Talk:Neural scaling law
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||
|
todo list
[edit]More data
[edit]https://epochai.org/blog/extrapolating-performance-in-language-modelling-benchmarks
Llama 3 paper, section 3.2.1 Scaling Laws.
PaLM2 paper. Almost no details, but there's something.
2 Scaling law experiments 2.1 Scaling laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Downstream metric evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 A Detailed results A.1 Scaling laws pony in a strange land (talk) 15:25, 20 May 2023 (UTC)
[[2305.18565] PaLI-X: On Scaling up a Multilingual Vision and Language Model](https://arxiv.org/abs/2305.18565)
[[2306.13575] Scaling MLPs: A Tale of Inductive Bias](https://arxiv.org/abs//2306.13575) > performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), > lack of inductive bias compensated. > MLP mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours.
Trading training and inference costs
[edit][[2104.03113] Scaling Scaling Laws with Board Games](https://arxiv.org/abs/2104.03113)
> Training compute and inference compute (MCTS) can be traded off against each other. 10x more MCTS steps is almost the same as training 10x more.
Figure 6, 7 of Alphacode https://arxiv.org/pdf/2203.07814.pdf
Scaling by data quality
[edit][[2206.14486] Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486)
> the scaling of error-(dataset size)
> faster than power law scaling, even possibly exponential scaling, if we have high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size
If phi-1
replicates, incorporate it too. https://arxiv.org/abs/2306.11644
- started with The Stack (a 3 TB collection of code) and text from StackOverflow
- used a LLM to select 6B "high-quality" tokens from (1)
- used GPT-3.5 to generate 1B tokens of text similar to textbooks
- trained a small (1.3B parameter) model ("phi-1") on (2) and (3)
- used GPT-3.5 to generate text similar to textbook exercises
- fine-tuned phi-1 on (5)
- tested phi-1 on HumanEval to evaluate its programming ability
RL scaling
[edit]Scaling laws for reward model overoptimization
GATO? RoboCat?
Theoretical explanations
[edit]13 the tradeoffs of large-scale learning L Bottou, O Bousquet - Optimization for machine learning, 2011
A Tale of Tails: Model Collapse as a Change of Scaling Laws
References
- ^ Hutter, Marcus (2021-02-01). "Learning Curve Theory".
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ Sharma, Utkarsh; Kaplan, Jared (2022). "Scaling Laws from the Data Manifold Dimension". Journal of Machine Learning Research. 23 (9): 1–34. ISSN 1533-7928.
- ^ Allen-Zhu, Zeyuan; Li, Yuanzhi (2024-04-08), Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, doi:10.48550/arXiv.2404.05405, retrieved 2024-04-25
- C-Class Computing articles
- Low-importance Computing articles
- C-Class software articles
- Low-importance software articles
- C-Class software articles of Low-importance
- All Software articles
- C-Class Computer science articles
- Low-importance Computer science articles
- C-Class Computer hardware articles
- Low-importance Computer hardware articles
- C-Class Computer hardware articles of Low-importance
- All Computing articles
- WikiProject Artificial Intelligence articles