Talk:Neural scaling law

Computing: Software / CompSci Low‑importance

This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles

Low

This article has been rated as Low-importance on the project's importance scale.

This article is supported by WikiProject Software (assessed as Low-importance).

This article is supported by WikiProject Computer science (assessed as Low-importance).

This article is supported by Computer hardware task force (assessed as Low-importance).

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Artificial Intelligence

This article is within the scope of WikiProject Artificial Intelligence, a collaborative effort to improve the coverage of Artificial intelligence on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Artificial IntelligenceWikipedia:WikiProject Artificial IntelligenceTemplate:WikiProject Artificial IntelligenceArtificial Intelligence articles

todo list

More data

https://epochai.org/blog/extrapolating-performance-in-language-modelling-benchmarks

Llama 3 paper, section 3.2.1 Scaling Laws.

PaLM2 paper. Almost no details, but there's something.

2 Scaling law experiments 2.1 Scaling laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Downstream metric evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 A Detailed results A.1 Scaling laws pony in a strange land (talk) 15:25, 20 May 2023 (UTC)[reply]

[[2305.18565] PaLI-X: On Scaling up a Multilingual Vision and Language Model](https://arxiv.org/abs/2305.18565)

[[2306.13575] Scaling MLPs: A Tale of Inductive Bias](https://arxiv.org/abs//2306.13575) > performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), > lack of inductive bias compensated. > MLP mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours.

Trading training and inference costs

[[2104.03113] Scaling Scaling Laws with Board Games](https://arxiv.org/abs/2104.03113)

> Training compute and inference compute (MCTS) can be traded off against each other. 10x more MCTS steps is almost the same as training 10x more.

Figure 6, 7 of Alphacode https://arxiv.org/pdf/2203.07814.pdf

Scaling by data quality

[[2206.14486] Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486)

> the scaling of error-(dataset size)

> faster than power law scaling, even possibly exponential scaling, if we have high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size

If phi-1 replicates, incorporate it too. https://arxiv.org/abs/2306.11644

started with The Stack (a 3 TB collection of code) and text from StackOverflow
used a LLM to select 6B "high-quality" tokens from (1)
used GPT-3.5 to generate 1B tokens of text similar to textbooks
trained a small (1.3B parameter) model ("phi-1") on (2) and (3)
used GPT-3.5 to generate text similar to textbook exercises
fine-tuned phi-1 on (5)
tested phi-1 on HumanEval to evaluate its programming ability

RL scaling

Scaling laws for reward model overoptimization

GATO? RoboCat?

Theoretical explanations

^[1]

^[2]

13 the tradeoffs of large-scale learning L Bottou, O Bousquet - Optimization for machine learning, 2011

^[3]

A Tale of Tails: Model Collapse as a Change of Scaling Laws

References

^ Hutter, Marcus (2021-02-01). "Learning Curve Theory". {{cite journal}}: Cite journal requires |journal= (help)
^ Sharma, Utkarsh; Kaplan, Jared (2022). "Scaling Laws from the Data Manifold Dimension". Journal of Machine Learning Research. 23 (9): 1–34. ISSN 1533-7928.
^ Allen-Zhu, Zeyuan; Li, Yuanzhi (2024-04-08), Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, doi:10.48550/arXiv.2404.05405, retrieved 2024-04-25

[1] Hutter, Marcus (2021-02-01). "Learning Curve Theory". {{cite journal}}: Cite journal requires |journal= (help)

[2] Sharma, Utkarsh; Kaplan, Jared (2022). "Scaling Laws from the Data Manifold Dimension". Journal of Machine Learning Research. 23 (9): 1–34. ISSN 1533-7928.

[3] Allen-Zhu, Zeyuan; Li, Yuanzhi (2024-04-08), Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, doi:10.48550/arXiv.2404.05405, retrieved 2024-04-25

[1]

[2]

[3]