Double descent

In statistics and machine learning, double descent is the phenomenon where a statistical model with a small number of parameters and a model with an extremely large number of parameters have a small test error, but a model whose number of parameters is about the same as the number of data points used to train the model will have a large error.^[2] This phenomenon has been considered surprising, as it contradicts assumptions about overfitting in classical machine learning.^[1]

History

Early observations of what would later be called double descent in specific models date back to 1989.^[3]^[4]

The term "double descent" was coined by Belkin et. al.^[5] in 2019,^[1] when the phenomenon as a broader concept shared by many models gained popularity.^[6]^[7] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of bias-variance tradeoff),^[8] and the empirical observations in the 2010s that some modern machine learning models tend to perform better with larger models.^[5]^[9]

Theoretical models

Double descent occurs in linear regression with isotropic Gaussian covariates and isotropic Gaussian noise.^[10]

A model of double descent at the thermodynamic limit has been analyzed by the replica method, and the result has been confirmed numerically.^[11]

Empirical examples

The scaling behavior of double descent has been found to follow a broken neural scaling law^[12] functional form.

References

^ ^a ^b ^c Schaeffer, Rylan; Khona, Mikail; Robertson, Zachary; Boopathy, Akhilan; Pistunova, Kateryna; Rocks, Jason W.; Fiete, Ila Rani; Koyejo, Oluwasanmi (2023-03-24). "Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle". arXiv:2303.14151v1 [cs.LG].
^ "Deep Double Descent". OpenAI. 2019-12-05. Retrieved 2022-08-12.
^ Vallet, F.; Cailton, J.-G.; Refregier, Ph (June 1989). "Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions". Europhysics Letters. 9 (4): 315. Bibcode:1989EL......9..315V. doi:10.1209/0295-5075/9/4/003. ISSN 0295-5075.
^ Loog, Marco; Viering, Tom; Mey, Alexander; Krijthe, Jesse H.; Tax, David M. J. (2020-05-19). "A brief prehistory of double descent". Proceedings of the National Academy of Sciences. 117 (20): 10625–10626. arXiv:2004.04328. Bibcode:2020PNAS..11710625L. doi:10.1073/pnas.2001875117. ISSN 0027-8424. PMC 7245109. PMID 32371495.
^ ^a ^b Belkin, Mikhail; Hsu, Daniel; Ma, Siyuan; Mandal, Soumik (2019-08-06). "Reconciling modern machine learning practice and the bias-variance trade-off". Proceedings of the National Academy of Sciences. 116 (32): 15849–15854. arXiv:1812.11118. doi:10.1073/pnas.1903070116. ISSN 0027-8424. PMC 6689936. PMID 31341078.
^ Spigler, Stefano; Geiger, Mario; d'Ascoli, Stéphane; Sagun, Levent; Biroli, Giulio; Wyart, Matthieu (2019-11-22). "A jamming transition from under- to over-parametrization affects loss landscape and generalization". Journal of Physics A: Mathematical and Theoretical. 52 (47): 474001. arXiv:1810.09665. doi:10.1088/1751-8121/ab4c8b. ISSN 1751-8113.
^ Viering, Tom; Loog, Marco (2023-06-01). "The Shape of Learning Curves: A Review". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (6): 7799–7819. arXiv:2103.10948. doi:10.1109/TPAMI.2022.3220744. ISSN 0162-8828. PMID 36350870.
^ Geman, Stuart; Bienenstock, Élie; Doursat, René (1992). "Neural networks and the bias/variance dilemma" (PDF). Neural Computation. 4: 1–58. doi:10.1162/neco.1992.4.1.1. S2CID 14215320.
^ Preetum Nakkiran; Gal Kaplun; Yamini Bansal; Tristan Yang; Boaz Barak; Ilya Sutskever (29 December 2021). "Deep double descent: where bigger models and more data hurt". Journal of Statistical Mechanics: Theory and Experiment. 2021 (12). IOP Publishing Ltd and SISSA Medialab srl: 124003. arXiv:1912.02292. Bibcode:2021JSMTE2021l4003N. doi:10.1088/1742-5468/ac3a74. S2CID 207808916.
^ Nakkiran, Preetum (2019-12-16). "More Data Can Hurt for Linear Regression: Sample-wise Double Descent". arXiv:1912.07242v1 [stat.ML].
^ Advani, Madhu S.; Saxe, Andrew M.; Sompolinsky, Haim (2020-12-01). "High-dimensional dynamics of generalization error in neural networks". Neural Networks. 132: 428–446. doi:10.1016/j.neunet.2020.08.022. ISSN 0893-6080. PMC 7685244. PMID 33022471.
^ Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.

External links

Brent Werness; Jared Wilber. "Double Descent: Part 1: A Visual Introduction".
Brent Werness; Jared Wilber. "Double Descent: Part 2: A Mathematical Explanation".
Understanding "Deep Double Descent" at evhub.

This statistics-related article is a stub. You can help Wikipedia by expanding it.

[:1-1] Schaeffer, Rylan; Khona, Mikail; Robertson, Zachary; Boopathy, Akhilan; Pistunova, Kateryna; Rocks, Jason W.; Fiete, Ila Rani; Koyejo, Oluwasanmi (2023-03-24). "Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle". arXiv:2303.14151v1 [cs.LG].

[2] "Deep Double Descent". OpenAI. 2019-12-05. Retrieved 2022-08-12.

[3] Vallet, F.; Cailton, J.-G.; Refregier, Ph (June 1989). "Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions". Europhysics Letters. 9 (4): 315. Bibcode:1989EL......9..315V. doi:10.1209/0295-5075/9/4/003. ISSN 0295-5075.

[4] Loog, Marco; Viering, Tom; Mey, Alexander; Krijthe, Jesse H.; Tax, David M. J. (2020-05-19). "A brief prehistory of double descent". Proceedings of the National Academy of Sciences. 117 (20): 10625–10626. arXiv:2004.04328. Bibcode:2020PNAS..11710625L. doi:10.1073/pnas.2001875117. ISSN 0027-8424. PMC 7245109. PMID 32371495.

[:0-5] Belkin, Mikhail; Hsu, Daniel; Ma, Siyuan; Mandal, Soumik (2019-08-06). "Reconciling modern machine learning practice and the bias-variance trade-off". Proceedings of the National Academy of Sciences. 116 (32): 15849–15854. arXiv:1812.11118. doi:10.1073/pnas.1903070116. ISSN 0027-8424. PMC 6689936. PMID 31341078.

[6] Spigler, Stefano; Geiger, Mario; d'Ascoli, Stéphane; Sagun, Levent; Biroli, Giulio; Wyart, Matthieu (2019-11-22). "A jamming transition from under- to over-parametrization affects loss landscape and generalization". Journal of Physics A: Mathematical and Theoretical. 52 (47): 474001. arXiv:1810.09665. doi:10.1088/1751-8121/ab4c8b. ISSN 1751-8113.

[7] Viering, Tom; Loog, Marco (2023-06-01). "The Shape of Learning Curves: A Review". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (6): 7799–7819. arXiv:2103.10948. doi:10.1109/TPAMI.2022.3220744. ISSN 0162-8828. PMID 36350870.

[geman-8] Geman, Stuart; Bienenstock, Élie; Doursat, René (1992). "Neural networks and the bias/variance dilemma" (PDF). Neural Computation. 4: 1–58. doi:10.1162/neco.1992.4.1.1. S2CID 14215320.

[9] Preetum Nakkiran; Gal Kaplun; Yamini Bansal; Tristan Yang; Boaz Barak; Ilya Sutskever (29 December 2021). "Deep double descent: where bigger models and more data hurt". Journal of Statistical Mechanics: Theory and Experiment. 2021 (12). IOP Publishing Ltd and SISSA Medialab srl: 124003. arXiv:1912.02292. Bibcode:2021JSMTE2021l4003N. doi:10.1088/1742-5468/ac3a74. S2CID 207808916.

[10] Nakkiran, Preetum (2019-12-16). "More Data Can Hurt for Linear Regression: Sample-wise Double Descent". arXiv:1912.07242v1 [stat.ML].

[11] Advani, Madhu S.; Saxe, Andrew M.; Sompolinsky, Haim (2020-12-01). "High-dimensional dynamics of generalization error in neural networks". Neural Networks. 132: 428–446. doi:10.1016/j.neunet.2020.08.022. ISSN 0893-6080. PMC 7685244. PMID 33022471.

[12] Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

History

Theoretical models

Empirical examples

References

Further reading

External links