Draft:The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Submission declined on 24 October 2024 by Dr vulpes (talk).

This submission is not adequately supported by reliable sources. Reliable sources are required so that information can be verified. If you need help with referencing, please see Referencing for beginners and Citing sources.

If you would like to continue working on the submission, click on the "Edit" tab at the top of the window.
If you have not resolved the issues listed above, your draft will be declined again and potentially deleted.
If you need extra help, please ask us a question at the AfC Help Desk or get live help from experienced editors.
Please do not remove reviewer comments or this notice until the submission is accepted.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Declined by Dr vulpes 21 days ago. Last edited by Citation bot 9 days ago. Reviewer: Inform author.

Resubmit

Please note that if the issues are not fixed, the draft will be declined again.

Submission declined on 24 October 2024 by Qcne (talk).

This draft's references do not show that the subject qualifies for a Wikipedia article. In summary, the draft needs multiple published sources that are:

in-depth (not just passing mentions about the subject)
reliable
secondary
independent of the subject

Make sure you add references that meet these criteria before resubmitting. Learn about mistakes to avoid when addressing this issue. If no additional references exist, the subject is not suitable for Wikipedia.

Declined by Qcne 21 days ago.

Comment: Could you please add sources that aren't from arXiv, WP:PREPRINT aren't considered reliable sources. I know the preprint has a lot of citations but still having the peer reviewed paper would be an improvement. Dr vulpes (Talk) 17:43, 24 October 2024 (UTC)

Neural networks have become integral to modern machine learning, achieving remarkable performance in tasks like image recognition and natural language processing. However, these models are often over-parameterized, containing far more weights and connections than necessary. Although pruning can reduce network size after training by removing up to 90% of parameters without sacrificing accuracy, the sparse architectures resulting from pruning are challenging to train from scratch, which would otherwise enhance training efficiency and achieve similar accuracy.^[1]

The Lottery Ticket Hypothesis^[1], proposed by Jonathan Frankle and Michael Carbin in their 2018 paper^[1], offers a different perspective. It suggests that within a randomly initialized, dense neural network lies a smaller, sparse subnetwork—referred to as a “winning ticket”^[1]—that, when trained independently from its original initialization, can match or exceed the performance of the full network. These subnetworks “win the initialization lottery,” meaning their specific initial weights make them particularly suited for effective training.

Through experiments on fully connected and convolutional networks, the authors demonstrate that these winning tickets, which are often only 10-20% the size of the original model, can be trained to achieve as much or more accuracy than larger networks.

The Lottery Ticket Hypothesis

Statement of the Hypothesis

The Lottery Ticket Hypothesis^[1] proposes that within a large, randomly-initialized, dense neural network, there exists a much smaller subnetwork that can be trained in isolation to achieve comparable performance to the full network. This subnetwork, when trained from its original random initialization, can achieve comparable or better accuracy in lesser number of training steps with significantly less number of parameters.

Formal Explanation

Consider a dense feed-forward neural network $f(x;\theta )$ with initial parameters $\theta =\theta _{0}\sim D_{\theta }$ , where $D_{\theta }$ represents the distribution from which the weights are randomly initialized. When this network is optimized using stochastic gradient descent (SGD), it reaches a minimum validation loss $l$ after $j$ iterations and achieves a test accuracy $a$ ^[1].

Now, introduce a binary mask $m\in \{0,1\}^{|\theta |}$ where each entry in the mask corresponds to a weight in the network. The mask defines a subnetwork by turning off some weights in $\theta$ . The subnetwork is trained with the remaining weights initialized to their original random values, $m\odot \theta _{0}$ (where $\odot$ represents element-wise multiplication)^[1].

The Lottery Ticket Hypothesis predicts that there exists a mask $m$ such that the pruned subnetwork, when trained using the same dataset and optimization method, will:

Reach the same or lower validation loss $l'$ in at most the same number of iterations $j'\leq j$ ,^[1]
Achieve comparable or higher test accuracy $a'\geq a$ ,^[1]
Use significantly fewer parameters, such that $\|m\|_{0}\ll |\theta |$ (meaning the subnetwork contains far fewer non-zero weights than the original network).^[1]

Identifying Winning Tickets

A winning ticket identified by training a network and pruning its smallest-magnitude weights. The remaining, unpruned connections form the architecture of the winning ticket. After pruning, each unpruned connection's value is reset to its original initialization before training^[1]. The process of identifying a winning ticket is as follows:

Randomly initialize a neural network $f(x;\theta _{0})$ where $\theta _{0}\sim D_{\theta }$ .^[1]
Train the network for $j$ iterations, resulting in parameters $\theta _{j}$ .^[1]
Prune $p\%$ of the parameters in $\theta _{j}$ , creating a mask $m$ .^[1]
Reset the remaining parameters to their initial values from $\theta _{0}$ , creating the winning ticket $f(x;m\odot \theta _{0})$ .^[1]

This pruning approach is called one-shot pruning^[1]: the network is trained once, $p\%$ of weights are pruned, and the surviving weights are reset to their initial values. However the authors primarily focus on iterative pruning, which repeats this process over $n$ rounds. In each round, ${\frac {p}{n}}\%$ of the weights that survived the previous round are pruned. Results show that iterative pruning finds winning tickets that achieve the same accuracy as the original network with a smaller number of parameters compared to one-shot pruning.^[1]

Authors

The Lottery Ticket Hypothesis was proposed by Jonathan Frankle and Michael Carbin in 2019. At the time, both authors were affiliated with the Massachusetts Institute of Technology (MIT), working in the field of computer science, specifically focussing on Machine Learning.

Jonathan Frankle was a PhD student at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) at the time of the paper's publication. His main line of research during his PhD was on the Lottery Ticket Hypothesis.^[2] Currently (Oct 2024) he works as a Chief Neural Network Scientist at Databricks, where he lead the Mosaic Research lab.^[3]
Michael Carbin is currently an Associate Professor at MIT in the Electrical Engineering and Computer Science department. He leads the MIT Programming Systems Group.

Experimental Setup:

Types of Architectures:

The authors tested the Lottery Ticket Hypothesis on three types of neural network architectures:

Fully Connected Networks: These were used for experiments on the MNIST dataset. Specifically, the authors used the LeNet-300-100^[4] architecture, a simple three-layer fully connected network with 300 units in the first hidden layer, 100 units in the second hidden layer, and 10 output units.
Convolutional Neural Network (CNNs): The authors also tested convolutional neural networks (CNNs) on the CIFAR-10 dataset. They used several variations of the VGG-style architecture, including networks with two, four, and six convolutional layers (referred to as Conv-2, Conv-4, and Conv-6). Additionally they study effects of dropout strategy.
Deeper Architectures: Additional experiments were performed on deeper networks such as ResNet-18^[5] and VGG-19,^[6] which are commonly used in large-scale image recognition tasks.

Each architecture was trained using standard optimization techniques, such as stochastic gradient descent (SGD).

Pruning Strategies:

The pruning strategy involved unstructured pruning, where individual weights, rather than entire neurons or layers, were removed. The authors used a simple pruning heuristic where the weights with the smallest magnitudes (in terms of absolute value) were pruned first. This method allowed the authors to iteratively reduce the size of the network while maintaining its structure.

For some experiments, the authors compared the performance of winning tickets found through one-shot pruning versus iterative pruning:

One-shot pruning^[1]: The network was pruned once by a large percentage (e.g., 50%), and then retrained from scratch.^[1]
Iterative pruning^[1]: The network was pruned by a small percentage (e.g., 10-20%) at each stage, with the remaining weights reset to their initial values and retrained after each round.^[1]
Global Pruning^[1]: For ResNet-18^[5] and VGG-19^[6], the networks are pruned globally instead of pruning each layer, removing the lowest-magnitude weights collectively across all convolutional layers.^[1]

Key Findings from Experiments^[1]

Winning tickets exist: In both fully connected and convolutional networks, they were able to find winning tickets that, when trained in isolation from their original initialization, matched or exceeded the performance of the full, unpruned network.^[1]
Winning tickets are much smaller: The pruned subnetworks (winning tickets) often contained less than 10-20% of the parameters of the original network, yet achieved similar test accuracy in a comparable number of training iterations.^[1]
Initialization matters: Winning tickets only performed well when their weights were reset to their original initialization. When the weights of the pruned network were randomly reinitialized, the network performed significantly worse, supporting the hypothesis that these winning tickets "won the initialization lottery."^[1]
Iterative pruning outperforms one-shot pruning: Iterative pruning consistently found smaller winning tickets that performed better than those found through one-shot pruning. This suggests that progressively reducing the size of the network helps in identifying the most critical weights and connections.^[1]

Implications^[1]

Improve training performance
Design better networks
Improve our theoretical understanding of neural networks

Historical context

Over-parameterization and Network Simplification

In modern neural networks, over-parameterization has become a common phenomenon. This means that networks often contain more parameters than are strictly necessary to learn the target function. Despite this excess capacity, neural networks still manage to find simpler functions during training, indicating that even over-parameterized networks can generalize well without memorizing the entire dataset.

Techniques for Reducing Network Complexity:

Distillation and Pruning:

Distillation: Developed by Ba & Caruana (2014)^[7] and Hinton et al. (2015),^[8] distillation compresses large models into smaller ones by training a small network to mimic the behavior of a large, complex network. This reduces computational resources and makes deployment on devices easier.
Pruning: Techniques like those introduced by LeCun et al. (1990)^[9] and later refined by Han et al. (2015),^[10] allow for the removal of unnecessary weights from a trained network, maintaining accuracy while significantly reducing the size of the model. This is particularly useful for reducing computational costs during inference

References

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ ^o ^p ^q ^r ^s ^t ^u ^v ^w ^x ^y ^z ^aa ^ab ^ac Frankle, Jonathan; Carbin, Michael (2019-03-04). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". arXiv:1803.03635 [cs.LG].
^ Frankle, Jonathan; Carbin, Michael (September 27, 2018). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks" – via openreview.net.
^ "Mosaic Research Hub". Databricks. March 11, 2024.
^ Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition". Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791.
^ ^a ^b He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-12-10). "Deep Residual Learning for Image Recognition". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). p. 1. arXiv:1512.03385. Bibcode:2016cvpr.confE...1H. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
^ ^a ^b Simonyan, Karen; Zisserman, Andrew (2015-04-10). "Very Deep Convolutional Networks for Large-Scale Image Recognition". arXiv:1409.1556 [cs.CV].
^ Ba, Lei Jimmy; Caruana, Rich (2014-10-10). "Do Deep Nets Really Need to be Deep?". arXiv:1312.6184 [cs.LG].
^ Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015-03-09). "Distilling the Knowledge in a Neural Network". arXiv:1503.02531 [stat.ML].
^ LeCun, Yann; Denker, John; Solla, Sara (1989). "Optimal Brain Damage". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.
^ Han, Song; Pool, Jeff; Tran, John; Dally, William J. (2015-10-30). "Learning both Weights and Connections for Efficient Neural Networks". arXiv:1506.02626 [cs.NE].

[:0-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ ^o ^p ^q ^r ^s ^t ^u ^v ^w ^x ^y ^z ^aa ^ab ^ac Frankle, Jonathan; Carbin, Michael (2019-03-04). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". arXiv:1803.03635 [cs.LG].

[2] Frankle, Jonathan; Carbin, Michael (September 27, 2018). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks" – via openreview.net.

[3] "Mosaic Research Hub". Databricks. March 11, 2024.

[4] Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition". Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791.

[:1-5] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-12-10). "Deep Residual Learning for Image Recognition". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). p. 1. arXiv:1512.03385. Bibcode:2016cvpr.confE...1H. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.

[:2-6] Simonyan, Karen; Zisserman, Andrew (2015-04-10). "Very Deep Convolutional Networks for Large-Scale Image Recognition". arXiv:1409.1556 [cs.CV].

[7] Ba, Lei Jimmy; Caruana, Rich (2014-10-10). "Do Deep Nets Really Need to be Deep?". arXiv:1312.6184 [cs.LG].

[8] Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015-03-09). "Distilling the Knowledge in a Neural Network". arXiv:1503.02531 [stat.ML].

[9] LeCun, Yann; Denker, John; Solla, Sara (1989). "Optimal Brain Damage". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.

[10] Han, Song; Pool, Jeff; Tran, John; Dally, William J. (2015-10-30). "Learning both Weights and Connections for Efficient Neural Networks". arXiv:1506.02626 [cs.NE].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]