Jump to content

Draft:The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

From Wikipedia, the free encyclopedia
  • Comment: Could you please add sources that aren't from arXiv, WP:PREPRINT aren't considered reliable sources. I know the preprint has a lot of citations but still having the peer reviewed paper would be an improvement. Dr vulpes (Talk) 17:43, 24 October 2024 (UTC)

Neural networks have become integral to modern machine learning, achieving remarkable performance in tasks like image recognition and natural language processing. However, these models are often over-parameterized, containing far more weights and connections than necessary. Although pruning can reduce network size after training by removing up to 90% of parameters without sacrificing accuracy, the sparse architectures resulting from pruning are challenging to train from scratch, which would otherwise enhance training efficiency and achieve similar accuracy.[1]

The Lottery Ticket Hypothesis[1], proposed by Jonathan Frankle and Michael Carbin in their 2018 paper[1], offers a different perspective. It suggests that within a randomly initialized, dense neural network lies a smaller, sparse subnetwork—referred to as a “winning ticket”[1]—that, when trained independently from its original initialization, can match or exceed the performance of the full network. These subnetworks “win the initialization lottery,” meaning their specific initial weights make them particularly suited for effective training.

Through experiments on fully connected and convolutional networks, the authors demonstrate that these winning tickets, which are often only 10-20% the size of the original model, can be trained to achieve as much or more accuracy than larger networks.

The Lottery Ticket Hypothesis

[edit]

Statement of the Hypothesis

[edit]

The Lottery Ticket Hypothesis[1] proposes that within a large, randomly-initialized, dense neural network, there exists a much smaller subnetwork that can be trained in isolation to achieve comparable performance to the full network. This subnetwork, when trained from its original random initialization, can achieve comparable or better accuracy in lesser number of training steps with significantly less number of parameters.

Formal Explanation

[edit]

Consider a dense feed-forward neural network with initial parameters , where represents the distribution from which the weights are randomly initialized. When this network is optimized using stochastic gradient descent (SGD), it reaches a minimum validation loss after iterations and achieves a test accuracy [1].

Now, introduce a binary mask where each entry in the mask corresponds to a weight in the network. The mask defines a subnetwork by turning off some weights in . The subnetwork is trained with the remaining weights initialized to their original random values, (where represents element-wise multiplication)[1].

The Lottery Ticket Hypothesis predicts that there exists a mask such that the pruned subnetwork, when trained using the same dataset and optimization method, will:

  • Reach the same or lower validation loss in at most the same number of iterations ,[1]
  • Achieve comparable or higher test accuracy ,[1]
  • Use significantly fewer parameters, such that (meaning the subnetwork contains far fewer non-zero weights than the original network).[1]

Identifying Winning Tickets

[edit]

A winning ticket identified by training a network and pruning its smallest-magnitude weights. The remaining, unpruned connections form the architecture of the winning ticket. After pruning, each unpruned connection's value is reset to its original initialization before training[1]. The process of identifying a winning ticket is as follows:

  1. Randomly initialize a neural network where .[1]
  2. Train the network for iterations, resulting in parameters .[1]
  3. Prune of the parameters in , creating a mask .[1]
  4. Reset the remaining parameters to their initial values from , creating the winning ticket .[1]

This pruning approach is called one-shot pruning[1]: the network is trained once, of weights are pruned, and the surviving weights are reset to their initial values. However the authors primarily focus on iterative pruning, which repeats this process over rounds. In each round, of the weights that survived the previous round are pruned. Results show that iterative pruning finds winning tickets that achieve the same accuracy as the original network with a smaller number of parameters compared to one-shot pruning.[1]

Authors

[edit]

The Lottery Ticket Hypothesis was proposed by Jonathan Frankle and Michael Carbin in 2019. At the time, both authors were affiliated with the Massachusetts Institute of Technology (MIT), working in the field of computer science, specifically focussing on Machine Learning.

Experimental Setup:

[edit]

Types of Architectures:

[edit]

The authors tested the Lottery Ticket Hypothesis on three types of neural network architectures:

  1. Fully Connected Networks: These were used for experiments on the MNIST dataset. Specifically, the authors used the LeNet-300-100[4] architecture, a simple three-layer fully connected network with 300 units in the first hidden layer, 100 units in the second hidden layer, and 10 output units.
  2. Convolutional Neural Network (CNNs): The authors also tested convolutional neural networks (CNNs) on the CIFAR-10 dataset. They used several variations of the VGG-style architecture, including networks with two, four, and six convolutional layers (referred to as Conv-2, Conv-4, and Conv-6). Additionally they study effects of dropout strategy.
  3. Deeper Architectures: Additional experiments were performed on deeper networks such as ResNet-18[5] and VGG-19,[6] which are commonly used in large-scale image recognition tasks.

Each architecture was trained using standard optimization techniques, such as stochastic gradient descent (SGD).

Pruning Strategies:

[edit]

The pruning strategy involved unstructured pruning, where individual weights, rather than entire neurons or layers, were removed. The authors used a simple pruning heuristic where the weights with the smallest magnitudes (in terms of absolute value) were pruned first. This method allowed the authors to iteratively reduce the size of the network while maintaining its structure.

For some experiments, the authors compared the performance of winning tickets found through one-shot pruning versus iterative pruning:

  • One-shot pruning[1]: The network was pruned once by a large percentage (e.g., 50%), and then retrained from scratch.[1]
  • Iterative pruning[1]: The network was pruned by a small percentage (e.g., 10-20%) at each stage, with the remaining weights reset to their initial values and retrained after each round.[1]
  • Global Pruning[1]: For ResNet-18[5] and VGG-19[6], the networks are pruned globally instead of pruning each layer, removing the lowest-magnitude weights collectively across all convolutional layers.[1]

Key Findings from Experiments[1]

[edit]
  1. Winning tickets exist: In both fully connected and convolutional networks, they were able to find winning tickets that, when trained in isolation from their original initialization, matched or exceeded the performance of the full, unpruned network.[1]
  2. Winning tickets are much smaller: The pruned subnetworks (winning tickets) often contained less than 10-20% of the parameters of the original network, yet achieved similar test accuracy in a comparable number of training iterations.[1]
  3. Initialization matters: Winning tickets only performed well when their weights were reset to their original initialization. When the weights of the pruned network were randomly reinitialized, the network performed significantly worse, supporting the hypothesis that these winning tickets "won the initialization lottery."[1]
  4. Iterative pruning outperforms one-shot pruning: Iterative pruning consistently found smaller winning tickets that performed better than those found through one-shot pruning. This suggests that progressively reducing the size of the network helps in identifying the most critical weights and connections.[1]

Implications[1]

[edit]
  1. Improve training performance
  2. Design better networks
  3. Improve our theoretical understanding of neural networks

Historical context

[edit]

Over-parameterization and Network Simplification

[edit]

In modern neural networks, over-parameterization has become a common phenomenon. This means that networks often contain more parameters than are strictly necessary to learn the target function. Despite this excess capacity, neural networks still manage to find simpler functions during training, indicating that even over-parameterized networks can generalize well without memorizing the entire dataset.

Techniques for Reducing Network Complexity:

[edit]

Distillation and Pruning:

[edit]
  • Distillation: Developed by Ba & Caruana (2014)[7] and Hinton et al. (2015),[8] distillation compresses large models into smaller ones by training a small network to mimic the behavior of a large, complex network. This reduces computational resources and makes deployment on devices easier.
  • Pruning: Techniques like those introduced by LeCun et al. (1990)[9] and later refined by Han et al. (2015),[10] allow for the removal of unnecessary weights from a trained network, maintaining accuracy while significantly reducing the size of the model. This is particularly useful for reducing computational costs during inference

References

[edit]
  1. ^ a b c d e f g h i j k l m n o p q r s t u v w x y z aa ab ac Frankle, Jonathan; Carbin, Michael (2019-03-04). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". arXiv:1803.03635 [cs.LG].
  2. ^ Frankle, Jonathan; Carbin, Michael (September 27, 2018). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks" – via openreview.net.
  3. ^ "Mosaic Research Hub". Databricks. March 11, 2024.
  4. ^ Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition". Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791.
  5. ^ a b He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-12-10). "Deep Residual Learning for Image Recognition". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). p. 1. arXiv:1512.03385. Bibcode:2016cvpr.confE...1H. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
  6. ^ a b Simonyan, Karen; Zisserman, Andrew (2015-04-10). "Very Deep Convolutional Networks for Large-Scale Image Recognition". arXiv:1409.1556 [cs.CV].
  7. ^ Ba, Lei Jimmy; Caruana, Rich (2014-10-10). "Do Deep Nets Really Need to be Deep?". arXiv:1312.6184 [cs.LG].
  8. ^ Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015-03-09). "Distilling the Knowledge in a Neural Network". arXiv:1503.02531 [stat.ML].
  9. ^ LeCun, Yann; Denker, John; Solla, Sara (1989). "Optimal Brain Damage". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.
  10. ^ Han, Song; Pool, Jeff; Tran, John; Dally, William J. (2015-10-30). "Learning both Weights and Connections for Efficient Neural Networks". arXiv:1506.02626 [cs.NE].