Jump to content

User:Elmackev/sandbox

From Wikipedia, the free encyclopedia


Regularization Perspectives on SVM

[edit]

Support vector machines (SVM), like regularized least squares, are a special case of Tikhonov regularization. In the case of SVM, the loss function is the hinge loss.[1][2][3][4]

Background

[edit]

In the supervised learning framework, an algorithm is a strategy for choosing a function given a training set of inputs and their labels (the labels are usually ). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:

,

where is a hypothesis space[5] of functions, is the loss function, is a norm on the hypothesis space of functions, and is the regularization parameter[6] .

When is a reproducing kernel Hilbert space, there exists a kernel function that can be written as an symmetric positive definite matrix . By the representer theorem[7], , and

Hinge loss

[edit]

Hinge and misclassification loss functions


The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if and 1 if , i.e the heaviside step function on . However, this loss function is not convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, where , provides such a convex relaxation. In fact, the hinge loss is the tightest convex upper bound to the 0-1 misclassification loss function[8], and with infinite data returns the Bayes optimal solution:[9]

Derivation[10]

[edit]

With the hinge loss, where , the regularization problem becomes:

,

In most of the SVM literature, this is written equivalently as:

.

This problem is non-differentiable because of the "kink" in the loss function. However, we can rewrite it using slack variables :

subject to:

Next we apply the representer theorem to get:

subject to:

This is a constrained optimization problem, which we will solve using the Lagrangian to derive the dual problem. The Lagrangian is:

The dual problem is:

Minimizing with respect to : Minimizing with respect to :

Then, plugging into the Lagrangian, we can write the dual problem as:

Then, plugging in , we get:

Subject to

Note that this dual problem is easier to solve than the original problem because it is box constrained (the are bounded). Also notice that the slack variables have disappeared in the dual problem.

Consequences and interpretations

[edit]

The Karush-Kuhn-Tucker conditions dictate that all optimal solutions must satisfy the following conditions for :

From these above constraints, and recalling that , we can derive conditions relating the to [11] :

Note that the solution is relatively sparse, because whenever . In SVM, the input points with non-zero coefficients are called support vectors. Given the above constraints, the support vectors are precisely the input points where .


Notes

[edit]
  1. ^ Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).,
  2. ^ Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
  3. ^ Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Check date values in: |year= / |date= mismatch (help)
  4. ^ Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)
  5. ^ This hypothesis space of functions is a Hilbert space of all the functions we're allowing the algorithm to pick
  6. ^ For insight on choosing the parameter, see, e.g., Wahba, Grace; Wang, Yonghua (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi:10.1080/03610929008830285.{{cite journal}}: CS1 maint: date and year (link)
  7. ^ See Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2111: 416–426. doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{cite journal}}: CS1 maint: date and year (link)
  8. ^ Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Check date values in: |year= / |date= mismatch (help)
  9. ^ Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: date and year (link)
  10. ^ For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
  11. ^ For more detail, see Rosasco, Lorenzo. "Regularized Least Squares and Support Vector Machines" (PDF).

References

[edit]