Notion of convergence of random variables
Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory . It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities . Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory .
The law of large numbers says that, for each single event
A
{\displaystyle A}
, its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class
S
{\displaystyle S}
from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events
S
{\displaystyle S}
[ 1] The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.
For a class of predicates
H
{\displaystyle H}
defined on a set
X
{\displaystyle X}
and a set of samples
x
=
(
x
1
,
x
2
,
…
,
x
m
)
{\displaystyle x=(x_{1},x_{2},\dots ,x_{m})}
, where
x
i
∈
X
{\displaystyle x_{i}\in X}
, the empirical frequency of
h
∈
H
{\displaystyle h\in H}
on
x
{\displaystyle x}
is
Q
^
x
(
h
)
=
1
m
|
{
i
:
1
≤
i
≤
m
,
h
(
x
i
)
=
1
}
|
.
{\displaystyle {\widehat {Q}}_{x}(h)={\frac {1}{m}}|\{i:1\leq i\leq m,h(x_{i})=1\}|.}
The theoretical probability of
h
∈
H
{\displaystyle h\in H}
is defined as
Q
P
(
h
)
=
P
{
y
∈
X
:
h
(
y
)
=
1
}
.
{\displaystyle Q_{P}(h)=P\{y\in X:h(y)=1\}.}
The Uniform Convergence Theorem states, roughly, that if
H
{\displaystyle H}
is "simple" and we draw samples independently (with replacement) from
X
{\displaystyle X}
according to any distribution
P
{\displaystyle P}
, then with high probability , the empirical frequency will be close to its expected value , which is the theoretical probability.[ 2]
Here "simple" means that the Vapnik–Chervonenkis dimension of the class
H
{\displaystyle H}
is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis[ 1] using the concept of growth function .
The statement of the uniform convergence theorem is as follows:[ 3]
If
H
{\displaystyle H}
is a set of
{
0
,
1
}
{\displaystyle \{0,1\}}
-valued functions defined on a set
X
{\displaystyle X}
and
P
{\displaystyle P}
is a probability distribution on
X
{\displaystyle X}
then for
ε
>
0
{\displaystyle \varepsilon >0}
and
m
{\displaystyle m}
a positive integer, we have:
P
m
{
|
Q
P
(
h
)
−
Q
x
^
(
h
)
|
≥
ε
for some
h
∈
H
}
≤
4
Π
H
(
2
m
)
e
−
ε
2
m
/
8
.
{\displaystyle P^{m}\{|Q_{P}(h)-{\widehat {Q_{x}}}(h)|\geq \varepsilon {\text{ for some }}h\in H\}\leq 4\Pi _{H}(2m)e^{-\varepsilon ^{2}m/8}.}
where, for any
x
∈
X
m
,
{\displaystyle x\in X^{m},}
,
Q
P
(
h
)
=
P
{
(
y
∈
X
:
h
(
y
)
=
1
}
,
{\displaystyle Q_{P}(h)=P\{(y\in X:h(y)=1\},}
Q
^
x
(
h
)
=
1
m
|
{
i
:
1
≤
i
≤
m
,
h
(
x
i
)
=
1
}
|
{\displaystyle {\widehat {Q}}_{x}(h)={\frac {1}{m}}|\{i:1\leq i\leq m,h(x_{i})=1\}|}
and
|
x
|
=
m
{\displaystyle |x|=m}
.
P
m
{\displaystyle P^{m}}
indicates that the probability is taken over
x
{\displaystyle x}
consisting of
m
{\displaystyle m}
i.i.d. draws from the distribution
P
{\displaystyle P}
.
Π
H
{\displaystyle \Pi _{H}}
is defined as: For any
{
0
,
1
}
{\displaystyle \{0,1\}}
-valued functions
H
{\displaystyle H}
over
X
{\displaystyle X}
and
D
⊆
X
{\displaystyle D\subseteq X}
,
Π
H
(
D
)
=
{
h
∩
D
:
h
∈
H
}
.
{\displaystyle \Pi _{H}(D)=\{h\cap D:h\in H\}.}
And for any natural number
m
{\displaystyle m}
, the shattering number
Π
H
(
m
)
{\displaystyle \Pi _{H}(m)}
is defined as:
Π
H
(
m
)
=
max
|
{
h
∩
D
:
|
D
|
=
m
,
h
∈
H
}
|
.
{\displaystyle \Pi _{H}(m)=\max |\{h\cap D:|D|=m,h\in H\}|.}
From the point of Learning Theory one can consider
H
{\displaystyle H}
to be the Concept/Hypothesis class defined over the instance set
X
{\displaystyle X}
. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
The Sauer–Shelah lemma [ 4] relates the shattering number
Π
h
(
m
)
{\displaystyle \Pi _{h}(m)}
to the VC Dimension.
Lemma:
Π
H
(
m
)
≤
(
e
m
d
)
d
{\displaystyle \Pi _{H}(m)\leq \left({\frac {em}{d}}\right)^{d}}
, where
d
{\displaystyle d}
is the VC Dimension of the concept class
H
{\displaystyle H}
.
Corollary:
Π
H
(
m
)
≤
m
d
{\displaystyle \Pi _{H}(m)\leq m^{d}}
.
[ 1] and [ 3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
Symmetrization: We transform the problem of analyzing
|
Q
P
(
h
)
−
Q
^
x
(
h
)
|
≥
ε
{\displaystyle |Q_{P}(h)-{\widehat {Q}}_{x}(h)|\geq \varepsilon }
into the problem of analyzing
|
Q
^
r
(
h
)
−
Q
^
s
(
h
)
|
≥
ε
/
2
{\displaystyle |{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2}
, where
r
{\displaystyle r}
and
s
{\displaystyle s}
are i.i.d samples of size
m
{\displaystyle m}
drawn according to the distribution
P
{\displaystyle P}
. One can view
r
{\displaystyle r}
as the original randomly drawn sample of length
m
{\displaystyle m}
, while
s
{\displaystyle s}
may be thought as the testing sample which is used to estimate
Q
P
(
h
)
{\displaystyle Q_{P}(h)}
.
Permutation: Since
r
{\displaystyle r}
and
s
{\displaystyle s}
are picked identically and independently, so swapping elements between them will not change the probability distribution on
r
{\displaystyle r}
and
s
{\displaystyle s}
. So, we will try to bound the probability of
|
Q
^
r
(
h
)
−
Q
^
s
(
h
)
|
≥
ε
/
2
{\displaystyle |{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2}
for some
h
∈
H
{\displaystyle h\in H}
by considering the effect of a specific collection of permutations of the joint sample
x
=
r
|
|
s
{\displaystyle x=r||s}
. Specifically, we consider permutations
σ
(
x
)
{\displaystyle \sigma (x)}
which swap
x
i
{\displaystyle x_{i}}
and
x
m
+
i
{\displaystyle x_{m+i}}
in some subset of
1
,
2
,
.
.
.
,
m
{\displaystyle {1,2,...,m}}
. The symbol
r
|
|
s
{\displaystyle r||s}
means the concatenation of
r
{\displaystyle r}
and
s
{\displaystyle s}
.[citation needed ]
Reduction to a finite class: We can now restrict the function class
H
{\displaystyle H}
to a fixed joint sample and hence, if
H
{\displaystyle H}
has finite VC Dimension, it reduces to the problem to one involving a finite function class.
We present the technical details of the proof.
Lemma: Let
V
=
{
x
∈
X
m
:
|
Q
P
(
h
)
−
Q
^
x
(
h
)
|
≥
ε
for some
h
∈
H
}
{\displaystyle V=\{x\in X^{m}:|Q_{P}(h)-{\widehat {Q}}_{x}(h)|\geq \varepsilon {\text{ for some }}h\in H\}}
and
R
=
{
(
r
,
s
)
∈
X
m
×
X
m
:
|
Q
r
^
(
h
)
−
Q
^
s
(
h
)
|
≥
ε
/
2
for some
h
∈
H
}
.
{\displaystyle R=\{(r,s)\in X^{m}\times X^{m}:|{\widehat {Q_{r}}}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2{\text{ for some }}h\in H\}.}
Then for
m
≥
2
ε
2
{\displaystyle m\geq {\frac {2}{\varepsilon ^{2}}}}
,
P
m
(
V
)
≤
2
P
2
m
(
R
)
{\displaystyle P^{m}(V)\leq 2P^{2m}(R)}
.
Proof:
By the triangle inequality,
if
|
Q
P
(
h
)
−
Q
^
r
(
h
)
|
≥
ε
{\displaystyle |Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon }
and
|
Q
P
(
h
)
−
Q
^
s
(
h
)
|
≤
ε
/
2
{\displaystyle |Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2}
then
|
Q
^
r
(
h
)
−
Q
^
s
(
h
)
|
≥
ε
/
2
{\displaystyle |{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2}
.
Therefore,
P
2
m
(
R
)
≥
P
2
m
{
∃
h
∈
H
,
|
Q
P
(
h
)
−
Q
^
r
(
h
)
|
≥
ε
and
|
Q
P
(
h
)
−
Q
^
s
(
h
)
|
≤
ε
/
2
}
=
∫
V
P
m
{
s
:
∃
h
∈
H
,
|
Q
P
(
h
)
−
Q
^
r
(
h
)
|
≥
ε
and
|
Q
P
(
h
)
−
Q
^
s
(
h
)
|
≤
ε
/
2
}
d
P
m
(
r
)
=
A
{\displaystyle {\begin{aligned}&P^{2m}(R)\\[5pt]\geq {}&P^{2m}\{\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\\[5pt]={}&\int _{V}P^{m}\{s:\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\,dP^{m}(r)\\[5pt]={}&A\end{aligned}}}
since
r
{\displaystyle r}
and
s
{\displaystyle s}
are independent.
Now for
r
∈
V
{\displaystyle r\in V}
fix an
h
∈
H
{\displaystyle h\in H}
such that
|
Q
P
(
h
)
−
Q
^
r
(
h
)
|
≥
ε
{\displaystyle |Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon }
. For this
h
{\displaystyle h}
, we shall show that
P
m
{
|
Q
P
(
h
)
−
Q
^
s
(
h
)
|
≤
ε
2
}
≥
1
2
.
{\displaystyle P^{m}\left\{|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq {\frac {\varepsilon }{2}}\right\}\geq {\frac {1}{2}}.}
Thus for any
r
∈
V
{\displaystyle r\in V}
,
A
≥
P
m
(
V
)
2
{\displaystyle A\geq {\frac {P^{m}(V)}{2}}}
and hence
P
2
m
(
R
)
≥
P
m
(
V
)
2
{\displaystyle P^{2m}(R)\geq {\frac {P^{m}(V)}{2}}}
. And hence we perform the first step of our high level idea.
Notice,
m
⋅
Q
^
s
(
h
)
{\displaystyle m\cdot {\widehat {Q}}_{s}(h)}
is a binomial random variable with expectation
m
⋅
Q
P
(
h
)
{\displaystyle m\cdot Q_{P}(h)}
and variance
m
⋅
Q
P
(
h
)
(
1
−
Q
P
(
h
)
)
{\displaystyle m\cdot Q_{P}(h)(1-Q_{P}(h))}
. By Chebyshev's inequality we get
P
m
{
|
Q
P
(
h
)
−
Q
s
(
h
)
^
|
>
ε
2
}
≤
m
⋅
Q
P
(
h
)
(
1
−
Q
P
(
h
)
)
(
ε
m
/
2
)
2
≤
1
ε
2
m
≤
1
2
{\displaystyle P^{m}\left\{|Q_{P}(h)-{\widehat {Q_{s}(h)}}|>{\frac {\varepsilon }{2}}\right\}\leq {\frac {m\cdot Q_{P}(h)(1-Q_{P}(h))}{(\varepsilon m/2)^{2}}}\leq {\frac {1}{\varepsilon ^{2}m}}\leq {\frac {1}{2}}}
for the mentioned bound on
m
{\displaystyle m}
. Here we use the fact that
x
(
1
−
x
)
≤
1
/
4
{\displaystyle x(1-x)\leq 1/4}
for
x
{\displaystyle x}
.
Let
Γ
m
{\displaystyle \Gamma _{m}}
be the set of all permutations of
{
1
,
2
,
3
,
…
,
2
m
}
{\displaystyle \{1,2,3,\dots ,2m\}}
that swaps
i
{\displaystyle i}
and
m
+
i
{\displaystyle m+i}
∀
i
{\displaystyle \forall i}
in some subset of
{
1
,
2
,
3
,
…
,
2
m
}
{\displaystyle \{1,2,3,\ldots ,2m\}}
.
Lemma: Let
R
{\displaystyle R}
be any subset of
X
2
m
{\displaystyle X^{2m}}
and
P
{\displaystyle P}
any probability distribution on
X
{\displaystyle X}
. Then,
P
2
m
(
R
)
=
E
[
Pr
[
σ
(
x
)
∈
R
]
]
≤
max
x
∈
X
2
m
(
Pr
[
σ
(
x
)
∈
R
]
)
,
{\displaystyle P^{2m}(R)=E[\Pr[\sigma (x)\in R]]\leq \max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]),}
where the expectation is over
x
{\displaystyle x}
chosen according to
P
2
m
{\displaystyle P^{2m}}
, and the probability is over
σ
{\displaystyle \sigma }
chosen uniformly from
Γ
m
{\displaystyle \Gamma _{m}}
.
Proof:
For any
σ
∈
Γ
m
,
{\displaystyle \sigma \in \Gamma _{m},}
P
2
m
(
R
)
=
P
2
m
{
x
:
σ
(
x
)
∈
R
}
{\displaystyle P^{2m}(R)=P^{2m}\{x:\sigma (x)\in R\}}
(since coordinate permutations preserve the product distribution
P
2
m
{\displaystyle P^{2m}}
.)
∴
P
2
m
(
R
)
=
∫
X
2
m
1
R
(
x
)
d
P
2
m
(
x
)
=
1
|
Γ
m
|
∑
σ
∈
Γ
m
∫
X
2
m
1
R
(
σ
(
x
)
)
d
P
2
m
(
x
)
=
∫
X
2
m
1
|
Γ
m
|
∑
σ
∈
Γ
m
1
R
(
σ
(
x
)
)
d
P
2
m
(
x
)
(because
|
Γ
m
|
is finite)
=
∫
X
2
m
Pr
[
σ
(
x
)
∈
R
]
d
P
2
m
(
x
)
(the expectation)
≤
max
x
∈
X
2
m
(
Pr
[
σ
(
x
)
∈
R
]
)
.
{\displaystyle {\begin{aligned}\therefore P^{2m}(R)={}&\int _{X^{2m}}1_{R}(x)\,dP^{2m}(x)\\[5pt]={}&{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}\int _{X^{2m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]={}&\int _{X^{2m}}{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]&{\text{(because }}|\Gamma _{m}|{\text{ is finite)}}\\[5pt]={}&\int _{X^{2m}}\Pr[\sigma (x)\in R]\,dP^{2m}(x)\quad {\text{(the expectation)}}\\[5pt]\leq {}&\max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]).\end{aligned}}}
The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a finite class [ edit ]
Lemma: Basing on the previous lemma,
max
x
∈
X
2
m
(
Pr
[
σ
(
x
)
∈
R
]
)
≤
4
Π
H
(
2
m
)
e
−
ε
2
m
/
8
{\displaystyle \max _{x\in X^{2m}}(\Pr[\sigma (x)\in R])\leq 4\Pi _{H}(2m)e^{-\varepsilon ^{2}m/8}}
.
Proof:
Let us define
x
=
(
x
1
,
x
2
,
…
,
x
2
m
)
{\displaystyle x=(x_{1},x_{2},\ldots ,x_{2m})}
and
t
=
|
H
|
x
|
{\displaystyle t=|H|_{x}|}
which is at most
Π
H
(
2
m
)
{\displaystyle \Pi _{H}(2m)}
. This means there are functions
h
1
,
h
2
,
…
,
h
t
∈
H
{\displaystyle h_{1},h_{2},\ldots ,h_{t}\in H}
such that for any
h
∈
H
,
∃
i
{\displaystyle h\in H,\exists i}
between
1
{\displaystyle 1}
and
t
{\displaystyle t}
with
h
i
(
x
k
)
=
h
(
x
k
)
{\displaystyle h_{i}(x_{k})=h(x_{k})}
for
1
≤
k
≤
2
m
.
{\displaystyle 1\leq k\leq 2m.}
We see that
σ
(
x
)
∈
R
{\displaystyle \sigma (x)\in R}
iff for some
h
{\displaystyle h}
in
H
{\displaystyle H}
satisfies,
|
1
m
|
{
1
≤
i
≤
m
:
h
(
x
σ
i
)
=
1
}
|
−
1
m
|
{
m
+
1
≤
i
≤
2
m
:
h
(
x
σ
i
)
=
1
}
|
|
≥
ε
2
{\displaystyle |{\frac {1}{m}}|\{1\leq i\leq m:h(x_{\sigma _{i}})=1\}|-{\frac {1}{m}}|\{m+1\leq i\leq 2m:h(x_{\sigma _{i}})=1\}||\geq {\frac {\varepsilon }{2}}}
.
Hence if we define
w
i
j
=
1
{\displaystyle w_{i}^{j}=1}
if
h
j
(
x
i
)
=
1
{\displaystyle h_{j}(x_{i})=1}
and
w
i
j
=
0
{\displaystyle w_{i}^{j}=0}
otherwise.
For
1
≤
i
≤
m
{\displaystyle 1\leq i\leq m}
and
1
≤
j
≤
t
{\displaystyle 1\leq j\leq t}
, we have that
σ
(
x
)
∈
R
{\displaystyle \sigma (x)\in R}
iff for some
j
{\displaystyle j}
in
1
,
…
,
t
{\displaystyle {1,\ldots ,t}}
satisfies
|
1
m
(
∑
i
w
σ
(
i
)
j
−
∑
i
w
σ
(
m
+
i
)
j
)
|
≥
ε
2
{\displaystyle |{\frac {1}{m}}\left(\sum _{i}w_{\sigma (i)}^{j}-\sum _{i}w_{\sigma (m+i)}^{j}\right)|\geq {\frac {\varepsilon }{2}}}
. By union bound we get
Pr
[
σ
(
x
)
∈
R
]
≤
t
⋅
max
(
Pr
[
|
1
m
(
∑
i
w
σ
i
j
−
∑
i
w
σ
m
+
i
j
)
|
≥
ε
2
]
)
{\displaystyle \Pr[\sigma (x)\in R]\leq t\cdot \max \left(\Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\varepsilon }{2}}]\right)}
≤
Π
H
(
2
m
)
⋅
max
(
Pr
[
|
1
m
(
∑
i
w
σ
i
j
−
∑
i
w
σ
m
+
i
j
)
|
≥
ε
2
]
)
.
{\displaystyle \leq \Pi _{H}(2m)\cdot \max \left(\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)\right|\geq {\frac {\varepsilon }{2}}\right]\right).}
Since, the distribution over the permutations
σ
{\displaystyle \sigma }
is uniform for each
i
{\displaystyle i}
, so
w
σ
i
j
−
w
σ
m
+
i
j
{\displaystyle w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}}
equals
±
|
w
i
j
−
w
m
+
i
j
|
{\displaystyle \pm |w_{i}^{j}-w_{m+i}^{j}|}
, with equal probability.
Thus,
Pr
[
|
1
m
(
∑
i
(
w
σ
i
j
−
w
σ
m
+
i
j
)
)
|
≥
ε
2
]
=
Pr
[
|
1
m
(
∑
i
|
w
i
j
−
w
m
+
i
j
|
β
i
)
|
≥
ε
2
]
,
{\displaystyle \Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}\left(w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}\right)\right)\right|\geq {\frac {\varepsilon }{2}}\right]=\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}|w_{i}^{j}-w_{m+i}^{j}|\beta _{i}\right)\right|\geq {\frac {\varepsilon }{2}}\right],}
where the probability on the right is over
β
i
{\displaystyle \beta _{i}}
and both the possibilities are equally likely. By Hoeffding's inequality , this is at most
2
e
−
m
ε
2
/
8
{\displaystyle 2e^{-m\varepsilon ^{2}/8}}
.
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem .
^ a b c Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications . 16 (2): 264. doi :10.1137/1116025 .
This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk . 181 (4): 781. 1968.
The translation was reproduced as:
Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity . p. 11. doi :10.1007/978-3-319-21852-6_3 . ISBN 978-3-319-21851-9 .
^ "Martingales" , Probability with Martingales , Cambridge University Press, pp. 93–105, 1991-02-14, doi :10.1017/cbo9780511813658.014 , ISBN 978-0-521-40455-6 , retrieved 2023-12-08
^ a b Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press ISBN 0-521-57353-X
^ Sham Kakade and Ambuj Tewari, CMSC 35900 (Spring 2008) Learning Theory, Lecture 11