As the very first example, we consider the binary classification problem, where the data points have features 𝒙i from a domain set 𝒳 and (deterministic) labels yi in {0,1}. We assume that:

  • β€’

    The i.i.d.Β assumption: The examples in the training set {𝒙1,…,𝒙m} are independently and identically distributed (i.i.d.) according to an underlying distribution π’Ÿ over the domain set 𝒳.

  • β€’

    Finiteness of the hypothesis class: The learning algorithm choses the predictor h from a finite hypothesis set β„‹, i.e. |β„‹|<∞.

  • β€’

    The realizable assumption: There exists h*βˆˆβ„‹ such that the true loss L(π’Ÿ,f)⁒(h*) is zero, where f is the true labeling function (i.e. yi=f⁒(𝒙i)). This implies that if we randomly sample a data set SβˆΌπ’Ÿ, the empirical loss LS⁒(h*) is always 0.

Say our training set is S={(𝒙1,y1),…,(𝒙m,ym)} (which is drawn from π’Ÿm), and our learning algorithm is A. The output of the algorithm is the predictor h=A⁒(S), we measure the performance of h over the training set S using the empirical loss:

LS⁒(h)=|{i∈[m]:h⁒(𝒙i)β‰ yi}|m=1mβ’βˆ‘i=1m𝟏[h⁒(𝒙i)β‰ yi].

Once we obtained the predictor h, we care about its general performance in the environment (over the underlying distribution π’Ÿ). The generalization loss (or the true error, the risk) is measured by the probability that it predicts wrongly for a random sample:

L(π’Ÿ,f)⁒(h)=Prπ’™βˆΌπ’Ÿβ‘[h⁒(𝒙)β‰ f⁒(𝒙)]=π’Ÿβ’({𝒙:h⁒(𝒙)β‰ f⁒(𝒙)}),

where π’Ÿβ’(S) is defined as the probability that we sample a point 𝒙 in 𝒳 w.r.t. π’Ÿ and the point 𝒙 lies in S.

Empirical Risk Minimization (ERM)

The objective of a learning algorithm (or a learner) is to minimize the true error w.r.t. π’Ÿ and f. However, π’Ÿ and f are unkown. The only thing that we can observe is the training set SβˆΌπ’Ÿm. Thus a natural algorithm is to minimize the empirical loss:

h=argminhβˆˆβ„‹β‘LS⁒(h).

The algorithm is known as the Empirical Risk Minimization (ERM) algorithm, which we denote as 𝖀𝖱𝖬ℋ. Though the concept is simple, ERM has a deadly weakness: it may overfit the training set. When the training set S cannot reveal the underlying structure of π’Ÿ, the true loss of the ERM predictor will be large. Imagine that if the labels of the samples in S are all one, the ERM predictor may predict everything as positive (though it is a very unlikely situation).

A common solution to this problem is to restrict the search space of ERM (shrink the size/space of the candidate set β„‹). To get some intuition, in the most extreme case, say if the hypothesis class β„‹ is the set of all function from 𝒳 to {0,1}, ERM can learn a predictor h with zero emprical loss where h fits perfectly to the training set (h⁒(𝒙i)=yi,βˆ€i∈[m]). However, the true loss can be arbitrarily bad, if |S|/|𝒳| is close to zero. Philosophically, if someone can explain every phenomenon, his explanations are worthless.

Though restricting the space of β„‹ seems helpful in hadling the overfitting problem, it also introduced biases. Therefore, ideally we want the choice of β„‹ to be based on some prior knowledge of the problem. As another extreme case, if β„‹ contains only one predictor (say a threshold function), it will surely not overfit S, but the true loss is also unlikely to be small. For now, we simply assume that the realizable assumption holds, which means there exists a h*βˆˆβ„‹ such that L(π’Ÿ,f)⁒(h*)=0. Such an β„‹ has perfect prior knowledge on π’Ÿ and f, thus is unrealistic. We will see how to remove this assumption later, but for simplicity let’s keep it for now.

Finite Hypothesis Class

In this section we show that if the i.i.d. assumption and the realizable assumption hold, with sufficient large number of samples, the true loss of the predictor 𝖀𝖱𝖬ℋ⁒(S) learned from a finite β„‹ is small:

L(π’Ÿ,f)⁒(𝖀𝖱𝖬ℋ⁒(S))≀Ρ with probability at least ⁒(1-Ξ΄).

Intuitively this is true because the training set can reveal more structure of the underlying distribution with more samples. Suppose the training set S contains m samples drawn i.i.d. from π’Ÿ and let hS=𝖀𝖱𝖬ℋ⁒(S), we want to upperbound the probabilty

PrSβˆΌπ’Ÿm⁑[L(π’Ÿ,f)⁒(hS)>Ξ΅].

Let β„‹BβŠ‚β„‹ be the set of all β€œbad” predictors whoes true loss is greater than Ξ΅, i.e. for each hBβˆˆβ„‹B we have L(π’Ÿ,f)⁒(hB)>Ξ΅. For each hB, we see that

Prπ’™βˆΌπ’Ÿβ‘[hB⁒(𝒙)=f⁒(𝒙)]=1-L(π’Ÿ,f)⁒(hB)≀1-Ξ΅.

The probability that we learn the specific bad predictor hB is the probability that its empirical loss on S is zero:

PrSβˆΌπ’Ÿm⁑[LS⁒(hB)=0]≀(1-Ξ΅)m≀e-Ρ⁒m.

By the union bound, the probability that we learn any bad predictor hBβˆˆβ„‹B is:

PrSβˆΌπ’Ÿm⁑[βˆƒhBβˆˆβ„‹B:LS⁒(hB)=0]β‰€βˆ‘hBβˆˆβ„‹BPrSβˆΌπ’Ÿm⁑[LS⁒(hB)=0]≀|β„‹B|⁒e-Ρ⁒m≀|β„‹|⁒e-Ρ⁒m,

which is equivalent to:

PrSβˆΌπ’Ÿm⁑[L(π’Ÿ,f)⁒(hS)>Ξ΅]≀|β„‹|⁒e-Ρ⁒m.

We conclude that:

Corollary 1.

Let β„‹ be a finite hypothesis class. Let δ∈(0,1) and Ξ΅>0, and let m be an integer satisfies

mβ‰₯log⁑(|β„‹|/Ξ΄)Ξ΅.

Then, for any distribution π’Ÿ and for any labeling function f, if the realizable assumption holds, with probability at least (1-Ξ΄), we have

L(π’Ÿ,f)⁒(𝖀𝖱𝖬ℋ⁒(S))≀Ρ.

References

  • [1] S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning - from theory to algorithms.. Cambridge University Press.