As the very first example, we consider the binary classification problem, where the data points have features from a domain set and (deterministic) labels in . We assume that:
-
β’
The i.i.d.Β assumption: The examples in the training set are independently and identically distributed (i.i.d.) according to an underlying distribution over the domain set .
-
β’
Finiteness of the hypothesis class: The learning algorithm choses the predictor from a finite hypothesis set , i.e. .
-
β’
The realizable assumption: There exists such that the true loss is zero, where is the true labeling function (i.e. ). This implies that if we randomly sample a data set , the empirical loss is always 0.
Say our training set is (which is drawn from ), and our learning algorithm is . The output of the algorithm is the predictor , we measure the performance of over the training set using the empirical loss:
Once we obtained the predictor , we care about its general performance in the environment (over the underlying distribution ). The generalization loss (or the true error, the risk) is measured by the probability that it predicts wrongly for a random sample:
where is defined as the probability that we sample a point in w.r.t. and the point lies in .
Empirical Risk Minimization (ERM)
The objective of a learning algorithm (or a learner) is to minimize the true error w.r.t. and . However, and are unkown. The only thing that we can observe is the training set . Thus a natural algorithm is to minimize the empirical loss:
The algorithm is known as the Empirical Risk Minimization (ERM) algorithm, which we denote as . Though the concept is simple, ERM has a deadly weakness: it may overfit the training set. When the training set cannot reveal the underlying structure of , the true loss of the ERM predictor will be large. Imagine that if the labels of the samples in are all one, the ERM predictor may predict everything as positive (though it is a very unlikely situation).
A common solution to this problem is to restrict the search space of ERM (shrink the size/space of the candidate set ). To get some intuition, in the most extreme case, say if the hypothesis class is the set of all function from to , ERM can learn a predictor with zero emprical loss where fits perfectly to the training set (). However, the true loss can be arbitrarily bad, if is close to zero. Philosophically, if someone can explain every phenomenon, his explanations are worthless.
Though restricting the space of seems helpful in hadling the overfitting problem, it also introduced biases. Therefore, ideally we want the choice of to be based on some prior knowledge of the problem. As another extreme case, if contains only one predictor (say a threshold function), it will surely not overfit , but the true loss is also unlikely to be small. For now, we simply assume that the realizable assumption holds, which means there exists a such that . Such an has perfect prior knowledge on and , thus is unrealistic. We will see how to remove this assumption later, but for simplicity letβs keep it for now.
Finite Hypothesis Class
In this section we show that if the i.i.d. assumption and the realizable assumption hold, with sufficient large number of samples, the true loss of the predictor learned from a finite is small:
Intuitively this is true because the training set can reveal more structure of the underlying distribution with more samples. Suppose the training set contains samples drawn i.i.d. from and let , we want to upperbound the probabilty
Let be the set of all βbadβ predictors whoes true loss is greater than , i.e. for each we have . For each , we see that
The probability that we learn the specific bad predictor is the probability that its empirical loss on is zero:
By the union bound, the probability that we learn any bad predictor is:
which is equivalent to:
We conclude that:
Corollary 1.
Let be a finite hypothesis class. Let and , and let be an integer satisfies
Then, for any distribution and for any labeling function , if the realizable assumption holds, with probability at least , we have
References
- [1] (2014) Understanding machine learning - from theory to algorithms.. Cambridge University Press.