SVM and Polytope Distance

As the very first example, we consider the binary classification problem, where the data points have features ${\bm{x}}_{i}$ from a domain set $\mathcal{X}$ and (deterministic) labels $y_{i}$ in $\{0,1\}$ . We assume that:

•

The i.i.d. assumption: The examples in the training set $\{{\bm{x}}_{1},\dots,{\bm{x}}_{m}\}$ are independently and identically distributed (i.i.d.) according to an underlying distribution $\mathcal{D}$ over the domain set $\mathcal{X}$ .
•

Finiteness of the hypothesis class: The learning algorithm choses the predictor $h$ from a finite hypothesis set $\mathcal{H}$ , i.e. $|\mathcal{H}|<\infty$ .
•

The realizable assumption: There exists $h^{*}\in\mathcal{H}$ such that the true loss $L_{(\mathcal{D},f)}(h^{*})$ is zero, where $f$ is the true labeling function (i.e. $y_{i}=f({\bm{x}}_{i})$ ). This implies that if we randomly sample a data set $S\sim\mathcal{D}$ , the empirical loss $L_{S}(h^{*})$ is always 0.

Say our training set is $S=\{({\bm{x}}_{1},y_{1}),\dots,({\bm{x}}_{m},y_{m})\}$ (which is drawn from $\mathcal{D}^{m}$ ), and our learning algorithm is $A$ . The output of the algorithm is the predictor $h=A(S)$ , we measure the performance of $h$ over the training set $S$ using the empirical loss:

L_{S}(h)=\frac{|\{i\in[m]:h({\bm{x}}_{i})\neq y_{i}\}|}{m}=\frac{1}{m}\sum_{i=% 1}^{m}{\bm{1}}_{[h({\bm{x}}_{i})\neq y_{i}]}.

Once we obtained the predictor $h$ , we care about its general performance in the environment (over the underlying distribution $\mathcal{D}$ ). The generalization loss (or the true error, the risk) is measured by the probability that it predicts wrongly for a random sample:

L_{(\mathcal{D},f)}(h)=\Pr_{{\bm{x}}\sim\mathcal{D}}[h({\bm{x}})\neq f({\bm{x}% })]=\mathcal{D}(\{{\bm{x}}:h({\bm{x}})\neq f({\bm{x}})\}),

where $\mathcal{D}(S)$ is defined as the probability that we sample a point ${\bm{x}}$ in $\mathcal{X}$ w.r.t. $\mathcal{D}$ and the point ${\bm{x}}$ lies in $S$ .

Empirical Risk Minimization (ERM)

The objective of a learning algorithm (or a learner) is to minimize the true error w.r.t. $\mathcal{D}$ and $f$ . However, $\mathcal{D}$ and $f$ are unkown. The only thing that we can observe is the training set $S\sim\mathcal{D}^{m}$ . Thus a natural algorithm is to minimize the empirical loss:

h=\operatorname*{argmin}_{h\in\mathcal{H}}L_{S}(h).

The algorithm is known as the Empirical Risk Minimization (ERM) algorithm, which we denote as ${\sf ERM}_{\mathcal{H}}$ . Though the concept is simple, ERM has a deadly weakness: it may overfit the training set. When the training set $S$ cannot reveal the underlying structure of $\mathcal{D}$ , the true loss of the ERM predictor will be large. Imagine that if the labels of the samples in $S$ are all one, the ERM predictor may predict everything as positive (though it is a very unlikely situation).

A common solution to this problem is to restrict the search space of ERM (shrink the size/space of the candidate set $\mathcal{H}$ ). To get some intuition, in the most extreme case, say if the hypothesis class $\mathcal{H}$ is the set of all function from $\mathcal{X}$ to $\{0,1\}$ , ERM can learn a predictor $h$ with zero emprical loss where $h$ fits perfectly to the training set ( $h({\bm{x}}_{i})=y_{i},\forall i\in[m]$ ). However, the true loss can be arbitrarily bad, if $|S|/|\mathcal{X}|$ is close to zero. Philosophically, if someone can explain every phenomenon, his explanations are worthless.

Though restricting the space of $\mathcal{H}$ seems helpful in hadling the overfitting problem, it also introduced biases. Therefore, ideally we want the choice of $\mathcal{H}$ to be based on some prior knowledge of the problem. As another extreme case, if $\mathcal{H}$ contains only one predictor (say a threshold function), it will surely not overfit $S$ , but the true loss is also unlikely to be small. For now, we simply assume that the realizable assumption holds, which means there exists a $h^{*}\in\mathcal{H}$ such that $L_{(\mathcal{D},f)}(h^{*})=0$ . Such an $\mathcal{H}$ has perfect prior knowledge on $\mathcal{D}$ and $f$ , thus is unrealistic. We will see how to remove this assumption later, but for simplicity let’s keep it for now.

Finite Hypothesis Class

In this section we show that if the i.i.d. assumption and the realizable assumption hold, with sufficient large number of samples, the true loss of the predictor ${\sf ERM}_{\mathcal{H}}(S)$ learned from a finite $\mathcal{H}$ is small:

L_{(\mathcal{D},f)}({\sf ERM}_{\mathcal{H}}(S))\leq\varepsilon\quad\text{with % probability at least }(1-\delta).

Intuitively this is true because the training set can reveal more structure of the underlying distribution with more samples. Suppose the training set $S$ contains $m$ samples drawn i.i.d. from $\mathcal{D}$ and let $h_{S}={\sf ERM}_{\mathcal{H}}(S)$ , we want to upperbound the probabilty

\Pr_{S\sim\mathcal{D}^{m}}[L_{(\mathcal{D},f)}(h_{S})>\varepsilon].

Let $\mathcal{H}_{B}\subset\mathcal{H}$ be the set of all “bad” predictors whoes true loss is greater than $\varepsilon$ , i.e. for each $h_{B}\in\mathcal{H}_{B}$ we have $L_{(\mathcal{D},f)}(h_{B})>\varepsilon$ . For each $h_{B}$ , we see that

\Pr_{{\bm{x}}\sim\mathcal{D}}[h_{B}({\bm{x}})=f({\bm{x}})]=1-L_{(\mathcal{D},f% )}(h_{B})\leq 1-\varepsilon.

The probability that we learn the specific bad predictor $h_{B}$ is the probability that its empirical loss on $S$ is zero:

\Pr_{S\sim\mathcal{D}^{m}}[L_{S}(h_{B})=0]\leq(1-\varepsilon)^{m}\leq e^{-% \varepsilon m}.

By the union bound, the probability that we learn any bad predictor $h_{B}\in\mathcal{H}_{B}$ is:

\Pr_{S\sim\mathcal{D}^{m}}[\exists h_{B}\in\mathcal{H}_{B}:L_{S}(h_{B})=0]\leq% \sum_{h_{B}\in\mathcal{H}_{B}}\Pr_{S\sim\mathcal{D}^{m}}[L_{S}(h_{B})=0]\leq|% \mathcal{H}_{B}|e^{-\varepsilon m}\leq|\mathcal{H}|e^{-\varepsilon m},

which is equivalent to:

\Pr_{S\sim\mathcal{D}^{m}}[L_{(\mathcal{D},f)}(h_{S})>\varepsilon]\leq|% \mathcal{H}|e^{-\varepsilon m}.

We conclude that:

Corollary 1.

Let $\mathcal{H}$ be a finite hypothesis class. Let $\delta\in(0,1)$ and $\varepsilon>0$ , and let $m$ be an integer satisfies

m\geq\frac{\log(|\mathcal{H}|/\delta)}{\varepsilon}.

Then, for any distribution $\mathcal{D}$ and for any labeling function $f$ , if the realizable assumption holds, with probability at least $(1-\delta)$ , we have

L_{(\mathcal{D},f)}({\sf ERM}_{\mathcal{H}}(S))\leq\varepsilon.

References

[1] S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning - from theory to algorithms.. Cambridge University Press.