$\newcommand{\ip}[2]{\langle #1, #2 \rangle}$
The Representer Theorum
For a general problem where $f: \mathbb{R}^m \rightarrow \mathbb{R}$ is usually the cost, and $R: \mathbb{R}_+ \rightarrow \mathbb{R}$ is usually a regularization which is monotonically nondecreasing. Assume $\psi$ is a mapping from $\mathcal{X}$ to a Hilbert space, then there exists a vector $\alpha \in \mathbb{R}^m$ such that $w = \sum_i \alpha_i \psi(x_i)$ is an optimal solution.
Corallary
Subsituting $w = \sum_i \alpha_i \psi(x_i)$ into (1), we can rewrite
$\langle w, \psi(x_i) \rangle = \langle\sum_i \alpha_i \psi(x_i), \psi(x_j) \rangle = \sum_i \alpha_i \langle \psi(x_i), \psi(x_j) \rangle$
and
Let $K(x, x’) = \ip{\psi(x)}{\psi(x’)}$, we can rewrite (1) into Thus we only need to compute the inner product in the embedding space $\psi(\mathcal X)$.
SVM
Hard SVM
Assume the dataset (in the embedding space) is linearly separable, which is a rather strong assumption (it depends on which embedding we are using but we did not have any gaurantee in the first place). By leveraging the freedom of rescaling we can constrain $y_i(w^T \psi(x_i)+b) \ge 1, \forall i = 1, … N$, thus turning (3) into
(Primal) (Dual)
Remark
1)
The decision boundary, or, the seperating hyperplane $y(x) = w^T \psi(x) + b$ can be reformulated, by using $w = \sum_i a_i y_i \psi(x_i)$, into 2)
Because the problem (3) achieves zero duality gap, by complementarity we have
$a_i (1 - y_i (w^T \psi(x_i) + b)) = 0, \forall i$
thus only a small portion (denoted as a set $\mathcal S$) of the data points will have $a_i \neq 0$ and $y_i (w^T \psi(x_i) + b) = 1$, these data points are called the support vectors.
1) + 2) =>
So in inference, we can just retain those support vectors in (6) as other data points whose $a_i = 0$ play no role in (6).
Soft SVM
A relaxation to the hard SVM that it does not require a linear separable dataset, by using the hinge loss. An alternative motivation is from allowing the constraints in (4) to be violated by a slack variable $\xi_i$.
(Primal) (Dual)