

Support Vector Machine¶

Support vector machine (SVM) is a binary linear classifier
There are tricks to make SVM able to solve non-linear problems
There are extensions which allows using SVM to multiclass classification or regression
SVM is a supervised learning algorithm
There are extensions which allows using SVM for (unsupervised) clustering

Linear SVM¶

Lets consider a training dataset of $N$ samples $(\vec x_1, y_1), \cdots, (\vec x_N, y_N)$
$\vec x_i$ is $D$ -dimensional vector representing features
$y_i$ is a class label, for convenience $y_i = \left\{-1, 1\right\}$
At this point we assume that classes are linearly separable
For visualization purpose lets use $D = 2$

# our standard imports: matplotlib and numpy
import matplotlib.pyplot as plt
import numpy as np

# just to overwrite default colab style
plt.style.use('default')
plt.style.use('seaborn-talk')

# class I
X01 = [(2, 9), (7, 19), (1, 10), (3, 19), (4, 16), (5, 18)]

# class II
X02 = [(4, 3), (6, 7), (1, -10), (3, -1), (9, 5), (5, -7)]

plt.xlim([0, 10])

plt.scatter(*(zip(*X01)))
plt.scatter(*(zip(*X02)));

png

In general we want to find a hyperplane which separates two classes
In the example above is just a line

plt.scatter(*(zip(*X01)))
plt.scatter(*(zip(*X02)))

plt.plot([0, 10], [0, 15], 'C2-');

png

Once a hyperplane is found, the classification is straightforward
In this case, everything above a line is classified as a blue point and below as an orange point
However, there is infinitely many possible lines / hyperplanes

plt.scatter(*(zip(*X01)))
plt.scatter(*(zip(*X02)))

plt.scatter(5, 9, c='C3')  # test point

plt.plot([0, 10], [2, 22], 'C4-');
plt.plot([0, 10], [0, 15], 'C2-');

png

On the plot above two possible lines which correctly classify all training data are drawn
The red point represents a test sample
Using "by eye" method one could say it is rather orange than blue
The final predictions depends on the choice of a line though
Thus, one need a criterion to choose the best line / hyperplane
SVM chooses the hyperplane which is maximally far away from any training point

plt.plot([0, 10], [0, 20], 'C2-', zorder=0)
plt.plot([0, 10], [5, 25], 'C2--', zorder=0)
plt.plot([0, 10], [-5, 15], 'C2--', zorder=0)

plt.scatter(5, 9, c='C3')  # test point

plt.annotate('', (0,5), (0.55, -4), arrowprops=dict(arrowstyle='<->'))
plt.text(0.5, -1.5, '$m$')
plt.text(0.3, 2.5, '$m$')

plt.annotate('', (3.05,18.1), (3.8, 7.75), arrowprops=dict(arrowstyle='<->'))
plt.text(2.9, 14, '$m_i$')

plt.scatter(*(zip(*X01)), zorder=1)
plt.scatter(*(zip(*X02)), zorder=1)

sv = X01[:2] + X02[:2]

plt.scatter(*(zip(*sv)), zorder=1, facecolors='none', edgecolors='r', s=500);

png

The goal is to maximize the margin $2m$
Please note, that the margin is given by the distance of closest data point to the hyperplane
Which means, that the classifier depends only on a small subset of training data, which are called support vectors

Hard margin¶

Any hyperplane can be defined by an intercept term $b$ and a normal vector $\vec w$ (which is usually called weights in machine learning)
All the points $\vec x$ on the hyperplane satisfy $\vec w \cdot \vec x + b = 0$
If $\vec w \cdot \vec x + b~{> \choose <}~0$ the point is on the $\text{one} \choose \text{other}$ side
We can easily defined the linear classifier

$f(\vec x) = \text{sign}\left(\vec w \cdot \vec x + b\right)$
That is why we wanted class labels to be $-1$ or $1$ (instead more standard $0$ and $1$ )
All we need is to find weights

Functional margin¶

Lets define functional margin of hyperplane $(\vec w, b)$ w.r.t. a training sample $(\vec x_i, y_i)$ as

$\hat m_i = y_i\cdot \left(\vec w \cdot \vec x_i + b\right)$
It is only positive if the training sample label has the same sign as prediction
The magnitude tells us somewhat about the confidence of a prediction
But please note, that we can choose arbitrary factor $k$ and $\text{sign}\left(\vec w \cdot \vec x + b\right) = \text{sign}\left(k\cdot\vec w \cdot \vec x + k\cdot b\right)$
Function margin w.r.t. the whole training dataset is the smallest one

$\hat m = \min\limits_{i = 1, \cdots, N}\hat m_i$

Geometric margin¶

Lets $m_i$ be a distance from a training point $\vec x_i$ to a decision boundary / hyperplane $(\vec w, b)$
Lets $\vec x_h$ be a point on a hyperplane closest to $\vec x_i$
Shortest distance must be peprendicular to a hyperplane (so parallel to $\vec w$ )
We can express $\vec x_h$ by

$\vec x_h = \vec x_i - y_i\cdot m_i\cdot \frac{\vec w}{|\vec w|}$
Because $\vec x_h$ lies on a hyperplane it must fulfill $\vec w \cdot \vec x_h + b = 0$ , thus

$\vec w\left(\vec x_i - y_i\cdot m_i\cdot \frac{\vec w}{|\vec w|}\right) + b = 0$
Pretty straightforward to solve for $m_i$

$m_i = y_i\frac{\vec w\cdot \vec x_i + b}{|\vec w|} = \frac{\hat m_i}{|\vec w|}$
As before geometric margin w.r.t. the whole training dataset is the smallest one

$m = \min\limits_{i = 1, \cdots, N} m_i = \frac{\hat m}{|\vec w|}$
Please note, that $m$ does not change when weights are scaled
$\hat m = m$ for $|\vec w| = 1$

The optimal margin¶

We want to find the maximum geometric margin $m$
But with a constraint that for every sample outside the margin

$\left.\begin{array}{cc}\text{maximize}_{w, b, m} & m\\\text{subject to} & m_i \geq m\end{array}\right.$
As we discussed, we can choose any normalization of normal vector $|\vec w|$ as it does not change the value of $m_i = y_i\frac{\vec w\cdot \vec x_i + b}{|\vec w|}$
Lets choose $|\vec w| = \frac{1}{m}$ , so

$\left.\begin{array}{cc}\text{maximize}_{w, b} & \frac{1}{|\vec w|}\\\text{subject to} & |\vec w|\cdot m_i \geq 1\end{array}\right.$
Since maximizing $|\vec w|^{-1}$ means the same as minimizing $|\vec w|$ , we can finally write

$\left.\begin{array}{cc}\text{minimize}_{w,b} & \frac{1}{2}|\vec w|^2\\\text{subject to} & \hat m_i \geq 1\end{array}\right.$
Please note that we choose $|\vec w|^2$ over $|\vec w|$ to avoid square root, and $\frac{1}{2}$ is for math convenience
Now, all we need is to optimize quadratic function over variables subject to linear constraints (quadratic programming)

Lagrange multipliers¶

Lagrange multipliers is the common method used to solve contrained optimization problems
Lets consider the following problem (with equality constraint):

$\left.\begin{array}{cc}\text{minimize} & f(\vec x)\\\text{subject to} & g(\vec x) = 0\end{array}\right.$
Please note, that one could also consider $g(\vec x) = c$ , where $c$ is constant (then $g'(\vec x) = g(\vec x) - c = 0$ )
Lagrange multiplier theorem says that at constrained optimum (if exists) $\nabla f$ will be parallel to $\nabla g$ , so

$\nabla f(\vec x) = \lambda\nabla g(\vec x)$
$\lambda$ - a Lagrange multiplier

How it works¶

Lets consider the following problem:

$\left.\begin{array}{cc}\text{minimize} & f(x, y) = x^2 + y^2\\\text{subject to} & g(x, y) = x\cdot y - 1 = 0\end{array}\right.$

def flc(c, n=100):
  """Level curves of objective functions"""
  return np.array([(c * np.cos(ang), c * np.sin(ang))
                   for ang in np.linspace(0, 2*np.pi, n)])

def g(n=100):
  """Constraint"""
  return np.array([(x, 1./x) for x in np.linspace(0.1, 2.5, n)])

##### PLOT SETTINGS #####

plt.figure(figsize=(8, 8))
plt.xlim([-2.5, 2.5])
plt.ylim([-2.5, 2.5])
plt.grid(color='0.5', linestyle='--', linewidth=0.5)

##### LEVEL CURVES OF f(x, y) #####

for c in (0.2, 0.6, 1, 1.8, 2.2):
  plt.plot(*flc(c).T, color='C0')

plt.plot(*flc(np.sqrt(2)).T, color='C0', label='$f(x, y) = c$')

##### CONSTRAINTS #####

plt.plot(*g().T, color='C1', label='$g(x, y) = 0$')
plt.plot(*-g().T, color='C1')

##### INTERSECTIONS #####

plt.scatter(1, 1, c='C2', zorder=4)
plt.scatter(np.sqrt(0.345), np.sqrt(1/0.345), c='C3', zorder=4)

plt.legend();

png

Blue lines represent the level curves of the objective function ( $f(x, y) = const$ )
Orange lines represent the constraints ( $g(x, y) = 0$ )
Lets first consider the case that blue and orange curves are not tangent (red point)
- going along the constraint one direction would result in the decrease of the objective function
- while going in another direction would result in the increase of the objective function
- thus, this point can not be an optimum
The only possibility is that a constrained optimum is where curves are tanget (it still may not be the case, but at least it could be)

Lagrangian¶

In general, there may be many constraints:

$\left.\begin{array}{cc}\text{minimize} & f(\vec x)\\\text{subject to} & g_i(\vec x) = 0, \hspace{5pt} i = 0,\cdots, M\end{array}\right.$
Then, there are $M$ Lagrange multipliers $\lambda_i$
It is convenient to define Lagrangian:

$\mathcal{L}(\vec x, \lambda) = f(\vec x) + \sum\limits_{i=1}^M\lambda_ig_i(\vec x)$
and solve:

$\nabla_{x_1,\cdots, x_n, \lambda_1,\cdots, \lambda_M} \mathcal{L}(\vec x, \lambda) = 0$
which is equivalent to:

$\nabla f(\vec x) = \sum\limits_{i=1}^M\lambda_i\nabla g_i(\vec x)$
$g_1(\vec x) = \cdots = g_M(\vec x) = 0$
The optimum (if exists) is always a saddle point of the Lagrangian
- On the one hand, we want to minimize the Lagrangian over $\vec x$
- And on the other hand, we want to maximize the Lagrangian over $\lambda_i$

Example¶

Lets consider the objective function $f(x) = x^2$ with a constraint $g(x) = x - 1 = 0$
The Lagrangian is given by:

$\mathcal{L}(x, \lambda) = x^2 + \lambda\cdot (x - 1)$
$\frac{\partial\mathcal{L}}{\partial x} = 2x + \lambda = 0$
$\frac{\partial\mathcal{L}}{\partial\lambda} = x - 1 = 0$
It has a saddle point at $x = 1$ and $p = -2$

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

ax = plt.figure().add_subplot(111, projection=Axes3D.name)

X, Y = np.meshgrid(np.linspace(-1, 3, 50), np.linspace(-5, 1, 50))
L = X**2 + Y * (X - 1)

ax.plot_surface(X, Y, L, cmap=cm.hsv)

ax.view_init(elev=45, azim=120)

ax.set_xlabel('$x$', labelpad=20)
ax.set_ylabel('$\lambda$', labelpad=20)
ax.set_zlabel('$\mathcal{L}$', labelpad=10);

png

Lagrange duality¶

Lets consider the optimization problem with inequality constraints:

$\left.\begin{array}{cc}\text{minimize} & f(\vec x)\\\text{subject to} & g_i(\vec x) \leq 0, \hspace{5pt} i = 0,\cdots, M\\\text{} & h_i(\vec x) = 0, \hspace{5pt} i = 0,\cdots, N\end{array}\right.$
We will call it the primal optimization problem and define generalized Lagrangian:

$\mathcal{L}(\vec x, \mu, \lambda) = f(\vec x) + \sum\limits_{i=1}^M\mu_ig_i(\vec x) + \sum\limits_{i=1}^N\lambda_ih_i(\vec x)$
with the additional restriction that $\mu_i \geq 0$
- if $i$ -th constraint is fulfilled, making $\mu_i$ more positive decrease the Lagrangian
- making $\mu_i$ negative is not allowed
Please note, that in the case of constraints $g_i \geq 0$ , the restriction is $\mu_i \leq 0$
The task can be expressed in terms of minimizing the following function:

$\theta_P(\vec x) = \left\{\begin{array}{cc}f(\vec x) & \text{if x satisfies primal constraints} \\ \infty & \text{otherwise}\end{array}\right. = \max\limits_{\mu, \lambda; \mu_i \geq 0}\mathcal{L}(\vec x, \mu, \lambda)$
Thus, we can write the final task in the form:

$\min\limits_{\vec x}\theta_P(\vec x) = \min\limits_{\vec x}\max\limits_{\mu, \lambda; \mu_i \geq 0}\mathcal{L}(\vec x, \mu, \lambda)$
Now, lets consider the following "reversed" task:

$\max\limits_{\mu, \lambda; \mu_i \geq 0}\min\limits_{\vec x}\mathcal{L}(\vec x, \mu, \lambda) = \max\limits_{\mu, \lambda; \mu_i \geq 0}\theta_D(\mu, \lambda)$
$\theta_D(\mu, \lambda)$ is known as the dual function and the maximizing it is called the dual problem
$\theta_D(\mu, \lambda)$ is a concave function (becasue Lagrangian is affine, i.e. linear in multipliers) - maximing $\theta_D$ is a convex optimisation problem
How it is related to the primal problem:

$\mathcal{L}(\vec x, \mu, \lambda) \leq \theta_P(\vec x)$

$\min\limits_{\vec x}\mathcal{L}(\vec x, \mu, \lambda) = \theta_D(\mu, \lambda) \leq \min\limits_{\vec x}\theta_P(\vec x)\equiv p^*$

$d^* \equiv \max\limits_{\mu, \lambda; \mu_i \geq 0}\theta_D(\mu, \lambda) \leq p^*$
Thus, the optimum of the dual problem $d^*$ is a lower bound for the optimum of the primal problem $p^*$ (known as weak duality)
The difference $p^* - d^*$ is known as the duality gap
If the gap is zero ( $p^* = d^*$ ) we have strong duality
If the optimization problem is convex and (strictly) feasible (i.e. there is $\vec x$ which satisfies all constraints) strong duality holds, so there must exist $\vec x^*$ , $\mu^*$ , $\lambda^*$ , so that $\vec x^*$ is the solution of the primal problem and $\mu^*$ and $\lambda^*$ are solutions of the dual problem, and

$p^* = d^* = \mathcal{L}(\vec x^*, \mu^*, \lambda^*)$

KKT conditions¶

Moreover, $\vec x^*$ , $\mu^*$ , $\lambda^*$ satisfy the Karush-Kuhn-Tucker (KKT) conditions
Stationarity

$\nabla_{\vec x} \mathcal{L}(\vec x^*, \mu^*, \lambda^*) = 0$
Primal feasibility

$g_i(\vec x^*) \leq 0, \hspace{10pt} h_i(\vec x^*) = 0$
Dual feasibility

$\mu_i^* \geq 0$
Complementary slackness (or dual complementarity)

$\mu_i^*\cdot g_i(\vec x^*) = 0$
Why the last one - for strong duality:

$f(\vec x^*) = \theta_D(\mu^*, \lambda^*) = \min\limits_{\vec x}\mathcal{L}(\vec x, \mu^*, \lambda^*) \leq f(\vec x^*) + \sum\limits_{i=1}^M\mu_i^*\cdot g_i(\vec x^*) + \sum\limits_{i=1}^N\lambda_i^*\cdot h_i(\vec x ^*) \leq f(\vec x^*)$
The last inequality holds beacause $\sum\limits_{i=1}^M\mu_i^*\cdot g_i(\vec x^*) \leq 0$
Since $f(\vec x^*)$ must be equal to $f(\vec x^*)$ these inequalities are in fact equalities, so $\sum\limits_{i=1}^M\mu_i^*\cdot g_i(\vec x^*) = 0$
Because $\mu_i \geq 0$ and $g_i \leq 0$ we get the complementary slackness condition: $\mu_i^*\cdot g_i(\vec x^*) = 0$
If $\vec x^*$ , $\mu^*$ , $\lambda^*$ satisfy KKT conditions, then they are primal and dual solutions

Optimal margin¶

Lets back to our original optimization problem:

$\left.\begin{array}{cc}\text{minimize}_{w,b} & \frac{1}{2}|\vec w|^2\\\text{subject to} & y_i\cdot \left(\vec w \cdot \vec x_i + b\right) \geq 1\end{array}\right.$
Lets rewrite the constraints in the form: $g_i(\vec w) = - y_i\cdot \left(\vec w \cdot \vec x_i + b\right) + 1 \leq 0$
Please note, that because of the complementary slackness condition, $\mu_i > 0$ only for the training examples ( $\vec x_i$ ) that have functional margin equal to one (support vectors)
Lets write the Langrangian for this problem:

$\mathcal{L}(\vec w, b, \mu) = \frac{1}{2}|\vec w|^2 - \sum\limits_{i=1}^M\mu_i\cdot \left[y_i\cdot\left(\vec w \cdot \vec x_i + b\right) - 1\right]$

Dual problem¶

To find the dual form of the problem we need to minimize the Lagrangian over $\vec w$ and $b$ :

$\nabla_{w}\mathcal{L}(\vec w, b, \mu) = \vec w - \sum\limits_{i=1}^M\mu_iy_i\vec x_i = 0 \Rightarrow \vec w = \sum\limits_{i=1}^M\mu_iy_i\vec x_i$
And for the intercept term:

$\frac{\partial}{\partial b}\mathcal{L}(\vec w, b, \mu) = \sum\limits_{i=1}^M\mu_iy_i = 0$
It is straightforawrd to show, that using two above equations one can obtain:

$\nabla_{w}\mathcal{L}(\vec w, b, \mu) = \sum\limits_{i=1}^M\mu_i - \frac{1}{2}|\vec w|^2 = \sum\limits_{i=1}^M\mu_i - \frac{1}{2}\sum\limits_{i,j=1}^My_iy_j\mu_i\mu_j\cdot\left(\vec x_i\cdot \vec x_j\right)$
By examing the dual form the optimization problem is expressed in terms of the inner product of input feature vectors:

$\left.\begin{array}{cc}\text{maximize}_{\mu} & \sum\limits_{i=1}^M\mu_i - \frac{1}{2}\sum\limits_{i,j=1}^My_iy_j\mu_i\mu_j\cdot\left<\vec x_i, \vec x_j\right>\\\text{subject to} & \mu_i \geq 0 \\ & \sum\limits_{i=1}^M\mu_iy_i = 0\end{array}\right.$

Non-linear SVM¶

SVM can be applied to non-linear problems using the kernel trick
However, before we go there, lets consider a simple non-linear problem

Example¶

Lets consider two classes of 2D points:
- inside a circle
- outside a circle

def generate_circle_data(R1=0, R2=1, N=500):
  """Generate N points in a circle for radius range (R1, R2)"""
  r = lambda: R1 + np.random.random() * (R2 - R1)

  return np.array([(r() * np.cos(ang), r() * np.sin(ang))
                   for ang in np.linspace(0, 2*np.pi, N)])

C01 = generate_circle_data()
C02 = generate_circle_data(1, 2)

plt.scatter(*C01.T, marker='.')
plt.scatter(*C02.T, marker='.');

png

There is no way to find a line which separates these classes
But would it be possible to add another dimension and find a plane?
Lets consider a feature mapping $\phi$ which maps original attributes:

$\phi(x, y) = \left[\begin{array}{c} x \\ y \\ x^2 + y^2\end{array}\right]$

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

ax = plt.figure().add_subplot(111, projection=Axes3D.name)

Z01 = np.array([x**2 + y**2 for x, y in C01])
Z02 = np.array([x**2 + y**2 for x, y in C02])

ax.scatter(*C01.T, Z01, cmap=cm.hsv)
ax.scatter(*C02.T, Z02, cmap=cm.hsv)

ax.view_init(elev=15, azim=60)

png

The new dataset is a linear problem and can be easily solved with SVM
All we need is to replace (in the dual formulation): $\left<\vec x_i, \vec x_j\right> \rightarrow \left<\phi(\vec x_i), \phi(\vec x_j)\right> = \phi^T(\vec x_i)\phi(\vec x_j)$

Kernel trick¶

Lets define the kernel (for given feature mapping $\phi$ ) as

$K(\vec x_i, \vec x_j) = \phi^T(\vec x_i)\phi(\vec x_j)$
Thus, the optimization problem we are trying to solve now is:

$\left.\begin{array}{cc}\text{maximize}_{\mu} & \sum\limits_{i=1}^M\mu_i - \frac{1}{2}\sum\limits_{i,j=1}^My_iy_j\mu_i\mu_j\cdot K(\vec x_i, \vec x_j)\\\text{subject to} & \mu_i \geq 0 \\ & \sum\limits_{i=1}^M\mu_iy_i = 0\end{array}\right.$
The trick is now, that we do not have to calculate (or even know) the mapping $\phi$ , which could be in general very expensive
The classifier was deinfed as:

$\hat y = \text{sign}\left(\vec w \cdot \vec x + b\right)$
and its dual form is:

$\hat y = \text{sign}\left(\sum\limits_{i=1}^M \mu_iy_i\left<\vec x_i, \vec x\right> + b\right)$
and can be now rewritten in the form:

$\hat y = \text{sign}\left(\sum\limits_{i=1}^M \mu_iy_iK(\vec x_i, \vec x) + b\right)$

Example¶

Lets consider the feature mapping $\phi$ (in 3D):

$\phi(\vec x) = \left[\begin{array}{c} x_1x_1 \\ x_1x_2 \\ x_1x_3 \\ x_2x_1 \\ x_2x_2 \\ x_2x_3 \\ x_3x_1 \\ x_3x_2 \\ x_3x_3\end{array}\right]$
Calculating $\phi(\vec x)$ , where $\vec x \in \mathcal{R}^N$ , requires $O(n^2)$ time
The related kernel is given by:

$K(\vec x_i, \vec x_j) = \left<x_i, \vec x_j\right>^2$
requires only $O(n)$ time

Mercer theorem¶

Having some function $K$ , how can we tell if it is a valid kernel?
For a valid kernel, there must exist a feature mapping $\phi$ , so that $K(\vec x_i, \vec x_j) = \phi^T(\vec x_i)\phi(\vec x_j)$
Lets define the kernel matrix $K_{ij} = K(\vec x_i, \vec x_j)$
If $K$ is a valid kernel the corresponding matrix must be symmetric: $K_{ij} = \phi^T(\vec x_i)\phi(\vec x_j) = \phi^T(\vec x_j)\phi(\vec x_i) = K_{ji}$
Moreover, $K$ must be positive semi-definite ( $K \geq 0$ ):

$z^TKz = \sum\limits_i\sum\limits_jz_iK_{ij}z_j = \sum\limits_i\sum\limits_jz_i\phi^T(\vec x_i)\phi(\vec x_j)z_j = \sum\limits_i\sum\limits_j\sum\limits_kz_i\phi_k(\vec x_i)\phi_k(\vec x_j)z_j = \sum\limits_k\left(\sum\limits_iz_i\phi_k(\vec x_i)\right)^2 \geq 0$
Mercer theorem

Let $K: \mathcal{R}^n\times \mathcal{R}^N \rightarrow \mathcal{R}$ be given. Then, for $K$ to be a valid kernel, it is necessary and **sufficient** that for any $\left\{x_1, \cdots x_m\right\}$, the corresponding kernel matrix is symmetric positive semi-definite.

Kernel examples¶

Gaussian kernel

$K(\vec x, \vec y) = \exp\left(-\frac{||\vec x - \vec y||}{2\sigma^2}\right)$
Polynomial kernel

$K(\vec x, \vec y) = \left(\vec x \cdot \vec y + c\right)^d$
Sigmoid kernel

$K(\vec x, \vec y) = \tanh\left(a\vec x \cdot \vec y + c\right)$

Soft margin¶

So far, everything was considered with the assumption that data is (linearly) separable
Mapping to a high-dimensional feature space may make data separable, but it can not guarantee that
Also, due to some noise in data, some outliers may lead to overfitting
Lets consider the example below

plt.plot([0, 10], [0, 20], 'C2-', zorder=0)
plt.plot([0, 10], [4, 24], 'C3-', zorder=0)

plt.scatter(3, 9, c='C1', marker=',')

plt.scatter(*(zip(*X01)), zorder=1)
plt.scatter(*(zip(*X02)), zorder=1);

png

The green line represents the optimal decision boundary if the orange square point is not included
The red line represents the case when the outlier is considered
Including the extra point drastically changes the result
The idea behind the soft margin is to allow some points to lie of the wrong side of a decision boundary

plt.plot([0, 10], [0, 20], 'C2-', zorder=0)
plt.plot([0, 10], [5, 25], 'C2--', zorder=0)
plt.plot([0, 10], [-5, 15], 'C2--', zorder=0)

plt.scatter(3, 9, c='C1', marker=',')
plt.annotate('', (3.05,8.7), (3.5, 1.75), arrowprops=dict(arrowstyle='<->'))
plt.text(3.5, 4.5, '$\\xi_i$')

plt.annotate('', (0,5), (0.55, -4), arrowprops=dict(arrowstyle='<->'))
plt.text(0.5, -1.5, '$m$')
plt.text(0.3, 2.5, '$m$')

plt.annotate('', (3.05,18.1), (3.8, 7.75), arrowprops=dict(arrowstyle='<->'))
plt.text(2.9, 14, '$m_i$')

plt.scatter(*(zip(*X01)), zorder=1)
plt.scatter(*(zip(*X02)), zorder=1)

sv = X01[:2] + X02[:2]

plt.scatter(*(zip(*sv)), zorder=1, facecolors='none', edgecolors='r', s=500);

png

Regularization¶

Regularization is a common technique to prevent overfitting (more about that next week)
Lets reformulate the original optimization problem:

$\left.\begin{array}{cc}\text{minimize}_{w,b} & \frac{1}{2}|\vec w|^2 + C\sum \limits_i\xi_i\\\text{subject to} & \hat y_i\cdot\left(\vec w \cdot \vec x + b\right) \geq 1 - \xi_i \\ & \xi_i \geq 0\end{array}\right.$
We now allow to have a functional margin less than 1, but if a training point has functional margin $1 - \xi_i$ we would pay a cost of the objective function being increased by $C\xi_i$
Thus, $C$ controls how much we penalize "bad" points (with $C = \infty$ we are back to the original perfectly separate case)
Lets write the Lagrangian for this optimization problem:

$\mathcal{L}(\vec w, b, \xi, \mu, r) = \frac{1}{2}|\vec w|^2 + C\sum \limits_{i=1}^M \xi_i - \sum\limits_{i=1}^M\mu_i\cdot \left[y_i\cdot\left(\vec w \cdot \vec x_i + b\right) - 1 + \xi_i\right] - \sum\limits_{i=1}^Mr_i\xi_i$
After the same procedure as before we get similar dual problem:

$\left.\begin{array}{cc}\text{maximize}_{\mu} & \sum\limits_{i=1}^M\mu_i - \frac{1}{2}\sum\limits_{i,j=1}^My_iy_j\mu_i\mu_j\cdot\left<\vec x_i, \vec x_j\right>\\\text{subject to} & C \geq \mu_i \geq 0 \\ & \sum\limits_{i=1}^M\mu_iy_i = 0\end{array}\right.$
The only difference is the extra constraint on $\mu_i$ , which comes from $\frac{\partial\mathcal{L}}{\partial\xi_i} = C - \mu_i - r_i = 0$
Lets consider also complementary slackness conditions:

$\mu_i\cdot\left[y_i\cdot\left(\vec w \cdot \vec x_i + b\right) - 1 + \xi_i\right] = 0$
$r_i\xi_i = 0$
In the case $\mu_i = 0$ , $r_i = C > 0 \Rightarrow \xi_i = 0$ :

$y_i\cdot\left(\vec w \cdot \vec x + b\right) \geq 1$
In the case $\mu_i = C$ , $r_i = 0 \Rightarrow \xi_i \geq 0$ :

$y_i\cdot\left(\vec w \cdot \vec x + b\right) \leq 1$
In the case $0 < \mu_i < C$ , $r_i = C - \mu_i > 0 \Rightarrow \xi_i = 0$ :

$y_i\cdot\left(\vec w \cdot \vec x + b\right) = 1$

SMO algorithm¶

We need to solve:

$\left.\begin{array}{cc}\text{maximize}_{\mu} & \sum\limits_{i=1}^M\mu_i - \frac{1}{2}\sum\limits_{i,j=1}^My_iy_j\mu_i\mu_jk_{ij} \equiv \mathcal{L}_D(\mu)\\\text{subject to} & C \geq \mu_i \geq 0 \\ & \sum\limits_{i=1}^M\mu_iy_i = 0\end{array}\right.$
where (for convenience) $k_{ij} = K(\vec x_i, \vec x_j)$
Sequential minimal optimization (SMO) algorithm solves the smallest possible optimization problem at a time
By updating two Lagrange multipliers $\mu_i$ in a step
One can not update only one because of the constraint $\sum\limits_i\mu_iy_i = 0$

Quick math¶

Without loss of generality, lets consider optimizing $\mu_1$ and $\mu_2$ from an old set of feasible solution: $\mu_1^{old}, \mu_2^{old}, \mu_3 \cdots \mu_M$
Because of the constraint, $y_1\mu_1 + y_2\mu_2 = y_1\mu_1^{old} + y_2\mu_2^{old}$
Which can be rewritten as:

$\mu_1 = \gamma - s\mu_2$
where $s = y_1y_2$ and $\gamma = \mu_1 + s\mu_2 = \mu_1^{old} + s\mu_2^{old}$
Thus, the optimization is on a line as shown below (two possibilities based on the sign of $s$ ):

plt.xticks([], [])
plt.yticks([], [])
plt.xlim([0,1])
plt.ylim([0,1])

plt.text(1.05, 0.5, "$\mu_1 = C$")
plt.text(-0.1, 0.5, "$\mu_1 = 0$")
plt.text(0.5, 1.05, "$\mu_2 = C$")
plt.text(0.5, -0.05, "$\mu_2 = 0$")

plt.plot((0.6,1), (0,0.75), label="$y_1 \\neq y_2 \Rightarrow \mu_1 - \mu_2 = \gamma$")
plt.plot((0.4,0), (0,0.75),  label="$y_1 = y_2 \Rightarrow \mu_1 + \mu_2 = \gamma$")

plt.legend();

png

First lets fixed all $\mu_i$ for $i \neq 1,2$ and write the objective funtion in the following term:

$\begin{eqnarray} \mathcal{L}_D & = & \mu_1 + \mu_2 + const \\ & - & \frac{1}{2}\left(k_{11}\mu_1^2 + k_{22}\mu_2^2 + 2sk_{12}\mu_1\mu_2\right) \\ & - & y_1\mu_1\sum\limits_{i=3}^My_i\mu_ik_{1i} - y_2\mu_2\sum\limits_{i=3}^My_2\mu_2k_{2i} \\ & + & const \end{eqnarray}$

Where $const$ does not depend on either $\mu_1$ nor $\mu_2$
Please note, that when formulating the dual problem we got:

$\nabla_{w}\mathcal{L}(\vec w, b, \mu) = \vec w - \sum\limits_{i=1}^M\mu_iy_i\vec x_i = 0 \Rightarrow \vec w = \sum\limits_{i=1}^M\mu_iy_i\vec x_i$
For convenience, lets introduce:

$\begin{eqnarray} v _j & = & \sum\limits_{i=3}y_i\mu_ik_{ij} \\ & = & \sum\limits_{i=1}y_i\mu_i^{old}k_{ij} + b^{old} - y_1\mu_1^{old}k_{1j} - y_2\mu_2^{old}k_{2j} - b^{old} \\ & = & z_j^{old} - b^{old} - y_1\mu_1^{old}k_{1j} - y_2\mu_2^{old}k_{2j} \end{eqnarray}$

where $z_j^{old} = \vec x_j \cdot \vec w + b^{old}$ is the output of $\vec x_j$ under old parameters.
Lets get back to the objective function:

$\begin{eqnarray} \mathcal{L}_D & = & \mu_1 + \mu_2 + const \\ & - & \frac{1}{2}\left(k_{11}\mu_1^2 + k_{22}\mu_2^2 + 2sk_{12}\mu_1\mu_2\right) \\ & - & y_1\mu_1v_1 - y_2\mu_2v_2 + const \end{eqnarray}$

We can remove $\mu_1$ from the equation using $\mu_1 = \gamma - s \mu_2$
After long (but relatively straightforward) algebra we get:

$\mathcal{L}_D = \frac{1}{2}\eta\mu_2^2 + \left[y_2\cdot\left(E_1^{old} - E_2^{old}\right) - \eta\mu_2^{old}\right]\mu_2 + const$
where

$\begin{eqnarray} \eta & = & 2k_{12} - k_{11} - k_{22} \\ E_i^{old} & = & z_i^{old} - y_i \end{eqnarray}$

To get a maximum we need first derivative to be zero and the second one to be negative

$\begin{eqnarray} \frac{d\mathcal{L}_D}{d\mu_2} & = & \eta\mu_2 + \left[y_2\cdot\left(E_1^{old} - E_2^{old}\right) - \eta\mu_2^{old}\right] = 0 \\ \frac{d^2\mathcal{L}_D}{d\mu_2^2} & = & \eta \end{eqnarray}$

Please note, that $\eta \leq 0$ from the definition, but one must be careful when $\eta = 0$
If $\eta < 0$ we get the unconstrained maximum point:

$\mu_2^{new} = \mu_2^{old} + \frac{y_2\left(E_2^{old} - E_1^{old}\right)}{\eta}$
However, because of the constraint $0 \leq \mu_i \leq C$ , we have to clip $\mu_2^{new}$
The minimum and maximum values depends on $s$ (see the plot above)
If $s = 1$ , then $\mu_1 + \mu_2 = \gamma$ :
- If $\gamma > C$ , then $\max\mu_2 = C$ and $\min\mu_2 = \gamma - C$
- If $\gamma < C$ , then $\max\mu_2 = \gamma$ and $\min\mu_2 = 0$
If $s = -1$ , then $\mu_1 - \mu_2 = \gamma$ :
- If $\gamma > C$ , then $\max\mu_2 = C - \gamma$ and $\min\mu_2 = 0$
- If $\gamma < C$ , then $\max\mu_2 = C$ and $\min\mu_2 = -\gamma$
Let the minimum feasible value of $\mu_2$ be $L$ , and maximum be $H$ :

$\mu_2^{new} \rightarrow \left\{\begin{array}{ll}H, & \mu_2^{new} > H \\ \mu_2^{new}, & L \leq \mu_2^{new} \leq H \\ L & \mu_2^{new} < L\end{array}\right.$

Math summary¶

Given $\mu_1$ , $\mu_2$ and the corresponding $y_1$ , $y_2$ , $k_{11}$ , $k_{12}$ , $k_{22}$ , $\Delta E = E_2^{old} - E_1^{old}$
Calculate $\eta = 2k_{12} - k_{11} - k_{22}$
If $\eta > 0$ : calculate $\Delta\alpha_2 = y_2\Delta E / \eta$ , clip the solution to the feasible region, and calculate $\Delta\mu_1 = -s\Delta\mu_2$
If $\eta = 0$ : evaluate the objective function at the two endpoints and choose $\mu_2$ which correponds to the larger value of the objective function

Updating after a successful optimization step¶

Lets define the predition error on $(\vec x, y)$ :

$E(\vec x, y) = \sum\limits_{i=1}^My_i\mu_iK(\vec x_i, \vec x) + b - y$
The change in error after an update:

$\Delta E(\vec x, y) = E^{new}(\vec x, y) - E^{old}(\vec x, y) = y_1\Delta\mu_1 K(\vec x_1, \vec x) + y_2\Delta\mu_2 K(\vec x_2, \vec x) + \Delta b$
If $\mu_1$ is not at the bounds it forces $E^{new}(\vec x_1, y_1) \equiv E_1 = 0$ , thus

$\Delta b_1 = -E_1 - y_1\Delta\mu_1 K(\vec x_1, \vec x_1) - y_2\Delta\mu_2 K(\vec x_2, \vec x_1)$
Alternatively, if $\mu_2$ is not at the bounds it forces $E^{new}(\vec x_2, y_2) \equiv E_2 = 0$ , thus

$\Delta b_2 = -E_2 - y_1\Delta\mu_1 K(\vec x_1, \vec x_2) - y_2\Delta\mu_2 K(\vec x_2, \vec x_2)$
If both $\mu_1$ and $\mu_2$ take $0$ or $C$ , SMO algorithm calculates new $b$ for both and takes average (any $b$ between new $b_1$ and $b_2$ satisfies KKT conditions)
Updating weights is straightforward: $\Delta \vec w = y_1\Delta\mu_1\vec x_1 + y_2\Delta\mu_2\vec x_2$

Choosing $\mu$ to optimize¶

The outer loop selects the first $\mu_i$ and the inner loop selects the second one $\mu_j$ that minimized $|E_j - E_i|$
The outer loop first goes through all examples selecting the ones violating KKT conditions
Then it iterates over all examples whose Lagrange multipliers are neither $0$ nor $C$ (the non-bound examples)
It repeats passes through the non-bound examples until all satisify KKT condition within some error $\epsilon$
The it goes back and loop over all training set
SMO repeats the procedure until all examples satisfy KKT conditions within $\epsilon$
To avoid bullying CPU SMO keeps a cashed error value for every non-bound example

Examples¶

Initial problem¶

Lets first consider the simple example given at the beginning of the lecture

# class I
X01 = [(2, 9), (7, 19), (1, 10), (3, 19), (4, 16), (5, 18)]

# class II
X02 = [(4, 3), (6, 7), (1, -10), (3, -1), (9, 5), (5, -7)]

plt.xlim([0, 10])

plt.scatter(*(zip(*X01)))
plt.scatter(*(zip(*X02)));

png

As we know it is linear problem we will use the linear kernel

from sklearn import svm

# create a classifier
clf = svm.SVC(kernel="linear")

# train classifier - assign -1 label for X01 and 1 for X02
clf.fit(X01 + X02, [-1]*len(X01) + [1]*len(X02))

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Lets visualize the result

sv = clf.support_vectors_  # support vectors
w = clf.coef_[0]           # weights
b = clf.intercept_         # intercept

# w[0] * x + w[1] * y + b = 0
f = lambda x: -(b + w[0]*x) / w[1]

# plot training data
plt.scatter(*(zip(*X01)))
plt.scatter(*(zip(*X02)))

# plt decision boundary
plt.plot([1,9], [f(1), f(9)])

# mark support vectors
plt.scatter(*(zip(*sv)), zorder=1, facecolors='none', edgecolors='r', s=500);

png

Circle¶

For the simple non-linear problem lets consider once again points inside / outside a circle
But this time let them also overlap a little bit die to some noise in data

def generate_circle_data(R1=0, R2=1, N=500):
  """Generate N points in a circle for radius range (R1, R2)"""
  r = lambda: R1 + np.random.random() * (R2 - R1) + np.random.normal(0, 0.2)

  return np.array([(r() * np.cos(ang), r() * np.sin(ang))
                   for ang in np.linspace(0, 2*np.pi, N)])

C01 = generate_circle_data()
C02 = generate_circle_data(1, 2)

plt.scatter(*C01.T, marker='.')
plt.scatter(*C02.T, marker='.');

png

We will consider 4 different kernels:
- linear
- polynomial of degree 3
- polynomial of degree 10
- Gaussian radial basis function (RBF)

# create classifier with different kernels
clf_linear = svm.SVC(kernel="linear")
clf_rbf = svm.SVC(kernel="rbf")
clf_poly3 = svm.SVC(kernel="poly", degree=3)
clf_poly10 = svm.SVC(kernel="poly", degree=10)

titles = ("Linear", "RBF", "Polynomial, degree = 3", "Plynomial, degree = 10")

# create a mesh to plot in
xx, yy = np.meshgrid(np.arange(-3, 3, 0.01), np.arange(-3, 3, 0.01))

# loop over classifiers
for i, clf in enumerate((clf_linear, clf_rbf, clf_poly3, clf_poly10)):
  # train classifier - assign -1 label for C01 and 1 for C02
  clf.fit(np.concatenate((C01, C02), axis=0), [-1]*len(C01) + [1]*len(C02))

  # visualize results
  plt.subplot(2, 2, i + 1)

  # decision boundary
  Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
  Z = Z.reshape(xx.shape)
  plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

  # training data
  plt.scatter(*C01.T, marker='.')
  plt.scatter(*C02.T, marker='.')

  plt.title(titles[i])

plt.tight_layout()

png

Multiclass classification¶

There are two popular methods used to apply binary classificators to multiclass problem
one-vs-rest (ovr) or one-vs-all (ova)
- having $K$ classes with labels $y_i = \left\{1, \cdots, K\right\}$
- train $K$ classifiers assuming $svm_i = 1$ and $svm_{j\neq i} = -1$
- the prediction of a label for an unseen sample is based on a classifier which corresponds to the highest confidence score
one-vs-one (ovo)
- having $K$ classes with labels $y_i = \left\{1, \cdots, K\right\}$
- train $K(K-1)/2$ classifiers for each possible pair of classes
- the prediction is based on the voting scheme

Example: blobs¶

Lets use blobs form the previous lecture

from sklearn.datasets import make_blobs

# generate 5 blobs with fixed random generator
X, Y = make_blobs(n_samples=500, centers=8, random_state=300)

plt.scatter(*X.T, c=Y, marker='.', cmap='Dark2');

png

We can use the same function as last time to train and visualize

def train_and_look(classifier, X, Y, ax=None, title='', cmap='Dark2'):
  """Train classifier on (X,Y). Plot data and prediction."""
  # create new axis if not provided
  ax = ax or plt.gca();

  ax.set_title(title)

  # plot training data
  ax.scatter(*X.T, c=Y, marker='.', cmap=cmap)

  # train a cliassifier
  classifier.fit(X, Y)

  # create a grid of testing points
  x_, y_ = np.meshgrid(np.linspace(*ax.get_xlim(), num=200),
                       np.linspace(*ax.get_ylim(), num=200))

  # convert to an array of 2D points
  test_data = np.vstack([x_.ravel(), y_.ravel()]).T

  # make a prediction and reshape to grid structure 
  z_ = classifier.predict(test_data).reshape(x_.shape)

  # arange z bins so class labels are in the middle
  z_levels = np.arange(len(np.unique(Y)) + 1) - 0.5

  # plot contours corresponding to classifier prediction
  ax.contourf(x_, y_, z_, alpha=0.25, cmap=cmap, levels=z_levels)

fig, ax = plt.subplots(1, 3, figsize=(15,5))

title = ("Linear", "RBF", "Polynomial")

settings = ({"kernel": "linear"}, {"kernel": "rbf"},
            {"kernel": "poly", "degree": 5})

# train and look at SVM with different kernels
for i in range(0, 3):
  train_and_look(svm.SVC(**settings[i]), X, Y,
                 ax=ax[i], title=title[i])

png

SVM regression¶

Lets consider the problem of approximating the set of data $\left\{(\vec x_1, y_1), \cdots, (\vec x_N, y_N)\right\}$ , where $\vec x_i \in \mathcal{R}^n$ and $y_i \in \mathcal{R}$ with a linear function:

$f(\vec x) = \left<\vec w, \vec x\right> + b$
In epsilon-insensitive SVM ( $\varepsilon$ -SVM) the goal is to find $f(x)$ that deviates from $y_i$ by a value not greater than $\varepsilon$ for each training point, and at the same time is as flat as possible.
And once again we end up with a optimization problem:

$\left.\begin{array}{cc}\text{minimize}_{w,b} & \frac{1}{2}|\vec w|^2 \\\text{subject to} & y_i - \left<\vec w, \vec x_i\right> - b \leq \varepsilon\\ & \left<\vec w, \vec x_i\right> + b - y_i \leq \varepsilon\end{array}\right.$
It is enough if such $f$ actually exists
However, analogously to soft margin, one can allow for some errors by introducing slack variables $\xi_i$ and $\xi^*_i$ , and reformulate the optimization problem:

$\left.\begin{array}{cc}\text{minimize}_{w,b} & \frac{1}{2}|\vec w|^2 + C\sum\limits_{i=1}^N(\xi_i + \xi^*_i) \\\text{subject to} & y_i - \left<\vec w, \vec x_i\right> - b \leq \varepsilon + \xi_i\\ & \left<\vec w, \vec x_i\right> + b - y_i \leq \varepsilon + \xi^*_i \\ & \xi_i, \xi^*_i \geq 0\end{array}\right.$
The corresponding dual problem is given by:

$\left.\begin{array}{cc}\text{minimize}_{\mu} & -\frac{1}{2}\sum\limits_{i, j = 1}^N(\mu_i - \mu^*_i)(\mu_j - \mu^*_j)\left<\vec x_i, \vec x_j\right> - \varepsilon\sum\limits_{i=1}^N(\mu_i + \mu^*_i) + \sum\limits_{i=1}^Ny_i(\mu_i - \mu^*_i)\\\text{subject to} & \sum\limits_{i=1}^N(\mu_i - \mu^*_i) = 0 \\ & \mu_i, \mu^*_i\in\left[0, C\right]\end{array}\right.$
And the solution is given by: $\vec w = \sum\limits_{i=1}^N(\mu_i - \mu^*_i)\vec x_i$ , thus:

$f(\vec x) = \sum\limits_{i=1}^N(\mu_i - \mu^*_i)\left<\vec x_i, \vec x\right> + b$
Or for the kernelized version:

$f(\vec x) = \sum\limits_{i=1}^N(\mu_i - \mu^*_i)K(\vec x_i, \vec x)+ b$
The intercept $b$ can be calculate for any support vector as $f(\vec x_{SV}) = y_{SV}$

Example - Boston Housing dataset¶

from sklearn.datasets import load_boston

boston = load_boston()

print(boston.DESCR)

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive

    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Visualize dataset¶

for i, feature in enumerate(boston.data.T):
  plt.subplot(3, 5, i+1)
  plt.scatter(feature, boston.target, marker='.')
  plt.title(boston.feature_names[i])

plt.tight_layout()

png

import pandas as pd

boston_pd = pd.DataFrame(boston.data)     # load features
boston_pd.columns = boston.feature_names  # add features names
boston_pd['PRICE'] = boston.target        # add a column with price

boston_pd.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

Test SVR¶

Below there is a result for the linear kernel - feel free to play with kernels and/or parameter $C$ on your own

from sklearn.svm import SVR

regressor = SVR(kernel="linear")
regressor.fit(boston.data, boston.target)

prediction = regressor.predict(boston.data)

plt.xlabel("True price")
plt.ylabel("Predicted price")

plt.scatter(boston.target, prediction, marker='.');

png

Summary¶

Support vector machine is a powerful model for both classification and regression
It guarantees finding the optimal hyperplane / decision boundary (if exists)
It is effective with high dimensional data
The dual formulation of the optimization problem allows easily to introduce kernels and deal with non-linear data
It has just few hyperparameters: kernel + kernel's parameters and $C$
Inference is fast as it depends only on a subset of training samples (support vectors) - although training may be slow for large datasets

Support Vector Machine¶

Linear SVM¶

Hard margin¶

Functional margin¶

Geometric margin¶

The optimal margin¶

Lagrange multipliers¶

How it works¶

Lagrangian¶

Example¶

Lagrange duality¶

KKT conditions¶

Optimal margin¶

Dual problem¶

Non-linear SVM¶

Example¶

Kernel trick¶

Example¶

Mercer theorem¶

Kernel examples¶

Soft margin¶

Regularization¶

SMO algorithm¶

Quick math¶

Math summary¶

Updating after a successful optimization step¶

Choosing \mu\mu to optimize¶

Examples¶

Initial problem¶

Circle¶

Multiclass classification¶

Example: blobs¶

SVM regression¶

Example - Boston Housing dataset¶

Visualize dataset¶

Test SVR¶

Summary¶

Choosing $\mu$ to optimize¶