The support vector machine (SVM) is a powerful **discriminative classification** technique. Using the labelled training vectors, SVM training finds a separating hyper plane (H) that maximizes the **margin** of separation between these two classes. Based on trining data, SVM can be divided into two categories: (a) **Linearly separable** training, (b) **Non-linearly separable** training.

**✓ An example of a SVM trained using linearly-separable data is given in following Figure.**

**✓ Let’s define the hyperplane H such that:**

w^{T}x_{i} + b ≥ +1 when y_{i} =+1

w^{T}x_{i} + b ≤ -1 when y_{i} =-1

**The distance between H and H1** is y(w^{T}x+b)/||w||

**✓ The distance between H1 and H2 is**: 2/||w||.

The distance between H1 and H2 is called margin. We want a classifier with as big margin as possible. In order to maximize the margin, we need to minimize ||w||. With the condition that there are no datapoints between H1 and H2:

w^{T}x_{i} +b ≥ +1 when y_{i} =+1

w^{T}x_{i} +b ≤ -1 when y_{i} =-1

**✓ Find w and b such that**

||w||^{2}/2 minimized and for all (x_{i},y_{i}), y_{i}(w^{T}x_{i} +b) ≥ 1

We are now optimizing a quadratic function subject to linear constraints. It’s Lagrangian optimization problem.

w = ∑ α_{i}y_{i}x_{i
}b = y_{k} – w^{T}x_{k}

**✓ Non-linearly separable training**

In the case of performing SVM training on non-linearly separable data, the missclassication of training examples will cause the Lagrangian multipliers to grow exceedingly large and prevent a feasible solution.

The idea is to gain **linearly separation by mapping the data to a higher dimensional space**. An **SVM kernel** is used to transform training and testing data into a higher dimensional space, which provides better linear separation.

**✓ A kernel function is defined as follows,**

K(x_{i},x_{j}) = Φ(x_{i}) Φ(x_{j})

where Φ(x) is a mapping function used to convert input vectors x to a desired space. Linear, polynomial, radial basis function (gaussians) and sigmoid (neural net activation function) are used as non-linear mapping functions.

An example of non-linearly separable data and how kernal function transforms into separable data is shown in below Figure,

SVM classification can be used in several applications, such as natural language processing, computer vision and bioinformatics.

This is a very good explanation for SVM and i didn’t understand it before