Introduction to Support Vector Machine (SVM)

The Support vector machine (SVM) is a type of supervised machine learning algorithm. To make a support vector machine, we modify the two terms in the cost function for logistic regression. The first term of the cost function is modified so that theta transpose x is greater than 1 then output is 0. furthermore if it is less than 1 we use a straight decreasing line instead of the sigmoid curve. This is called hinge loss function. Similarly we modify the second term of the cost function so that when theta transpose x is less than -1, it outputs 0. We also modify it so that for values of it greater than -1, we use a straight increasing line instead of the sigmoid curve.
In SVMs, the decision boundary has the special property that it is as far away as possible from both the positive and the negative examples. The distance of the decision boundary to the nearest example is called the margin. Since SVMs maximize this margin, it is often called a Large Margin Classifier. The SVM will separate the negative and positive examples by a large margin. This large margin is only achieved when C (1/ λ , λ is regularizing parameter) is very large. Data is linearly separable when a straight line can separate the positive and negative examples. If we have outlier examples that we don't want to affect the decision boundary, then we can reduce C. Increasing and decreasing C is similar to respectively decreasing and increasing λ, and can simplify our decision boundary.

Kernels allow us to make complex, non-linear classifiers using Support Vector Machines. Given x, compute new feature depending on proximity to landmarks. To do this, we find the "similarity" of x and some landmark. This similarity function is called a Gaussian Kernel. It is a specific example of a kernel. In other words, if x and the landmark are close, then the similarity will be close to 1, and if x and the landmark are far away from each other, the similarity will be close to 0. Each landmark gives us the features in our hypothesis. One way to get the landmarks is to put them in the exact same locations as all the training examples.

Choosing SVM parameter C (=1/ λ)

If C is large, then we get higher variance/lower bias. If C is small, then we get lower variance/higher bias. The other parameter we must choose is σ^2 from the Gaussian Kernel function: With a large σ^2, the features fi vary more smoothly, causing higher bias and lower variance. With a small σ^2, the features fi vary less smoothly, causing lower bias and higher variance.

Logistic Regression vs. SVMs

(a)If n is large (relative to m), then we generally use logistic regression, or SVM without a kernel (linear kernel). 
(b) If n is small and m is intermediate, then use SVM with a Gaussian Kernel. 
(c) If n is small and m is large, then manually create/add more features, then use logistic regression or SVM without a kernel. In the first case, we don't have enough examples to need a complicated polynomial hypothesis. In the second case, we have enough examples that we may need a complex non-linear hypothesis. In the last case, we want to increase our features so that logistic regression becomes applicable.