There is of course a lot of material on the Internet to study about support vector machines, but while doing so I've found some points to stay stubbornly unclear and kept wondering about them for a while. In this article I'll gather insight I've found here and there or understood myself as a memo for myself and eventual help for someone else.
A Support Vector Machine (SVM) is a class of machine learning model, used to find a boundary between two categories. It can be extended to several categories by using a once-against-all strategy: train one SVM per category, each one predicting if an input belongs to one of the categories, and aggregate the \(n\) predictions to get the result category with some heuristic to resolve conflicts (several SVMs predicting the appartenance to their respective category for a same input). Initially designed for linear boundaries, the introduction of the so-called "kernel trick" allows SVM to handle non-linear boundary as well. There exists an efficient algorithm to perform the training of SVM: the sequential minimal optimization, and once trained they generally provide a simple and performant prediction method. Finally, they have been shown to perform efficiently on various real-world problems, which make them overall a very attractive model. More about the theory can be found on the Wikipedia page and in this presentation by Dr Martin Hoffmann.
The training of a SVM consists of solving an optimisation problem. It is then necessary to convert the dataset into a numerical matrix. For categorical fields, the set of all possible values is created and the index of each value is used as its numerical representation. Each example (row) in the dataset gives a row in the matrix. SVM also performs better if the input data are standardized, for example by mapping values of each field from \([m,M]\) to \([0,1]\) where \(m\) and \(M\) are respectively the minimum and maximum values of the field (other normalisation strategies also exist). The predicted output is set to be the last column. It has a special encoding: it is assigned the value 1.0 for examples of the target category, and -1.0 for examples of other categories. For example, the following dataset (one numerical input, one categorical input, one categorical output):
will be converted and normalised to train a SVM predicting 'on' into the matrix:
The optimization problem looks like this: $$ \begin{array}{l} min_{\alpha}\left(\frac{1}{2}\sum_i^N\sum_j^N\left(y_iy_j\alpha_i\alpha_jK(\vec{x}_i,\vec{x}_j)\right)-\sum_i^N\alpha_i\right)\\ 0\le\alpha_i\le C,\forall i\\ \sum_i^Ny_i\alpha_i=0\\ \end{array} $$ where, \(N\) is the number of examples (rows) in the dataset, \(y_i\) is the \(i\)-th example's output value (-1.0 or 1.0), \(\alpha_i\) is the Lagrangian multiplier of the \(i\)-th example, \(K()\) is the kernel function, \(\vec{x}_i\) is the \(i\)-th example's input values, and \(C\) is what I call the relaxation coefficient (maybe it has a more common name, can't find).
Lagrangian multipliers allow to express the minimisation problem into a form easier to solve. See this video to know more about them. How the optimisation problem is built using them is explained in the links above.
The kernel function encodes how examples relate to each other. The simplest kernel is the inner product, \(K(\vec{x},\vec{y})=\vec{x}.\vec{y}\), and leads to SVMs with linear boundaries. Other kernel functions like the polynomial kernel \(K(\vec{x},\vec{y})=(\vec{x}.\vec{y} + a)^b\) or the Gaussian radial basis kernel \(K(\vec{x},\vec{y})=exp(-\gamma(\vec{x}.\vec{y}))\) lead to non-linear boundaries. See these StatQuest videos if you want to know more: 1, 2, 3. There is no rule to choose which kernel to use (as well as how to choose their parameters), so one has to try them out and see which one performs the best. However the Gaussian radial basis kernel seems to be accepted as the best default. K-fold cross validation is a good framework to help one determine the best choice. Also, one can create its own kernel function, it just has to satisfy the Mercer's theorem (see also Hoffmann's presentation linked above).
The relaxation coefficient controls how tightly the SVM will try to fit the training data. If \(C=+\infty\) (in practice a value around 1000.0 seems common) the SVM tries to reproduce exactly all the training examples. The final accuracy will probably be higher, but the risk of overfitting is also higher. As \(C\) decreases toward \(0.0\), the boundary between categories get "smoothed out" and outliers are more likely to be ignored. The final accuracy might decrease, but the risk of overfitting also decreases. Here again k-fold cross validation is a useful framework to help one choose the best value.
The optimisation problem has some particular mathematical properties which allowed the design of an efficient algorithm to solve it: the Sequential Minimal Optimisation invented by John Platt. The algorithm is described in this paper. Dr Platt provides the algorithm as pseudo-code, however if it may be clear to himself it is far from it to myself. So, for the other dumbasses like me I provide below a more explicit version which will probably save them a lot of pain and sweat.
Once the training is done, one can perform a prediction on a new input \(\vec{x}\) by calculating \(u=\sum_i^N\left(y_i\alpha_iK(\vec{x}_i,\vec{x}')\right)-b\), where \(\vec{x}'\) is equal to the input \(\vec{x}\) after applying the same normalisation as it has been done on the training dataset. If \(u\) is positive, the input matches the category predicted by the SVM (category assigned a value of 1.0 in the training dataset), else it isn't. \(b\) is the bias of the SVM and is hidden in the formulation of the optimisation problem using the Lagrangian multipliers. Its calculation is explained in my version of the pseudo-code below.
In practice, most of the \(\alpha_i\) will be equal to \(0.0\) (or below a given \(\epsilon\)). So, despite the prediction depending in theory on the whole training dataset, it is in fact a sum over a much smaller subset of the training examples. After training (ie, computing the values of \(\alpha_i\) and \(b\)), you only need to memorise \(b\) and \(\{\alpha_i,y_i,x_i\}\) for non-null \(\alpha_i\). This makes SVM's prediction algorithm light in memory usage and fast in execution time. The examples which are not "thrown away" (\(\alpha_i\gt\epsilon\)) are called the "support vectors" and give their name to the SVM model. They lay along or near the border between each category. Their number varies in function of the dataset and kernel, and increases as the relaxation coefficient decreases.
An interesting value is the absolute value of \(u\). The sign of \(u\) is enough to categorize the input into "target category" or "other category", but its absolute value is also an indication of the confidence of the prediction. The boundary is defined by \(\left\lbrace\vec{x},u(\vec{x})=0.0\right\rbrace\), and \(u(\vec{x})\) increases as \(\vec{x}\) goes away from the boundary toward the 'target category', or decreases as it goes away toward 'another category'. Hence it can be used to define a heuristic to resolve conflicts in the case of several categories: choose the one with maximum \(|u|\).
Here is now my version of the pseudo-code of the SMO algorithm.
To get a better grasp on what's going on I've made a simple dataset, trained a SVM on it with various parameters and a Gaussian kernel, done some visualisation. The dataset is made of two categories (red dots and blue dots) normally distributed in 2D, with an outlier blue lying into the red area.
I've performed the training with values 10, 100, 1000 for the relaxation coefficient \(C\) and values 0.1, 1, 10 for the Gaussian parameter \(\gamma\). In the visualisation the shade of grey indicates the confidence (black: low, white: high), the white line is the predicted boundary between the categories, the grey lines are lines of equi-confidence, and the dot circled in white are the support vectors. Click the image to enlarge.
In all cases the SVM returns a plausible boundary between the two categories. As expected, the number of support vectors is significantly less than the number of training examples, and they tend to be located near the boundary. As \(C\) increases the influence of the outlier (blue dot near the bottom-right corner) is clearly visible. For a given \(\gamma\), the confidence of the SVM in the outlier area decreases as \(C\) increases, going up to predicting that area as belonging to the blue category when \(\gamma=10\). The \(\gamma\) parameter of the Gaussian radial basis function controls the range around an example affected by that example. A low value of \(\gamma\) means a large range, a high one means a short range. So,
As mentioned earlier, finding the best pair of values can be done using k-fold cross validation. With the view to automate it, differential evolution (cf previous article) can be used to explore these values. \(C\in[\epsilon,1000]\) and \(\gamma\in[\epsilon,10]\) seems appropriate domains to explore, and for the fitness function I think that the sum of min/avg/max of the accuracy on the validation folds is a good choice.
Finally, I wanted to try these support vector machines onto real world examples. I've implemented SVM as explained in this article and added it to LibCapy (v0.4), then made a small CLI app to use it. For the datasets, I've used several ones from openml.org. I put below the output of that app for each dataset, with the link to the dataset on OpenML. The time limit for searching \(C, \gamma\) was 60mn per dataset. For comparison, I give the average accuracy on validation splits for my SVM implementation and the best run (not limited to SVM models) on OpenML obtained with the Python script:
Note however that I do not use the same splits as those on OpenML.
Diabetes (OpenML):
Average accuracy on validation splits, me: 0.8060, OpenML: 0.7877.
Haberman (OpenML):
Average accuracy on validation splits, me: 0.9478, OpenML: 0.7679.
Heart-statlog (OpenML):
Average accuracy on validation splits, me: 0.7148, OpenML: 0.8629.
Ionosphere (OpenML):
Average accuracy on validation splits, me: 0.7693, OpenML: 0.9601.
Kr-vs-kp (OpenML):
Average accuracy on validation splits, me: 0.9146, OpenML: 0.9978.
Monks-problems-1 (OpenML):
Average accuracy on validation splits, me: 0.9964, OpenML: 1.0.
Monks-problems-2 (OpenML):
Average accuracy on validation splits, me: 0.7552, OpenML: 1.0.
Mushroom (OpenML):
Average accuracy on validation splits, me: 1.0, OpenML: 1.0.
Sonar (OpenML):
Average accuracy on validation splits, me: 0.8700, OpenML: 0.8990.
Spambase (OpenML):
Average accuracy on validation splits, me: 0.9082, OpenML: 0.9626.
Tic-tac-toe (OpenML):
Average accuracy on validation splits, me: 0.7765, OpenML: 1.0.
Wdbc (OpenML):
Average accuracy on validation splits, me: 0.9684, OpenML: 0.9824.
Summary:
Within 12 datasets, the average accuracy over validation splits was above 90% for 6 datasets, above 80% for 8 datasets, and at worst 71.5%. In two cases it is above and in 4 cases at par with the best result on OpenML. These results validate my implementation of the support vector machine model, sequential minimal optimisation algorithm, and search of hyper-parameters using differential evolution. It would be interesting to see how it performs with a different kernel, or with more time to search the hyper-parameters (especially on large datasets for which one hour wasn't enough to perform a meaningful exploration), and to use the OpenML split instead of a random one for better comparison.
Edit on 2022/12/14: A follow up of this article is available here, where I compare support vector machines with neural networks.