Baillehache Pascal's personal website

Support Vector Machine model

There is of course a lot of material on the Internet to study about support vector machines, but while doing so I've found some points to stay stubbornly unclear and kept wondering about them for a while. In this article I'll gather insight I've found here and there or understood myself as a memo for myself and eventual help for someone else.

A Support Vector Machine (SVM) is a class of machine learning model, used to find a boundary between two categories. It can be extended to several categories by using a once-against-all strategy: train one SVM per category, each one predicting if an input belongs to one of the categories, and aggregate the $n$ predictions to get the result category with some heuristic to resolve conflicts (several SVMs predicting the appartenance to their respective category for a same input). Initially designed for linear boundaries, the introduction of the so-called "kernel trick" allows SVM to handle non-linear boundary as well. There exists an efficient algorithm to perform the training of SVM: the sequential minimal optimization, and once trained they generally provide a simple and performant prediction method. Finally, they have been shown to perform efficiently on various real-world problems, which make them overall a very attractive model. More about the theory can be found on the Wikipedia page and in this presentation by Dr Martin Hoffmann.

The training of a SVM consists of solving an optimisation problem. It is then necessary to convert the dataset into a numerical matrix. For categorical fields, the set of all possible values is created and the index of each value is used as its numerical representation. Each example (row) in the dataset gives a row in the matrix. SVM also performs better if the input data are standardized, for example by mapping values of each field from $[m,M]$ to $[0,1]$ where $m$ and $M$ are respectively the minimum and maximum values of the field (other normalisation strategies also exist). The predicted output is set to be the last column. It has a special encoding: it is assigned the value 1.0 for examples of the target category, and -1.0 for examples of other categories. For example, the following dataset (one numerical input, one categorical input, one categorical output):

1.0,A,on
4.0,B,off
3.0,C,on
4.0,A,hs

will be converted and normalised to train a SVM predicting 'on' into the matrix:

[[0.000, 0.0,  1.0],
 [1.000, 0.5, -1.0],
 [0.666, 1.0,  1.0],
 [1.000, 0.0, -1.0]]

The optimization problem looks like this: $$ \begin{array}{l} min_{\alpha}\left(\frac{1}{2}\sum_i^N\sum_j^N\left(y_iy_j\alpha_i\alpha_jK(\vec{x}_i,\vec{x}_j)\right)-\sum_i^N\alpha_i\right)\\ 0\le\alpha_i\le C,\forall i\\ \sum_i^Ny_i\alpha_i=0\\ \end{array} $$ where, $N$ is the number of examples (rows) in the dataset, $y_i$ is the $i$-th example's output value (-1.0 or 1.0), $\alpha_i$ is the Lagrangian multiplier of the $i$-th example, $K()$ is the kernel function, $\vec{x}_i$ is the $i$-th example's input values, and $C$ is what I call the relaxation coefficient (maybe it has a more common name, can't find).

Lagrangian multipliers allow to express the minimisation problem into a form easier to solve. See this video to know more about them. How the optimisation problem is built using them is explained in the links above.

The kernel function encodes how examples relate to each other. The simplest kernel is the inner product, $K(\vec{x},\vec{y})=\vec{x}.\vec{y}$, and leads to SVMs with linear boundaries. Other kernel functions like the polynomial kernel $K(\vec{x},\vec{y})=(\vec{x}.\vec{y} + a)^b$ or the Gaussian radial basis kernel $K(\vec{x},\vec{y})=exp(-\gamma(\vec{x}.\vec{y}))$ lead to non-linear boundaries. See these StatQuest videos if you want to know more: 1, 2, 3. There is no rule to choose which kernel to use (as well as how to choose their parameters), so one has to try them out and see which one performs the best. However the Gaussian radial basis kernel seems to be accepted as the best default. K-fold cross validation is a good framework to help one determine the best choice. Also, one can create its own kernel function, it just has to satisfy the Mercer's theorem (see also Hoffmann's presentation linked above).

The relaxation coefficient controls how tightly the SVM will try to fit the training data. If $C=+\infty$ (in practice a value around 1000.0 seems common) the SVM tries to reproduce exactly all the training examples. The final accuracy will probably be higher, but the risk of overfitting is also higher. As $C$ decreases toward $0.0$, the boundary between categories get "smoothed out" and outliers are more likely to be ignored. The final accuracy might decrease, but the risk of overfitting also decreases. Here again k-fold cross validation is a useful framework to help one choose the best value.

The optimisation problem has some particular mathematical properties which allowed the design of an efficient algorithm to solve it: the Sequential Minimal Optimisation invented by John Platt. The algorithm is described in this paper. Dr Platt provides the algorithm as pseudo-code, however if it may be clear to himself it is far from it to myself. So, for the other dumbasses like me I provide below a more explicit version which will probably save them a lot of pain and sweat.

Once the training is done, one can perform a prediction on a new input $\vec{x}$ by calculating $u=\sum_i^N\left(y_i\alpha_iK(\vec{x}_i,\vec{x}')\right)-b$, where $\vec{x}'$ is equal to the input $\vec{x}$ after applying the same normalisation as it has been done on the training dataset. If $u$ is positive, the input matches the category predicted by the SVM (category assigned a value of 1.0 in the training dataset), else it isn't. $b$ is the bias of the SVM and is hidden in the formulation of the optimisation problem using the Lagrangian multipliers. Its calculation is explained in my version of the pseudo-code below.

In practice, most of the $\alpha_i$ will be equal to $0.0$ (or below a given $\epsilon$). So, despite the prediction depending in theory on the whole training dataset, it is in fact a sum over a much smaller subset of the training examples. After training (ie, computing the values of $\alpha_i$ and $b$), you only need to memorise $b$ and $\{\alpha_i,y_i,x_i\}$ for non-null $\alpha_i$. This makes SVM's prediction algorithm light in memory usage and fast in execution time. The examples which are not "thrown away" ($\alpha_i\gt\epsilon$) are called the "support vectors" and give their name to the SVM model. They lay along or near the border between each category. Their number varies in function of the dataset and kernel, and increases as the relaxation coefficient decreases.

An interesting value is the absolute value of $u$. The sign of $u$ is enough to categorize the input into "target category" or "other category", but its absolute value is also an indication of the confidence of the prediction. The boundary is defined by $\left\lbrace\vec{x},u(\vec{x})=0.0\right\rbrace$, and $u(\vec{x})$ increases as $\vec{x}$ goes away from the boundary toward the 'target category', or decreases as it goes away toward 'another category'. Hence it can be used to define a heuristic to resolve conflicts in the case of several categories: choose the one with maximum $|u|$.

Here is now my version of the pseudo-code of the SMO algorithm.

epsilon = 0.001
SMOTrain(dataset, kernel, C):
  mat = array of dataset.nbRow rows by (dataset.nbInput+1) columns of
        floating point values
  convert the dataset into numerical as described above and memorise it in mat
  alpha = array of dataset.nbRow floating point values
  for i in [0..dataset.nbRow[
    alpha[i] = 0
  b = 0
  error = array of dataset.nbRow floating point values
  for i in [0..dataset.nbRow[
    error[i] = SMOError(kernel, mat, i, alpha, b)
  nbChanged = 0
  examineAll = 1
  while nbChanged > 0 or examineAll = 1
    nbChanged = 0
    if examineAll = 1
      for i in [0..dataset.nbRow[
        nbChanged += SMOExamine(kernel, mat, i, alpha, error, b, C)
    else
      for i in [0..dataset.nbRow[
        if epsilon < alpha[i] < C - epsilon
          nbChanged += SMOExamine(kernel, mat, i, alpha, error, b, C)
    if examineAll = 1
      examineAll = 0
    else if nbChanged = 0
      examineAll = 1
  nbSupport = 0
  for i in [0..dataset.nbRow[
    if alpha[i] > epsilon
      nbSupport += 1
  support = array of nbSupport rows by (mat.nbCol-1) columns
            floating point values
  lambda = array of nbSupport floating point values
  y = array of nbSupport floating point values
  iSupport = 0
  for i in [0..dataset.nbRow[
    if alpha[i] > epsilon
      lambda[iSupport] = alpha[i]
      y[iSupport] = mat[i][mat.nbCol-1]
      for j in [0..(mat.nbCol-1)[
        support[iSupport][j] = mat[i][j]
      iSupport += 1
  return support, y, lambda, b

SMOError(kernel, mat, iRow, alpha, b):
  xi = mat[iRow][0..(mat.nbCol-2)]
  yi = mat[iRow][mat.nbCol-1]
  ui = SMOEval(kernel, mat, xi, alpha, b)
  error = ui - yi
  return error

SMOEval(kernel, mat, x, alpha, b):
  u = 0
  for i in [0..mat.nbRow[
    xi = mat[i][0..(mat.nbCol-2)]
    yi = mat[i][mat.nbCol-1]
    u += yi * alpha[i] * kernel(xi, x)
  u -= b
  return u

SMOExamine(kernel, mat, iRow, alpha, error, b, C):
  if SMOViolatesKKT(mat, kernel, alpha, b, iRow, C)
    max = 0
    jRow = 0
    for j in [0..mat.nbRow[
      e = abs(error[iRow] - error[j])
      if max < e
        max = e
        jRow = j
    n = SMOStep(mat, kernel, iRow, jRow, alpha, error, b, C)
    if n > 0
      return n
    jRowMax = jRow
    shiftRow = random integer in [0..mat.nbRow[
    for j in [0..mat.nbRow[
      jRow = (j + shiftRow) modulo mat.nbRow
      if jRow <> iRow and jRow <> jRowMax and 
          epsilon < alpha[jRow] < C - epsilon
        n = SMOStep(mat, kernel, iRow, jRow, alpha, error, b, C)
        if n > 0
          return n
    for j in [0..mat.nbRow[
      jRow = (j + shiftRow) modulo mat.nbRow
      if jRow <> iRow and jRow <> jRowMax and
          (alpha[jRow] <= epsilon or C - epsilon <= alpha[jRow])
        n = SMOStep(mat, kernel, iRow, jRow, alpha, error, b, C)
        if n > 0
          return n
  return 0

SMOViolatesKKT(mat, kernel, alpha, b, iRow, C):
  xi = mat[iRow][0..(mat.nbCol-2)]
  yi = mat[iRow][mat.nbCol-1]
  ui = SMOEval(kernel, mat, xi, alpha, b)
  ei = yi * ui
  if alpha[iRow] < epsilon
    if ei < 1.0 - epsilon
      return true
  else if alpha[iRow] > C - epsilon
    if ei > 1.0 + epsilon
      return true
  else
    if abs(ei - 1.0) > epsilon
      return true
  return false

SMOStep(mat, kernel, iRow, jRow, alpha, error, b, C):
  x1 = mat[iRow][0..(mat.nbCol-2)]
  y1 = mat[iRow][mat.nbCol-1]
  x2 = mat[jRow][0..(mat.nbCol-2)]
  y2 = mat[jRow][mat.nbCol-1]
  alpha1 = alpha[iRow]
  alpha2 = alpha[jRow]
  e1 = error[iRow]
  e2 = error[jRow]
  s = y1 * y2
  if s < 0
    L = max(0, alpha2 - alpha1)
    H = min(C, C + alpha2 - alpha1)
  else
    L = max(0, alpha2 + alpha1 - C)
    H = min(C, alpha2 + alpha1)
  if L = H
    return 0
  k11 = K(x1, x1)
  k12 = K(x1, x2)
  k22 = K(x2, x2)
  eta = k11 + k22 - 2 * k12
  if eta > 0.0
    newAlpha2 = alpha2 + y2 * (e1 - e2) / eta
    newAlpha2 = min(H, max(L, newAlpha2))
  else
    f1 = y1 * (E1 + b) - alpha1 * k11 - s * alpha2 * k12
    f2 = y2 * (E2 + b) - s * alpha1 * k12 - alpha2 * k22
    L1 = alpha1 + s * (alpha2 - L)
    H1 = alpha1 + s * (alpha2 - H)
    Lobj =
      L1 * f1 + L * f2 + 0.5 * L1 * L1 * k11 + 0.5 * L * L * k22 +
      s * L * L1 * k12
    Hobj =
      H1 * f1 + H * f2 + 0.5 * H1 * H1 * k11 + 0.5 * H * H * k22 +
      s * H * H1 * k12
    if Lobj < Hobj
      newAlpha2 = L
    else if Hobj < Lobj
      newAlpha2 = H
    else
      newAlpha2 = alpha2
  if alpha2 = newAlpha2
    return 0
  newAlpha1 = alpha1 + s * (alpha2 - a2)
  deltaAlpha1 = newAlpha1 - alpha1
  deltaAlpha2 = newAlpha2 - alpha2
  b1 = e1 + y1 * deltaAlpha1 * k11 + y2 * deltaAlpha2 * k12 + b
  b2 = e2 + y1 * deltaAlpha1 * k12 + y2 * deltaAlpha2 * k22 + b
  if epsilon < newAlpha1 < C - epsilon
    newB = b1
  else if epsilon < newAlpha2 < C - epsilon
    newB = b2
  else newB = 0.5 * (b1 + b2)
  alpha[iRow] = newAlpha1
  alpha[jRow] = newAlpha2
  deltaB = newB - b
  for i in [0..mat.nbRow[
    xi = mat[i][0..(mat.nbCol-2)]
    k1 = kernel(x1, xi)
    k2 = kernel(x2, xi)
    error[i] += y1 * deltaAlpha1 * k1 + y2 * deltaAlpha2 * k2 - deltaB
  b = newB
  return 1

SMOPredict(lambda, y, b, support, x):
  u = 0
  for i in [0..support.nbRow[
    xi = support[i][0..(support.nbCol-1)]
    u += lambda[i] * kernel(xi, x) * y[i]
  u -= b
  if u > 0
    x is in the target category
  else
    x is not in the target category
  confidence = abs(u)

To get a better grasp on what's going on I've made a simple dataset, trained a SVM on it with various parameters and a Gaussian kernel, done some visualisation. The dataset is made of two categories (red dots and blue dots) normally distributed in 2D, with an outlier blue lying into the red area.

3
in,in,out
x,y,cat
num,num,cat
100
7.500,7.500,blue
6.818,6.851,red
5.055,4.633,blue
8.292,7.110,red
5.554,5.560,blue
7.904,7.262,red
6.597,5.059,blue
7.276,6.869,red
4.236,5.180,blue
5.814,6.678,red
5.192,5.053,blue
6.138,6.623,red
4.490,4.621,blue
7.778,7.353,red
5.913,4.076,blue
6.562,7.425,red
4.996,4.802,blue
7.276,6.370,red
3.720,5.913,blue
6.435,6.525,red
2.469,5.164,blue
5.484,6.685,red
3.557,4.533,blue
7.284,6.814,red
4.537,5.467,blue
6.392,7.308,red
5.980,4.803,blue
6.848,6.047,red
5.377,4.779,blue
6.072,6.467,red
4.714,5.057,blue
8.779,7.117,red
5.157,4.908,blue
7.241,6.688,red
5.878,4.398,blue
7.198,7.093,red
6.642,4.528,blue
5.624,7.174,red
5.491,4.814,blue
5.588,7.203,red
6.281,5.916,blue
6.455,7.177,red
3.094,5.464,blue
7.202,6.816,red
5.930,4.659,blue
7.961,7.544,red
4.550,6.443,blue
6.345,6.585,red
2.575,4.341,blue
6.774,7.279,red
3.195,5.599,blue
5.413,7.113,red
6.505,4.585,blue
7.129,7.429,red
4.784,5.086,blue
5.918,6.545,red
4.776,5.091,blue
6.954,6.389,red
6.652,4.168,blue
7.948,6.841,red
7.343,5.644,blue
7.478,8.186,red
2.066,4.534,blue
6.839,6.938,red
4.604,4.777,blue
6.648,6.760,red
5.736,4.172,blue
7.187,8.051,red
4.330,5.468,blue
7.628,7.436,red
5.855,5.365,blue
7.840,7.162,red
5.177,5.037,blue
7.411,6.476,red
4.794,4.484,blue
7.104,7.117,red
5.689,5.908,blue
6.941,6.673,red
4.347,5.323,blue
6.276,6.153,red
4.986,3.919,blue
7.243,7.005,red
5.005,5.672,blue
8.085,6.499,red
5.641,4.823,blue
7.172,6.792,red
4.864,5.141,blue
6.153,6.815,red
4.250,5.083,blue
7.990,7.660,red
4.625,4.477,blue
7.546,6.965,red
5.260,5.471,blue
7.127,6.147,red
3.752,5.434,blue
5.946,6.158,red
4.374,5.669,blue
8.283,6.656,red
5.663,4.180,blue
5.756,7.406,red

I've performed the training with values 10, 100, 1000 for the relaxation coefficient $C$ and values 0.1, 1, 10 for the Gaussian parameter $\gamma$. In the visualisation the shade of grey indicates the confidence (black: low, white: high), the white line is the predicted boundary between the categories, the grey lines are lines of equi-confidence, and the dot circled in white are the support vectors. Click the image to enlarge.

In all cases the SVM returns a plausible boundary between the two categories. As expected, the number of support vectors is significantly less than the number of training examples, and they tend to be located near the boundary. As $C$ increases the influence of the outlier (blue dot near the bottom-right corner) is clearly visible. For a given $\gamma$, the confidence of the SVM in the outlier area decreases as $C$ increases, going up to predicting that area as belonging to the blue category when $\gamma=10$. The $\gamma$ parameter of the Gaussian radial basis function controls the range around an example affected by that example. A low value of $\gamma$ means a large range, a high one means a short range. So,

for $C=10,\gamma=0.1$ the SVM doesn't care of misclassifying a few examples and looks at the whole dataset at once: the categories are separated by an almost linear boundary.
for $C=1000,\gamma=0.1$ the SVM is more sensitive to misclassification and looks at the whole dataset at once: the boundary is bent toward the misclassified blue dots near the boundary.
for $C=10,\gamma=10$ the SVM doesn't care of misclassifying a few examples and looks only at examples near a given area: the confidence is more concentrated around the examples of each category, and the outlier is ignored.
for $C=1000,\gamma=10$ the SVM is more sensitive to misclassification and looks only at examples near a given area: the confidence is more concentrated around the examples of each category, and the outlier gets predicted as really belonging to the blue category.

As mentioned earlier, finding the best pair of values can be done using k-fold cross validation. With the view to automate it, differential evolution (cf previous article) can be used to explore these values. $C\in[\epsilon,1000]$ and $\gamma\in[\epsilon,10]$ seems appropriate domains to explore, and for the fitness function I think that the sum of min/avg/max of the accuracy on the validation folds is a good choice.

for each agent, choose a random pair of value $C,\gamma$
for each agent, run a 10-fold cross validation (split the dataset in 10, train the agent on 9 splits, evaluate it on the remaining split, for all 10 combinations of the splits)
for each agent, calculate the fitness of the agent as the sum of the minimum and average and maximum accuracy on the 10 validation splits
apply the differential evolution rules to update the agents parameters $C,\gamma$
repeat 2. to 5. until perfect fitness or a time limit

Finally, I wanted to try these support vector machines onto real world examples. I've implemented SVM as explained in this article and added it to LibCapy (v0.4), then made a small CLI app to use it. For the datasets, I've used several ones from openml.org. I put below the output of that app for each dataset, with the link to the dataset on OpenML. The time limit for searching $C, \gamma$ was 60mn per dataset. For comparison, I give the average accuracy on validation splits for my SVM implementation and the best run (not limited to SVM models) on OpenML obtained with the Python script:

import openml
import sys
task_id = ... # cf link for each dataset
metric = "predictive_accuracy"
evals = openml.evaluations.list_evaluations(
    function=metric, tasks=[task_id], output_format="dataframe")
evals = evals.sort_values(by="value", ascending=False)
print(evals.head())

Note however that I do not use the same splits as those on OpenML.

Diabetes (OpenML):

Training a SVM with gaussian kernel on Resources/diabetes.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Thu Nov 17 21:00:31, maximum training time: 60mn
Search the best hyper parameters...
Thu Nov 17 22:09:08 (60mn) #7, hyperParams:[266.062, 0.969], fitness:2.520    
End the search of the best hyper parameters.
Hyper parameters value: [266.062054, 0.969021]
Perform 10-fold cross validation...
10-fold cross valid, split #3, acc./fit. train 0.890/0.523, valid 0.753/0.443
10-fold cross valid, split #8, acc./fit. train 0.874/0.494, valid 0.829/0.468
10-fold cross valid, split #7, acc./fit. train 0.887/0.523, valid 0.779/0.459
10-fold cross valid, split #2, acc./fit. train 0.887/0.516, valid 0.792/0.461
10-fold cross valid, split #0, acc./fit. train 0.896/0.528, valid 0.779/0.459
10-fold cross valid, split #4, acc./fit. train 0.893/0.534, valid 0.779/0.466
10-fold cross valid, split #6, acc./fit. train 0.883/0.502, valid 0.818/0.465
10-fold cross valid, split #1, acc./fit. train 0.880/0.495, valid 0.935/0.526
10-fold cross valid, split #5, acc./fit. train 0.886/0.510, valid 0.805/0.464
10-fold cross valid, split #9, acc./fit. train 0.889/0.518, valid 0.789/0.460
Avg acc training: 0.88643, avg acc validation: 0.80600
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.884, conf. matrix:[465,20,69,214]
Nb support vector: 324 (in 768 candidates, reduc: 0.578125)
Export the classifier as C functions to ./classify.c
Ends on Thu Nov 17 22:20:03

Average accuracy on validation splits, me: 0.8060, OpenML: 0.7877.

Haberman (OpenML):

Training a SVM with gaussian kernel on Resources/haberman.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Thu Nov 17 22:20:03, maximum training time: 60mn
Search the best hyper parameters...
Thu Nov 17 23:20:24 (60mn) #135, hyperParams:[789.023, 0.417], fitness:2.851    
End the search of the best hyper parameters.
Hyper parameters value: [789.022733, 0.417199]
Perform 10-fold cross validation...
10-fold cross valid, split #6, acc./fit. train 0.946/0.778, valid 0.900/0.740
10-fold cross valid, split #1, acc./fit. train 0.953/0.773, valid 0.903/0.732
10-fold cross valid, split #0, acc./fit. train 0.945/0.767, valid 0.935/0.759
10-fold cross valid, split #5, acc./fit. train 0.938/0.761, valid 0.968/0.785
10-fold cross valid, split #9, acc./fit. train 0.931/0.752, valid 0.967/0.781
10-fold cross valid, split #2, acc./fit. train 0.935/0.741, valid 0.968/0.767
10-fold cross valid, split #4, acc./fit. train 0.942/0.760, valid 0.903/0.729
10-fold cross valid, split #7, acc./fit. train 0.942/0.741, valid 1.000/0.786
10-fold cross valid, split #3, acc./fit. train 0.935/0.744, valid 0.968/0.771
10-fold cross valid, split #8, acc./fit. train 0.946/0.761, valid 0.967/0.778
Avg acc training: 0.94118, avg acc validation: 0.94785
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.938, conf. matrix:[249,10,9,38]
Nb support vector: 56 (in 306 candidates, reduc: 0.816993)
Export the classifier as C functions to ./classify.c
Ends on Thu Nov 17 23:21:09

Average accuracy on validation splits, me: 0.9478, OpenML: 0.7679.

Heart-statlog (OpenML):

Training a SVM with gaussian kernel on Resources/heart-statlog.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Thu Nov 17 23:21:09, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 00:22:10 (60mn) #53, hyperParams:[661.549, 1.562], fitness:2.307    
End the search of the best hyper parameters.
Hyper parameters value: [661.548583, 1.561873]
Perform 10-fold cross validation...
10-fold cross valid, split #8, acc./fit. train 1.000/0.506, valid 0.630/0.319
10-fold cross valid, split #1, acc./fit. train 0.996/0.504, valid 0.593/0.300
10-fold cross valid, split #7, acc./fit. train 1.000/0.473, valid 0.778/0.368
10-fold cross valid, split #2, acc./fit. train 1.000/0.444, valid 0.815/0.362
10-fold cross valid, split #9, acc./fit. train 1.000/0.477, valid 0.630/0.301
10-fold cross valid, split #5, acc./fit. train 0.996/0.459, valid 0.741/0.341
10-fold cross valid, split #3, acc./fit. train 1.000/0.420, valid 0.593/0.249
10-fold cross valid, split #0, acc./fit. train 0.992/0.457, valid 0.741/0.341
10-fold cross valid, split #6, acc./fit. train 0.992/0.400, valid 0.852/0.344
10-fold cross valid, split #4, acc./fit. train 1.000/0.412, valid 0.778/0.320
Avg acc training: 0.99753, avg acc validation: 0.71481
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[143,0,0,127]
Nb support vector: 150 (in 270 candidates, reduc: 0.444444)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 00:24:48

Average accuracy on validation splits, me: 0.7148, OpenML: 0.8629.

Ionosphere (OpenML):

Training a SVM with gaussian kernel on Resources/ionosphere.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 00:24:48, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 01:27:14 (60mn) #23, hyperParams:[140.242, 0.421], fitness:2.341    
End the search of the best hyper parameters.
Hyper parameters value: [140.241901, 0.421165]
Perform 10-fold cross validation...
10-fold cross valid, split #9, acc./fit. train 0.965/0.559, valid 0.743/0.430
10-fold cross valid, split #3, acc./fit. train 0.968/0.555, valid 0.743/0.425
10-fold cross valid, split #0, acc./fit. train 0.975/0.529, valid 0.722/0.392
10-fold cross valid, split #7, acc./fit. train 0.959/0.510, valid 0.771/0.410
10-fold cross valid, split #5, acc./fit. train 0.959/0.522, valid 0.771/0.420
10-fold cross valid, split #2, acc./fit. train 0.972/0.510, valid 0.857/0.450
10-fold cross valid, split #8, acc./fit. train 0.965/0.510, valid 0.800/0.423
10-fold cross valid, split #6, acc./fit. train 0.978/0.514, valid 0.800/0.420
10-fold cross valid, split #4, acc./fit. train 0.978/0.501, valid 0.771/0.395
10-fold cross valid, split #1, acc./fit. train 0.962/0.490, valid 0.714/0.364
Avg acc training: 0.96803, avg acc validation: 0.76937
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.972, conf. matrix:[119,9,1,222]
Nb support vector: 160 (in 351 candidates, reduc: 0.544160)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 01:32:52

Average accuracy on validation splits, me: 0.7693, OpenML: 0.9601.

Kr-vs-kp (OpenML):

Training a SVM with gaussian kernel on Resources/kr-vs-kp.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 01:33:01, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 02:34:01 (60mn) #6, hyperParams:[192.353, 0.463], fitness:2.749    
End the search of the best hyper parameters.
Hyper parameters value: [192.353214, 0.462814]
Perform 10-fold cross validation...
10-fold cross valid, split #3, acc./fit. train 1.000/0.648, valid 0.922/0.597
10-fold cross valid, split #9, acc./fit. train 1.000/0.651, valid 0.915/0.596
10-fold cross valid, split #2, acc./fit. train 1.000/0.648, valid 0.909/0.589
10-fold cross valid, split #4, acc./fit. train 1.000/0.652, valid 0.909/0.593
10-fold cross valid, split #1, acc./fit. train 1.000/0.645, valid 0.897/0.578
10-fold cross valid, split #6, acc./fit. train 1.000/0.651, valid 0.925/0.602
10-fold cross valid, split #7, acc./fit. train 1.000/0.659, valid 0.934/0.616
10-fold cross valid, split #0, acc./fit. train 1.000/0.648, valid 0.891/0.577
10-fold cross valid, split #5, acc./fit. train 1.000/0.656, valid 0.900/0.591
10-fold cross valid, split #8, acc./fit. train 1.000/0.649, valid 0.944/0.612
Avg acc training: 1.00000, avg acc validation: 0.91460
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[1799,0,0,1397]
Nb support vector: 1045 (in 3196 candidates, reduc: 0.673029)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 04:21:14

Average accuracy on validation splits, me: 0.9146, OpenML: 0.9978.

Monks-problems-1 (OpenML):

Training a SVM with gaussian kernel on Resources/monks-problems-1.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 04:21:14, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 05:23:11 (60mn) #30, hyperParams:[594.606, 2.098], fitness:2.979    
End the search of the best hyper parameters.
Hyper parameters value: [594.605742, 2.098373]
Perform 10-fold cross validation...
10-fold cross valid, split #3, acc./fit. train 1.000/0.558, valid 1.000/0.558
10-fold cross valid, split #2, acc./fit. train 1.000/0.570, valid 1.000/0.570
10-fold cross valid, split #7, acc./fit. train 1.000/0.579, valid 1.000/0.579
10-fold cross valid, split #0, acc./fit. train 1.000/0.582, valid 1.000/0.582
10-fold cross valid, split #8, acc./fit. train 1.000/0.577, valid 1.000/0.577
10-fold cross valid, split #5, acc./fit. train 1.000/0.550, valid 1.000/0.550
10-fold cross valid, split #6, acc./fit. train 1.000/0.569, valid 1.000/0.569
10-fold cross valid, split #1, acc./fit. train 1.000/0.548, valid 0.982/0.538
10-fold cross valid, split #4, acc./fit. train 1.000/0.534, valid 0.982/0.524
10-fold cross valid, split #9, acc./fit. train 1.000/0.515, valid 1.000/0.515
Avg acc training: 1.00000, avg acc validation: 0.99643
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[278,0,0,278]
Nb support vector: 209 (in 556 candidates, reduc: 0.624101)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 05:33:52

Average accuracy on validation splits, me: 0.9964, OpenML: 1.0.

Monks-problems-2 (OpenML):

Training a SVM with gaussian kernel on Resources/monks-problems-2.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 05:33:52, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 06:40:10 (60mn) #11, hyperParams:[428.097, 3.936], fitness:2.291    
End the search of the best hyper parameters.
Hyper parameters value: [428.097289, 3.935634]
Perform 10-fold cross validation...
10-fold cross valid, split #2, acc./fit. train 0.967/0.413, valid 0.783/0.334
10-fold cross valid, split #8, acc./fit. train 0.967/0.398, valid 0.783/0.323
10-fold cross valid, split #9, acc./fit. train 0.954/0.510, valid 0.717/0.383
10-fold cross valid, split #3, acc./fit. train 0.965/0.408, valid 0.717/0.303
10-fold cross valid, split #0, acc./fit. train 0.963/0.374, valid 0.836/0.325
10-fold cross valid, split #1, acc./fit. train 0.970/0.506, valid 0.767/0.400
10-fold cross valid, split #5, acc./fit. train 0.967/0.484, valid 0.700/0.351
10-fold cross valid, split #4, acc./fit. train 0.967/0.384, valid 0.750/0.298
10-fold cross valid, split #6, acc./fit. train 0.967/0.493, valid 0.717/0.366
10-fold cross valid, split #7, acc./fit. train 0.972/0.397, valid 0.783/0.320
Avg acc training: 0.96580, avg acc validation: 0.75527
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.965, conf. matrix:[214,13,8,366]
Nb support vector: 336 (in 601 candidates, reduc: 0.440932)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 06:49:59

Average accuracy on validation splits, me: 0.7552, OpenML: 1.0.

Mushroom (OpenML):

Training a SVM with gaussian kernel on Resources/mushroom.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 06:49:59, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 08:31:27 (0mn) #1, hyperParams:[154.258, 9.383], fitness:3.000    
End the search of the best hyper parameters.
Hyper parameters value: [154.257718, 9.383137]
Perform 10-fold cross validation...
10-fold cross valid, split #7, acc./fit. train 1.000/0.622, valid 1.000/0.622
10-fold cross valid, split #1, acc./fit. train 1.000/0.622, valid 1.000/0.622
10-fold cross valid, split #6, acc./fit. train 1.000/0.621, valid 1.000/0.621
10-fold cross valid, split #9, acc./fit. train 1.000/0.622, valid 1.000/0.622
10-fold cross valid, split #3, acc./fit. train 1.000/0.625, valid 1.000/0.625
10-fold cross valid, split #2, acc./fit. train 1.000/0.624, valid 1.000/0.624
10-fold cross valid, split #8, acc./fit. train 1.000/0.627, valid 1.000/0.627
10-fold cross valid, split #4, acc./fit. train 1.000/0.622, valid 1.000/0.622
10-fold cross valid, split #5, acc./fit. train 1.000/0.622, valid 1.000/0.622
10-fold cross valid, split #0, acc./fit. train 1.000/0.622, valid 1.000/0.622
Avg acc training: 1.00000, avg acc validation: 1.00000
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[3916,0,0,4208]
Nb support vector: 3031 (in 8124 candidates, reduc: 0.626908)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 12:25:31

Average accuracy on validation splits, me: 1.0, OpenML: 1.0.

Sonar (OpenML):

Training a SVM with gaussian kernel on Resources/sonar.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 12:25:31, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 13:25:34 (60mn) #192, hyperParams:[215.318, 0.003], fitness:2.632    
End the search of the best hyper parameters.
Hyper parameters value: [215.318451, 0.003051]
Perform 10-fold cross validation...
10-fold cross valid, split #0, acc./fit. train 0.904/0.493, valid 0.762/0.416
10-fold cross valid, split #7, acc./fit. train 0.898/0.471, valid 0.905/0.474
10-fold cross valid, split #1, acc./fit. train 0.904/0.478, valid 0.810/0.429
10-fold cross valid, split #3, acc./fit. train 0.898/0.485, valid 0.905/0.489
10-fold cross valid, split #2, acc./fit. train 0.904/0.474, valid 0.857/0.449
10-fold cross valid, split #4, acc./fit. train 0.914/0.484, valid 0.905/0.479
10-fold cross valid, split #5, acc./fit. train 0.920/0.502, valid 0.857/0.468
10-fold cross valid, split #9, acc./fit. train 0.904/0.491, valid 0.900/0.488
10-fold cross valid, split #8, acc./fit. train 0.915/0.492, valid 0.800/0.430
10-fold cross valid, split #6, acc./fit. train 0.898/0.452, valid 1.000/0.503
Avg acc training: 0.90598, avg acc validation: 0.87000
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.894, conf. matrix:[91,9,13,95]
Nb support vector: 94 (in 208 candidates, reduc: 0.548077)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 13:26:30

Average accuracy on validation splits, me: 0.8700, OpenML: 0.8990.

Spambase (OpenML):

Training a SVM with gaussian kernel on Resources/spambase.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 13:26:30, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 16:10:46 (60mn) #1, hyperParams:[973.718, 5.795], fitness:2.710    
End the search of the best hyper parameters.
Hyper parameters value: [973.718172, 5.795249]
Perform 10-fold cross validation...
10-fold cross valid, split #5, acc./fit. train 0.951/0.858, valid 0.911/0.822
10-fold cross valid, split #2, acc./fit. train 0.951/0.866, valid 0.900/0.819
10-fold cross valid, split #9, acc./fit. train 0.992/0.891, valid 0.948/0.851
10-fold cross valid, split #0, acc./fit. train 0.959/0.872, valid 0.898/0.816
10-fold cross valid, split #8, acc./fit. train 0.926/0.837, valid 0.885/0.800
10-fold cross valid, split #6, acc./fit. train 0.956/0.867, valid 0.917/0.832
10-fold cross valid, split #4, acc./fit. train 0.963/0.870, valid 0.913/0.825
10-fold cross valid, split #7, acc./fit. train 0.943/0.848, valid 0.891/0.802
10-fold cross valid, split #1, acc./fit. train 0.942/0.850, valid 0.909/0.820
10-fold cross valid, split #3, acc./fit. train 0.967/0.871, valid 0.911/0.820
Avg acc training: 0.95503, avg acc validation: 0.90828
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.929, conf. matrix:[1592,124,204,2681]
Nb support vector: 422 (in 4601 candidates, reduc: 0.908281)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 19:36:42

Average accuracy on validation splits, me: 0.9082, OpenML: 0.9626.

Tic-tac-toe (OpenML):

Training a SVM with gaussian kernel on Resources/tic-tac-toe.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 19:36:42, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 20:38:07 (60mn) #42, hyperParams:[276.168, 1.216], fitness:2.389    
End the search of the best hyper parameters.
Hyper parameters value: [276.167884, 1.215780]
Perform 10-fold cross validation...
10-fold cross valid, split #6, acc./fit. train 1.000/0.441, valid 0.781/0.344
10-fold cross valid, split #1, acc./fit. train 1.000/0.476, valid 0.719/0.342
10-fold cross valid, split #8, acc./fit. train 1.000/0.459, valid 0.758/0.348
10-fold cross valid, split #7, acc./fit. train 1.000/0.463, valid 0.844/0.391
10-fold cross valid, split #3, acc./fit. train 1.000/0.451, valid 0.750/0.338
10-fold cross valid, split #0, acc./fit. train 1.000/0.472, valid 0.760/0.359
10-fold cross valid, split #5, acc./fit. train 1.000/0.462, valid 0.781/0.361
10-fold cross valid, split #2, acc./fit. train 1.000/0.458, valid 0.760/0.348
10-fold cross valid, split #9, acc./fit. train 1.000/0.438, valid 0.758/0.332
10-fold cross valid, split #4, acc./fit. train 1.000/0.455, valid 0.854/0.388
Avg acc training: 1.00000, avg acc validation: 0.77658
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[667,0,0,291]
Nb support vector: 498 (in 958 candidates, reduc: 0.480167)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 21:01:08

Average accuracy on validation splits, me: 0.7765, OpenML: 1.0.

Wdbc (OpenML):

Training a SVM with gaussian kernel on Resources/wdbc.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Fri Nov 18 21:01:08, maximum training time: 60mn
Search the best hyper parameters...
Fri Nov 18 22:02:08 (60mn) #73, hyperParams:[128.599, 4.375], fitness:2.968    
End the search of the best hyper parameters.
Hyper parameters value: [128.599107, 4.374594]
Perform 10-fold cross validation...
10-fold cross valid, split #1, acc./fit. train 1.000/0.676, valid 0.947/0.640
10-fold cross valid, split #6, acc./fit. train 1.000/0.686, valid 0.965/0.661
10-fold cross valid, split #3, acc./fit. train 1.000/0.684, valid 0.982/0.672
10-fold cross valid, split #8, acc./fit. train 1.000/0.693, valid 0.930/0.645
10-fold cross valid, split #5, acc./fit. train 1.000/0.703, valid 0.947/0.666
10-fold cross valid, split #9, acc./fit. train 1.000/0.680, valid 1.000/0.680
10-fold cross valid, split #2, acc./fit. train 1.000/0.678, valid 1.000/0.678
10-fold cross valid, split #4, acc./fit. train 1.000/0.682, valid 0.930/0.634
10-fold cross valid, split #7, acc./fit. train 1.000/0.680, valid 0.982/0.668
10-fold cross valid, split #0, acc./fit. train 1.000/0.684, valid 1.000/0.684
Avg acc training: 1.00000, avg acc validation: 0.96842
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[205,0,0,364]
Nb support vector: 174 (in 569 candidates, reduc: 0.694200)
Export the classifier as C functions to ./classify.c
Ends on Fri Nov 18 22:03:36

Average accuracy on validation splits, me: 0.9684, OpenML: 0.9824.

Summary:

Within 12 datasets, the average accuracy over validation splits was above 90% for 6 datasets, above 80% for 8 datasets, and at worst 71.5%. In two cases it is above and in 4 cases at par with the best result on OpenML. These results validate my implementation of the support vector machine model, sequential minimal optimisation algorithm, and search of hyper-parameters using differential evolution. It would be interesting to see how it performs with a different kernel, or with more time to search the hyper-parameters (especially on large datasets for which one hour wasn't enough to perform a meaningful exploration), and to use the OpenML split instead of a random one for better comparison.

Edit on 2022/12/14: A follow up of this article is available here, where I compare support vector machines with neural networks.

2022-11-19
in AI/ML, All, C programming,
178 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.