Baillehache Pascal's personal website

Neural network and comparison with SVM

Edit on 2024/10/18: Added clarifications and corrected typos in the text and pseudocode. Results unchanged.

Following my study of Support Vector Machines I moved on to Neural Networks after watching this video by Sebastian Lague. This article is a memo about what I've learnt and a comparison with my previous work on SVM.

Artifical neural networks are oriented graphs where nodes are organised into layers and each node has an associated value calculated from its input nodes incoming edges and an 'activation' function. The Wikipedia page and the video linked above contain enough information to understand how they work and start implementing them. Of course it's another endless topic, but as results below are showing, there is already something practically useful to get from just scratching the surface. So as a first step I've mainly focused on what's explained in Lague's video.

There is no point in repeating what's already very well explained in his video, but I personally prefer to see an algorithm as pseudo code so that's what I'll share below. Also, I wished something more generic in term of activation function and network architecture than what's shown in the video.

First the data structures to implement a neural network are as follow:

Struct NeuralNetwork:
  nbInput: integer, number of input nodes
  nbOutput: integer, number of output nodes
  nbLayer: integer, at least two (the input and output layers), eventually more
           ("hidden layers")
  layers: array of 'nbLayer' Struct NNLayer, the layers of the network

Struct NNLayer:
  depth: integer, the depth of the layer, from 0 (input of the neural network)
         to NeuralNetwork.nbLayer - 1 (output of the neural network)
  activation: the activation function for the nodes of the layer, function
              taking one real value as input and returning one real value
  derivActivation: the derivative of the activation function
  nbNode: integer, number of nodes in the layer, at least one
  nodes: array of 'nbNode' Struct NNNode, the nodes of the layer

Struct NNNode:
  bias: float, the bias value of the node
  valInput: float, value of the node before applying its layer's activation
            function
  val: float, value of the node after applying its layer's activation function
  prevVal: float, previous value of the node (to allow recurrent network, not
           only feed forward network)
  propagDeriv: float, propagating derivation value, used during training
  nbLink: integer, number of link incoming the node
  links: array of Struct NNLink, the links incoming to the node

Struct NNLink:
  weight: float, the weight of the link
  iLayer: integer, index of the layer at the origin of the link
  iNode: integer, index of the node in the layer at the origin of the link

Some commonly used activation functions are:

Linear(x):
  return x

ReLU(x):
  if x >= 0.0:
    return x
  else:
    return 0.0

SiLU(x):
  return x / (1.0 + exp(-x))

Sigmoid(x):
  return 1.0 / (1.0 + exp(-x))

HyperTangent(x):
  y = exp(x)
  z = exp(-x)
  return (y - z) / (y + z)

And their derivatives:

DerivLinear(x):
  return 1.0

DerivReLU(x):
  if x >= 0.0:
    return 1.0
  else:
    return 0.0

DerivSiLU(x):
  y = exp(-x);
  z = 1.0 + y;
  out[0] = (z + x * y) / (z * z);

DerivSigmoid(x):
  y = Sigmoid(x)
  return y * (1.0 - y)

DerivHyperTangent(x):
  y = HyperTangent(x)
  return 1.0 - y * y;

More activation functions are available here.

Next, the prediction can be performed as follow:

nn: Struct NeuralNetwork, the neural network used for prediction
inputs: array of nn.nbInput float, the values of the input fields
        eventually preprocessed
outputs: array of nn.nbOutput float, updated with the values of the output
         layer at the end of the evaluation
Predict(nn, inputs, outputs):
  for i in [0, nn.nbInput[:
    nn.layers[0].nodes[i].val = inputs[i]
  for iLayer in [1, nn.nbLayer[:
    layer = nn.layers[iLayer]
    for iNode in [0, layer.nbNode[:
      layer.nodes[iNode].prevVal = layer.nodes[iNode].val
    for iNode in [0, layer.nbNode[:
      node = layer.nodes[iNode]
      node.valInput = node.bias
      for iLink in [0, node.nbLinks[:
        link = node.links[iLink]
        if nn.layers[link.iLayer].depth < layer.depth:
          valLink = nn.layers[link.iLayer].nodes[link.iNode].val
        else:
          valLink = nn.layers[link.iLayer].nodes[link.iNode].prevVal
        node.valInput += valLink * link.weight
      node.val = layer.activation(node.valInput)
  for i in [0, nn.nbOutput[:
    outputs[i] = nn.layers[nn.nbLayer - 1].nodes[i].val

Note that a neural network's input and output are vectors of real values. It means that any categorical values must be first converted to numerical ones. For regression task, the output is necessarily numerical, no problem. For a classification task, the solution is to use "one hot encoding": one output per predicted category value. The index of the predicted value is then equal to argmax(outputs) after executing Predict(). For the inputs, if it's numerical, except for wether or not it should be normalised it's easy, but for categorical ones it's more difficult.

This article suggests ordinal encoding, one hot encoding, embedding vector, or domain specific solutions. Embedding vectors and domain specific are beyond the scope of my study for now, so I simply ignore them. In my previous study of SVM I was simply using ordinal encoding. Now I wonder what the results would look like with one hot encoding. However results have shown that ordinal encoding is already enough to obtain good results, and one hot encoding systematically categorical variables makes the number of input dimension explode. I think that SVMs are much less sensitive to ordinal encoding because they are searching for support vectors delimiting subdomains of the input domain. So it doesn't matter much if along the dimension of one categorical input values do not represent a continuous concept. Anyway they will be clamped into several regions in the end.

A neural network works in a very different manner, with input "exciting" neurons. So if one categorical input is encoded into one single input node, the following nodes in the feed forward graph get excited at various levels transforming a discrete information into a nonsensical continous one. If that input is split into one node per value with one hot encoding, this problem doesn't occur. And as we are training the neural network using the gradient descent method, preserving sensical continuity seems natural. Still the explosion in number of dimensions is annoying. Looking for more info about that question I came accross this article arguing that one hot encoding is bad, and where other approaches are introduced. I'll definitely try them in the future, but in this article I'll stick to ordinal encoding to see how good (bad?) it's performing and have something to compare too when I'll try other encoding.

Then the problem is to train the neural network, i.e. to calculate the values of all the node.bias and link.weight parameters. This is done by minimising a loss function over those parameters. There are tons of possible loss function (cf here). In his video Sebastian Lague uses the squared error and I've simply followed him. To solve the minimisation problem I could have reused differential evolution, but, as sequential minimal optimisation for support vector machines, there exists an efficient algorithm for neural network: gradient descent with backpropagation.

The gradient descent method consists of calculating the gradient \(\nabla\) (partial derivatives of the loss function with respect to each parameter) and use it to update these parameters until all partial derivatives are null (which means an optimum has been found). The method has three parameters: random starting parameters value \(\overrightarrow{p_0}\), the step (also called learn rate in ML) \(s\) and the momentum \(m\). Then one step of the method is: \(\overrightarrow{p_{i+1}}=\overrightarrow{p_i}-s(\nabla(\overrightarrow{p_i})+m\nabla(\overrightarrow{p_{i-1}}))\).

The gradient method requires the partial derivatives for all parameters of the loss function which, averaged over all the samples in the training dataset. Without knowing those derivatives in closed form, the naive way is to approximate it with finite difference, i.e. calculate twice the loss function per partial derivative. It means that one step of the gradient method applied to neural network training requires (nbLink * nbNode * 2 * nbSample) evaluations of the neural network. In practice it's too time consuming for large datasets and/or large neural networks.

Thanksfully the so-called backpropagation trick saves the day. It's based on the chain rule which allows to calculate a derivative based on intermediate derivatives (\(\frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}\)). How it applies to our neural network is brillantly explained in Lague's video, so I won't even try here. I'll just summarise it as calculating intermediate derivatives at each node and link of the network and combining them backward (from the output nodes toward the input nodes) to calculate the gradient of the loss function from only one single evaluation of the loss function per step. Truly impressive!

For very large datasets and very large networks this may still be insufficient. Some more time can be saved using stochastic gradient descent. It consists of using only a subset of the training samples at each step. The gradient calculated is incorrect due to the missing samples, but provided there are enough samples at each step the "overall direction" stays good enough and efficient training can be achieved while saving significant amount of time. The number of samples used per step is called the batch size. How to choose it is a mistery I haven't investigated. Also, as the calculated gradient descent is incorrect, the evolution of the loss at each step becomes noisy, making it difficult to use as ending condition of the descent or adaptive learning rate (cf below). To compensate for that I'm using a sliding average with window size equal to the number of sample in the dataset divided by the number of sample in the batch.

There are two other problems with the gradient descent method: it is sensitive to its starting point and it has a strong tendency to diverge if the learning rate is not choosen carefully. About the starting point, I refer to this article and content myself with the Xavier method for now (biases set to null, weights set randomly according to a normal distibution with \(\sigma=1/n\) where \(n\) is the number of nodes in the layer). About divergence, there are methods called adaptive learning rate which helps avoid it. That's another never ending story from which I cowardly escape by implementing my own: at each gradient descent step, multiply the learning rate by \(a>1.0\) if the loss decreases, and by \(b<1.0\) if the loss increases (down to a minimun of 0.1 times the initial learning rate). In practice I have used \(a=1.1\) and \(b=0.5\).

Then the pseudo code for training is as follow:

Struct GradientDescent:
  params: array of float, current values of weights and biases
  gradient: array of float, current gradient for each param
  step: float, learning rate
  momentum: float, weight of the previous step

Struct Loss:
  derivatives: array of float, derivative of the loss value relative to each
               weight and bias, averaged over batch samples
  batchSize: integer, number of samples in the dataset used to calculate the
             loss
  batchStart: integer, index of the first row of the current batch
  val: current loss value, averaged over batch samples

nn: Struct NeuralNetwork, the neural network to train
dataset: the training dataset, assuming all values have been preprocessed to
         convenient numerical values as explained above
loss: Struct Loss, loss function
Train(nn, dataset, loss):
  gradientDescent: instance of Struct GradientDescent
  for param in gradientDescent.params:
    param = initialise according to Xavier method
  bestParams: array of float to memorise the current best parameters
  bestLoss: float, current best loss value (with sliding average)
  prevLoss: float, previous loss value (with sliding average)
  loss.batchStart = 0
  prevLoss = 0
  while first step or norm(gradientDescent.gradient) > epsilon:
    copy gradientDescent.params to nn's weights and biases
    sumDerivatives: array of float (same size as loss.derivatives),
                    initialised to 0
    for iBatchSample in [0, loss.batchSize[:
      for derivative in loss.derivatives:
        derivative = 0
      for node in nn's nodes:
        node.propagDeriv = 0
      outputs: array of float to memorise the output of the neural network
      iSample = (iBatchSample + loss.batchStart) modulo dataset.nbSample
      nn.Predict(dataset.samples[iSample], outputs)
      loss.val += Loss(outputs, dataset.targets[iSample])
      for iLayer for (nn.nbLayer-1) to 1:
        layer = nn.layers[iLayer]
        if iLayer = (nn.nbLayer - 1):
          UpdateDerivativeOutputLayer(
            nn, dataset, loss, layer, outputs, iSample)
        else:
          UpdateDerivativeHiddenLayer(nn, loss, layer)
      sumDerivatives += loss.derivatives
    loss.val /= loss.batchSize
    loss.derivatives = sumDerivatives / loss.batchSize
    if first step or slidingAverage(loss.val) < bestLoss:
      bestLoss = slidingAverage(loss.val)
      bestParams = gradientDescent.params
    if first step or slidingAverage(loss.val) > prevLoss:
      if gradientDescent.step > 0.1 * initial learn rate:
        gradientDescent.step *= 0.5
    else:
      gradientDescent.step *= 1.1
    prevLoss = slidingAverage(loss.val)
    for param in gradientDescent.params:
      gradientDescent.gradient[param] =
        gradientDescent.momentum * gradientDescent.gradient[param] +
        loss.derivatives[param]
      param -= gradientDescent.step * gradientDescent.gradient[param]
    loss.batchStart =
      (loss.batchStart + loss.batchSize) modulo dataset.nbSample 
  update nn's weights and biases with bestParams

UpdateDerivativeOutputLayer(nn, dataset, loss, layer, outputs, iSample):
  for node in layer.nodes:
    // derivative of the bias for the node
    loss.derivatives[node] =
      DerivLoss(outputs[node], dataset.target[iSample][node])
    for link in node.links:
      if nn.layers[link.iLayer].depth < layer.depth:
        valLink = layer.nodes[link.iNode].val
      else:
        valLink = layer.nodes[link.iNode].prevVal
      // derivative of the weight for the link
      loss.derivatives[link] = valLink * loss.derivatives[node]
      // propagation of the derivative to the previous node
      layer.nodes[link.iNode].propagDeriv +=
        link.weight * loss.derivatives[node]

UpdateDerivativeHiddenLayer(nn, loss, layer):
  for node in layer.nodes:
    // derivative of the bias for the node
    derivActivation = layer.derivActivation(node.valInput)
    loss.derivatives[node] = derivActivation * node.propagDeriv
    for link in node.links:
      if nn.layers[link.iLayer].depth < layer.depth:
        valLink = layer.nodes[link.iNode].val
      else:
        valLink = layer.nodes[link.iNode].prevVal
      // derivative of the weight for the link
      loss.derivatives[link] = valLink * loss.derivatives[node]
      // propagation of the derivative to the previous node
      layer.nodes[link.iNode].propagDeriv +=
        link.weight * loss.derivatives[node]

Loss(outputs, targets):
  loss = 0
  for iOutput in [0, outputs.size[:
    loss += (outputs[iOutput] - targets[iOutput])^2
  return loss

DerivLoss(output, target):
  return 2 * (output - target)

To check my implementation of neural networks and gradient descent with backpropagation in LibCapy, I've used the same datasets than for my study of support vector machines. In my previous article I had lazily used random splits for the k-fold cross validation, invalidating the comparison with OpenML results, and make it impossible to compare with neural network. The first thing to do was then to fix that, results below are all using the OpenML splits and I've rerun all the SVM training with those splits. The updated results for SVM are as follow:

Diabetes (OpenML):

Training a SVM with gaussian kernel on Resources/diabetes.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Tue Dec 13 08:53:05, maximum training time: 60mn
Search the best hyper parameters...
Tue Dec 13 10:05:13 (60mn) #6, hyperParams:[476.569, 0.812], fitness:2.225    
End the search of the best hyper parameters.
Hyper parameters value: [476.568821, 0.811600]
Perform 10-fold cross validation...
Uses Resources/diabetes_split.txt as the splits description file
10-fold cross valid, split #9, acc./fit. train 0.853/0.447, valid 0.711/0.373
10-fold cross valid, split #3, acc./fit. train 0.849/0.454, valid 0.714/0.381
10-fold cross valid, split #7, acc./fit. train 0.848/0.453, valid 0.753/0.402
10-fold cross valid, split #6, acc./fit. train 0.848/0.446, valid 0.792/0.416
10-fold cross valid, split #4, acc./fit. train 0.854/0.437, valid 0.779/0.399
10-fold cross valid, split #8, acc./fit. train 0.850/0.435, valid 0.776/0.397
10-fold cross valid, split #1, acc./fit. train 0.849/0.449, valid 0.688/0.364
10-fold cross valid, split #5, acc./fit. train 0.851/0.447, valid 0.714/0.375
10-fold cross valid, split #2, acc./fit. train 0.841/0.425, valid 0.766/0.387
10-fold cross valid, split #0, acc./fit. train 0.849/0.433, valid 0.753/0.384
Avg acc training: 0.84925, avg acc validation: 0.74479
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.842, conf. matrix:[179,89,32,468]
Nb support vector: 366 (in 768 candidates, reduc: 0.523438)
Export the classifier as C functions to /dev/null
Ends on Tue Dec 13 10:17:54

Average accuracy on validation splits, CapySupportVectorMachine: 0.74479, best of all on OpenML: 0.7877.

Haberman (OpenML):

Training a SVM with gaussian kernel on Resources/haberman.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Tue Dec 13 10:17:54, maximum training time: 60mn
Search the best hyper parameters...
Tue Dec 13 11:19:25 (60mn) #37, hyperParams:[0.010, 1.071], fitness:2.269      
End the search of the best hyper parameters.
Hyper parameters value: [0.010000, 1.070970]
Perform 10-fold cross validation...
Uses Resources/haberman_split.txt as the splits description file
10-fold cross valid, split #3, acc./fit. train 0.735/0.323, valid 0.742/0.326
10-fold cross valid, split #9, acc./fit. train 0.743/0.339, valid 0.667/0.304
10-fold cross valid, split #2, acc./fit. train 0.735/0.313, valid 0.742/0.316
10-fold cross valid, split #5, acc./fit. train 0.735/0.318, valid 0.742/0.321
10-fold cross valid, split #6, acc./fit. train 0.721/0.295, valid 0.867/0.355
10-fold cross valid, split #8, acc./fit. train 0.743/0.336, valid 0.667/0.302
10-fold cross valid, split #0, acc./fit. train 0.742/0.343, valid 0.677/0.313
10-fold cross valid, split #4, acc./fit. train 0.735/0.302, valid 0.742/0.305
10-fold cross valid, split #1, acc./fit. train 0.738/0.322, valid 0.710/0.310
10-fold cross valid, split #7, acc./fit. train 0.728/0.309, valid 0.800/0.339
Avg acc training: 0.73530, avg acc validation: 0.73548
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.735, conf. matrix:[225,0,81,0]
Nb support vector: 174 (in 306 candidates, reduc: 0.431373)
Export the classifier as C functions to /dev/null
Ends on Tue Dec 13 11:20:05

Average accuracy on validation splits, CapySupportVectorMachine: 0.73548, best of all on OpenML: 0.7679.

Heart-statlog (OpenML):

Training a SVM with gaussian kernel on Resources/heart-statlog.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Tue Dec 13 11:20:05, maximum training time: 60mn
Search the best hyper parameters...
Tue Dec 13 12:21:17 (60mn) #52, hyperParams:[678.609, 2.860], fitness:2.381    
End the search of the best hyper parameters.
Hyper parameters value: [678.608503, 2.859588]
Perform 10-fold cross validation...
Uses Resources/heart-statlog_split.txt as the splits description file
10-fold cross valid, split #6, acc./fit. train 1.000/0.255, valid 0.852/0.217
10-fold cross valid, split #5, acc./fit. train 1.000/0.247, valid 0.778/0.192
10-fold cross valid, split #3, acc./fit. train 1.000/0.255, valid 0.815/0.208
10-fold cross valid, split #1, acc./fit. train 1.000/0.251, valid 0.667/0.167
10-fold cross valid, split #9, acc./fit. train 1.000/0.251, valid 0.889/0.223
10-fold cross valid, split #7, acc./fit. train 1.000/0.251, valid 0.741/0.186
10-fold cross valid, split #0, acc./fit. train 1.000/0.317, valid 0.741/0.235
10-fold cross valid, split #2, acc./fit. train 1.000/0.235, valid 0.815/0.191
10-fold cross valid, split #4, acc./fit. train 1.000/0.251, valid 0.926/0.232
10-fold cross valid, split #8, acc./fit. train 1.000/0.235, valid 0.667/0.156
Avg acc training: 1.00000, avg acc validation: 0.78889
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[120,0,0,150]
Nb support vector: 199 (in 270 candidates, reduc: 0.262963)
Export the classifier as C functions to /dev/null
Ends on Tue Dec 13 12:24:27

Average accuracy on validation splits, CapySupportVectorMachine: 0.78889, best of all on OpenML: 0.8629.

Ionosphere (OpenML):

Training a SVM with gaussian kernel on Resources/ionosphere.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Tue Dec 13 12:24:27, maximum training time: 60mn
Search the best hyper parameters...
Tue Dec 13 13:25:17 (60mn) #55, hyperParams:[999.425, 1.786], fitness:2.869    
End the search of the best hyper parameters.
Hyper parameters value: [999.425335, 1.786343]
Perform 10-fold cross validation...
Uses Resources/ionosphere_split.txt as the splits description file
10-fold cross valid, split #4, acc./fit. train 1.000/0.484, valid 0.943/0.457
10-fold cross valid, split #1, acc./fit. train 1.000/0.453, valid 0.943/0.427
10-fold cross valid, split #6, acc./fit. train 1.000/0.437, valid 0.971/0.424
10-fold cross valid, split #7, acc./fit. train 1.000/0.449, valid 1.000/0.449
10-fold cross valid, split #0, acc./fit. train 1.000/0.463, valid 0.944/0.438
10-fold cross valid, split #3, acc./fit. train 1.000/0.453, valid 0.971/0.440
10-fold cross valid, split #9, acc./fit. train 1.000/0.449, valid 0.943/0.424
10-fold cross valid, split #2, acc./fit. train 1.000/0.459, valid 0.914/0.420
10-fold cross valid, split #8, acc./fit. train 1.000/0.462, valid 0.914/0.422
10-fold cross valid, split #5, acc./fit. train 1.000/0.459, valid 1.000/0.459
Avg acc training: 1.00000, avg acc validation: 0.95444
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[225,0,0,126]
Nb support vector: 184 (in 351 candidates, reduc: 0.475783)
Export the classifier as C functions to /dev/null
Ends on Tue Dec 13 13:29:51

Average accuracy on validation splits, CapySupportVectorMachine: 0.95444, best of all on OpenML: 0.9601.

Kr-vs-kp (OpenML):

Training a SVM with gaussian kernel on Resources/kr-vs-kp.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Tue Dec 13 18:46:51, maximum training time: 60mn
Search the best hyper parameters...
Tue Dec 13 19:47:58 (60mn) #17, hyperParams:[650.978, 1.335], fitness:2.914    
End the search of the best hyper parameters.
Hyper parameters value: [650.977791, 1.335398]
Perform 10-fold cross validation...
Uses Resources/kr-vs-kp_split.txt as the splits description file
10-fold cross valid, split #2, acc./fit. train 1.000/0.051, valid 0.950/0.049
10-fold cross valid, split #5, acc./fit. train 1.000/0.055, valid 0.978/0.053
10-fold cross valid, split #9, acc./fit. train 1.000/0.051, valid 0.962/0.049
10-fold cross valid, split #3, acc./fit. train 1.000/0.054, valid 0.988/0.054
10-fold cross valid, split #0, acc./fit. train 1.000/0.047, valid 0.991/0.047
10-fold cross valid, split #8, acc./fit. train 1.000/0.048, valid 0.972/0.047
10-fold cross valid, split #6, acc./fit. train 1.000/0.051, valid 0.959/0.049
10-fold cross valid, split #4, acc./fit. train 1.000/0.055, valid 0.978/0.054
10-fold cross valid, split #7, acc./fit. train 1.000/0.048, valid 0.978/0.047
10-fold cross valid, split #1, acc./fit. train 1.000/0.049, valid 0.978/0.048
Avg acc training: 1.00000, avg acc validation: 0.97340
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[1669,0,0,1527]
Nb support vector: 3015 (in 3196 candidates, reduc: 0.056633)
Export the classifier as C functions to /dev/null
Ends on Tue Dec 13 20:41:14

Average accuracy on validation splits, CapySupportVectorMachine: 0.9734, best of all on OpenML: 0.9978.

Monks-problems-1 (OpenML):

Training a SVM with gaussian kernel on Resources/monks-problems-1.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Tue Dec 13 14:32:48, maximum training time: 60mn
Search the best hyper parameters...
Tue Dec 13 15:36:11 (60mn) #29, hyperParams:[75.649, 0.864], fitness:2.953     
End the search of the best hyper parameters.
Hyper parameters value: [75.648897, 0.863917]
Perform 10-fold cross validation...
Uses Resources/monks-problems-1_split.txt as the splits description file
10-fold cross valid, split #7, acc./fit. train 0.988/0.765, valid 0.964/0.746
10-fold cross valid, split #5, acc./fit. train 0.992/0.762, valid 0.982/0.754
10-fold cross valid, split #0, acc./fit. train 0.984/0.758, valid 1.000/0.770
10-fold cross valid, split #9, acc./fit. train 0.996/0.755, valid 1.000/0.758
10-fold cross valid, split #3, acc./fit. train 0.982/0.738, valid 1.000/0.752
10-fold cross valid, split #1, acc./fit. train 0.994/0.753, valid 0.982/0.744
10-fold cross valid, split #4, acc./fit. train 1.000/0.738, valid 0.982/0.725
10-fold cross valid, split #8, acc./fit. train 0.998/0.747, valid 1.000/0.749
10-fold cross valid, split #2, acc./fit. train 0.996/0.745, valid 1.000/0.748
10-fold cross valid, split #6, acc./fit. train 0.994/0.738, valid 0.982/0.729
Avg acc training: 0.99240, avg acc validation: 0.98919
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.998, conf. matrix:[277,1,0,278]
Nb support vector: 111 (in 556 candidates, reduc: 0.800360)
Export the classifier as C functions to /dev/null
Ends on Tue Dec 13 15:41:44

Average accuracy on validation splits, CapySupportVectorMachine: 0.98919, best of all on OpenML: 1.0.

Monks-problems-2 (OpenML):

Training a SVM with gaussian kernel on Resources/monks-problems-2.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Tue Dec 13 15:41:44, maximum training time: 60mn
Search the best hyper parameters...
Tue Dec 13 16:42:55 (60mn) #23, hyperParams:[992.210, 1.607], fitness:2.675    
End the search of the best hyper parameters.
Hyper parameters value: [992.210170, 1.606826]
Perform 10-fold cross validation...
Uses Resources/monks-problems-2_split.txt as the splits description file
10-fold cross valid, split #1, acc./fit. train 0.993/0.640, valid 0.850/0.548
10-fold cross valid, split #3, acc./fit. train 0.991/0.634, valid 0.900/0.576
10-fold cross valid, split #5, acc./fit. train 0.998/0.638, valid 0.817/0.522
10-fold cross valid, split #0, acc./fit. train 0.989/0.615, valid 0.902/0.561
10-fold cross valid, split #2, acc./fit. train 0.983/0.613, valid 0.950/0.592
10-fold cross valid, split #8, acc./fit. train 0.994/0.612, valid 0.900/0.554
10-fold cross valid, split #4, acc./fit. train 0.996/0.619, valid 0.883/0.549
10-fold cross valid, split #9, acc./fit. train 0.991/0.612, valid 0.867/0.535
10-fold cross valid, split #7, acc./fit. train 0.989/0.594, valid 0.967/0.581
10-fold cross valid, split #6, acc./fit. train 0.985/0.592, valid 0.883/0.531
Avg acc training: 0.99094, avg acc validation: 0.89183
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.975, conf. matrix:[389,6,9,197]
Nb support vector: 218 (in 601 candidates, reduc: 0.637271)
Export the classifier as C functions to /dev/null
Ends on Tue Dec 13 16:59:56

Average accuracy on validation splits, CapySupportVectorMachine: 0.89183, best of all on OpenML: 1.0.

Mushroom (OpenML):

Training a SVM with gaussian kernel on Resources/mushroom.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Tue Dec 13 20:41:14, maximum training time: 60mn
Search the best hyper parameters...
Tue Dec 13 22:36:51 (0mn) #1, hyperParams:[753.903, 6.105], fitness:3.000    
End the search of the best hyper parameters.
Hyper parameters value: [753.902988, 6.104595]
Perform 10-fold cross validation...
Uses Resources/mushroom_split.txt as the splits description file
10-fold cross valid, split #3, acc./fit. train 1.000/0.649, valid 1.000/0.649
10-fold cross valid, split #1, acc./fit. train 1.000/0.650, valid 1.000/0.650
10-fold cross valid, split #0, acc./fit. train 1.000/0.652, valid 1.000/0.652
10-fold cross valid, split #4, acc./fit. train 1.000/0.651, valid 1.000/0.651
10-fold cross valid, split #8, acc./fit. train 1.000/0.648, valid 1.000/0.648
10-fold cross valid, split #5, acc./fit. train 1.000/0.643, valid 1.000/0.643
10-fold cross valid, split #2, acc./fit. train 1.000/0.649, valid 1.000/0.649
10-fold cross valid, split #6, acc./fit. train 1.000/0.647, valid 1.000/0.647
10-fold cross valid, split #9, acc./fit. train 1.000/0.651, valid 1.000/0.651
10-fold cross valid, split #7, acc./fit. train 1.000/0.658, valid 1.000/0.658
Avg acc training: 1.00000, avg acc validation: 1.00000
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[3916,0,0,4208]
Nb support vector: 2891 (in 8124 candidates, reduc: 0.644141)
Export the classifier as C functions to /dev/null
Ends on Wed Dec 14 03:04:32

Average accuracy on validation splits, CapySupportVectorMachine: 1.0, best of all on OpenML: 1.0.

Sonar (OpenML):

Training a SVM with gaussian kernel on Resources/sonar.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Wed Dec 14 03:04:32, maximum training time: 60mn
Search the best hyper parameters...
Wed Dec 14 04:05:04 (60mn) #128, hyperParams:[433.925, 0.543], fitness:2.714    
End the search of the best hyper parameters.
Hyper parameters value: [433.925232, 0.542691]
Perform 10-fold cross validation...
Uses Resources/sonar_split.txt as the splits description file
10-fold cross valid, split #5, acc./fit. train 1.000/0.262, valid 0.905/0.237
10-fold cross valid, split #6, acc./fit. train 1.000/0.214, valid 0.952/0.204
10-fold cross valid, split #3, acc./fit. train 1.000/0.257, valid 0.857/0.220
10-fold cross valid, split #2, acc./fit. train 1.000/0.235, valid 0.810/0.190
10-fold cross valid, split #9, acc./fit. train 1.000/0.250, valid 1.000/0.250
10-fold cross valid, split #1, acc./fit. train 1.000/0.225, valid 0.857/0.193
10-fold cross valid, split #8, acc./fit. train 1.000/0.245, valid 0.950/0.232
10-fold cross valid, split #0, acc./fit. train 1.000/0.246, valid 0.857/0.211
10-fold cross valid, split #4, acc./fit. train 1.000/0.225, valid 0.857/0.193
10-fold cross valid, split #7, acc./fit. train 1.000/0.219, valid 1.000/0.219
Avg acc training: 1.00000, avg acc validation: 0.90452
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[97,0,0,111]
Nb support vector: 157 (in 208 candidates, reduc: 0.245192)
Export the classifier as C functions to /dev/null
Ends on Wed Dec 14 04:06:36

Average accuracy on validation splits, CapySupportVectorMachine: 0.90452, best of all on OpenML: 0.8990.

Spambase (OpenML):

Training a SVM with gaussian kernel on Resources/spambase.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Wed Dec 14 04:06:36, maximum training time: 60mn
Search the best hyper parameters...
Wed Dec 14 07:03:10 (60mn) #1, hyperParams:[395.513, 7.680], fitness:2.639    
End the search of the best hyper parameters.
Hyper parameters value: [395.513095, 7.679614]
Perform 10-fold cross validation...
Uses Resources/spambase_split.txt as the splits description file
10-fold cross valid, split #2, acc./fit. train 0.906/0.807, valid 0.870/0.774
10-fold cross valid, split #9, acc./fit. train 0.974/0.845, valid 0.904/0.784
10-fold cross valid, split #1, acc./fit. train 0.920/0.815, valid 0.863/0.765
10-fold cross valid, split #5, acc./fit. train 0.932/0.829, valid 0.902/0.802
10-fold cross valid, split #8, acc./fit. train 0.923/0.816, valid 0.880/0.778
10-fold cross valid, split #0, acc./fit. train 0.917/0.810, valid 0.861/0.760
10-fold cross valid, split #3, acc./fit. train 0.922/0.812, valid 0.852/0.751
10-fold cross valid, split #4, acc./fit. train 0.968/0.832, valid 0.885/0.760
10-fold cross valid, split #6, acc./fit. train 0.936/0.825, valid 0.850/0.749
10-fold cross valid, split #7, acc./fit. train 0.975/0.835, valid 0.911/0.780
Avg acc training: 0.93731, avg acc validation: 0.87786
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.920, conf. matrix:[1563,250,120,2668]
Nb support vector: 490 (in 4601 candidates, reduc: 0.893501)
Export the classifier as C functions to /dev/null
Ends on Wed Dec 14 10:53:53

Average accuracy on validation splits, CapySupportVectorMachine: 0.87786, best of all on OpenML: 0.9626.

Tic-tac-toe (OpenML):

Training a SVM with gaussian kernel on Resources/tic-tac-toe.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Wed Dec 14 10:53:53, maximum training time: 60mn
Search the best hyper parameters...
Wed Dec 14 11:56:04 (60mn) #27, hyperParams:[868.385, 0.976], fitness:2.870    
End the search of the best hyper parameters.
Hyper parameters value: [868.385195, 0.975931]
Perform 10-fold cross validation...
Uses Resources/tic-tac-toe_split.txt as the splits description file
10-fold cross valid, split #4, acc./fit. train 1.000/0.476, valid 0.990/0.471
10-fold cross valid, split #9, acc./fit. train 1.000/0.483, valid 0.926/0.448
10-fold cross valid, split #0, acc./fit. train 1.000/0.472, valid 0.938/0.443
10-fold cross valid, split #6, acc./fit. train 1.000/0.487, valid 0.969/0.472
10-fold cross valid, split #8, acc./fit. train 1.000/0.475, valid 0.937/0.445
10-fold cross valid, split #2, acc./fit. train 1.000/0.466, valid 0.927/0.432
10-fold cross valid, split #3, acc./fit. train 1.000/0.474, valid 0.969/0.460
10-fold cross valid, split #7, acc./fit. train 1.000/0.468, valid 0.958/0.448
10-fold cross valid, split #1, acc./fit. train 1.000/0.480, valid 0.958/0.460
10-fold cross valid, split #5, acc./fit. train 1.000/0.463, valid 0.969/0.448
Avg acc training: 1.00000, avg acc validation: 0.95402
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:[626,0,0,332]
Nb support vector: 484 (in 958 candidates, reduc: 0.494781)
Export the classifier as C functions to /dev/null
Ends on Wed Dec 14 12:25:47

Average accuracy on validation splits, CapySupportVectorMachine: 0.95402, best of all on OpenML: 1.0.

Wdbc (OpenML):

Training a SVM with gaussian kernel on Resources/wdbc.csv
Hyper parameters: [coeffRelax, gamma]
Hyper parameters domain: [0.010, 1000.000], [0.001, 10.000]
Starts on Wed Dec 14 12:25:47, maximum training time: 60mn
Search the best hyper parameters...
Wed Dec 14 13:26:44 (60mn) #59, hyperParams:[218.057, 0.187], fitness:2.947    
End the search of the best hyper parameters.
Hyper parameters value: [218.057160, 0.186996]
Perform 10-fold cross validation...
Uses Resources/wdbc_split.txt as the splits description file
10-fold cross valid, split #5, acc./fit. train 0.994/0.928, valid 0.965/0.901
10-fold cross valid, split #8, acc./fit. train 0.990/0.921, valid 0.982/0.913
10-fold cross valid, split #6, acc./fit. train 0.992/0.913, valid 0.982/0.904
10-fold cross valid, split #1, acc./fit. train 0.988/0.913, valid 0.982/0.908
10-fold cross valid, split #0, acc./fit. train 0.988/0.919, valid 0.965/0.897
10-fold cross valid, split #3, acc./fit. train 0.988/0.907, valid 1.000/0.918
10-fold cross valid, split #4, acc./fit. train 0.990/0.921, valid 0.982/0.913
10-fold cross valid, split #9, acc./fit. train 0.992/0.921, valid 0.964/0.895
10-fold cross valid, split #7, acc./fit. train 0.988/0.909, valid 1.000/0.920
10-fold cross valid, split #2, acc./fit. train 0.990/0.915, valid 1.000/0.924
Avg acc training: 0.99024, avg acc validation: 0.98239
Train the classifier on the whole dataset...
End the training of the classifier.
Evaluation, acc:0.989, conf. matrix:[206,6,0,357]
Nb support vector: 39 (in 569 candidates, reduc: 0.931459)
Export the classifier as C functions to /dev/null
Ends on Wed Dec 14 13:28:55

Average accuracy on validation splits, CapySupportVectorMachine: 0.98239, best of all on OpenML: 0.9824.

Summary:

Once again I've obtained good predictive accuracy, below the best models on OpenML but not far away, even above on the sonar dataset. This helps build confidence in my implementation of SVM and there is always hope for better results by running the search for hyperparameters longer. About the sonar dataset, may I have been lucky enough to find hyper parameters which improve the results ? I tried to reproduce the results through the python API, but there was a problem and I'm currently waiting reply from OpenML.

After adding my neural network implementation to the appli I had previously made to use SVM, I used it to see how well it performs. But contrary to SVM I haven't automated the search for hyper parameters. One problem is the structure of the network, whose exploration can't be performed using differential evolution like I did on SVM. Another problem is that it's difficult to define how to split the time allocated to hyper parameters search. I decided to search manually the best network structure, learning rate and momentum instead for now with the following strategy. For the network structure, given \(n\) the number of inputs of the network I've tried the ten combinations (one layer of \(n\) nodes), (one layer of \(2n\) nodes), (two layers of \(n\) nodes), (two layers of \(2n\) nodes), (one layer of \(n\) nodes and one layer of \(2n\) nodes), each using Linear or Sigmoid activation. For other parameters, I use 5mn of training time, maximum batch size of 1000 samples, learning rate and momentum equals to 0.01. Best results are as follow:

Diabetes (OpenML):

Training a fully connected neural network on Resources/diabetes.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Fri Dec 16 20:21:48
Hyper parameters value: [0.010000, 0.010000, 768]
layer #0: 16 nodes, sigmoid activation
layer #1: 16 nodes, sigmoid activation
Perform 10-fold cross validation...
Uses Resources/diabetes_split.txt as the splits description file
10-fold cross valid, split #5, acc./fit. train 0.805/0.805, valid 0.805/0.805
10-fold cross valid, split #3, acc./fit. train 0.815/0.815, valid 0.727/0.727
10-fold cross valid, split #4, acc./fit. train 0.808/0.808, valid 0.805/0.805
10-fold cross valid, split #8, acc./fit. train 0.802/0.802, valid 0.763/0.763
10-fold cross valid, split #0, acc./fit. train 0.795/0.795, valid 0.753/0.753
10-fold cross valid, split #1, acc./fit. train 0.812/0.812, valid 0.792/0.792
10-fold cross valid, split #2, acc./fit. train 0.792/0.792, valid 0.727/0.727
10-fold cross valid, split #9, acc./fit. train 0.798/0.798, valid 0.803/0.803
10-fold cross valid, split #7, acc./fit. train 0.806/0.806, valid 0.792/0.792
10-fold cross valid, split #6, acc./fit. train 0.810/0.810, valid 0.805/0.805
Avg acc training: 0.80411, avg acc validation: 0.77736
Train the classifier on the whole dataset...
 300s/300s #157511: best loss 0.268630, loss 0.268630, gradient 0.001604       
End the training of the classifier.
Evaluation, acc:0.810, conf. matrix:
0171 0097 
0049 0451 
Export the classifier as C functions to /dev/null
Ends on Fri Dec 16 20:31:48

Average accuracy on validation splits, CapyNeuralNetwork: 0.77736, best of all on OpenML: 0.7877.

Haberman (OpenML):

Training a fully connected neural network on Resources/haberman.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Fri Dec 16 20:31:52
Hyper parameters value: [0.010000, 0.010000, 306]
layer #0: 6 nodes, linear activation
layer #1: 6 nodes, linear activation
Perform 10-fold cross validation...
Uses Resources/haberman_split.txt as the splits description file
10-fold cross valid, split #7, acc./fit. train 0.732/0.732, valid 0.800/0.800
10-fold cross valid, split #9, acc./fit. train 0.750/0.750, valid 0.667/0.667
10-fold cross valid, split #3, acc./fit. train 0.742/0.742, valid 0.774/0.774
10-fold cross valid, split #8, acc./fit. train 0.746/0.746, valid 0.733/0.733
10-fold cross valid, split #4, acc./fit. train 0.764/0.764, valid 0.677/0.677
10-fold cross valid, split #5, acc./fit. train 0.745/0.745, valid 0.774/0.774
10-fold cross valid, split #2, acc./fit. train 0.745/0.745, valid 0.742/0.742
10-fold cross valid, split #0, acc./fit. train 0.749/0.749, valid 0.742/0.742
10-fold cross valid, split #6, acc./fit. train 0.750/0.750, valid 0.767/0.767
10-fold cross valid, split #1, acc./fit. train 0.745/0.745, valid 0.774/0.774
Avg acc training: 0.74692, avg acc validation: 0.74505
Train the classifier on the whole dataset...
 0s/300s #3440: best loss 0.350355, loss 0.350355, gradient 0.000001       
End the training of the classifier.
Evaluation, acc:0.745, conf. matrix:
0217 0008 
0070 0011 
Export the classifier as C functions to /dev/null
Ends on Fri Dec 16 20:31:55

Average accuracy on validation splits, CapyNeuralNetwork: 0.74505, best of all on OpenML: 0.7679.

Heart-statlog (OpenML):

Training a fully connected neural network on Resources/heart-statlog.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Fri Dec 16 21:22:42
Hyper parameters value: [0.010000, 0.010000, 270]
layer #0: 13 nodes, linear activation
Perform 10-fold cross validation...
Uses Resources/heart-statlog_split.txt as the splits description file
10-fold cross valid, split #8, acc./fit. train 0.872/0.872, valid 0.741/0.741
10-fold cross valid, split #6, acc./fit. train 0.848/0.848, valid 0.889/0.889
10-fold cross valid, split #9, acc./fit. train 0.864/0.864, valid 0.852/0.852
10-fold cross valid, split #1, acc./fit. train 0.860/0.860, valid 0.778/0.778
10-fold cross valid, split #5, acc./fit. train 0.856/0.856, valid 0.815/0.815
10-fold cross valid, split #0, acc./fit. train 0.860/0.860, valid 0.852/0.852
10-fold cross valid, split #2, acc./fit. train 0.840/0.840, valid 0.926/0.926
10-fold cross valid, split #3, acc./fit. train 0.856/0.856, valid 0.815/0.815
10-fold cross valid, split #4, acc./fit. train 0.840/0.840, valid 0.889/0.889
10-fold cross valid, split #7, acc./fit. train 0.848/0.848, valid 0.815/0.815
Avg acc training: 0.85432, avg acc validation: 0.83704
Train the classifier on the whole dataset...
 1s/300s #2091: best loss 0.224569, loss 0.224569, gradient 0.000001       
End the training of the classifier.
Evaluation, acc:0.848, conf. matrix:
0096 0024 
0017 0133 
Export the classifier as C functions to /dev/null
Ends on Fri Dec 16 21:22:43

Average accuracy on validation splits, CapyNeuralNetwork: 0.83704, best of all on OpenML: 0.8629.

Ionosphere (OpenML):

Training a fully connected neural network on Resources/ionosphere.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Fri Dec 16 22:56:41
Hyper parameters value: [0.010000, 0.010000, 351]
layer #0: 34 nodes, sigmoid activation
Perform 10-fold cross validation...
Uses Resources/ionosphere_split.txt as the splits description file
10-fold cross valid, split #0, acc./fit. train 0.987/0.987, valid 0.861/0.861
10-fold cross valid, split #1, acc./fit. train 0.991/0.991, valid 0.943/0.943
10-fold cross valid, split #3, acc./fit. train 0.994/0.994, valid 0.886/0.886
10-fold cross valid, split #5, acc./fit. train 0.991/0.991, valid 1.000/1.000
10-fold cross valid, split #6, acc./fit. train 0.984/0.984, valid 0.886/0.886
10-fold cross valid, split #8, acc./fit. train 0.987/0.987, valid 0.886/0.886
10-fold cross valid, split #4, acc./fit. train 0.997/0.997, valid 0.886/0.886
10-fold cross valid, split #2, acc./fit. train 0.987/0.987, valid 0.886/0.886
10-fold cross valid, split #9, acc./fit. train 0.991/0.991, valid 0.943/0.943
10-fold cross valid, split #7, acc./fit. train 0.987/0.987, valid 0.943/0.943
Avg acc training: 0.98955, avg acc validation: 0.91183
Train the classifier on the whole dataset...
 300s/300s #158501: best loss 0.024310, loss 0.024310, gradient 0.002026       
End the training of the classifier.
Evaluation, acc:0.997, conf. matrix:
0225 0000 
0001 0125 
Export the classifier as C functions to /dev/null
Ends on Fri Dec 16 23:06:41

Average accuracy on validation splits, CapyNeuralNetwork: 0.91183, best of all on OpenML: 0.9601.

Kr-vs-kp (OpenML):

Training a fully connected neural network on Resources/kr-vs-kp.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Fri Dec 16 23:56:42
Hyper parameters value: [0.010000, 0.010000, 990]
layer #0: 72 nodes, linear activation
Perform 10-fold cross validation...
Uses Resources/kr-vs-kp_split.txt as the splits description file
10-fold cross valid, split #5, acc./fit. train 0.939/0.939, valid 0.944/0.944
10-fold cross valid, split #8, acc./fit. train 0.943/0.943, valid 0.931/0.931
10-fold cross valid, split #3, acc./fit. train 0.942/0.942, valid 0.953/0.953
10-fold cross valid, split #1, acc./fit. train 0.939/0.939, valid 0.956/0.956
10-fold cross valid, split #0, acc./fit. train 0.940/0.940, valid 0.941/0.941
10-fold cross valid, split #4, acc./fit. train 0.941/0.941, valid 0.947/0.947
10-fold cross valid, split #9, acc./fit. train 0.941/0.941, valid 0.934/0.934
10-fold cross valid, split #2, acc./fit. train 0.943/0.943, valid 0.916/0.916
10-fold cross valid, split #7, acc./fit. train 0.942/0.942, valid 0.940/0.940
10-fold cross valid, split #6, acc./fit. train 0.943/0.943, valid 0.925/0.925
Avg acc training: 0.94128, avg acc validation: 0.93867
Train the classifier on the whole dataset...
 300s/300s #31182: best loss 0.175802, loss 0.179762, gradient 2.258641       
End the training of the classifier.
Evaluation, acc:0.942, conf. matrix:
1586 0083 
0103 1424 
Export the classifier as C functions to /dev/null
Ends on Sat Dec 17 00:06:42

Average accuracy on validation splits, CapyNeuralNetwork: 0.93867, best of all on OpenML: 0.9978.

Monks-problems-1 (OpenML):

Training a fully connected neural network on Resources/monks-problems-1.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Sat Dec 17 01:16:52
Hyper parameters value: [0.010000, 0.010000, 556]
layer #0: 12 nodes, sigmoid activation
Perform 10-fold cross validation...
Uses Resources/monks-problems-1_split.txt as the splits description file
10-fold cross valid, split #3, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #4, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #1, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #0, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #9, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #2, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #7, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #5, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #6, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #8, acc./fit. train 1.000/1.000, valid 1.000/1.000
Avg acc training: 1.00000, avg acc validation: 1.00000
Train the classifier on the whole dataset...
 300s/300s #711863: best loss 0.000009, loss 0.000009, gradient 0.000134       
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:
0278 0000 
0000 0278 
Export the classifier as C functions to /dev/null
Ends on Sat Dec 17 01:26:52

Average accuracy on validation splits, CapyNeuralNetwork: 1.0, best of all on OpenML: 1.0.

Monks-problems-2 (OpenML):

Training a fully connected neural network on Resources/monks-problems-2.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Sat Dec 17 02:36:55
Hyper parameters value: [0.010000, 0.010000, 601]
layer #0: 12 nodes, sigmoid activation
layer #1: 6 nodes, sigmoid activation
Perform 10-fold cross validation...
Uses Resources/monks-problems-2_split.txt as the splits description file
10-fold cross valid, split #6, acc./fit. train 0.998/0.998, valid 1.000/1.000
10-fold cross valid, split #2, acc./fit. train 0.998/0.998, valid 0.983/0.983
10-fold cross valid, split #0, acc./fit. train 0.996/0.996, valid 1.000/1.000
10-fold cross valid, split #3, acc./fit. train 0.991/0.991, valid 1.000/1.000
10-fold cross valid, split #4, acc./fit. train 0.998/0.998, valid 1.000/1.000
10-fold cross valid, split #5, acc./fit. train 0.996/0.996, valid 1.000/1.000
10-fold cross valid, split #9, acc./fit. train 0.996/0.996, valid 1.000/1.000
10-fold cross valid, split #7, acc./fit. train 1.000/1.000, valid 1.000/1.000
10-fold cross valid, split #8, acc./fit. train 0.991/0.991, valid 1.000/1.000
10-fold cross valid, split #1, acc./fit. train 0.998/0.998, valid 0.967/0.967
Avg acc training: 0.99630, avg acc validation: 0.99500
Train the classifier on the whole dataset...
 300s/300s #439865: best loss 0.014990, loss 0.014990, gradient 0.003472       
End the training of the classifier.
Evaluation, acc:0.992, conf. matrix:
0394 0001 
0004 0202 
Export the classifier as C functions to /dev/null
Ends on Sat Dec 17 02:46:55

Average accuracy on validation splits, CapyNeuralNetwork: 0.995, best of all on OpenML: 1.0.

Mushroom (OpenML):

Training a fully connected neural network on Resources/mushroom.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Sat Dec 17 03:16:56
Hyper parameters value: [0.010000, 0.010000, 974]
layer #0: 44 nodes, linear activation
Perform 10-fold cross validation...
Uses Resources/mushroom_split.txt as the splits description file
10-fold cross valid, split #2, acc./fit. train 0.964/0.964, valid 0.970/0.970
10-fold cross valid, split #9, acc./fit. train 0.964/0.964, valid 0.961/0.961
10-fold cross valid, split #6, acc./fit. train 0.962/0.962, valid 0.964/0.964
10-fold cross valid, split #1, acc./fit. train 0.964/0.964, valid 0.959/0.959
10-fold cross valid, split #3, acc./fit. train 0.961/0.961, valid 0.966/0.966
10-fold cross valid, split #5, acc./fit. train 0.963/0.963, valid 0.968/0.968
10-fold cross valid, split #0, acc./fit. train 0.963/0.963, valid 0.959/0.959
10-fold cross valid, split #8, acc./fit. train 0.964/0.964, valid 0.957/0.957
10-fold cross valid, split #4, acc./fit. train 0.963/0.963, valid 0.963/0.963
10-fold cross valid, split #7, acc./fit. train 0.964/0.964, valid 0.963/0.963
Avg acc training: 0.96310, avg acc validation: 0.96307
Train the classifier on the whole dataset...
 300s/300s #78045: best loss 0.084216, loss 0.094246, gradient 0.606141       
End the training of the classifier.
Evaluation, acc:0.964, conf. matrix:
3832 0084 
0206 4002 
Export the classifier as C functions to /dev/null
Ends on Sat Dec 17 03:26:56

Average accuracy on validation splits, CapyNeuralNetwork: 0.96307, best of all on OpenML: 1.0.

Sonar (OpenML):

Training a fully connected neural network on Resources/sonar.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Sat Dec 17 05:31:43
Hyper parameters value: [0.010000, 0.010000, 208]
layer #0: 60 nodes, sigmoid activation
Perform 10-fold cross validation...
Uses Resources/sonar_split.txt as the splits description file
10-fold cross valid, split #5, acc./fit. train 1.000/1.000, valid 0.762/0.762
10-fold cross valid, split #3, acc./fit. train 1.000/1.000, valid 0.762/0.762
10-fold cross valid, split #4, acc./fit. train 1.000/1.000, valid 0.714/0.714
10-fold cross valid, split #2, acc./fit. train 1.000/1.000, valid 0.810/0.810
10-fold cross valid, split #0, acc./fit. train 1.000/1.000, valid 0.619/0.619
10-fold cross valid, split #7, acc./fit. train 0.995/0.995, valid 0.810/0.810
10-fold cross valid, split #9, acc./fit. train 0.995/0.995, valid 0.800/0.800
10-fold cross valid, split #6, acc./fit. train 1.000/1.000, valid 0.714/0.714
10-fold cross valid, split #8, acc./fit. train 1.000/1.000, valid 0.900/0.900
10-fold cross valid, split #1, acc./fit. train 1.000/1.000, valid 0.905/0.905
Avg acc training: 0.99893, avg acc validation: 0.77952
Train the classifier on the whole dataset...
 300s/300s #95480: best loss 0.008575, loss 0.008575, gradient 0.002969       
End the training of the classifier.
Evaluation, acc:1.000, conf. matrix:
0097 0000 
0000 0111 
Export the classifier as C functions to /dev/null
Ends on Sat Dec 17 05:41:43

Average accuracy on validation splits, CapyNeuralNetwork: 0.77952, best of all on OpenML: 0.8990.

Spambase (OpenML):

Training a fully connected neural network on Resources/spambase.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Sat Dec 17 06:31:44
Hyper parameters value: [0.010000, 0.010000, 966]
layer #0: 114 nodes, linear activation
Perform 10-fold cross validation...
Uses Resources/spambase_split.txt as the splits description file
10-fold cross valid, split #8, acc./fit. train 0.856/0.856, valid 0.850/0.850
10-fold cross valid, split #4, acc./fit. train 0.856/0.856, valid 0.859/0.859
10-fold cross valid, split #2, acc./fit. train 0.871/0.871, valid 0.859/0.859
10-fold cross valid, split #7, acc./fit. train 0.853/0.853, valid 0.859/0.859
10-fold cross valid, split #0, acc./fit. train 0.864/0.864, valid 0.870/0.870
10-fold cross valid, split #1, acc./fit. train 0.855/0.855, valid 0.857/0.857
10-fold cross valid, split #5, acc./fit. train 0.856/0.856, valid 0.867/0.867
10-fold cross valid, split #3, acc./fit. train 0.858/0.858, valid 0.857/0.857
10-fold cross valid, split #9, acc./fit. train 0.867/0.867, valid 0.841/0.841
10-fold cross valid, split #6, acc./fit. train 0.860/0.860, valid 0.854/0.854
Avg acc training: 0.85969, avg acc validation: 0.85720
Train the classifier on the whole dataset...
 300s/300s #12533: best loss 0.189798, loss 0.252300, gradient 3.866341       
End the training of the classifier.
Evaluation, acc:0.871, conf. matrix:
1334 0479 
0116 2672 
Export the classifier as C functions to /dev/null
Ends on Sat Dec 17 06:41:45

Average accuracy on validation splits, CapyNeuralNetwork: 0.8572, best of all on OpenML: 0.9626.

Tic-tac-toe (OpenML):

Training a fully connected neural network on Resources/tic-tac-toe.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Sat Dec 17 07:52:03
Hyper parameters value: [0.010000, 0.010000, 958]
layer #0: 18 nodes, sigmoid activation
Perform 10-fold cross validation...
Uses Resources/tic-tac-toe_split.txt as the splits description file
10-fold cross valid, split #2, acc./fit. train 0.981/0.981, valid 1.000/1.000
10-fold cross valid, split #6, acc./fit. train 0.986/0.986, valid 1.000/1.000
10-fold cross valid, split #1, acc./fit. train 0.992/0.992, valid 0.958/0.958
10-fold cross valid, split #0, acc./fit. train 0.997/0.997, valid 0.990/0.990
10-fold cross valid, split #7, acc./fit. train 0.994/0.994, valid 0.990/0.990
10-fold cross valid, split #9, acc./fit. train 0.985/0.985, valid 0.989/0.989
10-fold cross valid, split #4, acc./fit. train 0.986/0.986, valid 0.969/0.969
10-fold cross valid, split #8, acc./fit. train 0.986/0.986, valid 0.968/0.968
10-fold cross valid, split #5, acc./fit. train 0.991/0.991, valid 0.979/0.979
10-fold cross valid, split #3, acc./fit. train 0.999/0.999, valid 0.990/0.990
Avg acc training: 0.98968, avg acc validation: 0.98329
Train the classifier on the whole dataset...
 300s/300s #225716: best loss 0.020331, loss 0.020331, gradient 0.001991       
End the training of the classifier.
Evaluation, acc:0.986, conf. matrix:
0626 0000 
0013 0319 
Export the classifier as C functions to /dev/null
Ends on Sat Dec 17 08:02:03

Average accuracy on validation splits, CapyNeuralNetwork: 0.98329, best of all on OpenML: 1.0.

Wdbc (OpenML):

Training a fully connected neural network on Resources/wdbc.csv
Hyper parameters: [learning rate, momentum, batch size, model]
Starts on Sat Dec 17 09:46:47
Hyper parameters value: [0.010000, 0.010000, 569]
layer #0: 30 nodes, sigmoid activation
Perform 10-fold cross validation...
Uses Resources/wdbc_split.txt as the splits description file
10-fold cross valid, split #5, acc./fit. train 0.990/0.990, valid 0.930/0.930
10-fold cross valid, split #0, acc./fit. train 0.986/0.986, valid 1.000/1.000
10-fold cross valid, split #1, acc./fit. train 0.986/0.986, valid 1.000/1.000
10-fold cross valid, split #4, acc./fit. train 0.990/0.990, valid 0.982/0.982
10-fold cross valid, split #2, acc./fit. train 0.988/0.988, valid 0.965/0.965
10-fold cross valid, split #9, acc./fit. train 0.988/0.988, valid 0.929/0.929
10-fold cross valid, split #6, acc./fit. train 0.984/0.984, valid 0.982/0.982
10-fold cross valid, split #3, acc./fit. train 0.982/0.982, valid 1.000/1.000
10-fold cross valid, split #8, acc./fit. train 0.988/0.988, valid 0.965/0.965
10-fold cross valid, split #7, acc./fit. train 0.986/0.986, valid 1.000/1.000
Avg acc training: 0.98711, avg acc validation: 0.97531
Train the classifier on the whole dataset...
 300s/300s #122658: best loss 0.026291, loss 0.026291, gradient 0.006039       
End the training of the classifier.
Evaluation, acc:0.989, conf. matrix:
0208 0004 
0002 0355 
Export the classifier as C functions to /dev/null
Ends on Sat Dec 17 09:56:47

Average accuracy on validation splits, CapyNeuralNetwork: 0.97531, best of all on OpenML: 0.9824.

Summary:

Once again the results are very good and make me confident I've implemented correctly neural network and their training using gradient descent with back propagation. Also, it's not shown here but the gradient very rarely exploded, hint that my simple solution to control the learning rate is already good enough.

Compare to SVMs, neural networks have a lot of knobs and switches to tune: the network structure and activations, the loss function, the gradient descent parameters, the batch size. SVMs only have the choice of the kernel and its parameters, which is furthermore easy to automate. Still, using the most basic solution I could immediately obtain the results introduced here. That's really satisfying ! I'd like anyway to keep studying on that subject and expect a lot more fun to see how these results could be improved. The points I like to dig further are:

how to deal with missing data in the dataset (I've simply avoided them until now)
how to encode categorical data (other than ordinal encoding)
how to automate the search for the network structure (genetic algorithm ?)
other loss functions (I have considered MSE only until now)
other gradient descent heuristic to control the learning rate (better than my home made one)
other method to initialise the gradient descent (I have used Xavier method only without really thinking seriously about it)
a good heuristic to choose the batch size (I've set the 1000 samples limit arbitrarily, and looking at the log I think now that I could as well set it much higher probably)
the so called layer normalisation seems to be a nice feature to add
and of course extend the implementation to convolutional neural network

Finally a comparison SVM/NN/OpenML:

Both achieve very good results, and bad results do not overlap, which means that trying both and choosing the best gives results almost at par with the best one on OpenML for all datasets except one. Only the spambase dataset was behind for both, but given how good the others are there is hope that tuning parameters would correct that. Neural networks are more cumbersome to train (lot of parameters and no automation of their search), SVMs are easier to train thanks to automation but the hyper parameters search is super time consuming. The sum of validation accuracies on all datasets for SVM and NN are respectively 10.80 and 10.76, the difference is insignificant. So there is no real winner, both are performant tools in my toolbox waiting to be improved even more and used on future projects !

2022-12-17
in AI/ML, All,
186 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.