Edit on 2024/10/18: Added clarifications and corrected typos in the text and pseudocode. Results unchanged.
Following my study of Support Vector Machines I moved on to Neural Networks after watching this video by Sebastian Lague. This article is a memo about what I've learnt and a comparison with my previous work on SVM.
Artifical neural networks are oriented graphs where nodes are organised into layers and each node has an associated value calculated from its input nodes incoming edges and an 'activation' function. The Wikipedia page and the video linked above contain enough information to understand how they work and start implementing them. Of course it's another endless topic, but as results below are showing, there is already something practically useful to get from just scratching the surface. So as a first step I've mainly focused on what's explained in Lague's video.
There is no point in repeating what's already very well explained in his video, but I personally prefer to see an algorithm as pseudo code so that's what I'll share below. Also, I wished something more generic in term of activation function and network architecture than what's shown in the video.
First the data structures to implement a neural network are as follow:
Some commonly used activation functions are:
And their derivatives:
More activation functions are available here.
Next, the prediction can be performed as follow:
Note that a neural network's input and output are vectors of real values. It means that any categorical values must be first converted to numerical ones. For regression task, the output is necessarily numerical, no problem. For a classification task, the solution is to use "one hot encoding": one output per predicted category value. The index of the predicted value is then equal to argmax(outputs) after executing Predict(). For the inputs, if it's numerical, except for wether or not it should be normalised it's easy, but for categorical ones it's more difficult.
This article suggests ordinal encoding, one hot encoding, embedding vector, or domain specific solutions. Embedding vectors and domain specific are beyond the scope of my study for now, so I simply ignore them. In my previous study of SVM I was simply using ordinal encoding. Now I wonder what the results would look like with one hot encoding. However results have shown that ordinal encoding is already enough to obtain good results, and one hot encoding systematically categorical variables makes the number of input dimension explode. I think that SVMs are much less sensitive to ordinal encoding because they are searching for support vectors delimiting subdomains of the input domain. So it doesn't matter much if along the dimension of one categorical input values do not represent a continuous concept. Anyway they will be clamped into several regions in the end.
A neural network works in a very different manner, with input "exciting" neurons. So if one categorical input is encoded into one single input node, the following nodes in the feed forward graph get excited at various levels transforming a discrete information into a nonsensical continous one. If that input is split into one node per value with one hot encoding, this problem doesn't occur. And as we are training the neural network using the gradient descent method, preserving sensical continuity seems natural. Still the explosion in number of dimensions is annoying. Looking for more info about that question I came accross this article arguing that one hot encoding is bad, and where other approaches are introduced. I'll definitely try them in the future, but in this article I'll stick to ordinal encoding to see how good (bad?) it's performing and have something to compare too when I'll try other encoding.
Then the problem is to train the neural network, i.e. to calculate the values of all the node.bias and link.weight parameters. This is done by minimising a loss function over those parameters. There are tons of possible loss function (cf here). In his video Sebastian Lague uses the squared error and I've simply followed him. To solve the minimisation problem I could have reused differential evolution, but, as sequential minimal optimisation for support vector machines, there exists an efficient algorithm for neural network: gradient descent with backpropagation.
The gradient descent method consists of calculating the gradient \(\nabla\) (partial derivatives of the loss function with respect to each parameter) and use it to update these parameters until all partial derivatives are null (which means an optimum has been found). The method has three parameters: random starting parameters value \(\overrightarrow{p_0}\), the step (also called learn rate in ML) \(s\) and the momentum \(m\). Then one step of the method is: \(\overrightarrow{p_{i+1}}=\overrightarrow{p_i}-s(\nabla(\overrightarrow{p_i})+m\nabla(\overrightarrow{p_{i-1}}))\).
The gradient method requires the partial derivatives for all parameters of the loss function which, averaged over all the samples in the training dataset. Without knowing those derivatives in closed form, the naive way is to approximate it with finite difference, i.e. calculate twice the loss function per partial derivative. It means that one step of the gradient method applied to neural network training requires (nbLink * nbNode * 2 * nbSample) evaluations of the neural network. In practice it's too time consuming for large datasets and/or large neural networks.
Thanksfully the so-called backpropagation trick saves the day. It's based on the chain rule which allows to calculate a derivative based on intermediate derivatives (\(\frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}\)). How it applies to our neural network is brillantly explained in Lague's video, so I won't even try here. I'll just summarise it as calculating intermediate derivatives at each node and link of the network and combining them backward (from the output nodes toward the input nodes) to calculate the gradient of the loss function from only one single evaluation of the loss function per step. Truly impressive!
For very large datasets and very large networks this may still be insufficient. Some more time can be saved using stochastic gradient descent. It consists of using only a subset of the training samples at each step. The gradient calculated is incorrect due to the missing samples, but provided there are enough samples at each step the "overall direction" stays good enough and efficient training can be achieved while saving significant amount of time. The number of samples used per step is called the batch size. How to choose it is a mistery I haven't investigated. Also, as the calculated gradient descent is incorrect, the evolution of the loss at each step becomes noisy, making it difficult to use as ending condition of the descent or adaptive learning rate (cf below). To compensate for that I'm using a sliding average with window size equal to the number of sample in the dataset divided by the number of sample in the batch.
There are two other problems with the gradient descent method: it is sensitive to its starting point and it has a strong tendency to diverge if the learning rate is not choosen carefully. About the starting point, I refer to this article and content myself with the Xavier method for now (biases set to null, weights set randomly according to a normal distibution with \(\sigma=1/n\) where \(n\) is the number of nodes in the layer). About divergence, there are methods called adaptive learning rate which helps avoid it. That's another never ending story from which I cowardly escape by implementing my own: at each gradient descent step, multiply the learning rate by \(a>1.0\) if the loss decreases, and by \(b<1.0\) if the loss increases (down to a minimun of 0.1 times the initial learning rate). In practice I have used \(a=1.1\) and \(b=0.5\).
Then the pseudo code for training is as follow:
To check my implementation of neural networks and gradient descent with backpropagation in LibCapy, I've used the same datasets than for my study of support vector machines. In my previous article I had lazily used random splits for the k-fold cross validation, invalidating the comparison with OpenML results, and make it impossible to compare with neural network. The first thing to do was then to fix that, results below are all using the OpenML splits and I've rerun all the SVM training with those splits. The updated results for SVM are as follow:
Diabetes (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.74479, best of all on OpenML: 0.7877.
Haberman (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.73548, best of all on OpenML: 0.7679.
Heart-statlog (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.78889, best of all on OpenML: 0.8629.
Ionosphere (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.95444, best of all on OpenML: 0.9601.
Kr-vs-kp (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.9734, best of all on OpenML: 0.9978.
Monks-problems-1 (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.98919, best of all on OpenML: 1.0.
Monks-problems-2 (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.89183, best of all on OpenML: 1.0.
Mushroom (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 1.0, best of all on OpenML: 1.0.
Sonar (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.90452, best of all on OpenML: 0.8990.
Spambase (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.87786, best of all on OpenML: 0.9626.
Tic-tac-toe (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.95402, best of all on OpenML: 1.0.
Wdbc (OpenML):
Average accuracy on validation splits, CapySupportVectorMachine: 0.98239, best of all on OpenML: 0.9824.
Summary:
Once again I've obtained good predictive accuracy, below the best models on OpenML but not far away, even above on the sonar dataset. This helps build confidence in my implementation of SVM and there is always hope for better results by running the search for hyperparameters longer. About the sonar dataset, may I have been lucky enough to find hyper parameters which improve the results ? I tried to reproduce the results through the python API, but there was a problem and I'm currently waiting reply from OpenML.
After adding my neural network implementation to the appli I had previously made to use SVM, I used it to see how well it performs. But contrary to SVM I haven't automated the search for hyper parameters. One problem is the structure of the network, whose exploration can't be performed using differential evolution like I did on SVM. Another problem is that it's difficult to define how to split the time allocated to hyper parameters search. I decided to search manually the best network structure, learning rate and momentum instead for now with the following strategy. For the network structure, given \(n\) the number of inputs of the network I've tried the ten combinations (one layer of \(n\) nodes), (one layer of \(2n\) nodes), (two layers of \(n\) nodes), (two layers of \(2n\) nodes), (one layer of \(n\) nodes and one layer of \(2n\) nodes), each using Linear or Sigmoid activation. For other parameters, I use 5mn of training time, maximum batch size of 1000 samples, learning rate and momentum equals to 0.01. Best results are as follow:
Diabetes (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.77736, best of all on OpenML: 0.7877.
Haberman (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.74505, best of all on OpenML: 0.7679.
Heart-statlog (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.83704, best of all on OpenML: 0.8629.
Ionosphere (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.91183, best of all on OpenML: 0.9601.
Kr-vs-kp (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.93867, best of all on OpenML: 0.9978.
Monks-problems-1 (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 1.0, best of all on OpenML: 1.0.
Monks-problems-2 (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.995, best of all on OpenML: 1.0.
Mushroom (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.96307, best of all on OpenML: 1.0.
Sonar (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.77952, best of all on OpenML: 0.8990.
Spambase (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.8572, best of all on OpenML: 0.9626.
Tic-tac-toe (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.98329, best of all on OpenML: 1.0.
Wdbc (OpenML):
Average accuracy on validation splits, CapyNeuralNetwork: 0.97531, best of all on OpenML: 0.9824.
Summary:
Once again the results are very good and make me confident I've implemented correctly neural network and their training using gradient descent with back propagation. Also, it's not shown here but the gradient very rarely exploded, hint that my simple solution to control the learning rate is already good enough.
Compare to SVMs, neural networks have a lot of knobs and switches to tune: the network structure and activations, the loss function, the gradient descent parameters, the batch size. SVMs only have the choice of the kernel and its parameters, which is furthermore easy to automate. Still, using the most basic solution I could immediately obtain the results introduced here. That's really satisfying ! I'd like anyway to keep studying on that subject and expect a lot more fun to see how these results could be improved. The points I like to dig further are:
Finally a comparison SVM/NN/OpenML:
Both achieve very good results, and bad results do not overlap, which means that trying both and choosing the best gives results almost at par with the best one on OpenML for all datasets except one. Only the spambase dataset was behind for both, but given how good the others are there is hope that tuning parameters would correct that. Neural networks are more cumbersome to train (lot of parameters and no automation of their search), SVMs are easier to train thanks to automation but the hyper parameters search is super time consuming. The sum of validation accuracies on all datasets for SVM and NN are respectively 10.80 and 10.76, the difference is insignificant. So there is no real winner, both are performant tools in my toolbox waiting to be improved even more and used on future projects !