Comparison of standard, momentum and Adam methods for gradient descent

Following my previous article where I compared support vector machines and neural networks, I will compare in this one three different types of optimiser: standard gradient descent, gradient descent with momentum, and Adam.

During my study of neural networks I had left aside the question of adaptive learning rate and used a simple home made one. Coming back to that subject, I've now added to my implementation in LibCapy what's considered the state of the art optimiser: Adam.

Training the coefficients of a neural network is usually done with the gradient descent method. One calculates the derivative of the loss function with respect to the neural network coefficients over the training dataset, and corrects these coefficients with a small step in the gradient direction. The problem here is to decide how big that small step is. Too large and the training may fail, too small and the training may take forever to converge.

In the standard gradient descent method, the size of the step is a constant choosen by the user. $$ w_{t+1}=w_t-\eta\nabla_wL_t $$ where \(\eta\) is the learning rate, \(w\) the neural network coefficients, \(L\) the loss function, \(t\) the iteration step.

In gradient descent with momentum, the size of the step is updated at each iteration proportionally to the previous step (according to another coefficient). $$ \begin{array}{l} \Delta w_{t+1}=\alpha\Delta w_t-\eta\nabla_wL_t\\ w_{t+1}=w_t+\Delta w_{t+1}\\ \end{array} $$ where \(\eta\) is the learning rate, \(w\) the neural network coefficients, \(L\) the loss function, \(t\) the iteration step, \(\alpha\) the momentum coefficient.

Adam goes one step further, and update the step according to two measurements of the current descent direction (one based on the gradient and another based on the square of the gradient). Adam has been shown to converge much faster and be more stable than the other methods. $$ \begin{array}{l} m_{t+1}=\beta_1m_t+(1-\beta_1)\nabla_wL_t\\ v_{t+1}=\beta_2v_t+(1-\beta_2)(\nabla_wL_t)^2\\ \hat{m}=m_{t+1}/(1-\beta_1^{t+1})\\ \hat{v}=v_{t+1}/(1-\beta_2^{t+1})\\ w_{t+1}=w_t-\eta\hat{m}/(\sqrt{\hat{v}}+\epsilon)\\ \end{array} $$ where \(\eta\) is the learning rate, \(w\) the neural network coefficients, \(L\) the loss function, \(t\) the iteration step, \(\beta_1\) and \(\beta_2\) the decay rates, \(\epsilon\) a small value to avoid division by zero.

Reusing the same tests as in my previous article, I've compared this time the performances of these three methods. For the momentum version I've tried three values: \(\alpha=0.8, 0.9, 0.99\). For the Adam version I've used \(\beta _1=0.9\) and \(\beta _2=0.999\). For all tests \(\eta=0.01\).

Results were as follow (average of accuracy on validation data):

All three methods gave similar good results. Results are also similar to what I've got in my previous article (except for monks2, unlucky?), suggesting I haven't broken everything since then. If all three methods give similar results, what's the point will you ask ? Lets plot the evolution of the loss value during training for the tic-tac-toe dataset (choosen at random).

The difference between each methods is striking. The standard gradient method goes super slowly. Adding momentum improves convergence speed, but there is a sweet spot: too few or not enough momentum decreases the improvment. Also, the larger the coefficient the stronger stairways-like pattern appears (I guess it shows the weights overshooting and coming back toward the optimum?). The Adam method converges amazingly faster than the two others, as promised. Also, even if I have no data to show here (sorry for being lazy), I also had the feeling that Adam was significantly more stable than the two other methods. Finally it's important to note that Adam is known to not always converge. But given the gain in time it may provides using as a default and coming back to momentum when it doesn't work looks like the appropriate strategy.

2024-11-18
in AI/ML, All,
23 views
Copyright 2021-2024 Baillehache Pascal