Gradient descent optimization algorithms are very popular to perform optimization and by far the most common way to optimize neural networks. However, there are also many other algorithms, like Hessian Free, Conjugate Gradient, BFGS, L-BFGS and etc., are proposed to deal with optimization tasks, here we only take those gradient descent methods into account. And we will discuss the drawbacks among those gradient algorithms and the methods to solve these blemishes.

# Gradient Descent Optimization Algorithm

## Vanilla Gradient Descent

Consider an objective $\mathcal{L}(\boldsymbol{\theta})$, and our goal is aiming to mimimize it. Here, the $\boldsymbol{\theta}\in\mathbb{R}^{n}$ is the set of parameters for entire dataset. Using **(vanilla/batch) gradient descent**, to minimize the objective, we need to update $\boldsymbol{\theta}$ by
$$
\boldsymbol{\theta}:=\boldsymbol{\theta}-\eta\cdot\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})
$$
where $\eta$ denotes learning rate, which determines the size of the steps we take to reach a (local) minimum, $\nabla_{\boldsymbol{\theta}}$ represents the partial derivative of $\mathcal{L}(\boldsymbol{\theta})$ with respect to $\boldsymbol{\theta}$, we can derive from the formula above that, for each update, the gradients for whole dataset need to be computed, which is extraordinarily time consuming and heavy memory occupation. Moreover, the batch gradient descent does not allow online training. Other drawbacks:

- It is relatively slow close to the minimum: technically, its asymptotic rate of convergence is inferior to many other methods. For poorly conditioned convex problems, gradient descent increasingly ‘zigzags’ as the gradients point nearly orthogonally to the shortest direction to a minimum point.
- It is only guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Vanilla gradient descent is `OptimizationAlgorithm.LINE_GRADIENT_DESCENT`

in DeepLearning4J.

## Stochastic Gradient Descent

Stochastic gradient descent (often shortened in SGD), also known as incremental gradient descent, is a stochastic approximation of the gradient descent optimization method for minimizing an objective function. Unlike vanilla gradient descent, which performs redundant computations (since it may recompute gradients for similar examples before each parameter updates), SGD performs parameter update for each sample $(\mathbf{x}^{(i)},y^{(i)})$, i.e., it computes one update a time
$$
\boldsymbol{\theta}:=\boldsymbol{\theta}-\eta\cdot\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})
$$
Thus, it is much faster than vanilla gradient descent and is capable of training online. Moreover, becasue of updating by one sample each time, SGD always perform frequent updates with a high variance that cause the objective function to fluctuate heavily. For SGD’s fluctuation, it may enable it to jump from a local minimum to new and potentially better local (even global) minima, however, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting. Besides, when the learning rate is slowly decreased, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively. In deeplearning4j, Stochastic Gradient Descent is `OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT`

.

## Mini-batch (Stochastic) Gradient Descent

Mini-batch (Stochastic) Gradient Descent performs an update for every $n$ training sample ($n=1$, it becomes an online method)
$$
\boldsymbol{\theta}:=\boldsymbol{\theta}-\eta\cdot\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i:i+n}(\boldsymbol{\theta})
$$
it reduces the variance of the parameter updates, which can lead to more stable convergence and is able to make use of highly optimized matrix optimizations that make computing the gradient with respect to a mini-batch very efficient. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used. In deeplearning4j, by using mini-batch, set `.miniBatch(true)`

and the batch size is defined by data processor.

Albeit improvements are made to ameliorate gradient descent methods, there are still some challanges:

- Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.
- Learning rate schedules try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.
- Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
- Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. some argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

# Parameter Update Methods

Here I summarize some methods that are widely used in neural networks to deal with the aforementioned challanges.

## Momentum

Since SGD has trouble navigating ravines, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum. Momentum is a method, which appeared in Rumelhart, Hinton and Williams’ seminal paper on backpropagation learning, that helps to acclerate SGD in the relevant direction and dampens oscillations. Momentum remembers the update $\Delta w$ at each iteration, and determines the next update as a convex combination of the gradient and the previous update
$$
\begin{aligned}
\Delta w_{t} & :=\alpha\Delta w_{t-1}+\eta\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})\newline
\boldsymbol{\theta} & :=\boldsymbol{\theta}-\Delta w_{t}
\end{aligned}
$$
where, $\alpha$ is a hyperparameter, which controls the contribution of the update vector of the past time step to the current update vector. The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. Normally, the momentum term $\alpha$ is usually set to 0.9 or a similar value. In deeplearning4j, using momentum by setting `.momentum(0.9)`

.

## Nesterov Momentum

Nesterov momentum, also know as Nesterov accelerated gradient (NAG), is a way to give the momentum term a kind of prescience that prevents the gradient blindly following the slope and let it knows where to go and then to slow down before the hill slopes up again.
$$
\begin{aligned}
\Delta w_{t} & :=\alpha\Delta w_{t-1}+\eta\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}-\alpha\Delta w_{t-1})\newline
\boldsymbol{\theta} & :=\boldsymbol{\theta}-\Delta w_{t}
\end{aligned}
$$
where the $\eta$ denotes the learning rate, $\alpha$ denotes the momentum, computing $\boldsymbol{\theta}-\alpha\Delta w_{t-1}$ is to obtain an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where the parameters are going to be.
See from the graph, while Momentum first computes the current gradient (*small blue vector*) and then takes a big jump in the direction of the updated accumulated gradient (*big blue vector*), NAG first makes a big jump in the direction of the previous accumulated gradient (*brown vector*), measures the gradient and then makes a correction (*green vector*). This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks. In deeplearning4j, Nesterov is `.updater(Updater.NESTEROVS).momentum(0.9)`

.

## AdaGrad

AdaGrad (also know as adaptive gradient algorithm) is a modified stochastic gradient descent with per-parameter learning rate, it increases the learning rate (lager updates) for more sparse parameters and decreases the learning rate (smaller updates) for less sparse ones. This strategy often improves convergence performance over standard stochastic gradient descent for dealing with sparse data. In practice, AdaGrad greatly improved the robustness of SGD and it is often used for training large-scale neural nets, besides, GloVe model also use the AdaGrad.

Different from the above methods, who updates all parameters $\boldsymbol{\theta}$ at once and use the same learning rate for every parameter, however, AdaGrad uses a different learning rate for every parameter $\theta_{i}$ at every time step $t$. Denote $g_{t,i}$ to be the gradient of the objective function with respect to the parameter $\theta_{i}$ at time step $t$, then we have $g_{t,i}=\nabla_{\boldsymbol{\theta}}\mathcal{L}(\theta_{i})$, AdaGrad modifies the general learning rate $\eta$ at each time step $t$ for every parameter $\theta_{i}$ based on the previous gradients that have been computed for $\theta_{i}$:
$$
\theta_{t,i}:=\theta_{t-1,i}-\frac{\eta\cdot g_{t-1,i}}{\sqrt{G_{t-1,ii}+\epsilon}}:=\theta_{t-1,i}-\frac{\eta\cdot\nabla_{\boldsymbol{\theta}}\mathcal{L}(\theta_{i-1})}{\sqrt{G_{t-1,ii}+\epsilon}}
$$
where $G_{t}\in\mathbb{R}^{n\times n}$ is a diagonal matrix where each diagonal element is the sum of the squares of the gradients with respect to $\theta_{i}$ up to time step $t$, while $\epsilon$ is a smoothing term that avoids division by zero (usually takes $1e-8$). Thus, for $G_{t}$, which contains the sum of the squares of the past gradients with respect to all parameters $\boldsymbol{\theta}$ along its diagonal, so we have
$$
\boldsymbol{\theta}_{t}:=\boldsymbol{\theta}_{t-1}-\frac{\eta\cdot\nabla_{\boldsymbol{\theta}}\mathcal{L}(\theta)}{\sqrt{G_{t-1}+\epsilon}}
$$
Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate. While, its main weakness is its accumulation of the squared gradients in the denominator, since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. In deeplearning4j, AdaGrad is `Updater.ADAGRAD`

.

## AdaDelta

AdaDelta is an extension of AdaGrad that seeks to reduce the aggressive, monotonically decreasing learning rate of AdaGrad. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size $win$.

Instead of inefficiently storing $win$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E[g^{2}]_{t}$ at time step $t$ then depends (as a fraction $\alpha$ similarly to the Momentum term) only on the previous average and the current gradient:
$$
E[g^{2}]_{t}=\alpha E[g^{2}]_{t-1}+(1-\alpha^{2})g_{t}^{2}
$$
From AdaGrad, we have
$$\Delta w_{t}=-\frac{\eta\cdot g_{t}}{\sqrt{G_{t}+\epsilon}}
$$
Now, simply replace the diagonal matrix $G_{t}$ with the decaying average over past squared gradients $E[g^{2}]_{t}$:
$$\Delta w_{t}=-\frac{\eta\cdot g_{t}}{\sqrt{E[g^{2}]_{t}+\epsilon}}=-\frac{\eta\cdot g_{t}}{RMS[g]_{t}}
$$
The authors note that the units in this update (as well as in SGD, Momentum, or Adagrad) do not match, i.e. the update should have the same hypothetical units as the parameter. To realize this, they first define another exponentially decaying average, this time not of squared gradients but of squared parameter updates:
$$
E[\Delta w^{2}]_{t}=\alpha\cdot E[\Delta w^{2}]_{t-1}+(1-\alpha)\Delta w_{t}^{2}
$$
The root mean squared error of parameter updates is thus:
$$
RMS[\Delta w]_{t}=\sqrt{E[\Delta w^{2}]_{t}+\epsilon}
$$
Since $RMS[\Delta w]_{t}$ is unknown, it is approximated with the RMS of parameter updates until the previous time step, replacing the learning rate $\eta$ in the previous update rule with $RMS[\Delta w]_{t-1}$, yields the Adadelta update rule:
$$
\begin{aligned}
\Delta w_{t} & =-\frac{RMS[\Delta w]_{t-1}}{RMS[g]_{t}}\cdot g_{t}\newline
\boldsymbol{\theta}_{t+1} & =\boldsymbol{\theta}_{t}+\Delta w_{t}
\end{aligned}
$$
With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule. In deeplearning4j, AdaDelta is `Updater.ADADELTA`

.

## RMSProp

RMSProp is another alternative to resolve AdaGrad’s radically diminishing learning rates, it is identical to the first update vector of AdaDelta
$$
\begin{aligned}
E[g^{2}]_{t} & :=0.9\cdot E[g^{2}]_{t-1}+0.1\cdot g_{t}^{2}\newline
\boldsymbol{\theta}_{t+1} & :=\boldsymbol{\theta}_{t}-\frac{\eta\cdot g_{t}}{\sqrt{E[g^{2}]_{t}+\epsilon}}
\end{aligned}
$$
RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. In practice, $\alpha$ equals to 0.9, and $\eta$ equals to 0.001 are good default values. In deeplearning4j, RMSProp is `Updater.RMSPROP`

.

## Adam

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients $v_{t}$, Adam also keeps an exponentially decaying average of past gradients $m_{t}$:
$$
\begin{aligned}
m_{t} & =\beta_{1}m_{t-1}+(1-\beta_{1})g_{t}\newline
v_{t} & =\beta_{2}v_{t-1}+(1-\beta_{2})g_{t}^{2}
\end{aligned}
$$
where $m_{t}$ and $v_{t}$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, then correcting the biaes by computing bias-corrected first and second moment estimates:
$$
\begin{aligned}
\hat{m}_{t} & =\frac{m_{t}}{1-\beta_{1}^{t}}\newline
\hat{v}_{t} & =\frac{v_{t}}{1-\beta_{2}^{t}}
\end{aligned}
$$
then use these to update the parameter yield the Adam update rule:
$$
\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\frac{\eta\cdot\hat{m}_{t}}{\sqrt{\hat{v}_{t}+\epsilon}}
$$
Normally, in practice, $\beta_{1}=0.9$, $\beta_{2}=0.999$ and $\epsilon=10^{-8}$. In deeplearning4j, Adam is `Updater.ADAM`

.

## kSGD

Kalman-based Stochastic Gradient Descent (kSGD) is an online and offline algorithm for learning parameters from statistical problems from quasi-likelihood models, which include linear models, non-linear models, generalized linear models, and neural networks with squared error loss as special cases. The benefits of kSGD is not sensitive to the condition number of the problem and as a robust choice of hyperparameters as well as has a stopping condition. The drawbacks of kSGD is that the algorithm requires storing a dense covariance matrix between iterations, and requires a matrix-vector product at each iteration. More details please visit its Wikipedia – [link]

## Conclusion

Generally, SGD is totally depends on the gradient of current batch, so it is hard to choose a reasonable learning rate, and it uses the same learning rate for all parameters, which is not suitable for sparse data/features. Moreover, SGD can be captured by local minimum easily or be trapped in saddle point.

Momentum is a way to speed up SGD, reduce oscillation and gain faster convergence. In the initial period, it uses the update parameter of last stage to keep the same descent diretion, and speed up by multiple a large $\alpha$ (momentum factor); In the middle and later periods, when the value osillates around the local minimum, and gradient tends to zero, it enlarges the update amplitude to jump out of the trap; What’s more, when the gradient changes the direction, it inhibits the update.

Nesterov plays a role of correction function when the gradient updates, which avoids the gradient goes forward too fast and increases the sensitivity.

AdaGrad acts as a constraint for learning rate. when the gradient is small in the initial period, the constraint term is large, which is able to magnify the gradient, and when the gradient is large in the later period, the constraint term is small, which restrains the gradient. And it is suitable for handling sparse gradient. However, it relys on the pre-set learning rate $\eta$, if the $\eta$ is too large, which makes contraint term hypersensitive, overamplifies the gradient in the initial period, and makes the gradient tends to zero in the later period and causes the training process terminates early.

AdaDelta is an improvement of AdaGrad, which has good acceleration effect in the prelimilary and middle periods, while in the later period, it ossilates at local minimum.

RMSProp can be treated as a particular case of AdaDelta, but it relys on the global learning rate. RMSProp is suitable for non-stationary target and performs well on RNNs.

Adam combines the merit of AdaGrad, which is good at dealing with sparse gradient, and the merit of RMSProp, which is good at handling non-stationary target, it uses different adaptive learning rates for different parameters. So it is suitable in most non-convex optimization and also appropriate for big dataset and high-dimensional space.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numinator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Some researchers show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice (AdaMax and NAdam are two variants of Adam, which asserts that the performance is better than Adam).