Gradient descent

A large majority of artificial neural networks are based on the gradient descent algortihm. It is necessary to understand the fundamentals of this algorithm before studying neural networks. Gradient descent is an optimization algorithm for finding the minimum of a function.


# Principle

Let's consider the differentiable function \(f(x)\) to minimize. The gradient descent algorithm starts at an arbitrary position and iteratively converge to the minimum, as illustrated below:

Let's name \( x_0 \) the starting point of the algorithm. To computes the next point \( x_1 \), the gradient descent algorithm calculates the derivative \( f'(x_o) \), as illustrated on the following figure:

As the derivative is the slope of the tangent line to the function at that point, it is generaly a good indicator of how far the point is from the minimum. The next point is calculated according to the following formula:

$$ x_{n+1} = x_n - \alpha.\frac{df}{dx}(x_n) $$ where:

# Step size multiplier

The step size multiplier \( \alpha \) is a parameter to tune. It represents how big will be the step to the next point. The dilemna hidden behind this parameter is that large values will allow fast convergences, but small values will ensure more stability. On the illustration below, a too large step size prevents the algorithm from converging to the expected mimima:

Note that in artificial neural networks, this parameter is no longer called step size ( \( \alpha \) ) , but learning rate ( \( \eta \) ).


# Multidimensional optimization

For the sake of clarity, the algorithm has been presented on a function in one dimension ( \( f: \mathbb{R} \mapsto \mathbb{R} \) ). The previous principles can be extended to multidimentional functions ( \( f: \mathbb{R}^n \mapsto \mathbb{R} \) ). The derivative is then replaced by the gradient of the function. This is why the algorithm was named gradient descent. The multidimensional gradient descent generalization is given by the following equation:

$$ X_{n+1} = X_n - \alpha\nabla f(X_n) $$ where: The gradient is given by: $$ \nabla f(X) = \begin{pmatrix} \frac{df}{dx_1} \\ \frac{df}{dx_2} \\ \vdots \\ \frac{df}{dx_n} \end{pmatrix} $$ The figure below (taken from Hoang Duong blog) illustrates the convergence of the algorithm on a multidimentional function:

# Local minima

When using the gradient descent algorithm, we have to consider the fact that the algorithm can converge to local minima, as illustrated below:

When this algorithm is used for optimizing artificial neural netwoks parameters, this limitation can prevent the network to learn properly. Fortunatly, in practice we consider models with large number of parameters. Generaly, the local minima in a given dimension is not shared with the other dimensions. This explains why artificial neural networks can be applied to complex systems, but this also explains why it's particularly difficult to prove the convergence of theses algorithms.