Best explanation! Batch Gradient Descent, Mini-batch Gradient Descent & Stochastic Gradient Descent

Hello World.

In this article, we’ll cover the gradient descent algorithm and its variants: Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent.



Gradient Descent

Gradient descent is a very popular Optimization algorithm used in Machine Learning and Deep Learning. The optimization algorithm in the case of ML has certain goals to fulfill:


  • Searching for the global minima of the objective function(cost function). This is only possible when the objective function is convex, which indirectly means that the function will be of bowl-shaped and any local minima will be global minima.

  • In the case of Non-convex function(not a bowl-shaped function), gradient descent finds the nearest minima across it and this minimum of a function is called local minima.


Gradient descent is a first-order optimization algorithm. This signifies that while updating the parameters it only considers the first derivative of the function. Our main aim is to make the gradient travel in the direction of the steepest slope for this on each iteration, we update the parameters in the opposite direction of the gradient of the objective function w.r.t the parameters. One more important factor is the size of steps we are taking to reach the local minimum which is determined by the learning rate α.


Let's dive into the gradient descent algorithm working on Logistic regression. For simplicity, let's assume that logistic regression has only 2 parameters, Weight w, and Bias b.


1. Assign initial values of weight and bias to be any random number.


2. Choose the value of α(learning rate). It will determine the step size on each iteration.

  • If the learning rate α happened to be too small, the gradient will take a long time to converge and become computationally expensive.

  • If the learning rate α happened to be too large, the gradient will surpass the minima and may never converge.


3. Therefore, we have to choose the optimum learning rate α so that our gradient converges well. We can see the convergence rate by plotting the cost function against different values of the learning rate α and consider the learning rate that is right before the value that didn't converge well. If we do it, we would achieve a very fast learning rate that converges well.


Let's look at the graphical representation for what we have discussed above.


Figure 1: Graph of gradient descent against different learning rates.Source


  • Some of the most commonly practiced Learning rate values are: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 0.5

Here is one more mistake people do, they don't normalize(Making the data points constrained in between some values) the data in costs the learning rate to converge slower because the level curves (contours) would be narrower and taller which in turn would take longer time to converge. Therefore make sure you normalize the data points.


Figure 2: Gradient descent: Convergence in normalized versus unnormalized level contours.


You can find a z-score normalization Here.


4. By taking the partial derivative of the cost function with respect to parameters on each iteration we get the gradient scores.


After the gradient calculation, the updated equations will be:


For diagram illustration, we shall assume that we only have weight, not bias. If the slope of the particular value of weight(w)> 0, it signifies that we are to the right of the optimal w*. In this case, the update will be negative, and w will start getting closer to the optimal w*. However, If the slope of the particular value of weight(w)< 0, the update will be positive and will increase the current values to w to converge to the optimal values of w*.


  • Repeat the method until the cost function comes to convergence i.e the error curve becomes flat and doesn't change any more.


Now that when we know all about gradient descent, let's dive into gradient descent's three variants. The three variants namely are:


  1. Batch Gradient Descent

  2. Mini-batch Gradient Descent

  3. Stochastic Gradient Descent

The main distinguishing factor between the three of them is the amount of data intake we do for computing the gradients at each step. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter’s update (learning rate).



Batch Gradient Descent


In Batch Gradient Descent, the whole sum of training examples is passed through a neural network at a time and the parameters(weights) are updated for the whole training example. Therefore, for every update we do in weights, we have to sum over all training examples and then update the weight with this equation:

Python code Snippet for the above equation:

for i in range(num_epochs):
    grad = compute_gradient(data, params)
    params = params — learning_rate * grad


Mini-batch Gradient Descent


In mini-batch, the name only gives us enough references that instead of considering all the training examples, it considers summing up to over the lower number of examples based on the batch size given. Therefore learning takes place on each mini-batch of b examples.


mini-batch gradient descent can be written in the equation form like this:


Python code Snippet for the above equation:

for i in range(num_epochs):
    np.random.shuffle(data)
    for batch in radom_minibatches(data, batch_size=32):
        grad = compute_gradient(batch, params)
        params = params — learning_rate * grad

Some of the key points:


  • To avoid pre-existing order of example, always shuffle the dataset

  • The training dataset can be partitioned into b mini-batches based on its batch size. After dividing the training examples into batches evenly, if there are some of the training examples left, they will become a separate batch.


We can tune the batch size according to our ram capacity. Usually, it is chosen to be as a power of 2 for example 32, 64, 128, 256, 512, etc. It is done because many computing hardware like GPUs achieves better run time with common batch sizes such as the power of 2.



Stochastic Gradient Descent


In Stochastic Gradient Descent, each training example is passed through a neural network at a time and the parameter(weight) is updated for that particular training example. The parameters of all the layers of the network are updated after every training sample. For example, if the particular training set contains 200 samples, then it's corresponding parameters are updated 200 times i.e. one time after every individual example is passed through the network.


The mathematical equation of Stochastic Gradient Descent:

Python code Snippet for the above equation:

for i in range(num_epochs):
    np.random.shuffle(data)
    for example in data:
        grad = compute_gradient(example, params)
        params = params — learning_rate * grad

Some of the key points:


  • To avoid pre-existing order of example, always shuffle the dataset

  • The training dataset can be partitioned into m examples.

  • As we are considering each example individually, the noise or the shakiness of the gradient will be more while converging. This increases the run-time of this algorithm.


Let's have a look at gradient descent’s variants and their direction towards the minimum:




With this, I conclude my gradient descent explanation article. For beginners, I have made a simpler blog of gradient descent (Part-1), have a look at it HERE.


If you liked my writing style and explanation, please do consider liking, sharing and subscribing to my website for future feeds like this.


724 views0 comments

Recent Posts

See All