Updated date:

Understanding Gradient Descent Optimization Algorithm

A computer science engineering student finding answers to science behind the tech!

Gradient Descent Optimization

Gradient Descent Optimization

What Is Gradient Descent?

In this article, we will explore how gradient descent works through a practical example of simple linear regression.

Gradient is the measure of increase or decrease in the magnitude of a property.

Descent indicates a drop in the property. Gradient Descent algorithm, in simple words, can be defined as an algorithm that minimizes a cost function iteratively by modifying its parameters until it finds the local minimum.

Before we delve into gradient descent and its working, let us briefly talk about cost function.

Cost Function

Cost function is a way of determining the accuracy of a model. It measures the separation between actual and predicted values and displays it in the form of a number.

In case of regression, the separation between these values should be the minimum which implies that the cost function should be as small as possible.

Cost function for simple linear regression

Cost function for simple linear regression

Working of Gradient Descent

Gradient descent minimizes the cost function by simultaneously updating both parameters, coefficient and intercept, until it finds the local minimum.

Gradient descent algorithm

Gradient descent algorithm

1. Modelling Without intercept

In this section, we will implement gradient descent considering only the coefficient (i.e., intercept=0).

Code:

def hypothesis(theta,x):
    return theta*x


def cost_function(theta,x,y):
    hypo=hypothesis(theta,x)
    cf=(sum(pow((hypo-y),2)))/(2*len(x))
    return cf


def gradient_descent(learning_step,start_theta,no_of_iterations,x,y):
    thetas=[]                 
    cost_functions=[]         
    theta=start_theta
    m=len(x)
    thetas.append(theta)
    cost_functions.append(cost_function(theta,x,y))
    for i in range(no_of_iterations):
        h=hypothesis(theta,x)
        new_theta=theta-((learning_step/m)*(sum((h-y)*x)))
        theta=new_theta
        cf=cost_function(theta,x,y)
        thetas.append(theta)
        cost_functions.append(cf)    
    return thetas,cost_functions

Result:

Cost function

Cost function

The above chart shows the cost function plotted for a set of coefficients ranging from -400 to 850. It can be observed that the cost function minimizes a certain value of coefficient.

t,c=gradient_descent(0.00000001,-400,100,X_train,Y_train)
Gradient descent - 1

Gradient descent - 1

Here, gradient descent starts with a coefficient value equal to -400 and works iteratively to bring down the value of cost function over 100 iterations.

t,c=gradient_descent(0.00000001,800,100,X_train,Y_train)
Gradient descent - 2

Gradient descent - 2

In this case, gradient descent starts with a value equal to 800 and runs over 100 iterations.

Both the cases run on the same value of learning step.

How is gradient descent able to work in two different directions?

You must be wondering why does it work differently in the above two cases. The above cases clearly show that gradient descent can work in two different directions. In plot 1 it works from left to right while in plot 2 it works from right to left.

This is because of the derivative term which is being subtracted from the initial value of the coefficient (Refer to gradient descent algorithm under the Working of Gradient Descent section).

This derivative is simply the slope of the curve at any point. In the first case, the slope of the curve is negative, which results in a positive term being added to the coefficient. Hence, the value of coefficient increases in subsequent iterations. In the second case, the slope of the curve is positive, hence the value of coefficient keeps on decreasing.

What if the learning step of gradient descent is set too high?

t,c=gradient_descent(0.0000005,-400,10,X_train,Y_train)
Learning step value is too high

Learning step value is too high

For gradient descent to work correctly, it is important that to set the learning step to an appropriate value. Large learning rate means big steps.

If the learning step is set too high, the algorithm may never be able to make it to the local minimum as it bounces back and forth between the convex function as in the above scenario.

If the learning rate is very small, the algorithm will eventually find the local minimum. However, it may work quite slowly.

2. Modelling With intercept

Here, we consider both coefficient and intercept to assume some finite value.

Code:

def hypothesis2(coef,intercept,x):
    return coef*x+intercept


def cost_function2(coef,intercept,x,y):
    hypo=hypothesis2(coef,intercept,x)
    cf=(sum(pow((hypo-y),2)))/(2*len(x))
    return cf


def gradient_descent2(learning_step,start_coef,start_intercept,no_of_iterations,x,y):
    coefs=[]                 
    intercepts=[]           
    cost_functions=[]         
    coef=start_coef
    intcp=start_intercept
    m=len(x)
    coefs.append(coef)
    intercepts.append(intcp)
    cost_functions.append(cost_function2(coef,intcp,x,y))
    for i in range(no_of_iterations):
        h=hypothesis2(coef,intcp,x)
        new_coef=coef-((learning_step/m)*(sum((h-y)*x)))
        coef=new_coef
        new_intcp=intcp-((learning_step/m)*(sum((h-y))))
        intcp=new_intcp
        cf=cost_function2(coef,intcp,x,y)
        coefs.append(coef)
        intercepts.append(intcp)
        cost_functions.append(cf)    
    return coefs,intercepts,cost_functions

Result:

Cost function

Cost function

In this case, the cost function not only depends on the coefficient, but also on the intercept.

The above plot shows the cost function plotted for a range of values of the coefficients and intercepts. Note that the graph contains a global minimum.

Gradient descent

Gradient descent

The above image summarizes the working of gradient descent when both coefficient and intercept participate.

It can be clearly seen that both the parameters are updated simultaneously until the cost function reaches its local minimum.

Check your knowledge:

For each question, choose the best answer. The answer key is below.

  1. Gradient descent works by ______ updating of parameters.
    • simultaneous
    • consecutive
  2. Gradient descent works very well if the learning step is set to a very large value.
    • True
    • False

Answer Key

  1. simultaneous
  2. False

Interpreting Your Score

If you got 2 correct answers: Well done!

This content is accurate and true to the best of the author’s knowledge and is not meant to substitute for formal and individualized advice from a qualified professional.

© 2021 Riya Bindra

Related Articles