Gradient Descent for Univariate Linear Regression

The purpose of this article is to realize gradient descent algorithm for univariate linear regression in Python. We first briefly show the mathematics of the algorithm and then the code in Python to realize gradient descent method.

The cost function that will be used is mean square error (MSE). Let’s define some notations that will be used later.

h(x_i) = \theta_0 + \theta_{1}x_i
J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^{m}{(h(x_i) - y_i)^2}

h(x_i) is our linear model and J(\theta_0, \theta_1) is our objective function, or cost function.

The purpose of gradient descent is to keep updating our parameters, \theta_0 and \theta_1, to minimize the cost function J(\theta_0, \theta_1). To know the direction of update, we first need to obtain the first partial derivative of J(\theta_0, \theta_1) with respect to \theta_0 and \theta_1. The mathematics are shown as follows:

\theta_0 := \theta_0 - \alpha \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0}
\theta_1 := \theta_1 - \alpha \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1}
\alpha is our learning rate.

\frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0}
=\frac{\partial}{\partial \theta_0}(\frac{1}{2m}\sum_{i=1}^{m}{(\theta_0 + \theta_1x_i - y_i)^2}) (Plug in h(x_i))
=\frac{1}{m}\sum_{i=1}^{m}(\theta_0+\theta_1 x_i - y_i) (Use Chain rule)
And similarly,
\frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1}
=\frac{\partial}{\partial \theta_1}(\frac{1}{2m}\sum_{i=1}^{m}{(\theta_0 + \theta_1x_i - y_i)^2})
=\frac{1}{m}\sum_{i=1}^{m}(\theta_0+\theta_1 x_i - y_i)x_i

To generate some data, we let x be a series from 1 to 100 with steps of 1 and y = 0.6x + 0.2 + \epsilon where \epsilon \sim N(0,5). The code are shown as follows:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2021)
x = np.arange(1, 101, 1)
y = 0.6*x + 0.2 + np.random.normal(scale=5,size=len(x))

f, ax = plt.subplots(figsize=(16,4))
plt.title("Data Generated")
ax.scatter(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

The generated data are plotted.

Now, we need to define some functions for our gradient descent algorithm.

# Hypothesis
def h(x,t0,t1):
    x_ = np.array([np.ones(len(x)), x]).T
    y = np.matmul(x_, np.array([t0,t1]))
    return y

# Cost Function
def cost_f(y, y_hat):
    diff = y_hat - y
    diff_sq = np.power(diff,2)
    cost = 0.5 * np.mean(diff_sq)
    return cost
    
# Gradient Function
def t0_grad_f(t0,t1,x,y):
    diff = t0 + t1*x - y
    t0_grad = np.mean(diff)
    return t0_grad

def t1_grad_f(t0,t1,x,y):
    diff = (t0 + t1*x - y) * x
    t1_grad = np.mean(diff)
    return t1_grad

# Gradient Descent
def grad_desc(x,y,t0=0,t1=0,lr=0.0001):
    cost_list = []
    cost = 1e99
    while True:
        t0_grad = t0_grad_f(t0=t0,t1=t1,x=x,y=y)
        t1_grad = t1_grad_f(t0=t0,t1=t1,x=x,y=y)
        t0 = t0 - lr*t0_grad
        t1 = t1 - lr*t1_grad
        y_hat = h(x=x, t0=t0, t1=t1)
        cost_new = cost_f(y, y_hat)
        cost_list.append(cost_new)
        if cost_new > cost:
            break
        cost = cost_new
    return [t0,t1], cost_list

Then, we can run the function and plot our fitted model:

# Run gradient descent
coef_, _ = grad_desc(x=x, y=y)

# Fit data using GD results
y_hat = coef_[0] + x*coef_[1]

# Plot the data and model
f, ax = plt.subplots(figsize=(16,4))
plt.title("Linear Regression based on Gradient Descent")
plt.text(x=0, y=70, s=r"$\theta_0$={}, $\theta_1$={}".format(np.round(coef_[0],4),np.round(coef_[1],4)))
plt.xlabel("x")
plt.ylabel("y")
ax.scatter(x,y)
ax.plot(x,y_hat, color='orange', lw=2)
plt.show()

Notes:

  1. For linear gradient descent problem, the initial parameter \theta_0 and \theta_1 can be set to any value. This is because the cost function is a convex function and there are no local minima but only one global minima.
  2. The learning rate needs to be set carefully. Large learning rate will cause a divergent result while small learning rate will take long time to converge.

Leave a comment