Linear Regression with Gradient Descent

In your quest to learn machine learning, this is probably the first and simplest prediction model you will learn. Each one of these words have a meaning! Let’s break it down:

Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observer data.

So, if you have two variables and they have a relationship you can use this to create a prediction model.

The classic example is Housing Prices. The bigger the house is, the pricier it gets. So one variable could be Area Size and the other Price. We can use Linear Regression to predict the price of a house based on its size!

Disclaimer
Of course, houses have way more variables than that.
Things like the number of bedrooms, number of bathrooms, the neighborhood and city, year of construction, and many other parameters influence the price.
This is not the best model for this prediction – it is just the simplest!

When we say linear, what we mean is that it is going to fit in a linear equation. On one axis you have the house area, and on the other the price!

Imagine that, by the end, we have a mathematical function that you just have to provide the house area and we get the price. Something like this:

Where:

Y is the house price
X is the house area
a is the slope of the line
b is the intercept (value of Y when X=0)

In code it will look like this:

def predict_house_price(area):
    price = area * a + b
    return price

price = predict_house_price(area)

Seems easy!

Calm down. We have to understand another thing: cost functions!!

Let’s begin with an easy example:

Consider that a house with area 50m² is priced at US$ 100.000. we wish!
Now consider that another house with area 75m² is priced at US$ 125.000.
And last, another one that has area 100m² is priced at US$ 150.000.

This is easy: the function would be Price = Area * 1000 + 50.000. This is the graph:

But data in the real world isn’t as easy, and house prices are not influenced only by its area. For example, houses with bigger areas could mean more bedrooms and more bathrooms, or it could mean a pool! These details are what complicates this prediction. Consider the following graph:

Now a simple straight line can’t really match and predict correctly every single dot in the graph.

This is where we get more technical.

We can try and quantify the accuracy of this function by measuring the distance of every dot to the line. This is the cost function. The function for the line is called hypothesis.

Cost Function

It is the average difference of all the results of the hypothesis with inputs from X and the actual output Y.

This function will give us the average accuracy to every prediction we have with the straight line.

For this article, we’ll implement a type of Mean Squared Error, but keep in mind there are other types of cost functions.

For every X we’ll do the following:

Run the hypothesis > (Y = aX + b)
Get the difference > predictedY – actualY
Square it > difference*difference
Add it to all errors

At the end, we divide by the amount of data points (e.g. the number of houses/prices) in order to get the average. You can also do it by half of the average in cases where we’re going to use gradient descent (more on this later).

In Python, this is what it would look like:

def cost_function(
    y,              # Target variable
    predictions,    # Model predictions
    m               # Number of training examples
):
    """
        Calculate the cost function for linear regression.
        J(theta) = (1/(2*m)) * sum(errors^2)
    """
    sumErrorsSquared = 0.0
    for i in range(m):
        sumErrorsSquared += (predictions[i] - y[i]) ** 2.0

    return (1.0 / (2.0 * m)) * sumErrorsSquared

We square the difference for two reasons:

Negative errors cancel out positive ones

Sometimes we predict the value Y as being higher than the real Y, and other times we predict Y as being lower. By squaring it, we make sure it does not matter if the difference is positive or negative.

Penalize bigger mistakes more

Small errors become huge errors, so the accuracy gets a bigger hit and small variations in accuracy matter more.

Okay, but why you’re telling me about this?

Gradient Descent

Gradient descent is a method for unconstrained mathematical optimization. It is an iterative algorithm to minimize a function.

Can you see where we’re going with this?

We’re going to start out with a random arbitrary hypothesis, run our cost function, and then run gradient descent in order to minimize the value of this cost function. This will increase the accuracy of our hypothesis!

Gradient descent works by the following:

It begins by having arbitrary values
It calculates the loss (cost function)
Decides which way to go – in this case, descent means it wants to minimize the value
Take another in this direction, and the step size is the learning rate
Calculates again until it runs out of steps

If you (for some reason?) want the mathematical formula:

This basically means:

repeat until convergence means that we’ll repeat the following steps until the values of the parameters stop changing significantly (convergence is reached).
θⱼ is the parameter being updated in every repetition
α is the learning rate. A small positive value that controls the size of the update step.
∂ is a partial derivative. It means we’re taking the derivative of a function with respect to one variable, while keeping the others constant.
∂/∂θⱼ J(θ₀, θ₁) is the partial derivative of the cost function J(θ₀, θ₁) with respect to θⱼ. In other words, how much the cost would change if we nudged θⱼ slightly.

I find code easier to understand, so here it is:

def gradient_descent(
    X,                              # Input features
    y,                              # Target variable
    alpha,                          # Learning rate
    steps                           # Number of iterations
):
    """
    Perform gradient descent to find the best fitting line for linear regression.
    """

    theta0 = 0.0
    theta1 = 0.0
    m = len(X)

    for s in range(steps):
        theta0, theta1 = gradient_descent_step(X, y, m, (theta0, theta1), alpha)

    return theta0, theta1

And now for each step:

def gradient_descent_step(
        X,          # Input features
        y,          # Target variable
        m,          # Number of training examples
        thetas,     # Tuple of (theta0, theta1)
        alpha       # Learning rate
):
    """
        Perform a single step of gradient descent.
        theta0 = theta0 - alpha * (1/m) * sum(errors)
        theta1 = theta1 - alpha * (1/m) * sum(errors * X)
    """
    theta0, theta1 = thetas

    sumHypothesisMinusValue = 0
    sumHypothesisMinusValueTimesX = 0
    for i in range(m):
        hypothesis_value = hypothesis(theta0, theta1, X[i])
        error = hypothesis_value - y[i]
        sumHypothesisMinusValue += error
        sumHypothesisMinusValueTimesX += error * X[i]

    theta0 = theta0 - (alpha * 1 / m) * sumHypothesisMinusValue
    theta1 = theta1 - (alpha * 1 / m) * sumHypothesisMinusValueTimesX

    return theta0, theta1

Real world example

In this repository, I have a script that gets the data from a local real estate broker in my hometown (along with every code shared in this article!).

This results in a bunch of real-world data from houses in the market at the time of recording.

First, we begin by importing this data:

import pandas as pd
    data = pd.read_csv(filename)
    X = data['area'].tolist()
    y = data['price'].tolist()

When working with data, it is always good to run some data processing and cleaning.

First, we remove the outliers

def remove_outliers(X, y, threshold=3.0):
    def z_scores(values):
        mean = sum(values) / len(values)
        std = (sum((v - mean) ** 2 for v in values) / len(values)) ** 0.5
        return [(v - mean) / std for v in values], mean, std

    x_z, x_mean, x_std = z_scores(X)
    y_z, y_mean, y_std = z_scores(y)

    filtered = [
        (xi, yi)
        for xi, zi_x, zi_y in zip(X, x_z, y_z)
        for yi in [y[X.index(xi)]]
        if abs(zi_x) <= threshold and abs(zi_y) <= threshold
    ]

    if not filtered:
        raise ValueError("All data removed as outliers!")

    X_filtered, y_filtered = zip(*filtered)

    return X_filtered, y_filtered

Then, because the values are big (upwards of hundred of thousands) we can encounter some errors in python (especially since we’re squaring some values).

To fix this, we scale the numbers down.

def minmax(X, y):
    x_min, x_max = min(X), max(X)
    y_min, y_max = min(y), max(y)

    X_scaled = [(xi - x_min) / (x_max - x_min) for xi in X]
    y_scaled = [(yi - y_min) / (y_max - y_min) for yi in y]

    return X_scaled, y_scaled, x_min, x_max, y_min, y_max

After the data is cleaned, we can run gradient descent.

theta0_scaled, theta1_scaled = gradient_descent(
    X_scaled, 
    y_scaled, 
    alpha=alpha, 
    steps=steps
)

The theta0 and theta1 mentioned in the code are the a and b we discussed previously.

We must scale them back to normal if we want to use these values in a prediction manner.

Scaling the thetas back:

def unscale_thetas(theta0_scaled, theta1_scaled, x_min, x_max, y_min, y_max):
    """
    Unscale the thetas back to the original scale.
    This is necessary after performing gradient descent on scaled data.
    """
    x_range = x_max - x_min
    y_range = y_max - y_min

    theta1_unscaled = (y_range / x_range) * theta1_scaled
    theta0_unscaled = y_min + y_range * (theta0_scaled - theta1_scaled * (x_min / x_range))
    return theta0_unscaled, theta1_unscaled

And that’s it! 🎉🍾👏

This is an example of plot I got running my code:

And these are the thetas (a and b):

theta0: 104727.42003321546
theta1: 7699.98612392038

Now, if you want to implement the method predict_house_price we have in the beggining of this article:

def predict_house_price(area):
    theta0 = 104727.42003321546
    theta1 = 7699.98612392038

    price = theta0 + theta1 * area
    return price

Conclusion

Take a moment to look at my repository if you want to run this code in your own dataset.

Keep in mind, this process is super iterative and a different amount of steps and a bigger or smaller learning rate may provide better or worse results.

In the end, house prices cannot be predicted with only one variable, so this is more of a thought experiment than a real model for prediction.

Keep learning!

Name	Typ	Größe	Geändert am	Zugriff
📄 ailinux-26.01-noble-amd64.iso	ISO	4,45 GB	27.07.2025 10:27	0644
📄 ailinux-26.01-noble-amd64.iso.sha256	SHA256	96,00 B	27.07.2025 10:23	0644