In your quest to learn machine learning, this is probably the first and simplest prediction model you will learn. Each one of these words have a meaning! Let’s break it down:
Linear Regression
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observer data.
So, if you have two variables and they have a relationship you can use this to create a prediction model.
The classic example is Housing Prices. The bigger the house is, the pricier it gets. So one variable could be Area Size and the other Price. We can use Linear Regression to predict the price of a house based on its size!
Disclaimer
Of course, houses have way more variables than that.
Things like the number of bedrooms, number of bathrooms, the neighborhood and city, year of construction, and many other parameters influence the price.
This is not the best model for this prediction – it is just the simplest!
When we say linear, what we mean is that it is going to fit in a linear equation. On one axis you have the house area, and on the other the price!
Imagine that, by the end, we have a mathematical function that you just have to provide the house area and we get the price. Something like this:
Where:
- Y is the house price
- X is the house area
- a is the slope of the line
- b is the intercept (value of Y when X=0)
In code it will look like this:
def predict_house_price(area):
price = area * a + b
return price
price = predict_house_price(area)
Seems easy!
Calm down. We have to understand another thing: cost functions!!
Let’s begin with an easy example:
- Consider that a house with area 50m² is priced at US$ 100.000. we wish!
- Now consider that another house with area 75m² is priced at US$ 125.000.
- And last, another one that has area 100m² is priced at US$ 150.000.
This is easy: the function would be Price = Area * 1000 + 50.000
. This is the graph:
But data in the real world isn’t as easy, and house prices are not influenced only by its area. For example, houses with bigger areas could mean more bedrooms and more bathrooms, or it could mean a pool! These details are what complicates this prediction. Consider the following graph:
Now a simple straight line can’t really match and predict correctly every single dot in the graph.
This is where we get more technical.
We can try and quantify the accuracy of this function by measuring the distance of every dot to the line. This is the cost function. The function for the line is called hypothesis.
Cost Function
It is the average difference of all the results of the hypothesis with inputs from X and the actual output Y.
This function will give us the average accuracy to every prediction we have with the straight line.
For this article, we’ll implement a type of Mean Squared Error, but keep in mind there are other types of cost functions.
For every X we’ll do the following:
- Run the hypothesis > (Y = aX + b)
- Get the difference > predictedY – actualY
- Square it > difference*difference
- Add it to all errors
At the end, we divide by the amount of data points (e.g. the number of houses/prices) in order to get the average. You can also do it by half of the average in cases where we’re going to use gradient descent (more on this later).
In Python, this is what it would look like:
def cost_function(
y, # Target variable
predictions, # Model predictions
m # Number of training examples
):
"""
Calculate the cost function for linear regression.
J(theta) = (1/(2*m)) * sum(errors^2)
"""
sumErrorsSquared = 0.0
for i in range(m):
sumErrorsSquared += (predictions[i] - y[i]) ** 2.0
return (1.0 / (2.0 * m)) * sumErrorsSquared
We square the difference for two reasons:
- Negative errors cancel out positive ones
Sometimes we predict the value Y as being higher than the real Y, and other times we predict Y as being lower. By squaring it, we make sure it does not matter if the difference is positive or negative.
- Penalize bigger mistakes more
Small errors become huge errors, so the accuracy gets a bigger hit and small variations in accuracy matter more.
Okay, but why you’re telling me about this?
Gradient Descent
Gradient descent is a method for unconstrained mathematical optimization. It is an iterative algorithm to minimize a function.
Can you see where we’re going with this?
We’re going to start out with a random arbitrary hypothesis, run our cost function, and then run gradient descent in order to minimize the value of this cost function. This will increase the accuracy of our hypothesis!
Gradient descent works by the following:
- It begins by having arbitrary values
- It calculates the loss (cost function)
- Decides which way to go – in this case, descent means it wants to minimize the value
- Take another in this direction, and the step size is the learning rate
- Calculates again until it runs out of steps
If you (for some reason?) want the mathematical formula:
This basically means:
- repeat until convergence means that we’ll repeat the following steps until the values of the parameters stop changing significantly (convergence is reached).
- θⱼ is the parameter being updated in every repetition
- α is the learning rate. A small positive value that controls the size of the update step.
- ∂ is a partial derivative. It means we’re taking the derivative of a function with respect to one variable, while keeping the others constant.
- ∂/∂θⱼ J(θ₀, θ₁) is the partial derivative of the cost function J(θ₀, θ₁) with respect to θⱼ. In other words, how much the cost would change if we nudged θⱼ slightly.
I find code easier to understand, so here it is:
def gradient_descent(
X, # Input features
y, # Target variable
alpha, # Learning rate
steps # Number of iterations
):
"""
Perform gradient descent to find the best fitting line for linear regression.
"""
theta0 = 0.0
theta1 = 0.0
m = len(X)
for s in range(steps):
theta0, theta1 = gradient_descent_step(X, y, m, (theta0, theta1), alpha)
return theta0, theta1
And now for each step:
def gradient_descent_step(
X, # Input features
y, # Target variable
m, # Number of training examples
thetas, # Tuple of (theta0, theta1)
alpha # Learning rate
):
"""
Perform a single step of gradient descent.
theta0 = theta0 - alpha * (1/m) * sum(errors)
theta1 = theta1 - alpha * (1/m) * sum(errors * X)
"""
theta0, theta1 = thetas
sumHypothesisMinusValue = 0
sumHypothesisMinusValueTimesX = 0
for i in range(m):
hypothesis_value = hypothesis(theta0, theta1, X[i])
error = hypothesis_value - y[i]
sumHypothesisMinusValue += error
sumHypothesisMinusValueTimesX += error * X[i]
theta0 = theta0 - (alpha * 1 / m) * sumHypothesisMinusValue
theta1 = theta1 - (alpha * 1 / m) * sumHypothesisMinusValueTimesX
return theta0, theta1
Real world example
In this repository, I have a script that gets the data from a local real estate broker in my hometown (along with every code shared in this article!).
This results in a bunch of real-world data from houses in the market at the time of recording.
First, we begin by importing this data:
import pandas as pd
data = pd.read_csv(filename)
X = data['area'].tolist()
y = data['price'].tolist()
When working with data, it is always good to run some data processing and cleaning.
- First, we remove the outliers
def remove_outliers(X, y, threshold=3.0):
def z_scores(values):
mean = sum(values) / len(values)
std = (sum((v - mean) ** 2 for v in values) / len(values)) ** 0.5
return [(v - mean) / std for v in values], mean, std
x_z, x_mean, x_std = z_scores(X)
y_z, y_mean, y_std = z_scores(y)
filtered = [
(xi, yi)
for xi, zi_x, zi_y in zip(X, x_z, y_z)
for yi in [y[X.index(xi)]]
if abs(zi_x) <= threshold and abs(zi_y) <= threshold
]
if not filtered:
raise ValueError("All data removed as outliers!")
X_filtered, y_filtered = zip(*filtered)
return X_filtered, y_filtered
Then, because the values are big (upwards of hundred of thousands) we can encounter some errors in python (especially since we’re squaring some values).
- To fix this, we scale the numbers down.
def minmax(X, y):
x_min, x_max = min(X), max(X)
y_min, y_max = min(y), max(y)
X_scaled = [(xi - x_min) / (x_max - x_min) for xi in X]
y_scaled = [(yi - y_min) / (y_max - y_min) for yi in y]
return X_scaled, y_scaled, x_min, x_max, y_min, y_max
After the data is cleaned, we can run gradient descent.
theta0_scaled, theta1_scaled = gradient_descent(
X_scaled,
y_scaled,
alpha=alpha,
steps=steps
)
The
theta0
andtheta1
mentioned in the code are thea
andb
we discussed previously.
We must scale them back to normal if we want to use these values in a prediction manner.
- Scaling the thetas back:
def unscale_thetas(theta0_scaled, theta1_scaled, x_min, x_max, y_min, y_max):
"""
Unscale the thetas back to the original scale.
This is necessary after performing gradient descent on scaled data.
"""
x_range = x_max - x_min
y_range = y_max - y_min
theta1_unscaled = (y_range / x_range) * theta1_scaled
theta0_unscaled = y_min + y_range * (theta0_scaled - theta1_scaled * (x_min / x_range))
return theta0_unscaled, theta1_unscaled
And that’s it! 🎉🍾👏
This is an example of plot I got running my code:
And these are the thetas (a and b):
- theta0: 104727.42003321546
- theta1: 7699.98612392038
Now, if you want to implement the method predict_house_price
we have in the beggining of this article:
def predict_house_price(area):
theta0 = 104727.42003321546
theta1 = 7699.98612392038
price = theta0 + theta1 * area
return price
Conclusion
Take a moment to look at my repository if you want to run this code in your own dataset.
Keep in mind, this process is super iterative and a different amount of steps and a bigger or smaller learning rate may provide better or worse results.
In the end, house prices cannot be predicted with only one variable, so this is more of a thought experiment than a real model for prediction.
Keep learning!