Ender Dincer
Apr 4, 2023
Linear Regression with Ordinary Least Squares
In this article we will see how ordinary least squares method is used to fit a regression line. Then we will implement this with Python. Although there are existing optimized implementations in Ptyhon, our example will be for learning purposes, like doing arithmetic without a calculator.
Ordinary Least Squares aims to determine the coefficients of a linear regression line by minimising the area of squares between the obeservation points and the estimated points. It is widely used in machine learning, statistics, finance and so on.
Figure 1: Linear Regression
To derive the formula we will first simplify the above figure. We will only focus on a single point and the regression line like below.
Figure 2: LSE for a Single Point
The green point is an observation point. The yellow square is one of the many squares that we aim to minimise its area. We know that lengths of all four sides of a square is equal so only finding the length of a single side in terms of the regression line coefficients and the observation point will be sufficient. The area will be that length to the power 2.
Now that we have the area function of a single point, denoted by "A", we can generalise to all other observation points like equation 1.
Equation 1
The regression lines slop and position, and therefore the area of the square, depends on the coefficients. Thus, we will get partials derivatives with respect to each coefficient and equal it to zero.
Equation 2
Solve for b0
Equation 3
Equations 4
Equation 5: Leave b0 Alone
Equation 6
Equation 7: Mean notation
Equation 8
Equation 9
Equation 10
Equation 11
Solve for b1
Equation 12
Equation 13
Equation 14
Equation 15
We now know how to get the parameters of the regression line. Let's see how we can implement a linear regression method with ordinary least squares.
We will first create a class called LeastSquaresLinearRegressor that has two public functions. The first function is to fit the regression line. We will use the equations we just derived in section 2.
Copy
class LeastSquaresLinearRegressor:
def fit(self, x, y):
x_mean = np.mean(x)
y_mean = np.mean(y)
# Equation 12
self.b1 = (x * (y - y_mean)).sum() / (x * (x - x_mean)).sum()
# Equation 8
self.b0 = y_mean - self.b1 * x_mean
def predict(self, x):
return self.b0 + self.b1 * x
Copy
1Output: -
Equation 12 and 8 are used to find the parameters of the regression line. The second function is the prediction function where we use the parameters calculated with the fit function.
To test and visualise this regressor we will use scikit-learn and matplotlib libraries.
Copy
def mean_squared_error(actual, prediction):
return np.mean((actual - prediction) ** 2)
def mean_absolute_pc_error(actual, prediction):
return np.abs((actual - prediction) / actual).sum() * (100 / actual.size)
def test():
dataset_x, dataset_y = datasets.make_regression(
n_samples=100, n_features=1, noise=15, random_state=3
)
x = np.ravel(dataset_x)
y = np.ravel(dataset_y)
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=5
)
regressor = LeastSquaresLinearRegressor()
regressor.fit(x_train, y_train)
predictions = regressor.predict(x_test)
print(f"MSE: {mean_squared_error(y_test, predictions)}")
print(f"MAPE: {mean_absolute_pc_error(y_test, predictions)}%")
plt.scatter(x_train, y_train)
plt.scatter(x_test, y_test)
plt.plot(x_train, regressor.predict(x_train), color="#11aa00")
plt.show()
Copy
Output:
MSE: 208.2299503682563
MAPE: 48.37193549306731%
Mean Squared Error (MSE) and Mean Absolute Percentage Error (MAPE) are widely used metrics to measure the accuracy of regression lines. 48 percent error rate can look like a very bad error rate but since the regression type is linear this is normal and it heavily depends on the nature of the data.
Let's the see the plotted regression line. The green line is the regression line. Blue dots are observation points (training set), orange dots are test data.
Figure 3: Regression Line Plotted with Python
In conclusion, linear regression with ordinary least squares is a powerful and widely used statistical tool. With its ability to estimate the coefficients of a linear equation that best fits the data, OLS is a valuable technique for understanding the nature and strength of the relationship between variables.