It’s used to express the mathematical relationship between two variables. When you use it, you are making the assumption that there is a linear relationship between an outcome variable (dependent variable) and a predictor (independent variable).
It can be used in several practical applications such as analysing the impact of price change, assessing risk.
Two things are captured in the model. The first is the trend and the second is the variation. A linear model can be written as follows,
y = β0 +β1x
β0 and β1 needs to be estimated using the observed data: x1, y1 , x2, y2, .. xn, yn .
It can be also written as matrix format,
✓ Fitting the model – The intuition behind linear regression is to find the line that minimizes the distance between all the points and the line. Linear regression seeks to find the line that minimizes the residual sum of squares (RSS): RSS (β) = Σi (yi –βxi)2
We want to optimize the RSS (β) with respect to β to find the optimal line.
To minimize RSS (β) = (y−βx)t( y−βx) , differentiate it with respect to β and set it equal to zero, then solve for β. The estimator for β’ is,
β’= (xtx )-1xt y
So far we modelled the trend and variation needs to be modelled.
✓ Adding in modelling assumptions about the errors – If you use your model to predict y for a given value of x, your prediction is deterministic and doesn’t capture the variability in the observed data. In order to capture the variability in model, we need to extend the model as follows,
y = β0 +β1x + ϵ
where the new term ϵ is referred to as noise. It also represents the actual error, the difference between the observations and the true regression line. Noise ϵ is normally distributed: ϵ ∼N( 0,σ2).
In here, we made an assumption that noise is normal distribution. However, in financial modelling, the assumption should be flat-tailed distribution.
For any given value of x, the conditional distribution of y given x is:
p(y/ x) ∼N (β0 +β1x, σ2).
Now we have the estimated line, and we can see how far away the observed data points are from the line. We can treat these differences as residuals or estimates of the actual errors. It is estimated as follows,
ei = yi – yi’ = yi – (β0’ +β1’xi) for all i.
Then variance (σ2) of ϵ can be estimated as follows,
This is called the mean squared error and captures how much the predicted value varies from the observed. Mean squared error is a useful quantity for any prediction problem.
✓ Evaluation metrics
✶R-squared –R2 = 1 -Σi (yi-yi’)2/ Σi(yi-mean(y))2
In above equation, mean squared error is divided by total error. This can be interpreted as the proportion of variance explained by our model. R2 defines the proportion of variance unexplained by our model.
✶p-values – We are making a null hypothesis that the βs are zero. For any given β, the p-value captures the probability of observing the data. This means that if we have a low p-value, it is highly unlikely to observe such a test-statistic under the null hypothesis, and the coefficient is highly likely to be nonzero and therefore significant.
✶Cross-validation – Divide our data up into a training set and a test set: 80% in the training and 20% in the test. Fit the model on the training set, then look at the mean squared error on the test set and compare it to that on the training set. This comparison should be done across across sample size as well. If the mean squared errors are approximately the same, then our model generalizes well. Otherwise we are having issues of overfitting.
✓ Multiple linear regression
So far we looked at the simple linear regression with one dependent variable and one predictor. But it can be extended by adding other predictors. Multiple linear regression can be written as follows,
y = β0 +β1x1 +β2x2 +β3x3 +ϵ
As we previously expressed everything in matrix notation, it was already generalized to give the appropriate estimators for the β. Feature selection plays important role in multiple regression.
So far we assumed that relationship between dependent variable and independent variable is linear. Polynomial relationship can be written as follows,
y = β0 +β1x+β2x2 +β3x3