Some years ago, when I created my first simple linear regression model I had no a clear idea of how to evaluate my model or even, if it was a good predictor or not. Now, I decided to write this post concentrating the action items I use to determine the goodness of my simple linear regression models. You will find a brief review of basics and assumptions of simple linear regression and then the list of to-dos examine the model.
What Simple Linear Regression Is
It is the simplest regression analysis model for examining the relationship between two variables: a dependent variable (also know as response, target or outcome) and the independent variable (also know as explanatory or predictor variable). The estimation of that relationship could be seen in a plot as a one straight line.
The main objectives of simple regression are:
- Identify relationship between variables
- Predict the dependent variable Y on the basis of the single independent variable x
- Test hypothesis of association relationships
The goal is to find the line that represents the linear relationship, that line is determined by minimizing the sum of the squared differences between predicted values and actual values. For that purpose, we have to estimate the β0 and β1 parameters. β0 represents the intercept, which is the expected value of the dependent variable Y when independent variable x is zero and, β1 represents the slope, which represents the change in the expected value of Y for a one-unit change in x.
Linear Regression Assumptions
It is important to recall the linear regression assumptions because a good understand of those will help to evaluate the model.
- Linearity assumption means that expected value of errors or deviance is zero.
- Constant variance assumption or Homoscedasticity means that the variance of the residuals should be constant across all levels of the independent variable. In simple words, the model can not be more accurate for certain segments compared to others.
- Independence assumption means the errors are independent random variables.
- Normality assumption means that errors are assumed normally distributed.
Evaluate Model Effectiveness and Performance
To evaluate a linear simple regression model, there are some key metrics and plots that we can consider. Let’s review them:
Calculate Mean Squared Error (MSE) or Root Mean Squared Error (RMSE)
These metrics quantify the average squared difference between the actual and predicted values. Lower MSE or RMSE values indicate better predictive accuracy.
Verify the Coefficient of Determination (R-squared)
The R-squared is the proportion of the total variability of the dependent variable that can be explained by the independent variable in the linear regression model. It ranges from 0 to 1, where a closer value to 1 indicates a better fit.
Analyze the p-values of Coefficients
Evaluate the significance of the coefficients by examining their corresponding p-values. Lower p-value indicates that the predictor variable has a statistically significant effect on the response variable.
Calculate the Correlation Coefficient Between Variables
It measures the strength and direction of the linear relationship between two variables and, provides an indication of the linearity between them. A higher absolute value of the correlation coefficient suggests a closer adherence to a linear relationship. The square of the correlation coefficient is actually the R-squared.
Plot The Residuals vs Predicted Values
Ideally, the residuals should be randomly around zero, indicating that the model captures the essential relationship well and the variance of error terms are constant, see the left image below. The right image shows a megaphone effect which means the constant variance (homoscedasticity) assumption does not hold.
Create a QQ (Quantile) Plot
The quantiles of the residuals will line up with the normal quantiles following a straight line. If plot does not show a straight line, the normality assumption is not hold and indicates of either a skewed distribution or, heavy-tail distribution. Below there are some examples of QQ plots, the one we want to see as a result of our model is the plot on the left.
Create a Histogram of Residuals
The histograms are often used to observe the shape of the distribution of the residuals and verify if the normality assumption is hold, for instance the plot below shows a normal shape distribution of errors and it is what we expect to see from our models.
- If some of the assumptions do not hold, then the model fit is inadequate, but it does not mean that the regression is not useful.
- Violations of these assumptions can affect the validity and reliability of the regression results.
- And, finally it is quite important to explain the results in the context of the specific problem we want to solve.