Linear Regression Python Scikit-learn Supervised Learning

Predictions of Admissions to the Master’s Degree

This project was developed to predict the chance of admission of foreign students to Master’s Degree Programs in American Colleges. For doing that I will the Linear Regression Algorithm.

Loading Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection  import train_test_split
from sklearn.linear_model import LinearRegression


I’ll work with the Admission_Predict.csv file from the Kaggle Website. It contains parameters which are considered important during the application for Masters Programs.

admissions = pd.read_csv('Admission_Predict.csv')

The dataset contains 400 instances and 9 columns:

  1. Serial No.
  2. GRE Scores ( out of 340 )
  3. TOEFL Scores ( out of 120 )
  4. University Rating ( out of 5 )
  5. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  6. Undergraduate GPA ( out of 10 )
  7. Research Experience ( either 0 or 1 )
  8. Chance of Admit ( ranging from 0 to 1 )
  9. Chance of admit (Label)

Exploratory Data Analysis

It is the turn to see how variables are related, for this purpose I will use the common seaborn pair plot.

print('Relationships Between different Test Scores and The Chance to Be Admitted in College') 
g = sns.pairplot(admissions, 
                  x_vars=["Chance_of_Admit_","GRE_Score", "CGPA","TOEFL_Score"],
                  y_vars=["Chance_of_Admit_","GRE_Score", "CGPA","TOEFL_Score"],  

Using Linear Regression

Linear regression is one of the most interpretable machine learning algorithms, it is easy to explain, and it is easy to use. Based on the linear relationship depicted on the visualization above, I decided to use Linear Regression to predict the value of my dependent variable (chance to be admitted) based on the value of other variables, GRA Score, TOEFL Score, etc.

# Defining features (X) and target (y)

X = admissions.drop(columns = ['Chance of Admit'])
y = admissions['Chance of Admit']

# Splitting the data into training and testing sets
X_training, X_test, y_training, y_test = train_test_split(X, y, 
                                         test_size=0.3, random_state=101)

# Creating an instance of a LinearRegression() model named lm
lm = LinearRegression()

# Training and fitting lm on the training data, y_training)

# Predicting results
y_predicted = lm.predict(X_test)

Results of training and validation

# Creating a new dataframe to store results
results = pd.DataFrame()
results['Test'] = y_test
results['Predictions'] = y_predicted

# Creating a scatterplot of real test values vs predicted values.
plt.xlabel('Y Test')
plt.ylabel('Y Predicted')

Measuring the Model

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

MAE = mean_absolute_error(y_test, y_predicted)
MSE = mean_squared_error(y_test, y_predicted)
RMSE = np.sqrt(MSE)

print('MAE: {}'.format(MAE))
print('MSE: {}'.format(MSE))
print('RMSE: {}'.format(RMSE))

Getting Coefficients

# Verifying coefficients
coeficient = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])

The coefficient values mean how much the mean of the dependent variable Chance of Admit, changes given a one-unit shift in the independent variable while holding other variables in the model constant. For instance, holding all other features fixed, a 1 unit increase in GRE Score is associated with an increase of 0.002392 total chance to be admitted.

Histogram of Residuals



  • The histogram of the residuals looks like a normal distribution, it means the model is accuracy
  • All coefficients are greater than 0, it provides statistical evidence of a positive relationship between the variables
  • GPA Score is the most impactful feature on the chance to be admitted
  • Complete code in GitHub Repository