Classification Python Scikit-learn Supervised Learning

Online News Popularity Prediction


The objective of this project is to predict the popularity of articles published by the Mashable website, based on the number of shares of a specific article. The machine learning algorithms used for this project are Random Forest, Support Vector Classification, and KNN / K-Nearest Neighbor.


This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). It was retrieved from UCI Machine Learning in May 2019 from

Importing Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import time

from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.model_selection import train_test_split

from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix

from sklearn import tree
from IPython.display import Image

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix

Loading Data

At the moment to take a look at data, I realized column names contain blanks so, I decided to use the rename function to replace blanks for nothing, practically I remove them.

onlinenews = pd.read_csv('OnlineNewsPopularity.csv')

for col in onlinenews.columns:
    onlinenews.rename(columns={col:col.replace(" ", "")},inplace=True)


Exploratory Data Analysis

This phase of the project is about exploring data to discover insights, identify patterns, establish relationships and trends, and test assumptions. During this time I discovered outliers, missing values and type of variables, as well I have to verify if there is any correlation among variables, and finally define the features for my model.

Variables Types 

This dataset has 39644 instances and 61 attributes, most of them are float type.

Data Cleaning

Let’s see if there are some missing values. I recommend using the plot below since it is very easy to identify null values. If there are some values, those will be colored in yellow, in this case, the dataset has no null values.

# Looking for null values

print('All null values are in yellow')

Dealing with outliers

This dataset contains a lot of outliers. The statistical method used to detect outliers is the interquartile range (IQR). I invite you to visit the GitHub Repository to see the details of the techniques I used to treat outliers.

Q1 = onlinenews.quantile(0.25)
Q3 = onlinenews.quantile(0.75)
IQR = Q3 - Q1

# Creating notinvalidarea dataframe with boolean values:
# False means these values are into the valid area 
# True indicates presence of an outlier
notinvalidarea = (onlinenews < (Q1 - 1.5 * IQR)) | (onlinenews > (Q3 + 1.5 * IQR))

# Calling function outliersinColumns
columns_w_outliers = outliersinColumns(notinvalidarea)

# Printing Results
print('Columns with outliers: {}'.format(len(columns_w_outliers)))

Detecting and Visualizing Correlation

correlation = onlinenews.corr()

sns.heatmap(correlation, square=True, annot=True, linewidths=.5)
plt.title("Correlation Matrix (Online News)")

Binning the Target

This is a classification problem to predict if the popularity of an article is low, medium, or high based on the number of shares of that article. First at all, it is required to bin the target, create the groups low, medium and high defined by the quantiles on the target variable shares.

# Binning groups
onlinenews['labelbin_shares'] = pd.cut(onlinenews['shares'], 

onlinenews['bin_shares'] = pd.cut(onlinenews['shares'], 
# Visualization of Target Distribution

Feature Selection

I will select features applying the technique Univariate using SelectKBest, it just scores the features using a function, and then removes all but the k highest scoring features.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

X = onlinenews.drop(columns=['labelbin_shares','shares','bin_shares'])
y = onlinenews['bin_shares']    

# Create an SelectKBest object to select features with two best ANOVA F-Values
fvalue_selector = SelectKBest(f_classif, k=21)

# Apply the SelectKBest object to the features and target
X_kbest = fvalue_selector.fit_transform(X, y)

print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])

mask = fvalue_selector.get_support() #list of booleans
selectedFeatures = [] # The list of K best features

for bool, feature in zip(mask, X.columns.values):
    if bool:

Now, I will check the feature importance using Random Forest.

# Check feature importance using Random Forest

X = onlinenews[selectedFeatures]
Y = onlinenews['bin_shares']  

rfc = ensemble.RandomForestClassifier(n_estimators=100)

#Fitting the model,Y)

importantFeatures = {}
for feature,importance in zip(selectedFeatures,rfc.feature_importances_):
    importantFeatures[feature] = importance


Class Balancing

SMOTEENN, it’s a method that combines over-sampling and under-sampling. It’s a class to perform over-sampling using SMOTE and cleaning using ENN.

from collections import Counter
from imblearn.combine import SMOTEENN 

X = onlinenews[selectedFeatures]
y = onlinenews['bin_shares']    

sme = SMOTEENN(random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

The Model: KNN – K-Nearest Neighbor

from sklearn.neighbors import KNeighborsClassifier

X = X_res
Y = y_res

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 101)

knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, 
                           n_jobs=None, n_neighbors=9, p=2,
y_pred = knn.predict(X_test)

Accuracy and Score

knn_score = round(knn.score(X_train,y_train),4)

accu_score = metrics.accuracy_score(y_test, y_pred)
Accuracy: 0.755119901977945
Score: 0.8261


  • To get the results above, a model tuning process was required using GridSearch to find the best parameters.
  • I also applied two more algorithms for this problem, Random Forest and Support Vector Classification, if you want to see the complete code, Invite you to visit my GitHub repository.