About
The objective of this project is to predict the popularity of articles published by the Mashable website, based on the number of shares of a specific article. The machine learning algorithms used for this project are Random Forest, Support Vector Classification, and KNN / K-Nearest Neighbor.
Dataset
This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). It was retrieved from UCI Machine Learning in May 2019 from https://archive.ics.uci.edu/ml/datasets/online+news+popularity
Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import time
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn import tree
from IPython.display import Image
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix
Loading Data
At the moment to take a look at data, I realized column names contain blanks so, I decided to use the rename function to replace blanks for nothing, practically I remove them.
onlinenews = pd.read_csv('OnlineNewsPopularity.csv')
for col in onlinenews.columns:
onlinenews.rename(columns={col:col.replace(" ", "")},inplace=True)
onlinenews.head()

Exploratory Data Analysis
This phase of the project is about exploring data to discover insights, identify patterns, establish relationships and trends, and test assumptions. During this time I discovered outliers, missing values and type of variables, as well I have to verify if there is any correlation among variables, and finally define the features for my model.
Variables Types
onlinenews.info()

This dataset has 39644 instances and 61 attributes, most of them are float type.
Data Cleaning
Let’s see if there are some missing values. I recommend using the plot below since it is very easy to identify null values. If there are some values, those will be colored in yellow, in this case, the dataset has no null values.
# Looking for null values
print('All null values are in yellow')
sns.heatmap(onlinenews.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Dealing with outliers
This dataset contains a lot of outliers. The statistical method used to detect outliers is the interquartile range (IQR). I invite you to visit the GitHub Repository to see the details of the techniques I used to treat outliers.
Q1 = onlinenews.quantile(0.25)
Q3 = onlinenews.quantile(0.75)
IQR = Q3 - Q1
# Creating notinvalidarea dataframe with boolean values:
# False means these values are into the valid area
# True indicates presence of an outlier
notinvalidarea = (onlinenews < (Q1 - 1.5 * IQR)) | (onlinenews > (Q3 + 1.5 * IQR))
# Calling function outliersinColumns
columns_w_outliers = outliersinColumns(notinvalidarea)
# Printing Results
print('Columns with outliers: {}'.format(len(columns_w_outliers)))
print('\n')
print(columns_w_outliers)

Detecting and Visualizing Correlation
correlation = onlinenews.corr()
plt.figure(figsize=(25,25))
sns.heatmap(correlation, square=True, annot=True, linewidths=.5)
plt.title("Correlation Matrix (Online News)")

Binning the Target
This is a classification problem to predict if the popularity of an article is low, medium, or high based on the number of shares of that article. First at all, it is required to bin the target, create the groups low, medium and high defined by the quantiles on the target variable shares.
# Binning groups
onlinenews['labelbin_shares'] = pd.cut(onlinenews['shares'],
bins=[0,946,2800,843300],
labels=['Low','Medium','High'])
onlinenews['bin_shares'] = pd.cut(onlinenews['shares'],
bins=[0,946,2800,843300],
labels=[1,2,3])
# Visualization of Target Distribution
onlinenews.groupby('bin_shares').labelbin_shares.value_counts().plot.bar()

Feature Selection
I will select features applying the technique Univariate using SelectKBest, it just scores the features using a function, and then removes all but the k highest scoring features.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
X = onlinenews.drop(columns=['labelbin_shares','shares','bin_shares'])
y = onlinenews['bin_shares']
# Create an SelectKBest object to select features with two best ANOVA F-Values
fvalue_selector = SelectKBest(f_classif, k=21)
# Apply the SelectKBest object to the features and target
X_kbest = fvalue_selector.fit_transform(X, y)
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])
mask = fvalue_selector.get_support() #list of booleans
selectedFeatures = [] # The list of K best features
for bool, feature in zip(mask, X.columns.values):
if bool:
selectedFeatures.append(feature)
Now, I will check the feature importance using Random Forest.
# Check feature importance using Random Forest
X = onlinenews[selectedFeatures]
Y = onlinenews['bin_shares']
rfc = ensemble.RandomForestClassifier(n_estimators=100)
#Fitting the model
rfc.fit(X,Y)
importantFeatures = {}
for feature,importance in zip(selectedFeatures,rfc.feature_importances_):
importantFeatures[feature] = importance
#print(feature,importance)
plt.barh(selectedFeatures,rfc.feature_importances_,height=.5)

Class Balancing
SMOTEENN, it’s a method that combines over-sampling and under-sampling. It’s a class to perform over-sampling using SMOTE and cleaning using ENN.
from collections import Counter
from imblearn.combine import SMOTEENN
X = onlinenews[selectedFeatures]
y = onlinenews['bin_shares']
sme = SMOTEENN(random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
The Model: KNN – K-Nearest Neighbor¶
from sklearn.neighbors import KNeighborsClassifier
X = X_res
Y = y_res
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 101)
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',metric_params=None,
n_jobs=None, n_neighbors=9, p=2,
weights='uniform')
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
Accuracy and Score
knn_score = round(knn.score(X_train,y_train),4)
print("Score",knn_score)
accu_score = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:",accu_score)
Accuracy: 0.755119901977945 Score: 0.8261
Conclusions
- To get the results above, a model tuning process was required using GridSearch to find the best parameters.
- I also applied two more algorithms for this problem, Random Forest and Support Vector Classification, if you want to see the complete code, Invite you to visit my GitHub repository.