This notebook provides a modern approach for predicting the number of claps a MEDIUM article is going to reach after publication. The model uses both methods of NLP (including Word2Vec embeddings) and general feedforward neural networks. As a main package for machine learning, the sklearn tools will be applied to model a ML regession.
# connect to Google Drive
from google.colab import drive
drive.mount("/content/gdrive")
# import libraries
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.preprocessing import PowerTransformer
import pickle
import warnings
from IPython.display import Image
# settings
warnings.filterwarnings("ignore")
%matplotlib inline
# training data
df_train = pd.read_csv("gdrive/My Drive/Adams Assignment/Train.csv", sep=",")
# testing data
df_test = pd.read_csv("gdrive/My Drive/Adams Assignment/Test.csv", sep=",")
# check train data frame
df_train.head()
# check test data frame
df_test.head()
By taking a look at the feature names i.e. column names, it becomes obvious that the feature names of the training dataframe compared to those of the testing dataframe are not equal and hence a transformation needs to be applied.
# check for intersecting feature names
test_colnames = df_test.columns.values
print(test_colnames)
train_colnames = df_train.columns.values
print(train_colnames)
if np.intersect1d(train_colnames,test_colnames).size == 0:
print("\nThere are no intersecting column names.")
The training data set differs in its structure from the test data set. Since the final predictions will be based on the features of the test data, the superfluous features of training data set need to be harmonized. The Index is needed for a later mapping of the final predictions in the output file. The Claps are the target variable of the presented model. With DaysPublication the publication duration in days is measured. A column Text contains the title and the textual content of the articles and will be used for NLP modelling. The count of followers is given by AuthorFollowers. The features TitleWordCount and TextWordCount represent the length of the title and the article. Some of these features will be created and/or transformed in the following section for a neural network data preparation.
Training Data | Test Data | New Column Name |
---|---|---|
N/A | 'Unnamed: 0' | [drop] |
N/A | 'index' | 'Index' |
'totalClapCount' | N/A | 'Claps' |
'author' | 'Author' | [drop] |
'firstPublishedDate' | 'PublicationDetails' | 'DaysPublication' |
N/A | 'Responses' | [drop] |
'title' | 'Header' | [drop] |
'text' | 'Text' | 'Text' |
'wordCount' | 'Length' | [drop] |
various | N/A | [drop] |
'usersFollowedByCount' | N/A | 'AuthorFollowers' |
N/A | N/A | 'TitleWordCount' |
N/A | N/A | 'TextWordCount' |
# test data
Details = df_test.PublicationDetails
Dates =[]
def getDateOfPublication_test():
months = ["Jan ", "Feb ", "Mar ", "Apr ", "May ", "Jun ", "Jul ", "Aug ", "Sep ", "Oct ", "Nov ", "Dec "]
for cnt, det in enumerate(Details):
for m in months:
if m in det:
tmp = det[det.find(m):]
if len(tmp)<8:
tmp = tmp + str(", 2019")
tmp = datetime.strptime(tmp, '%b %d, %Y')
tmp = datetime(2019, 12, 31, 0, 0) - tmp
Dates.append(tmp.days)
getDateOfPublication_test()
df_test["DaysPublication"]=Dates
del Dates, Details
# train data
Dates = df_train.firstPublishedDate
Dates = [datetime(2018, 11, 4, 0, 0) - datetime.strptime(d, '%Y-%m-%d') for d in Dates]
Dates = [d.days for d in Dates] # scraping date
df_train["DaysPublication"] = Dates
del Dates
The publication dates of training data and the testing data are not identically distributed. The training data was scraped on the 04th of November 2018 and the testing data about one year later. This bias will be treated in chapter 4.3. (Scaling).
From a theoretical perspective, the author of an article will influence the number of claps. The popularity of an author can be approximated by his number of followers. Since the testing dataframe does neither contain this number nor the exact username, a composite web scraping is used to recreate this information. The column PublicationDetails consists of the author's name and the publication date. The name is extracted and a combined Google search with the author's name and the keyword "Medium" will return the author's profile page on Medium with a high probability. This URL is then used for extracting the number of followers from the HTML.
# web scrape for author's followers
# !pip install google
from googlesearch import search
Authors = [row[0:row.find(" in ")] for row in df_test.PublicationDetails]
Followers =[]
def getAuthorFollowers():
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
for cnt, auth in enumerate(Authors):
for m in months:
if m in auth:
Authors[cnt] = auth[0:auth.find(m)]
# has this author been scraped? (speed up)
flag = False
for cnt2 in range(0, cnt):
if auth == Authors[cnt2]:
Followers.append(Followers[cnt2])
flag = True
break
if flag:
print(Authors[cnt], Followers[cnt])
continue
# Google search
import requests
for g_res in search("Medium " + Authors[cnt], tld='com', lang='en', num=10, start=0, stop=10, pause=2.0):
if "https://medium.com/@" in g_res:
page = requests.get(g_res)
index = page.text.find("Followers")-9
if index<0:
continue
tmp =page.text[index-10:index]
tmp = tmp[tmp.find(">")+1:]
tmp = tmp.replace("K","000")
tmp = tmp.replace(".","")
tmp = tmp.replace(",","")
tmp = int(tmp)
Followers.append(tmp)
print(Authors[cnt], tmp)
break
else:
Followers.append(0)
# apply on test data
getAuthorFollowers()
df_test["AuthorFollowers"]=Followers
del Authors, Followers
# train data
df_train.rename(columns={"usersFollowedByCount":"AuthorFollowers"}, inplace=True)
def countHeaderWords(input_header):
tmp = input_header.split(" ")
return len(tmp)
# test data
cntHeader = [countHeaderWords(str(h)) for h in df_test.Header]
df_test["HeaderWordCount"] = cntHeader
# train data
cntHeader = [countHeaderWords(str(h)) for h in df_train.title]
df_train["HeaderWordCount"] = cntHeader
del cntHeader
The feature extraction of TextWordCount will be performed after the NLP preprocessing in chapter 4.5.
Two new dataframes df_train_new and df_test_new are created with the defined features for the prediciton model.
# Create Reduced Test DataFrame
df_test_new = df_test.loc[:,["index", "Header", "Text", "DaysPublication", "AuthorFollowers","HeaderWordCount"]].values
df_test_new = pd.DataFrame(df_test_new)
df_test_new.columns = ["Index","Header", "Text", "DaysPublication", "AuthorFollowers","TitleWordCount"]
df_test_new["Header"].fillna(" ", inplace = True) # replace missing headlines
df_test_new["Text"] = df_test_new["Header"] + " " + df_test_new["Text"]
del df_test_new["Header"]
df_test_new.head()
# Create Reduced Train DataFrame
df_train_new = df_train.loc[:,["totalClapCount", "title", "text", "DaysPublication", "AuthorFollowers","HeaderWordCount"]].values
df_train_new = pd.DataFrame(df_train_new)
df_train_new.columns = ["Claps", "Header", "Text", "DaysPublication", "AuthorFollowers","TitleWordCount"]
df_train_new.Text[3]
df_train_new["Text"] = df_train_new["Header"] + " " + df_train_new["Text"]
del df_train_new["Header"]
df_train_new.head()
#print("The Training Data has" + len(df_train.index) + " rows and " + len(df_train.columns) + " columns.")
print("The Training Data has %d rows and %d columns." % (len(df_train_new.index), len(df_train_new.columns)))
print("The Testing Data has %d rows and %d columns." % (len(df_test_new.index), len(df_test_new.columns)))
#df_train.shape
There are some dublicate rows in df_train_new due to the exclusion of the feature tag. In this new dataframe, the dublicate rows have no purpose anymore and can be dropped.
# remove dublicates in the train set
len_before = len(df_train_new.Claps)
df_train_new = df_train_new.drop_duplicates()
len_after = len(df_train_new.Claps)
print(str(len_before - len_after) + " (" + format(100*(len_before - len_after)/len_before, '.2f') + "%) dublicate rows have been deleted.")
del len_before, len_after
# drop null values
n = df_train_new.isnull().sum()
df_train_new = df_train_new.dropna()
print(str(n.sum()) + " row(s) dropped.")
A literature-based approach of NLP preprocessing is performed by using regular expressions with re and the NLP library nltk. In this process, the HTML tags, links and non-alphabetic characters are deleted first. Then the lower-case text is tokenized into single words and all the english stopwords are removed. Finally a simple lemmatization operation is applied on the words. The output is a list (bag) of words for each article, stored in the column Text.
import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
def nlp_preprocessing(input_data):
#remove HTML tags, Links and non-aphabetic characters
tmp = re.sub(r"<[^>]*>", "", input_data, flags=re.MULTILINE)
tmp = re.sub(r"http\S+", "", tmp, flags=re.MULTILINE)
tmp = re.sub(r"[^a-zA-Z]", " ", tmp, flags=re.MULTILINE)
# to lower case
tmp = tmp.lower()
# tokenization
tokenizer = RegexpTokenizer(r'\w+')
tmp = tokenizer.tokenize(tmp)
# remove stopwords
stop_words = stopwords.words("english")
tmp = [word for word in tmp if word not in stop_words]
# lemmatization
# only include words, exclude single characters
lemmatizer = WordNetLemmatizer()
tmp = [lemmatizer.lemmatize(word, pos='v') for word in tmp if len(word)>1]
return tmp
# apply cleaning on training data
train_text = pd.DataFrame(df_train_new)
tqdm.pandas()
df_train_new["Text"] = train_text["Text"].progress_apply(lambda x: nlp_preprocessing(x))
del train_text
# apply cleaning on testing data
test_text = pd.DataFrame(df_test_new)
tqdm.pandas()
df_test_new["Text"] = test_text["Text"].progress_apply(lambda x: nlp_preprocessing(x))
del test_text
In order to reduce the effect of outliers on the efficiency of the neural network(s), I define the 95-percentile as an upper bound for the claps. Any number above will be truncated to this upper_bound. Afterwards, the claps are scaled by a self-defined Min-Max-Log-Scaler.
# getting a first intention of Claps
print("The min(Claps) is " + str(df_train_new.Claps.min()) + " and the max(Claps) is " + str(df_train_new.Claps.max()) + ".")
_=plt.boxplot(df_train_new.Claps)
The boxplot visualisation shows that there exist articles with a very high amount of claps. A maximum of 291706 claps has been reached by one article. In order to smooth the distribution, these outliers are truncated to the 95-percentile:
# calculate the percentiles
np.percentile(df_train_new.Claps,[0,25,50,75,90,95,99,100])
# outlier treatment
# select upper bound (95-percentile)
upper_bound = int(np.percentile(df_train_new.Claps,[95]))
# truncate by setting
df_train_new.Claps[df_train_new.Claps > upper_bound] = upper_bound
For better results in a neural network, the target variable will be scaled by using a self-defined Min-Max-Log-Scalar, reshaping the distribution and setting the minimum to 0 and the maximum to 1.
# clap transformation
def logTransformation(c):
# log transformation
c = np.log(c+1)
return c
def min_maxTransformation(c, max):
c = c / max
return c
# apply on train data
trans_Claps = [logTransformation(c) for c in df_train_new.Claps]
upper_bound_log = max(trans_Claps)
trans_Claps = [min_maxTransformation(c, upper_bound_log) for c in trans_Claps]
plt.hist(trans_Claps)
df_train_new["Claps"] = trans_Claps
del trans_Claps
# retransformation function (needed later)
def re_Transformation_Claps(c):
if c<0:
c=0
c = c * upper_bound_log
c = np.exp(c)
c = c-1
return c
The sklearn.preprocessing package offers a class PowerTransformer, that performs a Box-Cox transformation of strictly positive numeric data
The advantage of Box-Cox transformation is a normal-like distribution with a mean set to 0:
# train data
from sklearn.preprocessing import PowerTransformer
tmp_min = min(df_train_new.DaysPublication)
tmp = [(dp - tmp_min + 1) for dp in df_train_new.DaysPublication]
del tmp_min
tmp = np.array(tmp)
tmp = tmp.reshape(-1, 1)
scaler_DaysPublication = PowerTransformer(method='box-cox', standardize=True, copy=True)
scaler_DaysPublication.fit(tmp)
df_train_new.DaysPublication = scaler_DaysPublication.transform(tmp)
_=plt.hist(df_train_new.DaysPublication)
As mentioned in a previous chapter, the application of this scaler on the testing data gives biased results. One reason for this are different dates of scraping and different distributions of publication dates. Furthermore, it is not completely clear, when the testing data has been scraped. For that reason, I decided to harmonize this temporal gap by setting the latest date of publication for both data sets to $t_0$.
# test data
tmp_min = min(df_test_new.DaysPublication)
tmp = [(dp - tmp_min + 1) for dp in df_test_new.DaysPublication]
del tmp_min
tmp = np.array(tmp)
tmp = tmp.reshape(-1, 1)
df_test_new.DaysPublication =scaler_DaysPublication.transform(tmp)
_=plt.hist(df_test_new.DaysPublication)
The testing data still shows a bias after the transformation.
For scaling the author's followers a Box-Cox transformation by using PowerTransformer is applied, as well.
# train data
from sklearn.preprocessing import PowerTransformer
tmp = [(dp+1) for dp in df_train_new.AuthorFollowers]
tmp = np.array(tmp)
tmp = tmp.reshape(-1, 1)
scaler_AuthorFollowers = PowerTransformer(method='box-cox', standardize=True, copy=True)
scaler_AuthorFollowers.fit(tmp)
df_train_new.AuthorFollowers = scaler_AuthorFollowers.transform(tmp)
_=plt.hist(df_train_new.AuthorFollowers)
# test data
tmp = [(dp + 1) for dp in df_test_new.AuthorFollowers]
tmp = np.array(tmp)
tmp = tmp.reshape(-1, 1)
df_test_new.AuthorFollowers = scaler_AuthorFollowers.transform(tmp)
_=plt.hist(df_test_new.AuthorFollowers)
The distribution of the testing data after applying the scaling is left-skewed and does not follow a normal distribution.
Again, the Box-Cox transformation is applied.
# train data
from sklearn.preprocessing import PowerTransformer
tmp = [(dp+1) for dp in df_train_new.TitleWordCount]
tmp = np.array(tmp)
tmp = tmp.reshape(-1, 1)
scaler_TitleWordCount = PowerTransformer(method='box-cox', standardize=True, copy=True)
scaler_TitleWordCount.fit(tmp)
df_train_new.TitleWordCount = scaler_TitleWordCount.transform(tmp)
_=plt.hist(df_train_new.TitleWordCount)
# test data
tmp = [(dp + 1) for dp in df_test_new.TitleWordCount]
tmp = np.array(tmp)
tmp = tmp.reshape(-1, 1)
df_test_new.TitleWordCount = scaler_TitleWordCount.transform(tmp)
_=plt.hist(df_test_new.TitleWordCount)
The feature TextWordCount contains the amount of words in an article after the NLP preprocessing. This number is scaled with Box-Cox transformation.
# train data
tmp = [len(BOW) for BOW in df_train_new.Text]
df_train_new["TextWordCount"] = tmp
# test data
tmp = [len(BOW) for BOW in df_test_new.Text]
df_test_new["TextWordCount"] = tmp
# check df
df_train_new.head()
# train data
from sklearn.preprocessing import PowerTransformer
tmp = [(dp+1) for dp in df_train_new.TextWordCount]
tmp = np.array(tmp)
tmp = tmp.reshape(-1, 1)
scaler_TextWordCount = PowerTransformer(method='box-cox', standardize=True, copy=True)
scaler_TextWordCount.fit(tmp)
df_train_new.TextWordCount = scaler_TextWordCount.transform(tmp)
_=plt.hist(df_train_new.TextWordCount)
# test data
tmp = [(dp + 1) for dp in df_test_new.TextWordCount]
tmp = np.array(tmp)
tmp = tmp.reshape(-1, 1)
df_test_new.TextWordCount = scaler_TextWordCount.transform(tmp)
_=plt.hist(df_test_new.TextWordCount)
Now, all required features have been selected and transformed in both dataframes df_train_new and df_test_new.
df_train_new.head()
df_test_new.head()
# save cleaned and preprocessed dataframes
with open('gdrive/My Drive/Adams Assignment/New Approach/df_train_new.pkl','wb') as path_name:
pickle.dump(df_train_new, path_name)
with open('gdrive/My Drive/Adams Assignment/New Approach/df_test_new.pkl','wb') as path_name:
pickle.dump(df_test_new, path_name)
with open('gdrive/My Drive/Adams Assignment/New Approach/upper_bound_log.pkl','wb') as path_name:
pickle.dump(upper_bound_log, path_name)
# load backup
import pickle
with open('gdrive/My Drive/Adams Assignment/New Approach/df_train_new.pkl','rb') as path_name:
df_train_new = pickle.load(path_name)
with open('gdrive/My Drive/Adams Assignment/New Approach/df_test_new.pkl','rb') as path_name:
df_test_new = pickle.load(path_name)
with open('gdrive/My Drive/Adams Assignment/New Approach/upper_bound_log.pkl','rb') as path_name:
upper_bound_log = pickle.load(path_name)
df_train_new.Claps[df_train_new.Claps>1]
import seaborn
# plt.figure(figsize=(w, h), dpi=d)
seaborn.regplot(df_train_new.DaysPublication, df_train_new.Claps,scatter=True, scatter_kws={'color': 'darkgrey','alpha':0.01},line_kws={'color':'#F12B04'})
plt.title('Scatter Plot with Regression Line: Claps ~ DaysPublication')
plt.ylabel('Claps')
plt.xlabel('DaysPublication')
plt.show()
There is a positive trend between DaysPublication and Claps. The assumption arises that older articles will have a higher amount of claps.
import seaborn
seaborn.regplot(df_train_new.AuthorFollowers, df_train_new.Claps,scatter=True, scatter_kws={'color': 'darkgrey', 'alpha':0.01},line_kws={'color':'#F12B04'})
plt.title('Scatter Plot with Regression Line: Claps ~ AuthorFollowers')
plt.ylabel('Claps')
plt.xlabel('AuthorFollowers')
plt.show()
There is a positive trend between AuthorFollowers and Claps. It can be assumed that an author with many followers will receive more claps on his articles.
plt.title('Heatmap: Claps ~ AuthorFollowers')
plt.ylabel('Claps')
plt.xlabel('AuthorFollowers')
plt.hist2d(np.array(df_train_new.AuthorFollowers), np.array(df_train_new.Claps), bins=10, normed=False, cmap='plasma')
plt.show()
The heatmap shows the composite distibution of Claps and AuthorFollowers. Authors with few followers often post articles with only few claps and the aricles with the most claps are mostly written by authors with many followers.
seaborn.regplot(df_train_new.TitleWordCount, df_train_new.Claps,scatter=True, scatter_kws={'color': 'darkgrey', 'alpha':0.01},line_kws={'color':'#F12B04'})
plt.title('Scatter Plot with Regression Line: Claps ~ TitleWordCount')
plt.ylabel('Claps')
plt.xlabel('TitleWordCount')
plt.show()
A slight positive trend between TitleWordCount and Claps could expected.
seaborn.regplot(df_train_new.TextWordCount, df_train_new.Claps,scatter=True, scatter_kws={'color': 'darkgrey', 'alpha':0.01},line_kws={'color':'#F12B04'})
plt.title('Scatter Plot with Regression Line: Claps ~ TextWordCount')
plt.ylabel('Claps')
plt.xlabel('TextWordCount')
plt.show()
There is a strong positive trend between TextWordCount and Claps. It seems plausible that articles with many words have been written with lots of effort. This might explain a rising number of claps.
In the later training process, a 5-fold cross validation technique will be used in order to find the best model. In the end, the final model will be tested against some additional hold-back data in a validation set. Therefore, the df_train_new is split into two data sets:
# split data
from sklearn import model_selection
df_train_fin, df_val_fin = model_selection.train_test_split(df_train_new, test_size=0.20, random_state=42)
print("Training size: " + str(len(df_train_fin)))
print("Validation size: " + str(len(df_val_fin)))
# make copy for further working
df_test_fin = df_test_new.copy()
In the first step, a Word2Vec model with 200 dimensions is trained on the words of the training set. After checking the effectiveness of this Word2Vec model, a DocumentVector will be created as a feature that represents the whole article.
Let $\phi(1..k)$ be called a mapping function of a regession neural network with $k$ input values and one output value, then the first MLPRegressor is trained on a feedforward neural network with
where Claps is the regression target. Let $\psi_{1}$ be the predicted output of the first NN.
On top of this first MLPRegressor, a second neural network is trained with
where Claps is again the regression target and $\psi_{2}$ is the final prediction of the second model. This stacked model architecture has been chosen, in order to minimalize the regression training error $\varepsilon_{1}^{2}= (Claps - \psi_{1})^{2}$ of the first model.
A Word2Vec analyzes textual data and transforms words into vectors with $k$ dimensions (word embeddings). In addition to the single words of the articles, a bigram model is trained for creating frequently used bigrams in the data. Here, only bigrams with a frequency of $q\geq 100$ are included.
The actual model training is done by the gensim API with an embedding size of $k=200$ dimensions.
import gensim
from gensim.models.phrases import Phrases
bigram_model = gensim.models.Phrases(df_train_fin.Text, min_count=100) #runs for 1-2 minutes
# show all bigrams
bigram_model.vocab
# apply bigram model to the training & validation data
df_train_fin.Text = [bigram_model[line] for line in df_train_fin.Text] #runs for 3 minutes
df_val_fin.Text = [bigram_model[line] for line in df_val_fin.Text]
# apply bigram model to the testing data
df_test_fin.Text = [bigram_model[line] for line in df_test_fin.Text]
# check one exemplary BOW
print(df_train_fin.Text[min(df_train_fin.index)])
The progress of the model training is printed out with an EpochLogger as recommended by Kite Documentation.
# print progress
from gensim.models.callbacks import CallbackAny2Vec
class EpochLogger(CallbackAny2Vec):
"""Print progress of Word2Vec training to console"""
def __init__(self):
self.epoch = 0
def on_epoch_begin(self, model):
print("Epoch #{} start".format(self.epoch))
def on_epoch_end(self, model):
print("Epoch #{} end".format(self.epoch))
self.epoch += 1
# train W2V model with 200 dimensions
from gensim.models import Word2Vec # one and a half hours of computation
epoch_logger = EpochLogger()
w2v_model = Word2Vec(df_train_fin.Text,
min_count=30,
window=5,
iter=100,
size=200,
workers=4,
callbacks=[epoch_logger])
# backup
with open('gdrive/My Drive/Adams Assignment/New Approach/w2v_model.pkl','wb') as path_name:
pickle.dump(w2v_model, path_name)
# load backup
with open('gdrive/My Drive/Adams Assignment/New Approach/w2v_model.pkl','rb') as path_name:
w2v_model = pickle.load(path_name)
# example vector of the word "data"
w2v_model.wv["data"]
# example most similar words to "python"
w2v_model.wv.most_similar("python")
# example cosine similarity of "dog" and "cat"
print(w2v_model.wv.similarity("dog", "cat"))
# example of similar words containing one with an additional meaning that doesn't match
print(w2v_model.doesnt_match(["python", "dog", "cat", "bird"]))
A popular method for visualizing a Word2Vec model is a t-distributed stochastic neighbor embedding plot. The algorithm is able to cluster the Word2Vec information of the high-dimensional data points. This approach reduces the number of dimensions so that it can be plotted in 2D (2 dimensions) or 3D (3 dimensions). In this case, I directly followed the idea and code presented by Sergey Smetanin in 2019 in Habr, who uses a 2D TSNE model. Here, 10 additional most-similar words are added to the keys.
# define key words and get top 10 similar words with their embeddings
keys = ['paris', 'python', 'sunday', 'twitter', 'bachelor', 'delivery', 'election', 'expensive',
'experience', 'financial', 'food', 'ios', 'peace', 'release', 'war', 'christopher']
embedding_clusters = []
word_clusters = []
for word in keys:
if word in w2v_model.wv.vocab:
embeddings = []
words = []
for similar_word, _ in w2v_model.most_similar(word, topn=10):
words.append(similar_word)
embeddings.append(w2v_model[similar_word])
embedding_clusters.append(embeddings)
word_clusters.append(words)
# train the TSNE model and transform the embeddings into 2D
from sklearn.manifold import TSNE
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)
# create the TSNE plot
import matplotlib.cm as cm
# surpress warning messages
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')
def tsne_plot_similar_words(labels, embedding_clusters, word_clusters, a=0.7):
plt.figure(figsize=(20, 10))
colors = cm.rainbow(np.linspace(0, 1, len(labels)))
for label, embeddings, words, color in zip(labels, embedding_clusters, word_clusters, colors):
x = embeddings[:,0]
y = embeddings[:,1]
plt.scatter(x, y, c=color, alpha=a, label=label)
for i, word in enumerate(words):
plt.annotate(word, alpha=0.8, xy=(x[i], y[i]), xytext=(5, 2),
textcoords='offset points', ha='right', va='bottom', size=11)
plt.legend(loc='upper right', frameon=True, fontsize=11)
plt.grid(True)
plt.show()
tsne_plot_similar_words(keys, embeddings_en_2d, word_clusters)
The TSNE plot shows the predefined key word as a cluster centroid and adds the top 10 most similar words of the Word2Vec model to the plot. The clustered words appear to be sensibly chosen. For instance, the cluster key "paris" shows other metropolitan cities around the world and this cluster can be easily distiguished from others in the plot.
The final step in the Word2Vec model is the calculation of a document vector that represents a specific article with just one single vector of dimansionality $k=200$. There are different approaches to create such a document vector. The problem arises that the articles have a different number of words.
In this solution the arithmetic mean is used as a way to consolidate the different vectors into one. It can be assumed, that a vector space with $k=200$ dimensions is sufficiently large to create enough distinction/distance between the documents. Now, similar articles will obtain a similar vector projection.
For instance, an article about data science should me more similar to an article about maths than to a political text.
# calculate document vector
def calcDocMean(input_data):
tmp = [w2v_model.wv[word] for word in input_data if word in list(w2v_model.wv.vocab)]
tmp = np.array(tmp)
if len(tmp)<1:
return np.zeros((200,), dtype=float)
else: return np.nanmean(tmp.astype('float64'), axis=0)
df_train_fin.head()
# train data
from tqdm import tqdm
tqdm.pandas()
tmp_train_doc = df_train_fin.Text.progress_apply(lambda x: calcDocMean(x))
tmp_train_doc = [list(r) for r in tmp_train_doc]
df_train_fin["DocumentVector"] = tmp_train_doc
del tmp_train_doc
# backup
with open('gdrive/My Drive/Adams Assignment/New Approach/df_train_fin.pkl','wb') as path_name:
pickle.dump(df_train_fin, path_name)
# validation data
from tqdm import tqdm
tqdm.pandas()
tmp_val_doc = df_val_fin.Text.progress_apply(lambda x: calcDocMean(x))
tmp_val_doc = [list(r) for r in tmp_val_doc]
df_val_fin["DocumentVector"] = tmp_val_doc
del tmp_val_doc
# backup
with open('gdrive/My Drive/Adams Assignment/New Approach/df_val_fin.pkl','wb') as path_name:
pickle.dump(df_val_fin, path_name)
# test data
from tqdm import tqdm
tqdm.pandas()
tmp_test_doc = df_test_fin.Text.progress_apply(lambda x: calcDocMean(x))
tmp_test_doc = [list(r) for r in tmp_test_doc]
df_test_fin["DocumentVector"] = tmp_test_doc
del tmp_test_doc
# backup
with open('gdrive/My Drive/Adams Assignment/New Approach/df_test_fin.pkl','wb') as path_name:
pickle.dump(df_test_fin, path_name)
import pickle
with open('gdrive/My Drive/Adams Assignment/New Approach/df_test_fin.pkl','rb') as path_name:
df_test_fin = pickle.load(path_name)
with open('gdrive/My Drive/Adams Assignment/New Approach/df_val_fin.pkl','rb') as path_name:
df_val_fin = pickle.load(path_name)
with open('gdrive/My Drive/Adams Assignment/New Approach/df_train_fin.pkl','rb') as path_name:
df_train_fin = pickle.load(path_name)
In order to check the semantics of the document vectors, four keyword-tuples containing words from one specific topic are chosen:
For each tuple, 10 articles containing both words are sampled and a composite document vector for those 10 articles is calculated. Then a inverse search for the words with the highest probability is applied. This way, one can quickly evaluate whether the document vector approach is effective.
# choose 10 blockchain/bitcoin articles
blockchain_articles = [a for a in df_train_fin.Text if "blockchain" in a and "bitcoin" in a]
blockchain_articles= blockchain_articles[0:10]
bc = [calcDocMean(b) for b in blockchain_articles]
# print words with highest similarity to this composite vector
bc_docVec_mean = np.mean(bc, axis=0)
print(w2v_model.wv.most_similar([bc_docVec_mean]))
del bc, blockchain_articles, bc_docVec_mean
# choose 10 ai/ml articles
ai_articles = [a for a in df_train_fin.Text if "ai" in a and "ml" in a]
ai_articles= ai_articles[0:10]
ai = [calcDocMean(a) for a in ai_articles]
# print words with highest similarity to this composite vector
ai_docVec_mean = np.mean(ai, axis=0)
print(w2v_model.wv.most_similar([ai_docVec_mean]))
del ai, ai_articles, ai_docVec_mean
# choose 10 trump/obama articles
america_articles = [a for a in df_train_fin.Text if "trump" in a and "obama" in a]
america_articles= america_articles[0:10]
am = [calcDocMean(a) for a in america_articles]
# print words with highest similarity to this composite vector
am_docVec_mean = np.mean(am, axis=0)
print(w2v_model.wv.most_similar([am_docVec_mean]))
del am, america_articles, am_docVec_mean
# choose 10 company/business articles
company_articles = [a for a in df_train_fin.Text if "company" in a and "business" in a]
company_articles= company_articles[0:10]
cp = [calcDocMean(a) for a in company_articles]
# print words with highest similarity to this composite vector
cp_docVec_mean = np.mean(cp, axis=0)
print(w2v_model.wv.most_similar([cp_docVec_mean]))
del cp, company_articles, cp_docVec_mean
Considering the output for these four tuples, the document vector model seems to be able to represent semantic essences of documents.
The idea of a document vector space immediately suggests performing a cluster analysis. In the following subsection, the applicability of a K-Means model is tested, but then discarded. The reason for that is the difficulty in setting the number of clusters and potential loss of information.
# k=2 clusters
from sklearn.cluster import KMeans
X=np.stack(df_train_fin["DocumentVector"], axis=0)
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[0]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[1]])])
A cluster with $k=2$ centroids reveals a very plausible splitting criterion: Spanish, English.
# k=5 clusters
from sklearn.cluster import KMeans
X=np.stack(df_train_fin["DocumentVector"], axis=0)
kmeans = KMeans(n_clusters=5, random_state=0).fit(X)
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[0]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[1]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[2]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[3]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[4]])])
A cluster with $k=5$ centroids shows the following splits: Business, Turkish, General, Spanish, Data Science.
# k=10 clusters
from sklearn.cluster import KMeans
X=np.stack(df_train_fin["DocumentVector"], axis=0)
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[0]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[1]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[2]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[3]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[4]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[5]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[6]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[7]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[8]])])
print([w[0] for w in w2v_model.wv.most_similar([kmeans.cluster_centers_[9]])])
A cluster with $k=10$ centroids shows the following splits: Python Coding, Humans, French, Business, Spanish, Data Science, General, Turkish, Statistical Modelling, AI.
In some terms, adding more clusters to the algorithm seems to improve the semantic preciseness and accuracy of the clusters. That contradicts the general idea of clustering, which is dimensionality reduction. However, in this case, too much information is lost by assigning the document vectors to clusters. The visualization with an elbow curve gives no hint for a number of clusters.
# visualization with elbow curve
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
X=np.stack(df_train_fin["DocumentVector"], axis=0)
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,30))
visualizer.fit(X)
Hence, the idea of clustering is kept for completeness, but not included into further model building.
The first NN is based on the feature DocumentVector and thus predicts the number of claps based on the textual data.
Remember, $\phi(1..k)$ denotes a mapping function of a regession neural network with $k$ input values and one output value. Here, the first MLPRegressor is trained on a feedforward neural network with
where Claps is the regression target. Let $\psi_{1}$ be the predicted output of the first NN. $\psi_{1}$ is stored in the feature NN1_output.
# import libraries
# for training
import sklearn
from sklearn import neural_network
from sklearn.model_selection import RandomizedSearchCV
# for assessment
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
# create grid with hidden layers for randomized CV
hiddenLayerGrid = []
for layers in range(1,10):
for neurons in range(25, 201,25):
# for position in len(neurons):
hiddenLayerGrid.append((neurons,)*layers)
hiddenLayerGrid
# build NLP Regression NN: Claps ~ DocumentVector
reg_ML_NLP = sklearn.neural_network.MLPRegressor(max_iter=1000, verbose = True)
param_list = {"hidden_layer_sizes": hiddenLayerGrid,
"activation": ["logistic", "relu"],
"solver": ["adam"],
"alpha": [0.00005, 0.0001, 0.0005]}
X_NLP = [np.array(x) for x in df_train_fin.DocumentVector]
y_NLP = np.array(df_train_fin.Claps, dtype=float)
y_NLP = y_NLP.reshape((-1,1))
randCV_NLP = RandomizedSearchCV(reg_ML_NLP, param_list, scoring="neg_mean_squared_error", n_jobs=-1, n_iter=10 )
randCV_NLP.fit(X_NLP, y_NLP)
randCV_NLP.best_params_
Based on the best parameters of the randomized parameter search, the following architecture of the first NN emerges:
# Image(filename='NN1.png')
# backup randCV_NLP
with open('gdrive/My Drive/Adams Assignment/New Approach/randCV.pkl','wb') as path_name:
pickle.dump(randCV_NLP, path_name)
# re-load randCV_NLP
with open('gdrive/My Drive/Adams Assignment/New Approach/randCV.pkl','rb') as path_name:
randCV_NLP = pickle.load(path_name)
y_hat_NLP = randCV_NLP.predict(X_NLP)
# accuracy on the training data
print("MSE: {}".format(mean_squared_error(y_NLP, y_hat_NLP)))
print("R^2: {}".format(r2_score(y_NLP, y_hat_NLP)))
# train data
X=[np.array(x) for x in df_train_fin.DocumentVector]
y_pred = randCV_NLP.predict(X)
df_train_fin["NN1_output"] = y_pred
# val data
X=[np.array(x) for x in df_val_fin.DocumentVector]
y_pred = randCV_NLP.predict(X)
df_val_fin["NN1_output"] = y_pred
# test data
X=[np.array(x) for x in df_test_fin.DocumentVector]
y_pred = randCV_NLP.predict(X)
df_test_fin["NN1_output"] = y_pred
df_train_fin.head()
Using the output $\psi_{1}$ of the first NN and the remaining features, a second neural network is trained with
where again Claps is the regression target and $\psi_{2}$ is the final prediction of the second model. This stacked model architecture has been chosen, in order to minimalize the regression training error $\varepsilon_{1}^{2}= (Claps - \psi_{1})^{2}$ of the first model.
# create grid with hidden layers for randomized CV
hiddenLayerGrid2 = []
for layers in range(1,15):
for neurons in range(3, 20, 1):
# for position in len(neurons):
hiddenLayerGrid2.append((neurons,)*layers)
hiddenLayerGrid2
reg_ML_fin = sklearn.neural_network.MLPRegressor(max_iter=1000, verbose = True)
param_list = {"hidden_layer_sizes": hiddenLayerGrid2,
"activation": ["logistic", "relu"],
"solver": ["adam"],
"alpha": [0.00005, 0.0001, 0.0005]}
X_fin = df_train_fin[["DaysPublication", "AuthorFollowers", "TitleWordCount", "TextWordCount", "NN1_output"]]
y_fin = np.array(df_train_fin.Claps, dtype=float)
y_fin = y_fin.reshape((-1,1))
randCV_fin = RandomizedSearchCV(reg_ML_fin, param_list, scoring="neg_mean_squared_error", n_jobs=-1, n_iter=30 )
randCV_fin.fit(X_fin, y_fin)
randCV_fin.best_params_
The following composite final model architecture with 4 hidden layers with 15 neurons each is chosen:
# Image(filename='NN2.png')
# backup randCV_fin
with open('gdrive/My Drive/Adams Assignment/New Approach/randCV_fin.pkl','wb') as path_name:
pickle.dump(randCV_fin, path_name)
# load randCV_fin
with open('gdrive/My Drive/Adams Assignment/New Approach/randCV_fin.pkl','rb') as path_name:
randCV_fin = pickle.load(path_name)
# reshape testing data
X_train = df_train_fin[["DaysPublication", "AuthorFollowers", "TitleWordCount", "TextWordCount", "NN1_output"]]
# predict
df_train_fin["predicted_Claps"] = randCV_fin.predict(X_train)
# accuracy on the training data
print("MSE: {}".format(mean_squared_error(df_train_fin["Claps"], df_train_fin["predicted_Claps"])))
print("R^2: {}".format(r2_score(df_train_fin["Claps"], df_train_fin["predicted_Claps"])))
Based on the training data, this final model has improved the predictions compared to a plain NLP model (1), where $MSE_{(1)}= 0.0745$ and $R^2_{(1)}= 0.3697$. The composite model proves more efficient.
# create a dataframe with retransformed Claps and predicted_claps
retransf_train = df_train_fin[["Claps", "predicted_Claps"]]
retransf_train.Claps = [re_Transformation_Claps(r) for r in retransf_train.Claps]
retransf_train.predicted_Claps = [re_Transformation_Claps(r) for r in retransf_train.predicted_Claps]
retransf_train.head()
# accuracy on the training data after retransformation
print("MSE: {}".format(mean_squared_error(retransf_train["Claps"], retransf_train["predicted_Claps"])))
print("MAE: {}".format(mean_absolute_error(retransf_train["Claps"], retransf_train["predicted_Claps"])))
print("R^2: {}".format(r2_score(retransf_train["Claps"], retransf_train["predicted_Claps"])))
Though $R^2$ decreased, the predictions seem to be acceptable with a $MSE_{(2)}=13271.2$ and a $MAE_{(2)}=53.6$. That is, on average the predicion only deviates from the correct number of claps by about 53 claps.
These first assumptions on the model efficiency need to be validated with the validation data set.
# reshape testing data
X_val = df_val_fin[["DaysPublication", "AuthorFollowers", "TitleWordCount", "TextWordCount", "NN1_output"]]
# predict
df_val_fin["predicted_Claps"] = randCV_fin.predict(X_val)
df_val_fin.head()
# create a dataframe with retransformed Claps and predicted_claps
retransf_val = df_val_fin[["Claps", "predicted_Claps"]]
retransf_val.Claps = [re_Transformation_Claps(r) for r in retransf_val.Claps]
retransf_val.predicted_Claps = [re_Transformation_Claps(r) for r in retransf_val.predicted_Claps]
retransf_val.head(n=15)
# accuracy on the validation data after retransformation
print("MSE: {}".format(mean_squared_error(retransf_val["Claps"], retransf_val["predicted_Claps"])))
print("MAE: {}".format(mean_absolute_error(retransf_val["Claps"], retransf_val["predicted_Claps"])))
print("R^2: {}".format(r2_score(retransf_val["Claps"], retransf_val["predicted_Claps"])))
The model shows similar results when applied on the validation data set. Though $R^2_{val}$ decreased even further, the predictions seem to as accaptable as on the training data with a $MSE_{(2), val}=14663.6$ and a $MAE_{(2), val}=58.7$. That is, on average the predicion only deviates from the correct number of claps by about 58 claps. Since the models where trained with $MSE$ as loss function, I focus on this measure.
The predictive power of the composite NN has been shown. In this section, a plain vanilla linear regression is run over the input features of model 2. This surrogate model gives insights into the black-box NN and helps to understand the dependencies of the features.
# surrogate model - linear regression on: predicted_Claps ~ [DaysPublication, AuthorFollowers, TitleWordCount, TextWordCount, NN1_output]
from sklearn.linear_model import LinearRegression
X_lin = df_train_fin[["DaysPublication", "AuthorFollowers", "TitleWordCount", "TextWordCount", "NN1_output"]]
y_lin = np.array(df_train_fin.predicted_Claps, dtype=float)
y_lin = y_lin.reshape((-1,1))
lin_reg = LinearRegression().fit(X_lin, y_lin)
lin_reg.score(X_lin, y_lin)
print(["DaysPublication", "AuthorFollowers", "TitleWordCount", "TextWordCount", "NN1_output"])
print(lin_reg.coef_)
The coefficients of the surrogate model demonstrate that the output of the first NLP regression model has the strongest impact on the final prediction. Hence, the article text and its related document vector have a lot of predictive power.
The features DaysPublication, AuthorFollowers and TextWordCount have a weak positive effect on the number of claps per unit. The feature TitleWordCount shows a weak negative effect on the prediction of claps.
# reshape testing data
X_test = df_test_fin[["DaysPublication", "AuthorFollowers", "TitleWordCount", "TextWordCount", "NN1_output"]]
# predict
df_test_fin["predicted_Claps"] = randCV_fin.predict(X_test)
df_test_fin.head()
min(df_test_fin["predicted_Claps"])
max(df_test_fin["predicted_Claps"])
_=plt.hist(df_test_fin["predicted_Claps"])
# retransform the claps
df_test_fin["predicted_Claps_retransformed"] = [re_Transformation_Claps(r) for r in df_test_fin["predicted_Claps"]]
df_test_fin.head()
# create output DataFrame
df_submission = df_test_fin[["Index", "predicted_Claps_retransformed"]]
df_submission.columns= (["index", "Claps"])
df_submission.head()
pth = "gdrive/My Drive/Adams Assignment/New Approach/Final_Submission.csv"
df_submission.to_csv(pth, sep=",", index=False)