Click the “AI Meets Machine Learning” above to select the “Star” public account
Heavyweight content delivered to you first
Overview
-
Today, recommendation engines are everywhere, and people expect data scientists to know how to build one.
-
Word2Vec is a very popular word embedding used for various NLP tasks.
-
We will use Word2Vec to build our own recommendation system. Let’s see how NLP and recommendation engines can work together!
The complete code can be downloaded from here:
https://github.com/prateekjoshi565/recommendation_system/blob/master/recommender_2.ipynb
Introduction
Honestly, have you noticed the content recommended for you on Amazon (the “Recommended for You” section)? Ever since I discovered a few years ago that machine learning could enhance this part, I’ve been fascinated by it. Every time I log into Amazon, I pay close attention to that section.
Companies like Netflix, Google, Amazon, and Flipkart spend millions of dollars perfecting their recommendation engines for a reason: it is a powerful channel for information retrieval and enhances the consumer experience.
Let me illustrate this with a recent example. I went to a popular online marketplace to buy a recliner, where there were various types of recliners, and I liked most of them and clicked to view a faux leather manual recliner.

Notice the different types of information displayed on the page; the left half of the image contains product images from different angles. The right half contains some details about the product and some similar products.
And this is my favorite part: the website is recommending similar products to me, which saves me the time of manually browsing through similar recliners.
In this article, we will build our own recommendation system. However, we will approach this problem from a unique perspective. We will use an NLP concept—Word2Vec—to recommend products to users. If you feel a little excited about this tutorial, then let’s get started!
In the text, I will mention some concepts. I suggest looking at the following two articles for a quick review:
Understanding Neural Networks: https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/?utm_source=blog&utm_medium=how-to-build-recommendation-system-word2vec-python Comprehensive Guide to Building Recommendation Engines: https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/?utm_source=blog&utm_medium=how-to-build-recommendation-system-word2vec-python
Word2Vec – Vector Representation of Words
We know that machines find it difficult to process raw text data. In fact, besides numerical data, machines can hardly handle other types of data. Therefore, representing text in vector form is almost always one of the most important steps in all NLP tasks.
One of the most important steps in this direction is using Word2Vec embeddings, which were introduced to the NLP community in 2013 and revolutionized the entire development of NLP.
It turns out that these embeddings are state-of-the-art for tasks such as word analogy and word similarity. Word2Vec embeddings can also perform tasks like King - man + woman ~= Queen
, which is a very magical result.
There are two types of Word2Vec models—Continuous Bag of Words model and Skip-Gram model. In this article, we will use the Skip-Gram model.
First, let’s understand how Word2Vec vectors, or embeddings, are computed.
How to Obtain Word2Vec Embeddings?
The Word2Vec model is a simple neural network model with only one hidden layer, and its task is to predict the synonyms of each word in a sentence. However, our goal is unrelated to this task. What we want is, once the model is trained, to use the weights learned from the hidden layer of the model. These weights can then be used as embeddings for the words.
Let me give you an example to illustrate how the Word2Vec model works. Consider the following sentence:

Suppose the word “teleport” (highlighted in yellow) is our input word. It has a context window size of 2. This means we only consider the two adjacent words on either side of the input word as neighboring words.
Note: The size of the context window is not fixed and can be changed according to our needs.
Now, the task is to select neighboring words (the words in the context window) one by one and give the probability of each word in the vocabulary being selected as a neighboring word. This should sound quite intuitive, right?
Let’s take another example to understand the whole process in detail.
Preparing Training Data
We need a labeled dataset to train the neural network model. This means the dataset should have a set of inputs and corresponding outputs for each input. At this point, you might have some questions, such as:
-
Where can I find such a dataset?
-
What does this dataset contain?
-
How large is this dataset?
And so on.
However, I want to tell you that we can easily create our own labeled data to train the Word2Vec model. Below, I will demonstrate how to generate this dataset from any text. Let’s use a sentence and create training data from it.
Step One: The yellow highlighted word will serve as the input, and the green highlighted words will serve as the output words. We will use a window size of 2 words. Let’s start with the first word as the input word.

So, the training samples for this input word are as follows:
Step Two: Next, we will take the second word as the input word. The context window will also move accordingly. Now, the neighboring words are “we”, “become”, and “what”.

The new training samples will be added to the previous training samples as follows:
We will repeat these steps until we reach the last word. Finally, the complete training data for this sentence is as follows:
We extracted 27 training samples from one sentence, which is one of the many aspects I enjoy about handling unstructured data—creating a labeled dataset out of thin air.
Obtaining Word2Vec Embeddings
Now, suppose we have a bunch of sentences, and we extract training samples from these sentences in the same way. We will eventually obtain a fairly large training dataset.
Suppose this dataset has 5000 unique words, and we want to create a vector of size 100 for each word. Then, for the Word2Vec architecture given below:
-
V = 5000 (vocabulary size)
-
N = 100 (number of hidden units or length of word embeddings)
The input will be a one-hot encoded vector, while the output layer will give the probability of each word in the vocabulary being nearby.
Once the model is trained, we can easily extract the learned weight matrix x N and use it to extract word vectors:
As you can see above, the shape of the weight matrix is 5000 x 100. The first row of this matrix corresponds to the first word in the vocabulary, the second to the second word, and so on.

This is how we obtain fixed-size word vectors or embeddings through Word2Vec. Words that are similar in this dataset will have similar vectors, meaning they point in the same direction. For example, the words “car” and “jeep” have similar vectors:

This is a high-level overview of how Word2Vec is used in NLP.
Before we start building the recommendation system, let me ask you a question. How can Word2Vec be used for non-NLP tasks, such as product recommendation? I believe you have been thinking about this question since you read the title of this article. Let’s solve this puzzle together.
Applying Word2Vec Model on Non-Text Data
Can you guess the basic characteristic of natural language that Word2Vec uses to create vector representations of text?
It is the sequential nature of text. Every sentence or phrase has a sequence of words. Without this order, it would be difficult for us to understand the text. Try explaining the following sentence:
“these most been languages deciphered written of have already”
This sentence has no order, making it difficult for us to understand, which is why the order of words is so important in any natural language. It is this characteristic that made me think of other types of data that do not have the sequential nature of text.
One such type of data is consumer purchasing behavior on e-commerce websites. Most of the time, consumers’ purchasing behaviors have a pattern, for example, a person engaged in sports-related activities may have a similar online purchasing pattern:

If we can represent each product as a vector, we can easily find similar products. Therefore, if a user views a product online, we can easily recommend similar products by using the vector similarity scores between products.
But how do we obtain these vector representations for products? Can we use the Word2Vec model to obtain these vectors?
The answer is, of course, yes! Imagine the consumer’s purchase history as a sentence, and the products as the words in that sentence:

Furthermore, let’s examine online retail data and use Word2Vec to build a recommendation system.
Case Study: Online Product Recommendation Using Word2Vec in Python
Now, let’s once again clarify our problem and requirements:
We are tasked with creating a system that automatically recommends a certain number of products to consumers on an e-commerce website based on their past purchasing behavior.
We will use an online retail dataset, which you can download from this link:
https://archive.ics.uci.edu/ml/machine-learning-databases/00352/
Let’s launch Jupyter Notebook, quickly import the required libraries, and load the dataset.
import pandas as pd
import numpy as np
import random
from tqdm import tqdm
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
%matplotlib inline
import warnings;
warnings.filterwarnings('ignore')
df = pd.read_excel('Online Retail.xlsx')
df.head()

Here is a description of the fields in the dataset:
-
InvoiceNo: Invoice number, a unique identifier assigned to each transaction
-
StockCode: Code of the product. A unique identifier assigned to each different product
-
Description: Description of the product
-
Quantity: Quantity of each product in each transaction
-
InvoiceDate: Date and time of each transaction
-
CustomerID: Customer number, a unique identifier assigned to each customer
df.shape
Output: (541909, 8)
The dataset contains 541,909 records, which is quite good for us to build a model.
Handling Missing Data
# Check for missing values
df.isnull().sum()

Since we have enough data, we will remove all rows with missing values.
# Remove rows with missing values
df.dropna(inplace=True)
Preparing the Data
Let’s convert StockCode to string data type:
df['StockCode']= df['StockCode'].astype(str)
Let’s take a look at the number of customers in our dataset:
customers = df["CustomerID"].unique().tolist()
len(customers)
Output: 4372
There are 4,372 customers in our dataset, and we will extract their purchase histories. In other words, we can have 4372 purchase sequences.
It’s a good practice to reserve a small portion of the dataset for validation. Therefore, I will use 90% of the customers’ data to create Word2Vec embeddings. Let’s start splitting the data.
# Shuffle customer IDs
random.shuffle(customers)
# Extract 90% of customers
customers_train = [customers[i] for i in range(round(0.9*len(customers)))]
# Split into training and validation sets
train_df = df[df['CustomerID'].isin(customers_train)]
validation_df = df[~df['CustomerID'].isin(customers_train)]
We will create sequences of purchases made by customers in the dataset for the training and validation sets.
# Store customers' purchase histories
purchases_train = []
# Fill the list with product codes
for i in tqdm(customers_train):
temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
purchases_train.append(temp)
# Store customers' purchase histories
purchases_val = []
# Fill the list with product codes
for i in tqdm(validation_df['CustomerID'].unique()):
temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
purchases_val.append(temp)
Building Word2Vec Embeddings for Products
# Train the Word2Vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
Since we do not intend to further train the model, we call init_sims()
here. This will make the model more memory efficient:
model.init_sims(replace=True)
Let’s take a look at the relevant parameters of the “model”:
print(model)
Output: Word2Vec(vocab=3151, size=100, alpha=0.03)
Our model has 3151 unique words, with each word having a vector size of 100 dimensions. Next, we will extract the vectors for all words in the vocabulary and store them in a place for easy access.
# Extract vectors
X = model[model.wv.vocab]
X.shape
Output: (3151, 100)
Visualizing Word2Vec Embeddings
Visualizing the embeddings you created is very helpful. Here, we have 100-dimensional embeddings. We cannot visualize 4-dimensional space, let alone 100-dimensional space, so how do we do it?
We will use the UMAP algorithm to reduce the dimensionality of the product embeddings from 100 to 2; UMAP is commonly used for dimensionality reduction.
import umap
cluster_embedding = umap.UMAP(n_neighbors=30, min_dist=0.0,
n_components=2, random_state=42).fit_transform(X)
plt.figure(figsize=(10,9))
plt.scatter(cluster_embedding[:, 0], cluster_embedding[:, 1], s=3, cmap='Spectral')

Each point in this graph represents a product. As you can see, these data points have several small clusters. These are groups of similar products.
Starting Product Recommendations
Congratulations! We are finally ready with Word2Vec embeddings for each product in our online retail dataset. Now, our next step is to recommend similar products for a specific product or based on the vector of a specific product.
Let’s first create a dictionary of product IDs and product descriptions for easy mapping of product descriptions to their IDs and vice versa.
products = train_df[["StockCode", "Description"]]
# Remove duplicates
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")
# Create a dictionary of product IDs and product descriptions
products_dict = products.groupby('StockCode')['Description'].apply(list).to_dict()
# Dictionary test
products_dict['84029E']
Output: [‘RED WOOLLY HOTTIE WHITE HEART.’]
I defined the function below, which takes a product’s vector (n) as input and returns the top 6 similar products:
def similar_products(v, n = 6):
# Extract the most similar products for the input vector
ms = model.similar_by_vector(v, topn= n+1)[1:]
# Extract the names and similarity scores of similar products
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
Let’s test this by passing the product ID ‘90019A’ (‘SILVER M.O.P ORBIT BRACELET’):
similar_products(model['90019A'])
Output:
[(‘SILVER M.O.P ORBIT DROP EARRINGS’, 0.766798734664917),
(‘PINK HEART OF GLASS BRACELET’, 0.7607438564300537),
(‘AMBER DROP EARRINGS W LONG BEADS’, 0.7573930025100708),
(‘GOLD/M.O.P PENDANT ORBIT NECKLACE’, 0.7413625121116638),
(‘ANT COPPER RED BOUDICCA BRACELET’, 0.7289256453514099),
(‘WHITE VINT ART DECO CRYSTAL NECKLAC’, 0.7265784740447998)]
Cool! The results are very relevant and match well with the input product. However, this output is based on the vector of a single product. What if we want to recommend products based on a user’s multiple past purchases?
A simple solution is to take the average of the vectors of all products purchased by the user so far and use this resulting vector to find similar products. We will use the function below, which takes a list of product IDs and returns a 100-dimensional vector that is the average of the vectors of the products in the input list:
def aggregate_vectors(products):
product_vec = []
for i in products:
try:
product_vec.append(model[i])
except KeyError:
continue
return np.mean(product_vec, axis=0)
Recall that for validation purposes, we have created a separate list of purchase sequences. Now we can conveniently utilize it.
len(purchases_val[0])
Output: 314
The length of the list of products purchased by the first user is 314. We will pass this list of products from the validation set to the aggregate_vectors function.
aggregate_vectors(purchases_val[0]).shape
Output: (100, )
The function returned a 100-dimensional array. This means the function is working correctly. Now we can use this result to get the most similar products:
similar_products(aggregate_vectors(purchases_val[0]))
Output:
[(‘PARTY BUNTING’, 0.661663293838501),
(‘ALARM CLOCK BAKELIKE RED ‘, 0.640213131904602),
(‘ALARM CLOCK BAKELIKE IVORY’, 0.6287959814071655),
(‘ROSES REGENCY TEACUP AND SAUCER ‘, 0.6286610960960388),
(‘SPOTTY BUNTING’, 0.6270893216133118),
(‘GREEN REGENCY TEACUP AND SAUCER’, 0.6261675357818604)]
As a result, our system recommended 6 products based on the user’s entire purchase history. Additionally, you can also recommend products based on the most recent purchases.
Below, I simply provided the last 10 products purchased as input:
similar_products(aggregate_vectors(purchases_val[0][-10:]))
Output:
[(‘PARISIENNE KEY CABINET ‘, 0.6296610832214355),
(‘FRENCH ENAMEL CANDLEHOLDER’, 0.6204789876937866),
(‘VINTAGE ZINC WATERING CAN’, 0.5855435729026794),
(‘CREAM HANGING HEART T-LIGHT HOLDER’, 0.5839680433273315),
(‘ENAMEL FLOWER JUG CREAM’, 0.5806118845939636)]
You can modify this code freely and try to recommend products from more product sequences in the validation set. You can also further optimize this code or make it better.
Conclusion
Finally, you can try implementing this code on similar non-text sequential data. For example, music recommendation is a great use case.
Recommended Reading
Content | How to Write Academic Papers
Resources | Recommended NLP Books and Courses (with material downloads)
Content | Comprehensive Understanding of N-Gram Language Models
Resources | Recommended Book: “Machine Learning for OpenCV”

Welcometo followus for more insightful content!