The recommendation engine is at the heart of the business strategy of all e-commerce giants. For example, 35 percent of Amazon’s e-commerce revenue is generated by its referral engine, according to a McKinsey study. We see every day the carousels of products that we recommend the sites on which we sail.
We plan to help you understand what’s behind and build your own engine. In essence, a recommendation engine works in the context of a large dataset. We will therefore use a language adapted to this type of dataset, spark in its python-based version.
1. The Dataset
We will use an online dataset as part of a Kaggle: https://www.kaggle.com/retailrocket/ecommerce-dataset
This data illustrates the activity of an e-commerce platform over a period of 4 and a half months: properties of available articles, categories of articles, activities of customers…
The data set is (for obvious reasons) anonymous both in terms of the individual and the product. It is divided into four parts:
Let’s start by reading these files to perform a first analysis.
import pyspark.sql.functions as f events=spark.read.csv('/events.csv', header= True, inferSchema = True)
# The timestamp column being of type long we will convert it to date
events = events.withColumn('date', f.from_unixtime((events.timestamp.cast('bigint')/1000)).cast('timestamp')) category_tree=spark.read.csv(' /category_tree.csv', header= True, inferSchema = True) item_properties_part1=spark.read.csv(' /item_properties_part1.csv', header= True, inferSchema = True) item_properties_part2=spark.read.csv('/ item_properties_part2.csv', header= True, inferSchema = True)
# We group item properties in the same dataframe.
item_properties = item_properties_part1.union(item_properties_part2)
It is important to study the dataset at our disposal before we can choose the methodology and type of recommendation engine we want.
A few statistics
We have three types of events: view when the customer clicks the item, addtocart when the item is added to the cart and transaction when the customer buys the item.
In all, we have 2 756 101 events including 2 664 312 view, 69 332 addtocart and 22 457 transactions.
These events are made by 1,407,580 visitors out of 235,061 unique items.
The data extend over a period of 4 and a half months (from 2015-05-03 to 2015-09-18).
The item_properties contain 20 275 902 unique properties describing 417 053 distinct items. (all items are not present in events).
The presence of the timestamp column indicates that some properties change their values as a function of time (eg price of the item).
Categories_tree contains a total of 1669 unique categories.
In order to have a useful dataset, we assume that chaining events is consistent. All the items were put on the cart at an earlier date, and all the items in the cart have been seen. And this for all the couple visitor_id / item_id.
This task will not be performed in this article but you are free to do so.
If we look at the distribution of visitors, we realize that almost half of the visitors (1,118,109) visited only one item.
2. Recommendation engine
The goal of a recommendation engine is to increase sales that would not have been made without these recommendations. Yet, it is very difficult to calculate the performance of a recommendation engine. If we recommend Volume 2 of Harry Potter to a buyer of Volume 1 on the carousel, and that buyer clicks and buys Volume 2, does this mean that the recommendation engine works? The customer concerned would have probably bought the item alone without any proposal. Any performance metric will count this action as a well-predicted case, but in this case things are not always obvious. We will discuss the choice of performance metrics at the end of this article.
Many methodologies exist, which must cover all the possibilities when a user connects to the site:
- It is considered a new user, either because it is a user of which we do not have the identity (without cookie, without connection to his account, etc), or it is a real new user. We will offer the users top 5 purchases. (cf 2.1)
- He is a recognized user but his history is not sufficient to be able to determine relevant recommendations. We will use this personal information to suggest the most relevant items. We can not implement this engine in our case (due to lack of visitor information) but it is important to note the feasibility of such an approach.
- He is an identified user who has a sufficient history (to be defined) to implement a relevant version of a recommendation engine. Two more classic choices are available to us: content based (cf 2.2) or user based (cf 2.3)
Let’s go back to our data. We have about 200,000 items that cannot be recommended as part of the user_based method (their page has never been accessed) while they can be recommended in the case of item_based.
In practice, we cannot just implement one or the other but combine both.
We will not be exploring this path later on.
2.1 “New” User
In the case of a new user, we can offer him the most popular products. Several approaches are possible:
- Recommend the best-selling items for the full time available
- Recommend the most recently sold items (last month for example)
Let’s go straight to practice and start with the first approach.
items_sold = events.where(f.col("event")=="transaction").groupby('itemid').count() items_sold.select('count').count() items_sold.sort(f.desc("count"))
The top 10 items sold throughout the timestamp are:
Let’s move on to the second approach. Let’s calculate the 10 best selling items in the last month.
#top 10 of the top selling items last month
import datetime last_month = (datetime. datetime(2015, 9, 18) - datetime.timedelta(days=30)).strftime(format='%Y-%m-%d') items_recently_sold = events.where(f.col("timestamp")>= last_month).\ where(f.col("event") == "transaction").groupby('itemid').count() items_recently_sold.select('count').count() items_recently_sold.sort(f.desc("count"))
We obtain the following items:
Data beting anonymized, we cannot qualitatively evaluate these recommendations. We note that the first 5 items proposed via the second approach are among the most purchased items over the duration of the timestamp. We see the appearance of 5 new items in the biggest sales over the last month.
Another approach would be recommend te items that have tje highest sales growth.
2.2 Content based
This type of recommendation engine is based on the characteristics of the items (content based). Typically, if you have a movie with Denzel Washington, you’ll probably love another movie with Denzel Washington. To do so, we must look at the properties of each item. For a certain type of product, such as movies or books, it is easier to calculate these similarities (same author, same director, same actors, same genre of books or movies, etc.). For others, such as home appliances, calculations are probably less relevant because the properties are less shareable from one item to another.
In our situation, the anonymization of properties makes it difficult to apply this methodology. However, we can give you the main steps:
- Construct a matrix with lines as, the itemid and as columns, the set of possible values for the properties. The values in this matrix are 0s and 1s indicating whether or not the itemid has this property
- Apply a distance of the Jaccard or Cosinus type to obtain a matrix of distance between all the pairs itemid / itemid
- Select the 5 items similar to the item seen by the visitor considered having the smallest distances.
2.3 User based
This type of recommendation engine is based on the users’ tastes (collaborative filtering) Typically, if the client A likes the items 1, 2, 3 and the client B likes the items 2, 3, 4, then we will propose the item 4 to client A and item 1 to client B. It is important to note that the algorithm only looks at the past, not the context.
Several algorithms predict which items customers will like.
Among these algorithms, matrix factorization is one of the most used techniques.
This family of algorithms has become widely known through the Netflix price challenge (https://www.netflixprize.com/) thanks to its efficiency.
The main idea is to decompose the user-item rating matrix into two matrices of lower rank: one represents characteristics of the user and the other of the item.
The product of these two matrices makes it possible to find a good approximation of the user-item observation matrix at the beginning.
Alternating Least Squares (ALS)
Let R be the notation matrix of the items such that Rij is equal to the note of the item j attributed by the visitor i.
R will contain empty elements and the goal will be to estimate each visitor’s rating for all items based on the existing ratings.
ALS attempts to estimate the matrix of R as the product of two lower-order matrices, X and Y, such that R = X * Yt.
Generally, these approximations are called “factor” matrices.
The general approach is iterative. At each iteration, one of the factor matrices is held constant, while the other is resolved. The newly resolved factor matrix is then held constant during the resolution for the other factor matrix.
The choice to make recommendations to the category or item depends on the machine resources we have. We will apply this methodology to our datasets by recoding a number of columns. Then, we present a first approach that can be improved as part of an industrialization project.
- Let’s begin by filtering the visitors whose history of events is reduced to the visit of an item (1,118,109 visitors).
- These visitors are considered new customers and we can remove them from our dataframe.
- In events, we will transform the types of events in score equivalent to the appreciation of the item by the visitor.
- view = 1
- addtocart = 10
- transaction = 30
- The scores we chosen arbitrarily, you can choose your
from pyspark.sql.types import IntegerType from pyspark.sql import Window df_nbview = events.where(f.col("event") =="view").\ groupby("visitorid").agg(f.countDistinct("itemid")).\ withColumnRenamed("count(DISTINCT itemid)", "nb_views") df_visitors_to_drop = df_nbview.where(f.col("nb_views")==1).select("visitorid") events = events.join(df_visitors_to_drop, [events_converted.visitorid == df_visitors_to_drop.visitorid], how='left_anti') def convert_event_value(event): if event == 'transaction': return 30 if event == 'addtocart': return 10 if event == 'view': return 1 return 0 udf_convert_event = f.udf(convert_event_value, IntegerType()) events = events.withColumn("event", udf_convert_event(f.col("event"))) w = Window.partitionBy(events["visitorid"], events["itemid"]) events = events.withColumn('maxEvent', f.max(f.col("event")).over(w)). \ where(f.col("event")==f.col("maxEvent")).\ drop(f.col("maxEvent")) events = events.select('timestamp','itemid', 'visitorid', 'event') df_final = events.\ select(f.col('timestamp'), f.col('visitorid').cast('int'), f.col('itemid').cast('int'), f.col('event').cast('int') )
- Now that we have our dataset events ready, we will split during and in test. We use 80% / 20% proportions for the split and take into account the timestamp: the test will contain events subsequent to those of the train. An additional work is then done on the test set:
- Visitors seen in test and not present on the train will be excluded from the test and we will offer them the most popular items
- The items seen in test which are not present in the train will be excluded from the test and processed by the Content-Based algorithm
If we do not exclude these visitors and these items, we will meet with Nan during the prediction because they will not be recognized by our model.
df_final = df_final.withColumn("rank", f.percent_rank().over(Window.partitionBy().orderBy("timestamp"))) train = df_final.where("rank <= .8").drop("rank") test = df_final.where("rank > .8").drop("rank") test= test.\ join(train.select('visitorid').distinct(), 'visitorid', how='inner').\ join(train.select('itemid').distinct(), 'itemid', how='inner') train.cache() test.cache()
- We can now launch the ALS algorithm (Alternating Least squares) available in pyspark.ml.recommendation.
A limitation of the ALS implementation of Spark MLlib is that it requires that the IDs of the users and elements be non-negative 32-bit integers.
This means that IDs greater than Integer.MAX_VALUE or 2147483647 can not be used.
We must therefore check whether this data set conforms to this condition.
from pyspark.ml.recommendation import ALS def initModel(inputData): als = ALS(maxIter=10, regParam=0.01, userCol="visitorid", itemCol="itemid", ratingCol="event", nonnegative=True) return als.fit(inputData) def predict(model, toPredict): return model.transform(toPredict).withColumn('prediction', f.round('prediction')) model = initModel(training) predictions = predict(model, test) predictions.cache()
We will recode the “prediction” column to make a comparison with our test, if prediction :
- <1 then prediction = 1
- <10 then prediction = 1
- <20 then prediction = 10
- Otherwise prediction = 30-
from pyspark.sql.types import FloatType def convert_predict_value(score): if score <1.0: return 0.0 if score<10.0: return 1.0 if score<20.0: return 10.0 return 30.0 convert_score = f.udf(convert_predict_value, FloatType()) predictions = predictions.withColumn("score", convert_score(f.col("prediction")))
Let’s move on to the most interesting but certainly the most difficult part, the analysis of the results.
In practice, perform a live A / B testing that allows your users to experiment with your engine and evaluate your recommendation engine. Through the Click Rate (CTR) and the Conversion Rate (CR) of the recommendations made, you can determine if the contribution of your engine is substantial.
Unfortunately, we are not in this case and we must find methods to evaluate our engine. Let’s start by describing our results:
- 68.19% of good recommendations.
- 94% good predictions on all “addtocart” and “transaction” events
Can we consider that our engine performs satisfactorily?
At first sight, yes, the rate of good recommendations is rather satisfactory, even if the major imbalance between our different events is problematic.
Other performance measures, such as the Recall or the RMSE (if we had left the predicted values as they were) could allow us to better understand these results.
Finally, we think it is important to propose a number of improvements to either the construction side of the model or the business side:
- We chose to keep the different types of events. We could have kept only the views and built a recommendation engine on this type of event.
- By keeping all the events, we could have oversampled the minority class (the “transaction” events) or undersampled the majority class (the “view” events) allowing us to better differentiate these types of events
- We built our test and train set under a time constraint. The recommendation engine must still continue to learn over time. The information collected for each user over time should be used to better predict future actions. Here, we treat all the actions in the test set in the same way. We should update the engine for users who have taken new actions to predict next actions. In this way, we will increase our chances of good recommendations.
Finally, it is always difficult to estimate success in the field of performance measurement (or in advertising, for example), because the actions we have tried to provoke do not always have direct consequences.