Item Clustering by User Purchase History

Skills: K-means Clustering, PCA, Silhouette Score

Unsupervised Learning technique to cluster items together that would replace old and manually created categories. png

We have two datasets:

item_to_id which has information on the item and it’s corresponding ID to uniquely identify each product.
purchase_history which contains information on user’s purchase history. The products users tend to buy together.
We can use these datasets to come up with interesting insights that may help expand the business.

Business Questions:

Target Audience

Identify top 10 customers who bought the most items overall
For each item, identify the customer who bought that product the most

This will help us identify our most valuable and loyal customers who can then advocate for our business.

Cluster Items

Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster.

import re
from collections import Counter
import itertools

import numpy as np
import pandas as pd

from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

Load and Inspect Data

There are 48 unique items in the inventory.

item = pd.read_csv("item_to_id.csv")
item.head()
item.sort_values(by = 'Item_id')

	Item_name	Item_id
18	sugar	1
35	lettuce	2
47	pet items	3
46	baby items	4
20	waffles	5
23	poultry	6
41	sandwich bags	7
15	butter	8
3	soda	9
32	carrots	10
16	cereals	11
42	shampoo	12
7	bagels	13
12	eggs	14
40	aluminum foil	15
13	milk	16
24	beef	17
36	laundry detergent	18
45	shaving cream	19
29	grapefruit	20
11	cheeses	21
21	frozen vegetables	22
1	tea	23
38	paper towels	24
28	cherries	25
9	spaghetti sauce	26
37	dishwashing	27
8	canned vegetables	28
44	hand soap	29
17	flour	30
19	pasta	31
30	apples	32
39	toilet paper	33
6	tortillas	34
43	soap	35
22	ice cream	36
5	dinner rolls	37
2	juice	38
4	sandwich loaves	39
27	berries	40
10	ketchup	41
34	cucumbers	42
0	coffee	43
31	broccoli	44
33	cauliflower	45
26	bananas	46
25	pork	47
14	yogurt	48

purchase = pd.read_csv('purchase_history.csv')
purchase.head()

	user_id	id
0	222087	27,26
1	1343649	6,47,17
2	404134	18,12,23,22,27,43,38,20,35,1
3	1110200	9,23,2,20,26,47,37
4	224107	31,18,5,13,1,21,48,16,26,2,44,32,20,37,42,35,4...

Data Summary

There are 39,474 purchases made by 24,885 unique customers. Data has no missing values.

item.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Item_name  48 non-null     object
 1   Item_id    48 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 896.0+ bytes

purchase.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39474 entries, 0 to 39473
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  39474 non-null  int64 
 1   id       39474 non-null  object
dtypes: int64(1), object(1)
memory usage: 616.9+ KB

len(purchase.user_id.unique())

purchase.isnull().sum()

user_id    0
id         0
dtype: int64

Data Pre-processing

In order to get any meaningful insights from the dataset, we will have to some pre-processing. We will create a frequency table for each item by every user. If a user buys a specific item 2 times we will have (user,item) = 2 in the table.

def count_items(df):
    ids = df['id'].str.split(',').sum()
    id_list = [0 for i in range(1,49)]
    for i in ids:
        id_list[int(i)-1]+=1
    return pd.Series(id_list,index = list(range(1,49)))

user_item_cnt = purchase.groupby('user_id').apply(count_items)
user_item_cnt.head()

	1	2	3	4	5	6	7	8	9	10	...	39	40	41	42	43	44	45	46	47	48
user_id
47	0	1	1	1	0	0	0	0	0	0	...	0	0	0	0	0	1	1	1	0	0
68	0	0	0	0	0	1	0	0	0	1	...	1	0	0	1	0	0	0	0	0	0
113	0	0	1	0	0	0	0	0	1	0	...	0	0	0	0	1	0	0	1	0	0
123	0	0	0	1	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
223	1	1	0	0	0	1	0	0	0	0	...	0	0	1	0	0	0	1	0	0	0

5 rows × 48 columns

Top 10 customers who bought the most items

user_count = user_item_cnt.sum(axis = 1).sort_values(ascending = False).reset_index().rename(columns = {0:'item_cnt'})
user_count.head(10)

	user_id	item_cnt
0	269335	72
1	367872	70
2	599172	64
3	397623	64
4	377284	63
5	1485538	62
6	917199	62
7	718218	60
8	653800	60
9	828721	58

For each item, the customer who bought that product the most

item_max = user_item_cnt.apply(lambda s : pd.Series([s.argmax(),s.max()],index = ['max_user','max_cnt']))
item_max = item_max.transpose()
item_max.index.name = 'Item_id'
item_max

	max_user	max_cnt
Item_id
1	512	4
2	512	5
3	2552	4
4	92	3
5	3605	3
6	5555	4
7	2926	3
8	2493	3
9	4445	4
10	10238	4
11	6111	3
12	9261	3
13	10799	4
14	2855	3
15	2356	3
16	1175	3
17	6085	4
18	15193	5
19	512	3
20	14623	4
21	14597	4
22	19892	4
23	15251	5
24	3154	3
25	1094	4
26	16026	4
27	15840	4
28	3400	4
29	6565	4
30	375	2
31	4803	3
32	1771	4
33	21668	3
34	5085	4
35	7526	3
36	4445	4
37	749	4
38	4231	4
39	9918	5
40	634	4
41	2192	4
42	1303	4
43	16510	4
44	512	4
45	19857	5
46	20214	4
47	6419	4
48	5575	3

We were able to get all the users who bought maximum number of each item.Identifying a target audience provides a clear focus of whom your business will serve and why those consumers need your goods or services. Determining this information also keeps a target audience at a manageable level.

Clustering of Items

Item-Item Similiarity Matrix

To cluster items we would first create a cosine similarity matrix.
Cosine similarity is a metric that can be used to determine how similar the documents/terms are irrespective of their magnitude. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.When plotted on a multi-dimensional space, where each dimension corresponds to a word , the cosine similarity captures the orientation (the angle) of the terms and not the magnitude. If you want the magnitude, compute the Euclidean distance instead.

The cosine similarity is advantageous because even if the two similar items are far apart by the Euclidean distance because of the size (like, the word ‘fruit’ was purchased 10 times by 1 user and 1 time by other) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

user_item_cnt.head()

	1	2	3	4	5	6	7	8	9	10	...	39	40	41	42	43	44	45	46	47	48
user_id
47	0	1	1	1	0	0	0	0	0	0	...	0	0	0	0	0	1	1	1	0	0
68	0	0	0	0	0	1	0	0	0	1	...	1	0	0	1	0	0	0	0	0	0
113	0	0	1	0	0	0	0	0	1	0	...	0	0	0	0	1	0	0	1	0	0
123	0	0	0	1	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
223	1	1	0	0	0	1	0	0	0	0	...	0	0	1	0	0	0	1	0	0	0

5 rows × 48 columns

user_item_cnt_transpose = user_item_cnt.T
similarity = cosine_similarity(user_item_cnt_transpose,user_item_cnt_transpose)
similarity_df = pd.DataFrame(similarity, index=user_item_cnt.columns, columns=user_item_cnt.columns)

similarity_df[:5]

	1	2	3	4	5	6	7	8	9	10	...	39	40	41	42	43	44	45	46	47	48
1	1.000000	0.506895	0.420145	0.296986	0.271132	0.388250	0.271743	0.335303	0.403690	0.390641	...	0.388034	0.390286	0.358599	0.393056	0.395696	0.396766	0.390253	0.394998	0.392164	0.328221
2	0.506895	1.000000	0.466874	0.322744	0.285125	0.468199	0.312200	0.390521	0.464872	0.527894	...	0.462968	0.462548	0.409401	0.529100	0.464579	0.527325	0.521058	0.462407	0.460257	0.380077
3	0.420145	0.466874	1.000000	0.277325	0.224537	0.358326	0.238133	0.301868	0.362091	0.352597	...	0.351093	0.368199	0.309078	0.357794	0.351209	0.362522	0.361922	0.354933	0.351832	0.297972
4	0.296986	0.322744	0.277325	1.000000	0.162860	0.247414	0.166012	0.216166	0.252662	0.258313	...	0.245623	0.261272	0.210767	0.244946	0.253282	0.253119	0.250190	0.253835	0.260541	0.218717
5	0.271132	0.285125	0.224537	0.162860	1.000000	0.233618	0.164699	0.203245	0.239445	0.234254	...	0.235543	0.238557	0.211006	0.238466	0.235064	0.241835	0.238087	0.238247	0.232387	0.188269

5 rows × 48 columns

Feature Selection Using PCA

Now that we have the similarity matrix, we can cluster the items based on the cosine similarity score but before that we will use PCA to reduce the dimension of the dataset.
PCA will help us to discover which dimensions in the dataset best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone.

pca_model = PCA()
items_rotate = pca_model.fit_transform(similarity_df)

items_rotate = pd.DataFrame(items_rotate,index=user_item_cnt.columns,
                        columns=["pc{}".format(idx+1) for idx in range(item.shape[0])])
items_rotate.head()

	pc1	pc2	pc3	pc4	pc5	pc6	pc7	pc8	pc9	pc10	...	pc39	pc40	pc41	pc42	pc43	pc44	pc45	pc46	pc47	pc48
1	0.355639	-0.025103	-0.051129	-0.039214	-0.049529	0.003300	0.019830	-0.011458	0.084798	-0.042284	...	-0.019760	-0.012837	-0.006332	-0.022289	-0.002654	-0.001332	0.011263	-0.026911	-0.129284	5.732095e-17
2	0.793894	-0.041103	0.013244	-0.009035	-0.141627	-0.036540	-0.114816	-0.087070	0.000959	-0.035030	...	-0.039028	-0.015895	-0.027280	0.017406	0.013680	0.004437	-0.030161	0.006713	0.379136	5.732095e-17
3	0.135668	-0.029584	-0.028246	-0.012556	-0.013317	0.004220	-0.020503	-0.050246	0.177599	-0.080864	...	0.020455	-0.007168	0.040652	0.004387	0.029288	0.000224	0.009908	0.001288	-0.035429	5.732095e-17
4	-0.532034	0.020259	-0.022628	-0.037650	0.022756	-0.046219	0.042124	-0.132828	0.651482	-0.274771	...	0.000133	0.002444	0.005894	-0.003617	-0.005383	0.001555	-0.006114	0.007527	0.020217	5.732095e-17
5	-0.678770	-0.093874	-0.347623	-0.232866	-0.090366	-0.057124	-0.018074	0.038866	-0.048059	-0.014378	...	-0.008466	0.011890	0.002598	0.007605	0.009321	-0.003278	0.006540	0.000677	0.023603	5.732095e-17

5 rows × 48 columns

Dimensionality Reduction

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. We can even visualize the components and variance explained by each of them.

fig = plt.figure(figsize=(10, 10))
plt.bar(range(1,len(pca_model.explained_variance_)+1),pca_model.explained_variance_)
plt.ylabel('Explained variance')
plt.xlabel('Components')
plt.title("Explained variance plot")

Text(0.5, 1.0, 'Explained variance plot')

png

plt.plot(pca_model.explained_variance_ratio_)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.title("Explained Variance Ratio Plot")
plt.show()

png

# show the total variance which can be explained by first K principle components
explained_variance_by_k = pca_model.explained_variance_ratio_.cumsum()

plt.plot(range(1,len(explained_variance_by_k)+1),explained_variance_by_k,marker="*")
plt.title("PCA Cumulative Variance Plot")

Text(0.5, 1.0, 'PCA Cumulative Variance Plot')

png

We can see 35 principal components can explain upto 90% variance in the dataset.

sum(pca_model.explained_variance_ratio_[:35])

0.9036104968698064

K-means Clustering

We are going to use K-means clustering to identify the clusters for our items.
Advantages of k-means

Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Random Initialization: In some cases, if the initialization of clusters is not appropriate, K-Means can result in arbitrarily bad clusters. This is where K-Means++ helps. It specifies a procedure to initialize the cluster centers before moving forward with the standard k-means clustering algorithm. It will randomly choose a centroid, calculate the distance of every point from it and the point which has the maximum distance from the centroid will be chosen as the second centroid.

Creating Clusters

When the number of clusters is not known a priori like in this problem, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the “goodness” of a clustering by calculating each data point’s silhouette coefficient and inertia.
The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.
Inertia: It is defined as the mean squared distance between each instance and its closest centroid. Logically, as per the definition lower the inertia better the model.

#We will use 25 pca components
items_filter = items_rotate.values[:,:25]

clusters = range(5, 20)
inertias = []
silhouettes = []

for n_clusters in clusters:
    model = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42, n_jobs=1)
    label = model.fit_predict(items_filter)
    
    inertias.append(model.inertia_)
    silhouettes.append(silhouette_score(items_filter, label))

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))
ax[0].plot(clusters, inertias, 'o-', label='Sum of Squared Distances')
ax[0].grid(True)
ax[1].plot(clusters, silhouettes, 'o-', label='Silhouette Coefficient')
ax[1].grid(True)
plt.legend(fontsize=12)
plt.tight_layout()
plt.show()

png

Silhouette coefficient plateaued after 14 so 14 clusters look optimal.

Cluster Visualization

We can see 14 clusters are created and items are clearly sorted. We were able to reduce dimensionality and ended up using just 25 features.
Items are sorted as bakery items(Cluster 4) such as bagels ,tortillas ,dinner rolls ,sandwich loaves and canned goods(cluster 1) such as spaghetti sauce ,canned vegetables ,ketchup. Similarly, we have other clusters as well.

def show_clusters(items_rotated,labels):
    """
    plot and print clustering result
    """
    fig = plt.figure(figsize=(15, 15))
    colors =  itertools.cycle (["blue","green","red","cyan","yellow","purple","orange","olive","seagreen","magenta","brown","lightpink","indigo","aqua","teal"])

    grps = items_rotated.groupby(labels)
    for label,grp in grps:
        #print(label,grp)
        plt.scatter(grp.pc1,grp.pc2,c=next(colors),label = label)

        print("*********** Label [{}] ***********".format(label))
        print(grp.index)
        for j in grp.index:
            
            names = item[item['Item_id'] == j]["Item_name"].to_string(index = False)
            print(j,names)
            
        

    # annotate
    for itemid in items_rotated.index:
        x = items_rotated.loc[itemid,"pc1"]
        y = items_rotated.loc[itemid,"pc2"]
        name = item[item['Item_id'] == itemid]["Item_name"].to_string(index = False)
        plt.text(x,y,name)

def cluster(n_clusters,n_components=48):
    """
    n_components=K, means use first K principle components in the clustering
    n_clusters: the number of clusters we want to cluster
    """
    print("first {} PC explain {:.1f}% variances".format(n_components,
                                                         100 * sum(pca_model.explained_variance_ratio_[:n_components])))

    kmeans_model = KMeans(n_clusters=n_clusters,init='k-means++', random_state=42, n_jobs=1)
    kmeans_model.fit(items_rotate.values[:, :n_components])

    # display results
    show_clusters(items_rotate, kmeans_model.labels_)

cluster(n_clusters = 14,n_components=25)

first 25 PC explain 79.8% variances
*********** Label [0] ***********
Int64Index([26, 28, 41], dtype='int64')
26  spaghetti sauce
28  canned vegetables
41  ketchup
*********** Label [1] ***********
Int64Index([30], dtype='int64')
30  flour
*********** Label [2] ***********
Int64Index([1, 3, 20, 25, 32, 40, 46], dtype='int64')
1  sugar
3  pet items
20  grapefruit
25  cherries
32  apples
40  berries
46  bananas
*********** Label [3] ***********
Int64Index([12, 19, 29, 35], dtype='int64')
12  shampoo
19  shaving cream
29  hand soap
35  soap
*********** Label [4] ***********
Int64Index([13, 34, 37, 39], dtype='int64')
13  bagels
34  tortillas
37  dinner rolls
39  sandwich loaves
*********** Label [5] ***********
Int64Index([7, 15, 24, 33], dtype='int64')
7  sandwich bags
15  aluminum foil
24  paper towels
33  toilet paper
*********** Label [6] ***********
Int64Index([2, 10, 42, 44, 45], dtype='int64')
2  lettuce
10  carrots
42  cucumbers
44  broccoli
45  cauliflower
*********** Label [7] ***********
Int64Index([5, 11, 31], dtype='int64')
5  waffles
11  cereals
31  pasta
*********** Label [8] ***********
Int64Index([8, 14, 16, 21, 48], dtype='int64')
8  butter
14  eggs
16  milk
21  cheeses
48  yogurt
*********** Label [9] ***********
Int64Index([9, 23, 38, 43], dtype='int64')
9  soda
23  tea
38  juice
43  coffee
*********** Label [10] ***********
Int64Index([22, 36], dtype='int64')
22  frozen vegetables
36  ice cream
*********** Label [11] ***********
Int64Index([18, 27], dtype='int64')
18  laundry detergent
27  dishwashing 
*********** Label [12] ***********
Int64Index([4], dtype='int64')
4  baby items
*********** Label [13] ***********
Int64Index([6, 17, 47], dtype='int64')
6  poultry
17  beef
47  pork

png