Skills: K-means Clustering, PCA, Silhouette Score


Unsupervised Learning technique to cluster items together that would replace old and manually created categories. png

We have two datasets:

  1. item_to_id which has information on the item and it’s corresponding ID to uniquely identify each product.
  2. purchase_history which contains information on user’s purchase history. The products users tend to buy together.
    We can use these datasets to come up with interesting insights that may help expand the business.

Business Questions:

Target Audience

  1. Identify top 10 customers who bought the most items overall
  2. For each item, identify the customer who bought that product the most

This will help us identify our most valuable and loyal customers who can then advocate for our business.

Cluster Items

  1. Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster.
import re
from collections import Counter
import itertools

import numpy as np
import pandas as pd

from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

Load and Inspect Data

There are 48 unique items in the inventory.

item = pd.read_csv("item_to_id.csv")
item.head()
item.sort_values(by = 'Item_id')
Item_name Item_id
18 sugar 1
35 lettuce 2
47 pet items 3
46 baby items 4
20 waffles 5
23 poultry 6
41 sandwich bags 7
15 butter 8
3 soda 9
32 carrots 10
16 cereals 11
42 shampoo 12
7 bagels 13
12 eggs 14
40 aluminum foil 15
13 milk 16
24 beef 17
36 laundry detergent 18
45 shaving cream 19
29 grapefruit 20
11 cheeses 21
21 frozen vegetables 22
1 tea 23
38 paper towels 24
28 cherries 25
9 spaghetti sauce 26
37 dishwashing 27
8 canned vegetables 28
44 hand soap 29
17 flour 30
19 pasta 31
30 apples 32
39 toilet paper 33
6 tortillas 34
43 soap 35
22 ice cream 36
5 dinner rolls 37
2 juice 38
4 sandwich loaves 39
27 berries 40
10 ketchup 41
34 cucumbers 42
0 coffee 43
31 broccoli 44
33 cauliflower 45
26 bananas 46
25 pork 47
14 yogurt 48
purchase = pd.read_csv('purchase_history.csv')
purchase.head()
user_id id
0 222087 27,26
1 1343649 6,47,17
2 404134 18,12,23,22,27,43,38,20,35,1
3 1110200 9,23,2,20,26,47,37
4 224107 31,18,5,13,1,21,48,16,26,2,44,32,20,37,42,35,4...

Data Summary

There are 39,474 purchases made by 24,885 unique customers. Data has no missing values.

item.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Item_name  48 non-null     object
 1   Item_id    48 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 896.0+ bytes
purchase.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39474 entries, 0 to 39473
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  39474 non-null  int64 
 1   id       39474 non-null  object
dtypes: int64(1), object(1)
memory usage: 616.9+ KB
len(purchase.user_id.unique())
24885
purchase.isnull().sum()
user_id    0
id         0
dtype: int64

Data Pre-processing

In order to get any meaningful insights from the dataset, we will have to some pre-processing. We will create a frequency table for each item by every user. If a user buys a specific item 2 times we will have (user,item) = 2 in the table.

def count_items(df):
    ids = df['id'].str.split(',').sum()
    id_list = [0 for i in range(1,49)]
    for i in ids:
        id_list[int(i)-1]+=1
    return pd.Series(id_list,index = list(range(1,49)))

user_item_cnt = purchase.groupby('user_id').apply(count_items)
user_item_cnt.head()
1 2 3 4 5 6 7 8 9 10 ... 39 40 41 42 43 44 45 46 47 48
user_id
47 0 1 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 1 1 1 0 0
68 0 0 0 0 0 1 0 0 0 1 ... 1 0 0 1 0 0 0 0 0 0
113 0 0 1 0 0 0 0 0 1 0 ... 0 0 0 0 1 0 0 1 0 0
123 0 0 0 1 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
223 1 1 0 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 1 0 0 0

5 rows × 48 columns

Top 10 customers who bought the most items

user_count = user_item_cnt.sum(axis = 1).sort_values(ascending = False).reset_index().rename(columns = {0:'item_cnt'})
user_count.head(10)
user_id item_cnt
0 269335 72
1 367872 70
2 599172 64
3 397623 64
4 377284 63
5 1485538 62
6 917199 62
7 718218 60
8 653800 60
9 828721 58

For each item, the customer who bought that product the most

item_max = user_item_cnt.apply(lambda s : pd.Series([s.argmax(),s.max()],index = ['max_user','max_cnt']))
item_max = item_max.transpose()
item_max.index.name = 'Item_id'
item_max
max_user max_cnt
Item_id
1 512 4
2 512 5
3 2552 4
4 92 3
5 3605 3
6 5555 4
7 2926 3
8 2493 3
9 4445 4
10 10238 4
11 6111 3
12 9261 3
13 10799 4
14 2855 3
15 2356 3
16 1175 3
17 6085 4
18 15193 5
19 512 3
20 14623 4
21 14597 4
22 19892 4
23 15251 5
24 3154 3
25 1094 4
26 16026 4
27 15840 4
28 3400 4
29 6565 4
30 375 2
31 4803 3
32 1771 4
33 21668 3
34 5085 4
35 7526 3
36 4445 4
37 749 4
38 4231 4
39 9918 5
40 634 4
41 2192 4
42 1303 4
43 16510 4
44 512 4
45 19857 5
46 20214 4
47 6419 4
48 5575 3

We were able to get all the users who bought maximum number of each item.Identifying a target audience provides a clear focus of whom your business will serve and why those consumers need your goods or services. Determining this information also keeps a target audience at a manageable level.

Clustering of Items

Item-Item Similiarity Matrix

To cluster items we would first create a cosine similarity matrix.
Cosine similarity is a metric that can be used to determine how similar the documents/terms are irrespective of their magnitude. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.When plotted on a multi-dimensional space, where each dimension corresponds to a word , the cosine similarity captures the orientation (the angle) of the terms and not the magnitude. If you want the magnitude, compute the Euclidean distance instead.

The cosine similarity is advantageous because even if the two similar items are far apart by the Euclidean distance because of the size (like, the word ‘fruit’ was purchased 10 times by 1 user and 1 time by other) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

user_item_cnt.head()
1 2 3 4 5 6 7 8 9 10 ... 39 40 41 42 43 44 45 46 47 48
user_id
47 0 1 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 1 1 1 0 0
68 0 0 0 0 0 1 0 0 0 1 ... 1 0 0 1 0 0 0 0 0 0
113 0 0 1 0 0 0 0 0 1 0 ... 0 0 0 0 1 0 0 1 0 0
123 0 0 0 1 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
223 1 1 0 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 1 0 0 0

5 rows × 48 columns

user_item_cnt_transpose = user_item_cnt.T
similarity = cosine_similarity(user_item_cnt_transpose,user_item_cnt_transpose)
similarity_df = pd.DataFrame(similarity, index=user_item_cnt.columns, columns=user_item_cnt.columns)
similarity_df[:5]
1 2 3 4 5 6 7 8 9 10 ... 39 40 41 42 43 44 45 46 47 48
1 1.000000 0.506895 0.420145 0.296986 0.271132 0.388250 0.271743 0.335303 0.403690 0.390641 ... 0.388034 0.390286 0.358599 0.393056 0.395696 0.396766 0.390253 0.394998 0.392164 0.328221
2 0.506895 1.000000 0.466874 0.322744 0.285125 0.468199 0.312200 0.390521 0.464872 0.527894 ... 0.462968 0.462548 0.409401 0.529100 0.464579 0.527325 0.521058 0.462407 0.460257 0.380077
3 0.420145 0.466874 1.000000 0.277325 0.224537 0.358326 0.238133 0.301868 0.362091 0.352597 ... 0.351093 0.368199 0.309078 0.357794 0.351209 0.362522 0.361922 0.354933 0.351832 0.297972
4 0.296986 0.322744 0.277325 1.000000 0.162860 0.247414 0.166012 0.216166 0.252662 0.258313 ... 0.245623 0.261272 0.210767 0.244946 0.253282 0.253119 0.250190 0.253835 0.260541 0.218717
5 0.271132 0.285125 0.224537 0.162860 1.000000 0.233618 0.164699 0.203245 0.239445 0.234254 ... 0.235543 0.238557 0.211006 0.238466 0.235064 0.241835 0.238087 0.238247 0.232387 0.188269

5 rows × 48 columns

Feature Selection Using PCA

Now that we have the similarity matrix, we can cluster the items based on the cosine similarity score but before that we will use PCA to reduce the dimension of the dataset.
PCA will help us to discover which dimensions in the dataset best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone.

pca_model = PCA()
items_rotate = pca_model.fit_transform(similarity_df)

items_rotate = pd.DataFrame(items_rotate,index=user_item_cnt.columns,
                        columns=["pc{}".format(idx+1) for idx in range(item.shape[0])])
items_rotate.head()
pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 ... pc39 pc40 pc41 pc42 pc43 pc44 pc45 pc46 pc47 pc48
1 0.355639 -0.025103 -0.051129 -0.039214 -0.049529 0.003300 0.019830 -0.011458 0.084798 -0.042284 ... -0.019760 -0.012837 -0.006332 -0.022289 -0.002654 -0.001332 0.011263 -0.026911 -0.129284 5.732095e-17
2 0.793894 -0.041103 0.013244 -0.009035 -0.141627 -0.036540 -0.114816 -0.087070 0.000959 -0.035030 ... -0.039028 -0.015895 -0.027280 0.017406 0.013680 0.004437 -0.030161 0.006713 0.379136 5.732095e-17
3 0.135668 -0.029584 -0.028246 -0.012556 -0.013317 0.004220 -0.020503 -0.050246 0.177599 -0.080864 ... 0.020455 -0.007168 0.040652 0.004387 0.029288 0.000224 0.009908 0.001288 -0.035429 5.732095e-17
4 -0.532034 0.020259 -0.022628 -0.037650 0.022756 -0.046219 0.042124 -0.132828 0.651482 -0.274771 ... 0.000133 0.002444 0.005894 -0.003617 -0.005383 0.001555 -0.006114 0.007527 0.020217 5.732095e-17
5 -0.678770 -0.093874 -0.347623 -0.232866 -0.090366 -0.057124 -0.018074 0.038866 -0.048059 -0.014378 ... -0.008466 0.011890 0.002598 0.007605 0.009321 -0.003278 0.006540 0.000677 0.023603 5.732095e-17

5 rows × 48 columns

Dimensionality Reduction

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. We can even visualize the components and variance explained by each of them.

fig = plt.figure(figsize=(10, 10))
plt.bar(range(1,len(pca_model.explained_variance_)+1),pca_model.explained_variance_)
plt.ylabel('Explained variance')
plt.xlabel('Components')
plt.title("Explained variance plot")
Text(0.5, 1.0, 'Explained variance plot')

png

plt.plot(pca_model.explained_variance_ratio_)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.title("Explained Variance Ratio Plot")
plt.show()

png

# show the total variance which can be explained by first K principle components
explained_variance_by_k = pca_model.explained_variance_ratio_.cumsum()

plt.plot(range(1,len(explained_variance_by_k)+1),explained_variance_by_k,marker="*")
plt.title("PCA Cumulative Variance Plot")
Text(0.5, 1.0, 'PCA Cumulative Variance Plot')

png

We can see 35 principal components can explain upto 90% variance in the dataset.

sum(pca_model.explained_variance_ratio_[:35])
0.9036104968698064

K-means Clustering

We are going to use K-means clustering to identify the clusters for our items.
Advantages of k-means

  1. Relatively simple to implement.
  2. Scales to large data sets.
  3. Guarantees convergence.
  4. Easily adapts to new examples.
  5. Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Random Initialization: In some cases, if the initialization of clusters is not appropriate, K-Means can result in arbitrarily bad clusters. This is where K-Means++ helps. It specifies a procedure to initialize the cluster centers before moving forward with the standard k-means clustering algorithm. It will randomly choose a centroid, calculate the distance of every point from it and the point which has the maximum distance from the centroid will be chosen as the second centroid.

Creating Clusters

When the number of clusters is not known a priori like in this problem, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the “goodness” of a clustering by calculating each data point’s silhouette coefficient and inertia.
The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.
Inertia: It is defined as the mean squared distance between each instance and its closest centroid. Logically, as per the definition lower the inertia better the model.

#We will use 25 pca components
items_filter = items_rotate.values[:,:25]
clusters = range(5, 20)
inertias = []
silhouettes = []

for n_clusters in clusters:
    model = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42, n_jobs=1)
    label = model.fit_predict(items_filter)
    
    inertias.append(model.inertia_)
    silhouettes.append(silhouette_score(items_filter, label))
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))
ax[0].plot(clusters, inertias, 'o-', label='Sum of Squared Distances')
ax[0].grid(True)
ax[1].plot(clusters, silhouettes, 'o-', label='Silhouette Coefficient')
ax[1].grid(True)
plt.legend(fontsize=12)
plt.tight_layout()
plt.show()

png

Silhouette coefficient plateaued after 14 so 14 clusters look optimal.

Cluster Visualization

We can see 14 clusters are created and items are clearly sorted. We were able to reduce dimensionality and ended up using just 25 features.
Items are sorted as bakery items(Cluster 4) such as bagels ,tortillas ,dinner rolls ,sandwich loaves and canned goods(cluster 1) such as spaghetti sauce ,canned vegetables ,ketchup. Similarly, we have other clusters as well.

def show_clusters(items_rotated,labels):
    """
    plot and print clustering result
    """
    fig = plt.figure(figsize=(15, 15))
    colors =  itertools.cycle (["blue","green","red","cyan","yellow","purple","orange","olive","seagreen","magenta","brown","lightpink","indigo","aqua","teal"])

    grps = items_rotated.groupby(labels)
    for label,grp in grps:
        #print(label,grp)
        plt.scatter(grp.pc1,grp.pc2,c=next(colors),label = label)

        print("*********** Label [{}] ***********".format(label))
        print(grp.index)
        for j in grp.index:
            
            names = item[item['Item_id'] == j]["Item_name"].to_string(index = False)
            print(j,names)
            
        

    # annotate
    for itemid in items_rotated.index:
        x = items_rotated.loc[itemid,"pc1"]
        y = items_rotated.loc[itemid,"pc2"]
        name = item[item['Item_id'] == itemid]["Item_name"].to_string(index = False)
        plt.text(x,y,name)
def cluster(n_clusters,n_components=48):
    """
    n_components=K, means use first K principle components in the clustering
    n_clusters: the number of clusters we want to cluster
    """
    print("first {} PC explain {:.1f}% variances".format(n_components,
                                                         100 * sum(pca_model.explained_variance_ratio_[:n_components])))

    kmeans_model = KMeans(n_clusters=n_clusters,init='k-means++', random_state=42, n_jobs=1)
    kmeans_model.fit(items_rotate.values[:, :n_components])

    # display results
    show_clusters(items_rotate, kmeans_model.labels_)

cluster(n_clusters = 14,n_components=25)
first 25 PC explain 79.8% variances
*********** Label [0] ***********
Int64Index([26, 28, 41], dtype='int64')
26  spaghetti sauce
28  canned vegetables
41  ketchup
*********** Label [1] ***********
Int64Index([30], dtype='int64')
30  flour
*********** Label [2] ***********
Int64Index([1, 3, 20, 25, 32, 40, 46], dtype='int64')
1  sugar
3  pet items
20  grapefruit
25  cherries
32  apples
40  berries
46  bananas
*********** Label [3] ***********
Int64Index([12, 19, 29, 35], dtype='int64')
12  shampoo
19  shaving cream
29  hand soap
35  soap
*********** Label [4] ***********
Int64Index([13, 34, 37, 39], dtype='int64')
13  bagels
34  tortillas
37  dinner rolls
39  sandwich loaves
*********** Label [5] ***********
Int64Index([7, 15, 24, 33], dtype='int64')
7  sandwich bags
15  aluminum foil
24  paper towels
33  toilet paper
*********** Label [6] ***********
Int64Index([2, 10, 42, 44, 45], dtype='int64')
2  lettuce
10  carrots
42  cucumbers
44  broccoli
45  cauliflower
*********** Label [7] ***********
Int64Index([5, 11, 31], dtype='int64')
5  waffles
11  cereals
31  pasta
*********** Label [8] ***********
Int64Index([8, 14, 16, 21, 48], dtype='int64')
8  butter
14  eggs
16  milk
21  cheeses
48  yogurt
*********** Label [9] ***********
Int64Index([9, 23, 38, 43], dtype='int64')
9  soda
23  tea
38  juice
43  coffee
*********** Label [10] ***********
Int64Index([22, 36], dtype='int64')
22  frozen vegetables
36  ice cream
*********** Label [11] ***********
Int64Index([18, 27], dtype='int64')
18  laundry detergent
27  dishwashing 
*********** Label [12] ***********
Int64Index([4], dtype='int64')
4  baby items
*********** Label [13] ***********
Int64Index([6, 17, 47], dtype='int64')
6  poultry
17  beef
47  pork

png