K-Means Cluster: Okay You Built the Model, Then What??? (No Math)

Datasans
5 min readJan 14, 2023

--

Explanation of K-Means clustering

K-Means clustering is a type of unsupervised machine learning algorithm used for grouping similar data points together. It is one of the most commonly used clustering algorithms in data mining and machine learning. The algorithm works by dividing a dataset into “k” clusters, where “k” is a user-specified number. The clusters are determined by the algorithm by finding the centroid or mean of the data points in the cluster. The algorithm iteratively reassigns data points to different clusters based on their distance from the cluster centroid until the cluster assignments no longer change. The goal of K-Means is to minimize the sum of the distances between the data points and the cluster centroid.

The basic process of K-Means clustering:

https://github.com/exercism/python/issues/2999
  • The algorithm starts with k initial centroids, which are chosen randomly from the dataset.
  • Each data point is assigned to the cluster whose centroid is closest to it.
  • The centroid of each cluster is then recalculated as the mean of all the data points in the cluster.
  • The process of assigning data points to clusters and recalculating centroids is repeated until the cluster assignments no longer change.

Advantages of K-Means clustering

  • K-Means is easy to understand and implement
  • It is computationally efficient and scalable for large datasets
  • It can find spherical clusters of similar size

Disadvantages of K-Means clustering

  • It requires the number of clusters, k, to be specified in advance
  • It is sensitive to initial centroid choice
  • It is not well-suited for non-spherical clusters or clusters of varying sizes.

Applications of K-Means clustering

  • Image segmentation
  • Market segmentation
  • Document clustering
  • Anomaly detection
  • Gene expression analysis.

Usecase Example

In this example, we will be using the Wholesale customers dataset, which contains the annual spending records of wholesale customers on different product categories such as Fresh, Milk, Grocery, Frozen, Detergents_Paper and Delicatessen. The dataset contains 440 observations and has been collected from a wholesale distributor. We will use the K-Means algorithm to cluster the customers into different groups based on their spending patterns. By analyzing the cluster centroids, distribution of data points within each cluster, and the attributes of the data points in each cluster, we can gain insights into the customer segments and identify patterns and trends in the data.

The Script

import pandas as pd
# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv'
df = pd.read_csv(url)
# Clean and preprocess the data
df = df.drop(['Region', 'Channel'], axis=1)
df.head()
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Initialize an empty list to store the silhouette scores
scores = []

# Try different numbers of clusters
for k in range(2, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(df)
score = silhouette_score(df, kmeans.labels_)
scores.append(score)

# Plot the silhouette scores
import matplotlib.pyplot as plt
plt.plot(range(2, 11), scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()

The silhouette score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). A score closer to 1 indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. A score closer to -1 indicates that the object is poorly matched to its own cluster and well matched to neighboring clusters. A score of 0 indicates that the object is on or very close to the decision boundary between two neighboring clusters.

In this example, the best silhouette score appears to be around 0.51, when the number of clusters is 2. But we will use K=5 for this case because I wanna focus on how to interprete that after the cluster scoring.

# Choose k = 5 clusters
kmeans = KMeans(n_clusters=5)

# Fit the model to the data
kmeans.fit(df)

# Get the cluster labels
labels = kmeans.labels_
# Add the cluster labels to the dataframe
df['cluster'] = labels

# Plot the data points, colored by cluster
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x=df['Grocery'], y=df['Frozen'], hue=df['cluster'], legend='full')
plt.xlabel('Grocery')
plt.ylabel('Frozen')
plt.show()
# Group the data by the predicted cluster
cluster_group = df.groupby(['cluster'])

# Calculate the mean of each feature for each cluster
cluster_mean = cluster_group.mean().reset_index()
cluster_mean
# Define a color map
color_map = sns.diverging_palette(10, 220, as_cmap=True)

cluster_mean.style.background_gradient(cmap=color_map, subset=list(set(cluster_mean.columns)-{'cluster'}))

Interprete the Cluster

To do cluster interpretation, you can use the central tendency or centroid of each cluster to its features. In this example I will use the average grouped by cluster.

For example,

0 — Cluster 0 has an average of 23,553.19 for the Fresh feature, which is pretty higher than the other clusters, indicating that customers in this cluster prefer fresh foods. Moreover, he can be categorized as a medium spending customer with a preference for buying fresh foods, and is not interested in Detergents_Paper. If we had to give it a name, I would probably call it “Fresh Products Low Value Customers”

1 — Cluster 1 has very few purchases in almost all categories relative to other clusters, maybe we can give it the name “Low Value Customers”

2 — Cluster 2 has an average of 43,460.60 for the Grocery feature, which is the highest of all clusters, indicating that customers in this cluster have a strong preference for grocery items. And has an average of 29,974.20 for the Detergents_Paper feature, which is the highest of all clusters, indicating that customers in this cluster have a strong preference for detergents and paper products as well. We can give it the name “Savvy “ (Skillful, Advantageous, Value-oriented Purchaser of Milk, Grocery, and Detergents & Papers)

3 — Cluster 3 is the same as cluster 2, it has dominance in Grocery, Milk, and Detergents_Paper products but with less spending. Maybe we can give him “Middle Class Savvy”

In your opinion, what name should be given to clusters 4 based on their behavior?

Anyway, you can give it any name but it must be remembered that the purpose of giving a name is to make it easier for us to imagine it and to give it treatment that is right on target.

Good Luck!

--

--

Datasans
Datasans

Written by Datasans

All things about data science that are discussed “sans ae”, data sains? sans lah…

Responses (3)