Cluster Analysis
By Chi Kit Yeung in Statistics Python Machine Learning
August 25, 2024
Introduction
Cluster analysis is a multivariate statistical technique that groups observations on the basis of one or several of their features they are described by. The goal of clustering is to maximize the similarity of the observations within a cluster while also maximizing the differences between different clusters.
Cluster analysis is useful as a starting point to help explore and identify patterns in the data. It is rarely the sole method used in a project and should be supplemented with other techniques.
The lesson will include:
- How to perform Cluster analysis
- How to find the optimal number of clusters
- How to identify appropriate features
- How to interpret results
Math Concepts
Before we proceed we’ll need to know 2 mathematical concepts: How to measure the distance between two data points, Centroid.
Measuring Distance Between Two Points
When performing clusters we will be finding the distance between different clusters.
On a two dimensional plane, we can find the distance between two points using their coordinates in the following formula:
$$ \text{Euclidean Distance (2D)} $$
$$ d\left(A,B \right)=d\left(B,A \right)=\sqrt{\left( x_{1}-x_{2} \right)^{2} +\left( y_{1}-y_{2} \right)^{2}} $$
In three dimensional space:
$$ \text{Euclidean Distance (3D)} $$
$$ d\left(A,B \right)=d\left(B,A \right)=\sqrt{\left( x_{1}-x_{2} \right)^{2} +\left( y_{1}-y_{2} \right)^{2} + \left( z_{1}-z_{2} \right)^{2}} $$
In n-dimensional space:
$$ \text{Euclidean Distance (n-dimensions)} $$
$$ d\left(A,B \right)=d\left(B,A \right)=\sqrt{\left( a_{1}-b_{1} \right)^{2} +\left( a_{2}-b_{2} \right)^{2} +...+ \left( a_{n}-b_{n} \right)^{2}} $$
Centroid
A centroid is the mean position of a group of points. It is also known as the center of mass in physics. Think of it as the center of a cluster.
K-means Clustering
K-means clustering is an unsupervised form of machine learning which groups the dataset into clusters based on an algorithm.
The K-means clustering algorithm first requires us to specify the number of clusters we want. The algorithm then iterates through the following steps until the clusters are found.
- Initialize the cluster seeds (centroids)
- Assign each data point to a seed based on their distance
- Adjust the centroids
- Repeat steps 2 and 3
Geeks for geek’s article on the subject has a very good illustration on how the algorithm works.
Performing the K-means Clustering
# Import the packages
from sklearn.cluster import KMeans
# Selects the features
x = data.iloc[:, <columns with desired features>]
# Fit and Generate the Predictions
i = # number of clusters
kmeans = KMeans(i)
kmeans.fit(x) # fit the data
kmeans.predict(x) # generate the predictions
Selecting the Number of Clusters
The Elbow Method
Recall that the goal of clustering is to minimise the differences between different observations within the same cluster and maximise the difference between datapoints of different clusters. We can think of the differences
as the distance between the data points. By definition, minimising the distance between data points within a cluster would automatically maximise it’s distance with a different cluster.
The distance between points within a cluster is represented by ‘Within-Cluster Sum of Squares’ (WCSS). It can be found using the inertia_
method:
from sklearn.cluster import KMeans
KMeans.inertia_()
If we were to plot a line between the number of clusters and it’s WCSS. It would typically look like this.
This shows that the more clusters we specify, the lower the distance between points within a cluster will be. We can think of it this way. The closer the number of clusters we define gets to the total number of observations, the closer WCSS gets to 0. However, having 100 clusters for a data set of 100 observations defeats the point of performing a cluster analysis in the first place.
Taking a look at the chart again, increasing the number of clusters provide a diminishing benefit the higher it gets. The ideal number of clusters should then be where the ’elbow’ of the line occurs, where increasing the number of cluster does not meaningfully lower WCSS any further.
Making the Elbow Chart
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
clusters = range(1,11) # Number of clusters to iterate over
wcss = []
for i in clusters:
model = KMeans(i) # Define the number of clusters
model.fit(x) # x is the training data
wcss.append(model.inertia_) # Append the WCSS to the list
plt.plot(clusters, wcss) # Plot the data
Feature Scaling
Scale matters in clustering. Suppose our dataset has feature A which has values ranging from 1 to 10 while feature B has values from 1000 to 10,000. K means will disproportionately favor the feature with a larger scale. Higher numbers are given higher weights when in comes to clustering. The aim of standardization is to reduce the weight of higher numbers and increase that of lower ones.
Feature scaling can be accomplished with the same method described in the Linear Regression post
We can also take advantage of this if suppose we know that a certain feature in our dataset is more important than another. In which case we can choose to standardize only certain features over the other.
Pros and Cons
Pros
- Simple to understand
- Fast to cluster. Implementation is easy
- There will always be results (for better or worse)
Cons and their remedies
- The number of clusters must be manually specified Remedy: The Elbow Method
The Elbow Method can help us arrive at the most likely number of clusters in a dataset but it is not totally scientific.
- The clusters can be strongly influenced by the initial centroids The K means algorithm first initializes the centroids randomly. This means that there are cases where the model can produce vastly different results between each run.
Remedy: KMeans++1
- KMeans is sensitive to outliers In almost all cases, outliers will be placed in their own cluster because of the Euclidean distance being further to other data points.
Remedy: Remove outliers before performing clustering
- KMeans produces spherical clusters This is because K means uses the Euclidean distance between points to determine the clusters, each cluster tends to be circular in shape when in reality the cluster should actually be spherical.
Other Types of Clustering
There are two types of clusters: the flat and hierarchical. K-means belongs to the former type. Hierarchical clusters, as it’s name suggests, divides a dataset into multiple hiearchical levels of clusters. The can be formed in two ways. First is the agglomerative and the other divisive. Think of the dataset as a pyramid of sorts. In the divisive method, we start with one mega cluster than encompasses the whole dataset that would get split progressively further down until we’re left with individual observations in the dataset. An example of this is how species of animals are divided. We start with the cluster ‘Animal’, then further divide them into sub-clusters like ‘Fish’, ‘Mammals’, ‘Birds’, etc. Those sub-clusters will get further divided down until the species level. On the other hand, the agglomerative process starts from the bottom up. Each individual observation (in this case species) start of as individual clusters which gets grouped together into sub clusters until finally it forms the top ‘Animal’ cluster. An example of hierarchical clustering is the dendrogram.
The Dendrogram
The Dendrogram is a representation of hierarchical clusters that shows the relationship between individual observations.
Pros
- Hierarchical clustering shows all possible linkages betweenn clusters
- Easy to understand
- No need to specify the number of clusters
- Many available methods to perform it (Ward method)
Cons
- Can get messy in large sample sizes
- Computationally expensive
Seaborn Heirarchical Heatmaps
import seaborn as sns
# Create the visualization
sns.clustermap(x_scaled)