A Guide to Data Clustering Methods in Python
Clustering is the process of separating different pieces of data based on common characteristics. Disparate industries, including retail, finance, and healthcare, use grouping techniques for a variety of analytical tasks. In retail, aggregation can help identify distinct consumer populations, which can then allow a business to create targeted ads based on consumer demographics that may be too complicated to manually inspect. In finance, clustering can detect different forms of illegal activity in the market, such as order book spoofing in which traders deceit large orders to pressure other traders to buy. or sell an asset. In the health field, clustering methods have been used to determine patient cost patterns, early-onset neurological disorders, and cancer gene expression.
Python offers many useful tools for performing cluster analysis. The best tool to use depends on the problem to be solved and the type of data available. Python offers three widely used techniques: K-means clustering, Gaussian mixture models, and spectral clustering. For relatively small-sized tasks (several dozen entries at most) such as identifying distinct consumer populations, K-means clustering is an excellent choice. For more complicated tasks such as detecting illegal market activity, a more robust and flexible model such as a Guassian mix model will be better suited. Finally, for large problems with potentially thousands of entries, spectral clustering is the best option.
In addition to selecting a suitable clustering algorithm for the problem, you should also have a way to assess the performance of these clustering algorithms. Typically, the average distance within the cluster from the center is used to assess model performance. More precisely, the average distance of each observation from the center of the cluster, called the centroid, is used to measure the compactness of a cluster. This makes sense because a good clustering algorithm should generate tightly packed groups of data. The closer the data points are to each other within a cluster, the better the results of the algorithm. The sum in the cluster distance plotted against the number of clusters used is a common way to assess performance.
For our purposes, we will perform a customer segmentation analysis on the customer segmentation of the shopping center The data.
Data clustering techniques in Python
- Grouping of K-means
- Gaussian mixture models
- Spectral aggregation
Let’s start by reading our data in a Pandas dataframe:
import pandas as pd df = pd.read_csv("Mall_Customers.csv") print(df.head())
We see that our data is quite simple. It contains a column with customer IDs, gender, age, income and a column that denotes the expense score on a scale of 1 to 100. The objective of our grouping exercise will be to generate unique groups of clients, where each member of that group is more alike than members of other groups.
K-means clustering is a type of unsupervised machine learning, which means that the algorithm only trains on inputs and no outputs. It works by finding the distinct groups of data (i.e. clusters) that are closest to each other. Specifically, it partitions data into clusters in which each point belongs to a cluster whose average is closest to that data point.
Let’s import the K-means class from the clusters module into Scikit-learn:
from sklearn.clusters import KMeans
Next, let’s define the inputs we’ll use for our K-means clustering algorithm. Let’s use the age and expense score:
X = df[['Age', 'Spending Score (1-100)']].copy()
The next thing we need to do is figure out how many clusters we will be using. We will use the elbow method, which plots the intra-cluster sum of squares (WCSS) against the number of clusters. We need to define a for loop that contains instances of the K-means class. This for loop will iterate over cluster numbers one through 10. We’ll also initialize a list that we’ll use to add the WCSS values:
for i in range(1, 11): kmeans = KMeans(n_clusters=i, random_state=0) kmeans.fit(X)
We then add the WCSS values ââto our list. We access these values ââvia the inertia attribute of the K-means object:
for i in range(1, 11): kmeans = KMeans(n_clusters=i, random_state=0) kmeans.fit(X) wcss.append(kmeans.intertia_)
Finally, we can plot the WCSS as a function of the number of clusters. First, let’s import Matplotlib and Seaborn, which will allow us to create and format data visualizations:
import matplotlib.pyplot as plt import seaborn as sns
Let’s style the plots using Seaborn:
Then plot the WCSS against the clusters:
plt.plot(range(1, 11), wcss)
Then add a title:
plt.title('Selecting the Numbeer of Clusters using the Elbow Method')
And finally, name the axes:
plt.xlabel('Clusters') plt.ylabel('WCSS') plt.show()
From this graph, we can see that four is the optimal number of clusters, as this is where the âbendâ of the curve appears.
We can see that K-means found four clusters, which break down as follows:
Young customers with a moderate spend score.
Young customers with a high spend score.
Middle-aged customers with a low spend score.
Senior customers with a moderate spend score.
This type of information can be very useful for retail businesses looking to target specific demographics to consumers. For example, if most of the people with high spend scores are younger, the company can target those populations with ads and promotions.
Gaussian mixture model (GMM)
This model assumes that the clusters can be modeled using a Gaussian distribution. Gaussian distributions, informally known as bell curves, are functions that describe many important things like the size and weight of the population.
These models are useful because Gaussian distributions have well-defined properties such as mean, variance, and covariance. The average is simply the average value of an entry within a cluster. Variance measures the fluctuation in values ââfor a single entry. Covariance is a matrix of statistics describing how inputs relate to each other and, more specifically, how they vary between them.
Collectively, these parameters allow the GMM algorithm to create flexible identity clusters of complex shapes. While K-means generally identifies clusters of spherical shape, GMM can more generally identify clusters of different shapes. This makes GMM more robust than K-means in practice.
Let’s start by importing the GMM package from Scikit-learn:
from sklearn.mixture import GaussianMixture
Next, let’s initialize an instance of the GaussianMixture class. Let’s start by considering three clusters and adapt the model to our inputs (in this case, age and expense score):
from sklearn.mixture import GaussianMixture n_clusters = 3 gmm_model = GaussianMixture(n_components=n_clusters) gmm_model.fit(X)
Now, let’s generate the cluster labels and store the results, along with our inputs, in a new data frame:
cluster_labels = gmm_model.predict(X) X = pd.DataFrame(X) X['cluster'] = cluster_labels
Next, let’s plot each cluster in a for loop:
for k in range(0,n_clusters): data = X[X["cluster"]==k] plt.scatter(data["Age"],data["Spending Score (1-100)"],c=color[k])
And, finally, format the plot:
plt.title("Clusters Identified by Guassian Mixture Model") plt.ylabel("Spending Score (1-100)") plt.xlabel("Age") plt.show()
The red and blue clusters appear relatively well defined. The blue cluster represents young customers with a high spend score and the red represents young customers with a moderate spend score. The green group is less well defined because it covers all ages and low to moderate spending scores.
Now let’s try four clusters:
... n_clusters = 4 gmm_model = GaussianMixture(n_components=n_clusters) ...
Although four groups show a slight improvement, the red and blue groups are still quite broad in terms of age score and expense values. So let’s try five clusters:
Five clusters seem appropriate here. They can be described as follows:
Young customers with a high spend score (green).
Young customers with a moderate spend score (black).
Young to middle-aged customers with a low spend score (blue).
Middle-aged to senior customers with a low spend score (yellow).
Middle-aged to senior customers with a moderate spend score (red).
Gaussian mixture models are generally more robust and flexible than K-means clustering. Again, this is because GMM captures complex cluster shapes and not K-means. This allows GMM to accurately identify clusters more complex than spherical clusters identified by K-means. GMM is an ideal method for datasets of moderate size and complexity, as it is better able to capture clusters in sets with complex shapes.
Spectral clustering is a commonly used method for cluster analysis on large and often complex data. It works by performing dimensionality reduction on the input and generating clusters in the reduced dimensional space. Since our data doesn’t have a lot of entries, this will be mainly for illustration purposes, but it should be straightforward to apply this method to more complex and large datasets.
Let’s start by importing the SpectralClustering class from the cluster module into Scikit-learn:
from sklearn.cluster import SpectralClustering
Next, let’s define our SpectralClustering class instance with five clusters:
spectral_cluster_model= SpectralClustering( n_clusters=5, random_state=25, n_neighbors=8, affinity='nearest_neighbors' )
Next, let’s define our model object for our inputs and store the results in the same data frame:
X['cluster'] = spectral_cluster_model.fit_predict(X[['Age', 'Spending Score (1-100)']])
Finally, let’s plot our clusters:
fig, ax = plt.subplots() sns.scatterplot(x='Age', y='Spending Score (1-100)', data=X, hue="cluster", ax=ax) ax.set(title="Spectral Clustering")
We see that clusters one, two, three and four are quite distinct while cluster zero looks quite large. Generally, we see some of the same patterns with groups of clusters as with K-means and GMM, although the previous methods gave better separation between the clusters. Again, spectral clustering is best suited for problems involving much larger datasets, such as those with hundreds to thousands of entries and millions of rows.
The code for this article is available on GitHub.
Add clustering to your toolbox
Although we have only considered cluster analysis in the context of customer segmentation, it is broadly applicable to a wide range of industries. The clustering methods we have discussed have been used to solve a wide range of problems. K-means pooling was used to identify vulnerable patient populations. Gaussian mixture models have been used to detect illegal market activities such as fraudulent trading, pump and dump and stuffing of quotes. Spectral grouping methods have been used to solve complex health problems such as grouping of medical terms for the discovery of health care knowledge.
No matter what industry, any modern organization or business can find great value in being able to identify important clusters from their data. Python provides many easy-to-implement tools for performing cluster analysis at all levels of data complexity. In addition, having a good knowledge of the methods that work best given the complexity of the data is an invaluable skill for any data scientist. What we’ve covered provides a solid foundation for data scientists starting to learn how to perform cluster analysis.