Cluster Analysis: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of Cluster Analysis all the way through advanced algorithms, evaluation, interpretation, and practical usage within the DataStatPro application. Whether you are encountering cluster analysis for the first time or looking to deepen your understanding of unsupervised learning and data segmentation, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is Cluster Analysis?
- The Mathematics Behind Cluster Analysis
- Assumptions of Cluster Analysis
- Types of Cluster Analysis Methods
- Using the Cluster Analysis Component
- Hierarchical Clustering
- K-Means and K-Medoids Clustering
- Model-Based and Density-Based Clustering
- Model Fit and Evaluation
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into cluster analysis, it is helpful to be comfortable with the following foundational statistical and mathematical concepts. Each is briefly reviewed below.
1.1 Distance and Similarity
Cluster analysis is fundamentally about grouping objects that are similar to each other and dissimilar from objects in other groups. The core computational tool is a distance or dissimilarity measure between pairs of observations.
For two observations and with variables, the most commonly used distance measure is the Euclidean distance:
Key properties of a valid distance measure:
- (non-negativity).
- (identity of indiscernibles).
- (symmetry).
- (triangle inequality).
A similarity measure is the inverse concept: high similarity means the objects are close. It can always be converted to a dissimilarity: (for similarities bounded by [0, 1]).
1.2 Vectors and Centroids
An observation in a dataset with variables is represented as a vector in -dimensional space:
The centroid of a set of observations is their arithmetic mean:
In -means clustering, the centroid is the representative point of each cluster.
1.3 Variance and Within-Group Variance
The total variance of a dataset measures overall dispersion:
The within-cluster sum of squares (WCSS) measures how tightly packed observations are within their assigned clusters:
Where is the set of observations in cluster and is the centroid of cluster . Minimising WCSS is the objective of -means clustering.
1.4 Matrix Notation and the Distance Matrix
Given observations, the distance matrix is an symmetric matrix where element is the distance between observations and :
The distance matrix is the primary input to hierarchical and many other clustering algorithms. It is always symmetric () with zeros on the diagonal.
1.5 Probability Distributions and Mixture Models
A probability distribution describes the likelihood of observing each value of a random variable . In model-based clustering, the data are assumed to arise from a mixture of distributions:
Where:
- = number of mixture components (clusters).
- = mixing proportion for component (, ).
- = density function of component with parameters .
The most common choice is the Gaussian mixture model (GMM), where each component is a multivariate normal distribution.
1.6 Optimisation and Convergence
Many clustering algorithms work by iteratively optimising an objective function (e.g., minimising WCSS). Key concepts:
Iteration: Repeating the same update steps until a stopping criterion is met.
Convergence: The algorithm has converged when the objective function changes by less than a small threshold between iterations:
Local vs. Global Optimum: Iterative algorithms like -means often converge to a local (not global) optimum — the solution depends on the random starting configuration. Running the algorithm multiple times with different starts helps find a better (though still not guaranteed globally optimal) solution.
1.7 Supervised vs. Unsupervised Learning
Supervised learning uses labelled data (known outcomes) to train a model — examples include regression and classification. Unsupervised learning discovers structure in unlabelled data without a predefined outcome variable.
Cluster analysis is unsupervised — there is no "correct answer" against which to evaluate the solution. This makes cluster analysis both powerful (works without labels) and challenging (no ground truth for validation). All cluster validation is internal (based on the structure of the clustered data itself) or theoretical (based on external knowledge).
2. What is Cluster Analysis?
2.1 The Core Idea
Cluster analysis (also called clustering or cluster detection) is a class of unsupervised statistical learning methods that partition a dataset into groups (clusters) such that:
- Observations within the same cluster are as similar as possible to each other (high intra-cluster homogeneity).
- Observations in different clusters are as different as possible from each other (high inter-cluster separation).
Unlike classification (which assigns new observations to predefined categories), cluster analysis discovers the categories themselves from the data. The researcher does not specify what the groups should look like — the algorithm determines which observations naturally belong together.
2.2 What Cluster Analysis Can and Cannot Do
Cluster analysis CAN:
- Reveal natural groupings in complex high-dimensional data.
- Generate hypotheses about subgroup differences.
- Reduce data complexity by summarising observations by their cluster membership.
- Identify outliers and unusual observations (those that do not fit well in any cluster).
- Segment populations for targeted interventions or personalisation.
Cluster analysis CANNOT:
- Prove that clusters are "real" in a statistical testing sense — there is always some grouping solution, even for completely random data.
- Identify causes of group differences (it is descriptive, not inferential).
- Tell you the "right" number of clusters — this requires the researcher's judgement.
- Produce clusters that generalise to new samples without validation.
2.3 Real-World Applications
| Field | Application | Variables Used |
|---|---|---|
| Medicine | Identifying patient subgroups with similar disease profiles or treatment responses | Lab values, symptoms, genetic markers |
| Marketing | Customer segmentation for targeted advertising and product development | Purchase behaviour, demographics, preferences |
| Genomics | Grouping genes with similar expression patterns across conditions | Gene expression levels across samples |
| Psychology | Identifying latent subtypes in clinical populations | Symptom profiles, cognitive test scores |
| Finance | Portfolio diversification by grouping assets with correlated returns | Return series, risk metrics |
| Image Analysis | Colour segmentation; pixel grouping in image compression | RGB values, spatial coordinates |
| Social Science | Identifying socioeconomic neighbourhood profiles | Income, education, employment, housing |
| Ecology | Species distribution grouping; habitat classification | Environmental variables, species counts |
| Astronomy | Classifying galaxies or stars by spectral characteristics | Luminosity, colour index, mass |
2.4 Cluster Analysis vs. Related Methods
| Feature | Cluster Analysis | Factor Analysis | PCA | Discriminant Analysis |
|---|---|---|---|---|
| Goal | Group observations | Group variables | Reduce variable dimensions | Separate known groups |
| Supervised? | No | No | No | Yes |
| Output | Cluster memberships | Latent factors | Component scores | Classification rules |
| Known groups? | No | No | No | Yes |
| Operates on | Rows (observations) | Columns (variables) | Columns (variables) | Both |
| Distance used? | Yes (usually) | No | No | Yes (Mahalanobis) |
2.5 The Fundamental Challenge: Defining Similarity
Every clustering algorithm embeds assumptions about what "similar" means. The choice of:
- Distance measure (Euclidean, Manhattan, cosine, etc.).
- Linkage method (for hierarchical clustering).
- Number of clusters .
- Cluster shape assumption (spherical, elliptical, arbitrary).
...all profoundly influence the resulting clusters. There is no universally correct clustering solution — different algorithms and distance measures can produce very different groupings from the same data. Always examine results from multiple approaches.
3. The Mathematics Behind Cluster Analysis
3.1 Distance and Dissimilarity Measures
The choice of distance measure determines how "closeness" is defined and which clusters are formed. Different measures emphasise different aspects of dissimilarity.
Euclidean Distance (L2 norm):
Sensitive to scale differences between variables — standardisation is essential.
Manhattan Distance (L1 norm, City Block):
Less sensitive to outliers than Euclidean distance. Appropriate for grid-like data.
Minkowski Distance (generalisation of Euclidean and Manhattan):
- : Manhattan distance.
- : Euclidean distance.
- : Chebyshev distance ().
Mahalanobis Distance (accounts for correlations and scale):
Where is the sample covariance matrix. Invariant to scale and rotation; accounts for correlations among variables.
Cosine Dissimilarity (based on angle between vectors):
Appropriate for high-dimensional data (text, documents) where magnitude is less important than direction.
Gower's Distance (for mixed variable types):
For a dataset with variables of mixed types (continuous, ordinal, binary, nominal):
Where if the -th variable is usable for comparison (non-missing), and is the contribution of variable to the dissimilarity:
- Continuous:
- Binary/Nominal:
- Ordinal: Similar to continuous after ranking.
3.2 Distance Measure Comparison Table
| Distance | Formula | Scale Sensitive | Outlier Robust | Variable Type | Best For |
|---|---|---|---|---|---|
| Euclidean | Yes | No | Continuous | General purpose (after standardising) | |
| Manhattan | $\sum | x_i - x_j | $ | Yes | Moderate |
| Minkowski | $(\sum | x_i - x_j | ^p)^{1/p}$ | Yes | Varies |
| Mahalanobis | No | No | Continuous | Correlated variables | |
| Cosine | No | Moderate | Continuous | High-dimensional, text | |
| Gower | Weighted average | No | Moderate | Mixed | Mixed data types |
| Binary (Jaccard) | $1 - \frac{ | A \cap B | }{ | A \cup B | }$ |
3.3 The Hierarchical Clustering Algorithm
Hierarchical clustering builds a dendrogram (tree structure) by successively merging (agglomerative) or splitting (divisive) clusters.
Agglomerative clustering algorithm (bottom-up):
Step 1: Start with clusters, each containing exactly one observation:
where
Step 2: Compute the distance matrix .
Step 3: Find the two clusters and with minimum inter-cluster distance:
Step 4: Merge and into a new cluster:
Step 5: Update the distance matrix by removing rows/columns for and and adding a new row/column for using the chosen linkage method.
Step 6: Repeat Steps 3–5 until a single cluster remains.
The result is a binary tree (dendrogram) recording the order and heights of all merges.
3.4 Linkage Methods
The linkage method determines how the distance between two clusters is computed from the distances between their constituent observations. Different linkage methods produce dramatically different cluster shapes.
Single Linkage (Minimum):
Distance = distance between the closest pair of observations from each cluster. Prone to chaining — elongated, chain-like clusters.
Complete Linkage (Maximum):
Distance = distance between the farthest pair of observations from each cluster. Produces compact, roughly equal-sized clusters.
Average Linkage (UPGMA):
Distance = average of all pairwise distances between the two clusters. A compromise between single and complete linkage.
Ward's Method (Minimum Variance):
Merges the two clusters that result in the smallest increase in total within-cluster sum of squares (WCSS):
Where and are the centroids of clusters and . Ward's method tends to produce compact, similarly sized spherical clusters and is the most widely used linkage in practice.
Centroid Linkage:
Distance = squared Euclidean distance between cluster centroids. Can produce inversions in the dendrogram (a merged cluster appearing lower than its components) — generally not recommended.
3.5 The K-Means Objective Function
-means clustering partitions observations into clusters by minimising the within-cluster sum of squares (WCSS), also called the inertia:
Where:
- = number of clusters (pre-specified by the researcher).
- = set of observations in cluster .
- = centroid of cluster .
The WCSS can be equivalently written using the between-cluster sum of squares (BCSS):
Where:
And is the global mean of all observations.
3.6 The K-Means Algorithm (Lloyd's Algorithm)
Initialisation: Randomly select observations as initial centroids .
Step 1 — Assignment Step: Assign each observation to the cluster with the nearest centroid:
Step 2 — Update Step: Recompute each centroid as the mean of all observations assigned to that cluster:
Convergence: Repeat Steps 1–2 until cluster assignments no longer change:
Or until the change in WCSS is below a threshold :
Convergence guarantee: -means always converges in a finite number of iterations because:
- The assignment step never increases WCSS.
- The update step never increases WCSS.
- The number of possible cluster assignments is finite.
However, the converged solution may be a local optimum. Run with at least 20–50 random starts and keep the solution with the lowest WCSS.
3.7 K-Means++ Initialisation
Standard random initialisation of -means often converges to poor local optima. K-means++ (Arthur & Vassilvitskii, 2007) provides a smarter initialisation:
Step 1: Choose the first centroid uniformly at random from the data:
Step 2: For each subsequent centroid (), choose observation with probability proportional to its squared distance from the nearest already-chosen centroid:
Where .
Step 3: Repeat Step 2 until centroids are chosen, then run standard K-means.
K-means++ provides an approximation guarantee on the WCSS and typically converges much faster than random initialisation.
3.8 The Gaussian Mixture Model (GMM)
Model-based clustering assumes that the data arise from a finite mixture of multivariate Gaussian distributions. Each component represents a cluster:
Where:
- = mixing proportion of component ().
- = mean vector of component .
- = covariance matrix of component .
- = multivariate normal density:
3.9 The EM Algorithm for GMM
The parameters are estimated by maximising the log-likelihood:
The Expectation-Maximisation (EM) algorithm iterates between two steps:
E-Step (Expectation): Compute the posterior responsibility (soft assignment probability) of component for observation :
M-Step (Maximisation): Update parameters using the current responsibilities:
Effective cluster sizes:
Updated mixing proportions:
Updated means:
Updated covariance matrices:
The EM algorithm is guaranteed to converge to a local maximum of the log-likelihood. Key advantage over -means: soft (probabilistic) cluster memberships — each observation belongs to each cluster with a probability, not a hard 0/1 assignment.
4. Assumptions of Cluster Analysis
4.1 The Existence of Clusters (Clusterability)
The most fundamental assumption is that the data contain genuine cluster structure — that the population from which the data are drawn is genuinely multimodal or heterogeneous.
Why it matters: Clustering algorithms will always produce clusters, even for completely random, uniform data. A solution from random data is meaningless.
How to check:
- Hopkins statistic: Tests whether data are uniformly distributed (no cluster structure).
Where is the distance from a randomly sampled point to its nearest neighbour in the data, and is the distance from a randomly chosen data point to its nearest neighbour. Under a uniform distribution, . suggests cluster structure.
- Visual methods: Plot pairwise scatter plots and look for visible groupings.
4.2 Scale and Measurement Consistency
All continuous variables must be on comparable scales before computing Euclidean distances. If one variable is measured in millimetres (range 1–10) and another in kilograms (range 50–100), the larger-scale variable will dominate the distance calculation.
How to check:
- Compare the standard deviations of all variables.
- If they differ substantially, standardise (z-score) all variables before clustering.
Standard z-score transformation:
When NOT to standardise:
- When the original scale differences are meaningful (e.g., all variables are on the same scale, such as blood cell counts in the same unit).
- When using Mahalanobis distance (already scale-invariant).
4.3 Absence of Extreme Outliers
Outliers severely distort distance-based clustering methods (especially -means and Ward's hierarchical clustering):
- Outliers may form their own singleton clusters.
- Outliers pull centroids toward them, distorting the clusters for all other observations.
How to check:
- Univariate: Box plots, -scores .
- Multivariate: Mahalanobis distance with chi-squared critical value.
- Visualise: Scatter plot matrix for small .
How to handle:
- Remove extreme outliers before clustering (and report their removal).
- Use outlier-robust clustering methods (e.g., -medoids, DBSCAN).
- Add outlier detection as a separate pre-processing step.
4.4 Appropriate Choice of Distance Measure
The distance measure should be appropriate for the type and scale of the variables:
- Continuous variables: Euclidean or Mahalanobis (after standardising).
- Binary variables: Jaccard or Dice coefficient.
- Mixed types: Gower's distance.
- High-dimensional: Cosine similarity.
Choosing the wrong distance measure can produce meaningless clusters even when genuine structure exists.
4.5 Spherical Cluster Shape (for K-Means)
-means assumes clusters are convex and roughly spherical in shape (equal variance in all directions). Non-spherical cluster shapes (elongated, curved, or irregular) will be poorly recovered by -means.
How to check:
- After clustering, plot the clusters and visually inspect their shapes.
- Compute silhouette coefficients — low values for many points suggest non-spherical clusters.
Remedy: Use GMM (which can model elliptical shapes), hierarchical clustering with single linkage (for chain-like clusters), or DBSCAN (for arbitrary shapes).
4.6 Appropriate Number of Clusters (K)
The number of clusters must be chosen appropriately. Specifying too few clusters over-groups heterogeneous observations; too many clusters fragments natural groups.
How to determine : Use multiple criteria (see Section 10) including the elbow method, silhouette analysis, gap statistic, and BIC. Never rely on a single criterion.
4.7 Adequate Sample Size
Reliable clustering requires sufficient observations per cluster for stable, reproducible results:
| Method | Minimum Total | Minimum per Cluster |
|---|---|---|
| Hierarchical | 30 | 5 |
| -means | 10 | |
| GMM | ||
| DBSCAN | 50 | Depends on MinPts |
Where is the number of clusters and is the number of variables.
⚠️ With small samples (), cluster solutions are highly unstable and may not replicate in new data. Always validate the solution using bootstrapping or an independent sample.
5. Types of Cluster Analysis Methods
5.1 Classification by Methodology
Clustering methods are broadly classified by their algorithmic strategy:
| Category | Methods | Cluster Shape | Hard/Soft | Output |
|---|---|---|---|---|
| Hierarchical | Ward, Complete, Single, Average, Centroid | Flexible | Hard | Dendrogram + any K |
| Partitional | K-Means, K-Medoids (PAM), K-Modes | Spherical/Euclidean | Hard | K clusters |
| Model-Based | GMM, Mclust | Elliptical | Soft | K clusters + probabilities |
| Density-Based | DBSCAN, OPTICS, HDBSCAN | Arbitrary | Hard | Clusters + noise points |
| Fuzzy | Fuzzy C-Means | Spherical | Soft | Membership degrees |
| Spectral | Spectral Clustering | Arbitrary | Hard | K clusters |
| Grid-Based | CLIQUE, STING | Grid cells | Hard | Dense grid cells |
5.2 Hierarchical vs. Partitional Methods
| Feature | Hierarchical | Partitional (-Means) |
|---|---|---|
| Specify in advance? | No | Yes |
| Deterministic? | Yes | No (random starts) |
| Scalability | to | |
| Output | Complete hierarchy (dendrogram) | Single solution for given |
| Reversible merges/splits? | No | Yes (re-run with different ) |
| Works for large ? | Poorly () | Well |
| Mixed variable types | Yes (with Gower) | Limited |
5.3 Agglomerative vs. Divisive Hierarchical Clustering
| Feature | Agglomerative (Bottom-Up) | Divisive (Top-Down) |
|---|---|---|
| Start | singleton clusters | 1 cluster containing all |
| Process | Merge closest clusters | Split most heterogeneous cluster |
| Common example | Ward, Complete, Average linkage | DIANA (Divisive Analysis) |
| Computational cost | in exact form | |
| Usage | Very common | Less common |
5.4 Hard vs. Soft (Fuzzy) Clustering
| Feature | Hard (Crisp) Clustering | Soft (Fuzzy) Clustering |
|---|---|---|
| Assignment | Each observation belongs to exactly one cluster | Each observation has a degree of membership in each cluster |
| Membership | , | |
| Examples | -Means, Hierarchical | GMM, Fuzzy C-Means |
| Best for | Well-separated clusters | Overlapping clusters; ambiguous boundary cases |
| Output | Cluster label vector | Membership matrix |
5.5 DataStatPro Implementation Overview
The DataStatPro Cluster Analysis component implements:
| Method | Implementation | Best For |
|---|---|---|
| Agglomerative Hierarchical | Ward, Complete, Average, Single linkage | Exploratory; unknown ; small to medium |
| K-Means | Lloyd's algorithm with K-Means++ init | Large ; continuous data; known approximate |
| K-Medoids (PAM) | Partitioning Around Medoids | Robust to outliers; mixed data |
| Gaussian Mixture Models | EM algorithm; multiple covariance structures | Soft assignments; elliptical clusters |
| DBSCAN | Density-based; automatic noise detection | Arbitrary shapes; outlier detection |
6. Using the Cluster Analysis Component
The Cluster Analysis component in DataStatPro provides a complete end-to-end workflow for performing, evaluating, and visualising cluster analyses.
Step-by-Step Guide
Step 1 — Select Dataset
Choose the dataset from the "Dataset" dropdown. Ensure:
- All clustering variables are in separate numeric columns.
- Categorical variables are appropriately recoded (dummy-coded or use Gower's distance).
- The dataset has been screened for missing data and extreme outliers.
- Variables are on comparable scales (or will be standardised in Step 4).
💡 Tip: Run descriptive statistics and a correlation matrix before clustering. Variables that are highly correlated () are redundant — consider removing one from each pair to avoid over-weighting that dimension in the distance calculation.
Step 2 — Select Clustering Variables
Select all variables to include in the cluster analysis from the "Clustering Variables" dropdown. Only include variables that are:
- Theoretically relevant to the grouping hypothesis.
- Sufficiently variable (variables with near-zero variance contribute nothing to clustering).
- Measured on compatible scales (or will be standardised).
⚠️ Important: Do not include identifier variables (ID numbers), date variables, or outcome variables you intend to use to validate the clusters (these should be held out for post-hoc validation, not used in forming the clusters).
Step 3 — Select Clustering Method
Choose from the "Method" dropdown:
- Hierarchical (Agglomerative): For exploratory analysis; produces a dendrogram. Select linkage method: Ward (recommended), Complete, Average, or Single.
- K-Means: For large datasets with a known approximate number of clusters. Uses K-Means++ initialisation by default.
- K-Medoids (PAM): More robust than K-Means; works with any distance matrix.
- Gaussian Mixture Model (GMM): For soft assignments and elliptical clusters. Select covariance structure: VVV (most flexible), EEE (equal covariance), or others.
- DBSCAN: For arbitrary-shaped clusters and automatic outlier detection. Specify (neighbourhood radius) and MinPts (minimum points).
Step 4 — Standardisation Options
Select how to scale the variables before clustering:
- Standardise (z-scores): Recommended for most analyses with continuous variables of different scales. Transforms each variable to mean = 0, SD = 1.
- Range normalisation (0–1): Scales all variables to the range [0, 1]. .
- No standardisation: Use when variables are already on comparable scales.
- Robust standardisation (median/IQR): Recommended when outliers are present. .
Step 5 — Select Distance Measure
Choose the appropriate distance measure for your data type:
- Euclidean (default): Continuous, standardised variables.
- Manhattan (City Block): Continuous variables with potential outliers.
- Mahalanobis: Continuous, correlated variables.
- Gower: Mixed variable types (continuous, ordinal, binary, nominal).
- Cosine: High-dimensional data (text, genomics).
- Correlation (1-r): When pattern matters more than magnitude.
Step 6 — Specify Number of Clusters
- For Hierarchical: The dendrogram is produced for all — cut it at any level to obtain the desired number of clusters. Use the automatic cutting tools in the application (based on height ratios or number of clusters).
- For K-Means / K-Medoids: Specify the number of clusters , or use the automated Elbow Method, Silhouette Analysis, or Gap Statistic to recommend .
- For GMM: Specify the range of to evaluate; the application selects the optimal using BIC or ICL.
- For DBSCAN: No specified; clusters emerge automatically from and MinPts.
💡 Best practice: Always evaluate at least three values of (your hypothesised number, one fewer, and one more) and compare the solutions on both statistical criteria and theoretical interpretability.
Step 7 — Display Options
Select which outputs and visualisations to display:
- ✅ Dendrogram with highlighted clusters (Hierarchical).
- ✅ Scree/Elbow plot of WCSS vs. .
- ✅ Silhouette plot and average silhouette width.
- ✅ Cluster profile table (means/medians per cluster per variable).
- ✅ PCA biplot coloured by cluster.
- ✅ Cluster size table (n per cluster, percentage).
- ✅ Within-cluster and between-cluster sum of squares.
- ✅ Gap statistic plot.
- ✅ Cluster stability indices (bootstrapped).
- ✅ Heatmap of variable means by cluster.
Step 8 — Run the Analysis
Click "Run Cluster Analysis". The application will:
- Standardise variables (if selected).
- Compute the distance matrix.
- Run the clustering algorithm.
- Compute cluster validity indices (silhouette, Calinski-Harabasz, Davies-Bouldin).
- Generate all selected visualisations.
- Produce a cluster membership variable that can be saved back to the dataset for further analysis.
7. Hierarchical Clustering
7.1 The Dendrogram
The dendrogram is the primary output of hierarchical clustering. It is a binary tree where:
- Leaves (bottom): Individual observations.
- Internal nodes: Cluster merges.
- Height of each node: The dissimilarity at which the merge occurred.
- Root (top): A single cluster containing all observations.
The dendrogram encodes a complete hierarchical clustering solution for all possible numbers of clusters from 1 to . Cutting the dendrogram at a given height produces the clustering solution for that number of clusters.
Reading the dendrogram:
- Observations (or clusters) that are merged at a low height are very similar.
- A large gap between consecutive merging heights suggests a natural cut point.
- The number of vertical lines cut by a horizontal cut at height equals the number of clusters .
7.2 Choosing the Cut Point
Method 1 — Largest height gap:
Inspect the heights of consecutive merges and cut below the largest jump:
Where is the height of the -th merge from the top (merging the last clusters into ).
Method 2 — Consistent percentage:
Cut at a height corresponding to a fixed percentage (e.g., 70%) of the total dendrogram height range.
Method 3 — External criteria:
Use cluster validity indices (silhouette width, Calinski-Harabasz) evaluated for several cut heights to determine the optimal .
💡 There is no single "correct" cut point. Combine statistical guidance with theoretical reasoning about the expected number of subgroups in your population.
7.3 Linkage Method Comparison
| Linkage | Properties | Cluster Shape | Recommended When |
|---|---|---|---|
| Ward's | Minimises WCSS increase | Compact, spherical | Most general-purpose analyses |
| Complete | Maximum diameter | Compact, similar size | Well-separated clusters expected |
| Average | Average pairwise distances | Moderate compactness | Balanced trade-off |
| Single | Chain-like merging | Elongated, chained | Detecting filamentary structures |
| Centroid | Distance between centroids | Moderate | Rarely recommended (inversions possible) |
⚠️ Ward's linkage with Euclidean distance (on standardised data) is the most commonly recommended default. However, it assumes roughly spherical, equally-sized clusters. If clusters are expected to differ dramatically in size or shape, consider average linkage.
7.4 Cophenetic Correlation Coefficient
The cophenetic correlation coefficient (CCC) measures how faithfully the dendrogram preserves the pairwise distances in the original data:
Where:
- = original pairwise distance between observations and .
- = cophenetic distance (height at which and first merge in the dendrogram).
- , = means of and respectively.
| CCC | Interpretation |
|---|---|
| Good representation | |
| Acceptable | |
| Poor — hierarchical structure may not be appropriate |
A high CCC () indicates that the dendrogram is a faithful representation of the original distances. Compare CCC across linkage methods and choose the one with the highest CCC.
7.5 Hierarchical Clustering for Large Datasets
Standard agglomerative hierarchical clustering has time complexity and space complexity — it requires storing the full distance matrix. For large :
- Use minimax linkage or robust single linkage methods designed for scalability.
- Use DIANA (Divisive Analysis) which can be implemented more efficiently for some data.
- Cluster a random subsample first, then assign remaining observations.
- Use -means or DBSCAN for large .
7.6 Dissimilarity Matrix Heatmap
A dissimilarity heatmap displays the distance matrix as a colour-coded image, with observations reordered according to the dendrogram. In a well-structured dataset:
- Block-diagonal patterns emerge when clusters are distinct.
- Observations within the same cluster appear in low-dissimilarity (dark) blocks.
- Off-diagonal blocks show higher dissimilarity (lighter colour) between clusters.
This visual provides an immediate, intuitive check on the quality and separation of the cluster solution.
8. K-Means and K-Medoids Clustering
8.1 Choosing K — The Elbow Method
The elbow method plots WCSS (total within-cluster sum of squares) against and looks for an "elbow" — a point where the rate of decrease in WCSS slows dramatically:
As increases from 1 to :
- WCSS decreases monotonically (more clusters always explain more variance).
- The marginal decrease diminishes after the "true" number of clusters.
- The elbow occurs at the where the curve transitions from steep to relatively flat.
Limitation: The elbow is often subjective — in real data, the curve is frequently smooth without a clear kink. Combine with silhouette analysis and the gap statistic.
8.2 The Silhouette Coefficient
For each observation assigned to cluster , the silhouette coefficient measures how well the observation fits its assigned cluster compared to the nearest other cluster:
: Average distance from observation to all other observations in the same cluster (intra-cluster cohesion):
: Minimum average distance from observation to all observations in any other cluster (inter-cluster separation):
Silhouette coefficient for observation :
Ranges from to :
- : Observation is well-matched to its cluster and poorly matched to neighbouring clusters.
- : Observation is on or very close to the boundary between two clusters.
- : Observation would be better placed in a neighbouring cluster (probable misclassification).
Average silhouette width (ASW): The mean across all observations:
Choose that maximises .
| ASW | Cluster Structure Interpretation |
|---|---|
| Strong structure | |
| Reasonable structure | |
| Weak structure — may be artificial | |
| No substantial structure |
8.3 The Gap Statistic
The gap statistic (Tibshirani, Walther & Hastie, 2001) compares the observed WCSS for clusters to the expected WCSS under a null reference distribution (uniform data with no cluster structure):
Where is the expectation under the null reference distribution (estimated by Monte Carlo simulation of uniform datasets).
Choosing : Select the smallest such that:
Where is the simulation error.
The gap statistic accounts for sampling variability and provides a more principled stopping rule than the elbow method.
8.4 Calinski-Harabasz Index (Variance Ratio Criterion)
The Calinski-Harabasz (CH) index (also called the Variance Ratio Criterion) measures the ratio of between-cluster variance to within-cluster variance:
Where:
- = between-cluster sum of squares.
- = within-cluster sum of squares.
Higher CH index = better-defined clusters. Choose that maximises .
8.5 Davies-Bouldin Index
The Davies-Bouldin (DB) index measures the average similarity between each cluster and its most similar other cluster:
Where:
- = average within-cluster distance from centroid.
- = distance between centroids of clusters and .
Lower DB index = better separation. Choose that minimises .
8.6 K-Medoids (PAM — Partitioning Around Medoids)
-medoids is a robust alternative to -means that represents each cluster by an actual medoid — the observation within the cluster that minimises the total dissimilarity to all other cluster members:
PAM algorithm:
Step 1 (BUILD): Sequentially select initial medoids.
Step 2 (SWAP): For each medoid and each non-medoid observation :
- Tentatively swap with and compute the total cost (sum of dissimilarities).
- If the total cost decreases, make the swap permanent.
Step 3: Repeat until no improvement is possible.
Objective function:
Advantages of K-Medoids over K-Means:
- Works with any distance matrix (not just Euclidean).
- Medoids are actual data points (interpretable).
- More robust to outliers (outliers are not selected as medoids under PAM).
- Applicable to mixed data with Gower's distance.
Disadvantage: Computationally more expensive than -means ( per iteration).
8.7 K-Modes for Categorical Data
When all variables are categorical (nominal), -means and -medoids are not directly applicable. -modes (Huang, 1998) extends -means by:
- Replacing the mean with the mode as the cluster representative.
- Using the simple matching dissimilarity (proportion of mismatches):
For mixed data (continuous + categorical), -prototypes combines:
Where = number of continuous variables, = number of categorical variables, and is a weight balancing the two contributions.
9. Model-Based and Density-Based Clustering
9.1 Gaussian Mixture Models (GMM) — Model-Based Clustering
Model-based clustering (Fraley & Raftery, 2002) treats clustering as a model selection problem. The data are assumed to arise from a GMM, and the optimal number of clusters and covariance structure are selected using the Bayesian Information Criterion (BIC).
Covariance matrix parameterisation:
The key advantage of GMM is the ability to model different cluster shapes, sizes, and orientations by parameterising the covariance matrices .
Using eigenvalue decomposition of :
Where:
- = volume (overall size of cluster ).
- = matrix of eigenvectors = orientation (direction of cluster axes).
- = diagonal matrix with normalised eigenvalues = shape (ratio of axes).
By constraining or freeing these three components across clusters, 14 different model
types are defined in the mclust framework:
| Model | Volume | Shape | Orientation | Description |
|---|---|---|---|---|
| EII | Equal | Spherical | — | -means-like spherical clusters |
| VII | Variable | Spherical | — | Spherical clusters of different sizes |
| EEI | Equal | Equal | Axis-aligned | Equal diagonal covariance |
| VEI | Variable | Equal | Axis-aligned | Variable-size axis-aligned |
| EEE | Equal | Equal | Equal | Equal ellipsoidal (all same shape) |
| VVV | Variable | Variable | Variable | Fully unconstrained (most flexible) |
BIC for model selection:
Where is the maximised log-likelihood and is the number of free parameters.
Higher BIC = better model (note: some software uses ; here higher
means better for the positive convention used in mclust).
ICL (Integrated Complete-data Likelihood):
ICL adds an entropy penalty for fuzzy assignments. Models where observations are clearly assigned (low entropy) are favoured by ICL over BIC.
9.2 Soft Cluster Membership in GMM
A major advantage of GMM over -means is the production of posterior probabilities of cluster membership for each observation. After EM convergence:
These probabilities allow identification of:
- Borderline cases ( for two clusters) — observations near cluster boundaries.
- Atypical members (low probability of belonging to any cluster).
- Uncertainty maps — spatial or other visualisations of cluster membership uncertainty.
9.3 DBSCAN — Density-Based Spatial Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise; Ester et al., 1996) identifies clusters as dense regions of the data space, separated by low-density regions. It does not require specifying in advance and can identify outliers (noise points).
Key parameters:
- (epsilon): The neighbourhood radius — how far from a point to look for neighbours.
- MinPts: The minimum number of points within distance for a point to be considered a core point.
Point classification:
A point is classified as:
Core point:
Where is the -neighbourhood of .
Border point: Not a core point, but within distance of a core point.
Noise point (outlier): Neither a core point nor a border point.
DBSCAN algorithm:
- Label all points as unvisited.
- For each unvisited core point : mark as visited; create a new cluster ; add all points in to .
- For each newly added point in : if is a core point, add all points in to (density reachability).
- Continue until no more points can be added to ; label all non-core border points.
- Mark remaining unvisited non-core points as noise.
Key properties of DBSCAN:
- Does not require specifying .
- Can discover clusters of arbitrary shape.
- Automatically identifies and labels outliers as noise.
- Produces the same results for the same and MinPts (deterministic).
- Struggles with clusters of varying density.
9.4 Choosing DBSCAN Parameters
Setting MinPts:
- Rule of thumb: MinPts (twice the number of dimensions).
- Larger MinPts → more stringent core point requirement → fewer, larger clusters.
- Minimum recommended: MinPts for 2D data; MinPts for higher dimensions.
Setting :
- Plot the -nearest neighbour distance plot (k-NN distance plot) where .
- Sort observations by their MinPts-nearest neighbour distance in decreasing order.
- The knee/elbow of this plot corresponds to the appropriate .
9.5 HDBSCAN — Hierarchical DBSCAN
HDBSCAN extends DBSCAN to handle varying density clusters by transforming the space using the concept of mutual reachability distance:
Where is the MinPts-nearest neighbour distance (the core distance) of point .
HDBSCAN builds a hierarchical clustering tree on this transformed space and extracts a flat clustering by selecting the most stable clusters across all density levels. Only MinPts needs to be specified (not ).
9.6 Fuzzy C-Means
Fuzzy C-Means (FCM) is a soft-assignment generalisation of -means. Each observation has a degree of membership in each cluster , with .
FCM objective function:
Where is the fuzziness parameter (typically ):
- : Hard assignments (approaches -means).
- : All memberships equal (completely fuzzy, no structure).
FCM update equations:
Centroids:
Memberships:
Iterate until convergence ().
10. Model Fit and Evaluation
Evaluating cluster quality is fundamentally more challenging than evaluating supervised model fit because there is no ground truth to compare against. Multiple complementary criteria must be consulted.
10.1 Internal Validity Indices
Internal validity indices assess cluster quality using only the clustering data and the resulting cluster assignments.
Average Silhouette Width (ASW):
Higher is better. Optimal = that maximises .
Calinski-Harabasz (CH) Index:
Higher is better.
Davies-Bouldin (DB) Index:
Lower is better.
Dunn Index:
Where is the minimum inter-cluster distance and is the diameter (maximum intra-cluster distance) of cluster .
Higher Dunn index = better separation and compactness. Sensitive to outliers.
Summary of Internal Validity Indices:
| Index | Optimal | Better Solution | Sensitivity |
|---|---|---|---|
| ASW | Maximise | Higher | Moderate |
| CH | Maximise | Higher | Low |
| DB | Minimise | Lower | Moderate |
| Dunn | Maximise | Higher | High (outliers) |
| WCSS (Elbow) | Elbow point | Lower | Low |
| Gap statistic | First local max | Higher | Moderate |
10.2 External Validity Indices
When true cluster labels are available (e.g., in a validation or simulation study), external validity indices compare the clustering result to the known ground truth.
Adjusted Rand Index (ARI):
The Rand Index counts agreements between pairs of observations in the clustering and in the true partition. The Adjusted Rand Index corrects for chance agreement:
More explicitly, for contingency table with elements (number of objects in cluster of solution 1 AND cluster of solution 2):
Where and are row and column marginals.
| ARI | Interpretation |
|---|---|
| Perfect agreement | |
| Excellent | |
| Good | |
| Fair | |
| Slight / random agreement | |
| Worse than random |
Normalised Mutual Information (NMI):
Where is the mutual information and , are the entropies of the two clusterings:
NMI ranges from 0 (no agreement) to 1 (perfect agreement). Does not correct for chance (use AMI — Adjusted Mutual Information — for a chance-corrected version).
10.3 Relative Validity — Stability Indices
Cluster stability assesses whether the cluster solution is reproducible across subsamples of the data. A stable solution will produce similar cluster assignments when the analysis is repeated on bootstrap samples.
Bootstrap stability algorithm:
- Draw bootstrap samples of size (with replacement).
- Apply the clustering algorithm to each bootstrap sample.
- For each bootstrap solution, match clusters to the original solution (using the Hungarian algorithm or maximum overlap matching).
- Compute the mean ARI between the original and bootstrap solutions:
| Stability | Interpretation |
|---|---|
| Very stable — reproducible solution | |
| Stable | |
| Moderately stable | |
| Unstable — solution may not replicate |
10.4 BIC and AIC for Model-Based Clustering
For GMM, the optimal number of components is selected using BIC:
Select that maximises BIC (using the mclust convention where higher = better).
AIC:
AIC tends to select more components than BIC (favours complexity less strictly than BIC).
10.5 Comparing Values — A Comprehensive Decision Framework
| Criterion | Supports Low | Supports High | Best Used With |
|---|---|---|---|
| Elbow (WCSS) | Elbow is early | Elbow is late | K-Means |
| Silhouette | High at low | High at high | Any method |
| Gap Statistic | Early plateau | Late plateau | Any method |
| CH Index | Peak at low | Peak at high | K-Means, Hierarchical |
| DB Index | Minimum at low | Minimum at high | K-Means, Hierarchical |
| BIC (GMM) | Peak at low | Peak at high | GMM only |
| Bootstrap stability | High stability at low | High stability at high | Any method |
| Theoretical expectations | Strong prior for few types | Strong prior for many types | All analyses |
💡 Best practice: Evaluate the clustering solution at , , and for the statistically suggested . Choose the solution that is most theoretically interpretable AND statistically defensible. Document the rationale for the final selection.
10.6 Cluster Profiling and Interpretation
After selecting the final clustering solution, profile each cluster by computing:
- Means and standard deviations of each continuous clustering variable by cluster.
- Frequencies and proportions of categorical variables by cluster.
- ANOVA or Kruskal-Wallis tests for each variable to confirm clusters differ.
- Effect sizes ( or rank-biserial ) for each variable.
- Discriminant function analysis to identify which variables best discriminate clusters.
A cluster profile heatmap (rows = clusters, columns = variables, cells = standardised mean) provides an intuitive summary of how clusters differ across all variables simultaneously.
11. Advanced Topics
11.1 Dimensionality Reduction Before Clustering
High-dimensional data (large ) pose several challenges for clustering:
- Curse of dimensionality: In high dimensions, all pairwise distances converge to the same value, making distance-based clustering meaningless.
- Noise variables: Many variables may be unrelated to the cluster structure, adding noise to the distance calculation.
- Visualisation: Impossible to visualise clusters beyond 3 dimensions.
Solution 1 — PCA before clustering:
Apply Principal Component Analysis to reduce to dimensions that capture most of the variance:
Where contains the first principal components. Cluster on instead of . Retain enough components to explain of total variance.
Solution 2 — UMAP/t-SNE for visualisation:
For visualising high-dimensional clusters in 2D:
- t-SNE (t-Distributed Stochastic Neighbour Embedding): Preserves local structure. Computationally expensive; not suitable for large .
- UMAP (Uniform Manifold Approximation and Projection): Faster than t-SNE; preserves both local and global structure better.
⚠️ Never cluster on t-SNE or UMAP embeddings — their stochastic, non-linear nature distorts distances in ways that invalidate distance-based clustering. Use PCA for pre-clustering dimensionality reduction; use t-SNE/UMAP only for visualisation.
11.2 Cluster Validation with External Variables
After obtaining a cluster solution, validate it by examining whether clusters differ meaningfully on variables that were NOT used in forming the clusters (held-out variables):
Criterion validity: Do clusters differ significantly on a relevant outcome variable not used in clustering?
Predictive validity: Can cluster membership predict future outcomes? Run a logistic regression (or multinomial logistic regression for ) predicting cluster membership from domain-relevant external predictors. A model that classifies with high accuracy validates the cluster solution.
Cross-tabulation with known groups: If true group membership is partially known (e.g., diagnosed patients vs. healthy controls), compare cluster memberships against true groups using chi-squared tests and ARI.
11.3 Mixed-Type Data Clustering
Real-world datasets often contain a mixture of continuous, ordinal, and nominal variables. Three main strategies exist:
Strategy 1 — Gower's distance + hierarchical/K-Medoids:
Compute Gower's distance matrix and feed it into hierarchical clustering or PAM. Most widely recommended approach for mixed data.
Strategy 2 — Factor Analysis of Mixed Data (FAMD):
Apply FAMD to extract factor scores that accommodate both continuous (PCA-style) and categorical (MCA-style) variables, then cluster on the factor scores.
Strategy 3 — Latent Class Analysis (LCA):
Model-based approach where the cluster structure is defined by a joint probability model over all variable types. Each "class" is characterised by a probability distribution over all variables. Suitable for primarily categorical data.
11.4 Longitudinal Clustering
When the same subjects are measured repeatedly over time, trajectory clustering (also called group-based trajectory modelling or latent growth curve clustering) identifies subgroups with similar temporal patterns.
Group-Based Trajectory Model (Nagin, 2005):
For each class , the trajectory of outcome over time is modelled as:
Model selection is based on BIC across models with different numbers of trajectory groups and different polynomial orders for each group.
11.5 Ensemble Clustering (Consensus Clustering)
Consensus clustering combines multiple clustering solutions (from different algorithms, different values, or different random starts) to produce a more stable and robust final partition.
Algorithm:
- Generate clustering solutions: .
- Build the co-association matrix where = proportion of solutions in which observations and are assigned to the same cluster:
- Apply hierarchical clustering to (treating it as a similarity matrix).
- The consensus clustering is the result of cutting this dendrogram at the desired .
A cluster solution with high consensus (many entries in near 0 or 1) is more stable than one with many entries near 0.5 (indicating ambiguous assignments).
11.6 Reporting Cluster Analysis Results
Best-practice reporting for cluster analysis includes:
Pre-analysis decisions:
- Variables selected for clustering and rationale.
- Standardisation approach.
- Distance measure chosen and justification.
- Algorithm(s) and settings (linkage for hierarchical; random starts for K-Means).
- Method for determining .
Main results:
- Final chosen and rationale (statistical indices + theory).
- Cluster sizes ( and percentage).
- Cluster validity indices (at minimum ASW and one other).
- Stability assessment (bootstrap ARI).
Cluster characterisation:
- Profile table: means/proportions for all clustering variables by cluster (with SDs).
- Statistical tests comparing clusters on each variable.
- Effect sizes for each variable.
- A descriptive label for each cluster.
Visualisations:
- Dendrogram (for hierarchical).
- PCA biplot coloured by cluster.
- Cluster profile heatmap.
- Interaction plots for key variables.
12. Worked Examples
Example 1: Hierarchical Clustering — Patient Subtype Identification
A clinical researcher collects data on patients with chronic fatigue syndrome on five symptom severity scores (each 0–10): Pain (), Fatigue (), Cognitive Impairment (), Sleep Disturbance (), and Mood ().
Step 1 — Standardise variables:
All five variables are on the same 0–10 scale with similar SDs — standardisation applied for consistency.
Step 2 — Compute distance matrix:
Euclidean distances on standardised scores.
Step 3 — Run Ward's hierarchical clustering.
Step 4 — Evaluate the dendrogram:
Dendrogram heights of final merges (from bottom up):
Height gaps: → largest gap is between and (), suggesting cutting at 3 clusters (before the large jump to 2 clusters).
Cophenetic correlation: → Good representation.
Step 5 — Evaluate 3-cluster solution:
| Index | |||
|---|---|---|---|
| ASW | 0.58 | 0.64 | 0.51 |
| CH | 41.2 | 52.8 | 47.1 |
| DB | 0.82 | 0.61 | 0.78 |
All three indices favour . Decision: 3 clusters.
Step 6 — Cluster Profiles:
| Variable | Cluster 1 () | Cluster 2 () | Cluster 3 () |
|---|---|---|---|
| Pain | |||
| Fatigue | |||
| Cognitive | |||
| Sleep | |||
| Mood | |||
| Bootstrap ARI | 0.88 |
Cluster Labels:
- Cluster 1 (n=28, 35%): "Severe" — high scores on all dimensions (mean profile 7.8–8.5).
- Cluster 2 (n=31, 39%): "Moderate/Somatic" — moderate physical symptoms, milder cognitive and mood symptoms.
- Cluster 3 (n=21, 26%): "Cognitive-Affective" — lower pain and fatigue but elevated cognitive impairment and mood disturbance.
Conclusion: Ward's hierarchical clustering identified three clinically meaningful patient subtypes. The 3-cluster solution was supported by all validity indices (ASW = 0.64, CH = 52.8, DB = 0.61) and demonstrated good bootstrap stability (ARI = 0.88). The three profiles suggest distinct symptom phenotypes that may benefit from different treatment approaches: Cluster 1 (Severe) may require multimodal comprehensive intervention; Cluster 2 (Moderate/Somatic) may respond to physical symptom management; Cluster 3 (Cognitive-Affective) may benefit from cognitive-behavioural and mood-focused therapies.
Example 2: K-Means Clustering — Customer Segmentation
A retail company collects data on customers: Annual spend (, £), Purchase frequency (, orders/year), Average order value (, £), Days since last purchase (, recency), and Customer age (, years).
Step 1 — Standardise all variables.
Step 2 — Determine optimal K:
WCSS, silhouette, and gap statistic evaluated for to :
| WCSS | WCSS | ASW | CH | Gap | |
|---|---|---|---|---|---|
| 2 | 1842 | — | 0.52 | 285 | 0.41 |
| 3 | 1341 | 501 | 0.61 | 341 | 0.58 |
| 4 | 1052 | 289 | 0.58 | 312 | 0.62 |
| 5 | 892 | 160 | 0.51 | 287 | 0.60 |
| 6 | 798 | 94 | 0.44 | 251 | 0.55 |
| 7 | 741 | 57 | 0.38 | 218 | 0.51 |
| 8 | 712 | 29 | 0.32 | 190 | 0.48 |
Elbow: Largest WCSS at (501) vs. (289) → elbow at . Silhouette: Maximised at (ASW = 0.61). Gap statistic: Gap(3) = 0.58 Gap(4) = 0.62 0.05 = 0.57. ✅ Decision: clusters.
Step 3 — Run K-Means (50 random starts, K-Means++ init).
Step 4 — Cluster profiles (unstandardised means):
| Variable | Cluster 1 "Premium" () | Cluster 2 "Standard" () | Cluster 3 "Dormant" () |
|---|---|---|---|
| Annual Spend | £4,820 | £1,240 | £380 |
| Frequency | 18.2 orders | 8.4 orders | 2.1 orders |
| Avg Order Value | £265 | £148 | £181 |
| Days Since Purchase | 12 days | 41 days | 287 days |
| Age | 42 years | 38 years | 55 years |
| Size | 28.4% | 43.6% | 28.0% |
Bootstrap stability: Mean ARI = 0.91 (Very stable).
ANOVA results: All five variables differ significantly across clusters ().
Business Interpretation:
- Cluster 1 "Premium Actives" (28%): High-value, high-frequency recent buyers. Target with loyalty rewards, early access to new products, and premium services.
- Cluster 2 "Standard Actives" (44%): Mid-range regular customers, recently active. Target with upselling campaigns and frequency incentives.
- Cluster 3 "Dormant" (28%): Low spend, infrequent, long since last purchase. Target with win-back campaigns offering discounts and re-engagement communications.
Example 3: Gaussian Mixture Model — Gene Expression Subtype Discovery
A genomics researcher analyses RNA-seq expression data for tumour samples on cancer-relevant gene expression scores (after PCA dimensionality reduction).
GMM with multiple covariance structures evaluated:
| Model | BIC | ICL | Best Fit? | |
|---|---|---|---|---|
| EII (spherical, equal volume) | 3 | 4821 | 4712 | No |
| EEE (ellipsoidal, equal) | 3 | 5108 | 5041 | No |
| VVV (fully unconstrained) | 2 | 4950 | 4882 | No |
| VEE (variable volume, equal shape) | 3 | 5342 | 5198 | Yes |
| VVE (variable volume and shape) | 4 | 5188 | 5014 | No |
Best model: VEE with (highest BIC = 5342).
Cluster assignments:
| Property | Cluster 1 () | Cluster 2 () | Cluster 3 () |
|---|---|---|---|
| Size (%) | 39.0% | 35.5% | 25.5% |
| Mean entropy | 0.12 | 0.18 | 0.31 |
| Low-uncertainty (prob ) | 92% | 88% | 71% |
Cluster profiles (top discriminating genes):
| Gene | Cluster 1 | Cluster 2 | Cluster 3 |
|---|---|---|---|
| TP53 | Low () | High () | Moderate () |
| BRCA1 | Moderate () | Low () | High () |
| EGFR | High () | Low () | Low () |
| MYC | Moderate () | High () | Low () |
Clinical validation (held-out variables not used in clustering):
| Outcome | Cluster 1 | Cluster 2 | Cluster 3 | |
|---|---|---|---|---|
| 5-year survival | 72% | 41% | 58% | |
| Stage III/IV (%) | 28% | 68% | 45% | |
| Hormone receptor+ (%) | 82% | 31% | 61% |
Molecular Subtypes Identified:
- Cluster 1 (n=78, 39%): "EGFR-high / Luminal-like" — EGFR overexpression, hormone receptor positive, best prognosis (72% 5-year survival).
- Cluster 2 (n=71, 36%): "TP53-mutant / Aggressive" — TP53 high, MYC amplification, mostly advanced stage, worst prognosis (41% 5-year survival).
- Cluster 3 (n=51, 26%): "BRCA1-high / Basal-like" — BRCA1 elevated, intermediate prognosis, high hormone receptor positivity (58% 5-year survival).
Conclusion: GMM with VEE covariance structure identified three biologically meaningful and clinically validated tumour subtypes. The soft assignment probabilities revealed that Cluster 3 members were more ambiguous (lower certainty), suggesting this subtype is transitional. The strong association with 5-year survival and stage distribution (all ) confirms the clinical validity of these molecular subtypes.
13. Common Mistakes and How to Avoid Them
Mistake 1: Not Standardising Variables Before Distance-Based Clustering
Problem: Variables measured on different scales (e.g., income in £10,000s and age in
years) will result in the large-scale variable dominating the Euclidean distance calculation.
Clusters will reflect variation in that variable rather than the true multivariate structure.
Solution: Always standardise continuous variables to -scores before computing Euclidean
distances, unless all variables are on the same scale or Mahalanobis distance is used. Report
which standardisation method was applied.
Mistake 2: Choosing Based Solely on the Elbow Plot
Problem: The elbow in the WCSS curve is frequently ambiguous — the "elbow" can be in
different positions depending on the scale and the eye of the beholder. Over-reliance on
a single criterion leads to arbitrary selection.
Solution: Use at least three complementary criteria: the elbow method, silhouette analysis,
and the gap statistic (or BIC for GMM). Combine statistical guidance with theoretical
expectations about the expected number of subgroups. Evaluate the interpretability and
stability of solutions at , , and .
Mistake 3: Running K-Means With a Single Random Start
Problem: -means converges to a local optimum that depends on the random initialisation.
A single random start frequently produces a poor solution with unnecessarily high WCSS.
Solution: Always use multiple random starts (at minimum 20, preferably 50–100) and keep
the solution with the lowest WCSS. Use K-Means++ initialisation as the default — it
dramatically reduces the probability of poor local optima.
Mistake 4: Treating Cluster Labels as Stable Across Analyses
Problem: Cluster labels (Cluster 1, 2, 3, etc.) are arbitrarily assigned and change
between runs of -means. Researchers sometimes reference "Cluster 1" without realising
it may correspond to a completely different group in a different run.
Solution: Always refer to clusters by their content labels (e.g., "High-Risk Group,"
"Moderate Group") rather than numerical labels. Match clusters across runs using the
Hungarian algorithm or maximum overlap matching. Set a random seed for reproducibility.
Mistake 5: Ignoring Cluster Validity and Stability
Problem: Reporting a cluster solution without any assessment of its validity or
reproducibility. A clustering that "looks interesting" may simply reflect the noise structure
of a small or particular sample and fail to replicate in new data.
Solution: Always report at least one internal validity index (ASW recommended) and
a stability assessment (bootstrap ARI). Solutions with ASW or bootstrap ARI
should be interpreted with extreme caution. Validate the solution on an independent holdout
sample or using cross-validation.
Mistake 6: Including Outcome Variables in the Clustering
Problem: Including the outcome variable (or variables highly correlated with it) in the
feature set used to define clusters creates a circular validation problem — the clusters
will naturally differ on the outcome because it was used to define them.
Solution: Cluster using only the predictor variables or features that define the
hypothesised subgroups. Hold out the outcome variable and use it only for post-hoc
validation (i.e., testing whether the clusters actually differ in a meaningful way on
the outcome).
Mistake 7: Using Hierarchical Clustering With Single Linkage for General Purposes
Problem: Single linkage is prone to chaining — it forms long, elongated clusters
by progressively adding observations to the nearest existing cluster, even if they are
far from the cluster's core members. This produces biologically/psychologically
uninterpretable clusters in most applications.
Solution: Use Ward's linkage as the default for general-purpose hierarchical
clustering with continuous variables. Use single linkage only when elongated, chain-like
clusters are theoretically expected (e.g., trajectory data or sequentially ordered data).
Mistake 8: Claiming Clusters Are "Real" Without Validation
Problem: Clustering always produces groups — even completely random data will be
partitioned into clusters by -means. Presenting the resulting solution as a discovery
of genuine population subgroups without testing whether the clusters reflect real structure
is scientifically inappropriate.
Solution: Test the null hypothesis of no cluster structure using the Hopkins statistic
before clustering. Validate the solution using held-out variables, external criteria, or
an independent replication sample. Use language that reflects the exploratory and hypothesis-
generating nature of cluster analysis (e.g., "consistent with three subtypes" rather than
"three subtypes were identified").
Mistake 9: Including Highly Correlated Variables Without Addressing Redundancy
Problem: When two highly correlated variables () are both included in
the clustering, that dimension is effectively double-weighted in the distance calculation.
The resulting clusters reflect that dimension disproportionately compared to others.
Solution: Inspect the correlation matrix of clustering variables. Remove one variable
from each highly correlated pair, or combine them into a composite. Alternatively, apply
PCA first and cluster on principal component scores, which are uncorrelated by construction.
Mistake 10: Using Cluster Means to Validate Clusters Using the Same Data
Problem: Computing ANOVA or -tests to compare clusters on the variables used to form
them, and then reporting these differences as evidence that the clusters are "real" or
"distinct," is tautological. Of course the clusters differ on the variables used to create
them — that is how clustering works.
Solution: Validate clusters using held-out variables (variables not used in
clustering) or external outcomes (e.g., diagnosis, treatment response, survival).
ANOVA on clustering variables is useful only for describing the clusters (profiling), not
for validating their existence or clinical/practical significance.
14. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| K-Means produces a singleton cluster (cluster of 1) | An extreme outlier has been assigned its own cluster | Remove outliers before clustering; use K-Medoids (PAM); check for data entry errors |
| WCSS elbow plot shows no clear elbow | Data may have no genuine cluster structure; too many/few values tested | Test with Hopkins statistic; extend the range of ; try hierarchical clustering to see dendrogram |
| Silhouette width is low () across all | No cluster structure in data; wrong distance measure; wrong variables selected | Reassess variable selection; try Gower's distance for mixed data; consider the data may not have clusters |
| K-Means result changes dramatically between runs | Too few random starts; very flat WCSS landscape; data near a saddle point | Increase to 100+ random starts; use K-Means++ initialisation; try different |
| Hierarchical dendrogram shows inversions (lower heights after higher) | Centroid linkage used with non-Euclidean distances | Switch to Ward, Average, or Complete linkage; avoid centroid linkage |
| DBSCAN classifies almost all points as noise | is too small | Increase ; replot the -NN distance graph and identify the correct elbow |
| DBSCAN produces one giant cluster | is too large | Decrease ; re-examine -NN distance plot |
| GMM EM algorithm does not converge | Too many components; degenerate covariance (near-singular); small | Reduce ; use constrained covariance (e.g., EEE); increase sample size; add regularisation |
| GMM produces a cluster with near-zero mixing proportion () | Over-specification of ; one cluster is absorbing outliers | Reduce ; check for outliers in the data |
| Bootstrap stability is very low () | Sample size too small; clusters not well-separated; wrong | Increase ; try different ; increase the number of bootstrap replicates to 500 |
| All validity indices disagree on the optimal | Genuine ambiguity in the data structure; clusters are overlapping | Report multiple solutions; choose based on theory; use GMM soft assignments to quantify overlap |
| Cluster sizes are extremely unequal (e.g., vs. ) | Outliers forming their own cluster; is too large; forced spherical clusters | Remove outliers; reduce ; use DBSCAN which flags outliers as noise |
| Variables with zero or near-zero variance included | Data preprocessing omission | Check variable SDs; remove constant or near-constant variables before clustering |
| PCA biplot shows no cluster separation | Clusters may differ on dimensions not captured by PC1 and PC2 | Plot PC2 vs. PC3, PC1 vs. PC3; use 3D plots; check if cluster differences are in higher components |
15. Quick Reference Cheat Sheet
Core Equations
| Formula | Description |
|---|---|
| Euclidean distance | |
| Manhattan distance | |
| Mahalanobis distance | |
| Gower's distance (mixed data) | |
| Within-cluster sum of squares | |
| K-Means assignment step | |
| K-Means update step (centroid) | |
| Silhouette coefficient | |
| Average silhouette width | |
| Calinski-Harabasz index | |
| Davies-Bouldin index | |
| Gap statistic | |
| GMM posterior responsibility | |
| BIC for GMM model selection | |
| Cophenetic correlation (hierarchical quality) |
Method Selection Guide
| Scenario | Recommended Method |
|---|---|
| Exploratory; unknown ; small-medium | Agglomerative hierarchical (Ward's linkage) |
| Large ; continuous data; approximate known | -Means with K-Means++ |
| Outliers present; any distance matrix | -Medoids (PAM) |
| Soft assignments desired; elliptical clusters | Gaussian Mixture Model (GMM) |
| Arbitrary-shaped clusters; automatic noise detection | DBSCAN or HDBSCAN |
| Mixed variable types | Gower's distance + PAM or hierarchical |
| Categorical-only variables | -Modes or Latent Class Analysis |
| High-dimensional data () | PCA first, then cluster on scores |
| Varying density clusters | HDBSCAN |
| Overlapping clusters; uncertainty quantification | GMM or Fuzzy C-Means |
| Temporal/longitudinal data | Group-Based Trajectory Modelling |
| Combining multiple algorithms | Ensemble/Consensus Clustering |
Distance Measure Selection Guide
| Data Type | Recommended Distance |
|---|---|
| Continuous, different scales | Euclidean (after z-standardisation) |
| Continuous, outliers present | Manhattan |
| Continuous, correlated variables | Mahalanobis |
| Mixed (continuous + categorical) | Gower's |
| Binary (presence/absence) | Jaccard |
| High-dimensional (text, genomics) | Cosine |
| Time series | Dynamic Time Warping (DTW) |
| Count data | Bray-Curtis |
Linkage Method Guide (Hierarchical Clustering)
| Linkage | Cluster Shape | Sensitive to Outliers | Recommended |
|---|---|---|---|
| Ward's | Compact, spherical | Moderate | ✅ Default choice |
| Complete | Compact, equal size | Yes | Compact expected clusters |
| Average | Moderate | Moderate | Good general alternative |
| Single | Elongated, chained | Very high | Filamentary structures only |
| Centroid | Moderate | Low | Not recommended (inversions) |
Validity Index Benchmarks
| Index | Poor | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Average Silhouette Width | ||||
| Bootstrap Stability (ARI) | ||||
| Cophenetic Correlation | ||||
| ARI vs. ground truth |
Selection Methods Comparison
| Method | Statistic | Optimal | Requires Range | For |
|---|---|---|---|---|
| Elbow | WCSS | At elbow | Yes | K-Means |
| Silhouette | ASW | Maximise | Yes | Any |
| Gap Statistic | Gap | First gap | Yes | Any |
| CH Index | CH | Maximise | Yes | K-Means, Hierarchical |
| DB Index | DB | Minimise | Yes | K-Means, Hierarchical |
| BIC | Maximise | Yes | GMM only | |
| Dendrogram | Height gaps | Largest gap | No | Hierarchical only |
Minimum Sample Size Guidelines
| Method | Minimum | Minimum per Cluster |
|---|---|---|
| Hierarchical | 30 | 5 |
| -Means | 10 | |
| -Medoids | 10 | |
| GMM | ||
| DBSCAN | 50 | MinPts |
| Fuzzy C-Means | 10 |
Key Pre-Processing Checklist
| Step | Action | When Required |
|---|---|---|
| Standardise | Z-score all continuous variables | Always (Euclidean distance) |
| Remove outliers | Screen using Mahalanobis or box plots | Before distance-based clustering |
| Handle missing data | Impute or use Gower's distance | When missing data present |
| Remove redundant variables | Drop if $ | r |
| Check clusterability | Hopkins statistic | Before clustering |
| Dimensionality reduction | PCA if | High-dimensional data |
| Check variable variance | Remove near-zero variance variables | Always |
This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Cluster Analysis using the DataStatPro application. For further reading, consult Kaufman & Rousseeuw's "Finding Groups in Data: An Introduction to Cluster Analysis" (2005), Everitt, Landau, Leese & Stahl's "Cluster Analysis" (5th ed., 2011), and Fraley & Raftery's "Model-Based Clustering, Discriminant Analysis, and Density Estimation" (Journal of the American Statistical Association, 2002). For feature requests or support, contact the DataStatPro team.