Knowledge Base / Cluster Analysis Advanced Analysis 64 min read

Cluster Analysis

Comprehensive reference guide for cluster analysis and unsupervised learning methods.

Cluster Analysis: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Cluster Analysis all the way through advanced algorithms, evaluation, interpretation, and practical usage within the DataStatPro application. Whether you are encountering cluster analysis for the first time or looking to deepen your understanding of unsupervised learning and data segmentation, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is Cluster Analysis?
  3. The Mathematics Behind Cluster Analysis
  4. Assumptions of Cluster Analysis
  5. Types of Cluster Analysis Methods
  6. Using the Cluster Analysis Component
  7. Hierarchical Clustering
  8. K-Means and K-Medoids Clustering
  9. Model-Based and Density-Based Clustering
  10. Model Fit and Evaluation
  11. Advanced Topics
  12. Worked Examples
  13. Common Mistakes and How to Avoid Them
  14. Troubleshooting
  15. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into cluster analysis, it is helpful to be comfortable with the following foundational statistical and mathematical concepts. Each is briefly reviewed below.

1.1 Distance and Similarity

Cluster analysis is fundamentally about grouping objects that are similar to each other and dissimilar from objects in other groups. The core computational tool is a distance or dissimilarity measure between pairs of observations.

For two observations xi=(xi1,xi2,,xip)\mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{ip}) and xj=(xj1,xj2,,xjp)\mathbf{x}_j = (x_{j1}, x_{j2}, \dots, x_{jp}) with pp variables, the most commonly used distance measure is the Euclidean distance:

d(xi,xj)=k=1p(xikxjk)2d(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{\sum_{k=1}^{p}(x_{ik} - x_{jk})^2}

Key properties of a valid distance measure:

A similarity measure s(xi,xj)s(\mathbf{x}_i, \mathbf{x}_j) is the inverse concept: high similarity means the objects are close. It can always be converted to a dissimilarity: d=1sd = 1 - s (for similarities bounded by [0, 1]).

1.2 Vectors and Centroids

An observation in a dataset with pp variables is represented as a vector in pp-dimensional space:

xi=(xi1,xi2,,xip)TRp\mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{ip})^T \in \mathbb{R}^p

The centroid of a set of nkn_k observations {x1,x2,,xnk}\{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_{n_k}\} is their arithmetic mean:

xˉk=1nki=1nkxi\bar{\mathbf{x}}_k = \frac{1}{n_k}\sum_{i=1}^{n_k}\mathbf{x}_i

In KK-means clustering, the centroid is the representative point of each cluster.

1.3 Variance and Within-Group Variance

The total variance of a dataset measures overall dispersion:

SStotal=i=1nk=1p(xikxˉk)2SS_{total} = \sum_{i=1}^{n}\sum_{k=1}^{p}(x_{ik} - \bar{x}_k)^2

The within-cluster sum of squares (WCSS) measures how tightly packed observations are within their assigned clusters:

WCSS=c=1KiCcxixˉc2\text{WCSS} = \sum_{c=1}^{K}\sum_{i \in C_c}\|\mathbf{x}_i - \bar{\mathbf{x}}_c\|^2

Where CcC_c is the set of observations in cluster cc and xˉc\bar{\mathbf{x}}_c is the centroid of cluster cc. Minimising WCSS is the objective of KK-means clustering.

1.4 Matrix Notation and the Distance Matrix

Given nn observations, the distance matrix D\mathbf{D} is an n×nn \times n symmetric matrix where element dijd_{ij} is the distance between observations ii and jj:

D=(0d12d13d1nd210d23d2ndn1dn2dn30)\mathbf{D} = \begin{pmatrix} 0 & d_{12} & d_{13} & \cdots & d_{1n} \\ d_{21} & 0 & d_{23} & \cdots & d_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d_{n1} & d_{n2} & d_{n3} & \cdots & 0 \end{pmatrix}

The distance matrix is the primary input to hierarchical and many other clustering algorithms. It is always symmetric (dij=djid_{ij} = d_{ji}) with zeros on the diagonal.

1.5 Probability Distributions and Mixture Models

A probability distribution f(x)f(x) describes the likelihood of observing each value of a random variable XX. In model-based clustering, the data are assumed to arise from a mixture of distributions:

f(x)=k=1Kπkfk(xθk)f(\mathbf{x}) = \sum_{k=1}^{K} \pi_k f_k(\mathbf{x} \mid \boldsymbol{\theta}_k)

Where:

The most common choice is the Gaussian mixture model (GMM), where each component is a multivariate normal distribution.

1.6 Optimisation and Convergence

Many clustering algorithms work by iteratively optimising an objective function (e.g., minimising WCSS). Key concepts:

Iteration: Repeating the same update steps until a stopping criterion is met.

Convergence: The algorithm has converged when the objective function changes by less than a small threshold ε\varepsilon between iterations:

F(t+1)F(t)<ε|F^{(t+1)} - F^{(t)}| < \varepsilon

Local vs. Global Optimum: Iterative algorithms like KK-means often converge to a local (not global) optimum — the solution depends on the random starting configuration. Running the algorithm multiple times with different starts helps find a better (though still not guaranteed globally optimal) solution.

1.7 Supervised vs. Unsupervised Learning

Supervised learning uses labelled data (known outcomes) to train a model — examples include regression and classification. Unsupervised learning discovers structure in unlabelled data without a predefined outcome variable.

Cluster analysis is unsupervised — there is no "correct answer" against which to evaluate the solution. This makes cluster analysis both powerful (works without labels) and challenging (no ground truth for validation). All cluster validation is internal (based on the structure of the clustered data itself) or theoretical (based on external knowledge).


2. What is Cluster Analysis?

2.1 The Core Idea

Cluster analysis (also called clustering or cluster detection) is a class of unsupervised statistical learning methods that partition a dataset into groups (clusters) such that:

Unlike classification (which assigns new observations to predefined categories), cluster analysis discovers the categories themselves from the data. The researcher does not specify what the groups should look like — the algorithm determines which observations naturally belong together.

2.2 What Cluster Analysis Can and Cannot Do

Cluster analysis CAN:

Cluster analysis CANNOT:

2.3 Real-World Applications

FieldApplicationVariables Used
MedicineIdentifying patient subgroups with similar disease profiles or treatment responsesLab values, symptoms, genetic markers
MarketingCustomer segmentation for targeted advertising and product developmentPurchase behaviour, demographics, preferences
GenomicsGrouping genes with similar expression patterns across conditionsGene expression levels across samples
PsychologyIdentifying latent subtypes in clinical populationsSymptom profiles, cognitive test scores
FinancePortfolio diversification by grouping assets with correlated returnsReturn series, risk metrics
Image AnalysisColour segmentation; pixel grouping in image compressionRGB values, spatial coordinates
Social ScienceIdentifying socioeconomic neighbourhood profilesIncome, education, employment, housing
EcologySpecies distribution grouping; habitat classificationEnvironmental variables, species counts
AstronomyClassifying galaxies or stars by spectral characteristicsLuminosity, colour index, mass

2.4 Cluster Analysis vs. Related Methods

FeatureCluster AnalysisFactor AnalysisPCADiscriminant Analysis
GoalGroup observationsGroup variablesReduce variable dimensionsSeparate known groups
Supervised?NoNoNoYes
OutputCluster membershipsLatent factorsComponent scoresClassification rules
Known groups?NoNoNoYes
Operates onRows (observations)Columns (variables)Columns (variables)Both
Distance used?Yes (usually)NoNoYes (Mahalanobis)

2.5 The Fundamental Challenge: Defining Similarity

Every clustering algorithm embeds assumptions about what "similar" means. The choice of:

...all profoundly influence the resulting clusters. There is no universally correct clustering solution — different algorithms and distance measures can produce very different groupings from the same data. Always examine results from multiple approaches.


3. The Mathematics Behind Cluster Analysis

3.1 Distance and Dissimilarity Measures

The choice of distance measure determines how "closeness" is defined and which clusters are formed. Different measures emphasise different aspects of dissimilarity.

Euclidean Distance (L2 norm):

dE(xi,xj)=k=1p(xikxjk)2=xixj2d_E(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{\sum_{k=1}^{p}(x_{ik} - x_{jk})^2} = \|\mathbf{x}_i - \mathbf{x}_j\|_2

Sensitive to scale differences between variables — standardisation is essential.

Manhattan Distance (L1 norm, City Block):

dM(xi,xj)=k=1pxikxjkd_M(\mathbf{x}_i, \mathbf{x}_j) = \sum_{k=1}^{p}|x_{ik} - x_{jk}|

Less sensitive to outliers than Euclidean distance. Appropriate for grid-like data.

Minkowski Distance (generalisation of Euclidean and Manhattan):

dp(xi,xj)=(k=1pxikxjkp)1/pd_p(\mathbf{x}_i, \mathbf{x}_j) = \left(\sum_{k=1}^{p}|x_{ik} - x_{jk}|^p\right)^{1/p}

Mahalanobis Distance (accounts for correlations and scale):

dMah(xi,xj)=(xixj)TS1(xixj)d_{Mah}(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_j)}

Where S\mathbf{S} is the sample covariance matrix. Invariant to scale and rotation; accounts for correlations among variables.

Cosine Dissimilarity (based on angle between vectors):

dcos(xi,xj)=1xiTxjxixjd_{cos}(\mathbf{x}_i, \mathbf{x}_j) = 1 - \frac{\mathbf{x}_i^T \mathbf{x}_j}{\|\mathbf{x}_i\| \cdot \|\mathbf{x}_j\|}

Appropriate for high-dimensional data (text, documents) where magnitude is less important than direction.

Gower's Distance (for mixed variable types):

For a dataset with pp variables of mixed types (continuous, ordinal, binary, nominal):

dG(xi,xj)=k=1pδijkdijkk=1pδijkd_G(\mathbf{x}_i, \mathbf{x}_j) = \frac{\sum_{k=1}^{p} \delta_{ijk} \cdot d_{ijk}}{\sum_{k=1}^{p} \delta_{ijk}}

Where δijk=1\delta_{ijk} = 1 if the kk-th variable is usable for comparison (non-missing), and dijkd_{ijk} is the contribution of variable kk to the dissimilarity:

3.2 Distance Measure Comparison Table

DistanceFormulaScale SensitiveOutlier RobustVariable TypeBest For
Euclidean(xixj)2\sqrt{\sum(x_i - x_j)^2}YesNoContinuousGeneral purpose (after standardising)
Manhattan$\sumx_i - x_j$YesModerate
Minkowski$(\sumx_i - x_j^p)^{1/p}$YesVaries
Mahalanobis(xixj)TS1(xixj)\sqrt{(\mathbf{x}_i-\mathbf{x}_j)^T\mathbf{S}^{-1}(\mathbf{x}_i-\mathbf{x}_j)}NoNoContinuousCorrelated variables
Cosine1cos(θ)1 - \cos(\theta)NoModerateContinuousHigh-dimensional, text
GowerWeighted averageNoModerateMixedMixed data types
Binary (Jaccard)$1 - \frac{A \cap B}{A \cup B}$

3.3 The Hierarchical Clustering Algorithm

Hierarchical clustering builds a dendrogram (tree structure) by successively merging (agglomerative) or splitting (divisive) clusters.

Agglomerative clustering algorithm (bottom-up):

Step 1: Start with nn clusters, each containing exactly one observation:

C={C1,C2,,Cn}\mathcal{C} = \{C_1, C_2, \dots, C_n\} where Ci={xi}C_i = \{\mathbf{x}_i\}

Step 2: Compute the n×nn \times n distance matrix D\mathbf{D}.

Step 3: Find the two clusters CiC_i and CjC_j with minimum inter-cluster distance:

(i,j)=argminijd(Ci,Cj)(i^*, j^*) = \arg\min_{i \neq j} d(C_i, C_j)

Step 4: Merge CiC_{i^*} and CjC_{j^*} into a new cluster:

Cnew=CiCjC_{new} = C_{i^*} \cup C_{j^*}

Step 5: Update the distance matrix by removing rows/columns for CiC_{i^*} and CjC_{j^*} and adding a new row/column for CnewC_{new} using the chosen linkage method.

Step 6: Repeat Steps 3–5 until a single cluster remains.

The result is a binary tree (dendrogram) recording the order and heights of all merges.

3.4 Linkage Methods

The linkage method determines how the distance between two clusters is computed from the distances between their constituent observations. Different linkage methods produce dramatically different cluster shapes.

Single Linkage (Minimum):

dSL(Ci,Cj)=minxCi,yCjd(x,y)d_{SL}(C_i, C_j) = \min_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})

Distance = distance between the closest pair of observations from each cluster. Prone to chaining — elongated, chain-like clusters.

Complete Linkage (Maximum):

dCL(Ci,Cj)=maxxCi,yCjd(x,y)d_{CL}(C_i, C_j) = \max_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})

Distance = distance between the farthest pair of observations from each cluster. Produces compact, roughly equal-sized clusters.

Average Linkage (UPGMA):

dAL(Ci,Cj)=1CiCjxCiyCjd(x,y)d_{AL}(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|}\sum_{\mathbf{x} \in C_i}\sum_{\mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})

Distance = average of all pairwise distances between the two clusters. A compromise between single and complete linkage.

Ward's Method (Minimum Variance):

Merges the two clusters that result in the smallest increase in total within-cluster sum of squares (WCSS):

dWard(Ci,Cj)=CiCjCi+CjxˉCixˉCj2d_{Ward}(C_i, C_j) = \frac{|C_i| \cdot |C_j|}{|C_i| + |C_j|} \|\bar{\mathbf{x}}_{C_i} - \bar{\mathbf{x}}_{C_j}\|^2

Where xˉCi\bar{\mathbf{x}}_{C_i} and xˉCj\bar{\mathbf{x}}_{C_j} are the centroids of clusters CiC_i and CjC_j. Ward's method tends to produce compact, similarly sized spherical clusters and is the most widely used linkage in practice.

Centroid Linkage:

dCentroid(Ci,Cj)=xˉCixˉCj2d_{Centroid}(C_i, C_j) = \|\bar{\mathbf{x}}_{C_i} - \bar{\mathbf{x}}_{C_j}\|^2

Distance = squared Euclidean distance between cluster centroids. Can produce inversions in the dendrogram (a merged cluster appearing lower than its components) — generally not recommended.

3.5 The K-Means Objective Function

KK-means clustering partitions nn observations into KK clusters by minimising the within-cluster sum of squares (WCSS), also called the inertia:

WCSS=k=1KxiCkxixˉk2\text{WCSS} = \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2

Where:

The WCSS can be equivalently written using the between-cluster sum of squares (BCSS):

WCSS=SStotalBCSS\text{WCSS} = SS_{total} - \text{BCSS}

Where:

BCSS=k=1KCkxˉkxˉ2\text{BCSS} = \sum_{k=1}^{K} |C_k| \|\bar{\mathbf{x}}_k - \bar{\mathbf{x}}\|^2

And xˉ\bar{\mathbf{x}} is the global mean of all observations.

3.6 The K-Means Algorithm (Lloyd's Algorithm)

Initialisation: Randomly select KK observations as initial centroids xˉ1(0),xˉ2(0),,xˉK(0)\bar{\mathbf{x}}_1^{(0)}, \bar{\mathbf{x}}_2^{(0)}, \dots, \bar{\mathbf{x}}_K^{(0)}.

Step 1 — Assignment Step: Assign each observation xi\mathbf{x}_i to the cluster with the nearest centroid:

ci(t)=argmink{1,,K}xixˉk(t)2c_i^{(t)} = \arg\min_{k \in \{1,\dots,K\}} \|\mathbf{x}_i - \bar{\mathbf{x}}_k^{(t)}\|^2

Step 2 — Update Step: Recompute each centroid as the mean of all observations assigned to that cluster:

xˉk(t+1)=1Ck(t)xiCk(t)xi\bar{\mathbf{x}}_k^{(t+1)} = \frac{1}{|C_k^{(t)}|}\sum_{\mathbf{x}_i \in C_k^{(t)}} \mathbf{x}_i

Convergence: Repeat Steps 1–2 until cluster assignments no longer change:

ci(t+1)=ci(t)ic_i^{(t+1)} = c_i^{(t)} \quad \forall i

Or until the change in WCSS is below a threshold ε>0\varepsilon > 0:

WCSS(t+1)WCSS(t)<ε|\text{WCSS}^{(t+1)} - \text{WCSS}^{(t)}| < \varepsilon

Convergence guarantee: KK-means always converges in a finite number of iterations because:

  1. The assignment step never increases WCSS.
  2. The update step never increases WCSS.
  3. The number of possible cluster assignments is finite.

However, the converged solution may be a local optimum. Run with at least 20–50 random starts and keep the solution with the lowest WCSS.

3.7 K-Means++ Initialisation

Standard random initialisation of KK-means often converges to poor local optima. K-means++ (Arthur & Vassilvitskii, 2007) provides a smarter initialisation:

Step 1: Choose the first centroid uniformly at random from the data:

xˉ1Uniform{x1,,xn}\bar{\mathbf{x}}_1 \sim \text{Uniform}\{\mathbf{x}_1, \dots, \mathbf{x}_n\}

Step 2: For each subsequent centroid xˉk\bar{\mathbf{x}}_k (k=2,,Kk = 2, \dots, K), choose observation xi\mathbf{x}_i with probability proportional to its squared distance from the nearest already-chosen centroid:

P(xi)=D(xi)2j=1nD(xj)2P(\mathbf{x}_i) = \frac{D(\mathbf{x}_i)^2}{\sum_{j=1}^{n} D(\mathbf{x}_j)^2}

Where D(xi)=mincchosen centroidsxicD(\mathbf{x}_i) = \min_{c \in \text{chosen centroids}} \|\mathbf{x}_i - c\|.

Step 3: Repeat Step 2 until KK centroids are chosen, then run standard K-means.

K-means++ provides an O(logK)O(\log K) approximation guarantee on the WCSS and typically converges much faster than random initialisation.

3.8 The Gaussian Mixture Model (GMM)

Model-based clustering assumes that the data arise from a finite mixture of multivariate Gaussian distributions. Each component represents a cluster:

f(x)=k=1KπkN(xμk,Σk)f(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

Where:

N(xμk,Σk)=1(2π)p/2Σk1/2exp(12(xμk)TΣk1(xμk))\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \frac{1}{(2\pi)^{p/2}|\boldsymbol{\Sigma}_k|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right)

3.9 The EM Algorithm for GMM

The parameters Θ={πk,μk,Σk}k=1K\boldsymbol{\Theta} = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}_{k=1}^K are estimated by maximising the log-likelihood:

(Θ)=i=1nln[k=1KπkN(xiμk,Σk)]\ell(\boldsymbol{\Theta}) = \sum_{i=1}^{n} \ln\left[\sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right]

The Expectation-Maximisation (EM) algorithm iterates between two steps:

E-Step (Expectation): Compute the posterior responsibility (soft assignment probability) of component kk for observation ii:

rik(t)=πk(t)N(xiμk(t),Σk(t))j=1Kπj(t)N(xiμj(t),Σj(t))r_{ik}^{(t)} = \frac{\pi_k^{(t)} \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})}{\sum_{j=1}^{K} \pi_j^{(t)} \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}

M-Step (Maximisation): Update parameters using the current responsibilities:

Effective cluster sizes:

Nk(t)=i=1nrik(t)N_k^{(t)} = \sum_{i=1}^{n} r_{ik}^{(t)}

Updated mixing proportions:

πk(t+1)=Nk(t)n\pi_k^{(t+1)} = \frac{N_k^{(t)}}{n}

Updated means:

μk(t+1)=1Nk(t)i=1nrik(t)xi\boldsymbol{\mu}_k^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_{i=1}^{n} r_{ik}^{(t)} \mathbf{x}_i

Updated covariance matrices:

Σk(t+1)=1Nk(t)i=1nrik(t)(xiμk(t+1))(xiμk(t+1))T\boldsymbol{\Sigma}_k^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_{i=1}^{n} r_{ik}^{(t)} (\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})^T

The EM algorithm is guaranteed to converge to a local maximum of the log-likelihood. Key advantage over KK-means: soft (probabilistic) cluster memberships — each observation belongs to each cluster with a probability, not a hard 0/1 assignment.


4. Assumptions of Cluster Analysis

4.1 The Existence of Clusters (Clusterability)

The most fundamental assumption is that the data contain genuine cluster structure — that the population from which the data are drawn is genuinely multimodal or heterogeneous.

Why it matters: Clustering algorithms will always produce clusters, even for completely random, uniform data. A solution from random data is meaningless.

How to check:

H=i=1muidi=1muid+i=1mwidH = \frac{\sum_{i=1}^{m} u_i^d}{\sum_{i=1}^{m} u_i^d + \sum_{i=1}^{m} w_i^d}

Where uiu_i is the distance from a randomly sampled point to its nearest neighbour in the data, and wiw_i is the distance from a randomly chosen data point to its nearest neighbour. Under a uniform distribution, H0.5H \approx 0.5. H>0.75H > 0.75 suggests cluster structure.

4.2 Scale and Measurement Consistency

All continuous variables must be on comparable scales before computing Euclidean distances. If one variable is measured in millimetres (range 1–10) and another in kilograms (range 50–100), the larger-scale variable will dominate the distance calculation.

How to check:

Standard z-score transformation:

zik=xikxˉkσxkz_{ik} = \frac{x_{ik} - \bar{x}_k}{\sigma_{x_k}}

When NOT to standardise:

4.3 Absence of Extreme Outliers

Outliers severely distort distance-based clustering methods (especially KK-means and Ward's hierarchical clustering):

How to check:

How to handle:

4.4 Appropriate Choice of Distance Measure

The distance measure should be appropriate for the type and scale of the variables:

Choosing the wrong distance measure can produce meaningless clusters even when genuine structure exists.

4.5 Spherical Cluster Shape (for K-Means)

KK-means assumes clusters are convex and roughly spherical in shape (equal variance in all directions). Non-spherical cluster shapes (elongated, curved, or irregular) will be poorly recovered by KK-means.

How to check:

Remedy: Use GMM (which can model elliptical shapes), hierarchical clustering with single linkage (for chain-like clusters), or DBSCAN (for arbitrary shapes).

4.6 Appropriate Number of Clusters (K)

The number of clusters KK must be chosen appropriately. Specifying too few clusters over-groups heterogeneous observations; too many clusters fragments natural groups.

How to determine KK: Use multiple criteria (see Section 10) including the elbow method, silhouette analysis, gap statistic, and BIC. Never rely on a single criterion.

4.7 Adequate Sample Size

Reliable clustering requires sufficient observations per cluster for stable, reproducible results:

MethodMinimum Total nnMinimum per Cluster
Hierarchical305
KK-means10K10K10
GMM10Kp10Kp5p5p
DBSCAN50Depends on MinPts

Where KK is the number of clusters and pp is the number of variables.

⚠️ With small samples (n<50n < 50), cluster solutions are highly unstable and may not replicate in new data. Always validate the solution using bootstrapping or an independent sample.


5. Types of Cluster Analysis Methods

5.1 Classification by Methodology

Clustering methods are broadly classified by their algorithmic strategy:

CategoryMethodsCluster ShapeHard/SoftOutput
HierarchicalWard, Complete, Single, Average, CentroidFlexibleHardDendrogram + any K
PartitionalK-Means, K-Medoids (PAM), K-ModesSpherical/EuclideanHardK clusters
Model-BasedGMM, MclustEllipticalSoftK clusters + probabilities
Density-BasedDBSCAN, OPTICS, HDBSCANArbitraryHardClusters + noise points
FuzzyFuzzy C-MeansSphericalSoftMembership degrees
SpectralSpectral ClusteringArbitraryHardK clusters
Grid-BasedCLIQUE, STINGGrid cellsHardDense grid cells

5.2 Hierarchical vs. Partitional Methods

FeatureHierarchicalPartitional (KK-Means)
Specify KK in advance?NoYes
Deterministic?YesNo (random starts)
ScalabilityO(n2logn)O(n^2 \log n) to O(n3)O(n^3)O(nKI)O(nKI)
OutputComplete hierarchy (dendrogram)Single solution for given KK
Reversible merges/splits?NoYes (re-run with different KK)
Works for large nn?Poorly (n>10,000n > 10{,}000)Well
Mixed variable typesYes (with Gower)Limited

5.3 Agglomerative vs. Divisive Hierarchical Clustering

FeatureAgglomerative (Bottom-Up)Divisive (Top-Down)
Startnn singleton clusters1 cluster containing all nn
ProcessMerge closest clustersSplit most heterogeneous cluster
Common exampleWard, Complete, Average linkageDIANA (Divisive Analysis)
Computational costO(n2logn)O(n^2 \log n)O(2n)O(2^n) in exact form
UsageVery commonLess common

5.4 Hard vs. Soft (Fuzzy) Clustering

FeatureHard (Crisp) ClusteringSoft (Fuzzy) Clustering
AssignmentEach observation belongs to exactly one clusterEach observation has a degree of membership in each cluster
Membershipuik{0,1}u_{ik} \in \{0, 1\}uik[0,1]u_{ik} \in [0, 1], kuik=1\sum_k u_{ik} = 1
ExamplesKK-Means, HierarchicalGMM, Fuzzy C-Means
Best forWell-separated clustersOverlapping clusters; ambiguous boundary cases
OutputCluster label vectorMembership matrix

5.5 DataStatPro Implementation Overview

The DataStatPro Cluster Analysis component implements:

MethodImplementationBest For
Agglomerative HierarchicalWard, Complete, Average, Single linkageExploratory; unknown KK; small to medium nn
K-MeansLloyd's algorithm with K-Means++ initLarge nn; continuous data; known approximate KK
K-Medoids (PAM)Partitioning Around MedoidsRobust to outliers; mixed data
Gaussian Mixture ModelsEM algorithm; multiple covariance structuresSoft assignments; elliptical clusters
DBSCANDensity-based; automatic noise detectionArbitrary shapes; outlier detection

6. Using the Cluster Analysis Component

The Cluster Analysis component in DataStatPro provides a complete end-to-end workflow for performing, evaluating, and visualising cluster analyses.

Step-by-Step Guide

Step 1 — Select Dataset

Choose the dataset from the "Dataset" dropdown. Ensure:

💡 Tip: Run descriptive statistics and a correlation matrix before clustering. Variables that are highly correlated (r>0.90|r| > 0.90) are redundant — consider removing one from each pair to avoid over-weighting that dimension in the distance calculation.

Step 2 — Select Clustering Variables

Select all variables to include in the cluster analysis from the "Clustering Variables" dropdown. Only include variables that are:

⚠️ Important: Do not include identifier variables (ID numbers), date variables, or outcome variables you intend to use to validate the clusters (these should be held out for post-hoc validation, not used in forming the clusters).

Step 3 — Select Clustering Method

Choose from the "Method" dropdown:

Step 4 — Standardisation Options

Select how to scale the variables before clustering:

Step 5 — Select Distance Measure

Choose the appropriate distance measure for your data type:

Step 6 — Specify Number of Clusters

💡 Best practice: Always evaluate at least three values of KK (your hypothesised number, one fewer, and one more) and compare the solutions on both statistical criteria and theoretical interpretability.

Step 7 — Display Options

Select which outputs and visualisations to display:

Step 8 — Run the Analysis

Click "Run Cluster Analysis". The application will:

  1. Standardise variables (if selected).
  2. Compute the distance matrix.
  3. Run the clustering algorithm.
  4. Compute cluster validity indices (silhouette, Calinski-Harabasz, Davies-Bouldin).
  5. Generate all selected visualisations.
  6. Produce a cluster membership variable that can be saved back to the dataset for further analysis.

7. Hierarchical Clustering

7.1 The Dendrogram

The dendrogram is the primary output of hierarchical clustering. It is a binary tree where:

The dendrogram encodes a complete hierarchical clustering solution for all possible numbers of clusters KK from 1 to nn. Cutting the dendrogram at a given height hh produces the clustering solution for that number of clusters.

Reading the dendrogram:

7.2 Choosing the Cut Point

Method 1 — Largest height gap:

Inspect the heights of consecutive merges and cut below the largest jump:

K=argmaxk[h(k)h(k1)]K^* = \arg\max_{k} [h(k) - h(k-1)]

Where h(k)h(k) is the height of the kk-th merge from the top (merging the last kk clusters into k1k-1).

Method 2 — Consistent percentage:

Cut at a height corresponding to a fixed percentage (e.g., 70%) of the total dendrogram height range.

Method 3 — External criteria:

Use cluster validity indices (silhouette width, Calinski-Harabasz) evaluated for several cut heights to determine the optimal KK.

💡 There is no single "correct" cut point. Combine statistical guidance with theoretical reasoning about the expected number of subgroups in your population.

7.3 Linkage Method Comparison

LinkagePropertiesCluster ShapeRecommended When
Ward'sMinimises WCSS increaseCompact, sphericalMost general-purpose analyses
CompleteMaximum diameterCompact, similar sizeWell-separated clusters expected
AverageAverage pairwise distancesModerate compactnessBalanced trade-off
SingleChain-like mergingElongated, chainedDetecting filamentary structures
CentroidDistance between centroidsModerateRarely recommended (inversions possible)

⚠️ Ward's linkage with Euclidean distance (on standardised data) is the most commonly recommended default. However, it assumes roughly spherical, equally-sized clusters. If clusters are expected to differ dramatically in size or shape, consider average linkage.

7.4 Cophenetic Correlation Coefficient

The cophenetic correlation coefficient (CCC) measures how faithfully the dendrogram preserves the pairwise distances in the original data:

rcoph=i<j(dijdˉ)(cijcˉ)i<j(dijdˉ)2i<j(cijcˉ)2r_{coph} = \frac{\sum_{i < j}(d_{ij} - \bar{d})(c_{ij} - \bar{c})}{\sqrt{\sum_{i < j}(d_{ij} - \bar{d})^2 \cdot \sum_{i < j}(c_{ij} - \bar{c})^2}}

Where:

CCCInterpretation
>0.75> 0.75Good representation
0.600.750.60 - 0.75Acceptable
<0.60< 0.60Poor — hierarchical structure may not be appropriate

A high CCC (>0.75> 0.75) indicates that the dendrogram is a faithful representation of the original distances. Compare CCC across linkage methods and choose the one with the highest CCC.

7.5 Hierarchical Clustering for Large Datasets

Standard agglomerative hierarchical clustering has time complexity O(n2logn)O(n^2 \log n) and space complexity O(n2)O(n^2) — it requires storing the full n×nn \times n distance matrix. For large n>5,000n > 5{,}000:

7.6 Dissimilarity Matrix Heatmap

A dissimilarity heatmap displays the n×nn \times n distance matrix as a colour-coded image, with observations reordered according to the dendrogram. In a well-structured dataset:

This visual provides an immediate, intuitive check on the quality and separation of the cluster solution.


8. K-Means and K-Medoids Clustering

8.1 Choosing K — The Elbow Method

The elbow method plots WCSS (total within-cluster sum of squares) against KK and looks for an "elbow" — a point where the rate of decrease in WCSS slows dramatically:

WCSS(K)=k=1KxiCkxixˉk2\text{WCSS}(K) = \sum_{k=1}^{K}\sum_{\mathbf{x}_i \in C_k}\|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2

As KK increases from 1 to nn:

Limitation: The elbow is often subjective — in real data, the curve is frequently smooth without a clear kink. Combine with silhouette analysis and the gap statistic.

8.2 The Silhouette Coefficient

For each observation ii assigned to cluster CkC_k, the silhouette coefficient measures how well the observation fits its assigned cluster compared to the nearest other cluster:

a(i)a(i): Average distance from observation ii to all other observations in the same cluster CkC_k (intra-cluster cohesion):

a(i)=1Ck1jCk,jid(i,j)a(i) = \frac{1}{|C_k| - 1}\sum_{j \in C_k, j \neq i} d(i, j)

b(i)b(i): Minimum average distance from observation ii to all observations in any other cluster (inter-cluster separation):

b(i)=minlk1CljCld(i,j)b(i) = \min_{l \neq k} \frac{1}{|C_l|}\sum_{j \in C_l} d(i, j)

Silhouette coefficient for observation ii:

s(i)=b(i)a(i)max[a(i),b(i)]s(i) = \frac{b(i) - a(i)}{\max[a(i), b(i)]}

Ranges from 1-1 to +1+1:

Average silhouette width (ASW): The mean s(i)s(i) across all observations:

sˉ=1ni=1ns(i)\bar{s} = \frac{1}{n}\sum_{i=1}^{n} s(i)

Choose KK that maximises sˉ\bar{s}.

ASWCluster Structure Interpretation
>0.70> 0.70Strong structure
0.510.700.51 - 0.70Reasonable structure
0.260.500.26 - 0.50Weak structure — may be artificial
0.25\leq 0.25No substantial structure

8.3 The Gap Statistic

The gap statistic (Tibshirani, Walther & Hastie, 2001) compares the observed WCSS for KK clusters to the expected WCSS under a null reference distribution (uniform data with no cluster structure):

Gap(K)=E[logWCSS(K)]logWCSS(K)\text{Gap}(K) = E^*[\log \text{WCSS}(K)] - \log \text{WCSS}(K)

Where E[]E^*[\cdot] is the expectation under the null reference distribution (estimated by Monte Carlo simulation of B=100B = 100 uniform datasets).

Choosing KK: Select the smallest KK such that:

Gap(K)Gap(K+1)sK+1\text{Gap}(K) \geq \text{Gap}(K+1) - s_{K+1}

Where sK+1=SD[logWCSS(K+1)]1+1/Bs_{K+1} = SD[\log \text{WCSS}(K+1)] \cdot \sqrt{1 + 1/B} is the simulation error.

The gap statistic accounts for sampling variability and provides a more principled stopping rule than the elbow method.

8.4 Calinski-Harabasz Index (Variance Ratio Criterion)

The Calinski-Harabasz (CH) index (also called the Variance Ratio Criterion) measures the ratio of between-cluster variance to within-cluster variance:

CH(K)=BCSS/(K1)WCSS/(nK)CH(K) = \frac{\text{BCSS}/(K-1)}{\text{WCSS}/(n-K)}

Where:

Higher CH index = better-defined clusters. Choose KK that maximises CH(K)CH(K).

8.5 Davies-Bouldin Index

The Davies-Bouldin (DB) index measures the average similarity between each cluster and its most similar other cluster:

DB(K)=1Kk=1Kmaxlk(σk+σld(xˉk,xˉl))DB(K) = \frac{1}{K}\sum_{k=1}^{K} \max_{l \neq k} \left(\frac{\sigma_k + \sigma_l}{d(\bar{\mathbf{x}}_k, \bar{\mathbf{x}}_l)}\right)

Where:

Lower DB index = better separation. Choose KK that minimises DB(K)DB(K).

8.6 K-Medoids (PAM — Partitioning Around Medoids)

KK-medoids is a robust alternative to KK-means that represents each cluster by an actual medoid — the observation within the cluster that minimises the total dissimilarity to all other cluster members:

medoid(Ck)=argminxiCkxjCkd(xi,xj)\text{medoid}(C_k) = \arg\min_{\mathbf{x}_i \in C_k} \sum_{\mathbf{x}_j \in C_k} d(\mathbf{x}_i, \mathbf{x}_j)

PAM algorithm:

Step 1 (BUILD): Sequentially select KK initial medoids.

Step 2 (SWAP): For each medoid mm and each non-medoid observation oo:

Step 3: Repeat until no improvement is possible.

Objective function:

Total Cost=k=1KxiCkd(xi,medoidk)\text{Total Cost} = \sum_{k=1}^{K}\sum_{\mathbf{x}_i \in C_k} d(\mathbf{x}_i, \text{medoid}_k)

Advantages of K-Medoids over K-Means:

Disadvantage: Computationally more expensive than KK-means (O(K(nK)2)O(K(n-K)^2) per iteration).

8.7 K-Modes for Categorical Data

When all variables are categorical (nominal), KK-means and KK-medoids are not directly applicable. KK-modes (Huang, 1998) extends KK-means by:

d(xi,xj)=k=1p1[xikxjk]d(\mathbf{x}_i, \mathbf{x}_j) = \sum_{k=1}^{p} \mathbf{1}[x_{ik} \neq x_{jk}]

For mixed data (continuous + categorical), KK-prototypes combines:

d(xi,xj)=k=1pc(xikxjk)2+γk=1pd1[xikxjk]d(\mathbf{x}_i, \mathbf{x}_j) = \sum_{k=1}^{p_c}(x_{ik} - x_{jk})^2 + \gamma \sum_{k=1}^{p_d}\mathbf{1}[x_{ik} \neq x_{jk}]

Where pcp_c = number of continuous variables, pdp_d = number of categorical variables, and γ>0\gamma > 0 is a weight balancing the two contributions.


9. Model-Based and Density-Based Clustering

9.1 Gaussian Mixture Models (GMM) — Model-Based Clustering

Model-based clustering (Fraley & Raftery, 2002) treats clustering as a model selection problem. The data are assumed to arise from a GMM, and the optimal number of clusters KK and covariance structure are selected using the Bayesian Information Criterion (BIC).

Covariance matrix parameterisation:

The key advantage of GMM is the ability to model different cluster shapes, sizes, and orientations by parameterising the covariance matrices Σk\boldsymbol{\Sigma}_k.

Using eigenvalue decomposition of Σk\boldsymbol{\Sigma}_k:

Σk=λkDkAkDkT\boldsymbol{\Sigma}_k = \lambda_k \mathbf{D}_k \mathbf{A}_k \mathbf{D}_k^T

Where:

By constraining or freeing these three components across clusters, 14 different model types are defined in the mclust framework:

ModelVolumeShapeOrientationDescription
EIIEqualSphericalKK-means-like spherical clusters
VIIVariableSphericalSpherical clusters of different sizes
EEIEqualEqualAxis-alignedEqual diagonal covariance
VEIVariableEqualAxis-alignedVariable-size axis-aligned
EEEEqualEqualEqualEqual ellipsoidal (all same shape)
VVVVariableVariableVariableFully unconstrained (most flexible)

BIC for model selection:

BIC(K,model)=2^qln(n)\text{BIC}(K, \text{model}) = 2\hat{\ell} - q \ln(n)

Where ^\hat{\ell} is the maximised log-likelihood and qq is the number of free parameters. Higher BIC = better model (note: some software uses 2^+qln(n)-2\hat{\ell} + q\ln(n); here higher means better for the positive convention used in mclust).

ICL (Integrated Complete-data Likelihood):

ICL(K)=BIC(K)+2i=1nk=1Kr^ikln(r^ik)\text{ICL}(K) = \text{BIC}(K) + 2\sum_{i=1}^{n}\sum_{k=1}^{K} \hat{r}_{ik}\ln(\hat{r}_{ik})

ICL adds an entropy penalty for fuzzy assignments. Models where observations are clearly assigned (low entropy) are favoured by ICL over BIC.

9.2 Soft Cluster Membership in GMM

A major advantage of GMM over KK-means is the production of posterior probabilities of cluster membership for each observation. After EM convergence:

P(cluster=kxi)=rik=π^kN(xiμ^k,Σ^k)j=1Kπ^jN(xiμ^j,Σ^j)P(\text{cluster} = k \mid \mathbf{x}_i) = r_{ik} = \frac{\hat{\pi}_k \mathcal{N}(\mathbf{x}_i \mid \hat{\boldsymbol{\mu}}_k, \hat{\boldsymbol{\Sigma}}_k)}{\sum_{j=1}^{K} \hat{\pi}_j \mathcal{N}(\mathbf{x}_i \mid \hat{\boldsymbol{\mu}}_j, \hat{\boldsymbol{\Sigma}}_j)}

These probabilities allow identification of:

9.3 DBSCAN — Density-Based Spatial Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise; Ester et al., 1996) identifies clusters as dense regions of the data space, separated by low-density regions. It does not require specifying KK in advance and can identify outliers (noise points).

Key parameters:

Point classification:

A point xi\mathbf{x}_i is classified as:

Core point: Nε(xi)MinPts|\mathcal{N}_\varepsilon(\mathbf{x}_i)| \geq \text{MinPts}

Where Nε(xi)={xj:d(xi,xj)ε}\mathcal{N}_\varepsilon(\mathbf{x}_i) = \{\mathbf{x}_j : d(\mathbf{x}_i, \mathbf{x}_j) \leq \varepsilon\} is the ε\varepsilon-neighbourhood of xi\mathbf{x}_i.

Border point: Not a core point, but within distance ε\varepsilon of a core point.

Noise point (outlier): Neither a core point nor a border point.

DBSCAN algorithm:

  1. Label all points as unvisited.
  2. For each unvisited core point pp: mark as visited; create a new cluster CC; add all points in Nε(p)\mathcal{N}_\varepsilon(p) to CC.
  3. For each newly added point qq in CC: if qq is a core point, add all points in Nε(q)\mathcal{N}_\varepsilon(q) to CC (density reachability).
  4. Continue until no more points can be added to CC; label all non-core border points.
  5. Mark remaining unvisited non-core points as noise.

Key properties of DBSCAN:

9.4 Choosing DBSCAN Parameters

Setting MinPts:

Setting ε\varepsilon:

εdMinPtsNN,knee\varepsilon^* \approx d_{MinPts-NN, \text{knee}}

9.5 HDBSCAN — Hierarchical DBSCAN

HDBSCAN extends DBSCAN to handle varying density clusters by transforming the space using the concept of mutual reachability distance:

dmreach,MinPts(xi,xj)=max[coreMinPts(xi),coreMinPts(xj),d(xi,xj)]d_{mreach,MinPts}(\mathbf{x}_i, \mathbf{x}_j) = \max[\text{core}_{MinPts}(\mathbf{x}_i), \text{core}_{MinPts}(\mathbf{x}_j), d(\mathbf{x}_i, \mathbf{x}_j)]

Where coreMinPts(xi)\text{core}_{MinPts}(\mathbf{x}_i) is the MinPts-nearest neighbour distance (the core distance) of point xi\mathbf{x}_i.

HDBSCAN builds a hierarchical clustering tree on this transformed space and extracts a flat clustering by selecting the most stable clusters across all density levels. Only MinPts needs to be specified (not ε\varepsilon).

9.6 Fuzzy C-Means

Fuzzy C-Means (FCM) is a soft-assignment generalisation of KK-means. Each observation has a degree of membership uik[0,1]u_{ik} \in [0, 1] in each cluster kk, with k=1Kuik=1\sum_{k=1}^K u_{ik} = 1.

FCM objective function:

Jm=i=1nk=1Kuikmxixˉk2J_m = \sum_{i=1}^{n}\sum_{k=1}^{K} u_{ik}^m \|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2

Where m>1m > 1 is the fuzziness parameter (typically m=2m = 2):

FCM update equations:

Centroids:

xˉk=i=1nuikmxii=1nuikm\bar{\mathbf{x}}_k = \frac{\sum_{i=1}^{n} u_{ik}^m \mathbf{x}_i}{\sum_{i=1}^{n} u_{ik}^m}

Memberships:

uik=1j=1K(xixˉkxixˉj)2/(m1)u_{ik} = \frac{1}{\sum_{j=1}^{K}\left(\frac{\|\mathbf{x}_i - \bar{\mathbf{x}}_k\|}{\|\mathbf{x}_i - \bar{\mathbf{x}}_j\|}\right)^{2/(m-1)}}

Iterate until convergence (U(t+1)U(t)<ε\|U^{(t+1)} - U^{(t)}\| < \varepsilon).


10. Model Fit and Evaluation

Evaluating cluster quality is fundamentally more challenging than evaluating supervised model fit because there is no ground truth to compare against. Multiple complementary criteria must be consulted.

10.1 Internal Validity Indices

Internal validity indices assess cluster quality using only the clustering data and the resulting cluster assignments.

Average Silhouette Width (ASW):

sˉ=1ni=1ns(i),s(i)=b(i)a(i)max[a(i),b(i)]\bar{s} = \frac{1}{n}\sum_{i=1}^{n} s(i), \quad s(i) = \frac{b(i) - a(i)}{\max[a(i), b(i)]}

Higher is better. Optimal KK = KK that maximises sˉ\bar{s}.

Calinski-Harabasz (CH) Index:

CH=(nK)BCSS(K1)WCSSCH = \frac{(n - K) \cdot \text{BCSS}}{(K - 1) \cdot \text{WCSS}}

Higher is better.

Davies-Bouldin (DB) Index:

DB=1Kk=1Kmaxlkσk+σld(xˉk,xˉl)DB = \frac{1}{K}\sum_{k=1}^{K}\max_{l \neq k}\frac{\sigma_k + \sigma_l}{d(\bar{\mathbf{x}}_k, \bar{\mathbf{x}}_l)}

Lower is better.

Dunn Index:

D=minkldmin(Ck,Cl)maxmΔ(Cm)D = \frac{\min_{k \neq l} d_{min}(C_k, C_l)}{\max_m \Delta(C_m)}

Where dmin(Ck,Cl)=minxCk,yCld(x,y)d_{min}(C_k, C_l) = \min_{\mathbf{x} \in C_k, \mathbf{y} \in C_l} d(\mathbf{x}, \mathbf{y}) is the minimum inter-cluster distance and Δ(Cm)=maxx,yCmd(x,y)\Delta(C_m) = \max_{\mathbf{x}, \mathbf{y} \in C_m} d(\mathbf{x}, \mathbf{y}) is the diameter (maximum intra-cluster distance) of cluster mm.

Higher Dunn index = better separation and compactness. Sensitive to outliers.

Summary of Internal Validity Indices:

IndexOptimal KKBetter SolutionSensitivity
ASWMaximiseHigherModerate
CHMaximiseHigherLow
DBMinimiseLowerModerate
DunnMaximiseHigherHigh (outliers)
WCSS (Elbow)Elbow pointLowerLow
Gap statisticFirst local maxHigherModerate

10.2 External Validity Indices

When true cluster labels are available (e.g., in a validation or simulation study), external validity indices compare the clustering result to the known ground truth.

Adjusted Rand Index (ARI):

The Rand Index counts agreements between pairs of observations in the clustering and in the true partition. The Adjusted Rand Index corrects for chance agreement:

ARI=RIE[RI]max(RI)E[RI]\text{ARI} = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]}

More explicitly, for contingency table N\mathbf{N} with elements nijn_{ij} (number of objects in cluster ii of solution 1 AND cluster jj of solution 2):

ARI=ij(nij2)[i(ai2)j(bj2)]/(n2)12[i(ai2)+j(bj2)][i(ai2)j(bj2)]/(n2)\text{ARI} = \frac{\sum_{ij}\binom{n_{ij}}{2} - \left[\sum_i\binom{a_i}{2}\sum_j\binom{b_j}{2}\right]/\binom{n}{2}}{\frac{1}{2}\left[\sum_i\binom{a_i}{2} + \sum_j\binom{b_j}{2}\right] - \left[\sum_i\binom{a_i}{2}\sum_j\binom{b_j}{2}\right]/\binom{n}{2}}

Where ai=jnija_i = \sum_j n_{ij} and bj=inijb_j = \sum_i n_{ij} are row and column marginals.

ARIInterpretation
1.01.0Perfect agreement
0.610.990.61 - 0.99Excellent
0.410.600.41 - 0.60Good
0.210.400.21 - 0.40Fair
0.000.200.00 - 0.20Slight / random agreement
<0< 0Worse than random

Normalised Mutual Information (NMI):

NMI(U,V)=2I(U;V)H(U)+H(V)\text{NMI}(U, V) = \frac{2 \cdot I(U; V)}{H(U) + H(V)}

Where I(U;V)I(U;V) is the mutual information and H(U)H(U), H(V)H(V) are the entropies of the two clusterings:

I(U;V)=k=1Kl=1LnklnlnnnklnknlI(U;V) = \sum_{k=1}^{K}\sum_{l=1}^{L} \frac{n_{kl}}{n} \ln\frac{n \cdot n_{kl}}{n_k \cdot n_l}

NMI ranges from 0 (no agreement) to 1 (perfect agreement). Does not correct for chance (use AMI — Adjusted Mutual Information — for a chance-corrected version).

10.3 Relative Validity — Stability Indices

Cluster stability assesses whether the cluster solution is reproducible across subsamples of the data. A stable solution will produce similar cluster assignments when the analysis is repeated on bootstrap samples.

Bootstrap stability algorithm:

  1. Draw B=100B = 100 bootstrap samples of size nn (with replacement).
  2. Apply the clustering algorithm to each bootstrap sample.
  3. For each bootstrap solution, match clusters to the original solution (using the Hungarian algorithm or maximum overlap matching).
  4. Compute the mean ARI between the original and bootstrap solutions:

Stability=1Bb=1BARI(C^,C^b)\text{Stability} = \frac{1}{B}\sum_{b=1}^{B} \text{ARI}(\hat{C}, \hat{C}_b^*)

StabilityInterpretation
>0.85> 0.85Very stable — reproducible solution
0.750.850.75 - 0.85Stable
0.600.740.60 - 0.74Moderately stable
<0.60< 0.60Unstable — solution may not replicate

10.4 BIC and AIC for Model-Based Clustering

For GMM, the optimal number of components KK is selected using BIC:

BIC=2^qln(n)\text{BIC} = 2\hat{\ell} - q\ln(n)

Select KK that maximises BIC (using the mclust convention where higher = better).

AIC:

AIC=2^2q\text{AIC} = 2\hat{\ell} - 2q

AIC tends to select more components than BIC (favours complexity less strictly than BIC).

10.5 Comparing KK Values — A Comprehensive Decision Framework

CriterionSupports Low KKSupports High KKBest Used With
Elbow (WCSS)Elbow is earlyElbow is lateK-Means
SilhouetteHigh sˉ\bar{s} at low KKHigh sˉ\bar{s} at high KKAny method
Gap StatisticEarly plateauLate plateauAny method
CH IndexPeak at low KKPeak at high KKK-Means, Hierarchical
DB IndexMinimum at low KKMinimum at high KKK-Means, Hierarchical
BIC (GMM)Peak at low KKPeak at high KKGMM only
Bootstrap stabilityHigh stability at low KKHigh stability at high KKAny method
Theoretical expectationsStrong prior for few typesStrong prior for many typesAll analyses

💡 Best practice: Evaluate the clustering solution at K1K-1, KK, and K+1K+1 for the statistically suggested KK. Choose the solution that is most theoretically interpretable AND statistically defensible. Document the rationale for the final KK selection.

10.6 Cluster Profiling and Interpretation

After selecting the final clustering solution, profile each cluster by computing:

A cluster profile heatmap (rows = clusters, columns = variables, cells = standardised mean) provides an intuitive summary of how clusters differ across all variables simultaneously.


11. Advanced Topics

11.1 Dimensionality Reduction Before Clustering

High-dimensional data (large pp) pose several challenges for clustering:

Solution 1 — PCA before clustering:

Apply Principal Component Analysis to reduce pp to ppp' \ll p dimensions that capture most of the variance:

Z=XVp\mathbf{Z} = \mathbf{X}\mathbf{V}_{p'}

Where Vp\mathbf{V}_{p'} contains the first pp' principal components. Cluster on Z\mathbf{Z} instead of X\mathbf{X}. Retain enough components to explain 80%\geq 80\% of total variance.

Solution 2 — UMAP/t-SNE for visualisation:

For visualising high-dimensional clusters in 2D:

⚠️ Never cluster on t-SNE or UMAP embeddings — their stochastic, non-linear nature distorts distances in ways that invalidate distance-based clustering. Use PCA for pre-clustering dimensionality reduction; use t-SNE/UMAP only for visualisation.

11.2 Cluster Validation with External Variables

After obtaining a cluster solution, validate it by examining whether clusters differ meaningfully on variables that were NOT used in forming the clusters (held-out variables):

Criterion validity: Do clusters differ significantly on a relevant outcome variable not used in clustering?

Fcriterion=BCSScriterion/(K1)WCSScriterion/(nK)F_{\text{criterion}} = \frac{\text{BCSS}_{criterion}/(K-1)}{\text{WCSS}_{criterion}/(n-K)}

Predictive validity: Can cluster membership predict future outcomes? Run a logistic regression (or multinomial logistic regression for K>2K > 2) predicting cluster membership from domain-relevant external predictors. A model that classifies with high accuracy validates the cluster solution.

Cross-tabulation with known groups: If true group membership is partially known (e.g., diagnosed patients vs. healthy controls), compare cluster memberships against true groups using chi-squared tests and ARI.

11.3 Mixed-Type Data Clustering

Real-world datasets often contain a mixture of continuous, ordinal, and nominal variables. Three main strategies exist:

Strategy 1 — Gower's distance + hierarchical/K-Medoids:

Compute Gower's distance matrix and feed it into hierarchical clustering or PAM. Most widely recommended approach for mixed data.

Strategy 2 — Factor Analysis of Mixed Data (FAMD):

Apply FAMD to extract factor scores that accommodate both continuous (PCA-style) and categorical (MCA-style) variables, then cluster on the factor scores.

Strategy 3 — Latent Class Analysis (LCA):

Model-based approach where the cluster structure is defined by a joint probability model over all variable types. Each "class" is characterised by a probability distribution over all variables. Suitable for primarily categorical data.

11.4 Longitudinal Clustering

When the same subjects are measured repeatedly over time, trajectory clustering (also called group-based trajectory modelling or latent growth curve clustering) identifies subgroups with similar temporal patterns.

Group-Based Trajectory Model (Nagin, 2005):

P(class=k)=πkP(\text{class} = k) = \pi_k

For each class kk, the trajectory of outcome YitY_{it} over time tt is modelled as:

E(Yitclass=k)=βk0+βk1t+βk2t2+E(Y_{it} \mid \text{class} = k) = \beta_{k0} + \beta_{k1}t + \beta_{k2}t^2 + \dots

Model selection is based on BIC across models with different numbers of trajectory groups and different polynomial orders for each group.

11.5 Ensemble Clustering (Consensus Clustering)

Consensus clustering combines multiple clustering solutions (from different algorithms, different KK values, or different random starts) to produce a more stable and robust final partition.

Algorithm:

  1. Generate BB clustering solutions: {C^1,C^2,,C^B}\{\hat{C}_1, \hat{C}_2, \dots, \hat{C}_B\}.
  2. Build the co-association matrix A\mathbf{A} where AijA_{ij} = proportion of solutions in which observations ii and jj are assigned to the same cluster:

Aij=1Bb=1B1[c^i(b)=c^j(b)]A_{ij} = \frac{1}{B}\sum_{b=1}^{B} \mathbf{1}[\hat{c}_i^{(b)} = \hat{c}_j^{(b)}]

  1. Apply hierarchical clustering to A\mathbf{A} (treating it as a similarity matrix).
  2. The consensus clustering is the result of cutting this dendrogram at the desired KK.

A cluster solution with high consensus (many entries in A\mathbf{A} near 0 or 1) is more stable than one with many entries near 0.5 (indicating ambiguous assignments).

11.6 Reporting Cluster Analysis Results

Best-practice reporting for cluster analysis includes:

Pre-analysis decisions:

Main results:

Cluster characterisation:

Visualisations:


12. Worked Examples

Example 1: Hierarchical Clustering — Patient Subtype Identification

A clinical researcher collects data on n=80n = 80 patients with chronic fatigue syndrome on five symptom severity scores (each 0–10): Pain (X1X_1), Fatigue (X2X_2), Cognitive Impairment (X3X_3), Sleep Disturbance (X4X_4), and Mood (X5X_5).

Step 1 — Standardise variables:

All five variables are on the same 0–10 scale with similar SDs — standardisation applied for consistency.

Step 2 — Compute distance matrix:

Euclidean distances on standardised scores.

Step 3 — Run Ward's hierarchical clustering.

Step 4 — Evaluate the dendrogram:

Dendrogram heights of final merges (from bottom up): h5=1.2,h4=1.8,h3=4.1,h2=8.9,h1=14.2h_5 = 1.2, h_4 = 1.8, h_3 = 4.1, h_2 = 8.9, h_1 = 14.2

Height gaps: 0.6,2.3,4.8,5.30.6, 2.3, 4.8, 5.3 → largest gap is between h2h_2 and h3h_3 (Δ=4.8\Delta = 4.8), suggesting cutting at 3 clusters (before the large jump to 2 clusters).

Cophenetic correlation: rcoph=0.82r_{coph} = 0.82 → Good representation.

Step 5 — Evaluate 3-cluster solution:

IndexK=2K=2K=3K=3K=4K=4
ASW0.580.640.51
CH41.252.847.1
DB0.820.610.78

All three indices favour K=3K = 3. Decision: 3 clusters.

Step 6 — Cluster Profiles:

VariableCluster 1 (n=28n=28)Cluster 2 (n=31n=31)Cluster 3 (n=21n=21)
Pain7.8±1.27.8 \pm 1.24.2±1.54.2 \pm 1.52.1±1.02.1 \pm 1.0
Fatigue8.5±0.98.5 \pm 0.96.1±1.46.1 \pm 1.43.5±1.23.5 \pm 1.2
Cognitive7.2±1.57.2 \pm 1.53.8±1.63.8 \pm 1.65.8±1.35.8 \pm 1.3
Sleep8.1±1.18.1 \pm 1.15.5±1.55.5 \pm 1.54.9±1.44.9 \pm 1.4
Mood7.4±1.37.4 \pm 1.34.9±1.64.9 \pm 1.66.1±1.26.1 \pm 1.2
Bootstrap ARI0.88

Cluster Labels:

Conclusion: Ward's hierarchical clustering identified three clinically meaningful patient subtypes. The 3-cluster solution was supported by all validity indices (ASW = 0.64, CH = 52.8, DB = 0.61) and demonstrated good bootstrap stability (ARI = 0.88). The three profiles suggest distinct symptom phenotypes that may benefit from different treatment approaches: Cluster 1 (Severe) may require multimodal comprehensive intervention; Cluster 2 (Moderate/Somatic) may respond to physical symptom management; Cluster 3 (Cognitive-Affective) may benefit from cognitive-behavioural and mood-focused therapies.


Example 2: K-Means Clustering — Customer Segmentation

A retail company collects data on n=500n = 500 customers: Annual spend (X1X_1, £), Purchase frequency (X2X_2, orders/year), Average order value (X3X_3, £), Days since last purchase (X4X_4, recency), and Customer age (X5X_5, years).

Step 1 — Standardise all variables.

Step 2 — Determine optimal K:

WCSS, silhouette, and gap statistic evaluated for K=2K = 2 to K=8K = 8:

KKWCSSΔ\DeltaWCSSASWCHGap
218420.522850.41
313415010.613410.58
410522890.583120.62
58921600.512870.60
6798940.442510.55
7741570.382180.51
8712290.321900.48

Elbow: Largest Δ\DeltaWCSS at K=3K=3 (501) vs. K=4K=4 (289) → elbow at K=3K=3. Silhouette: Maximised at K=3K=3 (ASW = 0.61). Gap statistic: Gap(3) = 0.58 \geq Gap(4) - s4s_4 = 0.62 - 0.05 = 0.57. ✅ Decision: K=3K = 3 clusters.

Step 3 — Run K-Means (50 random starts, K-Means++ init).

Step 4 — Cluster profiles (unstandardised means):

VariableCluster 1 "Premium" (n=142n=142)Cluster 2 "Standard" (n=218n=218)Cluster 3 "Dormant" (n=140n=140)
Annual Spend£4,820£1,240£380
Frequency18.2 orders8.4 orders2.1 orders
Avg Order Value£265£148£181
Days Since Purchase12 days41 days287 days
Age42 years38 years55 years
Size28.4%43.6%28.0%

Bootstrap stability: Mean ARI = 0.91 (Very stable).

ANOVA results: All five variables differ significantly across clusters (p<.001p < .001).

Business Interpretation:


Example 3: Gaussian Mixture Model — Gene Expression Subtype Discovery

A genomics researcher analyses RNA-seq expression data for n=200n = 200 tumour samples on p=8p = 8 cancer-relevant gene expression scores (after PCA dimensionality reduction).

GMM with multiple covariance structures evaluated:

ModelKKBICICLBest Fit?
EII (spherical, equal volume)348214712No
EEE (ellipsoidal, equal)351085041No
VVV (fully unconstrained)249504882No
VEE (variable volume, equal shape)353425198Yes
VVE (variable volume and shape)451885014No

Best model: VEE with K=3K = 3 (highest BIC = 5342).

Cluster assignments:

PropertyCluster 1 (n=78n=78)Cluster 2 (n=71n=71)Cluster 3 (n=51n=51)
Size (%)39.0%35.5%25.5%
Mean entropy0.120.180.31
Low-uncertainty (prob >0.90>0.90)92%88%71%

Cluster profiles (top discriminating genes):

GeneCluster 1Cluster 2Cluster 3
TP53Low (1.8-1.8)High (+1.4+1.4)Moderate (+0.3+0.3)
BRCA1Moderate (+0.2+0.2)Low (1.5-1.5)High (+1.6+1.6)
EGFRHigh (+1.7+1.7)Low (0.8-0.8)Low (0.9-0.9)
MYCModerate (+0.5+0.5)High (+1.3+1.3)Low (1.4-1.4)

Clinical validation (held-out variables not used in clustering):

OutcomeCluster 1Cluster 2Cluster 3pp
5-year survival72%41%58%<.001< .001
Stage III/IV (%)28%68%45%<.001< .001
Hormone receptor+ (%)82%31%61%<.001< .001

Molecular Subtypes Identified:

Conclusion: GMM with VEE covariance structure identified three biologically meaningful and clinically validated tumour subtypes. The soft assignment probabilities revealed that Cluster 3 members were more ambiguous (lower certainty), suggesting this subtype is transitional. The strong association with 5-year survival and stage distribution (all p<.001p < .001) confirms the clinical validity of these molecular subtypes.


13. Common Mistakes and How to Avoid Them

Mistake 1: Not Standardising Variables Before Distance-Based Clustering

Problem: Variables measured on different scales (e.g., income in £10,000s and age in years) will result in the large-scale variable dominating the Euclidean distance calculation. Clusters will reflect variation in that variable rather than the true multivariate structure.
Solution: Always standardise continuous variables to zz-scores before computing Euclidean distances, unless all variables are on the same scale or Mahalanobis distance is used. Report which standardisation method was applied.

Mistake 2: Choosing KK Based Solely on the Elbow Plot

Problem: The elbow in the WCSS curve is frequently ambiguous — the "elbow" can be in different positions depending on the scale and the eye of the beholder. Over-reliance on a single criterion leads to arbitrary KK selection.
Solution: Use at least three complementary criteria: the elbow method, silhouette analysis, and the gap statistic (or BIC for GMM). Combine statistical guidance with theoretical expectations about the expected number of subgroups. Evaluate the interpretability and stability of solutions at K1K-1, KK, and K+1K+1.

Mistake 3: Running K-Means With a Single Random Start

Problem: KK-means converges to a local optimum that depends on the random initialisation. A single random start frequently produces a poor solution with unnecessarily high WCSS.
Solution: Always use multiple random starts (at minimum 20, preferably 50–100) and keep the solution with the lowest WCSS. Use K-Means++ initialisation as the default — it dramatically reduces the probability of poor local optima.

Mistake 4: Treating Cluster Labels as Stable Across Analyses

Problem: Cluster labels (Cluster 1, 2, 3, etc.) are arbitrarily assigned and change between runs of KK-means. Researchers sometimes reference "Cluster 1" without realising it may correspond to a completely different group in a different run.
Solution: Always refer to clusters by their content labels (e.g., "High-Risk Group," "Moderate Group") rather than numerical labels. Match clusters across runs using the Hungarian algorithm or maximum overlap matching. Set a random seed for reproducibility.

Mistake 5: Ignoring Cluster Validity and Stability

Problem: Reporting a cluster solution without any assessment of its validity or reproducibility. A clustering that "looks interesting" may simply reflect the noise structure of a small or particular sample and fail to replicate in new data.
Solution: Always report at least one internal validity index (ASW recommended) and a stability assessment (bootstrap ARI). Solutions with ASW <0.26< 0.26 or bootstrap ARI <0.60< 0.60 should be interpreted with extreme caution. Validate the solution on an independent holdout sample or using cross-validation.

Mistake 6: Including Outcome Variables in the Clustering

Problem: Including the outcome variable (or variables highly correlated with it) in the feature set used to define clusters creates a circular validation problem — the clusters will naturally differ on the outcome because it was used to define them.
Solution: Cluster using only the predictor variables or features that define the hypothesised subgroups. Hold out the outcome variable and use it only for post-hoc validation (i.e., testing whether the clusters actually differ in a meaningful way on the outcome).

Mistake 7: Using Hierarchical Clustering With Single Linkage for General Purposes

Problem: Single linkage is prone to chaining — it forms long, elongated clusters by progressively adding observations to the nearest existing cluster, even if they are far from the cluster's core members. This produces biologically/psychologically uninterpretable clusters in most applications.
Solution: Use Ward's linkage as the default for general-purpose hierarchical clustering with continuous variables. Use single linkage only when elongated, chain-like clusters are theoretically expected (e.g., trajectory data or sequentially ordered data).

Mistake 8: Claiming Clusters Are "Real" Without Validation

Problem: Clustering always produces groups — even completely random data will be partitioned into KK clusters by KK-means. Presenting the resulting solution as a discovery of genuine population subgroups without testing whether the clusters reflect real structure is scientifically inappropriate.
Solution: Test the null hypothesis of no cluster structure using the Hopkins statistic before clustering. Validate the solution using held-out variables, external criteria, or an independent replication sample. Use language that reflects the exploratory and hypothesis- generating nature of cluster analysis (e.g., "consistent with three subtypes" rather than "three subtypes were identified").

Mistake 9: Including Highly Correlated Variables Without Addressing Redundancy

Problem: When two highly correlated variables (r>0.85|r| > 0.85) are both included in the clustering, that dimension is effectively double-weighted in the distance calculation. The resulting clusters reflect that dimension disproportionately compared to others.
Solution: Inspect the correlation matrix of clustering variables. Remove one variable from each highly correlated pair, or combine them into a composite. Alternatively, apply PCA first and cluster on principal component scores, which are uncorrelated by construction.

Mistake 10: Using Cluster Means to Validate Clusters Using the Same Data

Problem: Computing ANOVA or tt-tests to compare clusters on the variables used to form them, and then reporting these differences as evidence that the clusters are "real" or "distinct," is tautological. Of course the clusters differ on the variables used to create them — that is how clustering works.
Solution: Validate clusters using held-out variables (variables not used in clustering) or external outcomes (e.g., diagnosis, treatment response, survival). ANOVA on clustering variables is useful only for describing the clusters (profiling), not for validating their existence or clinical/practical significance.


14. Troubleshooting

ProblemLikely CauseSolution
K-Means produces a singleton cluster (cluster of 1)An extreme outlier has been assigned its own clusterRemove outliers before clustering; use K-Medoids (PAM); check for data entry errors
WCSS elbow plot shows no clear elbowData may have no genuine cluster structure; too many/few KK values testedTest with Hopkins statistic; extend the range of KK; try hierarchical clustering to see dendrogram
Silhouette width is low (<0.25< 0.25) across all KKNo cluster structure in data; wrong distance measure; wrong variables selectedReassess variable selection; try Gower's distance for mixed data; consider the data may not have clusters
K-Means result changes dramatically between runsToo few random starts; very flat WCSS landscape; data near a saddle pointIncrease to 100+ random starts; use K-Means++ initialisation; try different KK
Hierarchical dendrogram shows inversions (lower heights after higher)Centroid linkage used with non-Euclidean distancesSwitch to Ward, Average, or Complete linkage; avoid centroid linkage
DBSCAN classifies almost all points as noiseε\varepsilon is too smallIncrease ε\varepsilon; replot the kk-NN distance graph and identify the correct elbow
DBSCAN produces one giant clusterε\varepsilon is too largeDecrease ε\varepsilon; re-examine kk-NN distance plot
GMM EM algorithm does not convergeToo many components; degenerate covariance (near-singular); small nnReduce KK; use constrained covariance (e.g., EEE); increase sample size; add regularisation
GMM produces a cluster with near-zero mixing proportion (πk0\pi_k \approx 0)Over-specification of KK; one cluster is absorbing outliersReduce KK; check for outliers in the data
Bootstrap stability is very low (<0.60< 0.60)Sample size too small; clusters not well-separated; wrong KKIncrease nn; try different KK; increase the number of bootstrap replicates to 500
All validity indices disagree on the optimal KKGenuine ambiguity in the data structure; clusters are overlappingReport multiple solutions; choose based on theory; use GMM soft assignments to quantify overlap
Cluster sizes are extremely unequal (e.g., n=2n=2 vs. n=498n=498)Outliers forming their own cluster; KK is too large; forced spherical clustersRemove outliers; reduce KK; use DBSCAN which flags outliers as noise
Variables with zero or near-zero variance includedData preprocessing omissionCheck variable SDs; remove constant or near-constant variables before clustering
PCA biplot shows no cluster separationClusters may differ on dimensions not captured by PC1 and PC2Plot PC2 vs. PC3, PC1 vs. PC3; use 3D plots; check if cluster differences are in higher components

15. Quick Reference Cheat Sheet

Core Equations

FormulaDescription
dE=k=1p(xikxjk)2d_E = \sqrt{\sum_{k=1}^p(x_{ik}-x_{jk})^2}Euclidean distance
dM=k=1pxikxjkd_M = \sum_{k=1}^p \vert x_{ik}-x_{jk} \vertManhattan distance
dMah=(xixj)TS1(xixj)d_{Mah} = \sqrt{(\mathbf{x}_i-\mathbf{x}_j)^T\mathbf{S}^{-1}(\mathbf{x}_i-\mathbf{x}_j)}Mahalanobis distance
dG=kδijkdijkkδijkd_G = \frac{\sum_k \delta_{ijk}d_{ijk}}{\sum_k \delta_{ijk}}Gower's distance (mixed data)
WCSS=k=1KxiCkxixˉk2\text{WCSS} = \sum_{k=1}^K\sum_{\mathbf{x}_i \in C_k}\|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2Within-cluster sum of squares
ci=argminkxixˉk2c_i = \arg\min_k \|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2K-Means assignment step
xˉk=1nkxiCkxi\bar{\mathbf{x}}_k = \frac{1}{n_k}\sum_{\mathbf{x}_i \in C_k}\mathbf{x}_iK-Means update step (centroid)
s(i)=b(i)a(i)max[a(i),b(i)]s(i) = \frac{b(i)-a(i)}{\max[a(i),b(i)]}Silhouette coefficient
sˉ=1ni=1ns(i)\bar{s} = \frac{1}{n}\sum_{i=1}^n s(i)Average silhouette width
CH=(nK)BCSS(K1)WCSSCH = \frac{(n-K)\cdot\text{BCSS}}{(K-1)\cdot\text{WCSS}}Calinski-Harabasz index
DB=1Kk=1Kmaxlkσk+σld(xˉk,xˉl)DB = \frac{1}{K}\sum_{k=1}^K\max_{l\neq k}\frac{\sigma_k+\sigma_l}{d(\bar{\mathbf{x}}_k,\bar{\mathbf{x}}_l)}Davies-Bouldin index
Gap(K)=E[logWCSS(K)]logWCSS(K)\text{Gap}(K) = E^*[\log\text{WCSS}(K)] - \log\text{WCSS}(K)Gap statistic
rik=πkN(xiμk,Σk)jπjN(xiμj,Σj)r_{ik} = \frac{\pi_k\mathcal{N}(\mathbf{x}_i\mid\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{\sum_j\pi_j\mathcal{N}(\mathbf{x}_i\mid\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}GMM posterior responsibility
BIC=2^qln(n)\text{BIC} = 2\hat{\ell} - q\ln(n)BIC for GMM model selection
rcophr_{coph}Cophenetic correlation (hierarchical quality)

Method Selection Guide

ScenarioRecommended Method
Exploratory; unknown KK; small-medium nnAgglomerative hierarchical (Ward's linkage)
Large nn; continuous data; approximate KK knownKK-Means with K-Means++
Outliers present; any distance matrixKK-Medoids (PAM)
Soft assignments desired; elliptical clustersGaussian Mixture Model (GMM)
Arbitrary-shaped clusters; automatic noise detectionDBSCAN or HDBSCAN
Mixed variable typesGower's distance + PAM or hierarchical
Categorical-only variablesKK-Modes or Latent Class Analysis
High-dimensional data (p>20p > 20)PCA first, then cluster on scores
Varying density clustersHDBSCAN
Overlapping clusters; uncertainty quantificationGMM or Fuzzy C-Means
Temporal/longitudinal dataGroup-Based Trajectory Modelling
Combining multiple algorithmsEnsemble/Consensus Clustering

Distance Measure Selection Guide

Data TypeRecommended Distance
Continuous, different scalesEuclidean (after z-standardisation)
Continuous, outliers presentManhattan
Continuous, correlated variablesMahalanobis
Mixed (continuous + categorical)Gower's
Binary (presence/absence)Jaccard
High-dimensional (text, genomics)Cosine
Time seriesDynamic Time Warping (DTW)
Count dataBray-Curtis

Linkage Method Guide (Hierarchical Clustering)

LinkageCluster ShapeSensitive to OutliersRecommended
Ward'sCompact, sphericalModerate✅ Default choice
CompleteCompact, equal sizeYesCompact expected clusters
AverageModerateModerateGood general alternative
SingleElongated, chainedVery highFilamentary structures only
CentroidModerateLowNot recommended (inversions)

Validity Index Benchmarks

IndexPoorAcceptableGoodExcellent
Average Silhouette Width<0.25< 0.250.260.500.26 - 0.500.510.700.51 - 0.70>0.70> 0.70
Bootstrap Stability (ARI)<0.60< 0.600.600.740.60 - 0.740.750.850.75 - 0.85>0.85> 0.85
Cophenetic Correlation<0.60< 0.600.600.740.60 - 0.740.750.850.75 - 0.85>0.85> 0.85
ARI vs. ground truth<0.20< 0.200.210.400.21 - 0.400.410.700.41 - 0.70>0.70> 0.70

KK Selection Methods Comparison

MethodStatisticOptimal KKRequires KK RangeFor
ElbowWCSSAt elbowYesK-Means
SilhouetteASWMaximiseYesAny
Gap StatisticGap(K)(K)First \geq gap(K+1)s(K+1)-sYesAny
CH IndexCHMaximiseYesK-Means, Hierarchical
DB IndexDBMinimiseYesK-Means, Hierarchical
BIC2^qlnn2\hat{\ell} - q\ln nMaximiseYesGMM only
DendrogramHeight gapsLargest gapNoHierarchical only

Minimum Sample Size Guidelines

MethodMinimum nnMinimum per Cluster
Hierarchical305
KK-Means10K10K10
KK-Medoids15K15K10
GMM10Kp10Kp5p5p
DBSCAN50\geq MinPts
Fuzzy C-Means10K10K10

Key Pre-Processing Checklist

StepActionWhen Required
StandardiseZ-score all continuous variablesAlways (Euclidean distance)
Remove outliersScreen using Mahalanobis D2D^2 or box plotsBefore distance-based clustering
Handle missing dataImpute or use Gower's distanceWhen missing data present
Remove redundant variablesDrop if $r
Check clusterabilityHopkins statistic >0.75> 0.75Before clustering
Dimensionality reductionPCA if p>20p > 20High-dimensional data
Check variable varianceRemove near-zero variance variablesAlways

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Cluster Analysis using the DataStatPro application. For further reading, consult Kaufman & Rousseeuw's "Finding Groups in Data: An Introduction to Cluster Analysis" (2005), Everitt, Landau, Leese & Stahl's "Cluster Analysis" (5th ed., 2011), and Fraley & Raftery's "Model-Based Clustering, Discriminant Analysis, and Density Estimation" (Journal of the American Statistical Association, 2002). For feature requests or support, contact the DataStatPro team.