Cluster Analysis: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Cluster Analysis all the way through advanced algorithms, evaluation, interpretation, and practical usage within the DataStatPro application. Whether you are encountering cluster analysis for the first time or looking to deepen your understanding of unsupervised learning and data segmentation, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What is Cluster Analysis?
The Mathematics Behind Cluster Analysis
Assumptions of Cluster Analysis
Types of Cluster Analysis Methods
Using the Cluster Analysis Component
Hierarchical Clustering
K-Means and K-Medoids Clustering
Model-Based and Density-Based Clustering
Model Fit and Evaluation
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into cluster analysis, it is helpful to be comfortable with the following foundational statistical and mathematical concepts. Each is briefly reviewed below.

1.1 Distance and Similarity

Cluster analysis is fundamentally about grouping objects that are similar to each other and dissimilar from objects in other groups. The core computational tool is a distance or dissimilarity measure between pairs of observations.

For two observations $\mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{ip})$ and $\mathbf{x}_j = (x_{j1}, x_{j2}, \dots, x_{jp})$ with $p$ variables, the most commonly used distance measure is the Euclidean distance:

$d(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{\sum_{k=1}^{p}(x_{ik} - x_{jk})^2}$

Key properties of a valid distance measure:

$d(\mathbf{x}_i, \mathbf{x}_j) \geq 0$ (non-negativity).
$d(\mathbf{x}_i, \mathbf{x}_j) = 0 \Leftrightarrow \mathbf{x}_i = \mathbf{x}_j$ (identity of indiscernibles).
$d(\mathbf{x}_i, \mathbf{x}_j) = d(\mathbf{x}_j, \mathbf{x}_i)$ (symmetry).
$d(\mathbf{x}_i, \mathbf{x}_k) \leq d(\mathbf{x}_i, \mathbf{x}_j) + d(\mathbf{x}_j, \mathbf{x}_k)$ (triangle inequality).

A similarity measure $s(\mathbf{x}_i, \mathbf{x}_j)$ is the inverse concept: high similarity means the objects are close. It can always be converted to a dissimilarity: $d = 1 - s$ (for similarities bounded by [0, 1]).

1.2 Vectors and Centroids

An observation in a dataset with $p$ variables is represented as a vector in $p$ -dimensional space:

$\mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{ip})^T \in \mathbb{R}^p$

The centroid of a set of $n_k$ observations $\{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_{n_k}\}$ is their arithmetic mean:

$\bar{\mathbf{x}}_k = \frac{1}{n_k}\sum_{i=1}^{n_k}\mathbf{x}_i$

In $K$ -means clustering, the centroid is the representative point of each cluster.

1.3 Variance and Within-Group Variance

The total variance of a dataset measures overall dispersion:

$SS_{total} = \sum_{i=1}^{n}\sum_{k=1}^{p}(x_{ik} - \bar{x}_k)^2$

The within-cluster sum of squares (WCSS) measures how tightly packed observations are within their assigned clusters:

$\text{WCSS} = \sum_{c=1}^{K}\sum_{i \in C_c}\|\mathbf{x}_i - \bar{\mathbf{x}}_c\|^2$

Where $C_c$ is the set of observations in cluster $c$ and $\bar{\mathbf{x}}_c$ is the centroid of cluster $c$ . Minimising WCSS is the objective of $K$ -means clustering.

1.4 Matrix Notation and the Distance Matrix

Given $n$ observations, the distance matrix $\mathbf{D}$ is an $n \times n$ symmetric matrix where element $d_{ij}$ is the distance between observations $i$ and $j$ :

$\mathbf{D} = \begin{pmatrix} 0 & d_{12} & d_{13} & \cdots & d_{1n} \\ d_{21} & 0 & d_{23} & \cdots & d_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d_{n1} & d_{n2} & d_{n3} & \cdots & 0 \end{pmatrix}$

The distance matrix is the primary input to hierarchical and many other clustering algorithms. It is always symmetric ( $d_{ij} = d_{ji}$ ) with zeros on the diagonal.

1.5 Probability Distributions and Mixture Models

A probability distribution $f(x)$ describes the likelihood of observing each value of a random variable $X$ . In model-based clustering, the data are assumed to arise from a mixture of distributions:

$f(\mathbf{x}) = \sum_{k=1}^{K} \pi_k f_k(\mathbf{x} \mid \boldsymbol{\theta}_k)$

Where:

$K$ = number of mixture components (clusters).
$\pi_k$ = mixing proportion for component $k$ ( $\sum_k \pi_k = 1$ , $\pi_k > 0$ ).
$f_k(\mathbf{x} \mid \boldsymbol{\theta}_k)$ = density function of component $k$ with parameters $\boldsymbol{\theta}_k$ .

The most common choice is the Gaussian mixture model (GMM), where each component is a multivariate normal distribution.

1.6 Optimisation and Convergence

Many clustering algorithms work by iteratively optimising an objective function (e.g., minimising WCSS). Key concepts:

Iteration: Repeating the same update steps until a stopping criterion is met.

Convergence: The algorithm has converged when the objective function changes by less than a small threshold $\varepsilon$ between iterations:

$|F^{(t+1)} - F^{(t)}| < \varepsilon$

Local vs. Global Optimum: Iterative algorithms like $K$ -means often converge to a local (not global) optimum — the solution depends on the random starting configuration. Running the algorithm multiple times with different starts helps find a better (though still not guaranteed globally optimal) solution.

1.7 Supervised vs. Unsupervised Learning

Supervised learning uses labelled data (known outcomes) to train a model — examples include regression and classification. Unsupervised learning discovers structure in unlabelled data without a predefined outcome variable.

Cluster analysis is unsupervised — there is no "correct answer" against which to evaluate the solution. This makes cluster analysis both powerful (works without labels) and challenging (no ground truth for validation). All cluster validation is internal (based on the structure of the clustered data itself) or theoretical (based on external knowledge).

2. What is Cluster Analysis?

2.1 The Core Idea

Cluster analysis (also called clustering or cluster detection) is a class of unsupervised statistical learning methods that partition a dataset into groups (clusters) such that:

Observations within the same cluster are as similar as possible to each other (high intra-cluster homogeneity).
Observations in different clusters are as different as possible from each other (high inter-cluster separation).

Unlike classification (which assigns new observations to predefined categories), cluster analysis discovers the categories themselves from the data. The researcher does not specify what the groups should look like — the algorithm determines which observations naturally belong together.

2.2 What Cluster Analysis Can and Cannot Do

Cluster analysis CAN:

Reveal natural groupings in complex high-dimensional data.
Generate hypotheses about subgroup differences.
Reduce data complexity by summarising observations by their cluster membership.
Identify outliers and unusual observations (those that do not fit well in any cluster).
Segment populations for targeted interventions or personalisation.

Cluster analysis CANNOT:

Prove that clusters are "real" in a statistical testing sense — there is always some grouping solution, even for completely random data.
Identify causes of group differences (it is descriptive, not inferential).
Tell you the "right" number of clusters — this requires the researcher's judgement.
Produce clusters that generalise to new samples without validation.

2.3 Real-World Applications

Field	Application	Variables Used
Medicine	Identifying patient subgroups with similar disease profiles or treatment responses	Lab values, symptoms, genetic markers
Marketing	Customer segmentation for targeted advertising and product development	Purchase behaviour, demographics, preferences
Genomics	Grouping genes with similar expression patterns across conditions	Gene expression levels across samples
Psychology	Identifying latent subtypes in clinical populations	Symptom profiles, cognitive test scores
Finance	Portfolio diversification by grouping assets with correlated returns	Return series, risk metrics
Image Analysis	Colour segmentation; pixel grouping in image compression	RGB values, spatial coordinates
Social Science	Identifying socioeconomic neighbourhood profiles	Income, education, employment, housing
Ecology	Species distribution grouping; habitat classification	Environmental variables, species counts
Astronomy	Classifying galaxies or stars by spectral characteristics	Luminosity, colour index, mass

2.4 Cluster Analysis vs. Related Methods

Feature	Cluster Analysis	Factor Analysis	PCA	Discriminant Analysis
Goal	Group observations	Group variables	Reduce variable dimensions	Separate known groups
Supervised?	No	No	No	Yes
Output	Cluster memberships	Latent factors	Component scores	Classification rules
Known groups?	No	No	No	Yes
Operates on	Rows (observations)	Columns (variables)	Columns (variables)	Both
Distance used?	Yes (usually)	No	No	Yes (Mahalanobis)

2.5 The Fundamental Challenge: Defining Similarity

Every clustering algorithm embeds assumptions about what "similar" means. The choice of:

Distance measure (Euclidean, Manhattan, cosine, etc.).
Linkage method (for hierarchical clustering).
Number of clusters $K$ .
Cluster shape assumption (spherical, elliptical, arbitrary).

...all profoundly influence the resulting clusters. There is no universally correct clustering solution — different algorithms and distance measures can produce very different groupings from the same data. Always examine results from multiple approaches.

3. The Mathematics Behind Cluster Analysis

3.1 Distance and Dissimilarity Measures

The choice of distance measure determines how "closeness" is defined and which clusters are formed. Different measures emphasise different aspects of dissimilarity.

Euclidean Distance (L2 norm):

$d_E(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{\sum_{k=1}^{p}(x_{ik} - x_{jk})^2} = \|\mathbf{x}_i - \mathbf{x}_j\|_2$

Sensitive to scale differences between variables — standardisation is essential.

Manhattan Distance (L1 norm, City Block):

$d_M(\mathbf{x}_i, \mathbf{x}_j) = \sum_{k=1}^{p}|x_{ik} - x_{jk}|$

Less sensitive to outliers than Euclidean distance. Appropriate for grid-like data.

Minkowski Distance (generalisation of Euclidean and Manhattan):

$d_p(\mathbf{x}_i, \mathbf{x}_j) = \left(\sum_{k=1}^{p}|x_{ik} - x_{jk}|^p\right)^{1/p}$

$p = 1$ : Manhattan distance.
$p = 2$ : Euclidean distance.
$p \to \infty$ : Chebyshev distance ( $\max_k |x_{ik} - x_{jk}|$ ).

Mahalanobis Distance (accounts for correlations and scale):

$d_{Mah}(\mathbf{x}_i, \mathbf{x}_j) = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^T \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{x}_j)}$

Where $\mathbf{S}$ is the sample covariance matrix. Invariant to scale and rotation; accounts for correlations among variables.

Cosine Dissimilarity (based on angle between vectors):

$d_{cos}(\mathbf{x}_i, \mathbf{x}_j) = 1 - \frac{\mathbf{x}_i^T \mathbf{x}_j}{\|\mathbf{x}_i\| \cdot \|\mathbf{x}_j\|}$

Appropriate for high-dimensional data (text, documents) where magnitude is less important than direction.

Gower's Distance (for mixed variable types):

For a dataset with $p$ variables of mixed types (continuous, ordinal, binary, nominal):

$d_G(\mathbf{x}_i, \mathbf{x}_j) = \frac{\sum_{k=1}^{p} \delta_{ijk} \cdot d_{ijk}}{\sum_{k=1}^{p} \delta_{ijk}}$

Where $\delta_{ijk} = 1$ if the $k$ -th variable is usable for comparison (non-missing), and $d_{ijk}$ is the contribution of variable $k$ to the dissimilarity:

Continuous: $d_{ijk} = |x_{ik} - x_{jk}| / \text{range}_k$
Binary/Nominal: $d_{ijk} = \mathbf{1}[x_{ik} \neq x_{jk}]$
Ordinal: Similar to continuous after ranking.

3.2 Distance Measure Comparison Table

Distance	Formula	Scale Sensitive	Outlier Robust	Variable Type	Best For
Euclidean	$\sqrt{\sum(x_i - x_j)^2}$	Yes	No	Continuous	General purpose (after standardising)
Manhattan	$\sum	x_i - x_j	$	Yes	Moderate
Minkowski	$(\sum	x_i - x_j	^p)^{1/p}$	Yes	Varies
Mahalanobis	$\sqrt{(\mathbf{x}_i-\mathbf{x}_j)^T\mathbf{S}^{-1}(\mathbf{x}_i-\mathbf{x}_j)}$	No	No	Continuous	Correlated variables
Cosine	$1 - \cos(\theta)$	No	Moderate	Continuous	High-dimensional, text
Gower	Weighted average	No	Moderate	Mixed	Mixed data types
Binary (Jaccard)	$1 - \frac{	A \cap B	}{	A \cup B	}$

3.3 The Hierarchical Clustering Algorithm

Hierarchical clustering builds a dendrogram (tree structure) by successively merging (agglomerative) or splitting (divisive) clusters.

Agglomerative clustering algorithm (bottom-up):

Step 1: Start with $n$ clusters, each containing exactly one observation:

$\mathcal{C} = \{C_1, C_2, \dots, C_n\}$ where $C_i = \{\mathbf{x}_i\}$

Step 2: Compute the $n \times n$ distance matrix $\mathbf{D}$ .

Step 3: Find the two clusters $C_i$ and $C_j$ with minimum inter-cluster distance:

$(i^*, j^*) = \arg\min_{i \neq j} d(C_i, C_j)$

Step 4: Merge $C_{i^*}$ and $C_{j^*}$ into a new cluster:

$C_{new} = C_{i^*} \cup C_{j^*}$

Step 5: Update the distance matrix by removing rows/columns for $C_{i^*}$ and $C_{j^*}$ and adding a new row/column for $C_{new}$ using the chosen linkage method.

Step 6: Repeat Steps 3–5 until a single cluster remains.

The result is a binary tree (dendrogram) recording the order and heights of all merges.

3.4 Linkage Methods

The linkage method determines how the distance between two clusters is computed from the distances between their constituent observations. Different linkage methods produce dramatically different cluster shapes.

Single Linkage (Minimum):

$d_{SL}(C_i, C_j) = \min_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$

Distance = distance between the closest pair of observations from each cluster. Prone to chaining — elongated, chain-like clusters.

Complete Linkage (Maximum):

$d_{CL}(C_i, C_j) = \max_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$

Distance = distance between the farthest pair of observations from each cluster. Produces compact, roughly equal-sized clusters.

Average Linkage (UPGMA):

$d_{AL}(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|}\sum_{\mathbf{x} \in C_i}\sum_{\mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$

Distance = average of all pairwise distances between the two clusters. A compromise between single and complete linkage.

Ward's Method (Minimum Variance):

Merges the two clusters that result in the smallest increase in total within-cluster sum of squares (WCSS):

$d_{Ward}(C_i, C_j) = \frac{|C_i| \cdot |C_j|}{|C_i| + |C_j|} \|\bar{\mathbf{x}}_{C_i} - \bar{\mathbf{x}}_{C_j}\|^2$

Where $\bar{\mathbf{x}}_{C_i}$ and $\bar{\mathbf{x}}_{C_j}$ are the centroids of clusters $C_i$ and $C_j$ . Ward's method tends to produce compact, similarly sized spherical clusters and is the most widely used linkage in practice.

Centroid Linkage:

$d_{Centroid}(C_i, C_j) = \|\bar{\mathbf{x}}_{C_i} - \bar{\mathbf{x}}_{C_j}\|^2$

Distance = squared Euclidean distance between cluster centroids. Can produce inversions in the dendrogram (a merged cluster appearing lower than its components) — generally not recommended.

3.5 The K-Means Objective Function

$K$ -means clustering partitions $n$ observations into $K$ clusters by minimising the within-cluster sum of squares (WCSS), also called the inertia:

$\text{WCSS} = \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2$

Where:

$K$ = number of clusters (pre-specified by the researcher).
$C_k$ = set of observations in cluster $k$ .
$\bar{\mathbf{x}}_k = \frac{1}{|C_k|}\sum_{\mathbf{x}_i \in C_k} \mathbf{x}_i$ = centroid of cluster $k$ .

The WCSS can be equivalently written using the between-cluster sum of squares (BCSS):

$\text{WCSS} = SS_{total} - \text{BCSS}$

Where:

$\text{BCSS} = \sum_{k=1}^{K} |C_k| \|\bar{\mathbf{x}}_k - \bar{\mathbf{x}}\|^2$

And $\bar{\mathbf{x}}$ is the global mean of all observations.

3.6 The K-Means Algorithm (Lloyd's Algorithm)

Initialisation: Randomly select $K$ observations as initial centroids $\bar{\mathbf{x}}_1^{(0)}, \bar{\mathbf{x}}_2^{(0)}, \dots, \bar{\mathbf{x}}_K^{(0)}$ .

Step 1 — Assignment Step: Assign each observation $\mathbf{x}_i$ to the cluster with the nearest centroid:

$c_i^{(t)} = \arg\min_{k \in \{1,\dots,K\}} \|\mathbf{x}_i - \bar{\mathbf{x}}_k^{(t)}\|^2$

Step 2 — Update Step: Recompute each centroid as the mean of all observations assigned to that cluster:

$\bar{\mathbf{x}}_k^{(t+1)} = \frac{1}{|C_k^{(t)}|}\sum_{\mathbf{x}_i \in C_k^{(t)}} \mathbf{x}_i$

Convergence: Repeat Steps 1–2 until cluster assignments no longer change:

$c_i^{(t+1)} = c_i^{(t)} \quad \forall i$

Or until the change in WCSS is below a threshold $\varepsilon > 0$ :

$|\text{WCSS}^{(t+1)} - \text{WCSS}^{(t)}| < \varepsilon$

Convergence guarantee: $K$ -means always converges in a finite number of iterations because:

The assignment step never increases WCSS.
The update step never increases WCSS.
The number of possible cluster assignments is finite.

However, the converged solution may be a local optimum. Run with at least 20–50 random starts and keep the solution with the lowest WCSS.

3.7 K-Means++ Initialisation

Standard random initialisation of $K$ -means often converges to poor local optima. K-means++ (Arthur & Vassilvitskii, 2007) provides a smarter initialisation:

Step 1: Choose the first centroid uniformly at random from the data:

$\bar{\mathbf{x}}_1 \sim \text{Uniform}\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$

Step 2: For each subsequent centroid $\bar{\mathbf{x}}_k$ ( $k = 2, \dots, K$ ), choose observation $\mathbf{x}_i$ with probability proportional to its squared distance from the nearest already-chosen centroid:

$P(\mathbf{x}_i) = \frac{D(\mathbf{x}_i)^2}{\sum_{j=1}^{n} D(\mathbf{x}_j)^2}$

Where $D(\mathbf{x}_i) = \min_{c \in \text{chosen centroids}} \|\mathbf{x}_i - c\|$ .

Step 3: Repeat Step 2 until $K$ centroids are chosen, then run standard K-means.

K-means++ provides an $O(\log K)$ approximation guarantee on the WCSS and typically converges much faster than random initialisation.

3.8 The Gaussian Mixture Model (GMM)

Model-based clustering assumes that the data arise from a finite mixture of multivariate Gaussian distributions. Each component represents a cluster:

$f(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$

Where:

$\pi_k$ = mixing proportion of component $k$ ( $\sum_k \pi_k = 1$ ).
$\boldsymbol{\mu}_k$ = mean vector of component $k$ .
$\boldsymbol{\Sigma}_k$ = covariance matrix of component $k$ .
$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ = multivariate normal density:

$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \frac{1}{(2\pi)^{p/2}|\boldsymbol{\Sigma}_k|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right)$

3.9 The EM Algorithm for GMM

The parameters $\boldsymbol{\Theta} = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}_{k=1}^K$ are estimated by maximising the log-likelihood:

$\ell(\boldsymbol{\Theta}) = \sum_{i=1}^{n} \ln\left[\sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right]$

The Expectation-Maximisation (EM) algorithm iterates between two steps:

E-Step (Expectation): Compute the posterior responsibility (soft assignment probability) of component $k$ for observation $i$ :

$r_{ik}^{(t)} = \frac{\pi_k^{(t)} \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})}{\sum_{j=1}^{K} \pi_j^{(t)} \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}$

M-Step (Maximisation): Update parameters using the current responsibilities:

Effective cluster sizes:

$N_k^{(t)} = \sum_{i=1}^{n} r_{ik}^{(t)}$

Updated mixing proportions:

$\pi_k^{(t+1)} = \frac{N_k^{(t)}}{n}$

Updated means:

$\boldsymbol{\mu}_k^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_{i=1}^{n} r_{ik}^{(t)} \mathbf{x}_i$

Updated covariance matrices:

$\boldsymbol{\Sigma}_k^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_{i=1}^{n} r_{ik}^{(t)} (\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})^T$

The EM algorithm is guaranteed to converge to a local maximum of the log-likelihood. Key advantage over $K$ -means: soft (probabilistic) cluster memberships — each observation belongs to each cluster with a probability, not a hard 0/1 assignment.

4. Assumptions of Cluster Analysis

4.1 The Existence of Clusters (Clusterability)

The most fundamental assumption is that the data contain genuine cluster structure — that the population from which the data are drawn is genuinely multimodal or heterogeneous.

Why it matters: Clustering algorithms will always produce clusters, even for completely random, uniform data. A solution from random data is meaningless.

How to check:

Hopkins statistic: Tests whether data are uniformly distributed (no cluster structure).

$H = \frac{\sum_{i=1}^{m} u_i^d}{\sum_{i=1}^{m} u_i^d + \sum_{i=1}^{m} w_i^d}$

Where $u_i$ is the distance from a randomly sampled point to its nearest neighbour in the data, and $w_i$ is the distance from a randomly chosen data point to its nearest neighbour. Under a uniform distribution, $H \approx 0.5$ . $H > 0.75$ suggests cluster structure.

Visual methods: Plot pairwise scatter plots and look for visible groupings.

4.2 Scale and Measurement Consistency

All continuous variables must be on comparable scales before computing Euclidean distances. If one variable is measured in millimetres (range 1–10) and another in kilograms (range 50–100), the larger-scale variable will dominate the distance calculation.

How to check:

Compare the standard deviations of all variables.
If they differ substantially, standardise (z-score) all variables before clustering.

Standard z-score transformation:

$z_{ik} = \frac{x_{ik} - \bar{x}_k}{\sigma_{x_k}}$

When NOT to standardise:

When the original scale differences are meaningful (e.g., all variables are on the same scale, such as blood cell counts in the same unit).
When using Mahalanobis distance (already scale-invariant).

4.3 Absence of Extreme Outliers

Outliers severely distort distance-based clustering methods (especially $K$ -means and Ward's hierarchical clustering):

Outliers may form their own singleton clusters.
Outliers pull centroids toward them, distorting the clusters for all other observations.

How to check:

Univariate: Box plots, $z$ -scores $> 3$ .
Multivariate: Mahalanobis distance with chi-squared critical value.
Visualise: Scatter plot matrix for small $p$ .

How to handle:

Remove extreme outliers before clustering (and report their removal).
Use outlier-robust clustering methods (e.g., $K$ -medoids, DBSCAN).
Add outlier detection as a separate pre-processing step.

4.4 Appropriate Choice of Distance Measure

The distance measure should be appropriate for the type and scale of the variables:

Continuous variables: Euclidean or Mahalanobis (after standardising).
Binary variables: Jaccard or Dice coefficient.
Mixed types: Gower's distance.
High-dimensional: Cosine similarity.

Choosing the wrong distance measure can produce meaningless clusters even when genuine structure exists.

4.5 Spherical Cluster Shape (for K-Means)

$K$ -means assumes clusters are convex and roughly spherical in shape (equal variance in all directions). Non-spherical cluster shapes (elongated, curved, or irregular) will be poorly recovered by $K$ -means.

How to check:

After clustering, plot the clusters and visually inspect their shapes.
Compute silhouette coefficients — low values for many points suggest non-spherical clusters.

Remedy: Use GMM (which can model elliptical shapes), hierarchical clustering with single linkage (for chain-like clusters), or DBSCAN (for arbitrary shapes).

4.6 Appropriate Number of Clusters (K)

The number of clusters $K$ must be chosen appropriately. Specifying too few clusters over-groups heterogeneous observations; too many clusters fragments natural groups.

How to determine $K$ : Use multiple criteria (see Section 10) including the elbow method, silhouette analysis, gap statistic, and BIC. Never rely on a single criterion.

4.7 Adequate Sample Size

Reliable clustering requires sufficient observations per cluster for stable, reproducible results:

Method	Minimum Total $n$	Minimum per Cluster
Hierarchical	30	5
$K$ -means	$10K$	10
GMM	$10Kp$	$5p$
DBSCAN	50	Depends on MinPts

Where $K$ is the number of clusters and $p$ is the number of variables.

⚠️ With small samples ( $n < 50$ ), cluster solutions are highly unstable and may not replicate in new data. Always validate the solution using bootstrapping or an independent sample.

5. Types of Cluster Analysis Methods

5.1 Classification by Methodology

Clustering methods are broadly classified by their algorithmic strategy:

Category	Methods	Cluster Shape	Hard/Soft	Output
Hierarchical	Ward, Complete, Single, Average, Centroid	Flexible	Hard	Dendrogram + any K
Partitional	K-Means, K-Medoids (PAM), K-Modes	Spherical/Euclidean	Hard	K clusters
Model-Based	GMM, Mclust	Elliptical	Soft	K clusters + probabilities
Density-Based	DBSCAN, OPTICS, HDBSCAN	Arbitrary	Hard	Clusters + noise points
Fuzzy	Fuzzy C-Means	Spherical	Soft	Membership degrees
Spectral	Spectral Clustering	Arbitrary	Hard	K clusters
Grid-Based	CLIQUE, STING	Grid cells	Hard	Dense grid cells

5.2 Hierarchical vs. Partitional Methods

Feature	Hierarchical	Partitional ( $K$ -Means)
Specify $K$ in advance?	No	Yes
Deterministic?	Yes	No (random starts)
Scalability	$O(n^2 \log n)$ to $O(n^3)$	$O(nKI)$
Output	Complete hierarchy (dendrogram)	Single solution for given $K$
Reversible merges/splits?	No	Yes (re-run with different $K$ )
Works for large $n$ ?	Poorly ( $n > 10{,}000$ )	Well
Mixed variable types	Yes (with Gower)	Limited

5.3 Agglomerative vs. Divisive Hierarchical Clustering

Feature	Agglomerative (Bottom-Up)	Divisive (Top-Down)
Start	$n$ singleton clusters	1 cluster containing all $n$
Process	Merge closest clusters	Split most heterogeneous cluster
Common example	Ward, Complete, Average linkage	DIANA (Divisive Analysis)
Computational cost	$O(n^2 \log n)$	$O(2^n)$ in exact form
Usage	Very common	Less common

5.4 Hard vs. Soft (Fuzzy) Clustering

Feature	Hard (Crisp) Clustering	Soft (Fuzzy) Clustering
Assignment	Each observation belongs to exactly one cluster	Each observation has a degree of membership in each cluster
Membership	$u_{ik} \in \{0, 1\}$	$u_{ik} \in [0, 1]$ , $\sum_k u_{ik} = 1$
Examples	$K$ -Means, Hierarchical	GMM, Fuzzy C-Means
Best for	Well-separated clusters	Overlapping clusters; ambiguous boundary cases
Output	Cluster label vector	Membership matrix

5.5 DataStatPro Implementation Overview

The DataStatPro Cluster Analysis component implements:

Method	Implementation	Best For
Agglomerative Hierarchical	Ward, Complete, Average, Single linkage	Exploratory; unknown $K$ ; small to medium $n$
K-Means	Lloyd's algorithm with K-Means++ init	Large $n$ ; continuous data; known approximate $K$
K-Medoids (PAM)	Partitioning Around Medoids	Robust to outliers; mixed data
Gaussian Mixture Models	EM algorithm; multiple covariance structures	Soft assignments; elliptical clusters
DBSCAN	Density-based; automatic noise detection	Arbitrary shapes; outlier detection

6. Using the Cluster Analysis Component

The Cluster Analysis component in DataStatPro provides a complete end-to-end workflow for performing, evaluating, and visualising cluster analyses.

Step-by-Step Guide

Step 1 — Select Dataset

Choose the dataset from the "Dataset" dropdown. Ensure:

All clustering variables are in separate numeric columns.
Categorical variables are appropriately recoded (dummy-coded or use Gower's distance).
The dataset has been screened for missing data and extreme outliers.
Variables are on comparable scales (or will be standardised in Step 4).

💡 Tip: Run descriptive statistics and a correlation matrix before clustering. Variables that are highly correlated ( $|r| > 0.90$ ) are redundant — consider removing one from each pair to avoid over-weighting that dimension in the distance calculation.

Step 2 — Select Clustering Variables

Select all variables to include in the cluster analysis from the "Clustering Variables" dropdown. Only include variables that are:

Theoretically relevant to the grouping hypothesis.
Sufficiently variable (variables with near-zero variance contribute nothing to clustering).
Measured on compatible scales (or will be standardised).

⚠️ Important: Do not include identifier variables (ID numbers), date variables, or outcome variables you intend to use to validate the clusters (these should be held out for post-hoc validation, not used in forming the clusters).

Step 3 — Select Clustering Method

Choose from the "Method" dropdown:

Hierarchical (Agglomerative): For exploratory analysis; produces a dendrogram. Select linkage method: Ward (recommended), Complete, Average, or Single.
K-Means: For large datasets with a known approximate number of clusters. Uses K-Means++ initialisation by default.
K-Medoids (PAM): More robust than K-Means; works with any distance matrix.
Gaussian Mixture Model (GMM): For soft assignments and elliptical clusters. Select covariance structure: VVV (most flexible), EEE (equal covariance), or others.
DBSCAN: For arbitrary-shaped clusters and automatic outlier detection. Specify $\varepsilon$ (neighbourhood radius) and MinPts (minimum points).

Step 4 — Standardisation Options

Select how to scale the variables before clustering:

Standardise (z-scores): Recommended for most analyses with continuous variables of different scales. Transforms each variable to mean = 0, SD = 1.
Range normalisation (0–1): Scales all variables to the range [0, 1]. $x^* = (x - x_{min})/(x_{max} - x_{min})$ .
No standardisation: Use when variables are already on comparable scales.
Robust standardisation (median/IQR): Recommended when outliers are present. $x^* = (x - \text{median})/\text{IQR}$ .

Step 5 — Select Distance Measure

Choose the appropriate distance measure for your data type:

Euclidean (default): Continuous, standardised variables.
Manhattan (City Block): Continuous variables with potential outliers.
Mahalanobis: Continuous, correlated variables.
Gower: Mixed variable types (continuous, ordinal, binary, nominal).
Cosine: High-dimensional data (text, genomics).
Correlation (1-r): When pattern matters more than magnitude.

Step 6 — Specify Number of Clusters

For Hierarchical: The dendrogram is produced for all $K$ — cut it at any level to obtain the desired number of clusters. Use the automatic cutting tools in the application (based on height ratios or number of clusters).
For K-Means / K-Medoids: Specify the number of clusters $K$ , or use the automated Elbow Method, Silhouette Analysis, or Gap Statistic to recommend $K$ .
For GMM: Specify the range of $K$ to evaluate; the application selects the optimal $K$ using BIC or ICL.
For DBSCAN: No $K$ specified; clusters emerge automatically from $\varepsilon$ and MinPts.

💡 Best practice: Always evaluate at least three values of $K$ (your hypothesised number, one fewer, and one more) and compare the solutions on both statistical criteria and theoretical interpretability.

Step 7 — Display Options

Select which outputs and visualisations to display:

✅ Dendrogram with highlighted clusters (Hierarchical).
✅ Scree/Elbow plot of WCSS vs. $K$ .
✅ Silhouette plot and average silhouette width.
✅ Cluster profile table (means/medians per cluster per variable).
✅ PCA biplot coloured by cluster.
✅ Cluster size table (n per cluster, percentage).
✅ Within-cluster and between-cluster sum of squares.
✅ Gap statistic plot.
✅ Cluster stability indices (bootstrapped).
✅ Heatmap of variable means by cluster.

Step 8 — Run the Analysis

Click "Run Cluster Analysis". The application will:

Standardise variables (if selected).
Compute the distance matrix.
Run the clustering algorithm.
Compute cluster validity indices (silhouette, Calinski-Harabasz, Davies-Bouldin).
Generate all selected visualisations.
Produce a cluster membership variable that can be saved back to the dataset for further analysis.

7. Hierarchical Clustering

7.1 The Dendrogram

The dendrogram is the primary output of hierarchical clustering. It is a binary tree where:

Leaves (bottom): Individual observations.
Internal nodes: Cluster merges.
Height of each node: The dissimilarity at which the merge occurred.
Root (top): A single cluster containing all observations.

The dendrogram encodes a complete hierarchical clustering solution for all possible numbers of clusters $K$ from 1 to $n$ . Cutting the dendrogram at a given height $h$ produces the clustering solution for that number of clusters.

Reading the dendrogram:

Observations (or clusters) that are merged at a low height are very similar.
A large gap between consecutive merging heights suggests a natural cut point.
The number of vertical lines cut by a horizontal cut at height $h$ equals the number of clusters $K$ .

7.2 Choosing the Cut Point

Method 1 — Largest height gap:

Inspect the heights of consecutive merges and cut below the largest jump:

$K^* = \arg\max_{k} [h(k) - h(k-1)]$

Where $h(k)$ is the height of the $k$ -th merge from the top (merging the last $k$ clusters into $k-1$ ).

Method 2 — Consistent percentage:

Cut at a height corresponding to a fixed percentage (e.g., 70%) of the total dendrogram height range.

Method 3 — External criteria:

Use cluster validity indices (silhouette width, Calinski-Harabasz) evaluated for several cut heights to determine the optimal $K$ .

💡 There is no single "correct" cut point. Combine statistical guidance with theoretical reasoning about the expected number of subgroups in your population.

7.3 Linkage Method Comparison

Linkage	Properties	Cluster Shape	Recommended When
Ward's	Minimises WCSS increase	Compact, spherical	Most general-purpose analyses
Complete	Maximum diameter	Compact, similar size	Well-separated clusters expected
Average	Average pairwise distances	Moderate compactness	Balanced trade-off
Single	Chain-like merging	Elongated, chained	Detecting filamentary structures
Centroid	Distance between centroids	Moderate	Rarely recommended (inversions possible)

⚠️ Ward's linkage with Euclidean distance (on standardised data) is the most commonly recommended default. However, it assumes roughly spherical, equally-sized clusters. If clusters are expected to differ dramatically in size or shape, consider average linkage.

7.4 Cophenetic Correlation Coefficient

The cophenetic correlation coefficient (CCC) measures how faithfully the dendrogram preserves the pairwise distances in the original data:

$r_{coph} = \frac{\sum_{i < j}(d_{ij} - \bar{d})(c_{ij} - \bar{c})}{\sqrt{\sum_{i < j}(d_{ij} - \bar{d})^2 \cdot \sum_{i < j}(c_{ij} - \bar{c})^2}}$

Where:

$d_{ij}$ = original pairwise distance between observations $i$ and $j$ .
$c_{ij}$ = cophenetic distance (height at which $i$ and $j$ first merge in the dendrogram).
$\bar{d}$ , $\bar{c}$ = means of $d_{ij}$ and $c_{ij}$ respectively.

CCC	Interpretation
$> 0.75$	Good representation
$0.60 - 0.75$	Acceptable
$< 0.60$	Poor — hierarchical structure may not be appropriate

A high CCC ( $> 0.75$ ) indicates that the dendrogram is a faithful representation of the original distances. Compare CCC across linkage methods and choose the one with the highest CCC.

7.5 Hierarchical Clustering for Large Datasets

Standard agglomerative hierarchical clustering has time complexity $O(n^2 \log n)$ and space complexity $O(n^2)$ — it requires storing the full $n \times n$ distance matrix. For large $n > 5{,}000$ :

Use minimax linkage or robust single linkage methods designed for scalability.
Use DIANA (Divisive Analysis) which can be implemented more efficiently for some data.
Cluster a random subsample first, then assign remaining observations.
Use $K$ -means or DBSCAN for large $n$ .

7.6 Dissimilarity Matrix Heatmap

A dissimilarity heatmap displays the $n \times n$ distance matrix as a colour-coded image, with observations reordered according to the dendrogram. In a well-structured dataset:

Block-diagonal patterns emerge when clusters are distinct.
Observations within the same cluster appear in low-dissimilarity (dark) blocks.
Off-diagonal blocks show higher dissimilarity (lighter colour) between clusters.

This visual provides an immediate, intuitive check on the quality and separation of the cluster solution.

8. K-Means and K-Medoids Clustering

8.1 Choosing K — The Elbow Method

The elbow method plots WCSS (total within-cluster sum of squares) against $K$ and looks for an "elbow" — a point where the rate of decrease in WCSS slows dramatically:

$\text{WCSS}(K) = \sum_{k=1}^{K}\sum_{\mathbf{x}_i \in C_k}\|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2$

As $K$ increases from 1 to $n$ :

WCSS decreases monotonically (more clusters always explain more variance).
The marginal decrease diminishes after the "true" number of clusters.
The elbow occurs at the $K$ where the curve transitions from steep to relatively flat.

Limitation: The elbow is often subjective — in real data, the curve is frequently smooth without a clear kink. Combine with silhouette analysis and the gap statistic.

8.2 The Silhouette Coefficient

For each observation $i$ assigned to cluster $C_k$ , the silhouette coefficient measures how well the observation fits its assigned cluster compared to the nearest other cluster:

$a(i)$ : Average distance from observation $i$ to all other observations in the same cluster $C_k$ (intra-cluster cohesion):

$a(i) = \frac{1}{|C_k| - 1}\sum_{j \in C_k, j \neq i} d(i, j)$

$b(i)$ : Minimum average distance from observation $i$ to all observations in any other cluster (inter-cluster separation):

$b(i) = \min_{l \neq k} \frac{1}{|C_l|}\sum_{j \in C_l} d(i, j)$

Silhouette coefficient for observation $i$ :

$s(i) = \frac{b(i) - a(i)}{\max[a(i), b(i)]}$

Ranges from $-1$ to $+1$ :

$s(i) \approx +1$ : Observation is well-matched to its cluster and poorly matched to neighbouring clusters.
$s(i) \approx 0$ : Observation is on or very close to the boundary between two clusters.
$s(i) \approx -1$ : Observation would be better placed in a neighbouring cluster (probable misclassification).

Average silhouette width (ASW): The mean $s(i)$ across all observations:

$\bar{s} = \frac{1}{n}\sum_{i=1}^{n} s(i)$

Choose $K$ that maximises $\bar{s}$ .

ASW	Cluster Structure Interpretation
$> 0.70$	Strong structure
$0.51 - 0.70$	Reasonable structure
$0.26 - 0.50$	Weak structure — may be artificial
$\leq 0.25$	No substantial structure

8.3 The Gap Statistic

The gap statistic (Tibshirani, Walther & Hastie, 2001) compares the observed WCSS for $K$ clusters to the expected WCSS under a null reference distribution (uniform data with no cluster structure):

$\text{Gap}(K) = E^*[\log \text{WCSS}(K)] - \log \text{WCSS}(K)$

Where $E^*[\cdot]$ is the expectation under the null reference distribution (estimated by Monte Carlo simulation of $B = 100$ uniform datasets).

Choosing $K$ : Select the smallest $K$ such that:

$\text{Gap}(K) \geq \text{Gap}(K+1) - s_{K+1}$

Where $s_{K+1} = SD[\log \text{WCSS}(K+1)] \cdot \sqrt{1 + 1/B}$ is the simulation error.

The gap statistic accounts for sampling variability and provides a more principled stopping rule than the elbow method.

8.4 Calinski-Harabasz Index (Variance Ratio Criterion)

The Calinski-Harabasz (CH) index (also called the Variance Ratio Criterion) measures the ratio of between-cluster variance to within-cluster variance:

$CH(K) = \frac{\text{BCSS}/(K-1)}{\text{WCSS}/(n-K)}$

Where:

$\text{BCSS} = \sum_{k=1}^{K} n_k \|\bar{\mathbf{x}}_k - \bar{\mathbf{x}}\|^2$ = between-cluster sum of squares.
$\text{WCSS} = \sum_{k=1}^{K}\sum_{\mathbf{x}_i \in C_k}\|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2$ = within-cluster sum of squares.

Higher CH index = better-defined clusters. Choose $K$ that maximises $CH(K)$ .

8.5 Davies-Bouldin Index

The Davies-Bouldin (DB) index measures the average similarity between each cluster and its most similar other cluster:

$DB(K) = \frac{1}{K}\sum_{k=1}^{K} \max_{l \neq k} \left(\frac{\sigma_k + \sigma_l}{d(\bar{\mathbf{x}}_k, \bar{\mathbf{x}}_l)}\right)$

Where:

$\sigma_k = \frac{1}{n_k}\sum_{\mathbf{x}_i \in C_k}\|\mathbf{x}_i - \bar{\mathbf{x}}_k\|$ = average within-cluster distance from centroid.
$d(\bar{\mathbf{x}}_k, \bar{\mathbf{x}}_l)$ = distance between centroids of clusters $k$ and $l$ .

Lower DB index = better separation. Choose $K$ that minimises $DB(K)$ .

8.6 K-Medoids (PAM — Partitioning Around Medoids)

$K$ -medoids is a robust alternative to $K$ -means that represents each cluster by an actual medoid — the observation within the cluster that minimises the total dissimilarity to all other cluster members:

$\text{medoid}(C_k) = \arg\min_{\mathbf{x}_i \in C_k} \sum_{\mathbf{x}_j \in C_k} d(\mathbf{x}_i, \mathbf{x}_j)$

PAM algorithm:

Step 1 (BUILD): Sequentially select $K$ initial medoids.

Step 2 (SWAP): For each medoid $m$ and each non-medoid observation $o$ :

Tentatively swap $m$ with $o$ and compute the total cost (sum of dissimilarities).
If the total cost decreases, make the swap permanent.

Step 3: Repeat until no improvement is possible.

Objective function:

$\text{Total Cost} = \sum_{k=1}^{K}\sum_{\mathbf{x}_i \in C_k} d(\mathbf{x}_i, \text{medoid}_k)$

Advantages of K-Medoids over K-Means:

Works with any distance matrix (not just Euclidean).
Medoids are actual data points (interpretable).
More robust to outliers (outliers are not selected as medoids under PAM).
Applicable to mixed data with Gower's distance.

Disadvantage: Computationally more expensive than $K$ -means ( $O(K(n-K)^2)$ per iteration).

8.7 K-Modes for Categorical Data

When all variables are categorical (nominal), $K$ -means and $K$ -medoids are not directly applicable. $K$ -modes (Huang, 1998) extends $K$ -means by:

Replacing the mean with the mode as the cluster representative.
Using the simple matching dissimilarity (proportion of mismatches):

$d(\mathbf{x}_i, \mathbf{x}_j) = \sum_{k=1}^{p} \mathbf{1}[x_{ik} \neq x_{jk}]$

For mixed data (continuous + categorical), $K$ -prototypes combines:

$d(\mathbf{x}_i, \mathbf{x}_j) = \sum_{k=1}^{p_c}(x_{ik} - x_{jk})^2 + \gamma \sum_{k=1}^{p_d}\mathbf{1}[x_{ik} \neq x_{jk}]$

Where $p_c$ = number of continuous variables, $p_d$ = number of categorical variables, and $\gamma > 0$ is a weight balancing the two contributions.

9. Model-Based and Density-Based Clustering

9.1 Gaussian Mixture Models (GMM) — Model-Based Clustering

Model-based clustering (Fraley & Raftery, 2002) treats clustering as a model selection problem. The data are assumed to arise from a GMM, and the optimal number of clusters $K$ and covariance structure are selected using the Bayesian Information Criterion (BIC).

Covariance matrix parameterisation:

The key advantage of GMM is the ability to model different cluster shapes, sizes, and orientations by parameterising the covariance matrices $\boldsymbol{\Sigma}_k$ .

Using eigenvalue decomposition of $\boldsymbol{\Sigma}_k$ :

$\boldsymbol{\Sigma}_k = \lambda_k \mathbf{D}_k \mathbf{A}_k \mathbf{D}_k^T$

Where:

$\lambda_k = |\boldsymbol{\Sigma}_k|^{1/p}$ = volume (overall size of cluster $k$ ).
$\mathbf{D}_k$ = matrix of eigenvectors = orientation (direction of cluster axes).
$\mathbf{A}_k$ = diagonal matrix with normalised eigenvalues = shape (ratio of axes).

By constraining or freeing these three components across clusters, 14 different model types are defined in the mclust framework:

Model	Volume	Shape	Orientation	Description
EII	Equal	Spherical	—	$K$ -means-like spherical clusters
VII	Variable	Spherical	—	Spherical clusters of different sizes
EEI	Equal	Equal	Axis-aligned	Equal diagonal covariance
VEI	Variable	Equal	Axis-aligned	Variable-size axis-aligned
EEE	Equal	Equal	Equal	Equal ellipsoidal (all same shape)
VVV	Variable	Variable	Variable	Fully unconstrained (most flexible)

BIC for model selection:

$\text{BIC}(K, \text{model}) = 2\hat{\ell} - q \ln(n)$

Where $\hat{\ell}$ is the maximised log-likelihood and $q$ is the number of free parameters. Higher BIC = better model (note: some software uses $-2\hat{\ell} + q\ln(n)$ ; here higher means better for the positive convention used in mclust).

ICL (Integrated Complete-data Likelihood):

$\text{ICL}(K) = \text{BIC}(K) + 2\sum_{i=1}^{n}\sum_{k=1}^{K} \hat{r}_{ik}\ln(\hat{r}_{ik})$

ICL adds an entropy penalty for fuzzy assignments. Models where observations are clearly assigned (low entropy) are favoured by ICL over BIC.

9.2 Soft Cluster Membership in GMM

A major advantage of GMM over $K$ -means is the production of posterior probabilities of cluster membership for each observation. After EM convergence:

$P(\text{cluster} = k \mid \mathbf{x}_i) = r_{ik} = \frac{\hat{\pi}_k \mathcal{N}(\mathbf{x}_i \mid \hat{\boldsymbol{\mu}}_k, \hat{\boldsymbol{\Sigma}}_k)}{\sum_{j=1}^{K} \hat{\pi}_j \mathcal{N}(\mathbf{x}_i \mid \hat{\boldsymbol{\mu}}_j, \hat{\boldsymbol{\Sigma}}_j)}$

These probabilities allow identification of:

Borderline cases ( $r_{ik} \approx 0.5$ for two clusters) — observations near cluster boundaries.
Atypical members (low probability of belonging to any cluster).
Uncertainty maps — spatial or other visualisations of cluster membership uncertainty.

9.3 DBSCAN — Density-Based Spatial Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise; Ester et al., 1996) identifies clusters as dense regions of the data space, separated by low-density regions. It does not require specifying $K$ in advance and can identify outliers (noise points).

Key parameters:

$\varepsilon$ (epsilon): The neighbourhood radius — how far from a point to look for neighbours.
MinPts: The minimum number of points within distance $\varepsilon$ for a point to be considered a core point.

Point classification:

A point $\mathbf{x}_i$ is classified as:

Core point: $|\mathcal{N}_\varepsilon(\mathbf{x}_i)| \geq \text{MinPts}$

Where $\mathcal{N}_\varepsilon(\mathbf{x}_i) = \{\mathbf{x}_j : d(\mathbf{x}_i, \mathbf{x}_j) \leq \varepsilon\}$ is the $\varepsilon$ -neighbourhood of $\mathbf{x}_i$ .

Border point: Not a core point, but within distance $\varepsilon$ of a core point.

Noise point (outlier): Neither a core point nor a border point.

DBSCAN algorithm:

Label all points as unvisited.
For each unvisited core point $p$ : mark as visited; create a new cluster $C$ ; add all points in $\mathcal{N}_\varepsilon(p)$ to $C$ .
For each newly added point $q$ in $C$ : if $q$ is a core point, add all points in $\mathcal{N}_\varepsilon(q)$ to $C$ (density reachability).
Continue until no more points can be added to $C$ ; label all non-core border points.
Mark remaining unvisited non-core points as noise.

Key properties of DBSCAN:

Does not require specifying $K$ .
Can discover clusters of arbitrary shape.
Automatically identifies and labels outliers as noise.
Produces the same results for the same $\varepsilon$ and MinPts (deterministic).
Struggles with clusters of varying density.

9.4 Choosing DBSCAN Parameters

Setting MinPts:

Rule of thumb: MinPts $= 2 \times p$ (twice the number of dimensions).
Larger MinPts → more stringent core point requirement → fewer, larger clusters.
Minimum recommended: MinPts $\geq 4$ for 2D data; MinPts $\geq 2p$ for higher dimensions.

Setting $\varepsilon$ :

Plot the $k$ -nearest neighbour distance plot (k-NN distance plot) where $k = \text{MinPts}$ .
Sort observations by their MinPts-nearest neighbour distance in decreasing order.
The knee/elbow of this plot corresponds to the appropriate $\varepsilon$ .

$\varepsilon^* \approx d_{MinPts-NN, \text{knee}}$

9.5 HDBSCAN — Hierarchical DBSCAN

HDBSCAN extends DBSCAN to handle varying density clusters by transforming the space using the concept of mutual reachability distance:

$d_{mreach,MinPts}(\mathbf{x}_i, \mathbf{x}_j) = \max[\text{core}_{MinPts}(\mathbf{x}_i), \text{core}_{MinPts}(\mathbf{x}_j), d(\mathbf{x}_i, \mathbf{x}_j)]$

Where $\text{core}_{MinPts}(\mathbf{x}_i)$ is the MinPts-nearest neighbour distance (the core distance) of point $\mathbf{x}_i$ .

HDBSCAN builds a hierarchical clustering tree on this transformed space and extracts a flat clustering by selecting the most stable clusters across all density levels. Only MinPts needs to be specified (not $\varepsilon$ ).

9.6 Fuzzy C-Means

Fuzzy C-Means (FCM) is a soft-assignment generalisation of $K$ -means. Each observation has a degree of membership $u_{ik} \in [0, 1]$ in each cluster $k$ , with $\sum_{k=1}^K u_{ik} = 1$ .

FCM objective function:

$J_m = \sum_{i=1}^{n}\sum_{k=1}^{K} u_{ik}^m \|\mathbf{x}_i - \bar{\mathbf{x}}_k\|^2$

Where $m > 1$ is the fuzziness parameter (typically $m = 2$ ):

$m \to 1$ : Hard assignments (approaches $K$ -means).
$m \to \infty$ : All memberships equal $1/K$ (completely fuzzy, no structure).

FCM update equations:

Centroids:

$\bar{\mathbf{x}}_k = \frac{\sum_{i=1}^{n} u_{ik}^m \mathbf{x}_i}{\sum_{i=1}^{n} u_{ik}^m}$

Memberships:

$u_{ik} = \frac{1}{\sum_{j=1}^{K}\left(\frac{\|\mathbf{x}_i - \bar{\mathbf{x}}_k\|}{\|\mathbf{x}_i - \bar{\mathbf{x}}_j\|}\right)^{2/(m-1)}}$

Iterate until convergence ( $\|U^{(t+1)} - U^{(t)}\| < \varepsilon$ ).

10. Model Fit and Evaluation

Evaluating cluster quality is fundamentally more challenging than evaluating supervised model fit because there is no ground truth to compare against. Multiple complementary criteria must be consulted.

10.1 Internal Validity Indices

Internal validity indices assess cluster quality using only the clustering data and the resulting cluster assignments.

Average Silhouette Width (ASW):

$\bar{s} = \frac{1}{n}\sum_{i=1}^{n} s(i), \quad s(i) = \frac{b(i) - a(i)}{\max[a(i), b(i)]}$

Higher is better. Optimal $K$ = $K$ that maximises $\bar{s}$ .

Calinski-Harabasz (CH) Index:

$CH = \frac{(n - K) \cdot \text{BCSS}}{(K - 1) \cdot \text{WCSS}}$

Higher is better.

Davies-Bouldin (DB) Index:

$DB = \frac{1}{K}\sum_{k=1}^{K}\max_{l \neq k}\frac{\sigma_k + \sigma_l}{d(\bar{\mathbf{x}}_k, \bar{\mathbf{x}}_l)}$

Lower is better.

Dunn Index:

$D = \frac{\min_{k \neq l} d_{min}(C_k, C_l)}{\max_m \Delta(C_m)}$

Where $d_{min}(C_k, C_l) = \min_{\mathbf{x} \in C_k, \mathbf{y} \in C_l} d(\mathbf{x}, \mathbf{y})$ is the minimum inter-cluster distance and $\Delta(C_m) = \max_{\mathbf{x}, \mathbf{y} \in C_m} d(\mathbf{x}, \mathbf{y})$ is the diameter (maximum intra-cluster distance) of cluster $m$ .

Higher Dunn index = better separation and compactness. Sensitive to outliers.

Summary of Internal Validity Indices:

Index	Optimal $K$	Better Solution	Sensitivity
ASW	Maximise	Higher	Moderate
CH	Maximise	Higher	Low
DB	Minimise	Lower	Moderate
Dunn	Maximise	Higher	High (outliers)
WCSS (Elbow)	Elbow point	Lower	Low
Gap statistic	First local max	Higher	Moderate

10.2 External Validity Indices

When true cluster labels are available (e.g., in a validation or simulation study), external validity indices compare the clustering result to the known ground truth.

Adjusted Rand Index (ARI):

The Rand Index counts agreements between pairs of observations in the clustering and in the true partition. The Adjusted Rand Index corrects for chance agreement:

$\text{ARI} = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]}$

More explicitly, for contingency table $\mathbf{N}$ with elements $n_{ij}$ (number of objects in cluster $i$ of solution 1 AND cluster $j$ of solution 2):

$\text{ARI} = \frac{\sum_{ij}\binom{n_{ij}}{2} - \left[\sum_i\binom{a_i}{2}\sum_j\binom{b_j}{2}\right]/\binom{n}{2}}{\frac{1}{2}\left[\sum_i\binom{a_i}{2} + \sum_j\binom{b_j}{2}\right] - \left[\sum_i\binom{a_i}{2}\sum_j\binom{b_j}{2}\right]/\binom{n}{2}}$

Where $a_i = \sum_j n_{ij}$ and $b_j = \sum_i n_{ij}$ are row and column marginals.

ARI	Interpretation
$1.0$	Perfect agreement
$0.61 - 0.99$	Excellent
$0.41 - 0.60$	Good
$0.21 - 0.40$	Fair
$0.00 - 0.20$	Slight / random agreement
$< 0$	Worse than random

Normalised Mutual Information (NMI):

$\text{NMI}(U, V) = \frac{2 \cdot I(U; V)}{H(U) + H(V)}$

Where $I(U;V)$ is the mutual information and $H(U)$ , $H(V)$ are the entropies of the two clusterings:

$I(U;V) = \sum_{k=1}^{K}\sum_{l=1}^{L} \frac{n_{kl}}{n} \ln\frac{n \cdot n_{kl}}{n_k \cdot n_l}$

NMI ranges from 0 (no agreement) to 1 (perfect agreement). Does not correct for chance (use AMI — Adjusted Mutual Information — for a chance-corrected version).

10.3 Relative Validity — Stability Indices

Cluster stability assesses whether the cluster solution is reproducible across subsamples of the data. A stable solution will produce similar cluster assignments when the analysis is repeated on bootstrap samples.

Bootstrap stability algorithm:

Draw $B = 100$ bootstrap samples of size $n$ (with replacement).
Apply the clustering algorithm to each bootstrap sample.
For each bootstrap solution, match clusters to the original solution (using the Hungarian algorithm or maximum overlap matching).
Compute the mean ARI between the original and bootstrap solutions:

$\text{Stability} = \frac{1}{B}\sum_{b=1}^{B} \text{ARI}(\hat{C}, \hat{C}_b^*)$

Stability	Interpretation
$> 0.85$	Very stable — reproducible solution
$0.75 - 0.85$	Stable
$0.60 - 0.74$	Moderately stable
$< 0.60$	Unstable — solution may not replicate

10.4 BIC and AIC for Model-Based Clustering

For GMM, the optimal number of components $K$ is selected using BIC:

$\text{BIC} = 2\hat{\ell} - q\ln(n)$

Select $K$ that maximises BIC (using the mclust convention where higher = better).

AIC:

$\text{AIC} = 2\hat{\ell} - 2q$

AIC tends to select more components than BIC (favours complexity less strictly than BIC).

10.5 Comparing $K$ Values — A Comprehensive Decision Framework

Criterion	Supports Low $K$	Supports High $K$	Best Used With
Elbow (WCSS)	Elbow is early	Elbow is late	K-Means
Silhouette	High $\bar{s}$ at low $K$	High $\bar{s}$ at high $K$	Any method
Gap Statistic	Early plateau	Late plateau	Any method
CH Index	Peak at low $K$	Peak at high $K$	K-Means, Hierarchical
DB Index	Minimum at low $K$	Minimum at high $K$	K-Means, Hierarchical
BIC (GMM)	Peak at low $K$	Peak at high $K$	GMM only
Bootstrap stability	High stability at low $K$	High stability at high $K$	Any method
Theoretical expectations	Strong prior for few types	Strong prior for many types	All analyses

💡 Best practice: Evaluate the clustering solution at $K-1$ , $K$ , and $K+1$ for the statistically suggested $K$ . Choose the solution that is most theoretically interpretable AND statistically defensible. Document the rationale for the final $K$ selection.

10.6 Cluster Profiling and Interpretation

After selecting the final clustering solution, profile each cluster by computing:

Means and standard deviations of each continuous clustering variable by cluster.
Frequencies and proportions of categorical variables by cluster.
ANOVA or Kruskal-Wallis tests for each variable to confirm clusters differ.
Effect sizes ( $\eta^2$ or rank-biserial $r$ ) for each variable.
Discriminant function analysis to identify which variables best discriminate clusters.

A cluster profile heatmap (rows = clusters, columns = variables, cells = standardised mean) provides an intuitive summary of how clusters differ across all variables simultaneously.

11. Advanced Topics

11.1 Dimensionality Reduction Before Clustering

High-dimensional data (large $p$ ) pose several challenges for clustering:

Curse of dimensionality: In high dimensions, all pairwise distances converge to the same value, making distance-based clustering meaningless.
Noise variables: Many variables may be unrelated to the cluster structure, adding noise to the distance calculation.
Visualisation: Impossible to visualise clusters beyond 3 dimensions.

Solution 1 — PCA before clustering:

Apply Principal Component Analysis to reduce $p$ to $p' \ll p$ dimensions that capture most of the variance:

$\mathbf{Z} = \mathbf{X}\mathbf{V}_{p'}$

Where $\mathbf{V}_{p'}$ contains the first $p'$ principal components. Cluster on $\mathbf{Z}$ instead of $\mathbf{X}$ . Retain enough components to explain $\geq 80\%$ of total variance.

Solution 2 — UMAP/t-SNE for visualisation:

For visualising high-dimensional clusters in 2D:

t-SNE (t-Distributed Stochastic Neighbour Embedding): Preserves local structure. Computationally expensive; not suitable for large $n$ .
UMAP (Uniform Manifold Approximation and Projection): Faster than t-SNE; preserves both local and global structure better.

⚠️ Never cluster on t-SNE or UMAP embeddings — their stochastic, non-linear nature distorts distances in ways that invalidate distance-based clustering. Use PCA for pre-clustering dimensionality reduction; use t-SNE/UMAP only for visualisation.

11.2 Cluster Validation with External Variables

After obtaining a cluster solution, validate it by examining whether clusters differ meaningfully on variables that were NOT used in forming the clusters (held-out variables):

Criterion validity: Do clusters differ significantly on a relevant outcome variable not used in clustering?

$F_{\text{criterion}} = \frac{\text{BCSS}_{criterion}/(K-1)}{\text{WCSS}_{criterion}/(n-K)}$

Predictive validity: Can cluster membership predict future outcomes? Run a logistic regression (or multinomial logistic regression for $K > 2$ ) predicting cluster membership from domain-relevant external predictors. A model that classifies with high accuracy validates the cluster solution.

Cross-tabulation with known groups: If true group membership is partially known (e.g., diagnosed patients vs. healthy controls), compare cluster memberships against true groups using chi-squared tests and ARI.

11.3 Mixed-Type Data Clustering

Real-world datasets often contain a mixture of continuous, ordinal, and nominal variables. Three main strategies exist:

Strategy 1 — Gower's distance + hierarchical/K-Medoids:

Compute Gower's distance matrix and feed it into hierarchical clustering or PAM. Most widely recommended approach for mixed data.

Strategy 2 — Factor Analysis of Mixed Data (FAMD):

Apply FAMD to extract factor scores that accommodate both continuous (PCA-style) and categorical (MCA-style) variables, then cluster on the factor scores.

Strategy 3 — Latent Class Analysis (LCA):

Model-based approach where the cluster structure is defined by a joint probability model over all variable types. Each "class" is characterised by a probability distribution over all variables. Suitable for primarily categorical data.

11.4 Longitudinal Clustering

When the same subjects are measured repeatedly over time, trajectory clustering (also called group-based trajectory modelling or latent growth curve clustering) identifies subgroups with similar temporal patterns.

Group-Based Trajectory Model (Nagin, 2005):

$P(\text{class} = k) = \pi_k$

For each class $k$ , the trajectory of outcome $Y_{it}$ over time $t$ is modelled as:

$E(Y_{it} \mid \text{class} = k) = \beta_{k0} + \beta_{k1}t + \beta_{k2}t^2 + \dots$

Model selection is based on BIC across models with different numbers of trajectory groups and different polynomial orders for each group.

11.5 Ensemble Clustering (Consensus Clustering)

Consensus clustering combines multiple clustering solutions (from different algorithms, different $K$ values, or different random starts) to produce a more stable and robust final partition.

Algorithm:

Generate $B$ clustering solutions: $\{\hat{C}_1, \hat{C}_2, \dots, \hat{C}_B\}$ .
Build the co-association matrix $\mathbf{A}$ where $A_{ij}$ = proportion of solutions in which observations $i$ and $j$ are assigned to the same cluster:

$A_{ij} = \frac{1}{B}\sum_{b=1}^{B} \mathbf{1}[\hat{c}_i^{(b)} = \hat{c}_j^{(b)}]$

Apply hierarchical clustering to $\mathbf{A}$ (treating it as a similarity matrix).
The consensus clustering is the result of cutting this dendrogram at the desired $K$ .

A cluster solution with high consensus (many entries in $\mathbf{A}$ near 0 or 1) is more stable than one with many entries near 0.5 (indicating ambiguous assignments).

11.6 Reporting Cluster Analysis Results

Best-practice reporting for cluster analysis includes:

Pre-analysis decisions:

Variables selected for clustering and rationale.
Standardisation approach.
Distance measure chosen and justification.
Algorithm(s) and settings (linkage for hierarchical; random starts for K-Means).
Method for determining $K$ .

Main results:

Final $K$ chosen and rationale (statistical indices + theory).
Cluster sizes ( $n$ and percentage).
Cluster validity indices (at minimum ASW and one other).
Stability assessment (bootstrap ARI).

Cluster characterisation:

Profile table: means/proportions for all clustering variables by cluster (with SDs).
Statistical tests comparing clusters on each variable.
Effect sizes for each variable.
A descriptive label for each cluster.

Visualisations:

Dendrogram (for hierarchical).
PCA biplot coloured by cluster.
Cluster profile heatmap.
Interaction plots for key variables.

12. Worked Examples

Example 1: Hierarchical Clustering — Patient Subtype Identification

A clinical researcher collects data on $n = 80$ patients with chronic fatigue syndrome on five symptom severity scores (each 0–10): Pain ( $X_1$ ), Fatigue ( $X_2$ ), Cognitive Impairment ( $X_3$ ), Sleep Disturbance ( $X_4$ ), and Mood ( $X_5$ ).

Step 1 — Standardise variables:

All five variables are on the same 0–10 scale with similar SDs — standardisation applied for consistency.

Step 2 — Compute distance matrix:

Euclidean distances on standardised scores.

Step 3 — Run Ward's hierarchical clustering.

Step 4 — Evaluate the dendrogram:

Dendrogram heights of final merges (from bottom up): $h_5 = 1.2, h_4 = 1.8, h_3 = 4.1, h_2 = 8.9, h_1 = 14.2$

Height gaps: $0.6, 2.3, 4.8, 5.3$ → largest gap is between $h_2$ and $h_3$ ( $\Delta = 4.8$ ), suggesting cutting at 3 clusters (before the large jump to 2 clusters).

Cophenetic correlation: $r_{coph} = 0.82$ → Good representation.

Step 5 — Evaluate 3-cluster solution:

Index	$K=2$	$K=3$	$K=4$
ASW	0.58	0.64	0.51
CH	41.2	52.8	47.1
DB	0.82	0.61	0.78

All three indices favour $K = 3$ . Decision: 3 clusters.

Step 6 — Cluster Profiles:

Variable	Cluster 1 ( $n=28$ )	Cluster 2 ( $n=31$ )	Cluster 3 ( $n=21$ )
Pain	$7.8 \pm 1.2$	$4.2 \pm 1.5$	$2.1 \pm 1.0$
Fatigue	$8.5 \pm 0.9$	$6.1 \pm 1.4$	$3.5 \pm 1.2$
Cognitive	$7.2 \pm 1.5$	$3.8 \pm 1.6$	$5.8 \pm 1.3$
Sleep	$8.1 \pm 1.1$	$5.5 \pm 1.5$	$4.9 \pm 1.4$
Mood	$7.4 \pm 1.3$	$4.9 \pm 1.6$	$6.1 \pm 1.2$
Bootstrap ARI	0.88

Cluster Labels:

Cluster 1 (n=28, 35%): "Severe" — high scores on all dimensions (mean profile 7.8–8.5).
Cluster 2 (n=31, 39%): "Moderate/Somatic" — moderate physical symptoms, milder cognitive and mood symptoms.
Cluster 3 (n=21, 26%): "Cognitive-Affective" — lower pain and fatigue but elevated cognitive impairment and mood disturbance.

Conclusion: Ward's hierarchical clustering identified three clinically meaningful patient subtypes. The 3-cluster solution was supported by all validity indices (ASW = 0.64, CH = 52.8, DB = 0.61) and demonstrated good bootstrap stability (ARI = 0.88). The three profiles suggest distinct symptom phenotypes that may benefit from different treatment approaches: Cluster 1 (Severe) may require multimodal comprehensive intervention; Cluster 2 (Moderate/Somatic) may respond to physical symptom management; Cluster 3 (Cognitive-Affective) may benefit from cognitive-behavioural and mood-focused therapies.

Example 2: K-Means Clustering — Customer Segmentation

A retail company collects data on $n = 500$ customers: Annual spend ( $X_1$ , £), Purchase frequency ( $X_2$ , orders/year), Average order value ( $X_3$ , £), Days since last purchase ( $X_4$ , recency), and Customer age ( $X_5$ , years).

Step 1 — Standardise all variables.

Step 2 — Determine optimal K:

WCSS, silhouette, and gap statistic evaluated for $K = 2$ to $K = 8$ :

$K$	WCSS	$\Delta$ WCSS	ASW	CH	Gap
2	1842	—	0.52	285	0.41
3	1341	501	0.61	341	0.58
4	1052	289	0.58	312	0.62
5	892	160	0.51	287	0.60
6	798	94	0.44	251	0.55
7	741	57	0.38	218	0.51
8	712	29	0.32	190	0.48

Elbow: Largest $\Delta$ WCSS at $K=3$ (501) vs. $K=4$ (289) → elbow at $K=3$ . Silhouette: Maximised at $K=3$ (ASW = 0.61). Gap statistic: Gap(3) = 0.58 $\geq$ Gap(4) $-$ $s_4$ = 0.62 $-$ 0.05 = 0.57. ✅ Decision: $K = 3$ clusters.

Step 3 — Run K-Means (50 random starts, K-Means++ init).

Step 4 — Cluster profiles (unstandardised means):

Variable	Cluster 1 "Premium" ( $n=142$ )	Cluster 2 "Standard" ( $n=218$ )	Cluster 3 "Dormant" ( $n=140$ )
Annual Spend	£4,820	£1,240	£380
Frequency	18.2 orders	8.4 orders	2.1 orders
Avg Order Value	£265	£148	£181
Days Since Purchase	12 days	41 days	287 days
Age	42 years	38 years	55 years
Size	28.4%	43.6%	28.0%

Bootstrap stability: Mean ARI = 0.91 (Very stable).

ANOVA results: All five variables differ significantly across clusters ( $p < .001$ ).

Business Interpretation:

Cluster 1 "Premium Actives" (28%): High-value, high-frequency recent buyers. Target with loyalty rewards, early access to new products, and premium services.
Cluster 2 "Standard Actives" (44%): Mid-range regular customers, recently active. Target with upselling campaigns and frequency incentives.
Cluster 3 "Dormant" (28%): Low spend, infrequent, long since last purchase. Target with win-back campaigns offering discounts and re-engagement communications.

Example 3: Gaussian Mixture Model — Gene Expression Subtype Discovery

A genomics researcher analyses RNA-seq expression data for $n = 200$ tumour samples on $p = 8$ cancer-relevant gene expression scores (after PCA dimensionality reduction).

GMM with multiple covariance structures evaluated:

Model	$K$	BIC	ICL	Best Fit?
EII (spherical, equal volume)	3	4821	4712	No
EEE (ellipsoidal, equal)	3	5108	5041	No
VVV (fully unconstrained)	2	4950	4882	No
VEE (variable volume, equal shape)	3	5342	5198	Yes
VVE (variable volume and shape)	4	5188	5014	No

Best model: VEE with $K = 3$ (highest BIC = 5342).

Cluster assignments:

Property	Cluster 1 ( $n=78$ )	Cluster 2 ( $n=71$ )	Cluster 3 ( $n=51$ )
Size (%)	39.0%	35.5%	25.5%
Mean entropy	0.12	0.18	0.31
Low-uncertainty (prob $>0.90$ )	92%	88%	71%

Cluster profiles (top discriminating genes):

Gene	Cluster 1	Cluster 2	Cluster 3
TP53	Low ( $-1.8$ )	High ( $+1.4$ )	Moderate ( $+0.3$ )
BRCA1	Moderate ( $+0.2$ )	Low ( $-1.5$ )	High ( $+1.6$ )
EGFR	High ( $+1.7$ )	Low ( $-0.8$ )	Low ( $-0.9$ )
MYC	Moderate ( $+0.5$ )	High ( $+1.3$ )	Low ( $-1.4$ )

Clinical validation (held-out variables not used in clustering):

Outcome	Cluster 1	Cluster 2	Cluster 3	$p$
5-year survival	72%	41%	58%	$< .001$
Stage III/IV (%)	28%	68%	45%	$< .001$
Hormone receptor+ (%)	82%	31%	61%	$< .001$

Molecular Subtypes Identified:

Cluster 1 (n=78, 39%): "EGFR-high / Luminal-like" — EGFR overexpression, hormone receptor positive, best prognosis (72% 5-year survival).
Cluster 2 (n=71, 36%): "TP53-mutant / Aggressive" — TP53 high, MYC amplification, mostly advanced stage, worst prognosis (41% 5-year survival).
Cluster 3 (n=51, 26%): "BRCA1-high / Basal-like" — BRCA1 elevated, intermediate prognosis, high hormone receptor positivity (58% 5-year survival).

Conclusion: GMM with VEE covariance structure identified three biologically meaningful and clinically validated tumour subtypes. The soft assignment probabilities revealed that Cluster 3 members were more ambiguous (lower certainty), suggesting this subtype is transitional. The strong association with 5-year survival and stage distribution (all $p < .001$ ) confirms the clinical validity of these molecular subtypes.

13. Common Mistakes and How to Avoid Them

Mistake 1: Not Standardising Variables Before Distance-Based Clustering

Problem: Variables measured on different scales (e.g., income in £10,000s and age in years) will result in the large-scale variable dominating the Euclidean distance calculation. Clusters will reflect variation in that variable rather than the true multivariate structure.
Solution: Always standardise continuous variables to $z$ -scores before computing Euclidean distances, unless all variables are on the same scale or Mahalanobis distance is used. Report which standardisation method was applied.

Mistake 2: Choosing $K$ Based Solely on the Elbow Plot

Problem: The elbow in the WCSS curve is frequently ambiguous — the "elbow" can be in different positions depending on the scale and the eye of the beholder. Over-reliance on a single criterion leads to arbitrary $K$ selection.
Solution: Use at least three complementary criteria: the elbow method, silhouette analysis, and the gap statistic (or BIC for GMM). Combine statistical guidance with theoretical expectations about the expected number of subgroups. Evaluate the interpretability and stability of solutions at $K-1$ , $K$ , and $K+1$ .

Mistake 3: Running K-Means With a Single Random Start

Problem: $K$ -means converges to a local optimum that depends on the random initialisation. A single random start frequently produces a poor solution with unnecessarily high WCSS.
Solution: Always use multiple random starts (at minimum 20, preferably 50–100) and keep the solution with the lowest WCSS. Use K-Means++ initialisation as the default — it dramatically reduces the probability of poor local optima.

Mistake 4: Treating Cluster Labels as Stable Across Analyses

Problem: Cluster labels (Cluster 1, 2, 3, etc.) are arbitrarily assigned and change between runs of $K$ -means. Researchers sometimes reference "Cluster 1" without realising it may correspond to a completely different group in a different run.
Solution: Always refer to clusters by their content labels (e.g., "High-Risk Group," "Moderate Group") rather than numerical labels. Match clusters across runs using the Hungarian algorithm or maximum overlap matching. Set a random seed for reproducibility.

Mistake 5: Ignoring Cluster Validity and Stability

Problem: Reporting a cluster solution without any assessment of its validity or reproducibility. A clustering that "looks interesting" may simply reflect the noise structure of a small or particular sample and fail to replicate in new data.
Solution: Always report at least one internal validity index (ASW recommended) and a stability assessment (bootstrap ARI). Solutions with ASW $< 0.26$ or bootstrap ARI $< 0.60$ should be interpreted with extreme caution. Validate the solution on an independent holdout sample or using cross-validation.

Mistake 6: Including Outcome Variables in the Clustering

Problem: Including the outcome variable (or variables highly correlated with it) in the feature set used to define clusters creates a circular validation problem — the clusters will naturally differ on the outcome because it was used to define them.
Solution: Cluster using only the predictor variables or features that define the hypothesised subgroups. Hold out the outcome variable and use it only for post-hoc validation (i.e., testing whether the clusters actually differ in a meaningful way on the outcome).

Mistake 7: Using Hierarchical Clustering With Single Linkage for General Purposes

Problem: Single linkage is prone to chaining — it forms long, elongated clusters by progressively adding observations to the nearest existing cluster, even if they are far from the cluster's core members. This produces biologically/psychologically uninterpretable clusters in most applications.
Solution: Use Ward's linkage as the default for general-purpose hierarchical clustering with continuous variables. Use single linkage only when elongated, chain-like clusters are theoretically expected (e.g., trajectory data or sequentially ordered data).

Mistake 8: Claiming Clusters Are "Real" Without Validation

Problem: Clustering always produces groups — even completely random data will be partitioned into $K$ clusters by $K$ -means. Presenting the resulting solution as a discovery of genuine population subgroups without testing whether the clusters reflect real structure is scientifically inappropriate.
Solution: Test the null hypothesis of no cluster structure using the Hopkins statistic before clustering. Validate the solution using held-out variables, external criteria, or an independent replication sample. Use language that reflects the exploratory and hypothesis- generating nature of cluster analysis (e.g., "consistent with three subtypes" rather than "three subtypes were identified").

Mistake 9: Including Highly Correlated Variables Without Addressing Redundancy

Problem: When two highly correlated variables ( $|r| > 0.85$ ) are both included in the clustering, that dimension is effectively double-weighted in the distance calculation. The resulting clusters reflect that dimension disproportionately compared to others.
Solution: Inspect the correlation matrix of clustering variables. Remove one variable from each highly correlated pair, or combine them into a composite. Alternatively, apply PCA first and cluster on principal component scores, which are uncorrelated by construction.

Mistake 10: Using Cluster Means to Validate Clusters Using the Same Data

Problem: Computing ANOVA or $t$ -tests to compare clusters on the variables used to form them, and then reporting these differences as evidence that the clusters are "real" or "distinct," is tautological. Of course the clusters differ on the variables used to create them — that is how clustering works.
Solution: Validate clusters using held-out variables (variables not used in clustering) or external outcomes (e.g., diagnosis, treatment response, survival). ANOVA on clustering variables is useful only for describing the clusters (profiling), not for validating their existence or clinical/practical significance.

14. Troubleshooting

Problem	Likely Cause	Solution
K-Means produces a singleton cluster (cluster of 1)	An extreme outlier has been assigned its own cluster	Remove outliers before clustering; use K-Medoids (PAM); check for data entry errors
WCSS elbow plot shows no clear elbow	Data may have no genuine cluster structure; too many/few $K$ values tested	Test with Hopkins statistic; extend the range of $K$ ; try hierarchical clustering to see dendrogram
Silhouette width is low ( $< 0.25$ ) across all $K$	No cluster structure in data; wrong distance measure; wrong variables selected	Reassess variable selection; try Gower's distance for mixed data; consider the data may not have clusters
K-Means result changes dramatically between runs	Too few random starts; very flat WCSS landscape; data near a saddle point	Increase to 100+ random starts; use K-Means++ initialisation; try different $K$
Hierarchical dendrogram shows inversions (lower heights after higher)	Centroid linkage used with non-Euclidean distances	Switch to Ward, Average, or Complete linkage; avoid centroid linkage
DBSCAN classifies almost all points as noise	$\varepsilon$ is too small	Increase $\varepsilon$ ; replot the $k$ -NN distance graph and identify the correct elbow
DBSCAN produces one giant cluster	$\varepsilon$ is too large	Decrease $\varepsilon$ ; re-examine $k$ -NN distance plot
GMM EM algorithm does not converge	Too many components; degenerate covariance (near-singular); small $n$	Reduce $K$ ; use constrained covariance (e.g., EEE); increase sample size; add regularisation
GMM produces a cluster with near-zero mixing proportion ( $\pi_k \approx 0$ )	Over-specification of $K$ ; one cluster is absorbing outliers	Reduce $K$ ; check for outliers in the data
Bootstrap stability is very low ( $< 0.60$ )	Sample size too small; clusters not well-separated; wrong $K$	Increase $n$ ; try different $K$ ; increase the number of bootstrap replicates to 500
All validity indices disagree on the optimal $K$	Genuine ambiguity in the data structure; clusters are overlapping	Report multiple solutions; choose based on theory; use GMM soft assignments to quantify overlap
Cluster sizes are extremely unequal (e.g., $n=2$ vs. $n=498$ )	Outliers forming their own cluster; $K$ is too large; forced spherical clusters	Remove outliers; reduce $K$ ; use DBSCAN which flags outliers as noise
Variables with zero or near-zero variance included	Data preprocessing omission	Check variable SDs; remove constant or near-constant variables before clustering
PCA biplot shows no cluster separation	Clusters may differ on dimensions not captured by PC1 and PC2	Plot PC2 vs. PC3, PC1 vs. PC3; use 3D plots; check if cluster differences are in higher components

15. Quick Reference Cheat Sheet

Core Equations

Formula	Description
$d_E = \sqrt{\sum_{k=1}^p(x_{ik}-x_{jk})^2}$	Euclidean distance
$d_M = \sum_{k=1}^p \vert x_{ik}-x_{jk} \vert$	Manhattan distance
$d_{Mah} = \sqrt{(\mathbf{x}_i-\mathbf{x}_j)^T\mathbf{S}^{-1}(\mathbf{x}_i-\mathbf{x}_j)}$	Mahalanobis distance
$d_G = \frac{\sum_k \delta_{ijk}d_{ijk}}{\sum_k \delta_{ijk}}$	Gower's distance (mixed data)
$\text{WCSS} = \sum_{k=1}^K\sum_{\mathbf{x}_i \in C_k}\\|\mathbf{x}_i - \bar{\mathbf{x}}_k\\|^2$	Within-cluster sum of squares
$c_i = \arg\min_k \\|\mathbf{x}_i - \bar{\mathbf{x}}_k\\|^2$	K-Means assignment step
$\bar{\mathbf{x}}_k = \frac{1}{n_k}\sum_{\mathbf{x}_i \in C_k}\mathbf{x}_i$	K-Means update step (centroid)
$s(i) = \frac{b(i)-a(i)}{\max[a(i),b(i)]}$	Silhouette coefficient
$\bar{s} = \frac{1}{n}\sum_{i=1}^n s(i)$	Average silhouette width
$CH = \frac{(n-K)\cdot\text{BCSS}}{(K-1)\cdot\text{WCSS}}$	Calinski-Harabasz index
$DB = \frac{1}{K}\sum_{k=1}^K\max_{l\neq k}\frac{\sigma_k+\sigma_l}{d(\bar{\mathbf{x}}_k,\bar{\mathbf{x}}_l)}$	Davies-Bouldin index
$\text{Gap}(K) = E^*[\log\text{WCSS}(K)] - \log\text{WCSS}(K)$	Gap statistic
$r_{ik} = \frac{\pi_k\mathcal{N}(\mathbf{x}_i\mid\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{\sum_j\pi_j\mathcal{N}(\mathbf{x}_i\mid\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}$	GMM posterior responsibility
$\text{BIC} = 2\hat{\ell} - q\ln(n)$	BIC for GMM model selection
$r_{coph}$	Cophenetic correlation (hierarchical quality)

Method Selection Guide

Scenario	Recommended Method
Exploratory; unknown $K$ ; small-medium $n$	Agglomerative hierarchical (Ward's linkage)
Large $n$ ; continuous data; approximate $K$ known	$K$ -Means with K-Means++
Outliers present; any distance matrix	$K$ -Medoids (PAM)
Soft assignments desired; elliptical clusters	Gaussian Mixture Model (GMM)
Arbitrary-shaped clusters; automatic noise detection	DBSCAN or HDBSCAN
Mixed variable types	Gower's distance + PAM or hierarchical
Categorical-only variables	$K$ -Modes or Latent Class Analysis
High-dimensional data ( $p > 20$ )	PCA first, then cluster on scores
Varying density clusters	HDBSCAN
Overlapping clusters; uncertainty quantification	GMM or Fuzzy C-Means
Temporal/longitudinal data	Group-Based Trajectory Modelling
Combining multiple algorithms	Ensemble/Consensus Clustering

Distance Measure Selection Guide

Data Type	Recommended Distance
Continuous, different scales	Euclidean (after z-standardisation)
Continuous, outliers present	Manhattan
Continuous, correlated variables	Mahalanobis
Mixed (continuous + categorical)	Gower's
Binary (presence/absence)	Jaccard
High-dimensional (text, genomics)	Cosine
Time series	Dynamic Time Warping (DTW)
Count data	Bray-Curtis

Linkage Method Guide (Hierarchical Clustering)

Linkage	Cluster Shape	Sensitive to Outliers	Recommended
Ward's	Compact, spherical	Moderate	✅ Default choice
Complete	Compact, equal size	Yes	Compact expected clusters
Average	Moderate	Moderate	Good general alternative
Single	Elongated, chained	Very high	Filamentary structures only
Centroid	Moderate	Low	Not recommended (inversions)

Validity Index Benchmarks

Index	Poor	Acceptable	Good	Excellent
Average Silhouette Width	$< 0.25$	$0.26 - 0.50$	$0.51 - 0.70$	$> 0.70$
Bootstrap Stability (ARI)	$< 0.60$	$0.60 - 0.74$	$0.75 - 0.85$	$> 0.85$
Cophenetic Correlation	$< 0.60$	$0.60 - 0.74$	$0.75 - 0.85$	$> 0.85$
ARI vs. ground truth	$< 0.20$	$0.21 - 0.40$	$0.41 - 0.70$	$> 0.70$

$K$ Selection Methods Comparison

Method	Statistic	Optimal $K$	Requires $K$ Range	For
Elbow	WCSS	At elbow	Yes	K-Means
Silhouette	ASW	Maximise	Yes	Any
Gap Statistic	Gap $(K)$	First $\geq$ gap $(K+1)-s$	Yes	Any
CH Index	CH	Maximise	Yes	K-Means, Hierarchical
DB Index	DB	Minimise	Yes	K-Means, Hierarchical
BIC	$2\hat{\ell} - q\ln n$	Maximise	Yes	GMM only
Dendrogram	Height gaps	Largest gap	No	Hierarchical only

Minimum Sample Size Guidelines

Method	Minimum $n$	Minimum per Cluster
Hierarchical	30	5
$K$ -Means	$10K$	10
$K$ -Medoids	$15K$	10
GMM	$10Kp$	$5p$
DBSCAN	50	$\geq$ MinPts
Fuzzy C-Means	$10K$	10

Key Pre-Processing Checklist

Step	Action	When Required
Standardise	Z-score all continuous variables	Always (Euclidean distance)
Remove outliers	Screen using Mahalanobis $D^2$ or box plots	Before distance-based clustering
Handle missing data	Impute or use Gower's distance	When missing data present
Remove redundant variables	Drop if $	r
Check clusterability	Hopkins statistic $> 0.75$	Before clustering
Dimensionality reduction	PCA if $p > 20$	High-dimensional data
Check variable variance	Remove near-zero variance variables	Always

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Cluster Analysis using the DataStatPro application. For further reading, consult Kaufman & Rousseeuw's "Finding Groups in Data: An Introduction to Cluster Analysis" (2005), Everitt, Landau, Leese & Stahl's "Cluster Analysis" (5th ed., 2011), and Fraley & Raftery's "Model-Based Clustering, Discriminant Analysis, and Density Estimation" (Journal of the American Statistical Association, 2002). For feature requests or support, contact the DataStatPro team.

Cluster Analysis

Cluster Analysis: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Distance and Similarity

1.2 Vectors and Centroids

1.3 Variance and Within-Group Variance

1.4 Matrix Notation and the Distance Matrix

1.5 Probability Distributions and Mixture Models

1.6 Optimisation and Convergence

1.7 Supervised vs. Unsupervised Learning

2. What is Cluster Analysis?

2.1 The Core Idea

2.2 What Cluster Analysis Can and Cannot Do

2.3 Real-World Applications

2.4 Cluster Analysis vs. Related Methods

2.5 The Fundamental Challenge: Defining Similarity

3. The Mathematics Behind Cluster Analysis

3.1 Distance and Dissimilarity Measures

3.2 Distance Measure Comparison Table

3.3 The Hierarchical Clustering Algorithm

3.4 Linkage Methods

3.5 The K-Means Objective Function

3.6 The K-Means Algorithm (Lloyd's Algorithm)

3.7 K-Means++ Initialisation

3.8 The Gaussian Mixture Model (GMM)

3.9 The EM Algorithm for GMM

4. Assumptions of Cluster Analysis

4.1 The Existence of Clusters (Clusterability)

4.2 Scale and Measurement Consistency

4.3 Absence of Extreme Outliers

4.4 Appropriate Choice of Distance Measure

4.5 Spherical Cluster Shape (for K-Means)

4.6 Appropriate Number of Clusters (K)

4.7 Adequate Sample Size

5. Types of Cluster Analysis Methods

5.1 Classification by Methodology

5.2 Hierarchical vs. Partitional Methods

5.3 Agglomerative vs. Divisive Hierarchical Clustering

5.4 Hard vs. Soft (Fuzzy) Clustering

5.5 DataStatPro Implementation Overview

6. Using the Cluster Analysis Component

Step-by-Step Guide

7. Hierarchical Clustering

7.1 The Dendrogram

7.2 Choosing the Cut Point

7.3 Linkage Method Comparison

7.4 Cophenetic Correlation Coefficient

7.5 Hierarchical Clustering for Large Datasets

7.6 Dissimilarity Matrix Heatmap

8. K-Means and K-Medoids Clustering

8.1 Choosing K — The Elbow Method

8.2 The Silhouette Coefficient

8.3 The Gap Statistic

8.4 Calinski-Harabasz Index (Variance Ratio Criterion)

8.5 Davies-Bouldin Index

8.6 K-Medoids (PAM — Partitioning Around Medoids)

8.7 K-Modes for Categorical Data

9. Model-Based and Density-Based Clustering

9.1 Gaussian Mixture Models (GMM) — Model-Based Clustering

9.2 Soft Cluster Membership in GMM

9.3 DBSCAN — Density-Based Spatial Clustering

9.4 Choosing DBSCAN Parameters

9.5 HDBSCAN — Hierarchical DBSCAN

9.6 Fuzzy C-Means

10. Model Fit and Evaluation

10.1 Internal Validity Indices

10.2 External Validity Indices

10.3 Relative Validity — Stability Indices

10.4 BIC and AIC for Model-Based Clustering

10.5 Comparing KKK Values — A Comprehensive Decision Framework

10.6 Cluster Profiling and Interpretation

11. Advanced Topics

11.1 Dimensionality Reduction Before Clustering

11.2 Cluster Validation with External Variables

11.3 Mixed-Type Data Clustering

11.4 Longitudinal Clustering

11.5 Ensemble Clustering (Consensus Clustering)

11.6 Reporting Cluster Analysis Results

12. Worked Examples

10.5 Comparing $K$ Values — A Comprehensive Decision Framework

Mistake 2: Choosing $K$ Based Solely on the Elbow Plot

$K$ Selection Methods Comparison