PLS Regression: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Partial Least Squares (PLS) Regression all the way through advanced component interpretation, model validation, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.

Prerequisites and Background Concepts
What is PLS Regression?
The Mathematics Behind PLS Regression
Types of PLS Methods
Assumptions of PLS Regression
Data Preprocessing
PLS Components and Latent Variables
Choosing the Number of Components
Model Fit and Evaluation
Interpretation of PLS Results
Validation Methods
Comparison with Related Methods
Using the PLS Regression Component
Computational and Formula Details
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into PLS regression, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Vectors and Matrices

A vector is an ordered list of numbers (a one-dimensional array). A matrix is a two-dimensional array of numbers with $n$ rows and $p$ columns, denoted $\mathbf{X}_{n \times p}$ .

Key matrix operations used throughout this tutorial:

Transpose: $\mathbf{X}^T$ swaps rows and columns of $\mathbf{X}$ .
Matrix multiplication: $\mathbf{A}_{m \times k} \mathbf{B}_{k \times n} = \mathbf{C}_{m \times n}$ .
Matrix inverse: $\mathbf{A}^{-1}$ exists only if $\mathbf{A}$ is square and non-singular.
Dot product: $\mathbf{a}^T \mathbf{b} = \sum_{i} a_i b_i$ (a scalar).
Norm: $\|\mathbf{a}\| = \sqrt{\mathbf{a}^T \mathbf{a}} = \sqrt{\sum_i a_i^2}$ (the length of a vector).

1.2 Projection and Orthogonality

The projection of vector $\mathbf{y}$ onto vector $\mathbf{t}$ is:

$\text{proj}_{\mathbf{t}} \mathbf{y} = \frac{\mathbf{t}^T \mathbf{y}}{\mathbf{t}^T \mathbf{t}} \mathbf{t}$

Two vectors are orthogonal if their dot product equals zero: $\mathbf{a}^T \mathbf{b} = 0$ . Orthogonality is a central property of PLS components.

1.3 Variance and Covariance

The variance of a variable $X$ measures its spread:

$\text{Var}(X) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$

The covariance between variables $X$ and $Y$ measures how they vary together:

$\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$

The covariance matrix $\mathbf{S}_{p \times p}$ of a matrix $\mathbf{X}$ contains pairwise covariances of all $p$ variables. After mean-centring $\mathbf{X}$ :

$\mathbf{S} = \frac{1}{n-1}\mathbf{X}^T \mathbf{X}$

1.4 Correlation and Multicollinearity

Correlation is the standardised covariance:

$r_{XY} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}}$

Multicollinearity occurs when predictor variables are highly correlated with each other. It causes serious problems in ordinary least squares (OLS) regression:

Standard errors of coefficients become very large.
Coefficient estimates become unstable and unreliable.
The matrix $\mathbf{X}^T \mathbf{X}$ becomes nearly singular (non-invertible).

PLS regression was specifically designed to handle multicollinearity.

1.5 Eigenvalues and Eigenvectors

For a square matrix $\mathbf{A}$ , an eigenvector $\mathbf{v}$ and its associated eigenvalue $\lambda$ satisfy:

$\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$

Eigenvalues indicate the amount of variance explained by each eigenvector direction. They are the foundation of Principal Component Analysis (PCA) and are closely related to PLS decompositions.

1.6 Ordinary Least Squares (OLS) Regression

OLS regression models the response $\mathbf{y}$ as a linear function of predictors $\mathbf{X}$ :

$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$

The OLS estimator minimises the sum of squared residuals:

$\hat{\boldsymbol{\beta}}_{OLS} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$

This requires $\mathbf{X}^T \mathbf{X}$ to be invertible — which fails when:

$p > n$ (more predictors than observations).
Predictors are highly collinear.

PLS provides a solution to both of these failure modes.

2. What is PLS Regression?

Partial Least Squares (PLS) Regression is a multivariate statistical method that models the relationship between a matrix of predictor variables $\mathbf{X}$ and one or more response variables $\mathbf{Y}$ by extracting a small number of latent components (factors) that simultaneously:

Explain as much variance in $\mathbf{X}$ as possible.
Have maximum covariance (predictive power) with $\mathbf{Y}$ .

Unlike OLS regression, PLS does not require $\mathbf{X}^T \mathbf{X}$ to be invertible and performs well even when predictors far outnumber observations or are highly collinear.

2.1 The Core Idea

The name "Partial Least Squares" reflects the method's origins:

Partial: Each step involves projecting (regressing) the data onto a component that captures part of the variation in $\mathbf{X}$ and $\mathbf{Y}$ .
Least Squares: At each step, the component is found to minimise a least-squares criterion.

In essence, PLS finds a compressed, low-dimensional representation of $\mathbf{X}$ (the latent components or scores) that is maximally informative about $\mathbf{Y}$ . This is what distinguishes PLS from Principal Component Regression (PCR), which only maximises variance in $\mathbf{X}$ without considering $\mathbf{Y}$ .

2.2 Real-World Applications

PLS regression is the workhorse of many applied sciences. Common applications include:

Chemometrics: Predicting chemical concentrations (e.g., protein content, moisture, pH) from near-infrared (NIR) or Raman spectroscopy data — the quintessential PLS application with hundreds or thousands of correlated spectral wavelengths as predictors.
Genomics & Bioinformatics: Relating gene expression profiles (tens of thousands of genes) to clinical outcomes, phenotypes, or disease states.
Food Science: Predicting sensory quality attributes (taste, texture) from instrumental measurements.
Pharmaceutical Sciences: Relating molecular descriptors (QSAR/QSPR) to drug activity, toxicity, or physicochemical properties.
Process Monitoring & Control: Modelling industrial process variables to predict product quality and detect deviations.
Neuroscience: Relating brain imaging data (fMRI voxels) to behavioural or cognitive measures.
Social & Behavioural Sciences: Predicting complex outcomes (job performance, wellbeing) from many correlated survey items.
Environmental Science: Relating environmental variables (soil, water, climate) to ecological outcomes.

2.3 When to Use PLS Regression

PLS regression is particularly appropriate when:

Situation	Reason PLS is Preferred
$p \gg n$ (many more predictors than observations)	OLS is undefined; PLS works with far fewer components than predictors
High multicollinearity among predictors	OLS is unstable; PLS constructs orthogonal components
Noisy predictor variables (measurement error)	PLS filters noise by focusing on components relevant to $\mathbf{Y}$
Multiple correlated response variables	PLS-2 handles multivariate responses simultaneously
Interpretable latent structure is desired	PLS components have clear loading and score interpretations
Prediction accuracy is the primary goal	PLS minimises prediction error via cross-validation component selection

2.4 PLS vs. Related Methods: An Overview

Feature	OLS	PCR	Ridge	PLS
Handles $p > n$	❌	✅	✅	✅
Handles multicollinearity	❌	✅	✅	✅
Uses $\mathbf{Y}$ to construct components	❌	❌	❌	✅
Produces interpretable components	N/A	✅	❌	✅
Handles multiple responses	✅	Partially	✅	✅
Requires matrix inversion	✅	Partially	✅	❌

3. The Mathematics Behind PLS Regression

3.1 The PLS Model Structure

PLS simultaneously decomposes both $\mathbf{X}$ ( $n \times p$ ) and $\mathbf{Y}$ ( $n \times q$ ) into score matrices and loading matrices:

$\mathbf{X} = \mathbf{T}\mathbf{P}^T + \mathbf{E}$

$\mathbf{Y} = \mathbf{U}\mathbf{Q}^T + \mathbf{F}$

Where:

$\mathbf{T}$ ( $n \times A$ ) = X-scores matrix (latent components from $\mathbf{X}$ ).
$\mathbf{P}$ ( $p \times A$ ) = X-loadings matrix (how $\mathbf{X}$ variables relate to $\mathbf{X}$ -components).
$\mathbf{U}$ ( $n \times A$ ) = Y-scores matrix (latent components from $\mathbf{Y}$ ).
$\mathbf{Q}$ ( $q \times A$ ) = Y-loadings matrix (how $\mathbf{Y}$ variables relate to $\mathbf{Y}$ -components).
$\mathbf{E}$ ( $n \times p$ ) = X-residual matrix.
$\mathbf{F}$ ( $n \times q$ ) = Y-residual matrix.
$A$ = number of retained latent components.

The relationship between the $\mathbf{X}$ -scores and $\mathbf{Y}$ -scores is modelled as:

$\mathbf{U} = \mathbf{T}\mathbf{B}_{diag} + \mathbf{H}$

Where $\mathbf{B}_{diag}$ is a diagonal matrix of inner relation coefficients and $\mathbf{H}$ is residual.

3.2 The Objective: Maximum Covariance

The core objective of PLS is to find weight vectors $\mathbf{w}$ (for $\mathbf{X}$ ) and $\mathbf{c}$ (for $\mathbf{Y}$ ) such that the covariance between the resulting scores is maximised:

$\max_{\mathbf{w}, \mathbf{c}} \text{Cov}(\mathbf{X}\mathbf{w},\ \mathbf{Y}\mathbf{c})^2 = \max_{\mathbf{w}, \mathbf{c}} (\mathbf{w}^T \mathbf{X}^T \mathbf{Y} \mathbf{c})^2$

Subject to the normalisation constraints: $\|\mathbf{w}\| = 1$ and $\|\mathbf{c}\| = 1$ .

This is equivalent to finding the first singular value decomposition (SVD) of the cross-product matrix $\mathbf{X}^T \mathbf{Y}$ :

$\mathbf{X}^T \mathbf{Y} = \mathbf{W} \mathbf{S} \mathbf{C}^T$

Where $\mathbf{S}$ is diagonal with singular values (covariances), $\mathbf{W}$ contains X-weight vectors, and $\mathbf{C}$ contains Y-weight vectors.

3.3 Score and Loading Computation

For each component $a = 1, 2, \dots, A$ :

X-scores (latent variable scores):

$\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a$

Where $\mathbf{X}_a$ is the deflated (residualised) $\mathbf{X}$ matrix at step $a$ , and $\mathbf{w}_a$ is the X-weight vector for component $a$ .

Y-scores:

$\mathbf{u}_a = \mathbf{Y}_a \mathbf{c}_a$

X-loadings (regression of $\mathbf{X}_a$ on $\mathbf{t}_a$ ):

$\mathbf{p}_a = \frac{\mathbf{X}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}$

Y-loadings (regression of $\mathbf{Y}_a$ on $\mathbf{u}_a$ ):

$\mathbf{q}_a = \frac{\mathbf{Y}_a^T \mathbf{u}_a}{\mathbf{u}_a^T \mathbf{u}_a}$

Inner relation coefficient:

$b_a = \frac{\mathbf{u}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}$

3.4 Deflation

After extracting each component, the data matrices are deflated (the component's contribution is removed) to ensure that subsequent components capture new, orthogonal information:

$\mathbf{X}_{a+1} = \mathbf{X}_a - \mathbf{t}_a \mathbf{p}_a^T$

For PLS1 (single response):

$\mathbf{y}_{a+1} = \mathbf{y}_a - b_a \mathbf{t}_a$

For PLS2 (multiple responses):

$\mathbf{Y}_{a+1} = \mathbf{Y}_a - b_a \mathbf{t}_a \mathbf{q}_a^T$

This sequential deflation ensures the X-scores $\mathbf{t}_1, \mathbf{t}_2, \dots, \mathbf{t}_A$ are mutually orthogonal: $\mathbf{t}_a^T \mathbf{t}_b = 0$ for $a \neq b$ .

3.5 The PLS Regression Coefficients

After extracting $A$ components, the final PLS regression coefficients relating $\mathbf{X}$ directly to $\mathbf{Y}$ are:

$\hat{\mathbf{B}}_{PLS} = \mathbf{W}^* \mathbf{Q}^T$

Where $\mathbf{W}^*$ ( $p \times A$ ) is the matrix of modified weight vectors (also called the W-star or $\mathbf{R}$ matrix), which accounts for the sequential deflation:

$\mathbf{W}^* = \mathbf{W}(\mathbf{P}^T \mathbf{W})^{-1}$

The predicted values are then:

$\hat{\mathbf{Y}} = \mathbf{X} \hat{\mathbf{B}}_{PLS} = \mathbf{X} \mathbf{W}^* \mathbf{Q}^T = \mathbf{T} \mathbf{Q}^T$

Where $\mathbf{T} = \mathbf{X} \mathbf{W}^*$ are the X-scores computed directly from $\mathbf{X}$ without deflation.

For a single response variable $\mathbf{y}$ :

$\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}}_{PLS}, \quad \hat{\boldsymbol{\beta}}_{PLS} = \mathbf{W}^* \mathbf{q}$

The predicted value for a new observation $\mathbf{x}_{new}$ :

$\hat{y}_{new} = \mathbf{x}_{new}^T \hat{\boldsymbol{\beta}}_{PLS} = \mathbf{x}_{new}^T \mathbf{W}^* \mathbf{q}$

3.6 Relationship Between PLS and SVD

The core step of PLS — finding weight vectors that maximise the covariance between $\mathbf{t} = \mathbf{X}\mathbf{w}$ and $\mathbf{u} = \mathbf{Y}\mathbf{c}$ — reduces to finding the leading left and right singular vectors of $\mathbf{X}^T \mathbf{Y}$ :

$\mathbf{X}^T \mathbf{Y} \mathbf{Y}^T \mathbf{X} \mathbf{w} = \lambda^2 \mathbf{w}$

$\mathbf{Y}^T \mathbf{X} \mathbf{X}^T \mathbf{Y} \mathbf{c} = \lambda^2 \mathbf{c}$

The maximum covariance $\lambda$ is the first singular value of $\mathbf{X}^T \mathbf{Y}$ , and $\mathbf{w}$ , $\mathbf{c}$ are the corresponding left and right singular vectors.

4. Types of PLS Methods

Several algorithmic variants of PLS exist. The most important ones are described below.

4.1 PLS1

PLS1 handles a single continuous response variable $\mathbf{y}$ ( $q = 1$ ). It is the most common form of PLS regression in practice.

Input: $\mathbf{X}$ ( $n \times p$ ) and $\mathbf{y}$ ( $n \times 1$ ).
Output: $A$ latent components, regression coefficients $\hat{\boldsymbol{\beta}}_{PLS}$ ( $p \times 1$ ).
The NIPALS algorithm (see Section 14) is used for computation.

4.2 PLS2

PLS2 handles multiple response variables simultaneously ( $q > 1$ ). It extracts components that explain $\mathbf{X}$ variance while maximising covariance with the entire $\mathbf{Y}$ matrix.

Input: $\mathbf{X}$ ( $n \times p$ ) and $\mathbf{Y}$ ( $n \times q$ ).
Output: $A$ latent components, regression coefficient matrix $\hat{\mathbf{B}}_{PLS}$ ( $p \times q$ ).
A single set of X-components is found for all responses simultaneously.

💡 PLS2 is more parsimonious than running separate PLS1 models for each response, but if the responses are very different in nature, separate PLS1 models may give better individual predictions.

4.3 PLS-DA (PLS Discriminant Analysis)

PLS-DA applies PLS regression to a categorical response variable by encoding the class membership as a binary (or dummy-coded) $\mathbf{Y}$ matrix. It is widely used for classification in chemometrics, metabolomics, and genomics.

For a two-class problem: Encode $\mathbf{y} \in \{0, 1\}$ and apply PLS1. Classify using a threshold on $\hat{y}$ .
For a multi-class problem ( $K$ classes): Encode as a $K$ -column binary matrix (one-hot) and apply PLS2. Assign to the class with the highest predicted value.

⚠️ PLS-DA can overfit when many components are used, especially with small or imbalanced datasets. Cross-validation is essential for assessing classification performance.

4.4 OPLS (Orthogonal PLS)

OPLS (Trygg & Wold) separates the $\mathbf{X}$ variation into:

Predictive variation: Correlated with $\mathbf{Y}$ (the PLS component).
Orthogonal variation: Uncorrelated with $\mathbf{Y}$ (structured noise in $\mathbf{X}$ ).

This produces a model with a single predictive component (for PLS1) plus orthogonal components that account for systematic $\mathbf{X}$ variation not related to $\mathbf{Y}$ . OPLS produces simpler, more interpretable loading plots.

4.5 Kernel PLS

Kernel PLS extends PLS to handle non-linear relationships between $\mathbf{X}$ and $\mathbf{Y}$ by implicitly mapping the data into a high-dimensional feature space using a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ .

Common kernels:

Linear: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{x}_j$
Polynomial: $k(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^T \mathbf{x}_j + c)^d$
Radial Basis Function (RBF): $k(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2)$

4.6 Sparse PLS

Sparse PLS introduces variable selection by imposing $L_1$ (Lasso-type) penalties on the weight vectors, driving some weights to exactly zero. This simultaneously performs dimensionality reduction and variable selection, producing more interpretable models in high-dimensional settings ( $p \gg n$ ).

4.7 Summary of PLS Variants

Variant	Response Type	Key Feature	Typical Use
PLS1	Single continuous	Standard PLS	Most regression problems
PLS2	Multiple continuous	Joint modelling of all responses	Multivariate responses
PLS-DA	Categorical (classes)	Classification via dummy encoding	Chemometrics, omics
OPLS	Single/Multiple continuous	Separates predictive from orthogonal variation	Improved interpretation
Kernel PLS	Single/Multiple	Non-linear extension via kernels	Non-linear relationships
Sparse PLS	Single/Multiple	Variable selection via $L_1$ penalty	High-dimensional data

The DataStatPro application implements PLS1 and PLS2 (standard PLS regression) and PLS-DA, which are the focus of this tutorial.

5. Assumptions of PLS Regression

PLS regression is a relatively assumption-light method compared to OLS. However, certain conditions should be met for valid results.

5.1 Linearity

PLS assumes a linear relationship between the latent components (scores $\mathbf{T}$ ) and the response $\mathbf{Y}$ . Non-linear relationships between the original $\mathbf{X}$ variables and $\mathbf{Y}$ may be partially captured if the non-linearity is reflected in the latent structure, but Kernel PLS is preferred for strongly non-linear data.

5.2 Continuous (or Appropriately Encoded) Variables

Predictor variables ( $\mathbf{X}$ ): Should be continuous or ordinal. Binary variables can be included but must be mean-centred and scaled.
Response variable ( $\mathbf{y}$ ): Should be continuous for PLS1/PLS2. For categorical responses, use PLS-DA with appropriate dummy coding.

5.3 No Requirement for Multivariate Normality

Unlike some classical multivariate methods, PLS does not require the predictors or response to follow a multivariate normal distribution. This makes PLS robust to skewed or non-normal variables (though severe non-normality may still affect inference).

5.4 No Perfect Redundancy (Degenerate Cases)

If a predictor variable $X_j$ is a perfect linear combination of other predictor variables (i.e., the column is a deterministic function of others), it carries no additional information and should be removed before fitting PLS. Similarly, a response variable that is a perfect linear combination of $\mathbf{X}$ columns will cause a degenerate model.

5.5 Sufficient Sample Size

While PLS handles $p > n$ settings, model quality improves with more observations. General guidelines:

Minimum: $n \geq 5A$ where $A$ is the number of components (to avoid severe overfitting).
Preferred: $n \geq 10A$ for reliable cross-validated results.
For PLS-DA: At least 5–10 observations per class per component.

5.6 Representativeness of Calibration Set

The calibration (training) samples should span the full range of variation expected in future prediction samples. PLS is an interpolation method — predictions for samples outside the calibration space (extrapolation) are unreliable.

5.7 No Gross Outliers

Extreme outliers in $\mathbf{X}$ or $\mathbf{Y}$ can disproportionately influence the extracted components and distort the model. Outliers should be detected (using score plots and leverage/residual diagnostics) and investigated before finalising the model.

6. Data Preprocessing

Data preprocessing is arguably the most critical step in PLS analysis. The choice of preprocessing can have a profound impact on the extracted components and the resulting model.

6.1 Mean Centring

Mean centring subtracts the column mean from each variable:

$\tilde{x}_{ij} = x_{ij} - \bar{x}_j, \quad \bar{x}_j = \frac{1}{n}\sum_{i=1}^n x_{ij}$

Mean centring is almost always required for PLS. Without it, the first component tends to describe the mean of the data rather than the variance structure, and subsequent components are distorted.

After mean centring, the PLS model is fitted to $\tilde{\mathbf{X}}$ and $\tilde{\mathbf{Y}}$ , and predictions are adjusted using the response mean:

$\hat{y}_{new} = \bar{y} + \tilde{\mathbf{x}}_{new}^T \hat{\boldsymbol{\beta}}_{PLS}$

6.2 Autoscaling (Mean Centring + Unit Variance Scaling)

Autoscaling (also called standardisation or z-score scaling) both mean-centres and scales each variable to unit variance:

$\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j}, \quad s_j = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}$

When to use autoscaling:

When predictor variables are measured in different units (e.g., age in years, income in dollars, height in centimetres). Without scaling, variables with larger numerical ranges will dominate the components.
As the default choice when there is no strong reason to prefer otherwise.

When not to autoscale:

When variables are measured in the same units and differences in variance are meaningful (e.g., spectroscopic data where high-variance spectral regions are genuinely more informative).
When some variables are near-zero variance (noise-dominated) — autoscaling would amplify noise.

6.3 Other Scaling Methods

Method	Formula	When to Use
No scaling (mean-centre only)	$\tilde{x}_{ij} = x_{ij} - \bar{x}_j$	Same units; variance is meaningful
Autoscaling (UV)	$\tilde{x}_{ij} = (x_{ij} - \bar{x}_j)/s_j$	Different units; default choice
Pareto scaling	$\tilde{x}_{ij} = (x_{ij} - \bar{x}_j)/\sqrt{s_j}$	Compromise: down-weights high-variance variables less severely than UV
Range scaling	$\tilde{x}_{ij} = (x_{ij} - \bar{x}_j)/(\max_j - \min_j)$	All variables on [0,1] scale
Vast scaling	$\tilde{x}_{ij} = (x_{ij}/s_j) \times (\bar{x}_j/s_j)$	Focuses on variables with low coefficient of variation
Log transformation	$x_{ij}' = \ln(x_{ij})$	Right-skewed, multiplicative data (e.g., concentration data, metabolomics)

💡 The response variable $\mathbf{y}$ should also be mean-centred (and scaled if using PLS2 with multiple responses on different scales). For PLS1, scaling $\mathbf{y}$ to unit variance is optional but often beneficial.

6.4 Handling Missing Data

PLS with missing data requires special treatment. Options include:

Complete case analysis: Remove observations with any missing values (only if missingness is minimal and random).
Mean imputation: Replace missing values with the column mean (simple but ignores covariance structure).
NIPALS-based imputation: The NIPALS algorithm can handle missing data iteratively — missing values are replaced with their model estimates at each iteration.
Multiple imputation: Generate multiple complete datasets and pool the PLS results.

⚠️ Missing data in $\mathbf{X}$ is more tractable than missing data in $\mathbf{Y}$ . If $\mathbf{y}$ values are missing, those observations typically cannot be used in model calibration.

6.5 Outlier Detection Before Modelling

Before fitting PLS, screen for outliers using:

Univariate checks: Box plots, histograms, z-scores ( $|z| > 3$ as a flag).
Multivariate checks: Mahalanobis distance, PCA score plots.
Domain knowledge: Values that are physically impossible (negative concentrations, ages > 150) should be corrected or removed.

7. PLS Components and Latent Variables

Understanding what PLS components represent is essential for interpreting PLS models.

7.1 X-Scores ( $\mathbf{T}$ )

The X-scores matrix $\mathbf{T}$ ( $n \times A$ ) contains the coordinates of each observation in the low-dimensional latent space:

$\mathbf{t}_a = \mathbf{X} \mathbf{w}_a^*$

Where $\mathbf{w}_a^*$ is the $a$ -th column of $\mathbf{W}^*$ .

Each row $\mathbf{t}_i$ is the score vector for observation $i$ — its position in the latent space.
Scores are mutually orthogonal: $\mathbf{t}_a^T \mathbf{t}_b = 0$ for $a \neq b$ .
Score plots ( $t_1$ vs. $t_2$ , etc.) reveal the structure of the observations: clusters, trends, outliers.

7.2 X-Loadings ( $\mathbf{P}$ )

The X-loadings matrix $\mathbf{P}$ ( $p \times A$ ) describes how the original $\mathbf{X}$ variables contribute to each latent component during deflation:

$\mathbf{p}_a = \frac{\mathbf{X}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}$

Loading $p_{ja}$ indicates how strongly variable $X_j$ is associated with component $a$ .
Loading plots reveal which $\mathbf{X}$ variables are most important for each component and which variables are correlated (variables with similar loading vectors are collinear).

7.3 X-Weights ( $\mathbf{W}$ and $\mathbf{W}^*$ )

The X-weights $\mathbf{W}$ describe how $\mathbf{X}$ variables are weighted to form the X-scores:

$\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a \quad \text{(during deflation)}$

The modified X-weights $\mathbf{W}^* = \mathbf{W}(\mathbf{P}^T \mathbf{W})^{-1}$ relate the original (undeflated) $\mathbf{X}$ directly to the scores:

$\mathbf{T} = \mathbf{X} \mathbf{W}^*$

⚠️ There is a subtle but important distinction between $\mathbf{W}$ and $\mathbf{W}^*$ . For interpreting the relationship between $\mathbf{X}$ variables and the PLS components, use $\mathbf{W}^*$ (not $\mathbf{W}$ ), as $\mathbf{W}^*$ accounts for the deflation steps and relates to the original $\mathbf{X}$ directly.

7.4 Y-Loadings ( $\mathbf{Q}$ ) and Y-Weights ( $\mathbf{C}$ )

The Y-loadings $\mathbf{Q}$ ( $q \times A$ ) describe how the $\mathbf{Y}$ variables are reconstructed from the X-scores:

$\hat{\mathbf{Y}} = \mathbf{T} \mathbf{Q}^T$

Loading $q_{ja}$ indicates how strongly response variable $Y_j$ is associated with component $a$ in the final prediction equation. For PLS1 ( $q = 1$ ), $\mathbf{Q}$ reduces to a scalar $q_a$ for each component.

7.5 Y-Scores ( $\mathbf{U}$ )

The Y-scores $\mathbf{U}$ ( $n \times A$ ) are the latent components extracted from $\mathbf{Y}$ :

$\mathbf{u}_a = \mathbf{Y}_a \mathbf{c}_a$

In the inner relation $\mathbf{u}_a \approx b_a \mathbf{t}_a$ , the Y-scores should be close to the X-scores (scaled by $b_a$ ). The tightness of this relationship is diagnostic of model quality:

If $\mathbf{t}_a$ and $\mathbf{u}_a$ are strongly correlated, the component is predictively powerful.
If they are weakly correlated, the component explains $\mathbf{X}$ variance but not $\mathbf{Y}$ — suggesting the component is not useful for prediction.

7.6 The Biplot

The PLS biplot overlays the score plot (observations) and loading plot (variables) in the same space:

Observations (scores $t_1, t_2$ ) are plotted as points.
Variables (loadings $p_{a1}, p_{a2}$ or weights $w_{a1}^*, w_{a2}^*$ ) are plotted as vectors.
A variable vector pointing toward a cluster of observations indicates those observations have high values for that variable.
Variables with loading vectors in the same direction are positively correlated; opposite directions indicate negative correlation.

8. Choosing the Number of Components

The number of latent components $A$ is the primary hyperparameter in PLS regression. Too few components underfits (high bias, misses important structure); too many overfits (low bias but high variance, memorises noise).

8.1 Cross-Validation (Primary Method)

Cross-validation (CV) is the gold-standard method for selecting $A$ in PLS. The most common approach is $k$ -fold cross-validation:

Divide the $n$ observations into $k$ roughly equal folds.
For each fold $v = 1, \dots, k$ : a. Fit PLS with $a = 1, 2, \dots, A_{\max}$ components on the data excluding fold $v$ . b. Predict fold $v$ observations using each fitted model. c. Compute prediction errors for fold $v$ .
Average the prediction errors across all folds.
Select $A$ minimising the cross-validated prediction error.

Leave-One-Out Cross-Validation (LOOCV): Special case of $k$ -fold CV with $k = n$ . Computationally intensive but uses maximum data for fitting at each step.

Recommended $k$ : $k = 5$ or $k = 10$ is standard. For small datasets ( $n < 30$ ), LOOCV is preferred.

8.2 PRESS Statistic (Predicted Residual Error Sum of Squares)

The PRESS statistic summarises the cross-validated prediction error for $A$ components:

$PRESS(A) = \sum_{v=1}^k \sum_{i \in \text{fold } v} \left(y_i - \hat{y}_{i,-v}(A)\right)^2$

Where $\hat{y}_{i,-v}(A)$ is the prediction for observation $i$ when it was in the held-out fold $v$ , using a model with $A$ components.

Select $A^* = \arg\min_A PRESS(A)$ .

💡 To guard against overfitting while maintaining parsimony, some guidelines recommend choosing the smallest $A$ for which $PRESS(A)$ is within 1 standard error of the minimum PRESS (the "one-standard-error rule").

8.3 $Q^2$ (Cross-Validated $R^2$ )

$Q^2$ is the cross-validated analogue of $R^2$ :

$Q^2(A) = 1 - \frac{PRESS(A)}{SS_{Y,total}}$

Where:

$SS_{Y,total} = \sum_{i=1}^n (y_i - \bar{y})^2$

$Q^2$ ranges from $-\infty$ (model is worse than predicting the mean) to 1 (perfect cross-validated prediction).

Guidelines for $Q^2$ :

$Q^2$ Value	Interpretation
$Q^2 < 0$	Model is worse than the mean; no predictive power
$0 \leq Q^2 < 0.50$	Poor to moderate predictive ability
$0.50 \leq Q^2 < 0.70$	Moderate predictive ability
$0.70 \leq Q^2 < 0.90$	Good predictive ability
$Q^2 \geq 0.90$	Excellent predictive ability (verify not overfitting)

The optimal number of components is where $Q^2$ reaches a maximum (or where adding more components does not meaningfully increase $Q^2$ ).

8.4 The $R^2_Y$ vs. $Q^2$ Plot

A standard diagnostic in PLS is to plot both $R^2_Y$ (variance explained in $\mathbf{Y}$ — a training set metric) and $Q^2$ (cross-validated metric) against the number of components $A$ :

$R^2_Y$ always increases with more components (never decreases).
$Q^2$ first increases then decreases (or plateaus) as the model starts to overfit.
The optimal $A$ is where $Q^2$ peaks or where the gap between $R^2_Y$ and $Q^2$ begins to widen rapidly.

⚠️ A model with $R^2_Y$ much higher than $Q^2$ is overfitting. Aim for $Q^2$ close to $R^2_Y$ .

8.5 Scree Plot of Eigenvalues / Variance Explained

A scree plot of the variance explained in $\mathbf{X}$ ( $R^2_X$ ) by each successive component can also guide component selection: look for an "elbow" where additional components explain diminishing amounts of variance. However, $R^2_X$ alone ignores predictive relevance for $\mathbf{Y}$ — always prioritise $Q^2$ .

8.6 Permutation Testing

Permutation tests provide a rigorous null-hypothesis test for whether a PLS model with $A$ components explains more variance in $\mathbf{Y}$ than expected by chance:

Randomly permute (shuffle) the $\mathbf{y}$ vector $B$ times (e.g., $B = 999$ ).
Fit PLS with $A$ components to each permuted dataset and record $R^2_Y$ and $Q^2$ .
The p-value is the proportion of permuted $R^2_Y$ (or $Q^2$ ) values that exceed the observed value.

A significant result ( $p < 0.05$ ) confirms the model captures a real relationship, not a spurious one.

9. Model Fit and Evaluation

9.1 $R^2_X$ (Variance Explained in $\mathbf{X}$ )

The proportion of total $\mathbf{X}$ variance explained by component $a$ is:

$R^2_{X,a} = \frac{\|\mathbf{t}_a \mathbf{p}_a^T\|^2}{\|\mathbf{X}\|^2} = \frac{\mathbf{t}_a^T \mathbf{t}_a \cdot \mathbf{p}_a^T \mathbf{p}_a}{\|\mathbf{X}\|^2}$

Cumulative $R^2_X$ after $A$ components:

$R^2_X(A) = \sum_{a=1}^A R^2_{X,a}$

A high $R^2_X$ indicates the components capture most of the systematic variation in $\mathbf{X}$ .

9.2 $R^2_Y$ (Variance Explained in $\mathbf{Y}$ )

The proportion of total $\mathbf{Y}$ variance explained by $A$ components — the training set $R^2$ :

$R^2_Y(A) = 1 - \frac{\|\mathbf{Y} - \hat{\mathbf{Y}}\|^2}{\|\mathbf{Y} - \bar{\mathbf{Y}}\|^2} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}$

For PLS1, this reduces to:

$R^2_Y(A) = 1 - \frac{RSS(A)}{SS_{total}}$

Where $RSS(A) = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ is the residual sum of squares.

⚠️ $R^2_Y$ alone is an optimistic (biased) estimate of model quality because it is computed on the training data. Always report $Q^2$ alongside $R^2_Y$ .

9.3 Root Mean Squared Error of Calibration (RMSEC)

$RMSEC = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}$

This is the training set prediction error in the same units as $\mathbf{y}$ .

9.4 Root Mean Squared Error of Cross-Validation (RMSECV)

$RMSECV = \sqrt{\frac{PRESS}{n}} = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_{i,-v})^2}$

RMSECV is the cross-validated prediction error and is the primary model selection criterion alongside $Q^2$ .

$Q^2 = 1 - \frac{PRESS}{SS_{total}} = 1 - \frac{n \cdot RMSECV^2}{SS_{total}}$

9.5 Root Mean Squared Error of Prediction (RMSEP)

When a separate independent test set of $n_{test}$ observations is available (not used in model fitting or cross-validation):

$RMSEP = \sqrt{\frac{1}{n_{test}}\sum_{i=1}^{n_{test}} (y_i - \hat{y}_i)^2}$

RMSEP is the most honest estimate of true prediction error and should always be reported when a genuine external test set exists.

Hierarchy of prediction error estimates:

$RMSEC \leq RMSECV \leq RMSEP$

The gap between RMSEC and RMSECV/RMSEP indicates the degree of overfitting.

9.6 Bias and Slope of Predicted vs. Observed

A predicted vs. observed plot should ideally show points scattered symmetrically around the 1:1 line (slope = 1, intercept = 0). Formally test for systematic bias using a regression of observed on predicted:

$y_i = b_0 + b_1 \hat{y}_i + \epsilon_i$

$b_0 \neq 0$ : Systematic bias (model consistently over- or under-predicts).
$b_1 \neq 1$ : Scale bias (model predictions are proportionally stretched or compressed).

9.7 Summary of Model Fit Statistics

Statistic	Definition	Optimal	Purpose
$R^2_X$	Variance in $\mathbf{X}$ explained	High (informative but secondary)	Assess how well components represent $\mathbf{X}$
$R^2_Y$	Variance in $\mathbf{Y}$ explained (training)	High	Training set fit
$Q^2$	Cross-validated $R^2_Y$	High, close to $R^2_Y$	Predictive ability, component selection
RMSEC	Training set RMSE	Low (but biased)	Training error in original units
RMSECV	Cross-validated RMSE	Low	CV prediction error (primary)
RMSEP	External test set RMSE	Low	True prediction error

10. Interpretation of PLS Results

10.1 Regression Coefficients ( $\hat{\boldsymbol{\beta}}_{PLS}$ )

The PLS regression coefficients $\hat{\boldsymbol{\beta}}_{PLS} = \mathbf{W}^* \mathbf{q}$ have the same interpretation as OLS coefficients: $\hat{\beta}_j$ is the expected change in $\hat{y}$ for a one-unit increase in $X_j$ , with all other variables held constant.

However, because PLS implicitly regularises through dimensionality reduction:

Coefficients tend to be shrunk toward zero compared to OLS (especially for components not retained).
They are more stable under multicollinearity than OLS coefficients.
The number of components $A$ acts as the regularisation parameter: fewer components = more shrinkage.

⚠️ When $\mathbf{X}$ variables are autoscaled, the coefficients are on the standardised scale. Multiply by $s_j$ (the SD of $X_j$ ) and divide by $s_y$ to interpret on the original scales, or report the raw (unstandardised) coefficients after back-scaling.

10.2 Variable Importance in Projection (VIP)

The VIP score (Wold, 1994) quantifies the contribution of each predictor variable to the PLS model, accounting for all $A$ components:

$VIP_j = \sqrt{\frac{p \sum_{a=1}^A \left(R^2_Y(a) - R^2_Y(a-1)\right) w_{ja}^{*2}}{\sum_{a=1}^A \left(R^2_Y(a) - R^2_Y(a-1)\right) \sum_{j=1}^p w_{ja}^{*2}}}$

Where:

$p$ = total number of predictor variables.
$R^2_Y(a) - R^2_Y(a-1)$ = marginal variance explained by component $a$ .
$w_{ja}^{*2}$ = squared modified weight of variable $j$ in component $a$ .

Properties of VIP:

$VIP_j \geq 0$ always.
The average squared VIP across all variables equals 1: $\frac{1}{p}\sum_{j=1}^p VIP_j^2 = 1$ .
Therefore, the average VIP = 1 (approximately, depending on scaling).

Interpretation:

VIP Score	Interpretation
$VIP_j > 1.0$	Variable $j$ is important for the model (above-average contribution)
$0.8 \leq VIP_j \leq 1.0$	Variable $j$ has moderate importance
$VIP_j < 0.8$	Variable $j$ has low importance; may be a candidate for removal

💡 VIP is the most widely used variable selection criterion in PLS. Variables with VIP < 0.8 (or a domain-specific threshold) are candidates for removal, which can improve model parsimony and sometimes prediction performance.

10.3 Loadings Plot

The loadings plot displays the X-loadings $\mathbf{p}_a$ (or weights $\mathbf{w}_a^*$ ) for two components simultaneously, revealing:

Which variables are most strongly associated with each component.
Which variables are correlated with each other (similar loading vectors).
The direction and magnitude of each variable's contribution.

For spectroscopic data, a plot of the loadings as a function of wavelength (a "loading spectrum") is particularly informative, as it reveals which spectral regions are important.

10.4 Score Plot

The score plot displays the X-scores ( $t_1$ vs. $t_2$ , etc.) for all observations, revealing:

Clustering of observations (groups with similar $\mathbf{X}$ profiles).
Trends (e.g., samples ordered along $t_1$ by the response value).
Outliers (isolated points far from the main cluster).

When colour-coded by $\mathbf{y}$ values, the score plot shows whether the latent structure in $\mathbf{X}$ aligns with $\mathbf{Y}$ — the hallmark of a good PLS model.

10.5 Weight-Loading Biplot ( $w^*$ - $q$ Plot)

The correlation loadings plot or $w^*$ - $q$ biplot plots the modified X-weights $\mathbf{w}^*$ and Y-loadings $\mathbf{q}$ together in the same space. Variables (X and Y) that appear close together in this plot are positively correlated; variables on opposite sides are negatively correlated. This provides a comprehensive view of the $\mathbf{X}$ – $\mathbf{Y}$ relationship structure.

10.6 Leverage and Residuals (Influence Analysis)

Leverage measures how influential observation $i$ is in determining the model:

$h_i = t_{i\cdot}(\mathbf{T}^T\mathbf{T})^{-1}t_{i\cdot}^T = \frac{1}{n} + \sum_{a=1}^A \frac{t_{ia}^2}{\sum_{j=1}^n t_{ja}^2}$

Where $t_{i\cdot}$ is the $i$ -th row of $\mathbf{T}$ .

Standardised Y-residuals:

$e_i^* = \frac{y_i - \hat{y}_i}{\hat{\sigma}\sqrt{1 - h_i}}$

A leverage-residual plot (also called a Williams plot) displays $h_i$ vs. $e_i^*$ :

High leverage + small residual: Influential, well-fitting observation (good leverage).
High leverage + large residual: Outlier with high influence — investigate carefully.
Low leverage + large residual: Poorly predicted outlier but low influence.
Low leverage + small residual: Normal observation.

10.7 Hotelling's $T^2$ and SPE (DModX)

Two complementary multivariate control statistics are used to identify unusual samples:

Hotelling's $T^2$ : Measures the distance of a sample from the centre of the model in the latent space:

$T^2_i = \mathbf{t}_i^T (\mathbf{T}^T \mathbf{T} / n)^{-1} \mathbf{t}_i = n \sum_{a=1}^A \frac{t_{ia}^2}{\mathbf{t}_a^T \mathbf{t}_a}$

An approximate $F$ -distribution critical value:

$T^2_{crit} = \frac{A(n^2-1)}{n(n-A)} F_{\alpha,A,n-A}$

Observations with $T^2_i > T^2_{crit}$ are outliers within the model space (unusual combination of latent components).

SPE (Squared Prediction Error) / DModX: Measures the distance of a sample from the PLS model plane (residual variability not captured by the model):

$SPE_i = \|\mathbf{x}_i - \hat{\mathbf{x}}_i\|^2 = \sum_{j=1}^p (x_{ij} - \hat{x}_{ij})^2$

A high SPE indicates the observation does not conform to the $\mathbf{X}$ -covariance structure of the calibration set.

💡 Use both $T^2$ and SPE together. A sample can be unusual inside the model ( $T^2$ high) or outside the model (SPE high), or both.

11. Validation Methods

Rigorous validation is essential to ensure a PLS model genuinely predicts new observations, rather than memorising noise in the training data.

11.1 Internal Validation: Cross-Validation

$k$ -fold cross-validation and LOOCV (described in Section 8.1) provide internal validation estimates ( $Q^2$ , RMSECV). Internal validation is mandatory but can still be optimistic if the same data were used to select preprocessing, model type, and components.

Monte Carlo Cross-Validation (MCCV): Randomly splits the data into training and validation sets $B$ times (e.g., $B = 100$ ), each time using a different random split (e.g., 80% train / 20% validate). The distribution of RMSECV values across splits provides uncertainty estimates.

11.2 External Validation: Independent Test Set

The most rigorous validation uses a truly independent test set — samples not used in any stage of model building (not in training, not in cross-validation, not in component selection):

$RMSEP = \sqrt{\frac{1}{n_{test}}\sum_{i=1}^{n_{test}} (y_i - \hat{y}_i)^2}$

How to split data for external validation:

Random split: Randomly allocate (e.g., 70–80% training, 20–30% test). Only valid if samples are representative.
Kennard-Stone algorithm: Selects a maximally representative subset of the data for the training set based on Euclidean distances in $\mathbf{X}$ -space, ensuring the calibration set spans the full range of the data.
DUPLEX algorithm: Simultaneously selects representative training and test sets.
Systematic split: For time-ordered data, use earlier samples for training and later ones for testing.

11.3 Permutation Test for Model Validity

(Described in Section 8.6.) A permutation test confirms that the model's $Q^2$ is significantly above what would be obtained by chance.

A permutation plot shows the distribution of $Q^2$ values from permuted models overlaid with the observed $Q^2$ . If the observed $Q^2$ falls well above the permutation distribution, the model is valid.

11.4 Y-Randomisation Test

Related to the permutation test, Y-randomisation (also called response permutation) repeatedly randomises $\mathbf{y}$ , fits PLS with the same $A$ components, and records $R^2_Y$ and $Q^2$ :

The permuted $Q^2$ values should be near zero or negative (since a randomly shuffled $\mathbf{y}$ has no true relationship with $\mathbf{X}$ ).
The observed $Q^2$ should be far above the permuted distribution.
If permuted models give high $Q^2$ values, the original result is likely spurious (due to chance correlations).

11.5 The Ratio $R^2_Y / Q^2$ as a Validity Check

A simple heuristic from chemometrics practice:

If $Q^2 / R^2_Y > 0.5$ : Model is not overfitting (good).
If $Q^2 / R^2_Y < 0.5$ (or $Q^2 \ll R^2_Y$ ): Model may be overfitting — reduce number of components or collect more data.

12. Comparison with Related Methods

12.1 PLS vs. Principal Component Regression (PCR)

Both PLS and PCR reduce dimensionality before regression. The key difference:

Aspect	PCR	PLS
Component extraction criterion	Maximise variance in $\mathbf{X}$ only	Maximise covariance between $\mathbf{X}$ scores and $\mathbf{Y}$
$\mathbf{Y}$ used in component extraction	❌ No	✅ Yes
Relevant components	May not be the first few (components uncorrelated with $\mathbf{Y}$ may explain much $\mathbf{X}$ variance)	First few components are typically most predictive of $\mathbf{Y}$
Number of components needed	Typically more	Typically fewer
Predictive performance	Comparable or lower	Generally better

💡 PLS is generally preferred over PCR for prediction tasks because it ensures the extracted components are relevant for predicting $\mathbf{Y}$ . PCR may extract components that explain much $\mathbf{X}$ variance but are useless for prediction.

12.2 PLS vs. Ridge Regression

Aspect	Ridge Regression	PLS
Handles multicollinearity	✅	✅
Shrinkage mechanism	Continuous $L_2$ penalty on $\hat{\boldsymbol{\beta}}$	Discrete (choice of $A$ )
Variable selection	❌ (all variables retained)	Indirectly (via VIP)
Interpretable components	❌	✅
Handles $p > n$	✅	✅
Cross-validation parameter	$\lambda$ (regularisation)	$A$ (components)

12.3 PLS vs. Lasso

Aspect	Lasso	PLS
Variable selection	✅ (explicit, drives coefficients to zero)	Indirectly via VIP
Correlated predictors	❌ (arbitrary selection among correlated vars)	✅ (handles gracefully; correlated vars share weight)
Latent structure	❌	✅
Interpretable components	❌	✅
$p \gg n$	✅ (but arbitrarily selects one from each correlated group)	✅

12.4 PLS vs. OLS Multiple Regression

Aspect	OLS	PLS
Multicollinearity	❌ (unstable, inflated SE)	✅ (stable)
$p > n$	❌ (undefined)	✅
Coefficient interpretation	Straightforward (ceteris paribus)	Requires care (components summarise $\mathbf{X}$ )
Statistical inference (p-values)	✅ (exact under assumptions)	❌ (non-trivial; use jack-knife or bootstrap)
Prediction accuracy	✅ (when $n \gg p$ , no collinearity)	✅ (especially when $p > n$ or collinearity)

12.5 When to Choose Which Method

Condition	Recommended Method
$n \gg p$ , low multicollinearity, need inference	OLS
High multicollinearity, $p < n$ , need inference	Ridge Regression
Explicit variable selection needed, $p < n$	Lasso or Elastic Net
$p \gg n$ , high multicollinearity, interpretability needed	PLS
Multiple correlated responses ( $\mathbf{Y}$ matrix)	PLS2
Classification with high-dimensional $\mathbf{X}$	PLS-DA
Non-linear relationships	Kernel PLS or Non-linear methods (RF, SVM)
Want PCR but with $\mathbf{Y}$ -guided component selection	PLS

13. Using the PLS Regression Component

The PLS Regression component in the DataStatPro application provides a full end-to-end workflow for fitting, validating, and interpreting PLS models.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should have at least one response variable and two or more predictor variables.

Step 2 — Select Analysis Type Choose the PLS analysis type:

PLS1 (single continuous response)
PLS2 (multiple continuous responses)
PLS-DA (discriminant analysis; categorical response)

Step 3 — Select Predictor Variables (X) Select one or more predictor variables from the "Predictor Variables (X)" dropdown. These should be continuous or ordinal numeric variables.

💡 You can select all available numeric predictors and rely on VIP scores for post-hoc variable selection.

Step 4 — Select Response Variable(s) (Y)

For PLS1: Select a single continuous response from the "Response Variable (Y)" dropdown.
For PLS2: Select two or more continuous response variables.
For PLS-DA: Select a categorical variable. You will be prompted to confirm the class coding.

Step 5 — Configure Preprocessing Select the preprocessing method for $\mathbf{X}$ (and optionally $\mathbf{Y}$ ):

Mean Centring Only (recommended when variables are in the same units)
Autoscaling (UV) (recommended default)
Pareto Scaling
No Preprocessing (not recommended unless data are already preprocessed)

Step 6 — Configure Number of Components Choose how to determine the number of components $A$ :

Automatic (Cross-Validation): The application selects $A$ by minimising RMSECV.
Manual: Specify $A$ directly (e.g., based on domain knowledge or prior analysis).
Set the maximum number of components to evaluate (default: $\min(n-1, p, 10)$ ).

Step 7 — Configure Cross-Validation Select the cross-validation scheme:

$k$ -Fold CV: Specify $k$ (default: 10).
Leave-One-Out CV (LOOCV): Use for small datasets.
Monte Carlo CV: Specify the number of iterations and train/test split ratio.

Step 8 — Select Display Options Choose which outputs to display:

✅ Score Plot ( $t_1$ vs. $t_2$ )
✅ Loading Plot ( $p_1$ vs. $p_2$ )
✅ Biplot ( $w^*$ - $q$ plot)
✅ $R^2_Y$ and $Q^2$ vs. Number of Components Plot
✅ RMSEC and RMSECV vs. Number of Components Plot
✅ Predicted vs. Observed Plot
✅ VIP Scores Plot
✅ Regression Coefficients Plot
✅ Residuals Plot
✅ Leverage-Residual (Williams) Plot
✅ $T^2$ and SPE (DModX) Control Charts
✅ Component Summary Table ( $R^2_X$ , $R^2_Y$ , $Q^2$ per component)
✅ Permutation Plot

Step 9 — Run the Analysis Click "Run PLS Regression". The application will:

Apply the selected preprocessing to $\mathbf{X}$ and $\mathbf{Y}$ .
Run cross-validation to determine the optimal number of components (if automatic).
Fit the final PLS model with the selected $A$ components using NIPALS or SIMPLS.
Compute scores ( $\mathbf{T}$ , $\mathbf{U}$ ), loadings ( $\mathbf{P}$ , $\mathbf{Q}$ ), weights ( $\mathbf{W}$ , $\mathbf{W}^*$ ).
Compute VIP scores and regression coefficients.
Compute model fit statistics ( $R^2_X$ , $R^2_Y$ , $Q^2$ , RMSEC, RMSECV).
Compute leverage, residuals, $T^2$ , and SPE for all observations.
Generate all selected visualisations and tables.
Run permutation tests (if selected).

14. Computational and Formula Details

14.1 The NIPALS Algorithm for PLS1

The Non-linear Iterative Partial Least Squares (NIPALS) algorithm is the classical iterative procedure for PLS decomposition. For PLS1 (single response):

Inputs: Mean-centred (and scaled) $\mathbf{X}_1 = \tilde{\mathbf{X}}$ ( $n \times p$ ) and $\mathbf{y}_1 = \tilde{\mathbf{y}}$ ( $n \times 1$ ).

For each component $a = 1, 2, \dots, A$ :

Initialise: $\mathbf{u}_a = \mathbf{y}_a$ (or any non-zero starting vector).
Compute X-weight vector: $\mathbf{w}_a = \frac{\mathbf{X}_a^T \mathbf{u}_a}{\|\mathbf{X}_a^T \mathbf{u}_a\|}$ (Normalise to unit length.)
Compute X-score vector: $\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a$
Compute Y-weight (scalar for PLS1): $c_a = \frac{\mathbf{y}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}$
Compute Y-score vector: $\mathbf{u}_a = \mathbf{y}_a \cdot c_a / \|c_a\|^2 = \frac{\mathbf{y}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a} \mathbf{y}_a \cdot \frac{\mathbf{t}_a^T \mathbf{t}_a}{\mathbf{y}_a^T \mathbf{t}_a}$

For PLS1, since $q = 1$ : $\mathbf{u}_a = \mathbf{y}_a$ (no iteration needed — convergence is immediate).

Check convergence (PLS2 only — for PLS1, skip to step 7).
Compute X-loadings: $\mathbf{p}_a = \frac{\mathbf{X}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}$
Compute inner relation coefficient: $b_a = \frac{\mathbf{u}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}$
Deflate $\mathbf{X}$ : $\mathbf{X}_{a+1} = \mathbf{X}_a - \mathbf{t}_a \mathbf{p}_a^T$
Deflate $\mathbf{y}$ : $\mathbf{y}_{a+1} = \mathbf{y}_a - b_a \mathbf{t}_a$

After all $A$ components, compute $\mathbf{W}^*$ :

$\mathbf{W}^* = \mathbf{W}(\mathbf{P}^T \mathbf{W})^{-1}$

Where $\mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_A]$ and $\mathbf{P} = [\mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_A]$ .

Compute PLS regression coefficients:

$\hat{\boldsymbol{\beta}}_{PLS} = \mathbf{W}^* \mathbf{b}$

Where $\mathbf{b} = [b_1, b_2, \dots, b_A]^T$ (inner relation coefficients, equal to Y-loadings $\mathbf{q}$ in PLS1).

14.2 The NIPALS Algorithm for PLS2

For PLS2 ( $q > 1$ ), an inner iteration is needed at each component to converge to the dominant covariance direction:

Initialise: $\mathbf{u}_a =$ first column of $\mathbf{Y}_a$ (or any non-zero column).
Outer iteration (repeat until convergence of $\mathbf{t}_a$ ):

a. Compute X-weight: $\mathbf{w}_a = \mathbf{X}_a^T \mathbf{u}_a / \|\mathbf{X}_a^T \mathbf{u}_a\|$

b. Compute X-score: $\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a$

c. Compute Y-weight: $\mathbf{c}_a = \mathbf{Y}_a^T \mathbf{t}_a / \|\mathbf{Y}_a^T \mathbf{t}_a\|$

d. Compute Y-score: $\mathbf{u}_a = \mathbf{Y}_a \mathbf{c}_a$

e. Check: if $\|\mathbf{t}_a^{(new)} - \mathbf{t}_a^{(old)}\| / \|\mathbf{t}_a^{(old)}\| < \epsilon$ (e.g., $\epsilon = 10^{-10}$ ), converged.
Compute loadings: $\mathbf{p}_a = \mathbf{X}_a^T \mathbf{t}_a / (\mathbf{t}_a^T \mathbf{t}_a)$
Compute inner coefficient: $b_a = \mathbf{u}_a^T \mathbf{t}_a / (\mathbf{t}_a^T \mathbf{t}_a)$
Deflate both matrices: $\mathbf{X}_{a+1} = \mathbf{X}_a - \mathbf{t}_a \mathbf{p}_a^T$ $\mathbf{Y}_{a+1} = \mathbf{Y}_a - b_a \mathbf{t}_a \mathbf{c}_a^T$

14.3 The SIMPLS Algorithm

SIMPLS (de Jong, 1993) is an alternative, non-deflation-based algorithm that directly computes the PLS weight vectors $\mathbf{w}_a^*$ without deflating $\mathbf{X}$ . It is computationally more efficient and numerically more stable for large datasets.

SIMPLS directly finds $\mathbf{w}_a^*$ as the leading eigenvector of $\mathbf{X}^T \mathbf{Y} \mathbf{Y}^T \mathbf{X}$ deflated by projections onto previously found weight vectors:

$\mathbf{w}_a^* = \text{leading eigenvector of } \left[\mathbf{I} - \mathbf{W}_{a-1}^*(\mathbf{W}_{a-1}^{*T}\mathbf{W}_{a-1}^*)^{-1}\mathbf{W}_{a-1}^{*T}\right] \mathbf{X}^T \mathbf{Y} \mathbf{Y}^T \mathbf{X}$

NIPALS vs. SIMPLS:

Feature	NIPALS	SIMPLS
Handles missing data	✅ (iterative imputation)	❌
Computational efficiency	$O(npA)$ per iteration	$O(np^2)$ once
Numerical stability	Good	Excellent
Equivalence to NIPALS	✅ (for PLS1)	✅ (identical for PLS1; slightly different for PLS2)

14.4 VIP Score Computation

$VIP_j = \sqrt{\frac{p \sum_{a=1}^A SSY_a \cdot w_{ja}^{*2}}{\sum_{a=1}^A SSY_a}}$

Where $SSY_a = R^2_Y(a) - R^2_Y(a-1)$ is the $\mathbf{Y}$ variance explained by component $a$ , and the denominator normalisation $\sum_{a=1}^A SSY_a = R^2_Y(A) \cdot SS_{total}$ .

14.5 Confidence Intervals for PLS Coefficients (Jack-Knife)

Standard errors for PLS regression coefficients can be estimated using the jack-knife procedure:

For each observation $i$ , fit a PLS model with $A$ components leaving out observation $i$ : $\hat{\boldsymbol{\beta}}_{PLS}^{(-i)}$ .
The jack-knife estimate of the coefficient vector: $\hat{\boldsymbol{\beta}}_{JK} = \frac{1}{n}\sum_{i=1}^n \hat{\boldsymbol{\beta}}_{PLS}^{(-i)}$
Jack-knife standard error of coefficient $j$ : $SE_{JK}(\hat{\beta}_j) = \sqrt{\frac{n-1}{n}\sum_{i=1}^n \left(\hat{\beta}_j^{(-i)} - \hat{\beta}_{JK,j}\right)^2}$
Approximate $(1-\alpha) \times 100\%$ confidence interval: $\hat{\beta}_j \pm t_{\alpha/2, n-1} \cdot SE_{JK}(\hat{\beta}_j)$

A jack-knife $t$ -statistic for testing $H_0: \beta_j = 0$ :

$t_j = \frac{\hat{\beta}_j}{SE_{JK}(\hat{\beta}_j)}$

⚠️ Jack-knife standard errors for PLS coefficients are approximate. Bootstrap-based confidence intervals are more accurate but computationally more demanding.

14.6 Back-Scaling of Coefficients

When $\mathbf{X}$ and $\mathbf{y}$ are autoscaled before fitting, the PLS coefficients $\hat{\boldsymbol{\beta}}_{PLS}^{scaled}$ are on the standardised scale. Back-scaling to original units:

$\hat{\beta}_j^{original} = \hat{\beta}_j^{scaled} \cdot \frac{s_y}{s_j}$

Where $s_y$ is the standard deviation of $\mathbf{y}$ and $s_j$ is the standard deviation of $X_j$ .

The intercept is:

$\hat{\beta}_0 = \bar{y} - \sum_{j=1}^p \hat{\beta}_j^{original} \bar{x}_j$

14.7 $Q^2$ Computation in Detail

For LOOCV with $n$ folds:

$PRESS = \sum_{i=1}^n (y_i - \hat{y}_{i,-i})^2$

Where $\hat{y}_{i,-i}$ is the prediction for observation $i$ from a model trained on all observations except $i$ .

$Q^2 = 1 - \frac{PRESS}{\sum_{i=1}^n (y_i - \bar{y})^2}$

For $k$ -fold CV with $k < n$ , PRESS is computed as the sum over all $n$ held-out predictions (one per observation), where each observation is held out exactly once.

15. Worked Examples

Example 1: PLS1 Regression — Predicting Protein Content from NIR Spectra

Research Question: Can near-infrared (NIR) spectral measurements (100 wavelengths, $p = 100$ ) predict the protein content (%, $y$ ) of wheat flour samples?

Dataset: $n = 50$ wheat samples; $p = 100$ spectral absorbance variables; $y$ = protein content (%).

Step 1: Preprocessing

Apply autoscaling to $\mathbf{X}$ (spectral data; different variances at different wavelengths). Mean-centre $\mathbf{y}$ ( $\bar{y} = 11.84\%$ ).

Step 2: Cross-Validation to Select $A$

Run PLS1 with $A = 1$ to $10$ components using 10-fold cross-validation:

Components ( $A$ )	$R^2_Y$	$Q^2$	RMSEC	RMSECV
1	0.612	0.583	0.731	0.764
2	0.843	0.812	0.468	0.511
3	0.923	0.895	0.329	0.382
4	0.961	0.934	0.233	0.302
5	0.972	0.931	0.197	0.311
6	0.979	0.924	0.169	0.328
7	0.983	0.910	0.152	0.352
8	0.986	0.896	0.137	0.380

$Q^2$ peaks at $A = 4$ (0.934) and decreases thereafter (despite $R^2_Y$ continuing to rise → overfitting). Select $A = 4$ components.

Step 3: Fit Final PLS Model with $A = 4$

Model summary:

Component	$R^2_X$ (cumulative)	$R^2_Y$ (cumulative)	$Q^2$ (cumulative)
1	0.423	0.612	0.583
2	0.617	0.843	0.812
3	0.748	0.923	0.895
4	0.814	0.961	0.934

Final statistics:

$R^2_Y = 0.961, \quad Q^2 = 0.934, \quad RMSEC = 0.233\%, \quad RMSECV = 0.302\%$

Step 4: VIP Scores

The top 5 most important variables by VIP:

Wavelength (nm)	VIP Score	Interpretation
2180	2.14	Highly important (protein N-H stretch)
2100	1.98	Highly important
1680	1.87	Highly important
2240	1.76	Important
1940	1.62	Important

Variables with VIP < 0.8: 42 out of 100 wavelengths are candidates for removal.

Step 5: Prediction for New Sample

New wheat sample: spectral vector $\mathbf{x}_{new}$ (autoscaled).

$\hat{t}_{a} = \tilde{\mathbf{x}}_{new}^T \mathbf{w}_a^*, \quad a = 1, \dots, 4$

$\hat{\tilde{y}}_{new} = \sum_{a=1}^4 b_a \hat{t}_a = 0.247 \times 0.812 + (-0.183) \times (-0.441) + \dots = 0.384$

Back-scale: $\hat{y}_{new} = \bar{y} + \hat{\tilde{y}}_{new} \times s_y = 11.84 + 0.384 \times 1.214 = 11.84 + 0.466 = 12.31\%$

95% prediction interval (jack-knife SE = 0.28%):

$\hat{y}_{new} \pm 1.96 \times 0.28 = 12.31 \pm 0.55 = [11.76\%, 12.86\%]$

Conclusion: The 4-component PLS model achieves excellent predictive performance ( $Q^2 = 0.934$ , RMSECV = 0.302%). The model is not overfitting (ratio $Q^2/R^2_Y = 0.934/0.961 = 0.972 > 0.5$ ). NIR wavelengths around 2180 nm and 2100 nm (protein N-H stretching bands) are the most important predictors.

Example 2: PLS1 Regression — Predicting Blood Pressure from Clinical Variables

Research Question: Can clinical variables (age, BMI, cholesterol, glucose, smoking status, exercise frequency) predict systolic blood pressure (SBP)?

Dataset: $n = 120$ patients; $p = 6$ predictors; $y$ = SBP (mmHg).

Predictors: Age (years), BMI (kg/m²), Total Cholesterol (mmol/L), Fasting Glucose (mmol/L), Smoking (0/1), Exercise (days/week).

Step 1: Preprocessing

Apply autoscaling to all 6 predictors (different units) and mean-centre $\mathbf{y}$ ( $\bar{y} = 128.4$ mmHg).

Step 2: CV to Select $A$

Components	$R^2_Y$	$Q^2$	RMSECV
1	0.542	0.519	8.12
2	0.631	0.601	7.41
3	0.649	0.589	7.58
4	0.655	0.571	7.82

Select $A = 2$ (maximum $Q^2 = 0.601$ , RMSECV = 7.41 mmHg).

Step 3: Regression Coefficients (back-scaled to original units)

Predictor	$\hat{\beta}_j$ (mmHg / unit)	SE (jack-knife)	$t$ -statistic	Significant?
Age (per year)	0.482	0.091	5.30	✅ $p < 0.001$
BMI (per kg/m²)	1.241	0.213	5.83	✅ $p < 0.001$
Cholesterol (per mmol/L)	0.837	0.281	2.98	✅ $p = 0.003$
Glucose (per mmol/L)	0.614	0.244	2.51	✅ $p = 0.013$
Smoking	3.921	1.482	2.65	✅ $p = 0.009$
Exercise (per day/week)	-1.183	0.392	-3.02	✅ $p = 0.003$

Step 4: VIP Scores

Predictor	VIP Score	Importance
BMI	1.48	High
Age	1.32	High
Smoking	1.21	High
Exercise	1.13	High
Cholesterol	0.92	Moderate
Glucose	0.74	Low (VIP < 0.8)

Step 5: Prediction

For a new patient: Age = 55, BMI = 28.4, Cholesterol = 5.2, Glucose = 5.8, Smoking = 1, Exercise = 2:

$\hat{y}_{new} = 128.4 + 0.482(55-\bar{x}_{Age}) + 1.241(28.4 - \bar{x}_{BMI}) + \dots = 143.7 \text{ mmHg}$

Conclusion: The 2-component PLS model explains 63.1% of SBP variance ( $R^2_Y = 0.631$ ) with reasonable cross-validated performance ( $Q^2 = 0.601$ , RMSECV = 7.4 mmHg). BMI, age, and smoking are the strongest predictors. Glucose has a VIP < 0.8, suggesting limited predictive contribution in this dataset.

Example 3: PLS2 Regression — Predicting Multiple Sensory Attributes from Chemical Composition

Research Question: Can the chemical composition of wine (8 chemical variables) jointly predict 3 sensory attributes (acidity rating, bitterness rating, overall quality score)?

Dataset: $n = 80$ wines; $p = 8$ chemical predictors (pH, alcohol %, residual sugar, sulphates, fixed acidity, volatile acidity, citric acid, density); $q = 3$ responses.

Step 1: Preprocessing

Autoscale all $\mathbf{X}$ variables. Mean-centre and scale all $\mathbf{Y}$ variables (different scales).

Step 2: CV to Select $A$

Components	$R^2_X$	$R^2_Y$ (avg)	$Q^2$ (avg)
1	0.381	0.443	0.412
2	0.544	0.651	0.597
3	0.634	0.712	0.584

Select $A = 2$ ( $Q^2$ peaks at 0.597).

Step 3: Component Summary

Component	$R^2_X$ cumul.	$R^2_{Y,Acidity}$ cumul.	$R^2_{Y,Bitterness}$ cumul.	$R^2_{Y,Quality}$ cumul.
1	0.381	0.521	0.409	0.398
2	0.544	0.698	0.612	0.643

Step 4: Y-Loadings ( $\mathbf{Q}$ )

Response	$q_1$	$q_2$
Acidity	0.611	-0.392
Bitterness	0.524	0.481
Quality	0.593	0.144

Component 1 positively loads on all three responses (general quality/intensity factor). Component 2 contrasts bitterness (positive) against acidity (negative) — a sensory contrast axis.

Step 5: Top VIP Scores (averaged across responses)

Chemical Variable	VIP Score
Volatile Acidity	1.63
Alcohol %	1.41
pH	1.28
Sulphates	1.17
Residual Sugar	0.94
Fixed Acidity	0.86
Citric Acid	0.72
Density	0.68

Citric acid and density have VIP < 0.8 — candidates for removal in a reduced model.

Conclusion: The 2-component PLS2 model jointly predicts all three sensory attributes with moderate-to-good accuracy ( $Q^2_{avg} = 0.597$ ). Volatile acidity and alcohol content are the most influential predictors. The two PLS components reveal a general quality factor and a bitterness-versus-acidity contrast factor in the sensory space.

16. Common Mistakes and How to Avoid Them

Mistake 1: Skipping Preprocessing

Problem: Applying PLS to unscaled data where variables differ widely in units and magnitude. Variables with larger numerical ranges (e.g., income in thousands vs. age in decades) dominate the components, producing misleading results.
Solution: Always mean-centre the data. Apply autoscaling (UV scaling) when variables are in different units. Carefully consider the appropriate scaling for your specific domain and data type.

Mistake 2: Selecting Too Many Components

Problem: Using the number of components that maximises $R^2_Y$ (training set fit) rather than $Q^2$ (cross-validated fit), resulting in an overfitted model that performs well on training data but poorly on new observations.
Solution: Always use cross-validation ( $Q^2$ , RMSECV) to select $A$ . Look for the point where $Q^2$ peaks or where $R^2_Y - Q^2$ begins to widen. Apply the one-standard-error rule for extra parsimony.

Mistake 3: Ignoring Model Validation

Problem: Reporting only training set statistics ( $R^2_Y$ , RMSEC) without cross-validation or external validation, giving an overly optimistic picture of model performance.
Solution: Always report $Q^2$ and RMSECV. Whenever possible, reserve an independent external test set and report RMSEP. Run permutation tests to confirm the model is not a statistical artefact.

Mistake 4: Confusing W-Weights with P-Loadings

Problem: Using the X-loadings $\mathbf{P}$ (or raw weights $\mathbf{W}$ ) to interpret the relationship between $\mathbf{X}$ variables and the model, rather than the modified weights $\mathbf{W}^*$ .
Solution: For variable importance interpretation, use $\mathbf{W}^*$ (the modified weights) or VIP scores. Loadings $\mathbf{P}$ describe how $\mathbf{X}$ is reconstructed from the scores; modified weights $\mathbf{W}^*$ describe how $\mathbf{X}$ variables linearly combine to form the scores from the original $\mathbf{X}$ .

Mistake 5: Using VIP Threshold Rigidly

Problem: Mechanically removing all variables with VIP < 0.8 and accepting all variables with VIP > 1.0, without considering domain knowledge, model stability, or the effect of variable removal on $Q^2$ .
Solution: Treat VIP as a guide, not a hard rule. After removing low-VIP variables, refit the model and check whether $Q^2$ improves or remains stable. Incorporate domain knowledge about which variables are mechanistically meaningful.

Mistake 6: Applying PLS to a Completely Heterogeneous Dataset

Problem: Fitting a single global PLS model to data comprising fundamentally different subgroups (e.g., different product types, different analytical conditions), producing a model that fits no subgroup well.
Solution: Inspect score plots for clustering. If distinct subgroups are visible, consider fitting separate PLS models per subgroup, or use class-based modelling approaches such as PLS-DA to first classify then model within class.

Mistake 7: Not Detecting Outliers Before Modelling

Problem: Leaving extreme outliers in the dataset, which disproportionately influence the PLS components and distort the model for the remaining, majority observations.
Solution: Check univariate distributions, Mahalanobis distances, and initial PCA scores before PLS. After fitting, use the Williams plot (leverage vs. residuals), $T^2$ , and SPE plots to identify influential observations. Investigate outliers — do not simply delete them without justification.

Mistake 8: Extrapolating Beyond the Calibration Range

Problem: Using the PLS model to predict samples that fall outside the range of the calibration set (extrapolation), where the linear relationship may not hold and the model has no basis for reliable prediction.
Solution: Check new samples for consistency with the calibration set using $T^2$ and SPE control charts. If a new sample falls outside the 95% control limits, flag the prediction as unreliable. Expand the calibration set to cover the full expected range of future samples.

Mistake 9: Misinterpreting $R^2_Y$ Without Context

Problem: Reporting a high $R^2_Y$ (e.g., 0.98) as evidence of an excellent model, without noting that this is the training set fit and may reflect overfitting.
Solution: Always pair $R^2_Y$ with $Q^2$ . A model with $R^2_Y = 0.98$ and $Q^2 = 0.42$ is severely overfitting. The $Q^2$ value is the meaningful indicator of predictive performance.

Mistake 10: Applying PLS Regression to a Classification Problem Without PLS-DA

Problem: Using PLS1 to predict a binary class label (0/1) without proper classification thresholding or performance assessment with classification metrics (sensitivity, specificity, AUC).
Solution: For categorical outcomes, use PLS-DA with appropriate class encoding. Evaluate using classification metrics (confusion matrix, sensitivity, specificity, AUC-ROC) with cross-validated class assignments, not just RMSECV.

17. Troubleshooting

Issue	Likely Cause	Solution
$Q^2 < 0$ for all components	No predictive relationship; noisy data; too few samples	Check data quality; verify $\mathbf{X}$ and $\mathbf{y}$ are correctly entered; increase $n$ ; reconsider predictor selection
$R^2_Y$ is high but $Q^2$ is very low (large gap)	Severe overfitting due to too many components or $p \gg n$	Reduce $A$ ; use stricter CV; consider sparse PLS or variable pre-selection
NIPALS fails to converge	Near-zero variance columns; perfect collinearity; numerical issues	Remove zero-variance variables before fitting; check for duplicate columns; increase max iterations
Score plot shows extreme outlier separated from main cluster	Outlier with unusual $\mathbf{X}$ or $\mathbf{y}$ value; data entry error	Investigate observation; check for data entry errors; assess leverage and SPE
All VIP scores are approximately 1	Only one component extracted ( $A=1$ ); VIP is uniform when $A=1$	Increase $A$ if justified by CV; interpret $\mathbf{w}_1^*$ coefficients directly for $A=1$
Permutation test shows $Q^2$ of permuted models as high as observed	No real relationship between $\mathbf{X}$ and $\mathbf{y}$ ; chance correlation	Do not use the model; reconsider variable selection; collect more data; verify correct $\mathbf{y}$ assignment
Jack-knife SEs are very large	Too few samples relative to components ( $n \approx A$ )	Reduce $A$ ; collect more samples; do not report jack-knife inference for very small $n$
Predicted vs. observed plot shows systematic curvature	Non-linear relationship between $\mathbf{X}$ and $\mathbf{y}$	Apply polynomial or logarithmic transformation to $\mathbf{y}$ ; use kernel PLS; consider non-linear models
RMSECV does not decrease with more components	No additional predictive structure beyond first component	Accept a 1-component model; data may be well-described by a single latent variable
New sample has very high SPE	New sample's $\mathbf{X}$ pattern does not match calibration set structure	Flag prediction as unreliable; expand calibration set to include similar samples
Negative $Q^2$ for external test set despite positive cross-validated $Q^2$	Test set is not representative of training set (distribution shift)	Re-examine train/test split; use Kennard-Stone or DUPLEX for representative splitting; collect more diverse calibration samples
PLS2 gives worse predictions than separate PLS1 models	Responses are poorly correlated with each other	Run separate PLS1 models for each response; or use OPLS for better separation of effects

18. Quick Reference Cheat Sheet

Core Formulas

Formula	Description
$\mathbf{X} = \mathbf{T}\mathbf{P}^T + \mathbf{E}$	PLS decomposition of $\mathbf{X}$
$\mathbf{Y} = \mathbf{T}\mathbf{Q}^T + \mathbf{F}$	PLS prediction of $\mathbf{Y}$
$\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a$	X-score for component $a$
$\mathbf{p}_a = \mathbf{X}_a^T \mathbf{t}_a / (\mathbf{t}_a^T \mathbf{t}_a)$	X-loading for component $a$
$\mathbf{W}^* = \mathbf{W}(\mathbf{P}^T\mathbf{W})^{-1}$	Modified X-weight matrix
$\hat{\boldsymbol{\beta}}_{PLS} = \mathbf{W}^*\mathbf{q}$	PLS regression coefficients
$\hat{\mathbf{Y}} = \mathbf{X}\mathbf{W}^*\mathbf{Q}^T$	Predicted values
$R^2_Y = 1 - RSS/SS_{total}$	Training set variance explained
$Q^2 = 1 - PRESS/SS_{total}$	Cross-validated variance explained
$RMSECV = \sqrt{PRESS/n}$	Cross-validated RMSE
$RMSEP = \sqrt{\sum e_i^2 / n_{test}}$	External test set RMSE
$VIP_j = \sqrt{p\sum_a SSY_a w_{ja}^{*2} / \sum_a SSY_a}$	Variable importance in projection
$T^2_i = n\sum_a t_{ia}^2/(\mathbf{t}_a^T\mathbf{t}_a)$	Hotelling's $T^2$
$SPE_i = \\|\mathbf{x}_i - \hat{\mathbf{x}}_i\\|^2$	Squared prediction error

Preprocessing Guide

Situation	Recommended Scaling
Variables in different units	Autoscaling (UV)
Variables in same units; variance meaningful	Mean centring only
High-variance variables dominate	Pareto scaling
Right-skewed, multiplicative data	Log transform first, then mean centre
Near-zero variance variables present	Remove before scaling

Component Selection Guide

Evidence	Action
$Q^2$ increasing with each component	Add another component
$Q^2$ has peaked and starts decreasing	Stop adding components
$R^2_Y - Q^2$ gap widening	Overfitting — reduce $A$
$Q^2 < 0$	No predictive signal; check data
Permuted $Q^2 \approx$ observed $Q^2$	Model is a statistical artefact
$Q^2/R^2_Y < 0.5$	Overfitting — reduce $A$

VIP Score Interpretation

VIP Score	Variable Importance
$VIP > 1.0$	High importance
$0.8 \leq VIP \leq 1.0$	Moderate importance
$VIP < 0.8$	Low importance; candidate for removal

Model Evaluation Hierarchy

Metric	Type	Bias	Recommended Use
$R^2_Y$ / RMSEC	Training set	Optimistic	Report but do not use alone
$Q^2$ / RMSECV	Cross-validated	Slight	Primary model selection criterion
RMSEP	External test	Unbiased	Best estimate of true prediction error

Outlier Detection Summary

Statistic	What It Detects	Threshold
Hotelling's $T^2$	Unusual within the model space	$F_{\alpha, A, n-A}$ critical value
SPE (DModX)	Does not fit the model structure	$\chi^2_{\alpha, p-A}$ (approximate)
Leverage $h_i$	Influence on model coefficients	$2A/n$ (rough guideline)
Standardised residual $e_i^*$	Poor fit in $\mathbf{Y}$	$\\|e_i^*\\| > 2$ or $3$

PLS Model Type Selection

Scenario	PLS Variant
One continuous response	PLS1
Multiple continuous responses	PLS2
Binary or multi-class outcome	PLS-DA
Improve interpretability (single response)	OPLS
Non-linear relationships	Kernel PLS
Automatic variable selection	Sparse PLS

Comparison of Regression Methods

Feature	OLS	Ridge	PCR	PLS
$p > n$	❌	✅	✅	✅
Handles collinearity	❌	✅	✅	✅
Uses $\mathbf{Y}$ in reduction	❌	❌	❌	✅
Interpretable components	N/A	❌	✅	✅
Multiple responses	✅	✅	Partial	✅
Variable selection	❌	❌	❌	Via VIP
Exact inference (p-values)	✅	❌	❌	Approx. (jack-knife)

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting PLS Regression using the DataStatPro application. For further reading, consult Wold, Sjöström & Eriksson's "PLS-regression: a basic tool of chemometrics" (Chemometrics and Intelligent Laboratory Systems, 2001), Mevik & Wehrens's "The pls Package: Principal Component and Partial Least Squares Regression in R" (Journal of Statistical Software, 2007), or Höskuldsson's "PLS regression methods" (Journal of Chemometrics, 1988). For feature requests or support, contact the DataStatPro team.

PLS Regression

PLS Regression: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Vectors and Matrices

1.2 Projection and Orthogonality

1.3 Variance and Covariance

1.4 Correlation and Multicollinearity

1.5 Eigenvalues and Eigenvectors

1.6 Ordinary Least Squares (OLS) Regression

2. What is PLS Regression?

2.1 The Core Idea

2.2 Real-World Applications

2.3 When to Use PLS Regression

2.4 PLS vs. Related Methods: An Overview

3. The Mathematics Behind PLS Regression

3.1 The PLS Model Structure

3.2 The Objective: Maximum Covariance

3.3 Score and Loading Computation

3.4 Deflation

3.5 The PLS Regression Coefficients

3.6 Relationship Between PLS and SVD

4. Types of PLS Methods

4.1 PLS1

4.2 PLS2

4.3 PLS-DA (PLS Discriminant Analysis)

4.4 OPLS (Orthogonal PLS)

4.5 Kernel PLS

4.6 Sparse PLS

4.7 Summary of PLS Variants

5. Assumptions of PLS Regression

5.1 Linearity

5.2 Continuous (or Appropriately Encoded) Variables

5.3 No Requirement for Multivariate Normality

5.4 No Perfect Redundancy (Degenerate Cases)

5.5 Sufficient Sample Size

5.6 Representativeness of Calibration Set

5.7 No Gross Outliers

6. Data Preprocessing

6.1 Mean Centring

6.2 Autoscaling (Mean Centring + Unit Variance Scaling)

6.3 Other Scaling Methods

6.4 Handling Missing Data

6.5 Outlier Detection Before Modelling

7. PLS Components and Latent Variables

7.1 X-Scores (T\mathbf{T}T)

7.2 X-Loadings (P\mathbf{P}P)

7.3 X-Weights (W\mathbf{W}W and W∗\mathbf{W}^*W∗)

7.4 Y-Loadings (Q\mathbf{Q}Q) and Y-Weights (C\mathbf{C}C)

7.5 Y-Scores (U\mathbf{U}U)

7.6 The Biplot

8. Choosing the Number of Components

8.1 Cross-Validation (Primary Method)

8.2 PRESS Statistic (Predicted Residual Error Sum of Squares)

8.3 Q2Q^2Q2 (Cross-Validated R2R^2R2)

8.4 The RY2R^2_YRY2​ vs. Q2Q^2Q2 Plot

8.5 Scree Plot of Eigenvalues / Variance Explained

8.6 Permutation Testing

9. Model Fit and Evaluation

9.1 RX2R^2_XRX2​ (Variance Explained in X\mathbf{X}X)

9.2 RY2R^2_YRY2​ (Variance Explained in Y\mathbf{Y}Y)

9.3 Root Mean Squared Error of Calibration (RMSEC)

9.4 Root Mean Squared Error of Cross-Validation (RMSECV)

9.5 Root Mean Squared Error of Prediction (RMSEP)

9.6 Bias and Slope of Predicted vs. Observed

9.7 Summary of Model Fit Statistics

10. Interpretation of PLS Results

10.1 Regression Coefficients (β^PLS\hat{\boldsymbol{\beta}}_{PLS}β^​PLS​)

10.2 Variable Importance in Projection (VIP)

10.3 Loadings Plot

10.4 Score Plot

10.5 Weight-Loading Biplot (w∗w^*w∗-qqq Plot)

10.6 Leverage and Residuals (Influence Analysis)

10.7 Hotelling's T2T^2T2 and SPE (DModX)

11. Validation Methods

11.1 Internal Validation: Cross-Validation

11.2 External Validation: Independent Test Set

11.3 Permutation Test for Model Validity

11.4 Y-Randomisation Test

11.5 The Ratio RY2/Q2R^2_Y / Q^2RY2​/Q2 as a Validity Check

7.1 X-Scores ( $\mathbf{T}$ )

7.2 X-Loadings ( $\mathbf{P}$ )

7.3 X-Weights ( $\mathbf{W}$ and $\mathbf{W}^*$ )

7.4 Y-Loadings ( $\mathbf{Q}$ ) and Y-Weights ( $\mathbf{C}$ )

7.5 Y-Scores ( $\mathbf{U}$ )

8.3 $Q^2$ (Cross-Validated $R^2$ )

8.4 The $R^2_Y$ vs. $Q^2$ Plot

9.1 $R^2_X$ (Variance Explained in $\mathbf{X}$ )

9.2 $R^2_Y$ (Variance Explained in $\mathbf{Y}$ )

10.1 Regression Coefficients ( $\hat{\boldsymbol{\beta}}_{PLS}$ )

10.5 Weight-Loading Biplot ( $w^*$ - $q$ Plot)

10.7 Hotelling's $T^2$ and SPE (DModX)

11.5 The Ratio $R^2_Y / Q^2$ as a Validity Check

14.7 $Q^2$ Computation in Detail

Mistake 9: Misinterpreting $R^2_Y$ Without Context