Knowledge Base / PLS Regression Advanced Analysis 56 min read

PLS Regression

Comprehensive reference guide for Partial Least Squares Regression analysis.

PLS Regression: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Partial Least Squares (PLS) Regression all the way through advanced component interpretation, model validation, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is PLS Regression?
  3. The Mathematics Behind PLS Regression
  4. Types of PLS Methods
  5. Assumptions of PLS Regression
  6. Data Preprocessing
  7. PLS Components and Latent Variables
  8. Choosing the Number of Components
  9. Model Fit and Evaluation
  10. Interpretation of PLS Results
  11. Validation Methods
  12. Comparison with Related Methods
  13. Using the PLS Regression Component
  14. Computational and Formula Details
  15. Worked Examples
  16. Common Mistakes and How to Avoid Them
  17. Troubleshooting
  18. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into PLS regression, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Vectors and Matrices

A vector is an ordered list of numbers (a one-dimensional array). A matrix is a two-dimensional array of numbers with nn rows and pp columns, denoted Xn×p\mathbf{X}_{n \times p}.

Key matrix operations used throughout this tutorial:

1.2 Projection and Orthogonality

The projection of vector y\mathbf{y} onto vector t\mathbf{t} is:

projty=tTytTtt\text{proj}_{\mathbf{t}} \mathbf{y} = \frac{\mathbf{t}^T \mathbf{y}}{\mathbf{t}^T \mathbf{t}} \mathbf{t}

Two vectors are orthogonal if their dot product equals zero: aTb=0\mathbf{a}^T \mathbf{b} = 0. Orthogonality is a central property of PLS components.

1.3 Variance and Covariance

The variance of a variable XX measures its spread:

Var(X)=1n1i=1n(xixˉ)2\text{Var}(X) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2

The covariance between variables XX and YY measures how they vary together:

Cov(X,Y)=1n1i=1n(xixˉ)(yiyˉ)\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})

The covariance matrix Sp×p\mathbf{S}_{p \times p} of a matrix X\mathbf{X} contains pairwise covariances of all pp variables. After mean-centring X\mathbf{X}:

S=1n1XTX\mathbf{S} = \frac{1}{n-1}\mathbf{X}^T \mathbf{X}

1.4 Correlation and Multicollinearity

Correlation is the standardised covariance:

rXY=Cov(X,Y)Var(X)Var(Y)r_{XY} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}}

Multicollinearity occurs when predictor variables are highly correlated with each other. It causes serious problems in ordinary least squares (OLS) regression:

PLS regression was specifically designed to handle multicollinearity.

1.5 Eigenvalues and Eigenvectors

For a square matrix A\mathbf{A}, an eigenvector v\mathbf{v} and its associated eigenvalue λ\lambda satisfy:

Av=λv\mathbf{A}\mathbf{v} = \lambda \mathbf{v}

Eigenvalues indicate the amount of variance explained by each eigenvector direction. They are the foundation of Principal Component Analysis (PCA) and are closely related to PLS decompositions.

1.6 Ordinary Least Squares (OLS) Regression

OLS regression models the response y\mathbf{y} as a linear function of predictors X\mathbf{X}:

y=Xβ+ϵ\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

The OLS estimator minimises the sum of squared residuals:

β^OLS=(XTX)1XTy\hat{\boldsymbol{\beta}}_{OLS} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

This requires XTX\mathbf{X}^T \mathbf{X} to be invertible — which fails when:

PLS provides a solution to both of these failure modes.


2. What is PLS Regression?

Partial Least Squares (PLS) Regression is a multivariate statistical method that models the relationship between a matrix of predictor variables X\mathbf{X} and one or more response variables Y\mathbf{Y} by extracting a small number of latent components (factors) that simultaneously:

  1. Explain as much variance in X\mathbf{X} as possible.
  2. Have maximum covariance (predictive power) with Y\mathbf{Y}.

Unlike OLS regression, PLS does not require XTX\mathbf{X}^T \mathbf{X} to be invertible and performs well even when predictors far outnumber observations or are highly collinear.

2.1 The Core Idea

The name "Partial Least Squares" reflects the method's origins:

In essence, PLS finds a compressed, low-dimensional representation of X\mathbf{X} (the latent components or scores) that is maximally informative about Y\mathbf{Y}. This is what distinguishes PLS from Principal Component Regression (PCR), which only maximises variance in X\mathbf{X} without considering Y\mathbf{Y}.

2.2 Real-World Applications

PLS regression is the workhorse of many applied sciences. Common applications include:

2.3 When to Use PLS Regression

PLS regression is particularly appropriate when:

SituationReason PLS is Preferred
pnp \gg n (many more predictors than observations)OLS is undefined; PLS works with far fewer components than predictors
High multicollinearity among predictorsOLS is unstable; PLS constructs orthogonal components
Noisy predictor variables (measurement error)PLS filters noise by focusing on components relevant to Y\mathbf{Y}
Multiple correlated response variablesPLS-2 handles multivariate responses simultaneously
Interpretable latent structure is desiredPLS components have clear loading and score interpretations
Prediction accuracy is the primary goalPLS minimises prediction error via cross-validation component selection

2.4 PLS vs. Related Methods: An Overview

FeatureOLSPCRRidgePLS
Handles p>np > n
Handles multicollinearity
Uses Y\mathbf{Y} to construct components
Produces interpretable componentsN/A
Handles multiple responsesPartially
Requires matrix inversionPartially

3. The Mathematics Behind PLS Regression

3.1 The PLS Model Structure

PLS simultaneously decomposes both X\mathbf{X} (n×pn \times p) and Y\mathbf{Y} (n×qn \times q) into score matrices and loading matrices:

X=TPT+E\mathbf{X} = \mathbf{T}\mathbf{P}^T + \mathbf{E}

Y=UQT+F\mathbf{Y} = \mathbf{U}\mathbf{Q}^T + \mathbf{F}

Where:

The relationship between the X\mathbf{X}-scores and Y\mathbf{Y}-scores is modelled as:

U=TBdiag+H\mathbf{U} = \mathbf{T}\mathbf{B}_{diag} + \mathbf{H}

Where Bdiag\mathbf{B}_{diag} is a diagonal matrix of inner relation coefficients and H\mathbf{H} is residual.

3.2 The Objective: Maximum Covariance

The core objective of PLS is to find weight vectors w\mathbf{w} (for X\mathbf{X}) and c\mathbf{c} (for Y\mathbf{Y}) such that the covariance between the resulting scores is maximised:

maxw,cCov(Xw, Yc)2=maxw,c(wTXTYc)2\max_{\mathbf{w}, \mathbf{c}} \text{Cov}(\mathbf{X}\mathbf{w},\ \mathbf{Y}\mathbf{c})^2 = \max_{\mathbf{w}, \mathbf{c}} (\mathbf{w}^T \mathbf{X}^T \mathbf{Y} \mathbf{c})^2

Subject to the normalisation constraints: w=1\|\mathbf{w}\| = 1 and c=1\|\mathbf{c}\| = 1.

This is equivalent to finding the first singular value decomposition (SVD) of the cross-product matrix XTY\mathbf{X}^T \mathbf{Y}:

XTY=WSCT\mathbf{X}^T \mathbf{Y} = \mathbf{W} \mathbf{S} \mathbf{C}^T

Where S\mathbf{S} is diagonal with singular values (covariances), W\mathbf{W} contains X-weight vectors, and C\mathbf{C} contains Y-weight vectors.

3.3 Score and Loading Computation

For each component a=1,2,,Aa = 1, 2, \dots, A:

X-scores (latent variable scores):

ta=Xawa\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a

Where Xa\mathbf{X}_a is the deflated (residualised) X\mathbf{X} matrix at step aa, and wa\mathbf{w}_a is the X-weight vector for component aa.

Y-scores:

ua=Yaca\mathbf{u}_a = \mathbf{Y}_a \mathbf{c}_a

X-loadings (regression of Xa\mathbf{X}_a on ta\mathbf{t}_a):

pa=XaTtataTta\mathbf{p}_a = \frac{\mathbf{X}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}

Y-loadings (regression of Ya\mathbf{Y}_a on ua\mathbf{u}_a):

qa=YaTuauaTua\mathbf{q}_a = \frac{\mathbf{Y}_a^T \mathbf{u}_a}{\mathbf{u}_a^T \mathbf{u}_a}

Inner relation coefficient:

ba=uaTtataTtab_a = \frac{\mathbf{u}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}

3.4 Deflation

After extracting each component, the data matrices are deflated (the component's contribution is removed) to ensure that subsequent components capture new, orthogonal information:

Xa+1=XatapaT\mathbf{X}_{a+1} = \mathbf{X}_a - \mathbf{t}_a \mathbf{p}_a^T

For PLS1 (single response):

ya+1=yabata\mathbf{y}_{a+1} = \mathbf{y}_a - b_a \mathbf{t}_a

For PLS2 (multiple responses):

Ya+1=YabataqaT\mathbf{Y}_{a+1} = \mathbf{Y}_a - b_a \mathbf{t}_a \mathbf{q}_a^T

This sequential deflation ensures the X-scores t1,t2,,tA\mathbf{t}_1, \mathbf{t}_2, \dots, \mathbf{t}_A are mutually orthogonal: taTtb=0\mathbf{t}_a^T \mathbf{t}_b = 0 for aba \neq b.

3.5 The PLS Regression Coefficients

After extracting AA components, the final PLS regression coefficients relating X\mathbf{X} directly to Y\mathbf{Y} are:

B^PLS=WQT\hat{\mathbf{B}}_{PLS} = \mathbf{W}^* \mathbf{Q}^T

Where W\mathbf{W}^* (p×Ap \times A) is the matrix of modified weight vectors (also called the W-star or R\mathbf{R} matrix), which accounts for the sequential deflation:

W=W(PTW)1\mathbf{W}^* = \mathbf{W}(\mathbf{P}^T \mathbf{W})^{-1}

The predicted values are then:

Y^=XB^PLS=XWQT=TQT\hat{\mathbf{Y}} = \mathbf{X} \hat{\mathbf{B}}_{PLS} = \mathbf{X} \mathbf{W}^* \mathbf{Q}^T = \mathbf{T} \mathbf{Q}^T

Where T=XW\mathbf{T} = \mathbf{X} \mathbf{W}^* are the X-scores computed directly from X\mathbf{X} without deflation.

For a single response variable y\mathbf{y}:

y^=Xβ^PLS,β^PLS=Wq\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}}_{PLS}, \quad \hat{\boldsymbol{\beta}}_{PLS} = \mathbf{W}^* \mathbf{q}

The predicted value for a new observation xnew\mathbf{x}_{new}:

y^new=xnewTβ^PLS=xnewTWq\hat{y}_{new} = \mathbf{x}_{new}^T \hat{\boldsymbol{\beta}}_{PLS} = \mathbf{x}_{new}^T \mathbf{W}^* \mathbf{q}

3.6 Relationship Between PLS and SVD

The core step of PLS — finding weight vectors that maximise the covariance between t=Xw\mathbf{t} = \mathbf{X}\mathbf{w} and u=Yc\mathbf{u} = \mathbf{Y}\mathbf{c} — reduces to finding the leading left and right singular vectors of XTY\mathbf{X}^T \mathbf{Y}:

XTYYTXw=λ2w\mathbf{X}^T \mathbf{Y} \mathbf{Y}^T \mathbf{X} \mathbf{w} = \lambda^2 \mathbf{w}

YTXXTYc=λ2c\mathbf{Y}^T \mathbf{X} \mathbf{X}^T \mathbf{Y} \mathbf{c} = \lambda^2 \mathbf{c}

The maximum covariance λ\lambda is the first singular value of XTY\mathbf{X}^T \mathbf{Y}, and w\mathbf{w}, c\mathbf{c} are the corresponding left and right singular vectors.


4. Types of PLS Methods

Several algorithmic variants of PLS exist. The most important ones are described below.

4.1 PLS1

PLS1 handles a single continuous response variable y\mathbf{y} (q=1q = 1). It is the most common form of PLS regression in practice.

4.2 PLS2

PLS2 handles multiple response variables simultaneously (q>1q > 1). It extracts components that explain X\mathbf{X} variance while maximising covariance with the entire Y\mathbf{Y} matrix.

💡 PLS2 is more parsimonious than running separate PLS1 models for each response, but if the responses are very different in nature, separate PLS1 models may give better individual predictions.

4.3 PLS-DA (PLS Discriminant Analysis)

PLS-DA applies PLS regression to a categorical response variable by encoding the class membership as a binary (or dummy-coded) Y\mathbf{Y} matrix. It is widely used for classification in chemometrics, metabolomics, and genomics.

⚠️ PLS-DA can overfit when many components are used, especially with small or imbalanced datasets. Cross-validation is essential for assessing classification performance.

4.4 OPLS (Orthogonal PLS)

OPLS (Trygg & Wold) separates the X\mathbf{X} variation into:

This produces a model with a single predictive component (for PLS1) plus orthogonal components that account for systematic X\mathbf{X} variation not related to Y\mathbf{Y}. OPLS produces simpler, more interpretable loading plots.

4.5 Kernel PLS

Kernel PLS extends PLS to handle non-linear relationships between X\mathbf{X} and Y\mathbf{Y} by implicitly mapping the data into a high-dimensional feature space using a kernel function k(xi,xj)k(\mathbf{x}_i, \mathbf{x}_j).

Common kernels:

4.6 Sparse PLS

Sparse PLS introduces variable selection by imposing L1L_1 (Lasso-type) penalties on the weight vectors, driving some weights to exactly zero. This simultaneously performs dimensionality reduction and variable selection, producing more interpretable models in high-dimensional settings (pnp \gg n).

4.7 Summary of PLS Variants

VariantResponse TypeKey FeatureTypical Use
PLS1Single continuousStandard PLSMost regression problems
PLS2Multiple continuousJoint modelling of all responsesMultivariate responses
PLS-DACategorical (classes)Classification via dummy encodingChemometrics, omics
OPLSSingle/Multiple continuousSeparates predictive from orthogonal variationImproved interpretation
Kernel PLSSingle/MultipleNon-linear extension via kernelsNon-linear relationships
Sparse PLSSingle/MultipleVariable selection via L1L_1 penaltyHigh-dimensional data

The DataStatPro application implements PLS1 and PLS2 (standard PLS regression) and PLS-DA, which are the focus of this tutorial.


5. Assumptions of PLS Regression

PLS regression is a relatively assumption-light method compared to OLS. However, certain conditions should be met for valid results.

5.1 Linearity

PLS assumes a linear relationship between the latent components (scores T\mathbf{T}) and the response Y\mathbf{Y}. Non-linear relationships between the original X\mathbf{X} variables and Y\mathbf{Y} may be partially captured if the non-linearity is reflected in the latent structure, but Kernel PLS is preferred for strongly non-linear data.

5.2 Continuous (or Appropriately Encoded) Variables

5.3 No Requirement for Multivariate Normality

Unlike some classical multivariate methods, PLS does not require the predictors or response to follow a multivariate normal distribution. This makes PLS robust to skewed or non-normal variables (though severe non-normality may still affect inference).

5.4 No Perfect Redundancy (Degenerate Cases)

If a predictor variable XjX_j is a perfect linear combination of other predictor variables (i.e., the column is a deterministic function of others), it carries no additional information and should be removed before fitting PLS. Similarly, a response variable that is a perfect linear combination of X\mathbf{X} columns will cause a degenerate model.

5.5 Sufficient Sample Size

While PLS handles p>np > n settings, model quality improves with more observations. General guidelines:

5.6 Representativeness of Calibration Set

The calibration (training) samples should span the full range of variation expected in future prediction samples. PLS is an interpolation method — predictions for samples outside the calibration space (extrapolation) are unreliable.

5.7 No Gross Outliers

Extreme outliers in X\mathbf{X} or Y\mathbf{Y} can disproportionately influence the extracted components and distort the model. Outliers should be detected (using score plots and leverage/residual diagnostics) and investigated before finalising the model.


6. Data Preprocessing

Data preprocessing is arguably the most critical step in PLS analysis. The choice of preprocessing can have a profound impact on the extracted components and the resulting model.

6.1 Mean Centring

Mean centring subtracts the column mean from each variable:

x~ij=xijxˉj,xˉj=1ni=1nxij\tilde{x}_{ij} = x_{ij} - \bar{x}_j, \quad \bar{x}_j = \frac{1}{n}\sum_{i=1}^n x_{ij}

Mean centring is almost always required for PLS. Without it, the first component tends to describe the mean of the data rather than the variance structure, and subsequent components are distorted.

After mean centring, the PLS model is fitted to X~\tilde{\mathbf{X}} and Y~\tilde{\mathbf{Y}}, and predictions are adjusted using the response mean:

y^new=yˉ+x~newTβ^PLS\hat{y}_{new} = \bar{y} + \tilde{\mathbf{x}}_{new}^T \hat{\boldsymbol{\beta}}_{PLS}

6.2 Autoscaling (Mean Centring + Unit Variance Scaling)

Autoscaling (also called standardisation or z-score scaling) both mean-centres and scales each variable to unit variance:

x~ij=xijxˉjsj,sj=1n1i=1n(xijxˉj)2\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j}, \quad s_j = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}

When to use autoscaling:

When not to autoscale:

6.3 Other Scaling Methods

MethodFormulaWhen to Use
No scaling (mean-centre only)x~ij=xijxˉj\tilde{x}_{ij} = x_{ij} - \bar{x}_jSame units; variance is meaningful
Autoscaling (UV)x~ij=(xijxˉj)/sj\tilde{x}_{ij} = (x_{ij} - \bar{x}_j)/s_jDifferent units; default choice
Pareto scalingx~ij=(xijxˉj)/sj\tilde{x}_{ij} = (x_{ij} - \bar{x}_j)/\sqrt{s_j}Compromise: down-weights high-variance variables less severely than UV
Range scalingx~ij=(xijxˉj)/(maxjminj)\tilde{x}_{ij} = (x_{ij} - \bar{x}_j)/(\max_j - \min_j)All variables on [0,1] scale
Vast scalingx~ij=(xij/sj)×(xˉj/sj)\tilde{x}_{ij} = (x_{ij}/s_j) \times (\bar{x}_j/s_j)Focuses on variables with low coefficient of variation
Log transformationxij=ln(xij)x_{ij}' = \ln(x_{ij})Right-skewed, multiplicative data (e.g., concentration data, metabolomics)

💡 The response variable y\mathbf{y} should also be mean-centred (and scaled if using PLS2 with multiple responses on different scales). For PLS1, scaling y\mathbf{y} to unit variance is optional but often beneficial.

6.4 Handling Missing Data

PLS with missing data requires special treatment. Options include:

⚠️ Missing data in X\mathbf{X} is more tractable than missing data in Y\mathbf{Y}. If y\mathbf{y} values are missing, those observations typically cannot be used in model calibration.

6.5 Outlier Detection Before Modelling

Before fitting PLS, screen for outliers using:


7. PLS Components and Latent Variables

Understanding what PLS components represent is essential for interpreting PLS models.

7.1 X-Scores (T\mathbf{T})

The X-scores matrix T\mathbf{T} (n×An \times A) contains the coordinates of each observation in the low-dimensional latent space:

ta=Xwa\mathbf{t}_a = \mathbf{X} \mathbf{w}_a^*

Where wa\mathbf{w}_a^* is the aa-th column of W\mathbf{W}^*.

7.2 X-Loadings (P\mathbf{P})

The X-loadings matrix P\mathbf{P} (p×Ap \times A) describes how the original X\mathbf{X} variables contribute to each latent component during deflation:

pa=XaTtataTta\mathbf{p}_a = \frac{\mathbf{X}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}

7.3 X-Weights (W\mathbf{W} and W\mathbf{W}^*)

The X-weights W\mathbf{W} describe how X\mathbf{X} variables are weighted to form the X-scores:

ta=Xawa(during deflation)\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a \quad \text{(during deflation)}

The modified X-weights W=W(PTW)1\mathbf{W}^* = \mathbf{W}(\mathbf{P}^T \mathbf{W})^{-1} relate the original (undeflated) X\mathbf{X} directly to the scores:

T=XW\mathbf{T} = \mathbf{X} \mathbf{W}^*

⚠️ There is a subtle but important distinction between W\mathbf{W} and W\mathbf{W}^*. For interpreting the relationship between X\mathbf{X} variables and the PLS components, use W\mathbf{W}^* (not W\mathbf{W}), as W\mathbf{W}^* accounts for the deflation steps and relates to the original X\mathbf{X} directly.

7.4 Y-Loadings (Q\mathbf{Q}) and Y-Weights (C\mathbf{C})

The Y-loadings Q\mathbf{Q} (q×Aq \times A) describe how the Y\mathbf{Y} variables are reconstructed from the X-scores:

Y^=TQT\hat{\mathbf{Y}} = \mathbf{T} \mathbf{Q}^T

Loading qjaq_{ja} indicates how strongly response variable YjY_j is associated with component aa in the final prediction equation. For PLS1 (q=1q = 1), Q\mathbf{Q} reduces to a scalar qaq_a for each component.

7.5 Y-Scores (U\mathbf{U})

The Y-scores U\mathbf{U} (n×An \times A) are the latent components extracted from Y\mathbf{Y}:

ua=Yaca\mathbf{u}_a = \mathbf{Y}_a \mathbf{c}_a

In the inner relation uabata\mathbf{u}_a \approx b_a \mathbf{t}_a, the Y-scores should be close to the X-scores (scaled by bab_a). The tightness of this relationship is diagnostic of model quality:

7.6 The Biplot

The PLS biplot overlays the score plot (observations) and loading plot (variables) in the same space:


8. Choosing the Number of Components

The number of latent components AA is the primary hyperparameter in PLS regression. Too few components underfits (high bias, misses important structure); too many overfits (low bias but high variance, memorises noise).

8.1 Cross-Validation (Primary Method)

Cross-validation (CV) is the gold-standard method for selecting AA in PLS. The most common approach is kk-fold cross-validation:

  1. Divide the nn observations into kk roughly equal folds.
  2. For each fold v=1,,kv = 1, \dots, k: a. Fit PLS with a=1,2,,Amaxa = 1, 2, \dots, A_{\max} components on the data excluding fold vv. b. Predict fold vv observations using each fitted model. c. Compute prediction errors for fold vv.
  3. Average the prediction errors across all folds.
  4. Select AA minimising the cross-validated prediction error.

Leave-One-Out Cross-Validation (LOOCV): Special case of kk-fold CV with k=nk = n. Computationally intensive but uses maximum data for fitting at each step.

Recommended kk: k=5k = 5 or k=10k = 10 is standard. For small datasets (n<30n < 30), LOOCV is preferred.

8.2 PRESS Statistic (Predicted Residual Error Sum of Squares)

The PRESS statistic summarises the cross-validated prediction error for AA components:

PRESS(A)=v=1kifold v(yiy^i,v(A))2PRESS(A) = \sum_{v=1}^k \sum_{i \in \text{fold } v} \left(y_i - \hat{y}_{i,-v}(A)\right)^2

Where y^i,v(A)\hat{y}_{i,-v}(A) is the prediction for observation ii when it was in the held-out fold vv, using a model with AA components.

Select A=argminAPRESS(A)A^* = \arg\min_A PRESS(A).

💡 To guard against overfitting while maintaining parsimony, some guidelines recommend choosing the smallest AA for which PRESS(A)PRESS(A) is within 1 standard error of the minimum PRESS (the "one-standard-error rule").

8.3 Q2Q^2 (Cross-Validated R2R^2)

Q2Q^2 is the cross-validated analogue of R2R^2:

Q2(A)=1PRESS(A)SSY,totalQ^2(A) = 1 - \frac{PRESS(A)}{SS_{Y,total}}

Where:

SSY,total=i=1n(yiyˉ)2SS_{Y,total} = \sum_{i=1}^n (y_i - \bar{y})^2

Q2Q^2 ranges from -\infty (model is worse than predicting the mean) to 1 (perfect cross-validated prediction).

Guidelines for Q2Q^2:

Q2Q^2 ValueInterpretation
Q2<0Q^2 < 0Model is worse than the mean; no predictive power
0Q2<0.500 \leq Q^2 < 0.50Poor to moderate predictive ability
0.50Q2<0.700.50 \leq Q^2 < 0.70Moderate predictive ability
0.70Q2<0.900.70 \leq Q^2 < 0.90Good predictive ability
Q20.90Q^2 \geq 0.90Excellent predictive ability (verify not overfitting)

The optimal number of components is where Q2Q^2 reaches a maximum (or where adding more components does not meaningfully increase Q2Q^2).

8.4 The RY2R^2_Y vs. Q2Q^2 Plot

A standard diagnostic in PLS is to plot both RY2R^2_Y (variance explained in Y\mathbf{Y} — a training set metric) and Q2Q^2 (cross-validated metric) against the number of components AA:

⚠️ A model with RY2R^2_Y much higher than Q2Q^2 is overfitting. Aim for Q2Q^2 close to RY2R^2_Y.

8.5 Scree Plot of Eigenvalues / Variance Explained

A scree plot of the variance explained in X\mathbf{X} (RX2R^2_X) by each successive component can also guide component selection: look for an "elbow" where additional components explain diminishing amounts of variance. However, RX2R^2_X alone ignores predictive relevance for Y\mathbf{Y} — always prioritise Q2Q^2.

8.6 Permutation Testing

Permutation tests provide a rigorous null-hypothesis test for whether a PLS model with AA components explains more variance in Y\mathbf{Y} than expected by chance:

  1. Randomly permute (shuffle) the y\mathbf{y} vector BB times (e.g., B=999B = 999).
  2. Fit PLS with AA components to each permuted dataset and record RY2R^2_Y and Q2Q^2.
  3. The p-value is the proportion of permuted RY2R^2_Y (or Q2Q^2) values that exceed the observed value.

A significant result (p<0.05p < 0.05) confirms the model captures a real relationship, not a spurious one.


9. Model Fit and Evaluation

9.1 RX2R^2_X (Variance Explained in X\mathbf{X})

The proportion of total X\mathbf{X} variance explained by component aa is:

RX,a2=tapaT2X2=taTtapaTpaX2R^2_{X,a} = \frac{\|\mathbf{t}_a \mathbf{p}_a^T\|^2}{\|\mathbf{X}\|^2} = \frac{\mathbf{t}_a^T \mathbf{t}_a \cdot \mathbf{p}_a^T \mathbf{p}_a}{\|\mathbf{X}\|^2}

Cumulative RX2R^2_X after AA components:

RX2(A)=a=1ARX,a2R^2_X(A) = \sum_{a=1}^A R^2_{X,a}

A high RX2R^2_X indicates the components capture most of the systematic variation in X\mathbf{X}.

9.2 RY2R^2_Y (Variance Explained in Y\mathbf{Y})

The proportion of total Y\mathbf{Y} variance explained by AA components — the training set R2R^2:

RY2(A)=1YY^2YYˉ2=1i=1n(yiy^i)2i=1n(yiyˉ)2R^2_Y(A) = 1 - \frac{\|\mathbf{Y} - \hat{\mathbf{Y}}\|^2}{\|\mathbf{Y} - \bar{\mathbf{Y}}\|^2} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}

For PLS1, this reduces to:

RY2(A)=1RSS(A)SStotalR^2_Y(A) = 1 - \frac{RSS(A)}{SS_{total}}

Where RSS(A)=i=1n(yiy^i)2RSS(A) = \sum_{i=1}^n (y_i - \hat{y}_i)^2 is the residual sum of squares.

⚠️ RY2R^2_Y alone is an optimistic (biased) estimate of model quality because it is computed on the training data. Always report Q2Q^2 alongside RY2R^2_Y.

9.3 Root Mean Squared Error of Calibration (RMSEC)

RMSEC=1ni=1n(yiy^i)2RMSEC = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}

This is the training set prediction error in the same units as y\mathbf{y}.

9.4 Root Mean Squared Error of Cross-Validation (RMSECV)

RMSECV=PRESSn=1ni=1n(yiy^i,v)2RMSECV = \sqrt{\frac{PRESS}{n}} = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_{i,-v})^2}

RMSECV is the cross-validated prediction error and is the primary model selection criterion alongside Q2Q^2.

Q2=1PRESSSStotal=1nRMSECV2SStotalQ^2 = 1 - \frac{PRESS}{SS_{total}} = 1 - \frac{n \cdot RMSECV^2}{SS_{total}}

9.5 Root Mean Squared Error of Prediction (RMSEP)

When a separate independent test set of ntestn_{test} observations is available (not used in model fitting or cross-validation):

RMSEP=1ntesti=1ntest(yiy^i)2RMSEP = \sqrt{\frac{1}{n_{test}}\sum_{i=1}^{n_{test}} (y_i - \hat{y}_i)^2}

RMSEP is the most honest estimate of true prediction error and should always be reported when a genuine external test set exists.

Hierarchy of prediction error estimates:

RMSECRMSECVRMSEPRMSEC \leq RMSECV \leq RMSEP

The gap between RMSEC and RMSECV/RMSEP indicates the degree of overfitting.

9.6 Bias and Slope of Predicted vs. Observed

A predicted vs. observed plot should ideally show points scattered symmetrically around the 1:1 line (slope = 1, intercept = 0). Formally test for systematic bias using a regression of observed on predicted:

yi=b0+b1y^i+ϵiy_i = b_0 + b_1 \hat{y}_i + \epsilon_i

9.7 Summary of Model Fit Statistics

StatisticDefinitionOptimalPurpose
RX2R^2_XVariance in X\mathbf{X} explainedHigh (informative but secondary)Assess how well components represent X\mathbf{X}
RY2R^2_YVariance in Y\mathbf{Y} explained (training)HighTraining set fit
Q2Q^2Cross-validated RY2R^2_YHigh, close to RY2R^2_YPredictive ability, component selection
RMSECTraining set RMSELow (but biased)Training error in original units
RMSECVCross-validated RMSELowCV prediction error (primary)
RMSEPExternal test set RMSELowTrue prediction error

10. Interpretation of PLS Results

10.1 Regression Coefficients (β^PLS\hat{\boldsymbol{\beta}}_{PLS})

The PLS regression coefficients β^PLS=Wq\hat{\boldsymbol{\beta}}_{PLS} = \mathbf{W}^* \mathbf{q} have the same interpretation as OLS coefficients: β^j\hat{\beta}_j is the expected change in y^\hat{y} for a one-unit increase in XjX_j, with all other variables held constant.

However, because PLS implicitly regularises through dimensionality reduction:

⚠️ When X\mathbf{X} variables are autoscaled, the coefficients are on the standardised scale. Multiply by sjs_j (the SD of XjX_j) and divide by sys_y to interpret on the original scales, or report the raw (unstandardised) coefficients after back-scaling.

10.2 Variable Importance in Projection (VIP)

The VIP score (Wold, 1994) quantifies the contribution of each predictor variable to the PLS model, accounting for all AA components:

VIPj=pa=1A(RY2(a)RY2(a1))wja2a=1A(RY2(a)RY2(a1))j=1pwja2VIP_j = \sqrt{\frac{p \sum_{a=1}^A \left(R^2_Y(a) - R^2_Y(a-1)\right) w_{ja}^{*2}}{\sum_{a=1}^A \left(R^2_Y(a) - R^2_Y(a-1)\right) \sum_{j=1}^p w_{ja}^{*2}}}

Where:

Properties of VIP:

Interpretation:

VIP ScoreInterpretation
VIPj>1.0VIP_j > 1.0Variable jj is important for the model (above-average contribution)
0.8VIPj1.00.8 \leq VIP_j \leq 1.0Variable jj has moderate importance
VIPj<0.8VIP_j < 0.8Variable jj has low importance; may be a candidate for removal

💡 VIP is the most widely used variable selection criterion in PLS. Variables with VIP < 0.8 (or a domain-specific threshold) are candidates for removal, which can improve model parsimony and sometimes prediction performance.

10.3 Loadings Plot

The loadings plot displays the X-loadings pa\mathbf{p}_a (or weights wa\mathbf{w}_a^*) for two components simultaneously, revealing:

For spectroscopic data, a plot of the loadings as a function of wavelength (a "loading spectrum") is particularly informative, as it reveals which spectral regions are important.

10.4 Score Plot

The score plot displays the X-scores (t1t_1 vs. t2t_2, etc.) for all observations, revealing:

When colour-coded by y\mathbf{y} values, the score plot shows whether the latent structure in X\mathbf{X} aligns with Y\mathbf{Y} — the hallmark of a good PLS model.

10.5 Weight-Loading Biplot (ww^*-qq Plot)

The correlation loadings plot or ww^*-qq biplot plots the modified X-weights w\mathbf{w}^* and Y-loadings q\mathbf{q} together in the same space. Variables (X and Y) that appear close together in this plot are positively correlated; variables on opposite sides are negatively correlated. This provides a comprehensive view of the X\mathbf{X}Y\mathbf{Y} relationship structure.

10.6 Leverage and Residuals (Influence Analysis)

Leverage measures how influential observation ii is in determining the model:

hi=ti(TTT)1tiT=1n+a=1Atia2j=1ntja2h_i = t_{i\cdot}(\mathbf{T}^T\mathbf{T})^{-1}t_{i\cdot}^T = \frac{1}{n} + \sum_{a=1}^A \frac{t_{ia}^2}{\sum_{j=1}^n t_{ja}^2}

Where tit_{i\cdot} is the ii-th row of T\mathbf{T}.

Standardised Y-residuals:

ei=yiy^iσ^1hie_i^* = \frac{y_i - \hat{y}_i}{\hat{\sigma}\sqrt{1 - h_i}}

A leverage-residual plot (also called a Williams plot) displays hih_i vs. eie_i^*:

10.7 Hotelling's T2T^2 and SPE (DModX)

Two complementary multivariate control statistics are used to identify unusual samples:

Hotelling's T2T^2: Measures the distance of a sample from the centre of the model in the latent space:

Ti2=tiT(TTT/n)1ti=na=1Atia2taTtaT^2_i = \mathbf{t}_i^T (\mathbf{T}^T \mathbf{T} / n)^{-1} \mathbf{t}_i = n \sum_{a=1}^A \frac{t_{ia}^2}{\mathbf{t}_a^T \mathbf{t}_a}

An approximate FF-distribution critical value:

Tcrit2=A(n21)n(nA)Fα,A,nAT^2_{crit} = \frac{A(n^2-1)}{n(n-A)} F_{\alpha,A,n-A}

Observations with Ti2>Tcrit2T^2_i > T^2_{crit} are outliers within the model space (unusual combination of latent components).

SPE (Squared Prediction Error) / DModX: Measures the distance of a sample from the PLS model plane (residual variability not captured by the model):

SPEi=xix^i2=j=1p(xijx^ij)2SPE_i = \|\mathbf{x}_i - \hat{\mathbf{x}}_i\|^2 = \sum_{j=1}^p (x_{ij} - \hat{x}_{ij})^2

A high SPE indicates the observation does not conform to the X\mathbf{X}-covariance structure of the calibration set.

💡 Use both T2T^2 and SPE together. A sample can be unusual inside the model (T2T^2 high) or outside the model (SPE high), or both.


11. Validation Methods

Rigorous validation is essential to ensure a PLS model genuinely predicts new observations, rather than memorising noise in the training data.

11.1 Internal Validation: Cross-Validation

kk-fold cross-validation and LOOCV (described in Section 8.1) provide internal validation estimates (Q2Q^2, RMSECV). Internal validation is mandatory but can still be optimistic if the same data were used to select preprocessing, model type, and components.

Monte Carlo Cross-Validation (MCCV): Randomly splits the data into training and validation sets BB times (e.g., B=100B = 100), each time using a different random split (e.g., 80% train / 20% validate). The distribution of RMSECV values across splits provides uncertainty estimates.

11.2 External Validation: Independent Test Set

The most rigorous validation uses a truly independent test set — samples not used in any stage of model building (not in training, not in cross-validation, not in component selection):

RMSEP=1ntesti=1ntest(yiy^i)2RMSEP = \sqrt{\frac{1}{n_{test}}\sum_{i=1}^{n_{test}} (y_i - \hat{y}_i)^2}

How to split data for external validation:

11.3 Permutation Test for Model Validity

(Described in Section 8.6.) A permutation test confirms that the model's Q2Q^2 is significantly above what would be obtained by chance.

A permutation plot shows the distribution of Q2Q^2 values from permuted models overlaid with the observed Q2Q^2. If the observed Q2Q^2 falls well above the permutation distribution, the model is valid.

11.4 Y-Randomisation Test

Related to the permutation test, Y-randomisation (also called response permutation) repeatedly randomises y\mathbf{y}, fits PLS with the same AA components, and records RY2R^2_Y and Q2Q^2:

11.5 The Ratio RY2/Q2R^2_Y / Q^2 as a Validity Check

A simple heuristic from chemometrics practice:


12. Comparison with Related Methods

12.1 PLS vs. Principal Component Regression (PCR)

Both PLS and PCR reduce dimensionality before regression. The key difference:

AspectPCRPLS
Component extraction criterionMaximise variance in X\mathbf{X} onlyMaximise covariance between X\mathbf{X} scores and Y\mathbf{Y}
Y\mathbf{Y} used in component extraction❌ No✅ Yes
Relevant componentsMay not be the first few (components uncorrelated with Y\mathbf{Y} may explain much X\mathbf{X} variance)First few components are typically most predictive of Y\mathbf{Y}
Number of components neededTypically moreTypically fewer
Predictive performanceComparable or lowerGenerally better

💡 PLS is generally preferred over PCR for prediction tasks because it ensures the extracted components are relevant for predicting Y\mathbf{Y}. PCR may extract components that explain much X\mathbf{X} variance but are useless for prediction.

12.2 PLS vs. Ridge Regression

AspectRidge RegressionPLS
Handles multicollinearity
Shrinkage mechanismContinuous L2L_2 penalty on β^\hat{\boldsymbol{\beta}}Discrete (choice of AA)
Variable selection❌ (all variables retained)Indirectly (via VIP)
Interpretable components
Handles p>np > n
Cross-validation parameterλ\lambda (regularisation)AA (components)

12.3 PLS vs. Lasso

AspectLassoPLS
Variable selection✅ (explicit, drives coefficients to zero)Indirectly via VIP
Correlated predictors❌ (arbitrary selection among correlated vars)✅ (handles gracefully; correlated vars share weight)
Latent structure
Interpretable components
pnp \gg n✅ (but arbitrarily selects one from each correlated group)

12.4 PLS vs. OLS Multiple Regression

AspectOLSPLS
Multicollinearity❌ (unstable, inflated SE)✅ (stable)
p>np > n❌ (undefined)
Coefficient interpretationStraightforward (ceteris paribus)Requires care (components summarise X\mathbf{X})
Statistical inference (p-values)✅ (exact under assumptions)❌ (non-trivial; use jack-knife or bootstrap)
Prediction accuracy✅ (when npn \gg p, no collinearity)✅ (especially when p>np > n or collinearity)

12.5 When to Choose Which Method

ConditionRecommended Method
npn \gg p, low multicollinearity, need inferenceOLS
High multicollinearity, p<np < n, need inferenceRidge Regression
Explicit variable selection needed, p<np < nLasso or Elastic Net
pnp \gg n, high multicollinearity, interpretability neededPLS
Multiple correlated responses (Y\mathbf{Y} matrix)PLS2
Classification with high-dimensional X\mathbf{X}PLS-DA
Non-linear relationshipsKernel PLS or Non-linear methods (RF, SVM)
Want PCR but with Y\mathbf{Y}-guided component selectionPLS

13. Using the PLS Regression Component

The PLS Regression component in the DataStatPro application provides a full end-to-end workflow for fitting, validating, and interpreting PLS models.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should have at least one response variable and two or more predictor variables.

Step 2 — Select Analysis Type Choose the PLS analysis type:

Step 3 — Select Predictor Variables (X) Select one or more predictor variables from the "Predictor Variables (X)" dropdown. These should be continuous or ordinal numeric variables.

💡 You can select all available numeric predictors and rely on VIP scores for post-hoc variable selection.

Step 4 — Select Response Variable(s) (Y)

Step 5 — Configure Preprocessing Select the preprocessing method for X\mathbf{X} (and optionally Y\mathbf{Y}):

Step 6 — Configure Number of Components Choose how to determine the number of components AA:

Step 7 — Configure Cross-Validation Select the cross-validation scheme:

Step 8 — Select Display Options Choose which outputs to display:

Step 9 — Run the Analysis Click "Run PLS Regression". The application will:

  1. Apply the selected preprocessing to X\mathbf{X} and Y\mathbf{Y}.
  2. Run cross-validation to determine the optimal number of components (if automatic).
  3. Fit the final PLS model with the selected AA components using NIPALS or SIMPLS.
  4. Compute scores (T\mathbf{T}, U\mathbf{U}), loadings (P\mathbf{P}, Q\mathbf{Q}), weights (W\mathbf{W}, W\mathbf{W}^*).
  5. Compute VIP scores and regression coefficients.
  6. Compute model fit statistics (RX2R^2_X, RY2R^2_Y, Q2Q^2, RMSEC, RMSECV).
  7. Compute leverage, residuals, T2T^2, and SPE for all observations.
  8. Generate all selected visualisations and tables.
  9. Run permutation tests (if selected).

14. Computational and Formula Details

14.1 The NIPALS Algorithm for PLS1

The Non-linear Iterative Partial Least Squares (NIPALS) algorithm is the classical iterative procedure for PLS decomposition. For PLS1 (single response):

Inputs: Mean-centred (and scaled) X1=X~\mathbf{X}_1 = \tilde{\mathbf{X}} (n×pn \times p) and y1=y~\mathbf{y}_1 = \tilde{\mathbf{y}} (n×1n \times 1).

For each component a=1,2,,Aa = 1, 2, \dots, A:

  1. Initialise: ua=ya\mathbf{u}_a = \mathbf{y}_a (or any non-zero starting vector).

  2. Compute X-weight vector: wa=XaTuaXaTua\mathbf{w}_a = \frac{\mathbf{X}_a^T \mathbf{u}_a}{\|\mathbf{X}_a^T \mathbf{u}_a\|} (Normalise to unit length.)

  3. Compute X-score vector: ta=Xawa\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a

  4. Compute Y-weight (scalar for PLS1): ca=yaTtataTtac_a = \frac{\mathbf{y}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}

  5. Compute Y-score vector: ua=yaca/ca2=yaTtataTtayataTtayaTta\mathbf{u}_a = \mathbf{y}_a \cdot c_a / \|c_a\|^2 = \frac{\mathbf{y}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a} \mathbf{y}_a \cdot \frac{\mathbf{t}_a^T \mathbf{t}_a}{\mathbf{y}_a^T \mathbf{t}_a}

For PLS1, since q=1q = 1: ua=ya\mathbf{u}_a = \mathbf{y}_a (no iteration needed — convergence is immediate).

  1. Check convergence (PLS2 only — for PLS1, skip to step 7).

  2. Compute X-loadings: pa=XaTtataTta\mathbf{p}_a = \frac{\mathbf{X}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}

  3. Compute inner relation coefficient: ba=uaTtataTtab_a = \frac{\mathbf{u}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}

  4. Deflate X\mathbf{X}: Xa+1=XatapaT\mathbf{X}_{a+1} = \mathbf{X}_a - \mathbf{t}_a \mathbf{p}_a^T

  5. Deflate y\mathbf{y}: ya+1=yabata\mathbf{y}_{a+1} = \mathbf{y}_a - b_a \mathbf{t}_a

After all AA components, compute W\mathbf{W}^*:

W=W(PTW)1\mathbf{W}^* = \mathbf{W}(\mathbf{P}^T \mathbf{W})^{-1}

Where W=[w1,w2,,wA]\mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_A] and P=[p1,p2,,pA]\mathbf{P} = [\mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_A].

Compute PLS regression coefficients:

β^PLS=Wb\hat{\boldsymbol{\beta}}_{PLS} = \mathbf{W}^* \mathbf{b}

Where b=[b1,b2,,bA]T\mathbf{b} = [b_1, b_2, \dots, b_A]^T (inner relation coefficients, equal to Y-loadings q\mathbf{q} in PLS1).

14.2 The NIPALS Algorithm for PLS2

For PLS2 (q>1q > 1), an inner iteration is needed at each component to converge to the dominant covariance direction:

  1. Initialise: ua=\mathbf{u}_a = first column of Ya\mathbf{Y}_a (or any non-zero column).

  2. Outer iteration (repeat until convergence of ta\mathbf{t}_a):

    a. Compute X-weight: wa=XaTua/XaTua\mathbf{w}_a = \mathbf{X}_a^T \mathbf{u}_a / \|\mathbf{X}_a^T \mathbf{u}_a\|

    b. Compute X-score: ta=Xawa\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_a

    c. Compute Y-weight: ca=YaTta/YaTta\mathbf{c}_a = \mathbf{Y}_a^T \mathbf{t}_a / \|\mathbf{Y}_a^T \mathbf{t}_a\|

    d. Compute Y-score: ua=Yaca\mathbf{u}_a = \mathbf{Y}_a \mathbf{c}_a

    e. Check: if ta(new)ta(old)/ta(old)<ϵ\|\mathbf{t}_a^{(new)} - \mathbf{t}_a^{(old)}\| / \|\mathbf{t}_a^{(old)}\| < \epsilon (e.g., ϵ=1010\epsilon = 10^{-10}), converged.

  3. Compute loadings: pa=XaTta/(taTta)\mathbf{p}_a = \mathbf{X}_a^T \mathbf{t}_a / (\mathbf{t}_a^T \mathbf{t}_a)

  4. Compute inner coefficient: ba=uaTta/(taTta)b_a = \mathbf{u}_a^T \mathbf{t}_a / (\mathbf{t}_a^T \mathbf{t}_a)

  5. Deflate both matrices: Xa+1=XatapaT\mathbf{X}_{a+1} = \mathbf{X}_a - \mathbf{t}_a \mathbf{p}_a^T Ya+1=YabatacaT\mathbf{Y}_{a+1} = \mathbf{Y}_a - b_a \mathbf{t}_a \mathbf{c}_a^T

14.3 The SIMPLS Algorithm

SIMPLS (de Jong, 1993) is an alternative, non-deflation-based algorithm that directly computes the PLS weight vectors wa\mathbf{w}_a^* without deflating X\mathbf{X}. It is computationally more efficient and numerically more stable for large datasets.

SIMPLS directly finds wa\mathbf{w}_a^* as the leading eigenvector of XTYYTX\mathbf{X}^T \mathbf{Y} \mathbf{Y}^T \mathbf{X} deflated by projections onto previously found weight vectors:

wa=leading eigenvector of [IWa1(Wa1TWa1)1Wa1T]XTYYTX\mathbf{w}_a^* = \text{leading eigenvector of } \left[\mathbf{I} - \mathbf{W}_{a-1}^*(\mathbf{W}_{a-1}^{*T}\mathbf{W}_{a-1}^*)^{-1}\mathbf{W}_{a-1}^{*T}\right] \mathbf{X}^T \mathbf{Y} \mathbf{Y}^T \mathbf{X}

NIPALS vs. SIMPLS:

FeatureNIPALSSIMPLS
Handles missing data✅ (iterative imputation)
Computational efficiencyO(npA)O(npA) per iterationO(np2)O(np^2) once
Numerical stabilityGoodExcellent
Equivalence to NIPALS✅ (for PLS1)✅ (identical for PLS1; slightly different for PLS2)

14.4 VIP Score Computation

VIPj=pa=1ASSYawja2a=1ASSYaVIP_j = \sqrt{\frac{p \sum_{a=1}^A SSY_a \cdot w_{ja}^{*2}}{\sum_{a=1}^A SSY_a}}

Where SSYa=RY2(a)RY2(a1)SSY_a = R^2_Y(a) - R^2_Y(a-1) is the Y\mathbf{Y} variance explained by component aa, and the denominator normalisation a=1ASSYa=RY2(A)SStotal\sum_{a=1}^A SSY_a = R^2_Y(A) \cdot SS_{total}.

14.5 Confidence Intervals for PLS Coefficients (Jack-Knife)

Standard errors for PLS regression coefficients can be estimated using the jack-knife procedure:

  1. For each observation ii, fit a PLS model with AA components leaving out observation ii: β^PLS(i)\hat{\boldsymbol{\beta}}_{PLS}^{(-i)}.

  2. The jack-knife estimate of the coefficient vector: β^JK=1ni=1nβ^PLS(i)\hat{\boldsymbol{\beta}}_{JK} = \frac{1}{n}\sum_{i=1}^n \hat{\boldsymbol{\beta}}_{PLS}^{(-i)}

  3. Jack-knife standard error of coefficient jj: SEJK(β^j)=n1ni=1n(β^j(i)β^JK,j)2SE_{JK}(\hat{\beta}_j) = \sqrt{\frac{n-1}{n}\sum_{i=1}^n \left(\hat{\beta}_j^{(-i)} - \hat{\beta}_{JK,j}\right)^2}

  4. Approximate (1α)×100%(1-\alpha) \times 100\% confidence interval: β^j±tα/2,n1SEJK(β^j)\hat{\beta}_j \pm t_{\alpha/2, n-1} \cdot SE_{JK}(\hat{\beta}_j)

A jack-knife tt-statistic for testing H0:βj=0H_0: \beta_j = 0:

tj=β^jSEJK(β^j)t_j = \frac{\hat{\beta}_j}{SE_{JK}(\hat{\beta}_j)}

⚠️ Jack-knife standard errors for PLS coefficients are approximate. Bootstrap-based confidence intervals are more accurate but computationally more demanding.

14.6 Back-Scaling of Coefficients

When X\mathbf{X} and y\mathbf{y} are autoscaled before fitting, the PLS coefficients β^PLSscaled\hat{\boldsymbol{\beta}}_{PLS}^{scaled} are on the standardised scale. Back-scaling to original units:

β^joriginal=β^jscaledsysj\hat{\beta}_j^{original} = \hat{\beta}_j^{scaled} \cdot \frac{s_y}{s_j}

Where sys_y is the standard deviation of y\mathbf{y} and sjs_j is the standard deviation of XjX_j.

The intercept is:

β^0=yˉj=1pβ^joriginalxˉj\hat{\beta}_0 = \bar{y} - \sum_{j=1}^p \hat{\beta}_j^{original} \bar{x}_j

14.7 Q2Q^2 Computation in Detail

For LOOCV with nn folds:

PRESS=i=1n(yiy^i,i)2PRESS = \sum_{i=1}^n (y_i - \hat{y}_{i,-i})^2

Where y^i,i\hat{y}_{i,-i} is the prediction for observation ii from a model trained on all observations except ii.

Q2=1PRESSi=1n(yiyˉ)2Q^2 = 1 - \frac{PRESS}{\sum_{i=1}^n (y_i - \bar{y})^2}

For kk-fold CV with k<nk < n, PRESS is computed as the sum over all nn held-out predictions (one per observation), where each observation is held out exactly once.


15. Worked Examples

Example 1: PLS1 Regression — Predicting Protein Content from NIR Spectra

Research Question: Can near-infrared (NIR) spectral measurements (100 wavelengths, p=100p = 100) predict the protein content (%, yy) of wheat flour samples?

Dataset: n=50n = 50 wheat samples; p=100p = 100 spectral absorbance variables; yy = protein content (%).

Step 1: Preprocessing

Apply autoscaling to X\mathbf{X} (spectral data; different variances at different wavelengths). Mean-centre y\mathbf{y} (yˉ=11.84%\bar{y} = 11.84\%).

Step 2: Cross-Validation to Select AA

Run PLS1 with A=1A = 1 to 1010 components using 10-fold cross-validation:

Components (AA)RY2R^2_YQ2Q^2RMSECRMSECV
10.6120.5830.7310.764
20.8430.8120.4680.511
30.9230.8950.3290.382
40.9610.9340.2330.302
50.9720.9310.1970.311
60.9790.9240.1690.328
70.9830.9100.1520.352
80.9860.8960.1370.380

Q2Q^2 peaks at A=4A = 4 (0.934) and decreases thereafter (despite RY2R^2_Y continuing to rise → overfitting). Select A=4A = 4 components.

Step 3: Fit Final PLS Model with A=4A = 4

Model summary:

ComponentRX2R^2_X (cumulative)RY2R^2_Y (cumulative)Q2Q^2 (cumulative)
10.4230.6120.583
20.6170.8430.812
30.7480.9230.895
40.8140.9610.934

Final statistics:

RY2=0.961,Q2=0.934,RMSEC=0.233%,RMSECV=0.302%R^2_Y = 0.961, \quad Q^2 = 0.934, \quad RMSEC = 0.233\%, \quad RMSECV = 0.302\%

Step 4: VIP Scores

The top 5 most important variables by VIP:

Wavelength (nm)VIP ScoreInterpretation
21802.14Highly important (protein N-H stretch)
21001.98Highly important
16801.87Highly important
22401.76Important
19401.62Important

Variables with VIP < 0.8: 42 out of 100 wavelengths are candidates for removal.

Step 5: Prediction for New Sample

New wheat sample: spectral vector xnew\mathbf{x}_{new} (autoscaled).

t^a=x~newTwa,a=1,,4\hat{t}_{a} = \tilde{\mathbf{x}}_{new}^T \mathbf{w}_a^*, \quad a = 1, \dots, 4

y~^new=a=14bat^a=0.247×0.812+(0.183)×(0.441)+=0.384\hat{\tilde{y}}_{new} = \sum_{a=1}^4 b_a \hat{t}_a = 0.247 \times 0.812 + (-0.183) \times (-0.441) + \dots = 0.384

Back-scale: y^new=yˉ+y~^new×sy=11.84+0.384×1.214=11.84+0.466=12.31%\hat{y}_{new} = \bar{y} + \hat{\tilde{y}}_{new} \times s_y = 11.84 + 0.384 \times 1.214 = 11.84 + 0.466 = 12.31\%

95% prediction interval (jack-knife SE = 0.28%):

y^new±1.96×0.28=12.31±0.55=[11.76%,12.86%]\hat{y}_{new} \pm 1.96 \times 0.28 = 12.31 \pm 0.55 = [11.76\%, 12.86\%]

Conclusion: The 4-component PLS model achieves excellent predictive performance (Q2=0.934Q^2 = 0.934, RMSECV = 0.302%). The model is not overfitting (ratio Q2/RY2=0.934/0.961=0.972>0.5Q^2/R^2_Y = 0.934/0.961 = 0.972 > 0.5). NIR wavelengths around 2180 nm and 2100 nm (protein N-H stretching bands) are the most important predictors.


Example 2: PLS1 Regression — Predicting Blood Pressure from Clinical Variables

Research Question: Can clinical variables (age, BMI, cholesterol, glucose, smoking status, exercise frequency) predict systolic blood pressure (SBP)?

Dataset: n=120n = 120 patients; p=6p = 6 predictors; yy = SBP (mmHg).

Predictors: Age (years), BMI (kg/m²), Total Cholesterol (mmol/L), Fasting Glucose (mmol/L), Smoking (0/1), Exercise (days/week).

Step 1: Preprocessing

Apply autoscaling to all 6 predictors (different units) and mean-centre y\mathbf{y} (yˉ=128.4\bar{y} = 128.4 mmHg).

Step 2: CV to Select AA

ComponentsRY2R^2_YQ2Q^2RMSECV
10.5420.5198.12
20.6310.6017.41
30.6490.5897.58
40.6550.5717.82

Select A=2A = 2 (maximum Q2=0.601Q^2 = 0.601, RMSECV = 7.41 mmHg).

Step 3: Regression Coefficients (back-scaled to original units)

Predictorβ^j\hat{\beta}_j (mmHg / unit)SE (jack-knife)tt-statisticSignificant?
Age (per year)0.4820.0915.30p<0.001p < 0.001
BMI (per kg/m²)1.2410.2135.83p<0.001p < 0.001
Cholesterol (per mmol/L)0.8370.2812.98p=0.003p = 0.003
Glucose (per mmol/L)0.6140.2442.51p=0.013p = 0.013
Smoking3.9211.4822.65p=0.009p = 0.009
Exercise (per day/week)-1.1830.392-3.02p=0.003p = 0.003

Step 4: VIP Scores

PredictorVIP ScoreImportance
BMI1.48High
Age1.32High
Smoking1.21High
Exercise1.13High
Cholesterol0.92Moderate
Glucose0.74Low (VIP < 0.8)

Step 5: Prediction

For a new patient: Age = 55, BMI = 28.4, Cholesterol = 5.2, Glucose = 5.8, Smoking = 1, Exercise = 2:

y^new=128.4+0.482(55xˉAge)+1.241(28.4xˉBMI)+=143.7 mmHg\hat{y}_{new} = 128.4 + 0.482(55-\bar{x}_{Age}) + 1.241(28.4 - \bar{x}_{BMI}) + \dots = 143.7 \text{ mmHg}

Conclusion: The 2-component PLS model explains 63.1% of SBP variance (RY2=0.631R^2_Y = 0.631) with reasonable cross-validated performance (Q2=0.601Q^2 = 0.601, RMSECV = 7.4 mmHg). BMI, age, and smoking are the strongest predictors. Glucose has a VIP < 0.8, suggesting limited predictive contribution in this dataset.


Example 3: PLS2 Regression — Predicting Multiple Sensory Attributes from Chemical Composition

Research Question: Can the chemical composition of wine (8 chemical variables) jointly predict 3 sensory attributes (acidity rating, bitterness rating, overall quality score)?

Dataset: n=80n = 80 wines; p=8p = 8 chemical predictors (pH, alcohol %, residual sugar, sulphates, fixed acidity, volatile acidity, citric acid, density); q=3q = 3 responses.

Step 1: Preprocessing

Autoscale all X\mathbf{X} variables. Mean-centre and scale all Y\mathbf{Y} variables (different scales).

Step 2: CV to Select AA

ComponentsRX2R^2_XRY2R^2_Y (avg)Q2Q^2 (avg)
10.3810.4430.412
20.5440.6510.597
30.6340.7120.584

Select A=2A = 2 (Q2Q^2 peaks at 0.597).

Step 3: Component Summary

ComponentRX2R^2_X cumul.RY,Acidity2R^2_{Y,Acidity} cumul.RY,Bitterness2R^2_{Y,Bitterness} cumul.RY,Quality2R^2_{Y,Quality} cumul.
10.3810.5210.4090.398
20.5440.6980.6120.643

Step 4: Y-Loadings (Q\mathbf{Q})

Responseq1q_1q2q_2
Acidity0.611-0.392
Bitterness0.5240.481
Quality0.5930.144

Component 1 positively loads on all three responses (general quality/intensity factor). Component 2 contrasts bitterness (positive) against acidity (negative) — a sensory contrast axis.

Step 5: Top VIP Scores (averaged across responses)

Chemical VariableVIP Score
Volatile Acidity1.63
Alcohol %1.41
pH1.28
Sulphates1.17
Residual Sugar0.94
Fixed Acidity0.86
Citric Acid0.72
Density0.68

Citric acid and density have VIP < 0.8 — candidates for removal in a reduced model.

Conclusion: The 2-component PLS2 model jointly predicts all three sensory attributes with moderate-to-good accuracy (Qavg2=0.597Q^2_{avg} = 0.597). Volatile acidity and alcohol content are the most influential predictors. The two PLS components reveal a general quality factor and a bitterness-versus-acidity contrast factor in the sensory space.


16. Common Mistakes and How to Avoid Them

Mistake 1: Skipping Preprocessing

Problem: Applying PLS to unscaled data where variables differ widely in units and magnitude. Variables with larger numerical ranges (e.g., income in thousands vs. age in decades) dominate the components, producing misleading results.
Solution: Always mean-centre the data. Apply autoscaling (UV scaling) when variables are in different units. Carefully consider the appropriate scaling for your specific domain and data type.

Mistake 2: Selecting Too Many Components

Problem: Using the number of components that maximises RY2R^2_Y (training set fit) rather than Q2Q^2 (cross-validated fit), resulting in an overfitted model that performs well on training data but poorly on new observations.
Solution: Always use cross-validation (Q2Q^2, RMSECV) to select AA. Look for the point where Q2Q^2 peaks or where RY2Q2R^2_Y - Q^2 begins to widen. Apply the one-standard-error rule for extra parsimony.

Mistake 3: Ignoring Model Validation

Problem: Reporting only training set statistics (RY2R^2_Y, RMSEC) without cross-validation or external validation, giving an overly optimistic picture of model performance.
Solution: Always report Q2Q^2 and RMSECV. Whenever possible, reserve an independent external test set and report RMSEP. Run permutation tests to confirm the model is not a statistical artefact.

Mistake 4: Confusing W-Weights with P-Loadings

Problem: Using the X-loadings P\mathbf{P} (or raw weights W\mathbf{W}) to interpret the relationship between X\mathbf{X} variables and the model, rather than the modified weights W\mathbf{W}^*.
Solution: For variable importance interpretation, use W\mathbf{W}^* (the modified weights) or VIP scores. Loadings P\mathbf{P} describe how X\mathbf{X} is reconstructed from the scores; modified weights W\mathbf{W}^* describe how X\mathbf{X} variables linearly combine to form the scores from the original X\mathbf{X}.

Mistake 5: Using VIP Threshold Rigidly

Problem: Mechanically removing all variables with VIP < 0.8 and accepting all variables with VIP > 1.0, without considering domain knowledge, model stability, or the effect of variable removal on Q2Q^2.
Solution: Treat VIP as a guide, not a hard rule. After removing low-VIP variables, refit the model and check whether Q2Q^2 improves or remains stable. Incorporate domain knowledge about which variables are mechanistically meaningful.

Mistake 6: Applying PLS to a Completely Heterogeneous Dataset

Problem: Fitting a single global PLS model to data comprising fundamentally different subgroups (e.g., different product types, different analytical conditions), producing a model that fits no subgroup well.
Solution: Inspect score plots for clustering. If distinct subgroups are visible, consider fitting separate PLS models per subgroup, or use class-based modelling approaches such as PLS-DA to first classify then model within class.

Mistake 7: Not Detecting Outliers Before Modelling

Problem: Leaving extreme outliers in the dataset, which disproportionately influence the PLS components and distort the model for the remaining, majority observations.
Solution: Check univariate distributions, Mahalanobis distances, and initial PCA scores before PLS. After fitting, use the Williams plot (leverage vs. residuals), T2T^2, and SPE plots to identify influential observations. Investigate outliers — do not simply delete them without justification.

Mistake 8: Extrapolating Beyond the Calibration Range

Problem: Using the PLS model to predict samples that fall outside the range of the calibration set (extrapolation), where the linear relationship may not hold and the model has no basis for reliable prediction.
Solution: Check new samples for consistency with the calibration set using T2T^2 and SPE control charts. If a new sample falls outside the 95% control limits, flag the prediction as unreliable. Expand the calibration set to cover the full expected range of future samples.

Mistake 9: Misinterpreting RY2R^2_Y Without Context

Problem: Reporting a high RY2R^2_Y (e.g., 0.98) as evidence of an excellent model, without noting that this is the training set fit and may reflect overfitting.
Solution: Always pair RY2R^2_Y with Q2Q^2. A model with RY2=0.98R^2_Y = 0.98 and Q2=0.42Q^2 = 0.42 is severely overfitting. The Q2Q^2 value is the meaningful indicator of predictive performance.

Mistake 10: Applying PLS Regression to a Classification Problem Without PLS-DA

Problem: Using PLS1 to predict a binary class label (0/1) without proper classification thresholding or performance assessment with classification metrics (sensitivity, specificity, AUC).
Solution: For categorical outcomes, use PLS-DA with appropriate class encoding. Evaluate using classification metrics (confusion matrix, sensitivity, specificity, AUC-ROC) with cross-validated class assignments, not just RMSECV.


17. Troubleshooting

IssueLikely CauseSolution
Q2<0Q^2 < 0 for all componentsNo predictive relationship; noisy data; too few samplesCheck data quality; verify X\mathbf{X} and y\mathbf{y} are correctly entered; increase nn; reconsider predictor selection
RY2R^2_Y is high but Q2Q^2 is very low (large gap)Severe overfitting due to too many components or pnp \gg nReduce AA; use stricter CV; consider sparse PLS or variable pre-selection
NIPALS fails to convergeNear-zero variance columns; perfect collinearity; numerical issuesRemove zero-variance variables before fitting; check for duplicate columns; increase max iterations
Score plot shows extreme outlier separated from main clusterOutlier with unusual X\mathbf{X} or y\mathbf{y} value; data entry errorInvestigate observation; check for data entry errors; assess leverage and SPE
All VIP scores are approximately 1Only one component extracted (A=1A=1); VIP is uniform when A=1A=1Increase AA if justified by CV; interpret w1\mathbf{w}_1^* coefficients directly for A=1A=1
Permutation test shows Q2Q^2 of permuted models as high as observedNo real relationship between X\mathbf{X} and y\mathbf{y}; chance correlationDo not use the model; reconsider variable selection; collect more data; verify correct y\mathbf{y} assignment
Jack-knife SEs are very largeToo few samples relative to components (nAn \approx A)Reduce AA; collect more samples; do not report jack-knife inference for very small nn
Predicted vs. observed plot shows systematic curvatureNon-linear relationship between X\mathbf{X} and y\mathbf{y}Apply polynomial or logarithmic transformation to y\mathbf{y}; use kernel PLS; consider non-linear models
RMSECV does not decrease with more componentsNo additional predictive structure beyond first componentAccept a 1-component model; data may be well-described by a single latent variable
New sample has very high SPENew sample's X\mathbf{X} pattern does not match calibration set structureFlag prediction as unreliable; expand calibration set to include similar samples
Negative Q2Q^2 for external test set despite positive cross-validated Q2Q^2Test set is not representative of training set (distribution shift)Re-examine train/test split; use Kennard-Stone or DUPLEX for representative splitting; collect more diverse calibration samples
PLS2 gives worse predictions than separate PLS1 modelsResponses are poorly correlated with each otherRun separate PLS1 models for each response; or use OPLS for better separation of effects

18. Quick Reference Cheat Sheet

Core Formulas

FormulaDescription
X=TPT+E\mathbf{X} = \mathbf{T}\mathbf{P}^T + \mathbf{E}PLS decomposition of X\mathbf{X}
Y=TQT+F\mathbf{Y} = \mathbf{T}\mathbf{Q}^T + \mathbf{F}PLS prediction of Y\mathbf{Y}
ta=Xawa\mathbf{t}_a = \mathbf{X}_a \mathbf{w}_aX-score for component aa
pa=XaTta/(taTta)\mathbf{p}_a = \mathbf{X}_a^T \mathbf{t}_a / (\mathbf{t}_a^T \mathbf{t}_a)X-loading for component aa
W=W(PTW)1\mathbf{W}^* = \mathbf{W}(\mathbf{P}^T\mathbf{W})^{-1}Modified X-weight matrix
β^PLS=Wq\hat{\boldsymbol{\beta}}_{PLS} = \mathbf{W}^*\mathbf{q}PLS regression coefficients
Y^=XWQT\hat{\mathbf{Y}} = \mathbf{X}\mathbf{W}^*\mathbf{Q}^TPredicted values
RY2=1RSS/SStotalR^2_Y = 1 - RSS/SS_{total}Training set variance explained
Q2=1PRESS/SStotalQ^2 = 1 - PRESS/SS_{total}Cross-validated variance explained
RMSECV=PRESS/nRMSECV = \sqrt{PRESS/n}Cross-validated RMSE
RMSEP=ei2/ntestRMSEP = \sqrt{\sum e_i^2 / n_{test}}External test set RMSE
VIPj=paSSYawja2/aSSYaVIP_j = \sqrt{p\sum_a SSY_a w_{ja}^{*2} / \sum_a SSY_a}Variable importance in projection
Ti2=natia2/(taTta)T^2_i = n\sum_a t_{ia}^2/(\mathbf{t}_a^T\mathbf{t}_a)Hotelling's T2T^2
SPEi=xix^i2SPE_i = \|\mathbf{x}_i - \hat{\mathbf{x}}_i\|^2Squared prediction error

Preprocessing Guide

SituationRecommended Scaling
Variables in different unitsAutoscaling (UV)
Variables in same units; variance meaningfulMean centring only
High-variance variables dominatePareto scaling
Right-skewed, multiplicative dataLog transform first, then mean centre
Near-zero variance variables presentRemove before scaling

Component Selection Guide

EvidenceAction
Q2Q^2 increasing with each componentAdd another component
Q2Q^2 has peaked and starts decreasingStop adding components
RY2Q2R^2_Y - Q^2 gap wideningOverfitting — reduce AA
Q2<0Q^2 < 0No predictive signal; check data
Permuted Q2Q^2 \approx observed Q2Q^2Model is a statistical artefact
Q2/RY2<0.5Q^2/R^2_Y < 0.5Overfitting — reduce AA

VIP Score Interpretation

VIP ScoreVariable Importance
VIP>1.0VIP > 1.0High importance
0.8VIP1.00.8 \leq VIP \leq 1.0Moderate importance
VIP<0.8VIP < 0.8Low importance; candidate for removal

Model Evaluation Hierarchy

MetricTypeBiasRecommended Use
RY2R^2_Y / RMSECTraining setOptimisticReport but do not use alone
Q2Q^2 / RMSECVCross-validatedSlightPrimary model selection criterion
RMSEPExternal testUnbiasedBest estimate of true prediction error

Outlier Detection Summary

StatisticWhat It DetectsThreshold
Hotelling's T2T^2Unusual within the model spaceFα,A,nAF_{\alpha, A, n-A} critical value
SPE (DModX)Does not fit the model structureχα,pA2\chi^2_{\alpha, p-A} (approximate)
Leverage hih_iInfluence on model coefficients2A/n2A/n (rough guideline)
Standardised residual eie_i^*Poor fit in Y\mathbf{Y}ei>2\|e_i^*\| > 2 or 33

PLS Model Type Selection

ScenarioPLS Variant
One continuous responsePLS1
Multiple continuous responsesPLS2
Binary or multi-class outcomePLS-DA
Improve interpretability (single response)OPLS
Non-linear relationshipsKernel PLS
Automatic variable selectionSparse PLS

Comparison of Regression Methods

FeatureOLSRidgePCRPLS
p>np > n
Handles collinearity
Uses Y\mathbf{Y} in reduction
Interpretable componentsN/A
Multiple responsesPartial
Variable selectionVia VIP
Exact inference (p-values)Approx. (jack-knife)

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting PLS Regression using the DataStatPro application. For further reading, consult Wold, Sjöström & Eriksson's "PLS-regression: a basic tool of chemometrics" (Chemometrics and Intelligent Laboratory Systems, 2001), Mevik & Wehrens's "The pls Package: Principal Component and Partial Least Squares Regression in R" (Journal of Statistical Software, 2007), or Höskuldsson's "PLS regression methods" (Journal of Chemometrics, 1988). For feature requests or support, contact the DataStatPro team.