PLS Regression: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of Partial Least Squares (PLS) Regression all the way through advanced component interpretation, model validation, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.
Table of Contents
- Prerequisites and Background Concepts
- What is PLS Regression?
- The Mathematics Behind PLS Regression
- Types of PLS Methods
- Assumptions of PLS Regression
- Data Preprocessing
- PLS Components and Latent Variables
- Choosing the Number of Components
- Model Fit and Evaluation
- Interpretation of PLS Results
- Validation Methods
- Comparison with Related Methods
- Using the PLS Regression Component
- Computational and Formula Details
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into PLS regression, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.
1.1 Vectors and Matrices
A vector is an ordered list of numbers (a one-dimensional array). A matrix is a two-dimensional array of numbers with rows and columns, denoted .
Key matrix operations used throughout this tutorial:
- Transpose: swaps rows and columns of .
- Matrix multiplication: .
- Matrix inverse: exists only if is square and non-singular.
- Dot product: (a scalar).
- Norm: (the length of a vector).
1.2 Projection and Orthogonality
The projection of vector onto vector is:
Two vectors are orthogonal if their dot product equals zero: . Orthogonality is a central property of PLS components.
1.3 Variance and Covariance
The variance of a variable measures its spread:
The covariance between variables and measures how they vary together:
The covariance matrix of a matrix contains pairwise covariances of all variables. After mean-centring :
1.4 Correlation and Multicollinearity
Correlation is the standardised covariance:
Multicollinearity occurs when predictor variables are highly correlated with each other. It causes serious problems in ordinary least squares (OLS) regression:
- Standard errors of coefficients become very large.
- Coefficient estimates become unstable and unreliable.
- The matrix becomes nearly singular (non-invertible).
PLS regression was specifically designed to handle multicollinearity.
1.5 Eigenvalues and Eigenvectors
For a square matrix , an eigenvector and its associated eigenvalue satisfy:
Eigenvalues indicate the amount of variance explained by each eigenvector direction. They are the foundation of Principal Component Analysis (PCA) and are closely related to PLS decompositions.
1.6 Ordinary Least Squares (OLS) Regression
OLS regression models the response as a linear function of predictors :
The OLS estimator minimises the sum of squared residuals:
This requires to be invertible — which fails when:
- (more predictors than observations).
- Predictors are highly collinear.
PLS provides a solution to both of these failure modes.
2. What is PLS Regression?
Partial Least Squares (PLS) Regression is a multivariate statistical method that models the relationship between a matrix of predictor variables and one or more response variables by extracting a small number of latent components (factors) that simultaneously:
- Explain as much variance in as possible.
- Have maximum covariance (predictive power) with .
Unlike OLS regression, PLS does not require to be invertible and performs well even when predictors far outnumber observations or are highly collinear.
2.1 The Core Idea
The name "Partial Least Squares" reflects the method's origins:
- Partial: Each step involves projecting (regressing) the data onto a component that captures part of the variation in and .
- Least Squares: At each step, the component is found to minimise a least-squares criterion.
In essence, PLS finds a compressed, low-dimensional representation of (the latent components or scores) that is maximally informative about . This is what distinguishes PLS from Principal Component Regression (PCR), which only maximises variance in without considering .
2.2 Real-World Applications
PLS regression is the workhorse of many applied sciences. Common applications include:
- Chemometrics: Predicting chemical concentrations (e.g., protein content, moisture, pH) from near-infrared (NIR) or Raman spectroscopy data — the quintessential PLS application with hundreds or thousands of correlated spectral wavelengths as predictors.
- Genomics & Bioinformatics: Relating gene expression profiles (tens of thousands of genes) to clinical outcomes, phenotypes, or disease states.
- Food Science: Predicting sensory quality attributes (taste, texture) from instrumental measurements.
- Pharmaceutical Sciences: Relating molecular descriptors (QSAR/QSPR) to drug activity, toxicity, or physicochemical properties.
- Process Monitoring & Control: Modelling industrial process variables to predict product quality and detect deviations.
- Neuroscience: Relating brain imaging data (fMRI voxels) to behavioural or cognitive measures.
- Social & Behavioural Sciences: Predicting complex outcomes (job performance, wellbeing) from many correlated survey items.
- Environmental Science: Relating environmental variables (soil, water, climate) to ecological outcomes.
2.3 When to Use PLS Regression
PLS regression is particularly appropriate when:
| Situation | Reason PLS is Preferred |
|---|---|
| (many more predictors than observations) | OLS is undefined; PLS works with far fewer components than predictors |
| High multicollinearity among predictors | OLS is unstable; PLS constructs orthogonal components |
| Noisy predictor variables (measurement error) | PLS filters noise by focusing on components relevant to |
| Multiple correlated response variables | PLS-2 handles multivariate responses simultaneously |
| Interpretable latent structure is desired | PLS components have clear loading and score interpretations |
| Prediction accuracy is the primary goal | PLS minimises prediction error via cross-validation component selection |
2.4 PLS vs. Related Methods: An Overview
| Feature | OLS | PCR | Ridge | PLS |
|---|---|---|---|---|
| Handles | ❌ | ✅ | ✅ | ✅ |
| Handles multicollinearity | ❌ | ✅ | ✅ | ✅ |
| Uses to construct components | ❌ | ❌ | ❌ | ✅ |
| Produces interpretable components | N/A | ✅ | ❌ | ✅ |
| Handles multiple responses | ✅ | Partially | ✅ | ✅ |
| Requires matrix inversion | ✅ | Partially | ✅ | ❌ |
3. The Mathematics Behind PLS Regression
3.1 The PLS Model Structure
PLS simultaneously decomposes both () and () into score matrices and loading matrices:
Where:
- () = X-scores matrix (latent components from ).
- () = X-loadings matrix (how variables relate to -components).
- () = Y-scores matrix (latent components from ).
- () = Y-loadings matrix (how variables relate to -components).
- () = X-residual matrix.
- () = Y-residual matrix.
- = number of retained latent components.
The relationship between the -scores and -scores is modelled as:
Where is a diagonal matrix of inner relation coefficients and is residual.
3.2 The Objective: Maximum Covariance
The core objective of PLS is to find weight vectors (for ) and (for ) such that the covariance between the resulting scores is maximised:
Subject to the normalisation constraints: and .
This is equivalent to finding the first singular value decomposition (SVD) of the cross-product matrix :
Where is diagonal with singular values (covariances), contains X-weight vectors, and contains Y-weight vectors.
3.3 Score and Loading Computation
For each component :
X-scores (latent variable scores):
Where is the deflated (residualised) matrix at step , and is the X-weight vector for component .
Y-scores:
X-loadings (regression of on ):
Y-loadings (regression of on ):
Inner relation coefficient:
3.4 Deflation
After extracting each component, the data matrices are deflated (the component's contribution is removed) to ensure that subsequent components capture new, orthogonal information:
For PLS1 (single response):
For PLS2 (multiple responses):
This sequential deflation ensures the X-scores are mutually orthogonal: for .
3.5 The PLS Regression Coefficients
After extracting components, the final PLS regression coefficients relating directly to are:
Where () is the matrix of modified weight vectors (also called the W-star or matrix), which accounts for the sequential deflation:
The predicted values are then:
Where are the X-scores computed directly from without deflation.
For a single response variable :
The predicted value for a new observation :
3.6 Relationship Between PLS and SVD
The core step of PLS — finding weight vectors that maximise the covariance between and — reduces to finding the leading left and right singular vectors of :
The maximum covariance is the first singular value of , and , are the corresponding left and right singular vectors.
4. Types of PLS Methods
Several algorithmic variants of PLS exist. The most important ones are described below.
4.1 PLS1
PLS1 handles a single continuous response variable (). It is the most common form of PLS regression in practice.
- Input: () and ().
- Output: latent components, regression coefficients ().
- The NIPALS algorithm (see Section 14) is used for computation.
4.2 PLS2
PLS2 handles multiple response variables simultaneously (). It extracts components that explain variance while maximising covariance with the entire matrix.
- Input: () and ().
- Output: latent components, regression coefficient matrix ().
- A single set of X-components is found for all responses simultaneously.
💡 PLS2 is more parsimonious than running separate PLS1 models for each response, but if the responses are very different in nature, separate PLS1 models may give better individual predictions.
4.3 PLS-DA (PLS Discriminant Analysis)
PLS-DA applies PLS regression to a categorical response variable by encoding the class membership as a binary (or dummy-coded) matrix. It is widely used for classification in chemometrics, metabolomics, and genomics.
- For a two-class problem: Encode and apply PLS1. Classify using a threshold on .
- For a multi-class problem ( classes): Encode as a -column binary matrix (one-hot) and apply PLS2. Assign to the class with the highest predicted value.
⚠️ PLS-DA can overfit when many components are used, especially with small or imbalanced datasets. Cross-validation is essential for assessing classification performance.
4.4 OPLS (Orthogonal PLS)
OPLS (Trygg & Wold) separates the variation into:
- Predictive variation: Correlated with (the PLS component).
- Orthogonal variation: Uncorrelated with (structured noise in ).
This produces a model with a single predictive component (for PLS1) plus orthogonal components that account for systematic variation not related to . OPLS produces simpler, more interpretable loading plots.
4.5 Kernel PLS
Kernel PLS extends PLS to handle non-linear relationships between and by implicitly mapping the data into a high-dimensional feature space using a kernel function .
Common kernels:
- Linear:
- Polynomial:
- Radial Basis Function (RBF):
4.6 Sparse PLS
Sparse PLS introduces variable selection by imposing (Lasso-type) penalties on the weight vectors, driving some weights to exactly zero. This simultaneously performs dimensionality reduction and variable selection, producing more interpretable models in high-dimensional settings ().
4.7 Summary of PLS Variants
| Variant | Response Type | Key Feature | Typical Use |
|---|---|---|---|
| PLS1 | Single continuous | Standard PLS | Most regression problems |
| PLS2 | Multiple continuous | Joint modelling of all responses | Multivariate responses |
| PLS-DA | Categorical (classes) | Classification via dummy encoding | Chemometrics, omics |
| OPLS | Single/Multiple continuous | Separates predictive from orthogonal variation | Improved interpretation |
| Kernel PLS | Single/Multiple | Non-linear extension via kernels | Non-linear relationships |
| Sparse PLS | Single/Multiple | Variable selection via penalty | High-dimensional data |
The DataStatPro application implements PLS1 and PLS2 (standard PLS regression) and PLS-DA, which are the focus of this tutorial.
5. Assumptions of PLS Regression
PLS regression is a relatively assumption-light method compared to OLS. However, certain conditions should be met for valid results.
5.1 Linearity
PLS assumes a linear relationship between the latent components (scores ) and the response . Non-linear relationships between the original variables and may be partially captured if the non-linearity is reflected in the latent structure, but Kernel PLS is preferred for strongly non-linear data.
5.2 Continuous (or Appropriately Encoded) Variables
- Predictor variables (): Should be continuous or ordinal. Binary variables can be included but must be mean-centred and scaled.
- Response variable (): Should be continuous for PLS1/PLS2. For categorical responses, use PLS-DA with appropriate dummy coding.
5.3 No Requirement for Multivariate Normality
Unlike some classical multivariate methods, PLS does not require the predictors or response to follow a multivariate normal distribution. This makes PLS robust to skewed or non-normal variables (though severe non-normality may still affect inference).
5.4 No Perfect Redundancy (Degenerate Cases)
If a predictor variable is a perfect linear combination of other predictor variables (i.e., the column is a deterministic function of others), it carries no additional information and should be removed before fitting PLS. Similarly, a response variable that is a perfect linear combination of columns will cause a degenerate model.
5.5 Sufficient Sample Size
While PLS handles settings, model quality improves with more observations. General guidelines:
- Minimum: where is the number of components (to avoid severe overfitting).
- Preferred: for reliable cross-validated results.
- For PLS-DA: At least 5–10 observations per class per component.
5.6 Representativeness of Calibration Set
The calibration (training) samples should span the full range of variation expected in future prediction samples. PLS is an interpolation method — predictions for samples outside the calibration space (extrapolation) are unreliable.
5.7 No Gross Outliers
Extreme outliers in or can disproportionately influence the extracted components and distort the model. Outliers should be detected (using score plots and leverage/residual diagnostics) and investigated before finalising the model.
6. Data Preprocessing
Data preprocessing is arguably the most critical step in PLS analysis. The choice of preprocessing can have a profound impact on the extracted components and the resulting model.
6.1 Mean Centring
Mean centring subtracts the column mean from each variable:
Mean centring is almost always required for PLS. Without it, the first component tends to describe the mean of the data rather than the variance structure, and subsequent components are distorted.
After mean centring, the PLS model is fitted to and , and predictions are adjusted using the response mean:
6.2 Autoscaling (Mean Centring + Unit Variance Scaling)
Autoscaling (also called standardisation or z-score scaling) both mean-centres and scales each variable to unit variance:
When to use autoscaling:
- When predictor variables are measured in different units (e.g., age in years, income in dollars, height in centimetres). Without scaling, variables with larger numerical ranges will dominate the components.
- As the default choice when there is no strong reason to prefer otherwise.
When not to autoscale:
- When variables are measured in the same units and differences in variance are meaningful (e.g., spectroscopic data where high-variance spectral regions are genuinely more informative).
- When some variables are near-zero variance (noise-dominated) — autoscaling would amplify noise.
6.3 Other Scaling Methods
| Method | Formula | When to Use |
|---|---|---|
| No scaling (mean-centre only) | Same units; variance is meaningful | |
| Autoscaling (UV) | Different units; default choice | |
| Pareto scaling | Compromise: down-weights high-variance variables less severely than UV | |
| Range scaling | All variables on [0,1] scale | |
| Vast scaling | Focuses on variables with low coefficient of variation | |
| Log transformation | Right-skewed, multiplicative data (e.g., concentration data, metabolomics) |
💡 The response variable should also be mean-centred (and scaled if using PLS2 with multiple responses on different scales). For PLS1, scaling to unit variance is optional but often beneficial.
6.4 Handling Missing Data
PLS with missing data requires special treatment. Options include:
- Complete case analysis: Remove observations with any missing values (only if missingness is minimal and random).
- Mean imputation: Replace missing values with the column mean (simple but ignores covariance structure).
- NIPALS-based imputation: The NIPALS algorithm can handle missing data iteratively — missing values are replaced with their model estimates at each iteration.
- Multiple imputation: Generate multiple complete datasets and pool the PLS results.
⚠️ Missing data in is more tractable than missing data in . If values are missing, those observations typically cannot be used in model calibration.
6.5 Outlier Detection Before Modelling
Before fitting PLS, screen for outliers using:
- Univariate checks: Box plots, histograms, z-scores ( as a flag).
- Multivariate checks: Mahalanobis distance, PCA score plots.
- Domain knowledge: Values that are physically impossible (negative concentrations, ages > 150) should be corrected or removed.
7. PLS Components and Latent Variables
Understanding what PLS components represent is essential for interpreting PLS models.
7.1 X-Scores ()
The X-scores matrix () contains the coordinates of each observation in the low-dimensional latent space:
Where is the -th column of .
- Each row is the score vector for observation — its position in the latent space.
- Scores are mutually orthogonal: for .
- Score plots ( vs. , etc.) reveal the structure of the observations: clusters, trends, outliers.
7.2 X-Loadings ()
The X-loadings matrix () describes how the original variables contribute to each latent component during deflation:
- Loading indicates how strongly variable is associated with component .
- Loading plots reveal which variables are most important for each component and which variables are correlated (variables with similar loading vectors are collinear).
7.3 X-Weights ( and )
The X-weights describe how variables are weighted to form the X-scores:
The modified X-weights relate the original (undeflated) directly to the scores:
⚠️ There is a subtle but important distinction between and . For interpreting the relationship between variables and the PLS components, use (not ), as accounts for the deflation steps and relates to the original directly.
7.4 Y-Loadings () and Y-Weights ()
The Y-loadings () describe how the variables are reconstructed from the X-scores:
Loading indicates how strongly response variable is associated with component in the final prediction equation. For PLS1 (), reduces to a scalar for each component.
7.5 Y-Scores ()
The Y-scores () are the latent components extracted from :
In the inner relation , the Y-scores should be close to the X-scores (scaled by ). The tightness of this relationship is diagnostic of model quality:
- If and are strongly correlated, the component is predictively powerful.
- If they are weakly correlated, the component explains variance but not — suggesting the component is not useful for prediction.
7.6 The Biplot
The PLS biplot overlays the score plot (observations) and loading plot (variables) in the same space:
- Observations (scores ) are plotted as points.
- Variables (loadings or weights ) are plotted as vectors.
- A variable vector pointing toward a cluster of observations indicates those observations have high values for that variable.
- Variables with loading vectors in the same direction are positively correlated; opposite directions indicate negative correlation.
8. Choosing the Number of Components
The number of latent components is the primary hyperparameter in PLS regression. Too few components underfits (high bias, misses important structure); too many overfits (low bias but high variance, memorises noise).
8.1 Cross-Validation (Primary Method)
Cross-validation (CV) is the gold-standard method for selecting in PLS. The most common approach is -fold cross-validation:
- Divide the observations into roughly equal folds.
- For each fold : a. Fit PLS with components on the data excluding fold . b. Predict fold observations using each fitted model. c. Compute prediction errors for fold .
- Average the prediction errors across all folds.
- Select minimising the cross-validated prediction error.
Leave-One-Out Cross-Validation (LOOCV): Special case of -fold CV with . Computationally intensive but uses maximum data for fitting at each step.
Recommended : or is standard. For small datasets (), LOOCV is preferred.
8.2 PRESS Statistic (Predicted Residual Error Sum of Squares)
The PRESS statistic summarises the cross-validated prediction error for components:
Where is the prediction for observation when it was in the held-out fold , using a model with components.
Select .
💡 To guard against overfitting while maintaining parsimony, some guidelines recommend choosing the smallest for which is within 1 standard error of the minimum PRESS (the "one-standard-error rule").
8.3 (Cross-Validated )
is the cross-validated analogue of :
Where:
ranges from (model is worse than predicting the mean) to 1 (perfect cross-validated prediction).
Guidelines for :
| Value | Interpretation |
|---|---|
| Model is worse than the mean; no predictive power | |
| Poor to moderate predictive ability | |
| Moderate predictive ability | |
| Good predictive ability | |
| Excellent predictive ability (verify not overfitting) |
The optimal number of components is where reaches a maximum (or where adding more components does not meaningfully increase ).
8.4 The vs. Plot
A standard diagnostic in PLS is to plot both (variance explained in — a training set metric) and (cross-validated metric) against the number of components :
- always increases with more components (never decreases).
- first increases then decreases (or plateaus) as the model starts to overfit.
- The optimal is where peaks or where the gap between and begins to widen rapidly.
⚠️ A model with much higher than is overfitting. Aim for close to .
8.5 Scree Plot of Eigenvalues / Variance Explained
A scree plot of the variance explained in () by each successive component can also guide component selection: look for an "elbow" where additional components explain diminishing amounts of variance. However, alone ignores predictive relevance for — always prioritise .
8.6 Permutation Testing
Permutation tests provide a rigorous null-hypothesis test for whether a PLS model with components explains more variance in than expected by chance:
- Randomly permute (shuffle) the vector times (e.g., ).
- Fit PLS with components to each permuted dataset and record and .
- The p-value is the proportion of permuted (or ) values that exceed the observed value.
A significant result () confirms the model captures a real relationship, not a spurious one.
9. Model Fit and Evaluation
9.1 (Variance Explained in )
The proportion of total variance explained by component is:
Cumulative after components:
A high indicates the components capture most of the systematic variation in .
9.2 (Variance Explained in )
The proportion of total variance explained by components — the training set :
For PLS1, this reduces to:
Where is the residual sum of squares.
⚠️ alone is an optimistic (biased) estimate of model quality because it is computed on the training data. Always report alongside .
9.3 Root Mean Squared Error of Calibration (RMSEC)
This is the training set prediction error in the same units as .
9.4 Root Mean Squared Error of Cross-Validation (RMSECV)
RMSECV is the cross-validated prediction error and is the primary model selection criterion alongside .
9.5 Root Mean Squared Error of Prediction (RMSEP)
When a separate independent test set of observations is available (not used in model fitting or cross-validation):
RMSEP is the most honest estimate of true prediction error and should always be reported when a genuine external test set exists.
Hierarchy of prediction error estimates:
The gap between RMSEC and RMSECV/RMSEP indicates the degree of overfitting.
9.6 Bias and Slope of Predicted vs. Observed
A predicted vs. observed plot should ideally show points scattered symmetrically around the 1:1 line (slope = 1, intercept = 0). Formally test for systematic bias using a regression of observed on predicted:
- : Systematic bias (model consistently over- or under-predicts).
- : Scale bias (model predictions are proportionally stretched or compressed).
9.7 Summary of Model Fit Statistics
| Statistic | Definition | Optimal | Purpose |
|---|---|---|---|
| Variance in explained | High (informative but secondary) | Assess how well components represent | |
| Variance in explained (training) | High | Training set fit | |
| Cross-validated | High, close to | Predictive ability, component selection | |
| RMSEC | Training set RMSE | Low (but biased) | Training error in original units |
| RMSECV | Cross-validated RMSE | Low | CV prediction error (primary) |
| RMSEP | External test set RMSE | Low | True prediction error |
10. Interpretation of PLS Results
10.1 Regression Coefficients ()
The PLS regression coefficients have the same interpretation as OLS coefficients: is the expected change in for a one-unit increase in , with all other variables held constant.
However, because PLS implicitly regularises through dimensionality reduction:
- Coefficients tend to be shrunk toward zero compared to OLS (especially for components not retained).
- They are more stable under multicollinearity than OLS coefficients.
- The number of components acts as the regularisation parameter: fewer components = more shrinkage.
⚠️ When variables are autoscaled, the coefficients are on the standardised scale. Multiply by (the SD of ) and divide by to interpret on the original scales, or report the raw (unstandardised) coefficients after back-scaling.
10.2 Variable Importance in Projection (VIP)
The VIP score (Wold, 1994) quantifies the contribution of each predictor variable to the PLS model, accounting for all components:
Where:
- = total number of predictor variables.
- = marginal variance explained by component .
- = squared modified weight of variable in component .
Properties of VIP:
- always.
- The average squared VIP across all variables equals 1: .
- Therefore, the average VIP = 1 (approximately, depending on scaling).
Interpretation:
| VIP Score | Interpretation |
|---|---|
| Variable is important for the model (above-average contribution) | |
| Variable has moderate importance | |
| Variable has low importance; may be a candidate for removal |
💡 VIP is the most widely used variable selection criterion in PLS. Variables with VIP < 0.8 (or a domain-specific threshold) are candidates for removal, which can improve model parsimony and sometimes prediction performance.
10.3 Loadings Plot
The loadings plot displays the X-loadings (or weights ) for two components simultaneously, revealing:
- Which variables are most strongly associated with each component.
- Which variables are correlated with each other (similar loading vectors).
- The direction and magnitude of each variable's contribution.
For spectroscopic data, a plot of the loadings as a function of wavelength (a "loading spectrum") is particularly informative, as it reveals which spectral regions are important.
10.4 Score Plot
The score plot displays the X-scores ( vs. , etc.) for all observations, revealing:
- Clustering of observations (groups with similar profiles).
- Trends (e.g., samples ordered along by the response value).
- Outliers (isolated points far from the main cluster).
When colour-coded by values, the score plot shows whether the latent structure in aligns with — the hallmark of a good PLS model.
10.5 Weight-Loading Biplot (- Plot)
The correlation loadings plot or - biplot plots the modified X-weights and Y-loadings together in the same space. Variables (X and Y) that appear close together in this plot are positively correlated; variables on opposite sides are negatively correlated. This provides a comprehensive view of the – relationship structure.
10.6 Leverage and Residuals (Influence Analysis)
Leverage measures how influential observation is in determining the model:
Where is the -th row of .
Standardised Y-residuals:
A leverage-residual plot (also called a Williams plot) displays vs. :
- High leverage + small residual: Influential, well-fitting observation (good leverage).
- High leverage + large residual: Outlier with high influence — investigate carefully.
- Low leverage + large residual: Poorly predicted outlier but low influence.
- Low leverage + small residual: Normal observation.
10.7 Hotelling's and SPE (DModX)
Two complementary multivariate control statistics are used to identify unusual samples:
Hotelling's : Measures the distance of a sample from the centre of the model in the latent space:
An approximate -distribution critical value:
Observations with are outliers within the model space (unusual combination of latent components).
SPE (Squared Prediction Error) / DModX: Measures the distance of a sample from the PLS model plane (residual variability not captured by the model):
A high SPE indicates the observation does not conform to the -covariance structure of the calibration set.
💡 Use both and SPE together. A sample can be unusual inside the model ( high) or outside the model (SPE high), or both.
11. Validation Methods
Rigorous validation is essential to ensure a PLS model genuinely predicts new observations, rather than memorising noise in the training data.
11.1 Internal Validation: Cross-Validation
-fold cross-validation and LOOCV (described in Section 8.1) provide internal validation estimates (, RMSECV). Internal validation is mandatory but can still be optimistic if the same data were used to select preprocessing, model type, and components.
Monte Carlo Cross-Validation (MCCV): Randomly splits the data into training and validation sets times (e.g., ), each time using a different random split (e.g., 80% train / 20% validate). The distribution of RMSECV values across splits provides uncertainty estimates.
11.2 External Validation: Independent Test Set
The most rigorous validation uses a truly independent test set — samples not used in any stage of model building (not in training, not in cross-validation, not in component selection):
How to split data for external validation:
- Random split: Randomly allocate (e.g., 70–80% training, 20–30% test). Only valid if samples are representative.
- Kennard-Stone algorithm: Selects a maximally representative subset of the data for the training set based on Euclidean distances in -space, ensuring the calibration set spans the full range of the data.
- DUPLEX algorithm: Simultaneously selects representative training and test sets.
- Systematic split: For time-ordered data, use earlier samples for training and later ones for testing.
11.3 Permutation Test for Model Validity
(Described in Section 8.6.) A permutation test confirms that the model's is significantly above what would be obtained by chance.
A permutation plot shows the distribution of values from permuted models overlaid with the observed . If the observed falls well above the permutation distribution, the model is valid.
11.4 Y-Randomisation Test
Related to the permutation test, Y-randomisation (also called response permutation) repeatedly randomises , fits PLS with the same components, and records and :
- The permuted values should be near zero or negative (since a randomly shuffled has no true relationship with ).
- The observed should be far above the permuted distribution.
- If permuted models give high values, the original result is likely spurious (due to chance correlations).
11.5 The Ratio as a Validity Check
A simple heuristic from chemometrics practice:
- If : Model is not overfitting (good).
- If (or ): Model may be overfitting — reduce number of components or collect more data.
12. Comparison with Related Methods
12.1 PLS vs. Principal Component Regression (PCR)
Both PLS and PCR reduce dimensionality before regression. The key difference:
| Aspect | PCR | PLS |
|---|---|---|
| Component extraction criterion | Maximise variance in only | Maximise covariance between scores and |
| used in component extraction | ❌ No | ✅ Yes |
| Relevant components | May not be the first few (components uncorrelated with may explain much variance) | First few components are typically most predictive of |
| Number of components needed | Typically more | Typically fewer |
| Predictive performance | Comparable or lower | Generally better |
💡 PLS is generally preferred over PCR for prediction tasks because it ensures the extracted components are relevant for predicting . PCR may extract components that explain much variance but are useless for prediction.
12.2 PLS vs. Ridge Regression
| Aspect | Ridge Regression | PLS |
|---|---|---|
| Handles multicollinearity | ✅ | ✅ |
| Shrinkage mechanism | Continuous penalty on | Discrete (choice of ) |
| Variable selection | ❌ (all variables retained) | Indirectly (via VIP) |
| Interpretable components | ❌ | ✅ |
| Handles | ✅ | ✅ |
| Cross-validation parameter | (regularisation) | (components) |
12.3 PLS vs. Lasso
| Aspect | Lasso | PLS |
|---|---|---|
| Variable selection | ✅ (explicit, drives coefficients to zero) | Indirectly via VIP |
| Correlated predictors | ❌ (arbitrary selection among correlated vars) | ✅ (handles gracefully; correlated vars share weight) |
| Latent structure | ❌ | ✅ |
| Interpretable components | ❌ | ✅ |
| ✅ (but arbitrarily selects one from each correlated group) | ✅ |
12.4 PLS vs. OLS Multiple Regression
| Aspect | OLS | PLS |
|---|---|---|
| Multicollinearity | ❌ (unstable, inflated SE) | ✅ (stable) |
| ❌ (undefined) | ✅ | |
| Coefficient interpretation | Straightforward (ceteris paribus) | Requires care (components summarise ) |
| Statistical inference (p-values) | ✅ (exact under assumptions) | ❌ (non-trivial; use jack-knife or bootstrap) |
| Prediction accuracy | ✅ (when , no collinearity) | ✅ (especially when or collinearity) |
12.5 When to Choose Which Method
| Condition | Recommended Method |
|---|---|
| , low multicollinearity, need inference | OLS |
| High multicollinearity, , need inference | Ridge Regression |
| Explicit variable selection needed, | Lasso or Elastic Net |
| , high multicollinearity, interpretability needed | PLS |
| Multiple correlated responses ( matrix) | PLS2 |
| Classification with high-dimensional | PLS-DA |
| Non-linear relationships | Kernel PLS or Non-linear methods (RF, SVM) |
| Want PCR but with -guided component selection | PLS |
13. Using the PLS Regression Component
The PLS Regression component in the DataStatPro application provides a full end-to-end workflow for fitting, validating, and interpreting PLS models.
Step-by-Step Guide
Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should have at least one response variable and two or more predictor variables.
Step 2 — Select Analysis Type Choose the PLS analysis type:
- PLS1 (single continuous response)
- PLS2 (multiple continuous responses)
- PLS-DA (discriminant analysis; categorical response)
Step 3 — Select Predictor Variables (X) Select one or more predictor variables from the "Predictor Variables (X)" dropdown. These should be continuous or ordinal numeric variables.
💡 You can select all available numeric predictors and rely on VIP scores for post-hoc variable selection.
Step 4 — Select Response Variable(s) (Y)
- For PLS1: Select a single continuous response from the "Response Variable (Y)" dropdown.
- For PLS2: Select two or more continuous response variables.
- For PLS-DA: Select a categorical variable. You will be prompted to confirm the class coding.
Step 5 — Configure Preprocessing Select the preprocessing method for (and optionally ):
- Mean Centring Only (recommended when variables are in the same units)
- Autoscaling (UV) (recommended default)
- Pareto Scaling
- No Preprocessing (not recommended unless data are already preprocessed)
Step 6 — Configure Number of Components Choose how to determine the number of components :
- Automatic (Cross-Validation): The application selects by minimising RMSECV.
- Manual: Specify directly (e.g., based on domain knowledge or prior analysis).
- Set the maximum number of components to evaluate (default: ).
Step 7 — Configure Cross-Validation Select the cross-validation scheme:
- -Fold CV: Specify (default: 10).
- Leave-One-Out CV (LOOCV): Use for small datasets.
- Monte Carlo CV: Specify the number of iterations and train/test split ratio.
Step 8 — Select Display Options Choose which outputs to display:
- ✅ Score Plot ( vs. )
- ✅ Loading Plot ( vs. )
- ✅ Biplot (- plot)
- ✅ and vs. Number of Components Plot
- ✅ RMSEC and RMSECV vs. Number of Components Plot
- ✅ Predicted vs. Observed Plot
- ✅ VIP Scores Plot
- ✅ Regression Coefficients Plot
- ✅ Residuals Plot
- ✅ Leverage-Residual (Williams) Plot
- ✅ and SPE (DModX) Control Charts
- ✅ Component Summary Table (, , per component)
- ✅ Permutation Plot
Step 9 — Run the Analysis Click "Run PLS Regression". The application will:
- Apply the selected preprocessing to and .
- Run cross-validation to determine the optimal number of components (if automatic).
- Fit the final PLS model with the selected components using NIPALS or SIMPLS.
- Compute scores (, ), loadings (, ), weights (, ).
- Compute VIP scores and regression coefficients.
- Compute model fit statistics (, , , RMSEC, RMSECV).
- Compute leverage, residuals, , and SPE for all observations.
- Generate all selected visualisations and tables.
- Run permutation tests (if selected).
14. Computational and Formula Details
14.1 The NIPALS Algorithm for PLS1
The Non-linear Iterative Partial Least Squares (NIPALS) algorithm is the classical iterative procedure for PLS decomposition. For PLS1 (single response):
Inputs: Mean-centred (and scaled) () and ().
For each component :
-
Initialise: (or any non-zero starting vector).
-
Compute X-weight vector: (Normalise to unit length.)
-
Compute X-score vector:
-
Compute Y-weight (scalar for PLS1):
-
Compute Y-score vector:
For PLS1, since : (no iteration needed — convergence is immediate).
-
Check convergence (PLS2 only — for PLS1, skip to step 7).
-
Compute X-loadings:
-
Compute inner relation coefficient:
-
Deflate :
-
Deflate :
After all components, compute :
Where and .
Compute PLS regression coefficients:
Where (inner relation coefficients, equal to Y-loadings in PLS1).
14.2 The NIPALS Algorithm for PLS2
For PLS2 (), an inner iteration is needed at each component to converge to the dominant covariance direction:
-
Initialise: first column of (or any non-zero column).
-
Outer iteration (repeat until convergence of ):
a. Compute X-weight:
b. Compute X-score:
c. Compute Y-weight:
d. Compute Y-score:
e. Check: if (e.g., ), converged.
-
Compute loadings:
-
Compute inner coefficient:
-
Deflate both matrices:
14.3 The SIMPLS Algorithm
SIMPLS (de Jong, 1993) is an alternative, non-deflation-based algorithm that directly computes the PLS weight vectors without deflating . It is computationally more efficient and numerically more stable for large datasets.
SIMPLS directly finds as the leading eigenvector of deflated by projections onto previously found weight vectors:
NIPALS vs. SIMPLS:
| Feature | NIPALS | SIMPLS |
|---|---|---|
| Handles missing data | ✅ (iterative imputation) | ❌ |
| Computational efficiency | per iteration | once |
| Numerical stability | Good | Excellent |
| Equivalence to NIPALS | ✅ (for PLS1) | ✅ (identical for PLS1; slightly different for PLS2) |
14.4 VIP Score Computation
Where is the variance explained by component , and the denominator normalisation .
14.5 Confidence Intervals for PLS Coefficients (Jack-Knife)
Standard errors for PLS regression coefficients can be estimated using the jack-knife procedure:
-
For each observation , fit a PLS model with components leaving out observation : .
-
The jack-knife estimate of the coefficient vector:
-
Jack-knife standard error of coefficient :
-
Approximate confidence interval:
A jack-knife -statistic for testing :
⚠️ Jack-knife standard errors for PLS coefficients are approximate. Bootstrap-based confidence intervals are more accurate but computationally more demanding.
14.6 Back-Scaling of Coefficients
When and are autoscaled before fitting, the PLS coefficients are on the standardised scale. Back-scaling to original units:
Where is the standard deviation of and is the standard deviation of .
The intercept is:
14.7 Computation in Detail
For LOOCV with folds:
Where is the prediction for observation from a model trained on all observations except .
For -fold CV with , PRESS is computed as the sum over all held-out predictions (one per observation), where each observation is held out exactly once.
15. Worked Examples
Example 1: PLS1 Regression — Predicting Protein Content from NIR Spectra
Research Question: Can near-infrared (NIR) spectral measurements (100 wavelengths, ) predict the protein content (%, ) of wheat flour samples?
Dataset: wheat samples; spectral absorbance variables; = protein content (%).
Step 1: Preprocessing
Apply autoscaling to (spectral data; different variances at different wavelengths). Mean-centre ().
Step 2: Cross-Validation to Select
Run PLS1 with to components using 10-fold cross-validation:
| Components () | RMSEC | RMSECV | ||
|---|---|---|---|---|
| 1 | 0.612 | 0.583 | 0.731 | 0.764 |
| 2 | 0.843 | 0.812 | 0.468 | 0.511 |
| 3 | 0.923 | 0.895 | 0.329 | 0.382 |
| 4 | 0.961 | 0.934 | 0.233 | 0.302 |
| 5 | 0.972 | 0.931 | 0.197 | 0.311 |
| 6 | 0.979 | 0.924 | 0.169 | 0.328 |
| 7 | 0.983 | 0.910 | 0.152 | 0.352 |
| 8 | 0.986 | 0.896 | 0.137 | 0.380 |
peaks at (0.934) and decreases thereafter (despite continuing to rise → overfitting). Select components.
Step 3: Fit Final PLS Model with
Model summary:
| Component | (cumulative) | (cumulative) | (cumulative) |
|---|---|---|---|
| 1 | 0.423 | 0.612 | 0.583 |
| 2 | 0.617 | 0.843 | 0.812 |
| 3 | 0.748 | 0.923 | 0.895 |
| 4 | 0.814 | 0.961 | 0.934 |
Final statistics:
Step 4: VIP Scores
The top 5 most important variables by VIP:
| Wavelength (nm) | VIP Score | Interpretation |
|---|---|---|
| 2180 | 2.14 | Highly important (protein N-H stretch) |
| 2100 | 1.98 | Highly important |
| 1680 | 1.87 | Highly important |
| 2240 | 1.76 | Important |
| 1940 | 1.62 | Important |
Variables with VIP < 0.8: 42 out of 100 wavelengths are candidates for removal.
Step 5: Prediction for New Sample
New wheat sample: spectral vector (autoscaled).
Back-scale:
95% prediction interval (jack-knife SE = 0.28%):
Conclusion: The 4-component PLS model achieves excellent predictive performance (, RMSECV = 0.302%). The model is not overfitting (ratio ). NIR wavelengths around 2180 nm and 2100 nm (protein N-H stretching bands) are the most important predictors.
Example 2: PLS1 Regression — Predicting Blood Pressure from Clinical Variables
Research Question: Can clinical variables (age, BMI, cholesterol, glucose, smoking status, exercise frequency) predict systolic blood pressure (SBP)?
Dataset: patients; predictors; = SBP (mmHg).
Predictors: Age (years), BMI (kg/m²), Total Cholesterol (mmol/L), Fasting Glucose (mmol/L), Smoking (0/1), Exercise (days/week).
Step 1: Preprocessing
Apply autoscaling to all 6 predictors (different units) and mean-centre ( mmHg).
Step 2: CV to Select
| Components | RMSECV | ||
|---|---|---|---|
| 1 | 0.542 | 0.519 | 8.12 |
| 2 | 0.631 | 0.601 | 7.41 |
| 3 | 0.649 | 0.589 | 7.58 |
| 4 | 0.655 | 0.571 | 7.82 |
Select (maximum , RMSECV = 7.41 mmHg).
Step 3: Regression Coefficients (back-scaled to original units)
| Predictor | (mmHg / unit) | SE (jack-knife) | -statistic | Significant? |
|---|---|---|---|---|
| Age (per year) | 0.482 | 0.091 | 5.30 | ✅ |
| BMI (per kg/m²) | 1.241 | 0.213 | 5.83 | ✅ |
| Cholesterol (per mmol/L) | 0.837 | 0.281 | 2.98 | ✅ |
| Glucose (per mmol/L) | 0.614 | 0.244 | 2.51 | ✅ |
| Smoking | 3.921 | 1.482 | 2.65 | ✅ |
| Exercise (per day/week) | -1.183 | 0.392 | -3.02 | ✅ |
Step 4: VIP Scores
| Predictor | VIP Score | Importance |
|---|---|---|
| BMI | 1.48 | High |
| Age | 1.32 | High |
| Smoking | 1.21 | High |
| Exercise | 1.13 | High |
| Cholesterol | 0.92 | Moderate |
| Glucose | 0.74 | Low (VIP < 0.8) |
Step 5: Prediction
For a new patient: Age = 55, BMI = 28.4, Cholesterol = 5.2, Glucose = 5.8, Smoking = 1, Exercise = 2:
Conclusion: The 2-component PLS model explains 63.1% of SBP variance () with reasonable cross-validated performance (, RMSECV = 7.4 mmHg). BMI, age, and smoking are the strongest predictors. Glucose has a VIP < 0.8, suggesting limited predictive contribution in this dataset.
Example 3: PLS2 Regression — Predicting Multiple Sensory Attributes from Chemical Composition
Research Question: Can the chemical composition of wine (8 chemical variables) jointly predict 3 sensory attributes (acidity rating, bitterness rating, overall quality score)?
Dataset: wines; chemical predictors (pH, alcohol %, residual sugar, sulphates, fixed acidity, volatile acidity, citric acid, density); responses.
Step 1: Preprocessing
Autoscale all variables. Mean-centre and scale all variables (different scales).
Step 2: CV to Select
| Components | (avg) | (avg) | |
|---|---|---|---|
| 1 | 0.381 | 0.443 | 0.412 |
| 2 | 0.544 | 0.651 | 0.597 |
| 3 | 0.634 | 0.712 | 0.584 |
Select ( peaks at 0.597).
Step 3: Component Summary
| Component | cumul. | cumul. | cumul. | cumul. |
|---|---|---|---|---|
| 1 | 0.381 | 0.521 | 0.409 | 0.398 |
| 2 | 0.544 | 0.698 | 0.612 | 0.643 |
Step 4: Y-Loadings ()
| Response | ||
|---|---|---|
| Acidity | 0.611 | -0.392 |
| Bitterness | 0.524 | 0.481 |
| Quality | 0.593 | 0.144 |
Component 1 positively loads on all three responses (general quality/intensity factor). Component 2 contrasts bitterness (positive) against acidity (negative) — a sensory contrast axis.
Step 5: Top VIP Scores (averaged across responses)
| Chemical Variable | VIP Score |
|---|---|
| Volatile Acidity | 1.63 |
| Alcohol % | 1.41 |
| pH | 1.28 |
| Sulphates | 1.17 |
| Residual Sugar | 0.94 |
| Fixed Acidity | 0.86 |
| Citric Acid | 0.72 |
| Density | 0.68 |
Citric acid and density have VIP < 0.8 — candidates for removal in a reduced model.
Conclusion: The 2-component PLS2 model jointly predicts all three sensory attributes with moderate-to-good accuracy (). Volatile acidity and alcohol content are the most influential predictors. The two PLS components reveal a general quality factor and a bitterness-versus-acidity contrast factor in the sensory space.
16. Common Mistakes and How to Avoid Them
Mistake 1: Skipping Preprocessing
Problem: Applying PLS to unscaled data where variables differ widely in units and magnitude. Variables with larger numerical ranges (e.g., income in thousands vs. age in decades) dominate the components, producing misleading results.
Solution: Always mean-centre the data. Apply autoscaling (UV scaling) when variables are in different units. Carefully consider the appropriate scaling for your specific domain and data type.
Mistake 2: Selecting Too Many Components
Problem: Using the number of components that maximises (training set fit) rather than (cross-validated fit), resulting in an overfitted model that performs well on training data but poorly on new observations.
Solution: Always use cross-validation (, RMSECV) to select . Look for the point where peaks or where begins to widen. Apply the one-standard-error rule for extra parsimony.
Mistake 3: Ignoring Model Validation
Problem: Reporting only training set statistics (, RMSEC) without cross-validation or external validation, giving an overly optimistic picture of model performance.
Solution: Always report and RMSECV. Whenever possible, reserve an independent external test set and report RMSEP. Run permutation tests to confirm the model is not a statistical artefact.
Mistake 4: Confusing W-Weights with P-Loadings
Problem: Using the X-loadings (or raw weights ) to interpret the relationship between variables and the model, rather than the modified weights .
Solution: For variable importance interpretation, use (the modified weights) or VIP scores. Loadings describe how is reconstructed from the scores; modified weights describe how variables linearly combine to form the scores from the original .
Mistake 5: Using VIP Threshold Rigidly
Problem: Mechanically removing all variables with VIP < 0.8 and accepting all variables with VIP > 1.0, without considering domain knowledge, model stability, or the effect of variable removal on .
Solution: Treat VIP as a guide, not a hard rule. After removing low-VIP variables, refit the model and check whether improves or remains stable. Incorporate domain knowledge about which variables are mechanistically meaningful.
Mistake 6: Applying PLS to a Completely Heterogeneous Dataset
Problem: Fitting a single global PLS model to data comprising fundamentally different subgroups (e.g., different product types, different analytical conditions), producing a model that fits no subgroup well.
Solution: Inspect score plots for clustering. If distinct subgroups are visible, consider fitting separate PLS models per subgroup, or use class-based modelling approaches such as PLS-DA to first classify then model within class.
Mistake 7: Not Detecting Outliers Before Modelling
Problem: Leaving extreme outliers in the dataset, which disproportionately influence the PLS components and distort the model for the remaining, majority observations.
Solution: Check univariate distributions, Mahalanobis distances, and initial PCA scores before PLS. After fitting, use the Williams plot (leverage vs. residuals), , and SPE plots to identify influential observations. Investigate outliers — do not simply delete them without justification.
Mistake 8: Extrapolating Beyond the Calibration Range
Problem: Using the PLS model to predict samples that fall outside the range of the calibration set (extrapolation), where the linear relationship may not hold and the model has no basis for reliable prediction.
Solution: Check new samples for consistency with the calibration set using and SPE control charts. If a new sample falls outside the 95% control limits, flag the prediction as unreliable. Expand the calibration set to cover the full expected range of future samples.
Mistake 9: Misinterpreting Without Context
Problem: Reporting a high (e.g., 0.98) as evidence of an excellent model, without noting that this is the training set fit and may reflect overfitting.
Solution: Always pair with . A model with and is severely overfitting. The value is the meaningful indicator of predictive performance.
Mistake 10: Applying PLS Regression to a Classification Problem Without PLS-DA
Problem: Using PLS1 to predict a binary class label (0/1) without proper classification thresholding or performance assessment with classification metrics (sensitivity, specificity, AUC).
Solution: For categorical outcomes, use PLS-DA with appropriate class encoding. Evaluate using classification metrics (confusion matrix, sensitivity, specificity, AUC-ROC) with cross-validated class assignments, not just RMSECV.
17. Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| for all components | No predictive relationship; noisy data; too few samples | Check data quality; verify and are correctly entered; increase ; reconsider predictor selection |
| is high but is very low (large gap) | Severe overfitting due to too many components or | Reduce ; use stricter CV; consider sparse PLS or variable pre-selection |
| NIPALS fails to converge | Near-zero variance columns; perfect collinearity; numerical issues | Remove zero-variance variables before fitting; check for duplicate columns; increase max iterations |
| Score plot shows extreme outlier separated from main cluster | Outlier with unusual or value; data entry error | Investigate observation; check for data entry errors; assess leverage and SPE |
| All VIP scores are approximately 1 | Only one component extracted (); VIP is uniform when | Increase if justified by CV; interpret coefficients directly for |
| Permutation test shows of permuted models as high as observed | No real relationship between and ; chance correlation | Do not use the model; reconsider variable selection; collect more data; verify correct assignment |
| Jack-knife SEs are very large | Too few samples relative to components () | Reduce ; collect more samples; do not report jack-knife inference for very small |
| Predicted vs. observed plot shows systematic curvature | Non-linear relationship between and | Apply polynomial or logarithmic transformation to ; use kernel PLS; consider non-linear models |
| RMSECV does not decrease with more components | No additional predictive structure beyond first component | Accept a 1-component model; data may be well-described by a single latent variable |
| New sample has very high SPE | New sample's pattern does not match calibration set structure | Flag prediction as unreliable; expand calibration set to include similar samples |
| Negative for external test set despite positive cross-validated | Test set is not representative of training set (distribution shift) | Re-examine train/test split; use Kennard-Stone or DUPLEX for representative splitting; collect more diverse calibration samples |
| PLS2 gives worse predictions than separate PLS1 models | Responses are poorly correlated with each other | Run separate PLS1 models for each response; or use OPLS for better separation of effects |
18. Quick Reference Cheat Sheet
Core Formulas
| Formula | Description |
|---|---|
| PLS decomposition of | |
| PLS prediction of | |
| X-score for component | |
| X-loading for component | |
| Modified X-weight matrix | |
| PLS regression coefficients | |
| Predicted values | |
| Training set variance explained | |
| Cross-validated variance explained | |
| Cross-validated RMSE | |
| External test set RMSE | |
| Variable importance in projection | |
| Hotelling's | |
| Squared prediction error |
Preprocessing Guide
| Situation | Recommended Scaling |
|---|---|
| Variables in different units | Autoscaling (UV) |
| Variables in same units; variance meaningful | Mean centring only |
| High-variance variables dominate | Pareto scaling |
| Right-skewed, multiplicative data | Log transform first, then mean centre |
| Near-zero variance variables present | Remove before scaling |
Component Selection Guide
| Evidence | Action |
|---|---|
| increasing with each component | Add another component |
| has peaked and starts decreasing | Stop adding components |
| gap widening | Overfitting — reduce |
| No predictive signal; check data | |
| Permuted observed | Model is a statistical artefact |
| Overfitting — reduce |
VIP Score Interpretation
| VIP Score | Variable Importance |
|---|---|
| High importance | |
| Moderate importance | |
| Low importance; candidate for removal |
Model Evaluation Hierarchy
| Metric | Type | Bias | Recommended Use |
|---|---|---|---|
| / RMSEC | Training set | Optimistic | Report but do not use alone |
| / RMSECV | Cross-validated | Slight | Primary model selection criterion |
| RMSEP | External test | Unbiased | Best estimate of true prediction error |
Outlier Detection Summary
| Statistic | What It Detects | Threshold |
|---|---|---|
| Hotelling's | Unusual within the model space | critical value |
| SPE (DModX) | Does not fit the model structure | (approximate) |
| Leverage | Influence on model coefficients | (rough guideline) |
| Standardised residual | Poor fit in | or |
PLS Model Type Selection
| Scenario | PLS Variant |
|---|---|
| One continuous response | PLS1 |
| Multiple continuous responses | PLS2 |
| Binary or multi-class outcome | PLS-DA |
| Improve interpretability (single response) | OPLS |
| Non-linear relationships | Kernel PLS |
| Automatic variable selection | Sparse PLS |
Comparison of Regression Methods
| Feature | OLS | Ridge | PCR | PLS |
|---|---|---|---|---|
| ❌ | ✅ | ✅ | ✅ | |
| Handles collinearity | ❌ | ✅ | ✅ | ✅ |
| Uses in reduction | ❌ | ❌ | ❌ | ✅ |
| Interpretable components | N/A | ❌ | ✅ | ✅ |
| Multiple responses | ✅ | ✅ | Partial | ✅ |
| Variable selection | ❌ | ❌ | ❌ | Via VIP |
| Exact inference (p-values) | ✅ | ❌ | ❌ | Approx. (jack-knife) |
This tutorial provides a comprehensive foundation for understanding, applying, and interpreting PLS Regression using the DataStatPro application. For further reading, consult Wold, Sjöström & Eriksson's "PLS-regression: a basic tool of chemometrics" (Chemometrics and Intelligent Laboratory Systems, 2001), Mevik & Wehrens's "The pls Package: Principal Component and Partial Least Squares Regression in R" (Journal of Statistical Software, 2007), or Höskuldsson's "PLS regression methods" (Journal of Chemometrics, 1988). For feature requests or support, contact the DataStatPro team.