Knowledge Base / Multivariate Linear Models Advanced Analysis 62 min read

Multivariate Linear Models

Comprehensive reference guide for Multivariate Linear Models analysis.

Multivariate Linear Models: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Multivariate Linear Models (MLM) all the way through advanced estimation, hypothesis testing, model diagnostics, coefficient interpretation, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What are Multivariate Linear Models?
  3. The Mathematical Framework
  4. Model Specification and Design Matrices
  5. Assumptions of Multivariate Linear Models
  6. Parameter Estimation
  7. Hypothesis Testing and Inference
  8. Multivariate Test Statistics
  9. Effect Size Measures
  10. Model Fit and Evaluation
  11. Model Diagnostics and Residuals
  12. Interpretation of Coefficients
  13. Variable Selection and Model Comparison
  14. Special Cases and Connections to Other Methods
  15. Using the Multivariate Linear Models Component
  16. Computational and Formula Details
  17. Worked Examples
  18. Common Mistakes and How to Avoid Them
  19. Troubleshooting
  20. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into Multivariate Linear Models, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Ordinary Least Squares (OLS) Regression

Ordinary Least Squares (OLS) regression models a single continuous response yy as a linear function of pp predictors X1,X2,,XpX_1, X_2, \dots, X_p:

yi=β0+β1Xi1+β2Xi2++βpXip+ϵiy_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \epsilon_i

In matrix form:

y=Xβ+ϵ,ϵN(0,σ2I)\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})

The OLS estimator minimises the sum of squared residuals:

β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Multivariate Linear Models extend this framework to the case where there are multiple response variables measured simultaneously.

1.2 Matrices and Linear Algebra

Key matrix operations used throughout this tutorial:

1.3 The Multivariate Normal Distribution

The multivariate normal distribution Nq(μ,Σ)\mathcal{N}_q(\boldsymbol{\mu}, \boldsymbol{\Sigma}) has density:

f(y)=1(2π)q/2Σ1/2exp{12(yμ)TΣ1(yμ)}f(\mathbf{y}) = \frac{1}{(2\pi)^{q/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left\{-\frac{1}{2}(\mathbf{y}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{y}-\boldsymbol{\mu})\right\}

Where μ\boldsymbol{\mu} is the mean vector and Σ\boldsymbol{\Sigma} is the covariance matrix. MLMs assume that the response vectors follow this distribution conditionally on the predictors.

1.4 Covariance and Correlation Matrices

The covariance matrix Σ\boldsymbol{\Sigma} (q×qq \times q) for qq response variables contains:

The correlation matrix R\mathbf{R} standardises covariances:

Rjk=σjkσjjσkkR_{jk} = \frac{\sigma_{jk}}{\sqrt{\sigma_{jj}\sigma_{kk}}}

1.5 Eigenvalues and Eigenvectors

For a square matrix A\mathbf{A}, eigenvalue-eigenvector pairs (λs,vs)(\lambda_s, \mathbf{v}_s) satisfy Avs=λsvs\mathbf{A}\mathbf{v}_s = \lambda_s\mathbf{v}_s. They are critical for computing multivariate test statistics and understanding the structure of relationships in MLMs.

1.6 The Hat Matrix

In OLS regression, the hat (projection) matrix:

H=X(XTX)1XT\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T

Projects the response vector y\mathbf{y} onto the column space of X\mathbf{X}: y^=Hy\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}. The hat matrix diagonal elements hiih_{ii} measure the leverage of each observation. The same matrix plays the same role in MLMs, now projecting each response variable.


2. What are Multivariate Linear Models?

2.1 The Core Idea

A Multivariate Linear Model (MLM) — also called multivariate multiple regression or the general multivariate linear model — simultaneously models the linear relationship between a set of predictor variables (independent variables) and multiple continuous response variables (dependent variables).

Where univariate multiple regression has a single response y\mathbf{y} (n×1n \times 1), the multivariate linear model has a response matrix Y\mathbf{Y} (n×qn \times q) with q2q \geq 2 response variables measured on the same nn observations.

The model is:

Y=XB+E\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}

Where:

2.2 Why Use MLM Instead of Separate Regressions?

A natural question is: "Why not simply run qq separate univariate regressions, one for each response?" There are several compelling reasons to prefer the multivariate approach:

ReasonExplanation
Controls familywise Type I errorRunning qq separate tests inflates the overall error rate. MLM tests all responses simultaneously at a single α\alpha.
Exploits correlations among responsesSeparate regressions ignore that responses are correlated. MLM uses the full error covariance structure, improving efficiency.
Detects multivariate effectsPredictors may not significantly relate to any single response but may have a significant combined effect on the response set. MLM detects these patterns.
Produces joint confidence regionsMLM yields joint confidence regions for multiple coefficients simultaneously — something separate regressions cannot provide.
Tests multivariate hypotheses directlyHypotheses about linear combinations of coefficients across responses (e.g., "Does X1X_1 affect Y1Y_1 and Y2Y_2 equally?") can be tested directly in MLM.
More powerful when responses are correlatedWhen responses share common error variance, accounting for correlations can increase power compared to separate analyses.

2.3 The Generality of the Multivariate Linear Model

The multivariate linear model is remarkably general. Many seemingly distinct statistical procedures are special cases:

Special CaseDescription
Univariate Multiple Regressionq=1q = 1 (single response)
Multivariate ANOVA (MANOVA)X\mathbf{X} contains only categorical predictors (dummy-coded)
Multivariate ANCOVA (MANCOVA)X\mathbf{X} contains both categorical and continuous predictors
Seemingly Unrelated Regression (SUR)qq regression equations with different predictors per equation
Profile AnalysisX\mathbf{X} contains group indicators; Y\mathbf{Y} contains repeated measures
Canonical Correlation AnalysisTests overall association between predictor set and response set
Discriminant Function AnalysisX\mathbf{X} contains group indicators; follows from MLM eigenstructure

2.4 Real-World Applications

Multivariate Linear Models are used across a wide range of applied fields:


3. The Mathematical Framework

3.1 The Multivariate Linear Model

The core model is:

Yn×q=Xn×(p+1)B(p+1)×q+En×q\mathbf{Y}_{n \times q} = \mathbf{X}_{n \times (p+1)} \mathbf{B}_{(p+1) \times q} + \mathbf{E}_{n \times q}

Each row yiT\mathbf{y}_i^T (observation ii) satisfies:

yi=BTxi+ϵi,ϵiNq(0,Σ)\mathbf{y}_i = \mathbf{B}^T \mathbf{x}_i + \boldsymbol{\epsilon}_i, \quad \boldsymbol{\epsilon}_i \sim \mathcal{N}_q(\mathbf{0}, \boldsymbol{\Sigma})

Where:

Written differently, the distribution of each row is:

yiTxiNq(xiTB, Σ)\mathbf{y}_i^T \mid \mathbf{x}_i \sim \mathcal{N}_q\left(\mathbf{x}_i^T \mathbf{B},\ \boldsymbol{\Sigma}\right)

3.2 The Coefficient Matrix B\mathbf{B}

The coefficient matrix B\mathbf{B} has dimensions (p+1)×q(p+1) \times q:

B=(β01β02β0qβ11β12β1qβp1βp2βpq)\mathbf{B} = \begin{pmatrix} \beta_{01} & \beta_{02} & \cdots & \beta_{0q} \\ \beta_{11} & \beta_{12} & \cdots & \beta_{1q} \\ \vdots & \vdots & \ddots & \vdots \\ \beta_{p1} & \beta_{p2} & \cdots & \beta_{pq} \end{pmatrix}

3.3 The Error Structure

The error matrix E\mathbf{E} has the following properties:

E[E]=0n×qE[\mathbf{E}] = \mathbf{0}_{n \times q}

E[eieiT]=Σ,E[eiejT]=0 for ijE[\mathbf{e}_i \mathbf{e}_i^T] = \boldsymbol{\Sigma}, \quad E[\mathbf{e}_i \mathbf{e}_j^T] = \mathbf{0} \text{ for } i \neq j

This means the errors are identically and independently distributed (i.i.d.) multivariate normal. Equivalently, using the vec operator:

vec(E)Nnq(0, ΣIn)\text{vec}(\mathbf{E}) \sim \mathcal{N}_{nq}\left(\mathbf{0},\ \boldsymbol{\Sigma} \otimes \mathbf{I}_n\right)

The Kronecker product ΣIn\boldsymbol{\Sigma} \otimes \mathbf{I}_n captures the covariance structure: observations are independent (block-diagonal structure from In\mathbf{I}_n) but responses within each observation may be correlated (captured by Σ\boldsymbol{\Sigma}).

3.4 The Conditional Mean and Variance

For a new observation with predictor vector xnew\mathbf{x}_{new}:

E[ynewxnew]=BTxnewE[\mathbf{y}_{new} \mid \mathbf{x}_{new}] = \mathbf{B}^T \mathbf{x}_{new}

Var(ynewxnew)=Σ\text{Var}(\mathbf{y}_{new} \mid \mathbf{x}_{new}) = \boldsymbol{\Sigma}

The predicted response vector is:

y^new=B^Txnew\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T \mathbf{x}_{new}

The predicted response matrix for all nn observations:

Y^=XB^=HY\hat{\mathbf{Y}} = \mathbf{X}\hat{\mathbf{B}} = \mathbf{H}\mathbf{Y}

Where H=X(XTX)1XT\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T is the hat matrix.

3.5 The General Linear Hypothesis

The power of the multivariate linear model framework lies in the ability to test very general hypotheses of the form:

H0:CBM=Γ0H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{\Gamma}_0

Where:

This general framework subsumes all the special cases mentioned in Section 2.3.

Examples of CBM=0\mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0} hypotheses:

HypothesisC\mathbf{C}M\mathbf{M}
Test all predictors simultaneouslyIp\mathbf{I}_p (omitting intercept row)Iq\mathbf{I}_q
Test effect of predictor XkX_k on all DVsekT\mathbf{e}_k^T (unit vector)Iq\mathbf{I}_q
Test effect of all predictors on DV jjIp\mathbf{I}_pej\mathbf{e}_j (unit vector)
Test if XkX_k has equal effects on Y1Y_1 and Y2Y_2ekT\mathbf{e}_k^T(1,1)T(1, -1)^T
MANOVA group effectGroup contrast matrixIq\mathbf{I}_q

4. Model Specification and Design Matrices

4.1 The Design Matrix X\mathbf{X}

The design matrix X\mathbf{X} (n×(p+1)n \times (p+1)) encodes all predictor information. The first column is typically a column of ones for the intercept:

X=(1x11x12x1p1x21x22x2p1xn1xn2xnp)\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}

The design matrix can accommodate:

4.2 Dummy Coding for Categorical Predictors

For a categorical predictor with kk categories, create k1k-1 dummy variables using a reference category. For a three-level factor (A, B, C) with A as reference:

CategoryDBD_BDCD_C
A (reference)00
B10
C01

The coefficient for DBD_B in column jj of B^\hat{\mathbf{B}} represents the difference in the mean of YjY_j between groups B and A, holding other predictors constant.

4.3 The Response Matrix Y\mathbf{Y}

The response matrix Y\mathbf{Y} (n×qn \times q) contains all qq response variables:

Y=(y11y12y1qy21y22y2qyn1yn2ynq)\mathbf{Y} = \begin{pmatrix} y_{11} & y_{12} & \cdots & y_{1q} \\ y_{21} & y_{22} & \cdots & y_{2q} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n1} & y_{n2} & \cdots & y_{nq} \end{pmatrix}

Each column is a separate response variable. Each row is the complete response profile for one observation.

4.4 Centering and Scaling

Centering predictors (subtracting their means) is generally recommended to:

Standardising predictors (centering + scaling to unit variance) additionally makes regression coefficients comparable across predictors with different units and scales — the standardised coefficients represent the change in the response (in standard deviation units of the response) for a one-standard-deviation change in each predictor.

4.5 Interaction Terms

An interaction term Xj×XkX_j \times X_k is formed as the element-wise product of the two predictor columns:

xjk=xjxk\mathbf{x}_{jk} = \mathbf{x}_j \odot \mathbf{x}_k

Including an interaction in the model allows the effect of XjX_j on the response vector to depend on the value of XkX_k (and vice versa). For categorical × continuous interactions, the interaction allows the regression slope to differ across groups.


5. Assumptions of Multivariate Linear Models

5.1 Linearity

Assumption: The relationship between each predictor XkX_k and each response YjY_j is linear (on the scale of the model), holding all other predictors constant:

E[Yijxi]=β0j+β1jXi1++βpjXipE[Y_{ij} \mid \mathbf{x}_i] = \beta_{0j} + \beta_{1j}X_{i1} + \dots + \beta_{pj}X_{ip}

How to check: Partial residual plots (component-plus-residual plots) for each predictor-response combination; scatter plots of residuals against each predictor.

Consequences of violation: Biased and inconsistent coefficient estimates; poor prediction.

Remedies: Polynomial terms, log transformations of predictors or responses, spline terms, generalised additive models.

5.2 Multivariate Normality of Errors

Assumption: The error vectors ϵi\boldsymbol{\epsilon}_i follow a multivariate normal distribution:

ϵiNq(0,Σ)\boldsymbol{\epsilon}_i \sim \mathcal{N}_q(\mathbf{0}, \boldsymbol{\Sigma})

How to check:

Consequences of violation: OLS estimates remain unbiased and consistent (Gauss-Markov theorem extends to MLM) but inference (hypothesis tests, confidence intervals) based on the multivariate normal assumption is affected, particularly with small samples.

Remedies: Transformations (log, Box-Cox) of skewed responses; robust estimation; bootstrap inference.

5.3 Independence of Observations

Assumption: The error vectors ϵi\boldsymbol{\epsilon}_i are independent across observations:

Cov(ϵi,ϵj)=0for ij\text{Cov}(\boldsymbol{\epsilon}_i, \boldsymbol{\epsilon}_j) = \mathbf{0} \quad \text{for } i \neq j

How to check: Consider the study design. Look for temporal autocorrelation (Durbin-Watson statistics per response), spatial correlation, or clustering structure.

Consequences of violation: Biased standard errors, invalid hypothesis tests, false significance.

Remedies: Mixed models (multilevel MLM) for clustered data; GEE for repeated measures; time-series corrections for temporal dependence.

5.4 Homoscedasticity (Constant Error Covariance)

Assumption: The error covariance matrix Σ\boldsymbol{\Sigma} is the same for all observations — it does not depend on the values of the predictors or the fitted values:

Var(ϵi)=Σfor all i\text{Var}(\boldsymbol{\epsilon}_i) = \boldsymbol{\Sigma} \quad \text{for all } i

How to check:

Consequences of violation: OLS estimates remain unbiased but are no longer BLUE (Best Linear Unbiased Estimator); standard errors are biased.

Remedies: Weighted least squares (if the heteroscedasticity pattern is known); heteroscedasticity-consistent (HC) standard errors; transformations of responses.

5.5 No Perfect Multicollinearity Among Predictors

Assumption: No predictor variable is a perfect linear combination of other predictor variables. Equivalently, XTX\mathbf{X}^T\mathbf{X} must be invertible (full rank).

How to check: Variance Inflation Factor (VIF) for each predictor: VIFk=1/(1Rk2)VIF_k = 1/(1 - R^2_k) where Rk2R^2_k is the R² from regressing XkX_k on all other predictors. VIF > 10 is a common threshold for concern.

Consequences of violation: (XTX)1(\mathbf{X}^T\mathbf{X})^{-1} does not exist; coefficient estimates are undefined or numerically unstable with inflated standard errors.

Remedies: Remove redundant predictors; use ridge regression or PLS; combine highly correlated predictors into a composite.

5.6 No Influential Outliers

Assumption: No single observation has undue influence on the estimated coefficient matrix B^\hat{\mathbf{B}}.

How to check: Cook's distance (multivariate extension), hat matrix diagonal hiih_{ii}, DFFITS, standardised residuals.

Consequences of violation: Estimated coefficients may be heavily distorted by a single atypical observation.

Remedies: Investigate outlying observations for data errors; use robust regression; report sensitivity analyses with and without influential points.

5.7 Sufficient Sample Size

Recommendation: For MLM to be reliable:


6. Parameter Estimation

6.1 The Ordinary Least Squares (OLS) Estimator

The OLS estimator for B\mathbf{B} minimises the total sum of squared residuals simultaneously across all response variables:

B^=argminBtr[(YXB)T(YXB)]\hat{\mathbf{B}} = \arg\min_{\mathbf{B}} \text{tr}\left[(\mathbf{Y} - \mathbf{X}\mathbf{B})^T(\mathbf{Y} - \mathbf{X}\mathbf{B})\right]

Taking the matrix derivative and setting to zero:

Btr[(YXB)T(YXB)]=2XT(YXB)=0\frac{\partial}{\partial \mathbf{B}}\text{tr}\left[(\mathbf{Y} - \mathbf{X}\mathbf{B})^T(\mathbf{Y} - \mathbf{X}\mathbf{B})\right] = -2\mathbf{X}^T(\mathbf{Y} - \mathbf{X}\mathbf{B}) = \mathbf{0}

Solving the normal equations XTXB^=XTY\mathbf{X}^T\mathbf{X}\hat{\mathbf{B}} = \mathbf{X}^T\mathbf{Y}:

B^=(XTX)1XTY\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}

Critically: This is simply the matrix of coefficients obtained by running qq separate OLS regressions — one for each column of Y\mathbf{Y}. The multivariate OLS estimator is column-by-column identical to the univariate OLS estimator applied qq times.

6.2 Properties of the OLS Estimator

Under the standard assumptions:

Unbiasedness: E[B^]=BE[\hat{\mathbf{B}}] = \mathbf{B}

Covariance structure:

Cov(β^j,β^k)=σjk(XTX)1\text{Cov}(\hat{\boldsymbol{\beta}}_j, \hat{\boldsymbol{\beta}}_k) = \sigma_{jk}(\mathbf{X}^T\mathbf{X})^{-1}

Where σjk\sigma_{jk} is the (j,k)(j,k) element of Σ\boldsymbol{\Sigma} (the error covariance between responses jj and kk). In full:

Cov(vec(B^))=Σ(XTX)1\text{Cov}(\text{vec}(\hat{\mathbf{B}})) = \boldsymbol{\Sigma} \otimes (\mathbf{X}^T\mathbf{X})^{-1}

Gauss-Markov (multivariate): Among all linear unbiased estimators, B^\hat{\mathbf{B}} has the minimum covariance matrix (in the Loewner partial order) — it is BLUE.

Maximum Likelihood Equivalence: Under multivariate normality, the OLS estimator is also the maximum likelihood estimator (MLE) of B\mathbf{B}.

6.3 Estimation of the Error Covariance Matrix

The unbiased estimator of the error covariance matrix Σ\boldsymbol{\Sigma} is:

Σ^=E^TE^np1=(YY^)T(YY^)np1\hat{\boldsymbol{\Sigma}} = \frac{\hat{\mathbf{E}}^T\hat{\mathbf{E}}}{n - p - 1} = \frac{(\mathbf{Y} - \hat{\mathbf{Y}})^T(\mathbf{Y} - \hat{\mathbf{Y}})}{n - p - 1}

Where E^=YY^=(IH)Y\hat{\mathbf{E}} = \mathbf{Y} - \hat{\mathbf{Y}} = (\mathbf{I} - \mathbf{H})\mathbf{Y} is the matrix of OLS residuals.

The MLE of Σ\boldsymbol{\Sigma} (biased but with maximum likelihood properties) is:

Σ~MLE=E^TE^n\tilde{\boldsymbol{\Sigma}}_{MLE} = \frac{\hat{\mathbf{E}}^T\hat{\mathbf{E}}}{n}

The unbiased estimator Σ^\hat{\boldsymbol{\Sigma}} is generally preferred for inference.

Degrees of freedom: The residual SSCP matrix E^TE^\hat{\mathbf{E}}^T\hat{\mathbf{E}} has np1n - p - 1 degrees of freedom. Invertibility requires np1>qn - p - 1 > q.

6.4 Seemingly Unrelated Regression (SUR) and GLS

When the predictor matrices differ across response variables (i.e., each response has its own set of predictors), standard OLS on each response separately is not the most efficient estimator. Seemingly Unrelated Regression (SUR) applies Generalised Least Squares (GLS) using the full covariance structure:

B^GLS=[XT(Σ^1In)X]1XT(Σ^1In)vec(Y)\hat{\mathbf{B}}_{GLS} = \left[\mathbf{X}^T(\hat{\boldsymbol{\Sigma}}^{-1} \otimes \mathbf{I}_n)\mathbf{X}\right]^{-1}\mathbf{X}^T(\hat{\boldsymbol{\Sigma}}^{-1} \otimes \mathbf{I}_n)\text{vec}(\mathbf{Y})

When all responses share the same predictor matrix X\mathbf{X} (the standard MLM case), SUR and OLS give identical estimates. The GLS advantage only manifests when predictor matrices differ across equations.

6.5 The Maximum Likelihood Estimator

The log-likelihood for the multivariate linear model is:

(B,Σ)=n2lnΣ12tr[Σ1(YXB)T(YXB)]+const\ell(\mathbf{B}, \boldsymbol{\Sigma}) = -\frac{n}{2}\ln|\boldsymbol{\Sigma}| - \frac{1}{2}\text{tr}\left[\boldsymbol{\Sigma}^{-1}(\mathbf{Y} - \mathbf{X}\mathbf{B})^T(\mathbf{Y} - \mathbf{X}\mathbf{B})\right] + \text{const}

Maximising over B\mathbf{B} gives B^MLE=B^OLS\hat{\mathbf{B}}_{MLE} = \hat{\mathbf{B}}_{OLS} (as stated above). Maximising over Σ\boldsymbol{\Sigma} gives Σ~MLE=E^TE^/n\tilde{\boldsymbol{\Sigma}}_{MLE} = \hat{\mathbf{E}}^T\hat{\mathbf{E}}/n.

The maximised log-likelihood is:

(B^,Σ~MLE)=n2[qln(2π)+lnΣ~MLE+q]\ell(\hat{\mathbf{B}}, \tilde{\boldsymbol{\Sigma}}_{MLE}) = -\frac{n}{2}\left[q\ln(2\pi) + \ln|\tilde{\boldsymbol{\Sigma}}_{MLE}| + q\right]


7. Hypothesis Testing and Inference

7.1 The General Linear Hypothesis Framework

The unified framework for hypothesis testing in MLM is the General Linear Hypothesis:

H0:CBM=Γ0H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{\Gamma}_0

For testing H0:CBM=0H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0} (the most common case), the Hypothesis SSCP matrix is:

H=MTB^TCT[C(XTX)1CT]1CB^M\mathbf{H} = \mathbf{M}^T\hat{\mathbf{B}}^T\mathbf{C}^T\left[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T\right]^{-1}\mathbf{C}\hat{\mathbf{B}}\mathbf{M}

The Error SSCP matrix (the same for all hypotheses):

E=MTE^TE^M=MT(YY^)T(YY^)M\mathbf{E} = \mathbf{M}^T\hat{\mathbf{E}}^T\hat{\mathbf{E}}\mathbf{M} = \mathbf{M}^T(\mathbf{Y} - \hat{\mathbf{Y}})^T(\mathbf{Y} - \hat{\mathbf{Y}})\mathbf{M}

When M=Iq\mathbf{M} = \mathbf{I}_q, E=E^TE^\mathbf{E} = \hat{\mathbf{E}}^T\hat{\mathbf{E}}.

Test statistics are based on the eigenvalues of E1H\mathbf{E}^{-1}\mathbf{H} (see Section 8).

7.2 Testing Individual Predictor Effects (Row of B\mathbf{B})

To test whether predictor XkX_k has any effect on the set of responses (H0:βk=0TH_0: \boldsymbol{\beta}_{k\cdot} = \mathbf{0}^T, the kk-th row of B\mathbf{B}):

Set C=ekT\mathbf{C} = \mathbf{e}_k^T (row vector with 1 in position kk, zeros elsewhere) and M=Iq\mathbf{M} = \mathbf{I}_q.

This produces a multivariate test of whether XkX_k simultaneously predicts zero for all qq responses. The four multivariate test statistics (Section 8) are applied.

7.3 Testing a Subset of Predictors

To test whether a subset of cc predictors jointly contribute to the model (H0:CB=0H_0: \mathbf{C}\mathbf{B} = \mathbf{0} for a c×(p+1)c \times (p+1) matrix C\mathbf{C}):

The hypothesis SSCP matrix with dfH=cdf_H = c is compared to the error SSCP matrix with dfE=np1df_E = n - p - 1.

7.4 Testing Specific Linear Combinations of Responses

To test whether a predictor has differential effects on different responses (e.g., H0:βk1=βk2H_0: \beta_{k1} = \beta_{k2}):

Set C=ekT\mathbf{C} = \mathbf{e}_k^T and M=(1,1)T\mathbf{M} = (1, -1)^T.

The transformed response is Y1Y2Y_1 - Y_2; the test becomes a univariate tt-test on the difference.

7.5 Univariate Tests Within the MLM Framework

For each response variable YjY_j individually, the standard univariate FF-test for predictor XkX_k is obtained by setting M=ej\mathbf{M} = \mathbf{e}_j and applying the Wald test:

Fkj=β^kj2/σ^jj(kk)1F1,np1F_{kj} = \frac{\hat{\beta}_{kj}^2 / \hat{\sigma}_{jj}^{(kk)}}{1} \sim F_{1, n-p-1}

Where σ^jj(kk)=σ^jj[(XTX)1]kk\hat{\sigma}_{jj}^{(kk)} = \hat{\sigma}_{jj} [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk} is the estimated variance of β^kj\hat{\beta}_{kj}.

7.6 Wald Tests for Individual Coefficients

For a single coefficient βkj\beta_{kj} (predictor kk, response jj):

tkj=β^kjSE(β^kj)tnp1t_{kj} = \frac{\hat{\beta}_{kj}}{SE(\hat{\beta}_{kj})} \sim t_{n-p-1}

Where:

SE(β^kj)=σ^jj[(XTX)1]kkSE(\hat{\beta}_{kj}) = \sqrt{\hat{\sigma}_{jj} \cdot \left[(\mathbf{X}^T\mathbf{X})^{-1}\right]_{kk}}

A (1α)×100%(1-\alpha) \times 100\% confidence interval for βkj\beta_{kj}:

β^kj±tα/2,np1×SE(β^kj)\hat{\beta}_{kj} \pm t_{\alpha/2,\, n-p-1} \times SE(\hat{\beta}_{kj})

7.7 Simultaneous Confidence Regions

A key advantage of MLM is the ability to form joint confidence regions for multivariate coefficient vectors. The 100(1α)%100(1-\alpha)\% joint confidence region for the kk-th row βk\boldsymbol{\beta}_{k\cdot} of B\mathbf{B} (all qq responses simultaneously) is an ellipsoid:

(β^kβk)T[Σ^(XTX)kk1]1(β^kβk)qFα,q,npq\left(\hat{\boldsymbol{\beta}}_{k\cdot} - \boldsymbol{\beta}_{k\cdot}\right)^T \left[\hat{\boldsymbol{\Sigma}} \cdot \left(\mathbf{X}^T\mathbf{X}\right)^{-1}_{kk}\right]^{-1} \left(\hat{\boldsymbol{\beta}}_{k\cdot} - \boldsymbol{\beta}_{k\cdot}\right) \leq q F_{\alpha,\, q,\, n-p-q}

7.8 Testing the Overall Model

The overall model tests whether any predictor (excluding the intercept) has a significant effect on any response:

H0:B0=0p×qH_0: \mathbf{B}_{-0} = \mathbf{0}_{p \times q}

Where B0\mathbf{B}_{-0} is B\mathbf{B} with the intercept row removed. This is tested with C=[0pIp]\mathbf{C} = [\mathbf{0}_p \mid \mathbf{I}_p] (omitting the intercept column) and M=Iq\mathbf{M} = \mathbf{I}_q.


8. Multivariate Test Statistics

All multivariate tests in MLM are based on the eigenvalues λ1λ2λs\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_{s^*} of E1H\mathbf{E}^{-1}\mathbf{H}, where s=min(c,m)s^* = \min(c, m), cc = rank of C\mathbf{C} (number of hypothesis df), mm = number of transformed responses (columns of M\mathbf{M}).

8.1 Wilks' Lambda (Λ\Lambda^*)

Λ=EH+E=s=1s11+λs\Lambda^* = \frac{|\mathbf{E}|}{|\mathbf{H} + \mathbf{E}|} = \prod_{s=1}^{s^*} \frac{1}{1 + \lambda_s}

Range: 0Λ10 \leq \Lambda^* \leq 1; smaller values indicate stronger effects.

F-approximation (Rao):

F=1Λ1/tΛ1/tdf2df1F = \frac{1 - \Lambda^{*1/t}}{\Lambda^{*1/t}} \cdot \frac{df_2}{df_1}

Where:

t=m2c24m2+c25,df1=mc,df2=t(dfEmc+12)mc22t = \sqrt{\frac{m^2 c^2 - 4}{m^2 + c^2 - 5}}, \quad df_1 = mc, \quad df_2 = t\left(df_E - \frac{m - c + 1}{2}\right) - \frac{mc - 2}{2}

Exact F when m=1m = 1, m=2m = 2, c=1c = 1, or c=2c = 2.

8.2 Pillai's Trace (VV)

V=tr[H(H+E)1]=s=1sλs1+λsV = \text{tr}\left[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}\right] = \sum_{s=1}^{s^*}\frac{\lambda_s}{1+\lambda_s}

Range: 0Vs0 \leq V \leq s^*.

F-approximation:

F=V/s(sV)/s2nY+s+12nX+s+1Fs(2nX+s+1),s(2nY+s+1)F = \frac{V/s^*}{(s^* - V)/s^*} \cdot \frac{2n_Y + s^* + 1}{2n_X + s^* + 1} \sim F_{s^*(2n_X+s^*+1),\, s^*(2n_Y+s^*+1)}

Where nX=(dfEm1)/2n_X = (df_E - m - 1)/2 and nY=(cm1)/2n_Y = (c - m - 1)/2.

Most robust to violations of assumptions.

8.3 Hotelling-Lawley Trace (UU)

U=tr(E1H)=s=1sλsU = \text{tr}(\mathbf{E}^{-1}\mathbf{H}) = \sum_{s=1}^{s^*}\lambda_s

Range: 0U<0 \leq U < \infty.

F-approximation:

F=UdfE(dfEm1)sm(dfE+c)(dfEm1)2U(approximate)F = \frac{U \cdot df_E(df_E - m - 1)}{s^* \cdot m \cdot (df_E + c)(df_E - m - 1) - 2U} \quad \text{(approximate)}

Most powerful when effects are concentrated on a single dimension.

8.4 Roy's Largest Root (θ\theta)

θ=λ11+λ1\theta = \frac{\lambda_1}{1 + \lambda_1}

Range: 0θ10 \leq \theta \leq 1.

Roy's provides an upper bound to the pp-value distribution. Exact tables are needed for precise pp-values.

8.5 Comparison of the Four Statistics in MLM Context

StatisticH0H_0 Rejected WhenBest ForRobustness
Wilks' Λ\Lambda^*Λ\Lambda^* smallGeneral; effects spreadModerate
Pillai's VVVV largeEffects spread; small nn; unequal nnHighest
Hotelling-Lawley UUUU largeEffects on one dimensionLowest
Roy's θ\thetaθ\theta largeDominant single dimensionLowest

8.6 Likelihood Ratio Test

An alternative formulation: Under H0:CBM=0H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0}, the likelihood ratio statistic is:

ΛLR=(EH+E)n/2=(Λ)n/2\Lambda_{LR} = \left(\frac{|\mathbf{E}|}{|\mathbf{H}+\mathbf{E}|}\right)^{n/2} = (\Lambda^*)^{n/2}

The corresponding test statistic:

χ2[np1m+c+12]ln(Λ)\chi^2 \approx -\left[n - p - 1 - \frac{m + c + 1}{2}\right]\ln(\Lambda^*)

Asymptotically χmc2\sim \chi^2_{mc} under H0H_0. This approximation is Bartlett's correction to the likelihood ratio test.


9. Effect Size Measures

9.1 Multivariate Eta-Squared (ηp2\eta^2_p)

The most widely reported multivariate effect size for each hypothesis test:

From Wilks' Lambda:

ηp2=1Λ1/t\eta^2_p = 1 - \Lambda^{*1/t}

Where tt is defined as in Section 8.1.

From Pillai's Trace:

ηp2Vs\eta^2_p \approx \frac{V}{s^*}

Benchmarks:

ηp2\eta^2_pEffect Size
0.010.01Small
0.060.06Medium
0.140.14Large

9.2 Multivariate Omega-Squared (ω2\omega^2)

A less biased (corrected) multivariate effect size:

ω2=dfH(F1)dfH(F1)+n\omega^2 = \frac{df_H(F - 1)}{df_H(F - 1) + n}

Applied to the FF-approximation from the multivariate tests. Can be computed separately from each test statistic's FF approximation.

9.3 Univariate Effect Sizes per Response

For each response variable YjY_j, report the univariate R2R^2:

Rj2=1SSres,jSStot,j=1e^jTe^ji(yijyˉj)2R^2_j = 1 - \frac{SS_{res,j}}{SS_{tot,j}} = 1 - \frac{\hat{\mathbf{e}}_j^T\hat{\mathbf{e}}_j}{\sum_i(y_{ij}-\bar{y}_j)^2}

And univariate partial ηp2\eta^2_p for each predictor-response combination:

ηp,kj2=SSk,jSSk,j+SSres,j\eta^2_{p,kj} = \frac{SS_{k,j}}{SS_{k,j} + SS_{res,j}}

9.4 Standardised Coefficients

Standardised regression coefficients (βkj\beta^*_{kj}) are obtained by standardising both predictors and responses before fitting:

βkj=β^kjsXksYj\beta^*_{kj} = \hat{\beta}_{kj} \cdot \frac{s_{X_k}}{s_{Y_j}}

Where sXks_{X_k} and sYjs_{Y_j} are the standard deviations of predictor kk and response jj. Standardised coefficients represent the change in YjY_j (in SD units) for a one-SD change in XkX_k, facilitating comparison across predictors and responses.

9.5 Canonical Correlations

From the eigenvalues of E1H\mathbf{E}^{-1}\mathbf{H}, the canonical correlations between the predictor and response spaces are:

rc,s=λs1+λsr_{c,s} = \sqrt{\frac{\lambda_s}{1 + \lambda_s}}

The canonical Rs2=rc,s2R^2_s = r_{c,s}^2 represents the proportion of variance in the ss-th canonical response variate explained by the corresponding canonical predictor variate. These describe the multivariate association structure.

9.6 Multivariate R² (Generalisations)

Several generalisations of R2R^2 to the multivariate case exist:

Pillai's Trace-based R²:

Rmult2=Vs=sλs/(1+λs)sR^2_{mult} = \frac{V}{s^*} = \frac{\sum_s \lambda_s/(1+\lambda_s)}{s^*}

Wilks' Lambda-based R²:

RWilks2=1Λ1/tR^2_{Wilks} = 1 - \Lambda^{*1/t}

Trace correlation coefficient (average R2R^2 across canonical variates):

Rtrace2=src,s2sR^2_{trace} = \frac{\sum_s r_{c,s}^2}{s^*}


10. Model Fit and Evaluation

10.1 Univariate R² for Each Response

Report the standard R2R^2 for each of the qq response regressions:

Rj2=1e^jTe^j(yjyˉj1)T(yjyˉj1)R^2_j = 1 - \frac{\hat{\mathbf{e}}_j^T\hat{\mathbf{e}}_j}{(\mathbf{y}_j - \bar{y}_j\mathbf{1})^T(\mathbf{y}_j - \bar{y}_j\mathbf{1})}

Adjusted R² penalises for model complexity:

Rˉj2=1(1Rj2)n1np1\bar{R}^2_j = 1 - (1 - R^2_j)\frac{n-1}{n-p-1}

10.2 Trace of the Residual SSCP

The trace of the residual SSCP matrix measures total residual variation across all responses:

tr(E^TE^)=j=1qe^jTe^j=j=1qSSres,j\text{tr}(\hat{\mathbf{E}}^T\hat{\mathbf{E}}) = \sum_{j=1}^q \hat{\mathbf{e}}_j^T\hat{\mathbf{e}}_j = \sum_{j=1}^q SS_{res,j}

This is the multivariate analogue of the residual sum of squares.

10.3 The Determinant Criterion (Σ^|\hat{\boldsymbol{\Sigma}}|)

The generalised variance Σ^|\hat{\boldsymbol{\Sigma}}| (determinant of the estimated error covariance matrix) is a scalar measure of total residual variation that accounts for correlations among responses. Smaller values indicate better overall fit.

Log-determinant criterion:

lnΣ^=lnE^TE^np1\ln|\hat{\boldsymbol{\Sigma}}| = \ln\left|\frac{\hat{\mathbf{E}}^T\hat{\mathbf{E}}}{n-p-1}\right|

This appears in the log-likelihood and information criteria.

10.4 AIC and BIC for Multivariate Models

Akaike Information Criterion:

AIC=nlnΣ~MLE+2q(p+1)AIC = n\ln|\tilde{\boldsymbol{\Sigma}}_{MLE}| + 2q(p+1)

Or using the maximised log-likelihood:

AIC=2(B^,Σ~MLE)+2kAIC = -2\ell(\hat{\mathbf{B}}, \tilde{\boldsymbol{\Sigma}}_{MLE}) + 2k

Where k=q(p+1)+q(q+1)/2k = q(p+1) + q(q+1)/2 is the total number of free parameters (coefficients plus unique elements of Σ\boldsymbol{\Sigma}).

Bayesian Information Criterion:

BIC=2(B^,Σ~MLE)+kln(n)BIC = -2\ell(\hat{\mathbf{B}}, \tilde{\boldsymbol{\Sigma}}_{MLE}) + k\ln(n)

Lower AIC/BIC indicates a better model relative to complexity. Use AIC for predictive accuracy; BIC for parsimony.

10.5 Model Comparison via Likelihood Ratio Test

To compare a full model (all pp predictors) vs. a reduced model (pcp - c predictors, dropping the cc predictors specified by C\mathbf{C}):

ΛLR=E^fullE^reduced\Lambda_{LR} = \frac{|\hat{\mathbf{E}}_{full}|}{|\hat{\mathbf{E}}_{reduced}|}

χ2=[np1m+c+12]ln(ΛLR)χcm2under H0\chi^2 = -\left[n - p - 1 - \frac{m+c+1}{2}\right]\ln(\Lambda_{LR}) \sim \chi^2_{cm} \quad \text{under } H_0

Or equivalently using Wilks' Lambda for the hypothesis being tested.

10.6 Comparing Univariate and Multivariate Fit

An important diagnostic is to compare the multivariate model fit with the fit from qq separate univariate regressions:


11. Model Diagnostics and Residuals

11.1 The Residual Matrix

The residual matrix is:

E^=YY^=YXB^=(IH)Y\hat{\mathbf{E}} = \mathbf{Y} - \hat{\mathbf{Y}} = \mathbf{Y} - \mathbf{X}\hat{\mathbf{B}} = (\mathbf{I} - \mathbf{H})\mathbf{Y}

Each row ϵ^i=yiy^i\hat{\boldsymbol{\epsilon}}_i = \mathbf{y}_i - \hat{\mathbf{y}}_i is the residual vector for observation ii. A well-fitting model produces residuals that resemble multivariate white noise.

11.2 Univariate Residual Diagnostics (Per Response)

For each response YjY_j, extract the jj-th column of E^\hat{\mathbf{E}} and apply standard univariate regression diagnostics:

Residuals vs. Fitted Values: Plot e^ij\hat{e}_{ij} against y^ij\hat{y}_{ij}. Expect a random scatter around zero with no pattern.

Normal Q-Q Plot: Plot ordered residuals against normal quantiles. Points should fall near the diagonal.

Scale-Location Plot: Plot e^ij\sqrt{|\hat{e}_{ij}|} against y^ij\hat{y}_{ij}. A horizontal band indicates homoscedasticity.

Residuals vs. Predictor Plots: Plot e^ij\hat{e}_{ij} against each XkX_k. Patterns indicate non-linearity or omitted variable bias.

11.3 Multivariate Residual Diagnostics

11.3.1 Mahalanobis Distance of Residuals

The squared Mahalanobis distance of the residual vector for observation ii:

Di2=ϵ^iTΣ^1ϵ^iD^2_i = \hat{\boldsymbol{\epsilon}}_i^T \hat{\boldsymbol{\Sigma}}^{-1} \hat{\boldsymbol{\epsilon}}_i

Under the model, Di2/(np1)D^2_i / (n-p-1) approximately follows an Fq,npqF_{q, n-p-q} distribution. Large Di2D^2_i indicates observation ii is poorly fitted on the combined response profile.

Chi-squared Q-Q plot: Plot ordered D(i)2D^2_{(i)} against χq2\chi^2_q quantiles. Departures from linearity indicate non-normality of multivariate residuals or outliers.

11.3.2 Standardised Multivariate Residuals

Standardise each residual vector by the estimated error covariance:

ϵ~i=Σ^1/2ϵ^i\tilde{\boldsymbol{\epsilon}}_i = \hat{\boldsymbol{\Sigma}}^{-1/2}\hat{\boldsymbol{\epsilon}}_i

Under correct specification, ϵ~iNq(0,Iq)\tilde{\boldsymbol{\epsilon}}_i \approx \mathcal{N}_q(\mathbf{0}, \mathbf{I}_q).

11.3.3 Cross-Response Residual Correlation

Compute the correlation matrix of the columns of E^\hat{\mathbf{E}}:

R^ϵ=D1/2Σ^D1/2\hat{\mathbf{R}}_\epsilon = D^{-1/2} \hat{\boldsymbol{\Sigma}} D^{-1/2}

Where D=diag(Σ^)D = \text{diag}(\hat{\boldsymbol{\Sigma}}). This estimates the within-observation correlation structure of the errors. High correlations indicate that the responses share substantial common error variance.

11.4 Leverage and Influence

Leverage hiih_{ii} (diagonal of H\mathbf{H}): The same hat matrix applies to all responses in MLM. High leverage (hii>2(p+1)/nh_{ii} > 2(p+1)/n) indicates an observation with unusual predictor values.

Multivariate Cook's Distance:

DiCook=1q(p+1)tr[(B^B^(i))T(XTX)(B^B^(i))Σ^1]D_i^{Cook} = \frac{1}{q(p+1)}\text{tr}\left[(\hat{\mathbf{B}} - \hat{\mathbf{B}}^{(-i)})^T(\mathbf{X}^T\mathbf{X})(\hat{\mathbf{B}} - \hat{\mathbf{B}}^{(-i)})\hat{\boldsymbol{\Sigma}}^{-1}\right]

Where B^(i)\hat{\mathbf{B}}^{(-i)} is the coefficient matrix estimated without observation ii. An efficient formula:

DiCook=hiiq(1hii)2Di2D_i^{Cook} = \frac{h_{ii}}{q(1-h_{ii})^2} D^2_i

Observations with DiCook>1D_i^{Cook} > 1 (or >4/n> 4/n) warrant investigation.

DFFITS (Multivariate):

DFFITSi=hii1hiiDi2DFFITS_i = \frac{h_{ii}}{1-h_{ii}} D^2_i

11.5 Testing for Multivariate Outliers

Bonferroni-corrected Mahalanobis distance test: Compare Di2D^2_i to the critical value:

Dcrit2=χq,α/(2n)2D^2_{crit} = \chi^2_{q,\, \alpha/(2n)}

(Bonferroni-corrected at the α/(2n)\alpha/(2n) level, e.g., α=0.05\alpha = 0.05, n=100n = 100, q=3q = 3: use χ3,0.00025217.1\chi^2_{3, 0.00025} \approx 17.1.)

Robust Mahalanobis distances: Use the Minimum Covariance Determinant (MCD) or Minimum Volume Ellipsoid (MVE) estimates of location and scatter, which are resistant to masking effects (outliers hiding other outliers).

11.6 Checking Multicollinearity Among Predictors

Variance Inflation Factor (VIF) for each predictor (same for all responses, since X\mathbf{X} is shared):

VIFk=11Rk2VIF_k = \frac{1}{1 - R^2_k}

Where Rk2R^2_k is the R2R^2 from regressing XkX_k on all other predictors.

VIFInterpretation
151 - 5Low multicollinearity
5105 - 10Moderate multicollinearity (concern)
>10> 10Severe multicollinearity (serious problem)

Condition number of XTX\mathbf{X}^T\mathbf{X}: Ratio of largest to smallest eigenvalue. Values >30> 30 indicate potentially problematic multicollinearity.

11.7 Checking Homoscedasticity

For each response YjY_j, check homoscedasticity using:

For the multivariate setting, Box's M test can be adapted to test whether the error covariance matrix Σ\boldsymbol{\Sigma} is the same across subgroups defined by categorical predictors.


12. Interpretation of Coefficients

12.1 Interpreting the Coefficient Matrix B^\hat{\mathbf{B}}

The coefficient matrix B^\hat{\mathbf{B}} has dimensions (p+1)×q(p+1) \times q:

12.2 Ceteris Paribus Interpretation

Each coefficient β^kj\hat{\beta}_{kj} represents the partial effect of XkX_k on YjY_j:

β^kj=Y^jXkXk fixed\hat{\beta}_{kj} = \frac{\partial \hat{Y}_j}{\partial X_k}\Bigg|_{X_{-k} \text{ fixed}}

This is the expected change in the mean of YjY_j for a one-unit increase in XkX_k, with all other predictors Xk={X1,,Xk1,Xk+1,,Xp}X_{-k} = \{X_1, \dots, X_{k-1}, X_{k+1}, \dots, X_p\} held constant.

12.3 Comparing Coefficients Across Responses

A key feature of MLM is the ability to compare the effects of a predictor across multiple responses. If the responses are on the same scale (or are standardised), directly compare β^k1\hat{\beta}_{k1}, β^k2\hat{\beta}_{k2}, ..., β^kq\hat{\beta}_{kq} for predictor XkX_k.

To formally test whether the effect of XkX_k on YjY_j equals its effect on YlY_l:

H0:βkj=βklH0:ekTBmjl=0H_0: \beta_{kj} = \beta_{kl} \quad \Leftrightarrow \quad H_0: \mathbf{e}_k^T\mathbf{B}\mathbf{m}_{jl} = 0

Where mjl=ejel\mathbf{m}_{jl} = \mathbf{e}_j - \mathbf{e}_l (contrast between responses jj and ll).

12.4 Interpreting Interactions

For a model with predictors X1X_1, X2X_2, and their interaction X1×X2X_1 \times X_2:

E[YjX1,X2]=β0j+β1jX1+β2jX2+β12jX1X2E[Y_j \mid X_1, X_2] = \beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \beta_{12j}X_1 X_2

The interaction coefficient β12j\beta_{12j} represents the modification of the effect of X1X_1 on YjY_j per unit change in X2X_2 (and vice versa):

E[Yj]X1=β1j+β12jX2\frac{\partial E[Y_j]}{\partial X_1} = \beta_{1j} + \beta_{12j}X_2

12.5 Interpreting Dummy Variables for Categorical Predictors

For a categorical predictor with kk levels (reference = Level 1):

β^Dm,j=yˉm,jadjyˉ1,jadj\hat{\beta}_{D_m, j} = \bar{y}_{m,j}^{adj} - \bar{y}_{1,j}^{adj}

The coefficient for dummy variable DmD_m in the jj-th response equation represents the adjusted mean difference between Level mm and the reference Level 1 on response YjY_j, controlling for all other predictors.

12.6 The Profile of Effects Across Responses

The multivariate coefficient vector for predictor XkX_k is the row:

β^k=(β^k1,β^k2,,β^kq)\hat{\boldsymbol{\beta}}_{k\cdot} = (\hat{\beta}_{k1}, \hat{\beta}_{k2}, \dots, \hat{\beta}_{kq})

Plotting this profile (coefficient value vs. response variable) visualises the pattern of effects of XkX_k across the response set. Parallel or similar profiles across predictors indicate a common underlying structure; divergent profiles suggest differential effects.


13. Variable Selection and Model Comparison

13.1 Criteria for Variable Selection

Variable selection in MLM must account for the joint effect on all qq responses simultaneously. Several criteria are available:

Multivariate AIC/BIC: Compare models using AIC or BIC (see Section 10.4). Select the model with the lowest criterion value.

Sequential Likelihood Ratio Tests: Add predictors one at a time and test whether each addition significantly improves the joint model fit using the χ2\chi^2 approximation to Wilks' Lambda.

Multivariate Mallow's CpC_p:

Cpmult=tr(E^pTE^p)tr(E^fullTE^full/(npfull1))(n2(p+1))C_p^{mult} = \frac{\text{tr}(\hat{\mathbf{E}}_p^T\hat{\mathbf{E}}_p)}{\text{tr}(\hat{\mathbf{E}}_{full}^T\hat{\mathbf{E}}_{full}/(n-p_{full}-1))} - (n - 2(p+1))

Where the sum is over all qq responses.

13.2 All-Subsets Selection

Evaluate all 2p2^p possible subsets of pp predictors and select based on:

Computational note: For large pp, all-subsets becomes infeasible. Use stepwise procedures or regularised alternatives.

13.3 Stepwise Variable Selection

Forward selection:

  1. Start with the intercept-only model.
  2. At each step, add the predictor that produces the greatest reduction in multivariate AIC (or the most significant Wilks' Lambda test).
  3. Stop when no addition improves the criterion.

Backward elimination:

  1. Start with the full model.
  2. At each step, remove the predictor whose removal produces the least deterioration in multivariate fit (highest p-value for Wilks' Lambda test or smallest AIC increase).
  3. Stop when removing any predictor significantly worsens fit.

Bidirectional: Combine forward and backward at each step.

⚠️ Stepwise selection using p-values suffers from multiple testing inflation, instability, and optimistic bias. AIC/BIC-based stepwise is preferred. Always validate the final model on held-out data and interpret with caution.

13.4 Regularised Estimation for High-Dimensional Settings

When pp is large relative to nn, standard OLS estimation becomes unstable. Regularisation methods extend to the multivariate setting:

Multivariate Ridge Regression:

B^ridge=(XTX+λI)1XTY\hat{\mathbf{B}}_{ridge} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{Y}

The ridge penalty λ>0\lambda > 0 shrinks all coefficients toward zero, stabilising estimation under multicollinearity.

Multivariate Lasso (Group Lasso): Applies an L1L_1 penalty structure that encourages sparse solutions. The group lasso applies the penalty at the row level of B\mathbf{B} — entire rows (effects of a predictor on all responses) are driven to zero, performing group variable selection.

Penalty:

minBYXBF2+λk=1pβk2\min_{\mathbf{B}} \|\mathbf{Y} - \mathbf{X}\mathbf{B}\|_F^2 + \lambda\sum_{k=1}^p \|\boldsymbol{\beta}_{k\cdot}\|_2

Where βk2=jβkj2\|\boldsymbol{\beta}_{k\cdot}\|_2 = \sqrt{\sum_j \beta_{kj}^2} is the Euclidean norm of the kk-th row of B\mathbf{B}.

13.5 Cross-Validation for Model Selection

kk-fold cross-validation for MLM:

  1. Divide the nn observations into kk folds.
  2. For each fold vv: fit the model on all data except fold vv; predict fold vv.
  3. Compute the cross-validated prediction error:

MSPECV=1ni=1n(yiy^i(vi))T(yiy^i(vi))MSPE_{CV} = \frac{1}{n}\sum_{i=1}^n (\mathbf{y}_i - \hat{\mathbf{y}}_i^{(-v_i)})^T(\mathbf{y}_i - \hat{\mathbf{y}}_i^{(-v_i)})

Or using the trace criterion:

TraceCV=1ni=1nyiy^i(vi)2Trace_{CV} = \frac{1}{n}\sum_{i=1}^n \|\mathbf{y}_i - \hat{\mathbf{y}}_i^{(-v_i)}\|^2

Select the model minimising the cross-validated prediction error.


14. Special Cases and Connections to Other Methods

14.1 Univariate Multiple Regression (q=1q = 1)

When q=1q = 1, the response matrix Y\mathbf{Y} reduces to a vector y\mathbf{y} and the coefficient matrix B\mathbf{B} reduces to a vector β\boldsymbol{\beta}. All MLM formulas reduce to standard OLS:

β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

All multivariate test statistics reduce to the standard FF-test.

14.2 MANOVA as a Special Case

When all predictors in X\mathbf{X} are indicator variables (dummy coding of group membership) and no continuous covariates are included, the MLM reduces to MANOVA. The hypothesis H0:CB=0H_0: \mathbf{C}\mathbf{B} = \mathbf{0} with C\mathbf{C} specifying group contrasts is the MANOVA test for group mean vector differences.

The SSCP matrices:

14.3 MANCOVA as a Special Case

When X\mathbf{X} contains both group indicators (categorical) and continuous covariates, the MLM reduces to MANCOVA. The MANCOVA test for group differences (after adjusting for covariates) corresponds to the MLM hypothesis test H0:CgroupB=0H_0: \mathbf{C}_{group}\mathbf{B} = \mathbf{0} where Cgroup\mathbf{C}_{group} specifies contrasts among group effects.

The "adjustment" for covariates happens automatically within the MLM framework — the covariate columns of X\mathbf{X} absorb the variation they explain, leaving purer group comparisons.

14.4 Canonical Correlation Analysis (CCA)

Canonical Correlation Analysis studies the association between two sets of variables: predictors X\mathbf{X} and responses Y\mathbf{Y}. It finds linear combinations:

u=Xaandv=Yb\mathbf{u} = \mathbf{X}\mathbf{a} \quad \text{and} \quad \mathbf{v} = \mathbf{Y}\mathbf{b}

That maximise Cor(u,v)\text{Cor}(\mathbf{u}, \mathbf{v}). The canonical correlations are the square roots of the eigenvalues of:

SXX1SXYSYY1SYX\mathbf{S}_{XX}^{-1}\mathbf{S}_{XY}\mathbf{S}_{YY}^{-1}\mathbf{S}_{YX}

In the MLM framework, CCA provides insight into the overall association structure between the predictor and response sets. The eigenvalues of E1H\mathbf{E}^{-1}\mathbf{H} in the MLM test of the full model are directly related to the squared canonical correlations.

14.5 Discriminant Function Analysis

When X\mathbf{X} consists of group indicators, the eigenvectors of E1H\mathbf{E}^{-1}\mathbf{H} define the linear discriminant functions — the linear combinations of responses that maximally separate the groups. DFA is therefore a post-hoc analysis tool following a significant MANOVA (and thus a significant MLM with categorical predictors).

14.6 Repeated Measures and Profile Analysis

When the qq response variables represent the same variable measured at qq time points on the same observations, the MLM becomes a repeated-measures (profile analysis) model. The response transformation matrix M\mathbf{M} is used to form contrasts among time points:

14.7 Seemingly Unrelated Regression (SUR)

When each response variable has its own distinct set of predictors (not the shared X\mathbf{X} of standard MLM), the Seemingly Unrelated Regression model applies. Standard OLS applied separately to each equation is unbiased but inefficient; GLS using the cross-equation error covariances (Zellner's FGLS estimator) is asymptotically more efficient.


15. Using the Multivariate Linear Models Component

The Multivariate Linear Models component in the DataStatPro application provides a full end-to-end workflow for fitting, evaluating, and interpreting multivariate regression models.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain at least two continuous response variables and one or more predictor variables.

Step 2 — Select Response Variables (Y) Select two or more continuous response variables from the "Response Variables (Y)" panel. These represent the jointly modelled outcomes.

💡 Select response variables that are theoretically related and expected to be influenced by the same set of predictors. Responses that share common error variance benefit most from the multivariate framework.

Step 3 — Select Predictor Variables (X) Select one or more predictor variables from the "Predictor Variables (X)" dropdown. Predictors can be:

Step 4 — Configure Interactions (Optional) Specify interaction terms by selecting pairs of predictor variables. The application creates the product terms and adds them to the design matrix.

Step 5 — Configure Polynomial Terms (Optional) For continuous predictors, specify polynomial degrees (e.g., quadratic: X2X^2, cubic: X3X^3) to model non-linear relationships on the link scale.

Step 6 — Configure Centering and Scaling Select preprocessing for predictors:

Step 7 — Specify Hypothesis Tests (Optional) Define custom hypotheses of the form H0:CBM=0H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0}:

Pre-built hypothesis options:

Step 8 — Select Confidence Level Choose the confidence level for confidence intervals (default: 95%).

Step 9 — Select Display Options Choose which outputs to display:

Step 10 — Run the Analysis Click "Run Multivariate Linear Model". The application will:

  1. Construct the design matrix X\mathbf{X} (with dummy coding, interaction, polynomial terms).
  2. Compute B^=(XTX)1XTY\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}.
  3. Compute residuals E^\hat{\mathbf{E}} and estimate Σ^\hat{\boldsymbol{\Sigma}}.
  4. Compute all four multivariate test statistics for each specified hypothesis.
  5. Compute univariate regression statistics for each response.
  6. Compute effect sizes, leverage, Cook's distances, VIFs.
  7. Run multivariate normality tests on residuals.
  8. Generate all selected visualisations and tables.

16. Computational and Formula Details

16.1 Step-by-Step Computation

Step 1: Construct the design matrix X\mathbf{X} (n×(p+1)n \times (p+1))

Include a column of ones (for the intercept) as the first column, followed by the predictor columns (with dummy coding for categorical predictors).

Step 2: Verify rank condition

Check that rank(X)=p+1\text{rank}(\mathbf{X}) = p + 1 (full column rank). If not, the model is under-identified — remove linearly dependent columns.

Step 3: Compute (XTX)1(\mathbf{X}^T\mathbf{X})^{-1}

Using Cholesky decomposition (more numerically stable than direct inversion):

XTX=LLT(Cholesky)\mathbf{X}^T\mathbf{X} = \mathbf{L}\mathbf{L}^T \quad (\text{Cholesky})

(XTX)1=(LT)1L1(\mathbf{X}^T\mathbf{X})^{-1} = (\mathbf{L}^T)^{-1}\mathbf{L}^{-1}

Step 4: Compute the hat matrix H\mathbf{H}

H=X(XTX)1XT(n×n)\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \quad (n \times n)

For large nn, avoid storing H\mathbf{H} explicitly; compute leverages hiih_{ii} as row-wise norms:

hii=xiT(XTX)1xih_{ii} = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i

Step 5: Compute the coefficient matrix

B^=(XTX)1XTY((p+1)×q)\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \quad ((p+1) \times q)

Step 6: Compute fitted values and residuals

Y^=XB^=HY,E^=YY^=(IH)Y\hat{\mathbf{Y}} = \mathbf{X}\hat{\mathbf{B}} = \mathbf{H}\mathbf{Y}, \quad \hat{\mathbf{E}} = \mathbf{Y} - \hat{\mathbf{Y}} = (\mathbf{I} - \mathbf{H})\mathbf{Y}

Step 7: Estimate the error covariance matrix

Σ^=E^TE^np1\hat{\boldsymbol{\Sigma}} = \frac{\hat{\mathbf{E}}^T\hat{\mathbf{E}}}{n - p - 1}

Step 8: Compute standard errors for each coefficient

For coefficient β^kj\hat{\beta}_{kj}:

SE(β^kj)=σ^jj[(XTX)1]kkSE(\hat{\beta}_{kj}) = \sqrt{\hat{\sigma}_{jj} \cdot [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}

Where σ^jj=Σ^jj\hat{\sigma}_{jj} = \hat{\boldsymbol{\Sigma}}_{jj} is the estimated error variance for response jj.

16.2 Computing the Hypothesis SSCP Matrix

For hypothesis H0:CBM=0H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0}:

Step 1: Compute B^M\hat{\mathbf{B}}\mathbf{M} — the transformed coefficient matrix (dfH=cdf_H = c).

Step 2: Compute CB^M\mathbf{C}\hat{\mathbf{B}}\mathbf{M} — the estimated contrast (c×mc \times m).

Step 3: Compute the hypothesis SSCP:

H=(CB^M)T[C(XTX)1CT]1(CB^M)(m×m)\mathbf{H} = (\mathbf{C}\hat{\mathbf{B}}\mathbf{M})^T \left[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T\right]^{-1} (\mathbf{C}\hat{\mathbf{B}}\mathbf{M}) \quad (m \times m)

Step 4: Compute the error SSCP:

E=MTE^TE^M(m×m)\mathbf{E} = \mathbf{M}^T\hat{\mathbf{E}}^T\hat{\mathbf{E}}\mathbf{M} \quad (m \times m)

Step 5: Compute eigenvalues λ1,,λs\lambda_1, \dots, \lambda_{s^*} of E1H\mathbf{E}^{-1}\mathbf{H}.

Step 6: Compute test statistics (Wilks', Pillai's, Hotelling-Lawley, Roy's) from the eigenvalues.

16.3 Rao's F-Approximation to Wilks' Lambda

Given Λ\Lambda^*, c=dfHc = df_H (rank of C\mathbf{C}), mm (columns of M\mathbf{M}), dfE=np1df_E = n - p - 1:

t=m2c24m2+c25(set t=1 if m2+c250)t = \sqrt{\frac{m^2c^2 - 4}{m^2 + c^2 - 5}} \quad (\text{set } t = 1 \text{ if } m^2+c^2-5 \leq 0)

df1=mc,df2=t(dfEmc+12)mc22df_1 = mc, \quad df_2 = t\left(df_E - \frac{m - c + 1}{2}\right) - \frac{mc-2}{2}

F=1Λ1/tΛ1/tdf2df1F = \frac{1-\Lambda^{*1/t}}{\Lambda^{*1/t}} \cdot \frac{df_2}{df_1}

This approximation is exact when m=1m = 1 or m=2m = 2 or c=1c = 1 or c=2c = 2.

16.4 Confidence Ellipse for Two Coefficients

The joint 100(1α)%100(1-\alpha)\% confidence ellipse for (βk1,j,βk2,j)(\beta_{k_1, j}, \beta_{k_2, j}) (two coefficients in the same response equation jj) is:

(β^(k1,k2),jβ(k1,k2),j)T[σ^jjV(k1,k2)]1(β^(k1,k2),jβ(k1,k2),j)2Fα,2,np1\left(\hat{\boldsymbol{\beta}}_{(k_1,k_2),j} - \boldsymbol{\beta}_{(k_1,k_2),j}\right)^T \left[\hat{\sigma}_{jj}\mathbf{V}_{(k_1,k_2)}\right]^{-1} \left(\hat{\boldsymbol{\beta}}_{(k_1,k_2),j} - \boldsymbol{\beta}_{(k_1,k_2),j}\right) \leq 2F_{\alpha,2,n-p-1}

Where V(k1,k2)\mathbf{V}_{(k_1,k_2)} is the 2×22 \times 2 submatrix of (XTX)1(\mathbf{X}^T\mathbf{X})^{-1} corresponding to rows/columns k1k_1 and k2k_2.

16.5 Prediction for New Observations

For a new observation xnew\mathbf{x}_{new} (p+1×1p+1 \times 1), the point prediction of the response vector is:

y^new=B^Txnew(q×1)\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T\mathbf{x}_{new} \quad (q \times 1)

The prediction error covariance (accounting for both estimation uncertainty and future error variance):

Var(ynewy^new)=(1+xnewT(XTX)1xnew)Σ^\text{Var}(\mathbf{y}_{new} - \hat{\mathbf{y}}_{new}) = \left(1 + \mathbf{x}_{new}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_{new}\right)\hat{\boldsymbol{\Sigma}}

A 100(1α)%100(1-\alpha)\% prediction ellipsoid for ynew\mathbf{y}_{new}:

(ynewy^new)T[Σ^(1+hnew)]1(ynewy^new)qFα,q,npq(\mathbf{y}_{new} - \hat{\mathbf{y}}_{new})^T\left[\hat{\boldsymbol{\Sigma}}\left(1 + h_{new}\right)\right]^{-1}(\mathbf{y}_{new} - \hat{\mathbf{y}}_{new}) \leq q\cdot F_{\alpha,q,n-p-q}

Where hnew=xnewT(XTX)1xnewh_{new} = \mathbf{x}_{new}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_{new}.

Marginal (1α)%(1-\alpha)\% prediction intervals for individual response YjY_j:

y^new,j±tα/2,np1σ^jj(1+hnew)\hat{y}_{new,j} \pm t_{\alpha/2,\,n-p-1}\sqrt{\hat{\sigma}_{jj}\left(1 + h_{new}\right)}

16.6 Partitioning the SSCP Matrix

For the standard MLM, the total SSCP partitions as:

T=Hmodel+E\mathbf{T} = \mathbf{H}_{model} + \mathbf{E}

Where:

Hmodel=Y^T(I11T/n)Y^\mathbf{H}_{model} = \hat{\mathbf{Y}}^T(\mathbf{I} - \mathbf{1}\mathbf{1}^T/n)\hat{\mathbf{Y}}

E=E^TE^=YT(IH)Y\mathbf{E} = \hat{\mathbf{E}}^T\hat{\mathbf{E}} = \mathbf{Y}^T(\mathbf{I}-\mathbf{H})\mathbf{Y}

T=YT(I11T/n)Y\mathbf{T} = \mathbf{Y}^T(\mathbf{I} - \mathbf{1}\mathbf{1}^T/n)\mathbf{Y}

The diagonal elements give the familiar univariate sums of squares: Hjj=SSreg,jH_{jj} = SS_{reg,j} and Ejj=SSres,jE_{jj} = SS_{res,j}.


17. Worked Examples

Example 1: Predicting Multiple Environmental Outcomes from Land Use Variables

Research Question: Do land use variables (% Urban cover, % Agricultural cover, Distance to nearest river in km) simultaneously predict multiple water quality indicators (pH, Dissolved Oxygen mg/L, Nitrate mg/L) across sampling sites?

Data: n=80n = 80 water sampling sites; p=3p = 3 predictors; q=3q = 3 responses.

Step 1: Descriptive Statistics

VariableMeanSDMinMax
Urban (%)18.412.80.268.3
Agriculture (%)41.219.63.188.7
Distance (km)2.841.930.129.41
pH7.210.485.828.34
DO (mg/L)8.141.623.8111.42
Nitrate (mg/L)4.832.910.2114.72

Step 2: Estimated Coefficient Matrix B^\hat{\mathbf{B}}

B^=(β^0,pHβ^0,DOβ^0,NO3β^Urb,pHβ^Urb,DOβ^Urb,NO3β^Agr,pHβ^Agr,DOβ^Agr,NO3β^Dist,pHβ^Dist,DOβ^Dist,NO3)=(7.8329.2141.0410.0120.0310.0830.0080.0140.0410.0610.2180.312)\hat{\mathbf{B}} = \begin{pmatrix} \hat{\beta}_{0,pH} & \hat{\beta}_{0,DO} & \hat{\beta}_{0,NO3} \\ \hat{\beta}_{Urb,pH} & \hat{\beta}_{Urb,DO} & \hat{\beta}_{Urb,NO3} \\ \hat{\beta}_{Agr,pH} & \hat{\beta}_{Agr,DO} & \hat{\beta}_{Agr,NO3} \\ \hat{\beta}_{Dist,pH} & \hat{\beta}_{Dist,DO} & \hat{\beta}_{Dist,NO3} \end{pmatrix} = \begin{pmatrix} 7.832 & 9.214 & 1.041 \\ -0.012 & -0.031 & 0.083 \\ -0.008 & -0.014 & 0.041 \\ 0.061 & 0.218 & -0.312 \end{pmatrix}

Step 3: Standard Errors, t-values, and p-values (per response)

Response: pH

Predictorβ^\hat{\beta}SEttpp95% CI
Intercept7.8320.24132.50< 0.001[7.352, 8.312]
Urban (%)-0.0120.005-2.400.019[-0.022, -0.002]
Agriculture (%)-0.0080.004-2.000.049[-0.016, 0.000]
Distance (km)0.0610.0311.970.053[-0.001, 0.123]

RpH2=0.348R^2_{pH} = 0.348, Adjusted R2=0.320R^2 = 0.320

Response: Dissolved Oxygen (DO)

Predictorβ^\hat{\beta}SEttpp95% CI
Intercept9.2140.58415.78< 0.001[8.051, 10.377]
Urban (%)-0.0310.012-2.580.012[-0.055, -0.007]
Agriculture (%)-0.0140.009-1.560.124[-0.032, 0.004]
Distance (km)0.2180.0752.910.005[0.069, 0.367]

RDO2=0.411R^2_{DO} = 0.411, Adjusted R2=0.386R^2 = 0.386

Response: Nitrate (NO3)

Predictorβ^\hat{\beta}SEttpp95% CI
Intercept1.0410.8121.280.204[-0.573, 2.655]
Urban (%)0.0830.0174.88< 0.001[0.049, 0.117]
Agriculture (%)0.0410.0133.150.002[0.015, 0.067]
Distance (km)-0.3120.104-3.000.004[-0.519, -0.105]

RNO32=0.562R^2_{NO3} = 0.562, Adjusted R2=0.543R^2 = 0.543

Step 4: Estimated Error Covariance and Correlation Matrices

Σ^=(0.1640.2140.3120.2141.8411.4280.3121.4285.214)\hat{\boldsymbol{\Sigma}} = \begin{pmatrix} 0.164 & 0.214 & -0.312 \\ 0.214 & 1.841 & -1.428 \\ -0.312 & -1.428 & 5.214 \end{pmatrix}

R^ϵ=(1.0000.3890.3380.3891.0000.4600.3380.4601.000)\hat{\mathbf{R}}_\epsilon = \begin{pmatrix} 1.000 & 0.389 & -0.338 \\ 0.389 & 1.000 & -0.460 \\ -0.338 & -0.460 & 1.000 \end{pmatrix}

The error correlation matrix reveals substantial within-site correlations among the water quality residuals (up to r=0.46|r| = 0.46), confirming that the multivariate framework is appropriate.

Step 5: Multivariate Hypothesis Tests

Overall model test (H0:B0=0H_0: \mathbf{B}_{-0} = \mathbf{0}, all three predictors, all three responses):

Test StatisticValueFFdf1df_1df2df_2ppηp2\eta^2_p
Wilks' Λ\Lambda^*0.402710.8429187.4< 0.0010.302
Pillai's Trace0.65129.8149219< 0.0010.287
Hotelling-Lawley1.248312.8429178.0< 0.0010.322
Roy's Largest Root0.894121.727376< 0.0010.462

Interpretation: The overall model is highly significant (p<0.001p < 0.001). Land use variables jointly predict the combined water quality profile, with a large multivariate effect (ηp2=0.302\eta^2_p = 0.302 from Wilks' Lambda).

Per-predictor multivariate tests:

PredictorWilks' Λ\Lambda^*F(3,74)F(3, 74)ppηp2\eta^2_p
Urban (%)0.682411.43< 0.0010.317
Agriculture (%)0.82415.270.0020.176
Distance (km)0.76137.74< 0.0010.239

All three predictors have significant multivariate effects.

Step 6: Coefficient Profile Plot Interpretation

Plotting the rows of B^\hat{\mathbf{B}} across the three responses reveals:

Step 7: Prediction for a New Site

New site: Urban = 25%, Agriculture = 55%, Distance = 1.5 km.

y^new=B^Txnew=(7.832+(0.012)(25)+(0.008)(55)+(0.061)(1.5)9.214+(0.031)(25)+(0.014)(55)+(0.218)(1.5)1.041+(0.083)(25)+(0.041)(55)+(0.312)(1.5))\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T\mathbf{x}_{new} = \begin{pmatrix} 7.832 + (-0.012)(25) + (-0.008)(55) + (0.061)(1.5) \\ 9.214 + (-0.031)(25) + (-0.014)(55) + (0.218)(1.5) \\ 1.041 + (0.083)(25) + (0.041)(55) + (-0.312)(1.5) \end{pmatrix}

=(7.8320.3000.440+0.0929.2140.7750.770+0.3271.041+2.075+2.2550.468)=(7.184 (pH)7.996 (DO mg/L)4.903 (NO3 mg/L))= \begin{pmatrix} 7.832 - 0.300 - 0.440 + 0.092 \\ 9.214 - 0.775 - 0.770 + 0.327 \\ 1.041 + 2.075 + 2.255 - 0.468 \end{pmatrix} = \begin{pmatrix} 7.184 \text{ (pH)} \\ 7.996 \text{ (DO mg/L)} \\ 4.903 \text{ (NO}_3\text{ mg/L)} \end{pmatrix}

Prediction intervals (95%):

hnew=xnewT(XTX)1xnew=0.024h_{new} = \mathbf{x}_{new}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_{new} = 0.024

For pH: 7.184±1.9930.164(1+0.024)=7.184±0.828=[6.356,8.012]7.184 \pm 1.993\sqrt{0.164(1 + 0.024)} = 7.184 \pm 0.828 = [6.356, 8.012]


Example 2: MLM with Categorical and Continuous Predictors — Treatment, Gender, and Biomarker Outcomes

Research Question: Do treatment condition (Drug A, Drug B, Placebo) and gender (Male/Female), along with baseline age, predict a profile of three biomarker outcomes (Cholesterol mmol/L, Blood Pressure mmHg, Glucose mmol/L)?

Data: n=120n = 120 participants; p=4p = 4 predictors (2 dummy variables for treatment, 1 dummy for gender, 1 continuous for age); q=3q = 3 responses.

Design Matrix Structure:

X=(1DDrugADDrugBDFemaleAgecentred)\mathbf{X} = \begin{pmatrix} 1 & D_{DrugA} & D_{DrugB} & D_{Female} & Age_{centred} \end{pmatrix}

Reference: Drug = Placebo, Gender = Male. Age is mean-centred (Ageˉ=48.3\bar{Age} = 48.3 years).

Step 1: Coefficient Matrix (Selected)

Predictorβ^Chol\hat{\beta}_{Chol}SEppβ^BP\hat{\beta}_{BP}SEppβ^Gluc\hat{\beta}_{Gluc}SEpp
Intercept5.410.18< .001132.42.41< .0015.820.19< .001
Drug A-0.840.22< .001-8.212.940.006-0.610.230.009
Drug B-0.410.220.064-4.832.940.103-0.280.230.225
Female-0.320.190.094-5.142.540.046-0.180.200.369
Age0.0310.008< .0010.4120.107< .0010.0280.0080.001

Step 2: Multivariate Tests

Treatment effect (H0H_0: Drug A = Drug B = Placebo on all DVs):

TestValueF(6,228)F(6, 228)ppηp2\eta^2_p
Wilks' Λ\Lambda^*0.72147.12< 0.0010.158
Pillai's Trace0.29126.78< 0.0010.151

Treatment significantly and jointly predicts the biomarker profile (p<0.001p < 0.001, medium-large effect ηp2=0.158\eta^2_p = 0.158).

Gender effect (H0H_0: Male = Female on all DVs):

TestValueF(3,113)F(3, 113)ppηp2\eta^2_p
Wilks' Λ\Lambda^*0.91243.620.0150.088
Pillai's Trace0.08763.620.0150.088

Gender has a significant multivariate effect (p=0.015p = 0.015, small-medium ηp2=0.088\eta^2_p = 0.088), driven primarily by blood pressure differences.

Age effect (H0H_0: Age coefficient = 0 on all DVs):

TestValueF(3,113)F(3, 113)ppηp2\eta^2_p
Wilks' Λ\Lambda^*0.83417.48< 0.0010.166
Pillai's Trace0.16597.48< 0.0010.166

Age has a significant multivariate effect (p<0.001p < 0.001, large ηp2=0.166\eta^2_p = 0.166) — older participants have higher levels of all three biomarkers.

Step 3: Error Correlation Matrix

R^ϵ=(1.0000.4120.3310.4121.0000.2840.3310.2841.000)\hat{\mathbf{R}}_\epsilon = \begin{pmatrix} 1.000 & 0.412 & 0.331 \\ 0.412 & 1.000 & 0.284 \\ 0.331 & 0.284 & 1.000 \end{pmatrix}

Substantial positive error correlations confirm the appropriateness of the multivariate approach.

Step 4: Comparing Drug A vs. Drug B (Planned Contrast)

Hypothesis: H0:βDrugA,=βDrugB,H_0: \beta_{DrugA,\cdot} = \beta_{DrugB,\cdot} on all DVs (i.e., Drug A and Drug B have identical profiles of effects).

Using c=(0,1,1,0,0)T\mathbf{c} = (0, 1, -1, 0, 0)^T and M=I3\mathbf{M} = \mathbf{I}_3:

Wilks' Λ=0.9184\Lambda^* = 0.9184, F(3,113)=3.35F(3, 113) = 3.35, p=0.021p = 0.021, ηp2=0.082\eta^2_p = 0.082.

Drug A has a significantly different (stronger) effect profile compared to Drug B across the combined biomarker set (p=0.021p = 0.021). Examining individual coefficients: Drug A reduces cholesterol by 0.430.43 mmol/L more and blood pressure by 3.383.38 mmHg more than Drug B.

Conclusion: Drug A significantly and jointly reduces the three biomarker levels compared to Placebo (all p<0.01p < 0.01), whereas Drug B's effects are not individually significant. Age is a strong positive predictor of all biomarkers (all p0.001p \leq 0.001). Drug A's effects are significantly stronger than Drug B's across the combined biomarker profile.


Example 3: Profile of Cognitive Outcomes — Continuous Predictors

Research Question: Do IQ, years of education, and weekly exercise hours predict a profile of four cognitive assessment scores (Memory, Attention, Processing Speed, Executive Function) in older adults?

Data: n=150n = 150 participants; p=3p = 3 continuous predictors; q=4q = 4 responses (all standardised T-scores, mean = 50, SD = 10).

Step 1: Standardised Coefficient Matrix B^\hat{\mathbf{B}}^*

Predictorβ^Mem\hat{\beta}^*_{Mem}β^Att\hat{\beta}^*_{Att}β^Proc\hat{\beta}^*_{Proc}β^Exec\hat{\beta}^*_{Exec}
IQ0.412**0.381**0.448**0.521**
Education0.284**0.211**0.142*0.318**
Exercise0.118*0.0810.241**0.098

*p<0.05p < 0.05; **p<0.01p < 0.01

Step 2: Multivariate Tests

PredictorWilks' Λ\Lambda^*F(4,143)F(4, 143)ppηp2\eta^2_p
IQ0.542330.14< 0.0010.458
Education0.78419.82< 0.0010.216
Exercise0.91243.440.0110.088

All three predictors significantly and jointly predict the cognitive profile. IQ has the largest multivariate effect (ηp2=0.458\eta^2_p = 0.458, large); education has a medium effect (ηp2=0.216\eta^2_p = 0.216); exercise has a small but significant effect (ηp2=0.088\eta^2_p = 0.088).

Step 3: Differential Effects Across Responses

Test: Does IQ have equal effects on all four cognitive scores?

H0H_0: βIQ,Mem=βIQ,Att=βIQ,Proc=βIQ,Exec\beta_{IQ,Mem} = \beta_{IQ,Att} = \beta_{IQ,Proc} = \beta_{IQ,Exec}

Using pairwise contrasts M=\mathbf{M} = three-column difference matrix:

Wilks' Λ=0.9312\Lambda^* = 0.9312, F(3,145)=3.56F(3, 145) = 3.56, p=0.016p = 0.016IQ has differential effects across the cognitive domains. IQ most strongly predicts Executive Function and Processing Speed (standardised β0.450.52\beta^* \approx 0.45-0.52) compared to Memory and Attention (β0.380.41\beta^* \approx 0.38-0.41).

Test: Does exercise have equal effects on Processing Speed and Attention?

H0H_0: βEx,Proc=βEx,Att\beta_{Ex,Proc} = \beta_{Ex,Att}

c=eExT\mathbf{c} = \mathbf{e}_{Ex}^T, m=eProceAtt\mathbf{m} = \mathbf{e}_{Proc} - \mathbf{e}_{Att}

t(146)=2.14t(146) = 2.14, p=0.034p = 0.034Exercise has a significantly stronger effect on Processing Speed than Attention (βProc=0.241\beta^*_{Proc} = 0.241 vs. βAtt=0.081\beta^*_{Att} = 0.081).

Step 4: Univariate R² Summary

ResponseR2R^2Adjusted R2R^2Primary Predictor
Memory0.3480.334IQ (ηp2=0.175\eta^2_p = 0.175)
Attention0.2910.276IQ (ηp2=0.149\eta^2_p = 0.149)
Processing Speed0.4020.389IQ (ηp2=0.204\eta^2_p = 0.204)
Executive Function0.4810.470IQ (ηp2=0.271\eta^2_p = 0.271)

Conclusion: The three predictors jointly explain 29–48% of variance in individual cognitive scores. IQ is the dominant predictor across all cognitive domains, but exercise specifically benefits processing speed more than other domains. Education consistently predicts all cognitive outcomes.


18. Common Mistakes and How to Avoid Them

Mistake 1: Running Separate Univariate Regressions Without Multivariate Testing

Problem: Conducting qq separate OLS regressions and reporting results variable-by-variable without testing multivariate hypotheses. This ignores the correlational structure among responses, inflates the familywise Type I error rate, and misses effects that exist only on linear combinations of responses.
Solution: Use the MLM framework to obtain multivariate tests of each predictor's joint effect on all responses simultaneously. Report multivariate test statistics alongside univariate results. Apply Bonferroni or Holm corrections if separately reporting univariate results.

Mistake 2: Neglecting to Report and Interpret the Error Covariance Matrix

Problem: Fitting an MLM but reporting only univariate coefficients and ignoring Σ^\hat{\boldsymbol{\Sigma}}, missing a key piece of the analysis — the residual correlation structure among responses.
Solution: Always estimate and report Σ^\hat{\boldsymbol{\Sigma}} (or at least the correlation form R^ϵ\hat{\mathbf{R}}_\epsilon). Large within-observation error correlations are the key justification for using MLM rather than separate regressions. If error correlations are near zero, separate regressions are nearly equivalent to MLM for estimation.

Mistake 3: Confusing the Coefficient Matrix Interpretation Direction

Problem: Misinterpreting whether the kk-th row or the jj-th column of B^\hat{\mathbf{B}} corresponds to predictor kk or response jj.
Solution: Establish and maintain a consistent convention. In the standard MLM Y=XB+E\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}: rows of B\mathbf{B} correspond to predictors (including intercept); columns of B\mathbf{B} correspond to responses. Always label the rows and columns of B^\hat{\mathbf{B}} clearly in output tables.

Mistake 4: Ignoring Residual Diagnostics Across All Responses

Problem: Checking diagnostic plots for only one or two response variables and assuming the model is adequate for all, when violations (non-linearity, heteroscedasticity, outliers) may affect only some responses.
Solution: Examine residual diagnostic plots for every response variable individually. Also conduct multivariate residual diagnostics (Mahalanobis distance Q-Q plot, multivariate normality tests). A violation in even one response requires remediation.

Mistake 5: Using MLM When Responses Are Unrelated

Problem: Applying MLM to a set of conceptually unrelated response variables simply because they were measured, producing a statistically valid but scientifically meaningless combined analysis.
Solution: Only combine responses into an MLM when they represent a coherent theoretical construct (e.g., multiple indicators of the same underlying outcome dimension, multiple time points of the same measure, multiple facets of the same construct). Unrelated responses should be modelled separately.

Mistake 6: Ignoring Multicollinearity Among Predictors

Problem: Failing to check VIFs, leading to inflated standard errors, unstable coefficient estimates, and unreliable hypothesis tests — especially problematic in MLM because the instability propagates across all response equations simultaneously.
Solution: Always compute VIFs before finalising the model. If any VIFk>10VIF_k > 10, consider removing one of the collinear predictors, creating a composite, using ridge regression, or PLS. Report VIFs in the methods section.

Mistake 7: Testing Multivariate Hypotheses Without Checking the Appropriate Contrast Matrices

Problem: Specifying incorrect or incomplete C\mathbf{C} and M\mathbf{M} matrices, leading to tests of unintended hypotheses. For example, forgetting that the intercept column is the first row of B\mathbf{B} and incorrectly indexing predictors.
Solution: Always verify the C\mathbf{C} and M\mathbf{M} matrices by constructing them explicitly and checking that CB^M\mathbf{C}\hat{\mathbf{B}}\mathbf{M} produces the intended contrast. The DataStatPro application automatically constructs C\mathbf{C} and M\mathbf{M} from user specifications, but always review the hypothesis being tested.

Mistake 8: Selecting Variables Based Only on Univariate Significance

Problem: Dropping predictors from the MLM based solely on their univariate significance for individual responses (e.g., "agriculture is not significant for DO, so remove it"), potentially removing predictors that have significant multivariate effects.
Solution: Use multivariate criteria for variable selection: multivariate AIC/BIC, likelihood ratio test for the full row of B\mathbf{B} (all qq responses simultaneously), or the multivariate FF-test for the predictor's row. A predictor non-significant for one response may still significantly predict other responses or combinations thereof.

Mistake 9: Over-Interpreting Large Standardised Coefficients Without Confidence Intervals

Problem: Ranking predictor importance based on standardised coefficient magnitudes alone, without reporting confidence intervals, leading to overconfident conclusions about which predictors matter most.
Solution: Always report confidence intervals for standardised coefficients. A large standardised coefficient with a wide confidence interval indicates imprecision. Use effect sizes (ηp2\eta^2_p, ω2\omega^2) alongside coefficient estimates for importance assessment.

Mistake 10: Neglecting Multivariate Outlier Detection

Problem: Identifying and addressing univariate outliers (per response) but missing multivariate outliers — observations that are unusual on the combination of responses even if not extreme on any individual response.
Solution: Always compute Mahalanobis distances of residuals and create a D2D^2 vs. χq2\chi^2_q Q-Q plot. Investigate observations with D2>χq,0.0012D^2 > \chi^2_{q, 0.001} for data errors or genuine anomalies. Report Cook's distances and the sensitivity of results to influential observations.


19. Troubleshooting

IssueLikely CauseSolution
(XTX)1(\mathbf{X}^T\mathbf{X})^{-1} does not existPerfect multicollinearity among predictors; rank-deficient design matrix; n<p+1n < p+1Check for linearly dependent columns (e.g., sum of dummy variables = constant); remove redundant predictors; collect more data
Σ^\hat{\boldsymbol{\Sigma}} is singularnp1qn - p - 1 \leq q (too few residual df); perfectly correlated responsesIncrease nn; reduce qq (remove highly correlated responses); use regularised estimator
Very large standard errors for some coefficientsNear-multicollinearity (VIF>10VIF > 10); very small predictor varianceCheck VIFs; standardise predictors; use ridge regression; remove one of pair of collinear predictors
Wilks' Λ=1\Lambda^* = 1 (non-significant, p=1p = 1)No predictive signal; all responses uncorrelated with all predictors; data entry errorVerify data; check variable coding; consider whether the research question is plausible
Wilks' Λ<0\Lambda^* < 0 or >1> 1Computational error; near-singular E\mathbf{E}; numerical overflowCheck for perfect multicollinearity among responses; verify np1>qn - p - 1 > q; inspect raw data
Mahalanobis distance Q-Q plot strongly curvedSevere multivariate non-normality of residuals; influential outliersIdentify and investigate outliers (Cook's distance, leverage); transform skewed responses (log, Box-Cox); use bootstrap inference
Significant multivariate test but no significant univariate testsEffect exists on a linear combination of DVs not captured individuallyFocus on discriminant function / canonical variate analysis; do not dismiss the multivariate result — it is valid
Significant univariate tests but non-significant multivariate testInsufficient power for multivariate test with many DVs; DVs poorly correlated with group effectsConsider removing uninformative DVs; increase sample size; report both results with context
Residual plots show clear non-linearity for one DVOmitted non-linear term for that response; wrong functional formAdd polynomial or spline term for the offending predictor-response combination; consider transformation of the response
Residual plots show heteroscedasticity for one DVVariance of one response changes with fitted valuesApply response transformation (log, square root); use heteroscedasticity-consistent (HC) standard errors; consider weighted MLM
Very high within-observation residual correlations ($r> 0.95$)
Cook's distance extremely large for one observationData entry error; genuine anomaly; highly influential leverage pointVerify data accuracy; report analysis with and without the influential observation; use robust estimation
AIC increases when adding a theoretically important predictorPredictor explains negligible variance or adds noiseReport both models; note the theoretically motivated predictor regardless; consider whether nn is adequate to detect the hypothesised effect
FF-approximations for Wilks', Pillai's, and Roy's give different conclusionsComplex eigenstructure (effects spread across multiple dimensions); small sample sizeReport all four statistics; prioritise Pillai's Trace; investigate canonical variate structure

20. Quick Reference Cheat Sheet

Core Formulas

FormulaDescription
Y=XB+E\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}Multivariate linear model
B^=(XTX)1XTY\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}OLS/MLE coefficient estimator
Y^=XB^=HY\hat{\mathbf{Y}} = \mathbf{X}\hat{\mathbf{B}} = \mathbf{H}\mathbf{Y}Fitted values
E^=YY^=(IH)Y\hat{\mathbf{E}} = \mathbf{Y} - \hat{\mathbf{Y}} = (\mathbf{I}-\mathbf{H})\mathbf{Y}Residual matrix
Σ^=E^TE^/(np1)\hat{\boldsymbol{\Sigma}} = \hat{\mathbf{E}}^T\hat{\mathbf{E}}/(n-p-1)Error covariance estimator
SE(β^kj)=σ^jj[(XTX)1]kkSE(\hat{\beta}_{kj}) = \sqrt{\hat{\sigma}_{jj}[(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}Standard error of coefficient
H=Hhypothesis:(CB^M)T[C(XTX)1CT]1(CB^M)\mathbf{H} = \mathbf{H}_{hypothesis}: (\mathbf{C}\hat{\mathbf{B}}\mathbf{M})^T[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T]^{-1}(\mathbf{C}\hat{\mathbf{B}}\mathbf{M})Hypothesis SSCP matrix
$\Lambda^* =\mathbf{E}
V=tr[H(H+E)1]=λs/(1+λs)V = \text{tr}[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}] = \sum \lambda_s/(1+\lambda_s)Pillai's Trace
U=tr(E1H)=λsU = \text{tr}(\mathbf{E}^{-1}\mathbf{H}) = \sum \lambda_sHotelling-Lawley Trace
θ=λ1/(1+λ1)\theta = \lambda_1/(1+\lambda_1)Roy's Largest Root
ηp2=1Λ1/t\eta^2_p = 1 - \Lambda^{*1/t}Multivariate partial eta-squared
y^new=B^Txnew\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T\mathbf{x}_{new}Point prediction for new observation
Di2=ϵ^iTΣ^1ϵ^iD^2_i = \hat{\boldsymbol{\epsilon}}_i^T\hat{\boldsymbol{\Sigma}}^{-1}\hat{\boldsymbol{\epsilon}}_iMahalanobis distance of residuals

Four Multivariate Test Statistics Summary

StatisticFormulaRangeMost RobustBest When
Wilks' Λ\Lambda^*1/(1+λs)\prod 1/(1+\lambda_s)[0,1][0,1]ModerateStandard analysis, effects spread
Pillai's Trace VVλs/(1+λs)\sum \lambda_s/(1+\lambda_s)[0,s][0,s^*]HighestRobustness needed, small/unequal nn
Hotelling-Lawley UUλs\sum \lambda_s[0,)[0,\infty)LowestSingle dominant dimension
Roy's θ\thetaλ1/(1+λ1)\lambda_1/(1+\lambda_1)[0,1][0,1]LowestTheory predicts single dimension

General Linear Hypothesis Guide

HypothesisC\mathbf{C}M\mathbf{M}Description
All predictors, all responsesIp\mathbf{I}_p (no intercept)Iq\mathbf{I}_qOverall model test
Predictor kk, all responsesekT\mathbf{e}_k^TIq\mathbf{I}_qEffect of XkX_k on all DVs
All predictors, response jjIp\mathbf{I}_pej\mathbf{e}_jUnivariate model for YjY_j
Predictor kk, response jjekT\mathbf{e}_k^Tej\mathbf{e}_jSingle coefficient test
XkX_k equal effect on YjY_j and YlY_lekT\mathbf{e}_k^Tejel\mathbf{e}_j - \mathbf{e}_lDifferential effect test
Groups g1g_1 vs. g2g_2, all DVsContrast row for groupsIq\mathbf{I}_qMANOVA pairwise comparison
Subset of predictors (rows k1,,kck_1,\ldots,k_c)cc-row selection matrixIq\mathbf{I}_qJoint test of predictor subset

Effect Size Benchmarks

Effect SizeSmallMediumLarge
ηp2\eta^2_p (multivariate)0.010.060.14
ω2\omega^20.010.060.14
Canonical R2R^20.010.090.25
Univariate R2R^20.020.130.26

Assumption Diagnostics Summary

AssumptionCheckThresholdRemedy
LinearityPartial residual plotsVisual patternPolynomial/spline terms
Multivariate normalityMardia's tests, D2D^2 Q-Q plotp<0.05p < 0.05Transform responses; bootstrap
IndependenceStudy design; Durbin-WatsonDW<1.5DW < 1.5 or >2.5> 2.5Mixed models; GEE
HomoscedasticityScale-location plot; Breusch-Paganp<0.05p < 0.05Transform responses; HC SEs
No multicollinearityVIFVIF>10VIF > 10Remove/combine predictors; ridge
No outliersCook's DD; Mahalanobis D2D^2D>4/nD > 4/n; D2>χq,0.0012D^2 > \chi^2_{q,0.001}Investigate; robust estimation
Sufficient nnnp1>qn - p - 1 > qAbsolute minimumCollect more data; reduce qq

Model Selection Criteria

CriterionFormulaPreferNotes
AIC2+2k-2\ell + 2kLowerPredictive accuracy
BIC2+kln(n)-2\ell + k\ln(n)LowerParsimony
LRT (χ2\chi^2)[np1(m+c+1)/2]ln(Λ)-[n-p-1-(m+c+1)/2]\ln(\Lambda^*)p>0.05p > 0.05 to retain reducedNested models only
CV MSPE1nyiy^i(v)2\frac{1}{n}\sum\|\mathbf{y}_i - \hat{\mathbf{y}}_i^{(-v)}\|^2LowerOut-of-sample prediction

Special Cases of the MLM

Special CaseConditionsNotes
Univariate regressionq=1q = 1Standard OLS
MANOVAX\mathbf{X} = group indicators onlyGroups compared on qq DVs
MANCOVAX\mathbf{X} = groups + covariatesAdjusted group comparisons
Profile analysisqq = repeated measures; M\mathbf{M} = difference matrixTests parallelism, levels, flatness
Canonical correlationTest of all predictors, M=Iq\mathbf{M} = \mathbf{I}_qCCA = MLM eigenstructure
Discriminant analysisX\mathbf{X} = groups; post-hoc eigenvectorsDFA follows MANOVA

Prediction Formulas

TargetFormula
Point predictiony^new=B^Txnew\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T\mathbf{x}_{new}
Prediction SE (response jj)σ^jj(1+hnew)\sqrt{\hat{\sigma}_{jj}(1 + h_{new})}
95% prediction interval (YjY_j)y^new,j±t0.025,np1σ^jj(1+hnew)\hat{y}_{new,j} \pm t_{0.025,n-p-1}\sqrt{\hat{\sigma}_{jj}(1+h_{new})}
Leverage of new pointhnew=xnewT(XTX)1xnewh_{new} = \mathbf{x}_{new}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_{new}
Prediction ellipsoid(yy^)T[Σ^(1+hnew)]1(yy^)qFα,q,npq(\mathbf{y}-\hat{\mathbf{y}})^T[\hat{\boldsymbol{\Sigma}}(1+h_{new})]^{-1}(\mathbf{y}-\hat{\mathbf{y}}) \leq qF_{\alpha,q,n-p-q}

Multicollinearity Severity Guide

VIF RangeSeverityAction
121 - 2NoneProceed normally
252 - 5LowMonitor; no action required
5105 - 10ModerateCheck; consider remediation
103010 - 30HighRemediation recommended
>30> 30SevereRemediation required

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Multivariate Linear Models using the DataStatPro application. For further reading, consult Johnson & Wichern's "Applied Multivariate Statistical Analysis" (6th ed., Pearson, 2007), Rencher & Christensen's "Methods of Multivariate Analysis" (3rd ed., Wiley, 2012), or Anderson's "An Introduction to Multivariate Statistical Analysis" (3rd ed., Wiley, 2003). For feature requests or support, contact the DataStatPro team.