Multivariate Linear Models: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Multivariate Linear Models (MLM) all the way through advanced estimation, hypothesis testing, model diagnostics, coefficient interpretation, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.

Prerequisites and Background Concepts
What are Multivariate Linear Models?
The Mathematical Framework
Model Specification and Design Matrices
Assumptions of Multivariate Linear Models
Parameter Estimation
Hypothesis Testing and Inference
Multivariate Test Statistics
Effect Size Measures
Model Fit and Evaluation
Model Diagnostics and Residuals
Interpretation of Coefficients
Variable Selection and Model Comparison
Special Cases and Connections to Other Methods
Using the Multivariate Linear Models Component
Computational and Formula Details
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into Multivariate Linear Models, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Ordinary Least Squares (OLS) Regression

Ordinary Least Squares (OLS) regression models a single continuous response $y$ as a linear function of $p$ predictors $X_1, X_2, \dots, X_p$ :

$y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \epsilon_i$

In matrix form:

$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$

The OLS estimator minimises the sum of squared residuals:

$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Multivariate Linear Models extend this framework to the case where there are multiple response variables measured simultaneously.

1.2 Matrices and Linear Algebra

Key matrix operations used throughout this tutorial:

Matrix transpose $\mathbf{A}^T$ : Swaps rows and columns.
Matrix inverse $\mathbf{A}^{-1}$ : Exists when $\mathbf{A}$ is square and full rank; $\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$ .
Trace $\text{tr}(\mathbf{A}) = \sum_i a_{ii}$ : Sum of diagonal elements.
Determinant $|\mathbf{A}|$ : Scalar summarising the matrix; zero if $\mathbf{A}$ is singular.
Kronecker product $\mathbf{A} \otimes \mathbf{B}$ : Block matrix of all pairwise products of elements of $\mathbf{A}$ with $\mathbf{B}$ .
Vec operator $\text{vec}(\mathbf{A})$ : Stacks the columns of $\mathbf{A}$ into a single column vector.

1.3 The Multivariate Normal Distribution

The multivariate normal distribution $\mathcal{N}_q(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ has density:

$f(\mathbf{y}) = \frac{1}{(2\pi)^{q/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left\{-\frac{1}{2}(\mathbf{y}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{y}-\boldsymbol{\mu})\right\}$

Where $\boldsymbol{\mu}$ is the mean vector and $\boldsymbol{\Sigma}$ is the covariance matrix. MLMs assume that the response vectors follow this distribution conditionally on the predictors.

1.4 Covariance and Correlation Matrices

The covariance matrix $\boldsymbol{\Sigma}$ ( $q \times q$ ) for $q$ response variables contains:

Variances $\sigma_{jj} = \text{Var}(Y_j)$ on the diagonal.
Covariances $\sigma_{jk} = \text{Cov}(Y_j, Y_k)$ off-diagonal.

The correlation matrix $\mathbf{R}$ standardises covariances:

$R_{jk} = \frac{\sigma_{jk}}{\sqrt{\sigma_{jj}\sigma_{kk}}}$

1.5 Eigenvalues and Eigenvectors

For a square matrix $\mathbf{A}$ , eigenvalue-eigenvector pairs $(\lambda_s, \mathbf{v}_s)$ satisfy $\mathbf{A}\mathbf{v}_s = \lambda_s\mathbf{v}_s$ . They are critical for computing multivariate test statistics and understanding the structure of relationships in MLMs.

1.6 The Hat Matrix

In OLS regression, the hat (projection) matrix:

$\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$

Projects the response vector $\mathbf{y}$ onto the column space of $\mathbf{X}$ : $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$ . The hat matrix diagonal elements $h_{ii}$ measure the leverage of each observation. The same matrix plays the same role in MLMs, now projecting each response variable.

2. What are Multivariate Linear Models?

2.1 The Core Idea

A Multivariate Linear Model (MLM) — also called multivariate multiple regression or the general multivariate linear model — simultaneously models the linear relationship between a set of predictor variables (independent variables) and multiple continuous response variables (dependent variables).

Where univariate multiple regression has a single response $\mathbf{y}$ ( $n \times 1$ ), the multivariate linear model has a response matrix $\mathbf{Y}$ ( $n \times q$ ) with $q \geq 2$ response variables measured on the same $n$ observations.

The model is:

$\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}$

Where:

$\mathbf{Y}$ ( $n \times q$ ) = matrix of $q$ response variables.
$\mathbf{X}$ ( $n \times (p+1)$ ) = design matrix of predictors (including a column of ones for the intercept).
$\mathbf{B}$ ( $(p+1) \times q$ ) = matrix of regression coefficients (one row per predictor, one column per response).
$\mathbf{E}$ ( $n \times q$ ) = matrix of errors.

2.2 Why Use MLM Instead of Separate Regressions?

A natural question is: "Why not simply run $q$ separate univariate regressions, one for each response?" There are several compelling reasons to prefer the multivariate approach:

Reason	Explanation
Controls familywise Type I error	Running $q$ separate tests inflates the overall error rate. MLM tests all responses simultaneously at a single $\alpha$ .
Exploits correlations among responses	Separate regressions ignore that responses are correlated. MLM uses the full error covariance structure, improving efficiency.
Detects multivariate effects	Predictors may not significantly relate to any single response but may have a significant combined effect on the response set. MLM detects these patterns.
Produces joint confidence regions	MLM yields joint confidence regions for multiple coefficients simultaneously — something separate regressions cannot provide.
Tests multivariate hypotheses directly	Hypotheses about linear combinations of coefficients across responses (e.g., "Does $X_1$ affect $Y_1$ and $Y_2$ equally?") can be tested directly in MLM.
More powerful when responses are correlated	When responses share common error variance, accounting for correlations can increase power compared to separate analyses.

2.3 The Generality of the Multivariate Linear Model

The multivariate linear model is remarkably general. Many seemingly distinct statistical procedures are special cases:

Special Case	Description
Univariate Multiple Regression	$q = 1$ (single response)
Multivariate ANOVA (MANOVA)	$\mathbf{X}$ contains only categorical predictors (dummy-coded)
Multivariate ANCOVA (MANCOVA)	$\mathbf{X}$ contains both categorical and continuous predictors
Seemingly Unrelated Regression (SUR)	$q$ regression equations with different predictors per equation
Profile Analysis	$\mathbf{X}$ contains group indicators; $\mathbf{Y}$ contains repeated measures
Canonical Correlation Analysis	Tests overall association between predictor set and response set
Discriminant Function Analysis	$\mathbf{X}$ contains group indicators; follows from MLM eigenstructure

2.4 Real-World Applications

Multivariate Linear Models are used across a wide range of applied fields:

Neuroscience: Predicting multiple brain region activation levels simultaneously from experimental conditions, participant characteristics, or cognitive task measures.
Environmental Science: Modelling multiple water quality indicators (pH, dissolved oxygen, turbidity, nitrate) jointly as functions of environmental predictors (rainfall, temperature, land use).
Finance: Predicting multiple asset returns simultaneously from macroeconomic factors (interest rates, inflation, GDP growth).
Clinical Medicine: Relating treatment variables to multiple physiological endpoints (blood pressure, cholesterol, glucose, BMI) simultaneously.
Educational Research: Modelling multiple academic outcomes (reading, mathematics, science, writing) as joint functions of student and school characteristics.
Agricultural Science: Predicting multiple crop yield components (grain weight, harvest index, protein content) from soil, climate, and management variables.
Marketing Research: Modelling multiple consumer attitude dimensions simultaneously from demographic and psychographic predictors.
Psychometrics: Relating latent constructs to multiple observed scale scores while accounting for the shared measurement structure.

3. The Mathematical Framework

3.1 The Multivariate Linear Model

The core model is:

$\mathbf{Y}_{n \times q} = \mathbf{X}_{n \times (p+1)} \mathbf{B}_{(p+1) \times q} + \mathbf{E}_{n \times q}$

Each row $\mathbf{y}_i^T$ (observation $i$ ) satisfies:

$\mathbf{y}_i = \mathbf{B}^T \mathbf{x}_i + \boldsymbol{\epsilon}_i, \quad \boldsymbol{\epsilon}_i \sim \mathcal{N}_q(\mathbf{0}, \boldsymbol{\Sigma})$

Where:

$\mathbf{x}_i$ ( $p+1 \times 1$ ) = predictor vector for observation $i$ (including intercept).
$\boldsymbol{\Sigma}$ ( $q \times q$ ) = error covariance matrix (assumed common to all observations).
Errors are independent across observations: $\text{Cov}(\boldsymbol{\epsilon}_i, \boldsymbol{\epsilon}_j) = \mathbf{0}$ for $i \neq j$ .

Written differently, the distribution of each row is:

$\mathbf{y}_i^T \mid \mathbf{x}_i \sim \mathcal{N}_q\left(\mathbf{x}_i^T \mathbf{B},\ \boldsymbol{\Sigma}\right)$

3.2 The Coefficient Matrix $\mathbf{B}$

The coefficient matrix $\mathbf{B}$ has dimensions $(p+1) \times q$ :

$\mathbf{B} = \begin{pmatrix} \beta_{01} & \beta_{02} & \cdots & \beta_{0q} \\ \beta_{11} & \beta_{12} & \cdots & \beta_{1q} \\ \vdots & \vdots & \ddots & \vdots \\ \beta_{p1} & \beta_{p2} & \cdots & \beta_{pq} \end{pmatrix}$

The $j$ -th column $\boldsymbol{\beta}_j = (\beta_{0j}, \beta_{1j}, \dots, \beta_{pj})^T$ contains the coefficients for the $j$ -th response variable — identical to what you would get from running univariate OLS regression of $Y_j$ on $\mathbf{X}$ .
The $k$ -th row (for $k \geq 1$ ) contains the effects of predictor $X_k$ on all $q$ response variables simultaneously — the multivariate coefficient vector for predictor $k$ .

3.3 The Error Structure

The error matrix $\mathbf{E}$ has the following properties:

$E[\mathbf{E}] = \mathbf{0}_{n \times q}$

$E[\mathbf{e}_i \mathbf{e}_i^T] = \boldsymbol{\Sigma}, \quad E[\mathbf{e}_i \mathbf{e}_j^T] = \mathbf{0} \text{ for } i \neq j$

This means the errors are identically and independently distributed (i.i.d.) multivariate normal. Equivalently, using the vec operator:

$\text{vec}(\mathbf{E}) \sim \mathcal{N}_{nq}\left(\mathbf{0},\ \boldsymbol{\Sigma} \otimes \mathbf{I}_n\right)$

The Kronecker product $\boldsymbol{\Sigma} \otimes \mathbf{I}_n$ captures the covariance structure: observations are independent (block-diagonal structure from $\mathbf{I}_n$ ) but responses within each observation may be correlated (captured by $\boldsymbol{\Sigma}$ ).

3.4 The Conditional Mean and Variance

For a new observation with predictor vector $\mathbf{x}_{new}$ :

$E[\mathbf{y}_{new} \mid \mathbf{x}_{new}] = \mathbf{B}^T \mathbf{x}_{new}$

$\text{Var}(\mathbf{y}_{new} \mid \mathbf{x}_{new}) = \boldsymbol{\Sigma}$

The predicted response vector is:

$\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T \mathbf{x}_{new}$

The predicted response matrix for all $n$ observations:

$\hat{\mathbf{Y}} = \mathbf{X}\hat{\mathbf{B}} = \mathbf{H}\mathbf{Y}$

Where $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is the hat matrix.

3.5 The General Linear Hypothesis

The power of the multivariate linear model framework lies in the ability to test very general hypotheses of the form:

$H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{\Gamma}_0$

Where:

$\mathbf{C}$ ( $c \times (p+1)$ ) = hypothesis (contrast) matrix for predictors — specifies which linear combinations of rows of $\mathbf{B}$ are being tested.
$\mathbf{M}$ ( $q \times m$ ) = response transformation matrix — specifies which linear combinations of columns of $\mathbf{B}$ (i.e., response variables or contrasts among responses) are being tested.
$\mathbf{\Gamma}_0$ ( $c \times m$ ) = hypothesised value matrix (usually $\mathbf{0}$ for testing no effect).

This general framework subsumes all the special cases mentioned in Section 2.3.

Examples of $\mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0}$ hypotheses:

Hypothesis	$\mathbf{C}$	$\mathbf{M}$
Test all predictors simultaneously	$\mathbf{I}_p$ (omitting intercept row)	$\mathbf{I}_q$
Test effect of predictor $X_k$ on all DVs	$\mathbf{e}_k^T$ (unit vector)	$\mathbf{I}_q$
Test effect of all predictors on DV $j$	$\mathbf{I}_p$	$\mathbf{e}_j$ (unit vector)
Test if $X_k$ has equal effects on $Y_1$ and $Y_2$	$\mathbf{e}_k^T$	$(1, -1)^T$
MANOVA group effect	Group contrast matrix	$\mathbf{I}_q$

4. Model Specification and Design Matrices

4.1 The Design Matrix $\mathbf{X}$

The design matrix $\mathbf{X}$ ( $n \times (p+1)$ ) encodes all predictor information. The first column is typically a column of ones for the intercept:

$\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}$

The design matrix can accommodate:

Continuous predictors: Entered directly as columns.
Categorical predictors: Encoded as dummy variables ( $k-1$ columns for $k$ categories).
Interaction terms: Products of two or more predictor columns.
Polynomial terms: Powers of continuous predictors ( $X^2$ , $X^3$ , etc.).
Mixed designs: Combinations of continuous and categorical predictors.

4.2 Dummy Coding for Categorical Predictors

For a categorical predictor with $k$ categories, create $k-1$ dummy variables using a reference category. For a three-level factor (A, B, C) with A as reference:

Category	$D_B$	$D_C$
A (reference)	0	0
B	1	0
C	0	1

The coefficient for $D_B$ in column $j$ of $\hat{\mathbf{B}}$ represents the difference in the mean of $Y_j$ between groups B and A, holding other predictors constant.

4.3 The Response Matrix $\mathbf{Y}$

The response matrix $\mathbf{Y}$ ( $n \times q$ ) contains all $q$ response variables:

$\mathbf{Y} = \begin{pmatrix} y_{11} & y_{12} & \cdots & y_{1q} \\ y_{21} & y_{22} & \cdots & y_{2q} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n1} & y_{n2} & \cdots & y_{nq} \end{pmatrix}$

Each column is a separate response variable. Each row is the complete response profile for one observation.

4.4 Centering and Scaling

Centering predictors (subtracting their means) is generally recommended to:

Make the intercept interpretable (the predicted response at the mean of all predictors).
Reduce multicollinearity between main effects and interactions.
Improve numerical stability of matrix computations.

Standardising predictors (centering + scaling to unit variance) additionally makes regression coefficients comparable across predictors with different units and scales — the standardised coefficients represent the change in the response (in standard deviation units of the response) for a one-standard-deviation change in each predictor.

4.5 Interaction Terms

An interaction term $X_j \times X_k$ is formed as the element-wise product of the two predictor columns:

$\mathbf{x}_{jk} = \mathbf{x}_j \odot \mathbf{x}_k$

Including an interaction in the model allows the effect of $X_j$ on the response vector to depend on the value of $X_k$ (and vice versa). For categorical × continuous interactions, the interaction allows the regression slope to differ across groups.

5. Assumptions of Multivariate Linear Models

5.1 Linearity

Assumption: The relationship between each predictor $X_k$ and each response $Y_j$ is linear (on the scale of the model), holding all other predictors constant:

$E[Y_{ij} \mid \mathbf{x}_i] = \beta_{0j} + \beta_{1j}X_{i1} + \dots + \beta_{pj}X_{ip}$

How to check: Partial residual plots (component-plus-residual plots) for each predictor-response combination; scatter plots of residuals against each predictor.

Consequences of violation: Biased and inconsistent coefficient estimates; poor prediction.

Remedies: Polynomial terms, log transformations of predictors or responses, spline terms, generalised additive models.

5.2 Multivariate Normality of Errors

Assumption: The error vectors $\boldsymbol{\epsilon}_i$ follow a multivariate normal distribution:

$\boldsymbol{\epsilon}_i \sim \mathcal{N}_q(\mathbf{0}, \boldsymbol{\Sigma})$

How to check:

Mardia's tests of multivariate skewness and kurtosis on residuals.
Q-Q plots of squared Mahalanobis distances of residuals vs. $\chi^2_q$ quantiles.
Univariate normality checks (histograms, Q-Q plots, Shapiro-Wilk) on each column of $\hat{\mathbf{E}}$ .

Consequences of violation: OLS estimates remain unbiased and consistent (Gauss-Markov theorem extends to MLM) but inference (hypothesis tests, confidence intervals) based on the multivariate normal assumption is affected, particularly with small samples.

Remedies: Transformations (log, Box-Cox) of skewed responses; robust estimation; bootstrap inference.

5.3 Independence of Observations

Assumption: The error vectors $\boldsymbol{\epsilon}_i$ are independent across observations:

$\text{Cov}(\boldsymbol{\epsilon}_i, \boldsymbol{\epsilon}_j) = \mathbf{0} \quad \text{for } i \neq j$

How to check: Consider the study design. Look for temporal autocorrelation (Durbin-Watson statistics per response), spatial correlation, or clustering structure.

Consequences of violation: Biased standard errors, invalid hypothesis tests, false significance.

Remedies: Mixed models (multilevel MLM) for clustered data; GEE for repeated measures; time-series corrections for temporal dependence.

5.4 Homoscedasticity (Constant Error Covariance)

Assumption: The error covariance matrix $\boldsymbol{\Sigma}$ is the same for all observations — it does not depend on the values of the predictors or the fitted values:

$\text{Var}(\boldsymbol{\epsilon}_i) = \boldsymbol{\Sigma} \quad \text{for all } i$

How to check:

Plots of squared residuals vs. fitted values for each response.
Levene's test or Bartlett's test for each response variable.
Box's M test if comparing across groups defined by a categorical predictor.

Consequences of violation: OLS estimates remain unbiased but are no longer BLUE (Best Linear Unbiased Estimator); standard errors are biased.

Remedies: Weighted least squares (if the heteroscedasticity pattern is known); heteroscedasticity-consistent (HC) standard errors; transformations of responses.

5.5 No Perfect Multicollinearity Among Predictors

Assumption: No predictor variable is a perfect linear combination of other predictor variables. Equivalently, $\mathbf{X}^T\mathbf{X}$ must be invertible (full rank).

How to check: Variance Inflation Factor (VIF) for each predictor: $VIF_k = 1/(1 - R^2_k)$ where $R^2_k$ is the R² from regressing $X_k$ on all other predictors. VIF > 10 is a common threshold for concern.

Consequences of violation: $(\mathbf{X}^T\mathbf{X})^{-1}$ does not exist; coefficient estimates are undefined or numerically unstable with inflated standard errors.

Remedies: Remove redundant predictors; use ridge regression or PLS; combine highly correlated predictors into a composite.

5.6 No Influential Outliers

Assumption: No single observation has undue influence on the estimated coefficient matrix $\hat{\mathbf{B}}$ .

How to check: Cook's distance (multivariate extension), hat matrix diagonal $h_{ii}$ , DFFITS, standardised residuals.

Consequences of violation: Estimated coefficients may be heavily distorted by a single atypical observation.

Remedies: Investigate outlying observations for data errors; use robust regression; report sensitivity analyses with and without influential points.

5.7 Sufficient Sample Size

Recommendation: For MLM to be reliable:

$n > p + q + 1$ (absolute minimum to estimate $\hat{\mathbf{B}}$ and $\hat{\boldsymbol{\Sigma}}$ ).
$n \geq 10(p + 1)$ (rule of thumb for adequate precision of coefficient estimates).
The within-group covariance matrix $\hat{\boldsymbol{\Sigma}}$ requires $n - p - 1 > q$ for invertibility.

6. Parameter Estimation

6.1 The Ordinary Least Squares (OLS) Estimator

The OLS estimator for $\mathbf{B}$ minimises the total sum of squared residuals simultaneously across all response variables:

$\hat{\mathbf{B}} = \arg\min_{\mathbf{B}} \text{tr}\left[(\mathbf{Y} - \mathbf{X}\mathbf{B})^T(\mathbf{Y} - \mathbf{X}\mathbf{B})\right]$

Taking the matrix derivative and setting to zero:

$\frac{\partial}{\partial \mathbf{B}}\text{tr}\left[(\mathbf{Y} - \mathbf{X}\mathbf{B})^T(\mathbf{Y} - \mathbf{X}\mathbf{B})\right] = -2\mathbf{X}^T(\mathbf{Y} - \mathbf{X}\mathbf{B}) = \mathbf{0}$

Solving the normal equations $\mathbf{X}^T\mathbf{X}\hat{\mathbf{B}} = \mathbf{X}^T\mathbf{Y}$ :

$\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$

Critically: This is simply the matrix of coefficients obtained by running $q$ separate OLS regressions — one for each column of $\mathbf{Y}$ . The multivariate OLS estimator is column-by-column identical to the univariate OLS estimator applied $q$ times.

6.2 Properties of the OLS Estimator

Under the standard assumptions:

Unbiasedness: $E[\hat{\mathbf{B}}] = \mathbf{B}$

Covariance structure:

$\text{Cov}(\hat{\boldsymbol{\beta}}_j, \hat{\boldsymbol{\beta}}_k) = \sigma_{jk}(\mathbf{X}^T\mathbf{X})^{-1}$

Where $\sigma_{jk}$ is the $(j,k)$ element of $\boldsymbol{\Sigma}$ (the error covariance between responses $j$ and $k$ ). In full:

$\text{Cov}(\text{vec}(\hat{\mathbf{B}})) = \boldsymbol{\Sigma} \otimes (\mathbf{X}^T\mathbf{X})^{-1}$

Gauss-Markov (multivariate): Among all linear unbiased estimators, $\hat{\mathbf{B}}$ has the minimum covariance matrix (in the Loewner partial order) — it is BLUE.

Maximum Likelihood Equivalence: Under multivariate normality, the OLS estimator is also the maximum likelihood estimator (MLE) of $\mathbf{B}$ .

6.3 Estimation of the Error Covariance Matrix

The unbiased estimator of the error covariance matrix $\boldsymbol{\Sigma}$ is:

$\hat{\boldsymbol{\Sigma}} = \frac{\hat{\mathbf{E}}^T\hat{\mathbf{E}}}{n - p - 1} = \frac{(\mathbf{Y} - \hat{\mathbf{Y}})^T(\mathbf{Y} - \hat{\mathbf{Y}})}{n - p - 1}$

Where $\hat{\mathbf{E}} = \mathbf{Y} - \hat{\mathbf{Y}} = (\mathbf{I} - \mathbf{H})\mathbf{Y}$ is the matrix of OLS residuals.

The MLE of $\boldsymbol{\Sigma}$ (biased but with maximum likelihood properties) is:

$\tilde{\boldsymbol{\Sigma}}_{MLE} = \frac{\hat{\mathbf{E}}^T\hat{\mathbf{E}}}{n}$

The unbiased estimator $\hat{\boldsymbol{\Sigma}}$ is generally preferred for inference.

Degrees of freedom: The residual SSCP matrix $\hat{\mathbf{E}}^T\hat{\mathbf{E}}$ has $n - p - 1$ degrees of freedom. Invertibility requires $n - p - 1 > q$ .

6.4 Seemingly Unrelated Regression (SUR) and GLS

When the predictor matrices differ across response variables (i.e., each response has its own set of predictors), standard OLS on each response separately is not the most efficient estimator. Seemingly Unrelated Regression (SUR) applies Generalised Least Squares (GLS) using the full covariance structure:

$\hat{\mathbf{B}}_{GLS} = \left[\mathbf{X}^T(\hat{\boldsymbol{\Sigma}}^{-1} \otimes \mathbf{I}_n)\mathbf{X}\right]^{-1}\mathbf{X}^T(\hat{\boldsymbol{\Sigma}}^{-1} \otimes \mathbf{I}_n)\text{vec}(\mathbf{Y})$

When all responses share the same predictor matrix $\mathbf{X}$ (the standard MLM case), SUR and OLS give identical estimates. The GLS advantage only manifests when predictor matrices differ across equations.

6.5 The Maximum Likelihood Estimator

The log-likelihood for the multivariate linear model is:

$\ell(\mathbf{B}, \boldsymbol{\Sigma}) = -\frac{n}{2}\ln|\boldsymbol{\Sigma}| - \frac{1}{2}\text{tr}\left[\boldsymbol{\Sigma}^{-1}(\mathbf{Y} - \mathbf{X}\mathbf{B})^T(\mathbf{Y} - \mathbf{X}\mathbf{B})\right] + \text{const}$

Maximising over $\mathbf{B}$ gives $\hat{\mathbf{B}}_{MLE} = \hat{\mathbf{B}}_{OLS}$ (as stated above). Maximising over $\boldsymbol{\Sigma}$ gives $\tilde{\boldsymbol{\Sigma}}_{MLE} = \hat{\mathbf{E}}^T\hat{\mathbf{E}}/n$ .

The maximised log-likelihood is:

$\ell(\hat{\mathbf{B}}, \tilde{\boldsymbol{\Sigma}}_{MLE}) = -\frac{n}{2}\left[q\ln(2\pi) + \ln|\tilde{\boldsymbol{\Sigma}}_{MLE}| + q\right]$

7. Hypothesis Testing and Inference

7.1 The General Linear Hypothesis Framework

The unified framework for hypothesis testing in MLM is the General Linear Hypothesis:

$H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{\Gamma}_0$

For testing $H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0}$ (the most common case), the Hypothesis SSCP matrix is:

$\mathbf{H} = \mathbf{M}^T\hat{\mathbf{B}}^T\mathbf{C}^T\left[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T\right]^{-1}\mathbf{C}\hat{\mathbf{B}}\mathbf{M}$

The Error SSCP matrix (the same for all hypotheses):

$\mathbf{E} = \mathbf{M}^T\hat{\mathbf{E}}^T\hat{\mathbf{E}}\mathbf{M} = \mathbf{M}^T(\mathbf{Y} - \hat{\mathbf{Y}})^T(\mathbf{Y} - \hat{\mathbf{Y}})\mathbf{M}$

When $\mathbf{M} = \mathbf{I}_q$ , $\mathbf{E} = \hat{\mathbf{E}}^T\hat{\mathbf{E}}$ .

Test statistics are based on the eigenvalues of $\mathbf{E}^{-1}\mathbf{H}$ (see Section 8).

7.2 Testing Individual Predictor Effects (Row of $\mathbf{B}$ )

To test whether predictor $X_k$ has any effect on the set of responses ( $H_0: \boldsymbol{\beta}_{k\cdot} = \mathbf{0}^T$ , the $k$ -th row of $\mathbf{B}$ ):

Set $\mathbf{C} = \mathbf{e}_k^T$ (row vector with 1 in position $k$ , zeros elsewhere) and $\mathbf{M} = \mathbf{I}_q$ .

This produces a multivariate test of whether $X_k$ simultaneously predicts zero for all $q$ responses. The four multivariate test statistics (Section 8) are applied.

7.3 Testing a Subset of Predictors

To test whether a subset of $c$ predictors jointly contribute to the model ( $H_0: \mathbf{C}\mathbf{B} = \mathbf{0}$ for a $c \times (p+1)$ matrix $\mathbf{C}$ ):

The hypothesis SSCP matrix with $df_H = c$ is compared to the error SSCP matrix with $df_E = n - p - 1$ .

7.4 Testing Specific Linear Combinations of Responses

To test whether a predictor has differential effects on different responses (e.g., $H_0: \beta_{k1} = \beta_{k2}$ ):

Set $\mathbf{C} = \mathbf{e}_k^T$ and $\mathbf{M} = (1, -1)^T$ .

The transformed response is $Y_1 - Y_2$ ; the test becomes a univariate $t$ -test on the difference.

7.5 Univariate Tests Within the MLM Framework

For each response variable $Y_j$ individually, the standard univariate $F$ -test for predictor $X_k$ is obtained by setting $\mathbf{M} = \mathbf{e}_j$ and applying the Wald test:

$F_{kj} = \frac{\hat{\beta}_{kj}^2 / \hat{\sigma}_{jj}^{(kk)}}{1} \sim F_{1, n-p-1}$

Where $\hat{\sigma}_{jj}^{(kk)} = \hat{\sigma}_{jj} [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}$ is the estimated variance of $\hat{\beta}_{kj}$ .

7.6 Wald Tests for Individual Coefficients

For a single coefficient $\beta_{kj}$ (predictor $k$ , response $j$ ):

$t_{kj} = \frac{\hat{\beta}_{kj}}{SE(\hat{\beta}_{kj})} \sim t_{n-p-1}$

Where:

$SE(\hat{\beta}_{kj}) = \sqrt{\hat{\sigma}_{jj} \cdot \left[(\mathbf{X}^T\mathbf{X})^{-1}\right]_{kk}}$

A $(1-\alpha) \times 100\%$ confidence interval for $\beta_{kj}$ :

$\hat{\beta}_{kj} \pm t_{\alpha/2,\, n-p-1} \times SE(\hat{\beta}_{kj})$

7.7 Simultaneous Confidence Regions

A key advantage of MLM is the ability to form joint confidence regions for multivariate coefficient vectors. The $100(1-\alpha)\%$ joint confidence region for the $k$ -th row $\boldsymbol{\beta}_{k\cdot}$ of $\mathbf{B}$ (all $q$ responses simultaneously) is an ellipsoid:

$\left(\hat{\boldsymbol{\beta}}_{k\cdot} - \boldsymbol{\beta}_{k\cdot}\right)^T \left[\hat{\boldsymbol{\Sigma}} \cdot \left(\mathbf{X}^T\mathbf{X}\right)^{-1}_{kk}\right]^{-1} \left(\hat{\boldsymbol{\beta}}_{k\cdot} - \boldsymbol{\beta}_{k\cdot}\right) \leq q F_{\alpha,\, q,\, n-p-q}$

7.8 Testing the Overall Model

The overall model tests whether any predictor (excluding the intercept) has a significant effect on any response:

$H_0: \mathbf{B}_{-0} = \mathbf{0}_{p \times q}$

Where $\mathbf{B}_{-0}$ is $\mathbf{B}$ with the intercept row removed. This is tested with $\mathbf{C} = [\mathbf{0}_p \mid \mathbf{I}_p]$ (omitting the intercept column) and $\mathbf{M} = \mathbf{I}_q$ .

8. Multivariate Test Statistics

All multivariate tests in MLM are based on the eigenvalues $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_{s^*}$ of $\mathbf{E}^{-1}\mathbf{H}$ , where $s^* = \min(c, m)$ , $c$ = rank of $\mathbf{C}$ (number of hypothesis df), $m$ = number of transformed responses (columns of $\mathbf{M}$ ).

8.1 Wilks' Lambda ( $\Lambda^*$ )

$\Lambda^* = \frac{|\mathbf{E}|}{|\mathbf{H} + \mathbf{E}|} = \prod_{s=1}^{s^*} \frac{1}{1 + \lambda_s}$

Range: $0 \leq \Lambda^* \leq 1$ ; smaller values indicate stronger effects.

F-approximation (Rao):

$F = \frac{1 - \Lambda^{*1/t}}{\Lambda^{*1/t}} \cdot \frac{df_2}{df_1}$

Where:

$t = \sqrt{\frac{m^2 c^2 - 4}{m^2 + c^2 - 5}}, \quad df_1 = mc, \quad df_2 = t\left(df_E - \frac{m - c + 1}{2}\right) - \frac{mc - 2}{2}$

Exact F when $m = 1$ , $m = 2$ , $c = 1$ , or $c = 2$ .

8.2 Pillai's Trace ( $V$ )

$V = \text{tr}\left[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}\right] = \sum_{s=1}^{s^*}\frac{\lambda_s}{1+\lambda_s}$

Range: $0 \leq V \leq s^*$ .

F-approximation:

$F = \frac{V/s^*}{(s^* - V)/s^*} \cdot \frac{2n_Y + s^* + 1}{2n_X + s^* + 1} \sim F_{s^*(2n_X+s^*+1),\, s^*(2n_Y+s^*+1)}$

Where $n_X = (df_E - m - 1)/2$ and $n_Y = (c - m - 1)/2$ .

Most robust to violations of assumptions.

8.3 Hotelling-Lawley Trace ( $U$ )

$U = \text{tr}(\mathbf{E}^{-1}\mathbf{H}) = \sum_{s=1}^{s^*}\lambda_s$

Range: $0 \leq U < \infty$ .

F-approximation:

$F = \frac{U \cdot df_E(df_E - m - 1)}{s^* \cdot m \cdot (df_E + c)(df_E - m - 1) - 2U} \quad \text{(approximate)}$

Most powerful when effects are concentrated on a single dimension.

8.4 Roy's Largest Root ( $\theta$ )

$\theta = \frac{\lambda_1}{1 + \lambda_1}$

Range: $0 \leq \theta \leq 1$ .

Roy's provides an upper bound to the $p$ -value distribution. Exact tables are needed for precise $p$ -values.

8.5 Comparison of the Four Statistics in MLM Context

Statistic	$H_0$ Rejected When	Best For	Robustness
Wilks' $\Lambda^*$	$\Lambda^*$ small	General; effects spread	Moderate
Pillai's $V$	$V$ large	Effects spread; small $n$ ; unequal $n$	Highest
Hotelling-Lawley $U$	$U$ large	Effects on one dimension	Lowest
Roy's $\theta$	$\theta$ large	Dominant single dimension	Lowest

8.6 Likelihood Ratio Test

An alternative formulation: Under $H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0}$ , the likelihood ratio statistic is:

$\Lambda_{LR} = \left(\frac{|\mathbf{E}|}{|\mathbf{H}+\mathbf{E}|}\right)^{n/2} = (\Lambda^*)^{n/2}$

The corresponding test statistic:

$\chi^2 \approx -\left[n - p - 1 - \frac{m + c + 1}{2}\right]\ln(\Lambda^*)$

Asymptotically $\sim \chi^2_{mc}$ under $H_0$ . This approximation is Bartlett's correction to the likelihood ratio test.

9. Effect Size Measures

9.1 Multivariate Eta-Squared ( $\eta^2_p$ )

The most widely reported multivariate effect size for each hypothesis test:

From Wilks' Lambda:

$\eta^2_p = 1 - \Lambda^{*1/t}$

Where $t$ is defined as in Section 8.1.

From Pillai's Trace:

$\eta^2_p \approx \frac{V}{s^*}$

Benchmarks:

$\eta^2_p$	Effect Size
$0.01$	Small
$0.06$	Medium
$0.14$	Large

9.2 Multivariate Omega-Squared ( $\omega^2$ )

A less biased (corrected) multivariate effect size:

$\omega^2 = \frac{df_H(F - 1)}{df_H(F - 1) + n}$

Applied to the $F$ -approximation from the multivariate tests. Can be computed separately from each test statistic's $F$ approximation.

9.3 Univariate Effect Sizes per Response

For each response variable $Y_j$ , report the univariate $R^2$ :

$R^2_j = 1 - \frac{SS_{res,j}}{SS_{tot,j}} = 1 - \frac{\hat{\mathbf{e}}_j^T\hat{\mathbf{e}}_j}{\sum_i(y_{ij}-\bar{y}_j)^2}$

And univariate partial $\eta^2_p$ for each predictor-response combination:

$\eta^2_{p,kj} = \frac{SS_{k,j}}{SS_{k,j} + SS_{res,j}}$

9.4 Standardised Coefficients

Standardised regression coefficients ( $\beta^*_{kj}$ ) are obtained by standardising both predictors and responses before fitting:

$\beta^*_{kj} = \hat{\beta}_{kj} \cdot \frac{s_{X_k}}{s_{Y_j}}$

Where $s_{X_k}$ and $s_{Y_j}$ are the standard deviations of predictor $k$ and response $j$ . Standardised coefficients represent the change in $Y_j$ (in SD units) for a one-SD change in $X_k$ , facilitating comparison across predictors and responses.

9.5 Canonical Correlations

From the eigenvalues of $\mathbf{E}^{-1}\mathbf{H}$ , the canonical correlations between the predictor and response spaces are:

$r_{c,s} = \sqrt{\frac{\lambda_s}{1 + \lambda_s}}$

The canonical $R^2_s = r_{c,s}^2$ represents the proportion of variance in the $s$ -th canonical response variate explained by the corresponding canonical predictor variate. These describe the multivariate association structure.

9.6 Multivariate R² (Generalisations)

Several generalisations of $R^2$ to the multivariate case exist:

Pillai's Trace-based R²:

$R^2_{mult} = \frac{V}{s^*} = \frac{\sum_s \lambda_s/(1+\lambda_s)}{s^*}$

Wilks' Lambda-based R²:

$R^2_{Wilks} = 1 - \Lambda^{*1/t}$

Trace correlation coefficient (average $R^2$ across canonical variates):

$R^2_{trace} = \frac{\sum_s r_{c,s}^2}{s^*}$

10. Model Fit and Evaluation

10.1 Univariate R² for Each Response

Report the standard $R^2$ for each of the $q$ response regressions:

$R^2_j = 1 - \frac{\hat{\mathbf{e}}_j^T\hat{\mathbf{e}}_j}{(\mathbf{y}_j - \bar{y}_j\mathbf{1})^T(\mathbf{y}_j - \bar{y}_j\mathbf{1})}$

Adjusted R² penalises for model complexity:

$\bar{R}^2_j = 1 - (1 - R^2_j)\frac{n-1}{n-p-1}$

10.2 Trace of the Residual SSCP

The trace of the residual SSCP matrix measures total residual variation across all responses:

$\text{tr}(\hat{\mathbf{E}}^T\hat{\mathbf{E}}) = \sum_{j=1}^q \hat{\mathbf{e}}_j^T\hat{\mathbf{e}}_j = \sum_{j=1}^q SS_{res,j}$

This is the multivariate analogue of the residual sum of squares.

10.3 The Determinant Criterion ( $|\hat{\boldsymbol{\Sigma}}|$ )

The generalised variance $|\hat{\boldsymbol{\Sigma}}|$ (determinant of the estimated error covariance matrix) is a scalar measure of total residual variation that accounts for correlations among responses. Smaller values indicate better overall fit.

Log-determinant criterion:

$\ln|\hat{\boldsymbol{\Sigma}}| = \ln\left|\frac{\hat{\mathbf{E}}^T\hat{\mathbf{E}}}{n-p-1}\right|$

This appears in the log-likelihood and information criteria.

10.4 AIC and BIC for Multivariate Models

Akaike Information Criterion:

$AIC = n\ln|\tilde{\boldsymbol{\Sigma}}_{MLE}| + 2q(p+1)$

Or using the maximised log-likelihood:

$AIC = -2\ell(\hat{\mathbf{B}}, \tilde{\boldsymbol{\Sigma}}_{MLE}) + 2k$

Where $k = q(p+1) + q(q+1)/2$ is the total number of free parameters (coefficients plus unique elements of $\boldsymbol{\Sigma}$ ).

Bayesian Information Criterion:

$BIC = -2\ell(\hat{\mathbf{B}}, \tilde{\boldsymbol{\Sigma}}_{MLE}) + k\ln(n)$

Lower AIC/BIC indicates a better model relative to complexity. Use AIC for predictive accuracy; BIC for parsimony.

10.5 Model Comparison via Likelihood Ratio Test

To compare a full model (all $p$ predictors) vs. a reduced model ( $p - c$ predictors, dropping the $c$ predictors specified by $\mathbf{C}$ ):

$\Lambda_{LR} = \frac{|\hat{\mathbf{E}}_{full}|}{|\hat{\mathbf{E}}_{reduced}|}$

$\chi^2 = -\left[n - p - 1 - \frac{m+c+1}{2}\right]\ln(\Lambda_{LR}) \sim \chi^2_{cm} \quad \text{under } H_0$

Or equivalently using Wilks' Lambda for the hypothesis being tested.

10.6 Comparing Univariate and Multivariate Fit

An important diagnostic is to compare the multivariate model fit with the fit from $q$ separate univariate regressions:

Same coefficient estimates: $\hat{\mathbf{B}}_{MLM}$ = column-stacked OLS estimates.
Difference in inference: MLM uses the joint error covariance structure for more powerful tests of multivariate hypotheses.
Gain from MLM: Greatest when responses are strongly correlated and hypotheses about joint effects are of primary interest.

11. Model Diagnostics and Residuals

11.1 The Residual Matrix

The residual matrix is:

$\hat{\mathbf{E}} = \mathbf{Y} - \hat{\mathbf{Y}} = \mathbf{Y} - \mathbf{X}\hat{\mathbf{B}} = (\mathbf{I} - \mathbf{H})\mathbf{Y}$

Each row $\hat{\boldsymbol{\epsilon}}_i = \mathbf{y}_i - \hat{\mathbf{y}}_i$ is the residual vector for observation $i$ . A well-fitting model produces residuals that resemble multivariate white noise.

11.2 Univariate Residual Diagnostics (Per Response)

For each response $Y_j$ , extract the $j$ -th column of $\hat{\mathbf{E}}$ and apply standard univariate regression diagnostics:

Residuals vs. Fitted Values: Plot $\hat{e}_{ij}$ against $\hat{y}_{ij}$ . Expect a random scatter around zero with no pattern.

Normal Q-Q Plot: Plot ordered residuals against normal quantiles. Points should fall near the diagonal.

Scale-Location Plot: Plot $\sqrt{|\hat{e}_{ij}|}$ against $\hat{y}_{ij}$ . A horizontal band indicates homoscedasticity.

Residuals vs. Predictor Plots: Plot $\hat{e}_{ij}$ against each $X_k$ . Patterns indicate non-linearity or omitted variable bias.

11.3 Multivariate Residual Diagnostics

11.3.1 Mahalanobis Distance of Residuals

The squared Mahalanobis distance of the residual vector for observation $i$ :

$D^2_i = \hat{\boldsymbol{\epsilon}}_i^T \hat{\boldsymbol{\Sigma}}^{-1} \hat{\boldsymbol{\epsilon}}_i$

Under the model, $D^2_i / (n-p-1)$ approximately follows an $F_{q, n-p-q}$ distribution. Large $D^2_i$ indicates observation $i$ is poorly fitted on the combined response profile.

Chi-squared Q-Q plot: Plot ordered $D^2_{(i)}$ against $\chi^2_q$ quantiles. Departures from linearity indicate non-normality of multivariate residuals or outliers.

11.3.2 Standardised Multivariate Residuals

Standardise each residual vector by the estimated error covariance:

$\tilde{\boldsymbol{\epsilon}}_i = \hat{\boldsymbol{\Sigma}}^{-1/2}\hat{\boldsymbol{\epsilon}}_i$

Under correct specification, $\tilde{\boldsymbol{\epsilon}}_i \approx \mathcal{N}_q(\mathbf{0}, \mathbf{I}_q)$ .

11.3.3 Cross-Response Residual Correlation

Compute the correlation matrix of the columns of $\hat{\mathbf{E}}$ :

$\hat{\mathbf{R}}_\epsilon = D^{-1/2} \hat{\boldsymbol{\Sigma}} D^{-1/2}$

Where $D = \text{diag}(\hat{\boldsymbol{\Sigma}})$ . This estimates the within-observation correlation structure of the errors. High correlations indicate that the responses share substantial common error variance.

11.4 Leverage and Influence

Leverage $h_{ii}$ (diagonal of $\mathbf{H}$ ): The same hat matrix applies to all responses in MLM. High leverage ( $h_{ii} > 2(p+1)/n$ ) indicates an observation with unusual predictor values.

Multivariate Cook's Distance:

$D_i^{Cook} = \frac{1}{q(p+1)}\text{tr}\left[(\hat{\mathbf{B}} - \hat{\mathbf{B}}^{(-i)})^T(\mathbf{X}^T\mathbf{X})(\hat{\mathbf{B}} - \hat{\mathbf{B}}^{(-i)})\hat{\boldsymbol{\Sigma}}^{-1}\right]$

Where $\hat{\mathbf{B}}^{(-i)}$ is the coefficient matrix estimated without observation $i$ . An efficient formula:

$D_i^{Cook} = \frac{h_{ii}}{q(1-h_{ii})^2} D^2_i$

Observations with $D_i^{Cook} > 1$ (or $> 4/n$ ) warrant investigation.

DFFITS (Multivariate):

$DFFITS_i = \frac{h_{ii}}{1-h_{ii}} D^2_i$

11.5 Testing for Multivariate Outliers

Bonferroni-corrected Mahalanobis distance test: Compare $D^2_i$ to the critical value:

$D^2_{crit} = \chi^2_{q,\, \alpha/(2n)}$

(Bonferroni-corrected at the $\alpha/(2n)$ level, e.g., $\alpha = 0.05$ , $n = 100$ , $q = 3$ : use $\chi^2_{3, 0.00025} \approx 17.1$ .)

Robust Mahalanobis distances: Use the Minimum Covariance Determinant (MCD) or Minimum Volume Ellipsoid (MVE) estimates of location and scatter, which are resistant to masking effects (outliers hiding other outliers).

11.6 Checking Multicollinearity Among Predictors

Variance Inflation Factor (VIF) for each predictor (same for all responses, since $\mathbf{X}$ is shared):

$VIF_k = \frac{1}{1 - R^2_k}$

Where $R^2_k$ is the $R^2$ from regressing $X_k$ on all other predictors.

VIF	Interpretation
$1 - 5$	Low multicollinearity
$5 - 10$	Moderate multicollinearity (concern)
$> 10$	Severe multicollinearity (serious problem)

Condition number of $\mathbf{X}^T\mathbf{X}$ : Ratio of largest to smallest eigenvalue. Values $> 30$ indicate potentially problematic multicollinearity.

11.7 Checking Homoscedasticity

For each response $Y_j$ , check homoscedasticity using:

Breusch-Pagan test: Tests whether $\hat{e}_{ij}^2$ is predicted by the fitted values.
White's test: A more general test for heteroscedasticity.
Scale-Location plot: Visual assessment.

For the multivariate setting, Box's M test can be adapted to test whether the error covariance matrix $\boldsymbol{\Sigma}$ is the same across subgroups defined by categorical predictors.

12. Interpretation of Coefficients

12.1 Interpreting the Coefficient Matrix $\hat{\mathbf{B}}$

The coefficient matrix $\hat{\mathbf{B}}$ has dimensions $(p+1) \times q$ :

Column $j$ of $\hat{\mathbf{B}}$ : The univariate regression coefficients for response $Y_j$ on all predictors. Interpretation is identical to univariate regression: $\hat{\beta}_{kj}$ is the expected change in $Y_j$ for a one-unit increase in $X_k$ , holding all other predictors constant.
Row $k$ of $\hat{\mathbf{B}}$ : The multivariate coefficient vector for predictor $X_k$ — the simultaneous effect of $X_k$ on all $q$ responses. This row is tested jointly in multivariate hypothesis tests.

12.2 Ceteris Paribus Interpretation

Each coefficient $\hat{\beta}_{kj}$ represents the partial effect of $X_k$ on $Y_j$ :

$\hat{\beta}_{kj} = \frac{\partial \hat{Y}_j}{\partial X_k}\Bigg|_{X_{-k} \text{ fixed}}$

This is the expected change in the mean of $Y_j$ for a one-unit increase in $X_k$ , with all other predictors $X_{-k} = \{X_1, \dots, X_{k-1}, X_{k+1}, \dots, X_p\}$ held constant.

12.3 Comparing Coefficients Across Responses

A key feature of MLM is the ability to compare the effects of a predictor across multiple responses. If the responses are on the same scale (or are standardised), directly compare $\hat{\beta}_{k1}$ , $\hat{\beta}_{k2}$ , ..., $\hat{\beta}_{kq}$ for predictor $X_k$ .

To formally test whether the effect of $X_k$ on $Y_j$ equals its effect on $Y_l$ :

$H_0: \beta_{kj} = \beta_{kl} \quad \Leftrightarrow \quad H_0: \mathbf{e}_k^T\mathbf{B}\mathbf{m}_{jl} = 0$

Where $\mathbf{m}_{jl} = \mathbf{e}_j - \mathbf{e}_l$ (contrast between responses $j$ and $l$ ).

12.4 Interpreting Interactions

For a model with predictors $X_1$ , $X_2$ , and their interaction $X_1 \times X_2$ :

$E[Y_j \mid X_1, X_2] = \beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \beta_{12j}X_1 X_2$

The interaction coefficient $\beta_{12j}$ represents the modification of the effect of $X_1$ on $Y_j$ per unit change in $X_2$ (and vice versa):

$\frac{\partial E[Y_j]}{\partial X_1} = \beta_{1j} + \beta_{12j}X_2$

12.5 Interpreting Dummy Variables for Categorical Predictors

For a categorical predictor with $k$ levels (reference = Level 1):

$\hat{\beta}_{D_m, j} = \bar{y}_{m,j}^{adj} - \bar{y}_{1,j}^{adj}$

The coefficient for dummy variable $D_m$ in the $j$ -th response equation represents the adjusted mean difference between Level $m$ and the reference Level 1 on response $Y_j$ , controlling for all other predictors.

12.6 The Profile of Effects Across Responses

The multivariate coefficient vector for predictor $X_k$ is the row:

$\hat{\boldsymbol{\beta}}_{k\cdot} = (\hat{\beta}_{k1}, \hat{\beta}_{k2}, \dots, \hat{\beta}_{kq})$

Plotting this profile (coefficient value vs. response variable) visualises the pattern of effects of $X_k$ across the response set. Parallel or similar profiles across predictors indicate a common underlying structure; divergent profiles suggest differential effects.

13. Variable Selection and Model Comparison

13.1 Criteria for Variable Selection

Variable selection in MLM must account for the joint effect on all $q$ responses simultaneously. Several criteria are available:

Multivariate AIC/BIC: Compare models using AIC or BIC (see Section 10.4). Select the model with the lowest criterion value.

Sequential Likelihood Ratio Tests: Add predictors one at a time and test whether each addition significantly improves the joint model fit using the $\chi^2$ approximation to Wilks' Lambda.

Multivariate Mallow's $C_p$ :

$C_p^{mult} = \frac{\text{tr}(\hat{\mathbf{E}}_p^T\hat{\mathbf{E}}_p)}{\text{tr}(\hat{\mathbf{E}}_{full}^T\hat{\mathbf{E}}_{full}/(n-p_{full}-1))} - (n - 2(p+1))$

Where the sum is over all $q$ responses.

13.2 All-Subsets Selection

Evaluate all $2^p$ possible subsets of $p$ predictors and select based on:

Minimum multivariate AIC or BIC.
Minimum $\ln|\hat{\boldsymbol{\Sigma}}|$ .
Maximum multivariate adjusted R².

Computational note: For large $p$ , all-subsets becomes infeasible. Use stepwise procedures or regularised alternatives.

13.3 Stepwise Variable Selection

Forward selection:

Start with the intercept-only model.
At each step, add the predictor that produces the greatest reduction in multivariate AIC (or the most significant Wilks' Lambda test).
Stop when no addition improves the criterion.

Backward elimination:

Start with the full model.
At each step, remove the predictor whose removal produces the least deterioration in multivariate fit (highest p-value for Wilks' Lambda test or smallest AIC increase).
Stop when removing any predictor significantly worsens fit.

Bidirectional: Combine forward and backward at each step.

⚠️ Stepwise selection using p-values suffers from multiple testing inflation, instability, and optimistic bias. AIC/BIC-based stepwise is preferred. Always validate the final model on held-out data and interpret with caution.

13.4 Regularised Estimation for High-Dimensional Settings

When $p$ is large relative to $n$ , standard OLS estimation becomes unstable. Regularisation methods extend to the multivariate setting:

Multivariate Ridge Regression:

$\hat{\mathbf{B}}_{ridge} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{Y}$

The ridge penalty $\lambda > 0$ shrinks all coefficients toward zero, stabilising estimation under multicollinearity.

Multivariate Lasso (Group Lasso): Applies an $L_1$ penalty structure that encourages sparse solutions. The group lasso applies the penalty at the row level of $\mathbf{B}$ — entire rows (effects of a predictor on all responses) are driven to zero, performing group variable selection.

Penalty:

$\min_{\mathbf{B}} \|\mathbf{Y} - \mathbf{X}\mathbf{B}\|_F^2 + \lambda\sum_{k=1}^p \|\boldsymbol{\beta}_{k\cdot}\|_2$

Where $\|\boldsymbol{\beta}_{k\cdot}\|_2 = \sqrt{\sum_j \beta_{kj}^2}$ is the Euclidean norm of the $k$ -th row of $\mathbf{B}$ .

13.5 Cross-Validation for Model Selection

$k$ -fold cross-validation for MLM:

Divide the $n$ observations into $k$ folds.
For each fold $v$ : fit the model on all data except fold $v$ ; predict fold $v$ .
Compute the cross-validated prediction error:

$MSPE_{CV} = \frac{1}{n}\sum_{i=1}^n (\mathbf{y}_i - \hat{\mathbf{y}}_i^{(-v_i)})^T(\mathbf{y}_i - \hat{\mathbf{y}}_i^{(-v_i)})$

Or using the trace criterion:

$Trace_{CV} = \frac{1}{n}\sum_{i=1}^n \|\mathbf{y}_i - \hat{\mathbf{y}}_i^{(-v_i)}\|^2$

Select the model minimising the cross-validated prediction error.

14. Special Cases and Connections to Other Methods

14.1 Univariate Multiple Regression ( $q = 1$ )

When $q = 1$ , the response matrix $\mathbf{Y}$ reduces to a vector $\mathbf{y}$ and the coefficient matrix $\mathbf{B}$ reduces to a vector $\boldsymbol{\beta}$ . All MLM formulas reduce to standard OLS:

$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

All multivariate test statistics reduce to the standard $F$ -test.

14.2 MANOVA as a Special Case

When all predictors in $\mathbf{X}$ are indicator variables (dummy coding of group membership) and no continuous covariates are included, the MLM reduces to MANOVA. The hypothesis $H_0: \mathbf{C}\mathbf{B} = \mathbf{0}$ with $\mathbf{C}$ specifying group contrasts is the MANOVA test for group mean vector differences.

The SSCP matrices:

$\mathbf{H} = \hat{\mathbf{B}}^T\mathbf{X}^T\mathbf{X}\hat{\mathbf{B}} - n\bar{\mathbf{y}}\bar{\mathbf{y}}^T$ (between-group)
$\mathbf{E} = \hat{\mathbf{E}}^T\hat{\mathbf{E}}$ (within-group)

14.3 MANCOVA as a Special Case

When $\mathbf{X}$ contains both group indicators (categorical) and continuous covariates, the MLM reduces to MANCOVA. The MANCOVA test for group differences (after adjusting for covariates) corresponds to the MLM hypothesis test $H_0: \mathbf{C}_{group}\mathbf{B} = \mathbf{0}$ where $\mathbf{C}_{group}$ specifies contrasts among group effects.

The "adjustment" for covariates happens automatically within the MLM framework — the covariate columns of $\mathbf{X}$ absorb the variation they explain, leaving purer group comparisons.

14.4 Canonical Correlation Analysis (CCA)

Canonical Correlation Analysis studies the association between two sets of variables: predictors $\mathbf{X}$ and responses $\mathbf{Y}$ . It finds linear combinations:

$\mathbf{u} = \mathbf{X}\mathbf{a} \quad \text{and} \quad \mathbf{v} = \mathbf{Y}\mathbf{b}$

That maximise $\text{Cor}(\mathbf{u}, \mathbf{v})$ . The canonical correlations are the square roots of the eigenvalues of:

$\mathbf{S}_{XX}^{-1}\mathbf{S}_{XY}\mathbf{S}_{YY}^{-1}\mathbf{S}_{YX}$

In the MLM framework, CCA provides insight into the overall association structure between the predictor and response sets. The eigenvalues of $\mathbf{E}^{-1}\mathbf{H}$ in the MLM test of the full model are directly related to the squared canonical correlations.

14.5 Discriminant Function Analysis

When $\mathbf{X}$ consists of group indicators, the eigenvectors of $\mathbf{E}^{-1}\mathbf{H}$ define the linear discriminant functions — the linear combinations of responses that maximally separate the groups. DFA is therefore a post-hoc analysis tool following a significant MANOVA (and thus a significant MLM with categorical predictors).

14.6 Repeated Measures and Profile Analysis

When the $q$ response variables represent the same variable measured at $q$ time points on the same observations, the MLM becomes a repeated-measures (profile analysis) model. The response transformation matrix $\mathbf{M}$ is used to form contrasts among time points:

Flatness: $\mathbf{M} = \mathbf{D}$ (first-difference matrix: $(q-1) \times q$ ) tests whether the profile is flat (no time effect).
Parallelism: Tests whether the time profiles are parallel across groups.
Levels: Tests whether average levels differ across groups.

14.7 Seemingly Unrelated Regression (SUR)

When each response variable has its own distinct set of predictors (not the shared $\mathbf{X}$ of standard MLM), the Seemingly Unrelated Regression model applies. Standard OLS applied separately to each equation is unbiased but inefficient; GLS using the cross-equation error covariances (Zellner's FGLS estimator) is asymptotically more efficient.

15. Using the Multivariate Linear Models Component

The Multivariate Linear Models component in the DataStatPro application provides a full end-to-end workflow for fitting, evaluating, and interpreting multivariate regression models.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain at least two continuous response variables and one or more predictor variables.

Step 2 — Select Response Variables (Y) Select two or more continuous response variables from the "Response Variables (Y)" panel. These represent the jointly modelled outcomes.

💡 Select response variables that are theoretically related and expected to be influenced by the same set of predictors. Responses that share common error variance benefit most from the multivariate framework.

Step 3 — Select Predictor Variables (X) Select one or more predictor variables from the "Predictor Variables (X)" dropdown. Predictors can be:

Continuous: Entered directly.
Categorical: The application automatically creates dummy variables; you will be prompted to select the reference category.

Step 4 — Configure Interactions (Optional) Specify interaction terms by selecting pairs of predictor variables. The application creates the product terms and adds them to the design matrix.

Step 5 — Configure Polynomial Terms (Optional) For continuous predictors, specify polynomial degrees (e.g., quadratic: $X^2$ , cubic: $X^3$ ) to model non-linear relationships on the link scale.

Step 6 — Configure Centering and Scaling Select preprocessing for predictors:

No centering/scaling
Mean centering only (recommended for interpretability of intercepts)
Standardisation (mean centering + unit variance scaling) (recommended for comparing coefficients across predictors)

Step 7 — Specify Hypothesis Tests (Optional) Define custom hypotheses of the form $H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0}$ :

$\mathbf{C}$ : Contrast matrix for predictors (specify as a matrix of coefficients).
$\mathbf{M}$ : Response transformation matrix (default $\mathbf{I}_q$ for all responses).

Pre-built hypothesis options:

Overall model test (all predictors, all responses).
Individual predictor tests (one row of $\mathbf{B}$ at a time).
Subset of predictors (user-specified group of rows).
Equality of effects across responses (e.g., $\beta_{k1} = \beta_{k2}$ ).

Step 8 — Select Confidence Level Choose the confidence level for confidence intervals (default: 95%).

Step 9 — Select Display Options Choose which outputs to display:

✅ Coefficient Matrix Table $\hat{\mathbf{B}}$ (with SE, $t$ -values, p-values, CIs per response)
✅ Standardised Coefficient Matrix
✅ Error Covariance Matrix $\hat{\boldsymbol{\Sigma}}$ (with correlation matrix)
✅ Multivariate Test Statistics Table (Wilks', Pillai's, Hotelling-Lawley, Roy's)
✅ Univariate ANOVA/Regression Tables (per response)
✅ Univariate R² and Adjusted R² (per response)
✅ Multivariate Effect Sizes ( $\eta^2_p$ , $\omega^2$ )
✅ Residual Diagnostics Plots (per response and multivariate)
✅ Mahalanobis Distance Q-Q Plot
✅ Leverage and Cook's Distance Plot
✅ VIF Table
✅ Predicted vs. Observed Plots (per response)
✅ Coefficient Profile Plot
✅ Correlation Structure of Residuals
✅ AIC / BIC

Step 10 — Run the Analysis Click "Run Multivariate Linear Model". The application will:

Construct the design matrix $\mathbf{X}$ (with dummy coding, interaction, polynomial terms).
Compute $\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$ .
Compute residuals $\hat{\mathbf{E}}$ and estimate $\hat{\boldsymbol{\Sigma}}$ .
Compute all four multivariate test statistics for each specified hypothesis.
Compute univariate regression statistics for each response.
Compute effect sizes, leverage, Cook's distances, VIFs.
Run multivariate normality tests on residuals.
Generate all selected visualisations and tables.

16. Computational and Formula Details

16.1 Step-by-Step Computation

Step 1: Construct the design matrix $\mathbf{X}$ ( $n \times (p+1)$ )

Include a column of ones (for the intercept) as the first column, followed by the predictor columns (with dummy coding for categorical predictors).

Step 2: Verify rank condition

Check that $\text{rank}(\mathbf{X}) = p + 1$ (full column rank). If not, the model is under-identified — remove linearly dependent columns.

Step 3: Compute $(\mathbf{X}^T\mathbf{X})^{-1}$

Using Cholesky decomposition (more numerically stable than direct inversion):

$\mathbf{X}^T\mathbf{X} = \mathbf{L}\mathbf{L}^T \quad (\text{Cholesky})$

$(\mathbf{X}^T\mathbf{X})^{-1} = (\mathbf{L}^T)^{-1}\mathbf{L}^{-1}$

Step 4: Compute the hat matrix $\mathbf{H}$

$\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \quad (n \times n)$

For large $n$ , avoid storing $\mathbf{H}$ explicitly; compute leverages $h_{ii}$ as row-wise norms:

$h_{ii} = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i$

Step 5: Compute the coefficient matrix

$\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \quad ((p+1) \times q)$

Step 6: Compute fitted values and residuals

$\hat{\mathbf{Y}} = \mathbf{X}\hat{\mathbf{B}} = \mathbf{H}\mathbf{Y}, \quad \hat{\mathbf{E}} = \mathbf{Y} - \hat{\mathbf{Y}} = (\mathbf{I} - \mathbf{H})\mathbf{Y}$

Step 7: Estimate the error covariance matrix

$\hat{\boldsymbol{\Sigma}} = \frac{\hat{\mathbf{E}}^T\hat{\mathbf{E}}}{n - p - 1}$

Step 8: Compute standard errors for each coefficient

For coefficient $\hat{\beta}_{kj}$ :

$SE(\hat{\beta}_{kj}) = \sqrt{\hat{\sigma}_{jj} \cdot [(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}$

Where $\hat{\sigma}_{jj} = \hat{\boldsymbol{\Sigma}}_{jj}$ is the estimated error variance for response $j$ .

16.2 Computing the Hypothesis SSCP Matrix

For hypothesis $H_0: \mathbf{C}\mathbf{B}\mathbf{M} = \mathbf{0}$ :

Step 1: Compute $\hat{\mathbf{B}}\mathbf{M}$ — the transformed coefficient matrix ( $df_H = c$ ).

Step 2: Compute $\mathbf{C}\hat{\mathbf{B}}\mathbf{M}$ — the estimated contrast ( $c \times m$ ).

Step 3: Compute the hypothesis SSCP:

$\mathbf{H} = (\mathbf{C}\hat{\mathbf{B}}\mathbf{M})^T \left[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T\right]^{-1} (\mathbf{C}\hat{\mathbf{B}}\mathbf{M}) \quad (m \times m)$

Step 4: Compute the error SSCP:

$\mathbf{E} = \mathbf{M}^T\hat{\mathbf{E}}^T\hat{\mathbf{E}}\mathbf{M} \quad (m \times m)$

Step 5: Compute eigenvalues $\lambda_1, \dots, \lambda_{s^*}$ of $\mathbf{E}^{-1}\mathbf{H}$ .

Step 6: Compute test statistics (Wilks', Pillai's, Hotelling-Lawley, Roy's) from the eigenvalues.

16.3 Rao's F-Approximation to Wilks' Lambda

Given $\Lambda^*$ , $c = df_H$ (rank of $\mathbf{C}$ ), $m$ (columns of $\mathbf{M}$ ), $df_E = n - p - 1$ :

$t = \sqrt{\frac{m^2c^2 - 4}{m^2 + c^2 - 5}} \quad (\text{set } t = 1 \text{ if } m^2+c^2-5 \leq 0)$

$df_1 = mc, \quad df_2 = t\left(df_E - \frac{m - c + 1}{2}\right) - \frac{mc-2}{2}$

$F = \frac{1-\Lambda^{*1/t}}{\Lambda^{*1/t}} \cdot \frac{df_2}{df_1}$

This approximation is exact when $m = 1$ or $m = 2$ or $c = 1$ or $c = 2$ .

16.4 Confidence Ellipse for Two Coefficients

The joint $100(1-\alpha)\%$ confidence ellipse for $(\beta_{k_1, j}, \beta_{k_2, j})$ (two coefficients in the same response equation $j$ ) is:

$\left(\hat{\boldsymbol{\beta}}_{(k_1,k_2),j} - \boldsymbol{\beta}_{(k_1,k_2),j}\right)^T \left[\hat{\sigma}_{jj}\mathbf{V}_{(k_1,k_2)}\right]^{-1} \left(\hat{\boldsymbol{\beta}}_{(k_1,k_2),j} - \boldsymbol{\beta}_{(k_1,k_2),j}\right) \leq 2F_{\alpha,2,n-p-1}$

Where $\mathbf{V}_{(k_1,k_2)}$ is the $2 \times 2$ submatrix of $(\mathbf{X}^T\mathbf{X})^{-1}$ corresponding to rows/columns $k_1$ and $k_2$ .

16.5 Prediction for New Observations

For a new observation $\mathbf{x}_{new}$ ( $p+1 \times 1$ ), the point prediction of the response vector is:

$\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T\mathbf{x}_{new} \quad (q \times 1)$

The prediction error covariance (accounting for both estimation uncertainty and future error variance):

$\text{Var}(\mathbf{y}_{new} - \hat{\mathbf{y}}_{new}) = \left(1 + \mathbf{x}_{new}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_{new}\right)\hat{\boldsymbol{\Sigma}}$

A $100(1-\alpha)\%$ prediction ellipsoid for $\mathbf{y}_{new}$ :

$(\mathbf{y}_{new} - \hat{\mathbf{y}}_{new})^T\left[\hat{\boldsymbol{\Sigma}}\left(1 + h_{new}\right)\right]^{-1}(\mathbf{y}_{new} - \hat{\mathbf{y}}_{new}) \leq q\cdot F_{\alpha,q,n-p-q}$

Where $h_{new} = \mathbf{x}_{new}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_{new}$ .

Marginal $(1-\alpha)\%$ prediction intervals for individual response $Y_j$ :

$\hat{y}_{new,j} \pm t_{\alpha/2,\,n-p-1}\sqrt{\hat{\sigma}_{jj}\left(1 + h_{new}\right)}$

16.6 Partitioning the SSCP Matrix

For the standard MLM, the total SSCP partitions as:

$\mathbf{T} = \mathbf{H}_{model} + \mathbf{E}$

Where:

$\mathbf{H}_{model} = \hat{\mathbf{Y}}^T(\mathbf{I} - \mathbf{1}\mathbf{1}^T/n)\hat{\mathbf{Y}}$

$\mathbf{E} = \hat{\mathbf{E}}^T\hat{\mathbf{E}} = \mathbf{Y}^T(\mathbf{I}-\mathbf{H})\mathbf{Y}$

$\mathbf{T} = \mathbf{Y}^T(\mathbf{I} - \mathbf{1}\mathbf{1}^T/n)\mathbf{Y}$

The diagonal elements give the familiar univariate sums of squares: $H_{jj} = SS_{reg,j}$ and $E_{jj} = SS_{res,j}$ .

17. Worked Examples

Example 1: Predicting Multiple Environmental Outcomes from Land Use Variables

Research Question: Do land use variables (% Urban cover, % Agricultural cover, Distance to nearest river in km) simultaneously predict multiple water quality indicators (pH, Dissolved Oxygen mg/L, Nitrate mg/L) across sampling sites?

Data: $n = 80$ water sampling sites; $p = 3$ predictors; $q = 3$ responses.

Step 1: Descriptive Statistics

Variable	Mean	SD	Min	Max
Urban (%)	18.4	12.8	0.2	68.3
Agriculture (%)	41.2	19.6	3.1	88.7
Distance (km)	2.84	1.93	0.12	9.41
pH	7.21	0.48	5.82	8.34
DO (mg/L)	8.14	1.62	3.81	11.42
Nitrate (mg/L)	4.83	2.91	0.21	14.72

Step 2: Estimated Coefficient Matrix $\hat{\mathbf{B}}$

$\hat{\mathbf{B}} = \begin{pmatrix} \hat{\beta}_{0,pH} & \hat{\beta}_{0,DO} & \hat{\beta}_{0,NO3} \\ \hat{\beta}_{Urb,pH} & \hat{\beta}_{Urb,DO} & \hat{\beta}_{Urb,NO3} \\ \hat{\beta}_{Agr,pH} & \hat{\beta}_{Agr,DO} & \hat{\beta}_{Agr,NO3} \\ \hat{\beta}_{Dist,pH} & \hat{\beta}_{Dist,DO} & \hat{\beta}_{Dist,NO3} \end{pmatrix} = \begin{pmatrix} 7.832 & 9.214 & 1.041 \\ -0.012 & -0.031 & 0.083 \\ -0.008 & -0.014 & 0.041 \\ 0.061 & 0.218 & -0.312 \end{pmatrix}$

Step 3: Standard Errors, t-values, and p-values (per response)

Response: pH

Predictor	$\hat{\beta}$	SE	$t$	$p$	95% CI
Intercept	7.832	0.241	32.50	< 0.001	[7.352, 8.312]
Urban (%)	-0.012	0.005	-2.40	0.019	[-0.022, -0.002]
Agriculture (%)	-0.008	0.004	-2.00	0.049	[-0.016, 0.000]
Distance (km)	0.061	0.031	1.97	0.053	[-0.001, 0.123]

$R^2_{pH} = 0.348$ , Adjusted $R^2 = 0.320$

Response: Dissolved Oxygen (DO)

Predictor	$\hat{\beta}$	SE	$t$	$p$	95% CI
Intercept	9.214	0.584	15.78	< 0.001	[8.051, 10.377]
Urban (%)	-0.031	0.012	-2.58	0.012	[-0.055, -0.007]
Agriculture (%)	-0.014	0.009	-1.56	0.124	[-0.032, 0.004]
Distance (km)	0.218	0.075	2.91	0.005	[0.069, 0.367]

$R^2_{DO} = 0.411$ , Adjusted $R^2 = 0.386$

Response: Nitrate (NO3)

Predictor	$\hat{\beta}$	SE	$t$	$p$	95% CI
Intercept	1.041	0.812	1.28	0.204	[-0.573, 2.655]
Urban (%)	0.083	0.017	4.88	< 0.001	[0.049, 0.117]
Agriculture (%)	0.041	0.013	3.15	0.002	[0.015, 0.067]
Distance (km)	-0.312	0.104	-3.00	0.004	[-0.519, -0.105]

$R^2_{NO3} = 0.562$ , Adjusted $R^2 = 0.543$

Step 4: Estimated Error Covariance and Correlation Matrices

$\hat{\boldsymbol{\Sigma}} = \begin{pmatrix} 0.164 & 0.214 & -0.312 \\ 0.214 & 1.841 & -1.428 \\ -0.312 & -1.428 & 5.214 \end{pmatrix}$

$\hat{\mathbf{R}}_\epsilon = \begin{pmatrix} 1.000 & 0.389 & -0.338 \\ 0.389 & 1.000 & -0.460 \\ -0.338 & -0.460 & 1.000 \end{pmatrix}$

The error correlation matrix reveals substantial within-site correlations among the water quality residuals (up to $|r| = 0.46$ ), confirming that the multivariate framework is appropriate.

Step 5: Multivariate Hypothesis Tests

Overall model test ( $H_0: \mathbf{B}_{-0} = \mathbf{0}$ , all three predictors, all three responses):

Test Statistic	Value	$F$	$df_1$	$df_2$	$p$	$\eta^2_p$
Wilks' $\Lambda^*$	0.4027	10.842	9	187.4	< 0.001	0.302
Pillai's Trace	0.6512	9.814	9	219	< 0.001	0.287
Hotelling-Lawley	1.2483	12.842	9	178.0	< 0.001	0.322
Roy's Largest Root	0.8941	21.727	3	76	< 0.001	0.462

Interpretation: The overall model is highly significant ( $p < 0.001$ ). Land use variables jointly predict the combined water quality profile, with a large multivariate effect ( $\eta^2_p = 0.302$ from Wilks' Lambda).

Per-predictor multivariate tests:

Predictor	Wilks' $\Lambda^*$	$F(3, 74)$	$p$	$\eta^2_p$
Urban (%)	0.6824	11.43	< 0.001	0.317
Agriculture (%)	0.8241	5.27	0.002	0.176
Distance (km)	0.7613	7.74	< 0.001	0.239

All three predictors have significant multivariate effects.

Step 6: Coefficient Profile Plot Interpretation

Plotting the rows of $\hat{\mathbf{B}}$ across the three responses reveals:

Urban cover has a consistent negative effect on pH and DO (degrading oxygen-related water quality) but a positive effect on nitrate (nutrient loading from urban runoff).
Distance to river has a consistent positive effect on DO (more oxygenated conditions further from main channel?) and a negative effect on nitrate (dilution/attenuation further from source).

Step 7: Prediction for a New Site

New site: Urban = 25%, Agriculture = 55%, Distance = 1.5 km.

$\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T\mathbf{x}_{new} = \begin{pmatrix} 7.832 + (-0.012)(25) + (-0.008)(55) + (0.061)(1.5) \\ 9.214 + (-0.031)(25) + (-0.014)(55) + (0.218)(1.5) \\ 1.041 + (0.083)(25) + (0.041)(55) + (-0.312)(1.5) \end{pmatrix}$

$= \begin{pmatrix} 7.832 - 0.300 - 0.440 + 0.092 \\ 9.214 - 0.775 - 0.770 + 0.327 \\ 1.041 + 2.075 + 2.255 - 0.468 \end{pmatrix} = \begin{pmatrix} 7.184 \text{ (pH)} \\ 7.996 \text{ (DO mg/L)} \\ 4.903 \text{ (NO}_3\text{ mg/L)} \end{pmatrix}$

Prediction intervals (95%):

$h_{new} = \mathbf{x}_{new}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_{new} = 0.024$

For pH: $7.184 \pm 1.993\sqrt{0.164(1 + 0.024)} = 7.184 \pm 0.828 = [6.356, 8.012]$

Example 2: MLM with Categorical and Continuous Predictors — Treatment, Gender, and Biomarker Outcomes

Research Question: Do treatment condition (Drug A, Drug B, Placebo) and gender (Male/Female), along with baseline age, predict a profile of three biomarker outcomes (Cholesterol mmol/L, Blood Pressure mmHg, Glucose mmol/L)?

Data: $n = 120$ participants; $p = 4$ predictors (2 dummy variables for treatment, 1 dummy for gender, 1 continuous for age); $q = 3$ responses.

Design Matrix Structure:

$\mathbf{X} = \begin{pmatrix} 1 & D_{DrugA} & D_{DrugB} & D_{Female} & Age_{centred} \end{pmatrix}$

Reference: Drug = Placebo, Gender = Male. Age is mean-centred ( $\bar{Age} = 48.3$ years).

Step 1: Coefficient Matrix (Selected)

Predictor	$\hat{\beta}_{Chol}$	SE	$p$	$\hat{\beta}_{BP}$	SE	$p$	$\hat{\beta}_{Gluc}$	SE	$p$
Intercept	5.41	0.18	< .001	132.4	2.41	< .001	5.82	0.19	< .001
Drug A	-0.84	0.22	< .001	-8.21	2.94	0.006	-0.61	0.23	0.009
Drug B	-0.41	0.22	0.064	-4.83	2.94	0.103	-0.28	0.23	0.225
Female	-0.32	0.19	0.094	-5.14	2.54	0.046	-0.18	0.20	0.369
Age	0.031	0.008	< .001	0.412	0.107	< .001	0.028	0.008	0.001

Step 2: Multivariate Tests

Treatment effect ( $H_0$ : Drug A = Drug B = Placebo on all DVs):

Test	Value	$F(6, 228)$	$p$	$\eta^2_p$
Wilks' $\Lambda^*$	0.7214	7.12	< 0.001	0.158
Pillai's Trace	0.2912	6.78	< 0.001	0.151

Treatment significantly and jointly predicts the biomarker profile ( $p < 0.001$ , medium-large effect $\eta^2_p = 0.158$ ).

Gender effect ( $H_0$ : Male = Female on all DVs):

Test	Value	$F(3, 113)$	$p$	$\eta^2_p$
Wilks' $\Lambda^*$	0.9124	3.62	0.015	0.088
Pillai's Trace	0.0876	3.62	0.015	0.088

Gender has a significant multivariate effect ( $p = 0.015$ , small-medium $\eta^2_p = 0.088$ ), driven primarily by blood pressure differences.

Age effect ( $H_0$ : Age coefficient = 0 on all DVs):

Test	Value	$F(3, 113)$	$p$	$\eta^2_p$
Wilks' $\Lambda^*$	0.8341	7.48	< 0.001	0.166
Pillai's Trace	0.1659	7.48	< 0.001	0.166

Age has a significant multivariate effect ( $p < 0.001$ , large $\eta^2_p = 0.166$ ) — older participants have higher levels of all three biomarkers.

Step 3: Error Correlation Matrix

$\hat{\mathbf{R}}_\epsilon = \begin{pmatrix} 1.000 & 0.412 & 0.331 \\ 0.412 & 1.000 & 0.284 \\ 0.331 & 0.284 & 1.000 \end{pmatrix}$

Substantial positive error correlations confirm the appropriateness of the multivariate approach.

Step 4: Comparing Drug A vs. Drug B (Planned Contrast)

Hypothesis: $H_0: \beta_{DrugA,\cdot} = \beta_{DrugB,\cdot}$ on all DVs (i.e., Drug A and Drug B have identical profiles of effects).

Using $\mathbf{c} = (0, 1, -1, 0, 0)^T$ and $\mathbf{M} = \mathbf{I}_3$ :

Wilks' $\Lambda^* = 0.9184$ , $F(3, 113) = 3.35$ , $p = 0.021$ , $\eta^2_p = 0.082$ .

Drug A has a significantly different (stronger) effect profile compared to Drug B across the combined biomarker set ( $p = 0.021$ ). Examining individual coefficients: Drug A reduces cholesterol by $0.43$ mmol/L more and blood pressure by $3.38$ mmHg more than Drug B.

Conclusion: Drug A significantly and jointly reduces the three biomarker levels compared to Placebo (all $p < 0.01$ ), whereas Drug B's effects are not individually significant. Age is a strong positive predictor of all biomarkers (all $p \leq 0.001$ ). Drug A's effects are significantly stronger than Drug B's across the combined biomarker profile.

Example 3: Profile of Cognitive Outcomes — Continuous Predictors

Research Question: Do IQ, years of education, and weekly exercise hours predict a profile of four cognitive assessment scores (Memory, Attention, Processing Speed, Executive Function) in older adults?

Data: $n = 150$ participants; $p = 3$ continuous predictors; $q = 4$ responses (all standardised T-scores, mean = 50, SD = 10).

Step 1: Standardised Coefficient Matrix $\hat{\mathbf{B}}^*$

Predictor	$\hat{\beta}^*_{Mem}$	$\hat{\beta}^*_{Att}$	$\hat{\beta}^*_{Proc}$	$\hat{\beta}^*_{Exec}$
IQ	0.412**	0.381**	0.448**	0.521**
Education	0.284**	0.211**	0.142*	0.318**
Exercise	0.118*	0.081	0.241**	0.098

* $p < 0.05$ ; ** $p < 0.01$

Step 2: Multivariate Tests

Predictor	Wilks' $\Lambda^*$	$F(4, 143)$	$p$	$\eta^2_p$
IQ	0.5423	30.14	< 0.001	0.458
Education	0.7841	9.82	< 0.001	0.216
Exercise	0.9124	3.44	0.011	0.088

All three predictors significantly and jointly predict the cognitive profile. IQ has the largest multivariate effect ( $\eta^2_p = 0.458$ , large); education has a medium effect ( $\eta^2_p = 0.216$ ); exercise has a small but significant effect ( $\eta^2_p = 0.088$ ).

Step 3: Differential Effects Across Responses

Test: Does IQ have equal effects on all four cognitive scores?

$H_0$ : $\beta_{IQ,Mem} = \beta_{IQ,Att} = \beta_{IQ,Proc} = \beta_{IQ,Exec}$

Using pairwise contrasts $\mathbf{M} =$ three-column difference matrix:

Wilks' $\Lambda^* = 0.9312$ , $F(3, 145) = 3.56$ , $p = 0.016$ → IQ has differential effects across the cognitive domains. IQ most strongly predicts Executive Function and Processing Speed (standardised $\beta^* \approx 0.45-0.52$ ) compared to Memory and Attention ( $\beta^* \approx 0.38-0.41$ ).

Test: Does exercise have equal effects on Processing Speed and Attention?

$H_0$ : $\beta_{Ex,Proc} = \beta_{Ex,Att}$

$\mathbf{c} = \mathbf{e}_{Ex}^T$ , $\mathbf{m} = \mathbf{e}_{Proc} - \mathbf{e}_{Att}$

$t(146) = 2.14$ , $p = 0.034$ → Exercise has a significantly stronger effect on Processing Speed than Attention ( $\beta^*_{Proc} = 0.241$ vs. $\beta^*_{Att} = 0.081$ ).

Step 4: Univariate R² Summary

Response	$R^2$	Adjusted $R^2$	Primary Predictor
Memory	0.348	0.334	IQ ( $\eta^2_p = 0.175$ )
Attention	0.291	0.276	IQ ( $\eta^2_p = 0.149$ )
Processing Speed	0.402	0.389	IQ ( $\eta^2_p = 0.204$ )
Executive Function	0.481	0.470	IQ ( $\eta^2_p = 0.271$ )

Conclusion: The three predictors jointly explain 29–48% of variance in individual cognitive scores. IQ is the dominant predictor across all cognitive domains, but exercise specifically benefits processing speed more than other domains. Education consistently predicts all cognitive outcomes.

18. Common Mistakes and How to Avoid Them

Mistake 1: Running Separate Univariate Regressions Without Multivariate Testing

Problem: Conducting $q$ separate OLS regressions and reporting results variable-by-variable without testing multivariate hypotheses. This ignores the correlational structure among responses, inflates the familywise Type I error rate, and misses effects that exist only on linear combinations of responses.
Solution: Use the MLM framework to obtain multivariate tests of each predictor's joint effect on all responses simultaneously. Report multivariate test statistics alongside univariate results. Apply Bonferroni or Holm corrections if separately reporting univariate results.

Mistake 2: Neglecting to Report and Interpret the Error Covariance Matrix

Problem: Fitting an MLM but reporting only univariate coefficients and ignoring $\hat{\boldsymbol{\Sigma}}$ , missing a key piece of the analysis — the residual correlation structure among responses.
Solution: Always estimate and report $\hat{\boldsymbol{\Sigma}}$ (or at least the correlation form $\hat{\mathbf{R}}_\epsilon$ ). Large within-observation error correlations are the key justification for using MLM rather than separate regressions. If error correlations are near zero, separate regressions are nearly equivalent to MLM for estimation.

Mistake 3: Confusing the Coefficient Matrix Interpretation Direction

Problem: Misinterpreting whether the $k$ -th row or the $j$ -th column of $\hat{\mathbf{B}}$ corresponds to predictor $k$ or response $j$ .
Solution: Establish and maintain a consistent convention. In the standard MLM $\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}$ : rows of $\mathbf{B}$ correspond to predictors (including intercept); columns of $\mathbf{B}$ correspond to responses. Always label the rows and columns of $\hat{\mathbf{B}}$ clearly in output tables.

Mistake 4: Ignoring Residual Diagnostics Across All Responses

Problem: Checking diagnostic plots for only one or two response variables and assuming the model is adequate for all, when violations (non-linearity, heteroscedasticity, outliers) may affect only some responses.
Solution: Examine residual diagnostic plots for every response variable individually. Also conduct multivariate residual diagnostics (Mahalanobis distance Q-Q plot, multivariate normality tests). A violation in even one response requires remediation.

Mistake 5: Using MLM When Responses Are Unrelated

Problem: Applying MLM to a set of conceptually unrelated response variables simply because they were measured, producing a statistically valid but scientifically meaningless combined analysis.
Solution: Only combine responses into an MLM when they represent a coherent theoretical construct (e.g., multiple indicators of the same underlying outcome dimension, multiple time points of the same measure, multiple facets of the same construct). Unrelated responses should be modelled separately.

Mistake 6: Ignoring Multicollinearity Among Predictors

Problem: Failing to check VIFs, leading to inflated standard errors, unstable coefficient estimates, and unreliable hypothesis tests — especially problematic in MLM because the instability propagates across all response equations simultaneously.
Solution: Always compute VIFs before finalising the model. If any $VIF_k > 10$ , consider removing one of the collinear predictors, creating a composite, using ridge regression, or PLS. Report VIFs in the methods section.

Mistake 7: Testing Multivariate Hypotheses Without Checking the Appropriate Contrast Matrices

Problem: Specifying incorrect or incomplete $\mathbf{C}$ and $\mathbf{M}$ matrices, leading to tests of unintended hypotheses. For example, forgetting that the intercept column is the first row of $\mathbf{B}$ and incorrectly indexing predictors.
Solution: Always verify the $\mathbf{C}$ and $\mathbf{M}$ matrices by constructing them explicitly and checking that $\mathbf{C}\hat{\mathbf{B}}\mathbf{M}$ produces the intended contrast. The DataStatPro application automatically constructs $\mathbf{C}$ and $\mathbf{M}$ from user specifications, but always review the hypothesis being tested.

Mistake 8: Selecting Variables Based Only on Univariate Significance

Problem: Dropping predictors from the MLM based solely on their univariate significance for individual responses (e.g., "agriculture is not significant for DO, so remove it"), potentially removing predictors that have significant multivariate effects.
Solution: Use multivariate criteria for variable selection: multivariate AIC/BIC, likelihood ratio test for the full row of $\mathbf{B}$ (all $q$ responses simultaneously), or the multivariate $F$ -test for the predictor's row. A predictor non-significant for one response may still significantly predict other responses or combinations thereof.

Mistake 9: Over-Interpreting Large Standardised Coefficients Without Confidence Intervals

Problem: Ranking predictor importance based on standardised coefficient magnitudes alone, without reporting confidence intervals, leading to overconfident conclusions about which predictors matter most.
Solution: Always report confidence intervals for standardised coefficients. A large standardised coefficient with a wide confidence interval indicates imprecision. Use effect sizes ( $\eta^2_p$ , $\omega^2$ ) alongside coefficient estimates for importance assessment.

Mistake 10: Neglecting Multivariate Outlier Detection

Problem: Identifying and addressing univariate outliers (per response) but missing multivariate outliers — observations that are unusual on the combination of responses even if not extreme on any individual response.
Solution: Always compute Mahalanobis distances of residuals and create a $D^2$ vs. $\chi^2_q$ Q-Q plot. Investigate observations with $D^2 > \chi^2_{q, 0.001}$ for data errors or genuine anomalies. Report Cook's distances and the sensitivity of results to influential observations.

19. Troubleshooting

Issue	Likely Cause	Solution
$(\mathbf{X}^T\mathbf{X})^{-1}$ does not exist	Perfect multicollinearity among predictors; rank-deficient design matrix; $n < p+1$	Check for linearly dependent columns (e.g., sum of dummy variables = constant); remove redundant predictors; collect more data
$\hat{\boldsymbol{\Sigma}}$ is singular	$n - p - 1 \leq q$ (too few residual df); perfectly correlated responses	Increase $n$ ; reduce $q$ (remove highly correlated responses); use regularised estimator
Very large standard errors for some coefficients	Near-multicollinearity ( $VIF > 10$ ); very small predictor variance	Check VIFs; standardise predictors; use ridge regression; remove one of pair of collinear predictors
Wilks' $\Lambda^* = 1$ (non-significant, $p = 1$ )	No predictive signal; all responses uncorrelated with all predictors; data entry error	Verify data; check variable coding; consider whether the research question is plausible
Wilks' $\Lambda^* < 0$ or $> 1$	Computational error; near-singular $\mathbf{E}$ ; numerical overflow	Check for perfect multicollinearity among responses; verify $n - p - 1 > q$ ; inspect raw data
Mahalanobis distance Q-Q plot strongly curved	Severe multivariate non-normality of residuals; influential outliers	Identify and investigate outliers (Cook's distance, leverage); transform skewed responses (log, Box-Cox); use bootstrap inference
Significant multivariate test but no significant univariate tests	Effect exists on a linear combination of DVs not captured individually	Focus on discriminant function / canonical variate analysis; do not dismiss the multivariate result — it is valid
Significant univariate tests but non-significant multivariate test	Insufficient power for multivariate test with many DVs; DVs poorly correlated with group effects	Consider removing uninformative DVs; increase sample size; report both results with context
Residual plots show clear non-linearity for one DV	Omitted non-linear term for that response; wrong functional form	Add polynomial or spline term for the offending predictor-response combination; consider transformation of the response
Residual plots show heteroscedasticity for one DV	Variance of one response changes with fitted values	Apply response transformation (log, square root); use heteroscedasticity-consistent (HC) standard errors; consider weighted MLM
Very high within-observation residual correlations ($	r	> 0.95$)
Cook's distance extremely large for one observation	Data entry error; genuine anomaly; highly influential leverage point	Verify data accuracy; report analysis with and without the influential observation; use robust estimation
AIC increases when adding a theoretically important predictor	Predictor explains negligible variance or adds noise	Report both models; note the theoretically motivated predictor regardless; consider whether $n$ is adequate to detect the hypothesised effect
$F$ -approximations for Wilks', Pillai's, and Roy's give different conclusions	Complex eigenstructure (effects spread across multiple dimensions); small sample size	Report all four statistics; prioritise Pillai's Trace; investigate canonical variate structure

20. Quick Reference Cheat Sheet

Core Formulas

Formula	Description
$\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{E}$	Multivariate linear model
$\hat{\mathbf{B}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$	OLS/MLE coefficient estimator
$\hat{\mathbf{Y}} = \mathbf{X}\hat{\mathbf{B}} = \mathbf{H}\mathbf{Y}$	Fitted values
$\hat{\mathbf{E}} = \mathbf{Y} - \hat{\mathbf{Y}} = (\mathbf{I}-\mathbf{H})\mathbf{Y}$	Residual matrix
$\hat{\boldsymbol{\Sigma}} = \hat{\mathbf{E}}^T\hat{\mathbf{E}}/(n-p-1)$	Error covariance estimator
$SE(\hat{\beta}_{kj}) = \sqrt{\hat{\sigma}_{jj}[(\mathbf{X}^T\mathbf{X})^{-1}]_{kk}}$	Standard error of coefficient
$\mathbf{H} = \mathbf{H}_{hypothesis}: (\mathbf{C}\hat{\mathbf{B}}\mathbf{M})^T[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T]^{-1}(\mathbf{C}\hat{\mathbf{B}}\mathbf{M})$	Hypothesis SSCP matrix
$\Lambda^* =	\mathbf{E}
$V = \text{tr}[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}] = \sum \lambda_s/(1+\lambda_s)$	Pillai's Trace
$U = \text{tr}(\mathbf{E}^{-1}\mathbf{H}) = \sum \lambda_s$	Hotelling-Lawley Trace
$\theta = \lambda_1/(1+\lambda_1)$	Roy's Largest Root
$\eta^2_p = 1 - \Lambda^{*1/t}$	Multivariate partial eta-squared
$\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T\mathbf{x}_{new}$	Point prediction for new observation
$D^2_i = \hat{\boldsymbol{\epsilon}}_i^T\hat{\boldsymbol{\Sigma}}^{-1}\hat{\boldsymbol{\epsilon}}_i$	Mahalanobis distance of residuals

Four Multivariate Test Statistics Summary

Statistic	Formula	Range	Most Robust	Best When
Wilks' $\Lambda^*$	$\prod 1/(1+\lambda_s)$	$[0,1]$	Moderate	Standard analysis, effects spread
Pillai's Trace $V$	$\sum \lambda_s/(1+\lambda_s)$	$[0,s^*]$	Highest	Robustness needed, small/unequal $n$
Hotelling-Lawley $U$	$\sum \lambda_s$	$[0,\infty)$	Lowest	Single dominant dimension
Roy's $\theta$	$\lambda_1/(1+\lambda_1)$	$[0,1]$	Lowest	Theory predicts single dimension

General Linear Hypothesis Guide

Hypothesis	$\mathbf{C}$	$\mathbf{M}$	Description
All predictors, all responses	$\mathbf{I}_p$ (no intercept)	$\mathbf{I}_q$	Overall model test
Predictor $k$ , all responses	$\mathbf{e}_k^T$	$\mathbf{I}_q$	Effect of $X_k$ on all DVs
All predictors, response $j$	$\mathbf{I}_p$	$\mathbf{e}_j$	Univariate model for $Y_j$
Predictor $k$ , response $j$	$\mathbf{e}_k^T$	$\mathbf{e}_j$	Single coefficient test
$X_k$ equal effect on $Y_j$ and $Y_l$	$\mathbf{e}_k^T$	$\mathbf{e}_j - \mathbf{e}_l$	Differential effect test
Groups $g_1$ vs. $g_2$ , all DVs	Contrast row for groups	$\mathbf{I}_q$	MANOVA pairwise comparison
Subset of predictors (rows $k_1,\ldots,k_c$ )	$c$ -row selection matrix	$\mathbf{I}_q$	Joint test of predictor subset

Effect Size Benchmarks

Effect Size	Small	Medium	Large
$\eta^2_p$ (multivariate)	0.01	0.06	0.14
$\omega^2$	0.01	0.06	0.14
Canonical $R^2$	0.01	0.09	0.25
Univariate $R^2$	0.02	0.13	0.26

Assumption Diagnostics Summary

Assumption	Check	Threshold	Remedy
Linearity	Partial residual plots	Visual pattern	Polynomial/spline terms
Multivariate normality	Mardia's tests, $D^2$ Q-Q plot	$p < 0.05$	Transform responses; bootstrap
Independence	Study design; Durbin-Watson	$DW < 1.5$ or $> 2.5$	Mixed models; GEE
Homoscedasticity	Scale-location plot; Breusch-Pagan	$p < 0.05$	Transform responses; HC SEs
No multicollinearity	VIF	$VIF > 10$	Remove/combine predictors; ridge
No outliers	Cook's $D$ ; Mahalanobis $D^2$	$D > 4/n$ ; $D^2 > \chi^2_{q,0.001}$	Investigate; robust estimation
Sufficient $n$	$n - p - 1 > q$	Absolute minimum	Collect more data; reduce $q$

Model Selection Criteria

Criterion	Formula	Prefer	Notes
AIC	$-2\ell + 2k$	Lower	Predictive accuracy
BIC	$-2\ell + k\ln(n)$	Lower	Parsimony
LRT ( $\chi^2$ )	$-[n-p-1-(m+c+1)/2]\ln(\Lambda^*)$	$p > 0.05$ to retain reduced	Nested models only
CV MSPE	$\frac{1}{n}\sum\\|\mathbf{y}_i - \hat{\mathbf{y}}_i^{(-v)}\\|^2$	Lower	Out-of-sample prediction

Special Cases of the MLM

Special Case	Conditions	Notes
Univariate regression	$q = 1$	Standard OLS
MANOVA	$\mathbf{X}$ = group indicators only	Groups compared on $q$ DVs
MANCOVA	$\mathbf{X}$ = groups + covariates	Adjusted group comparisons
Profile analysis	$q$ = repeated measures; $\mathbf{M}$ = difference matrix	Tests parallelism, levels, flatness
Canonical correlation	Test of all predictors, $\mathbf{M} = \mathbf{I}_q$	CCA = MLM eigenstructure
Discriminant analysis	$\mathbf{X}$ = groups; post-hoc eigenvectors	DFA follows MANOVA

Prediction Formulas

Target	Formula
Point prediction	$\hat{\mathbf{y}}_{new} = \hat{\mathbf{B}}^T\mathbf{x}_{new}$
Prediction SE (response $j$ )	$\sqrt{\hat{\sigma}_{jj}(1 + h_{new})}$
95% prediction interval ( $Y_j$ )	$\hat{y}_{new,j} \pm t_{0.025,n-p-1}\sqrt{\hat{\sigma}_{jj}(1+h_{new})}$
Leverage of new point	$h_{new} = \mathbf{x}_{new}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_{new}$
Prediction ellipsoid	$(\mathbf{y}-\hat{\mathbf{y}})^T[\hat{\boldsymbol{\Sigma}}(1+h_{new})]^{-1}(\mathbf{y}-\hat{\mathbf{y}}) \leq qF_{\alpha,q,n-p-q}$

Multicollinearity Severity Guide

VIF Range	Severity	Action
$1 - 2$	None	Proceed normally
$2 - 5$	Low	Monitor; no action required
$5 - 10$	Moderate	Check; consider remediation
$10 - 30$	High	Remediation recommended
$> 30$	Severe	Remediation required

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Multivariate Linear Models using the DataStatPro application. For further reading, consult Johnson & Wichern's "Applied Multivariate Statistical Analysis" (6th ed., Pearson, 2007), Rencher & Christensen's "Methods of Multivariate Analysis" (3rd ed., Wiley, 2012), or Anderson's "An Introduction to Multivariate Statistical Analysis" (3rd ed., Wiley, 2003). For feature requests or support, contact the DataStatPro team.

Multivariate Linear Models

Multivariate Linear Models: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Ordinary Least Squares (OLS) Regression

1.2 Matrices and Linear Algebra

1.3 The Multivariate Normal Distribution

1.4 Covariance and Correlation Matrices

1.5 Eigenvalues and Eigenvectors

1.6 The Hat Matrix

2. What are Multivariate Linear Models?

2.1 The Core Idea

2.2 Why Use MLM Instead of Separate Regressions?

2.3 The Generality of the Multivariate Linear Model

2.4 Real-World Applications

3. The Mathematical Framework

3.1 The Multivariate Linear Model

3.2 The Coefficient Matrix B\mathbf{B}B

3.3 The Error Structure

3.4 The Conditional Mean and Variance

3.5 The General Linear Hypothesis

4. Model Specification and Design Matrices

4.1 The Design Matrix X\mathbf{X}X

4.2 Dummy Coding for Categorical Predictors

4.3 The Response Matrix Y\mathbf{Y}Y

4.4 Centering and Scaling

4.5 Interaction Terms

5. Assumptions of Multivariate Linear Models

5.1 Linearity

5.2 Multivariate Normality of Errors

5.3 Independence of Observations

5.4 Homoscedasticity (Constant Error Covariance)

5.5 No Perfect Multicollinearity Among Predictors

5.6 No Influential Outliers

5.7 Sufficient Sample Size

6. Parameter Estimation

6.1 The Ordinary Least Squares (OLS) Estimator

6.2 Properties of the OLS Estimator

6.3 Estimation of the Error Covariance Matrix

6.4 Seemingly Unrelated Regression (SUR) and GLS

6.5 The Maximum Likelihood Estimator

7. Hypothesis Testing and Inference

7.1 The General Linear Hypothesis Framework

7.2 Testing Individual Predictor Effects (Row of B\mathbf{B}B)

7.3 Testing a Subset of Predictors

7.4 Testing Specific Linear Combinations of Responses

7.5 Univariate Tests Within the MLM Framework

7.6 Wald Tests for Individual Coefficients

7.7 Simultaneous Confidence Regions

7.8 Testing the Overall Model

8. Multivariate Test Statistics

8.1 Wilks' Lambda (Λ∗\Lambda^*Λ∗)

8.2 Pillai's Trace (VVV)

8.3 Hotelling-Lawley Trace (UUU)

8.4 Roy's Largest Root (θ\thetaθ)

8.5 Comparison of the Four Statistics in MLM Context

8.6 Likelihood Ratio Test

9. Effect Size Measures

9.1 Multivariate Eta-Squared (ηp2\eta^2_pηp2​)

9.2 Multivariate Omega-Squared (ω2\omega^2ω2)

9.3 Univariate Effect Sizes per Response

9.4 Standardised Coefficients

9.5 Canonical Correlations

9.6 Multivariate R² (Generalisations)

10. Model Fit and Evaluation

10.1 Univariate R² for Each Response

10.2 Trace of the Residual SSCP

10.3 The Determinant Criterion (∣Σ^∣|\hat{\boldsymbol{\Sigma}}|∣Σ^∣)

10.4 AIC and BIC for Multivariate Models

10.5 Model Comparison via Likelihood Ratio Test

10.6 Comparing Univariate and Multivariate Fit

11. Model Diagnostics and Residuals

11.1 The Residual Matrix

11.2 Univariate Residual Diagnostics (Per Response)

11.3 Multivariate Residual Diagnostics

11.3.1 Mahalanobis Distance of Residuals

11.3.2 Standardised Multivariate Residuals

11.3.3 Cross-Response Residual Correlation

11.4 Leverage and Influence

11.5 Testing for Multivariate Outliers

3.2 The Coefficient Matrix $\mathbf{B}$

4.1 The Design Matrix $\mathbf{X}$

4.3 The Response Matrix $\mathbf{Y}$

7.2 Testing Individual Predictor Effects (Row of $\mathbf{B}$ )

8.1 Wilks' Lambda ( $\Lambda^*$ )

8.2 Pillai's Trace ( $V$ )

8.3 Hotelling-Lawley Trace ( $U$ )

8.4 Roy's Largest Root ( $\theta$ )

9.1 Multivariate Eta-Squared ( $\eta^2_p$ )

9.2 Multivariate Omega-Squared ( $\omega^2$ )

10.3 The Determinant Criterion ( $|\hat{\boldsymbol{\Sigma}}|$ )

12.1 Interpreting the Coefficient Matrix $\hat{\mathbf{B}}$

14.1 Univariate Multiple Regression ( $q = 1$ )