Time Series Analysis: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of time series analysis all the way through advanced modelling, forecasting, and diagnostics, with practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.

Prerequisites and Background Concepts
What is Time Series Analysis?
Components of a Time Series
Time Series Decomposition
Stationarity
Autocorrelation and Partial Autocorrelation
Classical Time Series Models
Exponential Smoothing Models
Model Identification, Estimation, and Selection
Model Diagnostics
Forecasting and Prediction Intervals
Advanced Topics
Using the Time Series Component
Computational and Formula Details
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into time series analysis, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Random Variables and Expectation

A random variable $X$ is a variable whose value is the outcome of a random phenomenon. The expectation (mean) of $X$ is:

$E[X] = \mu = \sum_x x \cdot P(X = x) \quad \text{(discrete)}, \qquad E[X] = \int_{-\infty}^{\infty} x f(x)\, dx \quad \text{(continuous)}$

The variance is:

$\text{Var}(X) = E\left[(X - \mu)^2\right] = E[X^2] - \mu^2$

1.2 Covariance and Correlation

The covariance between two random variables $X$ and $Y$ measures how they move together:

$\text{Cov}(X, Y) = E\left[(X - \mu_X)(Y - \mu_Y)\right]$

The Pearson correlation normalises covariance to the range $[-1, 1]$ :

$\rho_{XY} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}}$

In time series, the covariance between a series and its own lagged values plays a central role.

1.3 The Lag Operator

The lag operator $L$ (also called the backshift operator $B$ ) shifts a time series back by one period:

$L X_t = X_{t-1}, \qquad L^k X_t = X_{t-k}$

The lag operator is a powerful notational tool that simplifies the expression of time series models:

$\Delta X_t = X_t - X_{t-1} = (1 - L) X_t$

1.4 White Noise

A sequence $\{\epsilon_t\}$ is called white noise if:

$E[\epsilon_t] = 0$ for all $t$ .
$\text{Var}(\epsilon_t) = \sigma^2 < \infty$ for all $t$ (constant variance).
$\text{Cov}(\epsilon_t, \epsilon_s) = 0$ for all $t \neq s$ (no autocorrelation).

White noise is the building block of all time series models. If additionally $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ , it is called Gaussian white noise.

1.5 The Normal Distribution

The normal distribution $\mathcal{N}(\mu, \sigma^2)$ with probability density function:

$f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

is assumed for residuals in many time series models. Departures from normality are assessed during model diagnostics.

2. What is Time Series Analysis?

A time series is a sequence of observations recorded at successive, equally spaced points in time. Time series analysis is the set of methods used to understand the structure of a time series, model its behaviour, and generate forecasts of future values.

2.1 What Makes Time Series Special?

Unlike cross-sectional data (where observations are assumed to be independent), observations in a time series are ordered in time and are typically correlated with past values. This temporal dependence is both a challenge and an opportunity:

Challenge: Standard statistical methods that assume independence are not directly applicable.
Opportunity: Past behaviour contains information about future values, enabling forecasting.

2.2 Real-World Applications

Time series analysis is one of the most widely applied quantitative tools across many domains:

Finance & Economics: Forecasting stock prices, exchange rates, GDP growth, inflation, and interest rates.
Meteorology: Predicting temperature, rainfall, wind speed, and weather patterns.
Healthcare: Monitoring patient vital signs over time; forecasting disease incidence and epidemic spread.
Engineering & Manufacturing: Detecting anomalies in industrial sensors; predicting equipment failure.
Retail & Supply Chain: Demand forecasting for inventory management and logistics planning.
Energy: Forecasting electricity demand, solar/wind power generation.
Transportation: Predicting traffic volumes, airline passenger counts, ride-sharing demand.
Social Sciences: Analysing unemployment rates, population growth, and crime statistics over time.

2.3 Goals of Time Series Analysis

The primary goals of time series analysis are:

Goal	Description
Description	Summarising the main features of the series (trend, seasonality, variability)
Explanation	Understanding relationships between the series and other variables
Decomposition	Separating the series into its underlying components
Modelling	Fitting a mathematical model that captures the series' structure
Forecasting	Predicting future values of the series with associated uncertainty
Anomaly Detection	Identifying unusual observations or structural breaks
Control	Monitoring a process and intervening when it deviates from target

2.4 Types of Time Series Data

Type	Description	Example
Univariate	A single variable observed over time	Monthly sales figures
Multivariate	Multiple variables observed over time	Daily temperature, humidity, and wind speed simultaneously
Continuous	Recorded continuously (or at very fine intervals)	ECG heart rate signal
Discrete	Recorded at distinct time points	Quarterly GDP
Equally spaced	Fixed time intervals between observations	Weekly stock closing prices
Irregularly spaced	Variable time intervals	Transaction data, event logs

The DataStatPro application focuses primarily on univariate, equally spaced discrete time series, which covers the vast majority of applied use cases.

3. Components of a Time Series

Most time series can be understood as a combination of several underlying structural components. Identifying and separating these components is the first step in any time series analysis.

3.1 Trend ( $T_t$ )

The trend is the long-term, systematic increase or decrease in the level of the series over time. It represents the underlying direction of the data, abstracting away short-term fluctuations.

Upward trend: Series generally increases over time (e.g., global CO₂ levels, e-commerce sales).
Downward trend: Series generally decreases over time (e.g., cost of solar panels per watt).
No trend: Series fluctuates around a constant mean (e.g., a stationary series).
Non-linear trend: Trend follows a curve (e.g., exponential growth, S-curve adoption).

3.2 Seasonality ( $S_t$ )

Seasonality refers to regular, periodic fluctuations that repeat at a fixed and known frequency (the seasonal period $m$ ). Seasonality is caused by calendar or institutional factors.

Frequency	Seasonal Period $m$	Example
Monthly data	$m = 12$	Retail sales peak in December
Quarterly data	$m = 4$	Energy consumption peaks in Q1/Q3
Weekly data	$m = 7$	Traffic higher on weekdays
Hourly data	$m = 24$	Electricity demand peaks at 6–8pm

💡 Seasonality is distinct from cyclical behaviour — seasonal patterns repeat at fixed intervals (e.g., every 12 months), whereas cycles have variable duration (typically 2–10 years) and are driven by broader economic forces.

3.3 Cyclical Fluctuations ( $C_t$ )

Cyclical fluctuations are wave-like patterns that occur over periods longer than one seasonal cycle, typically driven by economic or business cycles. Unlike seasonal patterns, cycles:

Do not have a fixed, regular period.
Typically span 2–10 years.
Are harder to model and forecast than seasonal patterns.

3.4 Irregular (Residual) Component ( $I_t$ )

The irregular component (also called the error, noise, or residual) is the portion of the series that remains after removing trend, seasonality, and cyclical components. It represents:

Random fluctuations with no predictable structure.
One-off events: strikes, natural disasters, policy changes, pandemics.
Measurement error.

In a well-fitted model, the irregular component should resemble white noise.

3.5 Summary of Components

$Y_t = f(T_t,\ S_t,\ C_t,\ I_t)$

Where the function $f$ depends on the decomposition model (additive or multiplicative — see Section 4).

4. Time Series Decomposition

Decomposition is the process of separating a time series into its constituent components. It provides a clearer picture of the underlying structure and aids in modelling and forecasting.

4.1 Additive Decomposition

In the additive model, the components are assumed to add together:

$Y_t = T_t + S_t + C_t + I_t$

When to use: When the magnitude of seasonal fluctuations is constant over time, regardless of the level of the series. The amplitude of the seasonal swings does not change as the trend rises or falls.

4.2 Multiplicative Decomposition

In the multiplicative model, the components multiply together:

$Y_t = T_t \times S_t \times C_t \times I_t$

When to use: When the magnitude of seasonal fluctuations increases or decreases proportionally with the level of the series. As the trend rises, the seasonal swings get larger. Most economic and business time series exhibit multiplicative seasonality.

💡 A multiplicative model can be converted to an additive model by taking logarithms: $\ln(Y_t) = \ln(T_t) + \ln(S_t) + \ln(C_t) + \ln(I_t)$ . This is a common preprocessing step.

4.3 Moving Average Smoothing for Trend Estimation

A centred moving average (CMA) of order $m$ is the most common method for estimating the trend-cycle component:

For odd $m$ : $\hat{T}_t = \frac{1}{m} \sum_{j=-(m-1)/2}^{(m-1)/2} Y_{t+j}$

For even $m$ (requires a $2 \times m$ CMA to maintain centring): $\hat{T}_t = \frac{1}{m}\left(\frac{1}{2}Y_{t-m/2} + Y_{t-m/2+1} + \dots + Y_{t+m/2-1} + \frac{1}{2}Y_{t+m/2}\right)$

For seasonal data with period $m$ , the CMA of order $m$ removes seasonality and smooths the irregular component, leaving the trend-cycle.

4.4 Classical Decomposition Procedure

Step 1: Estimate the Trend-Cycle ( $\hat{T}_t$ ) Apply a centred moving average of order $m$ (the seasonal period).

Step 2: Detrend the Series

Additive: $Y_t - \hat{T}_t = S_t + I_t$
Multiplicative: $Y_t / \hat{T}_t = S_t \times I_t$

Step 3: Estimate Seasonal Component ( $\hat{S}_t$ ) Average the detrended values for each season across all years. Normalise so that seasonal indices sum to zero (additive) or average to 1 (multiplicative).

Step 4: Calculate the Irregular Component ( $\hat{I}_t$ )

Additive: $\hat{I}_t = Y_t - \hat{T}_t - \hat{S}_t$
Multiplicative: $\hat{I}_t = Y_t / (\hat{T}_t \times \hat{S}_t)$

4.5 STL Decomposition (Seasonal and Trend Decomposition using Loess)

STL (Cleveland et al.) is a more robust and flexible decomposition method based on locally weighted regression (Loess). Advantages over classical decomposition:

Can handle any seasonal period.
Robust to outliers (uses iterative, robust Loess fitting).
Allows the seasonal component to change over time.
Does not require a symmetric moving average window at the series endpoints.

The STL decomposition is controlled by two primary smoothing parameters:

$n_s$ : Seasonal smoothing window (must be odd; larger = smoother seasonal component).
$n_t$ : Trend smoothing window (must be odd; larger = smoother trend).

4.6 Seasonal Adjustment

A seasonally adjusted series is obtained by removing the estimated seasonal component:

Additive: $Y_t^{SA} = Y_t - \hat{S}_t$
Multiplicative: $Y_t^{SA} = Y_t / \hat{S}_t$

Seasonally adjusted series are widely used in economic reporting (e.g., seasonally adjusted unemployment rate) to reveal the underlying trend more clearly.

5. Stationarity

Stationarity is the single most important concept in classical time series modelling. Nearly all standard time series models (ARMA, ARIMA) assume some form of stationarity.

5.1 Strict Stationarity

A process $\{Y_t\}$ is strictly stationary if the joint distribution of $(Y_{t_1}, Y_{t_2}, \dots, Y_{t_k})$ is identical to the joint distribution of $(Y_{t_1+h}, Y_{t_2+h}, \dots, Y_{t_k+h})$ for all $k$ , all time points $t_1, \dots, t_k$ , and all shifts $h$ . This is a very strong condition.

5.2 Weak (Covariance) Stationarity

Weak stationarity (also called second-order stationarity) requires only:

Constant mean: $E[Y_t] = \mu$ for all $t$ .
Constant variance: $\text{Var}(Y_t) = \sigma^2 < \infty$ for all $t$ .
Autocovariance depends only on lag: $\text{Cov}(Y_t, Y_{t+h}) = \gamma(h)$ for all $t$ , where $\gamma(h)$ is a function of the lag $h$ only, not of $t$ .

In practice, "stationarity" almost always refers to weak stationarity. A non-stationary series has a time-varying mean (trend), time-varying variance, or both.

5.3 Why Stationarity Matters

ARMA models are only valid for stationary series. Applying these models to a non-stationary series leads to:

Spurious regressions (finding illusory relationships between unrelated trending series).
Unreliable coefficient estimates and invalid inference.
Poor forecasting performance (predictions diverge to infinity or are systematically biased).

5.4 Types of Non-Stationarity

Type	Description	Solution
Trend stationarity	Series has a deterministic trend (mean increases linearly)	Detrend by regression on time
Difference stationarity (Unit Root)	Series has a stochastic trend (random walk)	First-difference: $\Delta Y_t = Y_t - Y_{t-1}$
Seasonal non-stationarity	Seasonal pattern is non-stationary	Seasonal differencing: $\Delta_m Y_t = Y_t - Y_{t-m}$
Heteroscedasticity	Variance changes over time	Log or Box-Cox transformation

5.5 Differencing

First differencing ( $d = 1$ ) removes a stochastic linear trend (unit root):

$\Delta Y_t = Y_t - Y_{t-1} = (1 - L) Y_t$

Second differencing ( $d = 2$ ) removes a stochastic quadratic trend:

$\Delta^2 Y_t = \Delta(\Delta Y_t) = Y_t - 2Y_{t-1} + Y_{t-2} = (1-L)^2 Y_t$

Seasonal differencing of order $m$ removes seasonal non-stationarity:

$\Delta_m Y_t = Y_t - Y_{t-m} = (1 - L^m) Y_t$

⚠️ Over-differencing introduces unnecessary MA structure and inflates variance. Always apply the minimum number of differences needed to achieve stationarity.

5.6 The Box-Cox Transformation

When the series exhibits heteroscedasticity (variance increases with the level), the Box-Cox transformation stabilises the variance:

$Y_t^{(\lambda)} = \begin{cases} \frac{Y_t^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \ln(Y_t) & \text{if } \lambda = 0 \end{cases}$

Common choices:

$\lambda = 0$ : Natural log transformation (most common).
$\lambda = 0.5$ : Square root transformation.
$\lambda = 1$ : No transformation.
$\lambda = -1$ : Reciprocal transformation.

The optimal $\lambda$ can be estimated by maximising the log-likelihood. The Guerrero method provides a fast, robust estimator.

5.7 Formal Tests for Stationarity

5.7.1 Augmented Dickey-Fuller (ADF) Test

The ADF test tests for the presence of a unit root (stochastic trend):

$\Delta Y_t = \alpha + \beta t + \gamma Y_{t-1} + \sum_{j=1}^p \delta_j \Delta Y_{t-j} + \epsilon_t$

$H_0$ : $\gamma = 0$ (unit root present → series is non-stationary).
$H_1$ : $\gamma < 0$ (no unit root → series is stationary).

The test statistic is $\tau = \hat{\gamma}/SE(\hat{\gamma})$ , compared to critical values from the Dickey-Fuller distribution (not the standard $t$ -distribution). A small (very negative) $\tau$ and small p-value lead to rejection of $H_0$ .

The lag order $p$ is chosen to remove autocorrelation from the residuals (e.g., using AIC/BIC).

Three variants exist based on the deterministic terms included:

Variant	Equation	Use Case
No constant, no trend	$\Delta Y_t = \gamma Y_{t-1} + \sum \delta_j \Delta Y_{t-j} + \epsilon_t$	Series fluctuates around zero
With constant (drift)	Add $\alpha$	Series has a non-zero mean
With constant and trend	Add $\alpha + \beta t$	Series has both a mean and a linear trend

5.7.2 KPSS Test (Kwiatkowski-Phillips-Schmidt-Shin)

The KPSS test has the opposite null hypothesis from the ADF test:

$H_0$ : Series is stationary (trend-stationary).
$H_1$ : Series has a unit root (non-stationary).

The test statistic is:

$\text{KPSS} = \frac{1}{T^2 \hat{\sigma}^2} \sum_{t=1}^T S_t^2$

Where $S_t = \sum_{i=1}^t \hat{e}_i$ is the partial sum of OLS residuals from regressing $Y_t$ on a constant (or constant + trend), and $\hat{\sigma}^2$ is a long-run variance estimator.

Large KPSS statistic → reject $H_0$ → series is non-stationary.

💡 Using both ADF and KPSS together is recommended: if ADF rejects (non-stationary) and KPSS does not reject (stationary), there is a contradiction suggesting more careful examination. If both agree, the conclusion is clearer.

5.7.3 Phillips-Perron (PP) Test

The Phillips-Perron test is a non-parametric modification of the Dickey-Fuller test. Instead of adding lagged difference terms to control for serial correlation (as ADF does), it uses a non-parametric correction to the test statistic. It is more robust to heteroscedasticity and serial correlation in the errors.

$H_0$ : Unit root present (non-stationary).
$H_1$ : No unit root (stationary).

Decision rule and interpretation are the same as for the ADF test.

5.8 Determining the Order of Differencing

Evidence	Action
ADF: fail to reject $H_0$ ; KPSS: reject $H_0$	Apply first difference ( $d = 1$ )
After first difference, ADF: reject $H_0$ ; KPSS: fail to reject	Series is I(1); use $d = 1$ in ARIMA
After first difference, still non-stationary	Apply second difference ( $d = 2$ ); rarely needed
ACF decays very slowly	Strong evidence of non-stationarity; difference required
ACF cuts off quickly	Series may already be stationary

6. Autocorrelation and Partial Autocorrelation

The autocorrelation function (ACF) and partial autocorrelation function (PACF) are the primary tools for identifying the structure of a time series and selecting appropriate model orders.

6.1 Autocovariance Function

For a stationary process, the autocovariance at lag $h$ is:

$\gamma(h) = \text{Cov}(Y_t, Y_{t+h}) = E\left[(Y_t - \mu)(Y_{t+h} - \mu)\right]$

Note that $\gamma(0) = \text{Var}(Y_t) = \sigma^2$ .

6.2 Autocorrelation Function (ACF)

The autocorrelation at lag $h$ is the autocovariance normalised by the variance:

$\rho(h) = \frac{\gamma(h)}{\gamma(0)} = \frac{\text{Cov}(Y_t, Y_{t+h})}{\text{Var}(Y_t)}$

Properties:

$\rho(0) = 1$ always.
$-1 \leq \rho(h) \leq 1$ for all $h$ .
$\rho(h) = \rho(-h)$ (symmetric).

Sample ACF: Estimated from data as:

$\hat{\rho}(h) = \frac{\sum_{t=1}^{T-h}(Y_t - \bar{Y})(Y_{t+h} - \bar{Y})}{\sum_{t=1}^T (Y_t - \bar{Y})^2}$

Bartlett's approximate 95% confidence bounds for testing whether $\hat{\rho}(h) = 0$ :

$\pm \frac{1.96}{\sqrt{T}}$

Autocorrelations that fall outside these bounds are considered statistically significant.

6.3 Partial Autocorrelation Function (PACF)

The partial autocorrelation at lag $h$ , denoted $\phi_{hh}$ , measures the correlation between $Y_t$ and $Y_{t+h}$ after removing the linear influence of the intervening lags $Y_{t+1}, Y_{t+2}, \dots, Y_{t+h-1}$ .

It can be computed using the Yule-Walker equations:

$\begin{pmatrix} \rho(1) \\ \rho(2) \\ \vdots \\ \rho(h) \end{pmatrix} = \begin{pmatrix} 1 & \rho(1) & \cdots & \rho(h-1) \\ \rho(1) & 1 & \cdots & \rho(h-2) \\ \vdots & \vdots & \ddots & \vdots \\ \rho(h-1) & \rho(h-2) & \cdots & 1 \end{pmatrix} \begin{pmatrix} \phi_{h1} \\ \phi_{h2} \\ \vdots \\ \phi_{hh} \end{pmatrix}$

The PACF at lag $h$ is the last element $\phi_{hh}$ of the solution vector.

95% confidence bounds for PACF:

$\pm \frac{1.96}{\sqrt{T}}$

6.4 ACF and PACF as Model Identification Tools

The patterns of ACF and PACF are the fingerprints of different time series models:

Model	ACF Pattern	PACF Pattern
White Noise	No significant spikes at any lag	No significant spikes at any lag
AR( $p$ )	Decays gradually (exponential or sinusoidal)	Cuts off sharply after lag $p$
MA( $q$ )	Cuts off sharply after lag $q$	Decays gradually (exponential or sinusoidal)
ARMA( $p$ , $q$ )	Decays gradually after lag $q-p$	Decays gradually after lag $p-q$
Non-stationary (unit root)	Decays very slowly (near-unit persistence)	Large spike at lag 1, near zero thereafter
Seasonal AR	Significant spikes at multiples of $m$ (decaying)	Spike cuts off at lag $m$
Seasonal MA	Spike at lag $m$ only	Decaying spikes at multiples of $m$

💡 In practice, ACF/PACF identification is an art as much as a science. Real data rarely produce textbook-perfect patterns. Use ACF/PACF alongside information criteria (AIC/BIC) for model selection.

6.5 The Ljung-Box Test for Autocorrelation

The Ljung-Box test (Box-Pierce modification) tests whether a group of autocorrelations are jointly zero:

$Q_{LB}(m) = T(T+2) \sum_{h=1}^m \frac{\hat{\rho}^2(h)}{T-h}$

Under $H_0$ (no autocorrelation up to lag $m$ ), $Q_{LB}(m) \sim \chi^2_m$ .

Large $Q_{LB}$ and small p-value: Significant autocorrelation present (residuals are not white noise).
Small $Q_{LB}$ and large p-value: No significant autocorrelation (residuals resemble white noise).

This test is used primarily for residual diagnostics after fitting a model (see Section 10).

7. Classical Time Series Models

7.1 Autoregressive (AR) Models

An autoregressive model of order $p$ , denoted AR( $p$ ), models the current value as a linear combination of its $p$ most recent past values plus white noise:

$Y_t = c + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \dots + \phi_p Y_{t-p} + \epsilon_t$

Where:

$c$ is a constant (related to the mean: $c = \mu(1 - \phi_1 - \dots - \phi_p)$ ).
$\phi_1, \phi_2, \dots, \phi_p$ are the autoregressive coefficients.
$\epsilon_t \sim WN(0, \sigma^2)$ is white noise.

Using the lag operator:

$(1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p) Y_t = c + \epsilon_t$

$\Phi(L) Y_t = c + \epsilon_t$

Where $\Phi(L) = 1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p$ is the AR characteristic polynomial.

7.1.1 Stationarity Condition for AR( $p$ )

An AR( $p$ ) process is stationary if and only if all roots of the characteristic polynomial lie outside the unit circle:

$\Phi(z) = 1 - \phi_1 z - \phi_2 z^2 - \dots - \phi_p z^p = 0 \implies |z| > 1 \text{ for all roots}$

For AR(1): The stationarity condition is simply $|\phi_1| < 1$ .

For AR(2): The stationarity conditions are: $|\phi_2| < 1, \quad \phi_1 + \phi_2 < 1, \quad \phi_2 - \phi_1 < 1$

7.1.2 Properties of the AR(1) Process

The simplest AR model, AR(1):

$Y_t = c + \phi_1 Y_{t-1} + \epsilon_t$

Mean: $\mu = c / (1 - \phi_1)$
Variance: $\gamma(0) = \sigma^2 / (1 - \phi_1^2)$
Autocovariance at lag $h$ : $\gamma(h) = \phi_1^h \cdot \gamma(0)$
ACF: $\rho(h) = \phi_1^h$ — decays exponentially.
PACF: $\phi_{11} = \phi_1$ ; $\phi_{hh} = 0$ for $h > 1$ — cuts off after lag 1.

7.2 Moving Average (MA) Models

A moving average model of order $q$ , denoted MA( $q$ ), models the current value as a linear combination of the current and $q$ most recent white noise terms:

$Y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \dots + \theta_q \epsilon_{t-q}$

Where:

$\mu$ is the mean of the process.
$\theta_1, \theta_2, \dots, \theta_q$ are the moving average coefficients.
$\epsilon_t \sim WN(0, \sigma^2)$ .

Using the lag operator:

$Y_t = \mu + (1 + \theta_1 L + \theta_2 L^2 + \dots + \theta_q L^q) \epsilon_t = \mu + \Theta(L) \epsilon_t$

7.2.1 Properties of the MA( $q$ ) Process

Always stationary regardless of the values of $\theta_j$ (since it is a finite linear combination of white noise terms).
Mean: $E[Y_t] = \mu$ .
Variance: $\gamma(0) = \sigma^2(1 + \theta_1^2 + \dots + \theta_q^2)$ .
Autocovariance: $\gamma(h) = 0$ for $|h| > q$ — cuts off after lag $q$ .
ACF: Cuts off sharply after lag $q$ .
PACF: Decays gradually (exponentially or with oscillation).

7.2.2 Invertibility Condition for MA( $q$ )

An MA( $q$ ) process is invertible if all roots of the MA characteristic polynomial $\Theta(z) = 1 + \theta_1 z + \dots + \theta_q z^q = 0$ lie outside the unit circle ( $|z| > 1$ ). Invertibility ensures a unique MA representation and is required for estimation and forecasting.

For MA(1): Invertibility requires $|\theta_1| < 1$ .

7.3 ARMA Models

The Autoregressive Moving Average model of orders $p$ and $q$ , denoted ARMA( $p$ , $q$ ), combines both AR and MA components:

$Y_t = c + \phi_1 Y_{t-1} + \dots + \phi_p Y_{t-p} + \epsilon_t + \theta_1 \epsilon_{t-1} + \dots + \theta_q \epsilon_{t-q}$

Or compactly using the lag operator:

$\Phi(L) Y_t = c + \Theta(L) \epsilon_t$

Stationarity requires the AR polynomial roots to lie outside the unit circle. Invertibility requires the MA polynomial roots to lie outside the unit circle. Both must hold for a well-behaved ARMA model.

Properties:

The ACF and PACF both decay gradually (neither cuts off sharply).
More parsimonious than a pure AR or MA model of the same effective order.

7.4 ARIMA Models

The Autoregressive Integrated Moving Average model, denoted ARIMA( $p$ , $d$ , $q$ ), extends ARMA to handle non-stationary series by incorporating $d$ rounds of differencing:

$\Phi(L)(1-L)^d Y_t = c + \Theta(L) \epsilon_t$

Where:

$p$ = order of the autoregressive part.
$d$ = degree of differencing (the integration order).
$q$ = order of the moving average part.

The differenced series $W_t = (1-L)^d Y_t$ follows an ARMA( $p$ , $q$ ) model.

Special Cases:

Model	Parameters	Description
ARIMA(0,1,0)	$d=1$	Random walk
ARIMA(1,0,0)	$p=1$ , $d=0$	AR(1)
ARIMA(0,0,1)	$q=1$ , $d=0$	MA(1)
ARIMA(0,1,1)	$d=1$ , $q=1$	Exponential smoothing
ARIMA(1,1,0)	$p=1$ , $d=1$	Differenced AR(1)
ARIMA(0,2,2)	$d=2$ , $q=2$	Equivalent to Holt's linear method

7.5 SARIMA Models

The Seasonal ARIMA model, denoted SARIMA( $p$ , $d$ , $q$ )( $P$ , $D$ , $Q$ ) $_m$ , extends ARIMA by adding seasonal AR, differencing, and MA components with seasonal period $m$ :

$\Phi(L)\Phi_S(L^m)(1-L)^d(1-L^m)^D Y_t = c + \Theta(L)\Theta_S(L^m)\epsilon_t$

Where:

$(p, d, q)$ = non-seasonal AR order, differencing, MA order.
$(P, D, Q)$ = seasonal AR order, seasonal differencing, seasonal MA order.
$m$ = seasonal period (e.g., 12 for monthly, 4 for quarterly, 7 for daily with weekly seasonality).
$\Phi_S(L^m) = 1 - \Phi_1 L^m - \Phi_2 L^{2m} - \dots - \Phi_P L^{Pm}$
$\Theta_S(L^m) = 1 + \Theta_1 L^m + \Theta_2 L^{2m} + \dots + \Theta_Q L^{Qm}$

Common SARIMA Configurations:

Model	Use Case
SARIMA(0,1,1)(0,1,1) $_{12}$	Airline model (Box-Jenkins); monthly data with seasonal differencing
SARIMA(1,1,1)(1,1,1) $_{12}$	General monthly seasonal model
SARIMA(0,1,1)(0,1,1) $_4$	Quarterly seasonal model
SARIMA(2,1,0)(1,1,0) $_{12}$	Monthly data with AR structure

7.6 SARIMAX Models

The SARIMAX model extends SARIMA by including exogenous (external) predictor variables $X_t$ :

$\Phi(L)\Phi_S(L^m)(1-L)^d(1-L^m)^D Y_t = c + \beta_1 X_{1t} + \dots + \beta_k X_{kt} + \Theta(L)\Theta_S(L^m)\epsilon_t$

Where $X_{jt}$ are exogenous variables (e.g., advertising spend, temperature, day-of-week indicators) and $\beta_j$ are their regression coefficients.

💡 SARIMAX is equivalent to a dynamic regression model with ARIMA errors. The exogenous variables account for the deterministic part of the series structure, while the ARIMA component models the stochastic error structure.

7.7 The Random Walk and Random Walk with Drift

Random Walk: ARIMA(0,1,0) with $c = 0$ :

$Y_t = Y_{t-1} + \epsilon_t \implies Y_t = Y_0 + \sum_{i=1}^t \epsilon_i$

Mean: $E[Y_t] = Y_0$ (constant).
Variance: $\text{Var}(Y_t) = t\sigma^2$ → grows without bound.
Forecasts: $\hat{Y}_{t+h|t} = Y_t$ for all $h$ (naïve forecast).

Random Walk with Drift: ARIMA(0,1,0) with $c \neq 0$ :

$Y_t = c + Y_{t-1} + \epsilon_t$

Mean: $E[Y_t] = Y_0 + ct$ → linear trend.
Forecasts: $\hat{Y}_{t+h|t} = hc + Y_t$ (linear extrapolation).

8. Exponential Smoothing Models

Exponential smoothing methods are a family of forecasting procedures that generate weighted averages of past observations, with weights that decay exponentially as observations recede into the past. They are intuitive, robust, and widely used in practice.

8.1 Simple Exponential Smoothing (SES)

SES (also called single exponential smoothing) is appropriate for series with no trend and no seasonality (or trend and seasonality that have been removed).

Smoothed Level:

$\hat{Y}_{t+1|t} = \ell_t = \alpha Y_t + (1 - \alpha) \ell_{t-1}$

Where:

$\ell_t$ is the smoothed level at time $t$ .
$\alpha \in (0, 1)$ is the smoothing parameter: larger $\alpha$ gives more weight to recent observations.

Equivalently, as a weighted average of all past observations:

$\ell_t = \alpha \sum_{j=0}^{t-1}(1-\alpha)^j Y_{t-j} + (1-\alpha)^t \ell_0$

Forecasts: $\hat{Y}_{t+h|t} = \ell_t$ for all forecast horizons $h \geq 1$ (flat forecast line).

Error Correction Form (ETS interpretation):

$\ell_t = \ell_{t-1} + \alpha \epsilon_t$

Where $\epsilon_t = Y_t - \ell_{t-1}$ is the one-step-ahead forecast error. SES updates the level proportionally to the most recent error.

Equivalence: SES is equivalent to an ARIMA(0,1,1) model with $\theta_1 = \alpha - 1$ .

8.2 Holt's Linear Exponential Smoothing (Double Exponential Smoothing)

Holt's method extends SES to handle series with a linear trend (but no seasonality), by separately smoothing both the level and the trend.

Level equation:

$\ell_t = \alpha Y_t + (1 - \alpha)(\ell_{t-1} + b_{t-1})$

Trend (slope) equation:

$b_t = \beta^*(\ell_t - \ell_{t-1}) + (1 - \beta^*) b_{t-1}$

Where:

$\alpha \in (0,1)$ is the level smoothing parameter.
$\beta^* \in (0,1)$ is the trend smoothing parameter.
$b_t$ is the estimated slope (trend) at time $t$ .

Forecasts:

$\hat{Y}_{t+h|t} = \ell_t + h \cdot b_t$

The $h$ -step-ahead forecast is a linear extrapolation of the trend.

Equivalence: Holt's method is equivalent to ARIMA(0,2,2).

8.2.1 Damped Trend Method

The damped trend modification (Gardner & McKenzie) introduces a damping parameter $\phi \in (0, 1)$ that dampens the trend toward zero for longer forecast horizons, avoiding the unrealistic assumption of a constant growth rate into the indefinite future:

Level: $\ell_t = \alpha Y_t + (1-\alpha)(\ell_{t-1} + \phi b_{t-1})$

Trend: $b_t = \beta^*(\ell_t - \ell_{t-1}) + (1-\beta^*)\phi b_{t-1}$

Forecasts: $\hat{Y}_{t+h|t} = \ell_t + \left(\sum_{j=1}^h \phi^j\right) b_t = \ell_t + \frac{\phi(1-\phi^h)}{1-\phi} b_t$

As $h \to \infty$ , the forecast converges to $\ell_t + \phi b_t / (1 - \phi)$ (a constant).

💡 The damped trend method is among the most robust and widely recommended methods for general-purpose forecasting. It is less likely to over-extrapolate trends than undamped Holt's method.

8.3 Holt-Winters Exponential Smoothing (Triple Exponential Smoothing)

Holt-Winters extends Holt's method to handle series with both trend and seasonality.

8.3.1 Additive Holt-Winters

Appropriate when the seasonal variation is roughly constant in magnitude.

Level: $\ell_t = \alpha(Y_t - s_{t-m}) + (1-\alpha)(\ell_{t-1} + b_{t-1})$

Trend: $b_t = \beta^*(\ell_t - \ell_{t-1}) + (1-\beta^*) b_{t-1}$

Seasonal: $s_t = \gamma(Y_t - \ell_{t-1} - b_{t-1}) + (1-\gamma) s_{t-m}$

Forecasts ( $h$ steps ahead): $\hat{Y}_{t+h|t} = \ell_t + h \cdot b_t + s_{t+h-m(k+1)}$

Where $k = \lfloor (h-1)/m \rfloor$ ensures we pick the correct seasonal index.

8.3.2 Multiplicative Holt-Winters

Appropriate when the seasonal variation grows proportionally with the level of the series.

Level: $\ell_t = \alpha \frac{Y_t}{s_{t-m}} + (1-\alpha)(\ell_{t-1} + b_{t-1})$

Trend: $b_t = \beta^*(\ell_t - \ell_{t-1}) + (1-\beta^*) b_{t-1}$

Seasonal: $s_t = \gamma \frac{Y_t}{\ell_{t-1} + b_{t-1}} + (1-\gamma) s_{t-m}$

Forecasts: $\hat{Y}_{t+h|t} = (\ell_t + h \cdot b_t) \times s_{t+h-m(k+1)}$

Where:

$\gamma \in (0, 1)$ is the seasonal smoothing parameter.
$s_t$ are the seasonal indices (sum to $m$ for additive; average to 1 for multiplicative).

8.4 The ETS Framework

The ETS (Error, Trend, Seasonal) framework (Hyndman et al.) provides a unified taxonomy for all exponential smoothing methods based on the nature of:

Error (E): Additive (A) or Multiplicative (M).
Trend (T): None (N), Additive (A), Additive Damped (A $_d$ ).
Seasonal (S): None (N), Additive (A), Multiplicative (M).

ETS Model Taxonomy (selected):

ETS Model	Equivalent Method	Error	Trend	Seasonal
ETS(A,N,N)	Simple Exponential Smoothing	A	N	N
ETS(A,A,N)	Holt's Linear	A	A	N
ETS(A,A $_d$ ,N)	Damped Holt's	A	A $_d$	N
ETS(A,A,A)	Additive Holt-Winters	A	A	A
ETS(M,A,M)	Multiplicative Holt-Winters	M	A	M
ETS(M,A $_d$ ,M)	Damped Multiplicative HW	M	A $_d$	M

The ETS framework provides a state space representation with a likelihood function, enabling proper:

Maximum likelihood estimation of smoothing parameters.
Model selection using AIC/BIC.
Prediction intervals derived from the model's error structure.

8.5 Initialisation of Smoothing Parameters

Initial values for $\ell_0$ , $b_0$ , and $s_{1-m}, \dots, s_0$ are required to start the recursions. Common methods:

Heuristic initialisation: Use the average of the first few observations for $\ell_0$ ; estimate $b_0$ from the first two periods; estimate initial seasonal indices from the first full seasonal cycle.
Optimised initialisation: Treat the initial values as additional parameters to be optimised alongside $\alpha$ , $\beta^*$ , $\gamma$ by minimising the sum of squared errors (SSE).

The DataStatPro application uses optimised initialisation by default.

9. Model Identification, Estimation, and Selection

9.1 The Box-Jenkins Methodology

The Box-Jenkins methodology is the classical framework for ARIMA model building. It proceeds through three iterative stages:

Stage 1 — Identification

Plot the time series and examine its features.
Check stationarity (ADF, KPSS tests); apply transformations and differencing as needed.
Examine the ACF and PACF of the (transformed, differenced) series to suggest tentative values of $p$ and $q$ .

Stage 2 — Estimation

Estimate the parameters of the identified model(s) by maximum likelihood (or conditional least squares).
Compute standard errors, t-statistics, and confidence intervals for parameters.

Stage 3 — Diagnostic Checking

Analyse the residuals to verify they resemble white noise.
Ljung-Box test for residual autocorrelation.
Normality tests and Q-Q plots for residuals.
If residuals are not white noise, return to Stage 1 with a modified model.

9.2 Maximum Likelihood Estimation (MLE) for ARIMA

For ARIMA models, parameters are estimated by maximising the log-likelihood of the observed data. Assuming Gaussian innovations, the exact log-likelihood is:

$\ell(\phi, \theta, \sigma^2) = -\frac{T}{2}\ln(2\pi) - \frac{1}{2}\sum_{t=1}^T \ln f_t - \frac{1}{2}\sum_{t=1}^T \frac{\epsilon_t^2}{f_t}$

Where $f_t$ is the conditional variance of $Y_t$ given all past values (computed via the Kalman filter for exact likelihood, or recursively for conditional likelihood). MLE is solved numerically using iterative algorithms (e.g., L-BFGS-B, Newton-Raphson).

9.3 Information Criteria for Model Selection

Information criteria penalise the log-likelihood for model complexity to avoid overfitting.

Akaike Information Criterion (AIC):

$\text{AIC} = -2\ell(\hat{\boldsymbol{\theta}}) + 2k$

Corrected AIC (AICc) — recommended for small samples:

$\text{AICc} = \text{AIC} + \frac{2k(k+1)}{T - k - 1}$

Bayesian Information Criterion (BIC):

$\text{BIC} = -2\ell(\hat{\boldsymbol{\theta}}) + k \ln(T)$

Where:

$k$ = number of estimated parameters.
$T$ = number of observations.

Rules:

Lower AIC / AICc / BIC = better model (better fit relative to complexity).
AICc converges to AIC as $T \to \infty$ but penalises more heavily for small $T$ .
BIC penalises model complexity more strongly than AIC and tends to select more parsimonious models.
Use AICc as the primary criterion for ARIMA model selection; use BIC as a robustness check.

9.4 Automatic ARIMA Selection (auto.ARIMA)

Searching all possible combinations of $(p, d, q)(P, D, Q)_m$ is computationally expensive. The Hyndman-Khandakar algorithm (implemented as auto.arima in R and replicated in DataStatPro) automates ARIMA selection:

Determine $d$ and $D$ using unit root tests (KPSS for $d$ ; Canova-Hansen or KPSS on seasonally differenced series for $D$ ).
Start with a default model (e.g., ARIMA(2, $d$ ,2)(1, $D$ ,1) $_m$ ).
Evaluate neighbouring models (varying $p$ , $q$ , $P$ , $Q$ by ±1).
Select the model with the lowest AICc.
Repeat until no neighbouring model improves AICc.

⚠️ Automatic selection is a useful starting point but should not replace domain knowledge and manual inspection of ACF/PACF plots, residual diagnostics, and out-of-sample forecast evaluation.

9.5 Forecast Accuracy Metrics

To compare competing models, accuracy metrics are computed on a hold-out (test) set of the last $h$ observations not used in model fitting:

Mean Error (ME): $ME = \frac{1}{h} \sum_{t=1}^h e_t$

Mean Absolute Error (MAE): $MAE = \frac{1}{h} \sum_{t=1}^h |e_t|$

Root Mean Squared Error (RMSE): $RMSE = \sqrt{\frac{1}{h} \sum_{t=1}^h e_t^2}$

Mean Absolute Percentage Error (MAPE): $MAPE = \frac{100}{h} \sum_{t=1}^h \left|\frac{e_t}{Y_t}\right|$

Mean Absolute Scaled Error (MASE) — scale-free, robust to zero values: $MASE = \frac{MAE}{\frac{1}{T-m}\sum_{t=m+1}^T |Y_t - Y_{t-m}|}$

Where the denominator is the in-sample MAE of the seasonal naïve forecast. MASE < 1 means the model outperforms the seasonal naïve benchmark.

Metric	Scale	Sensitive to Outliers	Notes
MAE	Same as data	No	Easy to interpret
RMSE	Same as data	Yes	Penalises large errors more heavily
MAPE	Percentage	No	Undefined when $Y_t = 0$ ; biased for asymmetric series
MASE	Scale-free	No	Preferred for cross-series comparison

10. Model Diagnostics

After fitting a time series model, residual analysis is essential to verify that the model has adequately captured the structure of the data.

10.1 Residual Definition

For a fitted model, the residuals (one-step-ahead forecast errors) are:

$\hat{\epsilon}_t = Y_t - \hat{Y}_{t|t-1}$

A well-specified model should produce residuals that are approximately white noise: uncorrelated, zero-mean, and homoscedastic.

10.2 Diagnostic Checks

10.2.1 Time Plot of Residuals

Plot $\hat{\epsilon}_t$ against time. Look for:

No obvious patterns, trends, or structural breaks.
Roughly constant variance over time (homoscedasticity).
No clustering of large or small residuals.

10.2.2 ACF of Residuals

Plot the ACF of the residuals. For a well-fitted model:

No autocorrelations should fall significantly outside the $\pm 1.96/\sqrt{T}$ bounds.
Significant autocorrelation at any lag suggests the model has not fully captured the serial dependence.

10.2.3 Ljung-Box Test on Residuals

Test the joint significance of autocorrelations up to lag $m$ :

$Q_{LB}(m) = T(T+2) \sum_{h=1}^m \frac{\hat{\rho}^2(h)}{T-h} \sim \chi^2_{m-p-q}$

For residuals from an ARMA( $p$ , $q$ ) fit, the degrees of freedom are adjusted to $m - p - q$ .

$p > 0.05$ : No evidence of residual autocorrelation → model is adequate.
$p < 0.05$ : Significant residual autocorrelation → model needs refinement.

10.2.4 Histogram and Q-Q Plot of Residuals

Assess normality of residuals:

Histogram: Should be roughly bell-shaped and centred at zero.
Q-Q Plot: Points should fall close to the 45° diagonal line.
Departures indicate non-normality, which affects the validity of prediction intervals.

10.2.5 Jarque-Bera Test for Normality

A formal test for normality based on skewness ( $\hat{S}$ ) and excess kurtosis ( $\hat{K}$ ):

$JB = \frac{T}{6}\left[\hat{S}^2 + \frac{(\hat{K}-3)^2}{4}\right] \sim \chi^2_2 \text{ under } H_0$

$H_0$ : Residuals are normally distributed.
$p < 0.05$ : Reject normality. Prediction intervals may be unreliable.

10.2.6 ARCH-LM Test for Heteroscedasticity

The ARCH-LM test (Engle, 1982) tests whether residual variance is serially correlated:

$\hat{\epsilon}_t^2 = a_0 + a_1 \hat{\epsilon}_{t-1}^2 + \dots + a_m \hat{\epsilon}_{t-m}^2 + v_t$

Test statistic: $LM = T \cdot R^2$ from this regression, distributed $\chi^2_m$ under $H_0$ of no ARCH effects.

Significant result suggests volatility clustering: large errors tend to be followed by large errors.
In such cases, a GARCH model (Section 12) should be considered.

10.3 Overfitting and Parsimony

A model with too many parameters may overfit the training data (low in-sample errors) but generalise poorly to new data (high out-of-sample errors). The principle of parsimony (Occam's Razor) favours the simplest model that adequately captures the data structure. AICc and BIC both penalise complexity to guard against overfitting.

11. Forecasting and Prediction Intervals

11.1 Point Forecasts

The $h$ -step-ahead point forecast made at time $T$ is the conditional expectation:

$\hat{Y}_{T+h|T} = E[Y_{T+h} \mid Y_T, Y_{T-1}, \dots]$

For ARIMA models, forecasts are computed recursively using the estimated model equations, replacing unknown future values with their forecasts and unknown future errors with zero.

AR( $p$ ) forecast: $\hat{Y}_{T+h|T} = \hat{c} + \hat{\phi}_1 \hat{Y}_{T+h-1|T} + \dots + \hat{\phi}_p \hat{Y}_{T+h-p|T}$

Where $\hat{Y}_{T+j|T} = Y_{T+j}$ for $j \leq 0$ (known past values) and $\hat{Y}_{T+j|T}$ for $j > 0$ (previously computed forecasts).

11.2 Forecast Error and Variance

The $h$ -step-ahead forecast error is:

$e_{T+h|T} = Y_{T+h} - \hat{Y}_{T+h|T}$

For ARIMA models, $Y_t$ has an infinite MA representation (the Wold decomposition):

$Y_t = \mu + \sum_{j=0}^{\infty} \psi_j \epsilon_{t-j}$

Where the $\psi$ -weights can be derived recursively from the AR and MA polynomials. The variance of the $h$ -step-ahead forecast error is:

$\text{Var}(e_{T+h|T}) = \sigma^2 \sum_{j=0}^{h-1} \psi_j^2$

For $h = 1$ : $\text{Var}(e_{T+1|T}) = \sigma^2$ (one-step forecast error variance equals the innovation variance).

The forecast uncertainty grows with the forecast horizon $h$ , reflecting increasing uncertainty about the distant future.

11.3 Prediction Intervals

A $(1-\alpha) \times 100\%$ prediction interval for $Y_{T+h}$ is:

$\hat{Y}_{T+h|T} \pm z_{\alpha/2} \cdot \sqrt{\hat{\sigma}^2 \sum_{j=0}^{h-1} \hat{\psi}_j^2}$

For a 95% prediction interval, $z_{0.025} = 1.96$ .

Key properties:

Prediction intervals are centred on the point forecast.
They widen with forecast horizon (uncertainty accumulates).
They assume Gaussian innovations; if residuals are non-normal, bootstrap prediction intervals are more appropriate.
They account for model estimation uncertainty only approximately — they do not account for model misspecification uncertainty.

11.4 Bootstrap Prediction Intervals

When residuals are non-normal, bootstrap prediction intervals are more reliable:

Fit the model; save the standardised residuals $\hat{\epsilon}_t^* = \hat{\epsilon}_t / \hat{\sigma}$ .
For each bootstrap replicate $b = 1, \dots, B$ : a. Simulate future errors by sampling with replacement from $\{\hat{\epsilon}_t^*\}$ . b. Generate a simulated future path $Y_{T+1}^{(b)}, \dots, Y_{T+h}^{(b)}$ using the fitted model.
The $\alpha/2$ and $1-\alpha/2$ percentiles of $\{Y_{T+h}^{(b)}\}$ form the bootstrap PI.

Bootstrap PIs are distribution-free and automatically capture non-normality and non-linearity.

11.5 Benchmark Forecasting Methods

Before applying sophisticated models, it is good practice to compare against simple benchmark methods:

Method	Formula	Use Case
Naïve	$\hat{Y}_{T+h	T} = Y_T$
Seasonal Naïve	$\hat{Y}_{T+h	T} = Y_{T+h-m \cdot k}$
Drift	$\hat{Y}_{T+h	T} = Y_T + h \frac{Y_T - Y_1}{T-1}$
Mean	$\hat{Y}_{T+h	T} = \bar{Y}$

Any proposed model should outperform these benchmarks (MASE < 1 relative to seasonal naïve).

12. Advanced Topics

12.1 ARCH and GARCH Models

When residuals exhibit volatility clustering (periods of high volatility followed by high volatility, and calm followed by calm), standard ARIMA models with constant variance $\sigma^2$ are inadequate.

12.1.1 ARCH( $m$ ) Model

The Autoregressive Conditional Heteroscedasticity model (Engle, 1982) models the conditional variance as:

$\sigma_t^2 = \omega + \alpha_1 \epsilon_{t-1}^2 + \alpha_2 \epsilon_{t-2}^2 + \dots + \alpha_m \epsilon_{t-m}^2$

Where $\omega > 0$ and $\alpha_j \geq 0$ to ensure positive variance. The conditional variance depends on past squared residuals.

12.1.2 GARCH( $p$ , $q$ ) Model

The Generalised ARCH model (Bollerslev, 1986) adds lagged conditional variances to the equation:

$\sigma_t^2 = \omega + \sum_{i=1}^q \alpha_i \epsilon_{t-i}^2 + \sum_{j=1}^p \beta_j \sigma_{t-j}^2$

With constraints $\omega > 0$ , $\alpha_i \geq 0$ , $\beta_j \geq 0$ , and $\sum \alpha_i + \sum \beta_j < 1$ for stationarity.

The GARCH(1,1) model is by far the most widely used:

$\sigma_t^2 = \omega + \alpha_1 \epsilon_{t-1}^2 + \beta_1 \sigma_{t-1}^2$

It captures the fact that large shocks to volatility decay slowly (volatility persistence = $\alpha_1 + \beta_1$ , typically close to 1 for financial data).

12.2 Vector Autoregression (VAR)

When analysing multiple time series simultaneously and modelling their interdependencies, VAR models extend univariate AR models to a multivariate setting.

A VAR( $p$ ) model for a $K$ -dimensional vector $\mathbf{Y}_t = (Y_{1t}, Y_{2t}, \dots, Y_{Kt})^T$ is:

$\mathbf{Y}_t = \mathbf{c} + \mathbf{A}_1 \mathbf{Y}_{t-1} + \mathbf{A}_2 \mathbf{Y}_{t-2} + \dots + \mathbf{A}_p \mathbf{Y}_{t-p} + \boldsymbol{\epsilon}_t$

Where:

$\mathbf{c}$ is a $K \times 1$ vector of constants.
$\mathbf{A}_j$ are $K \times K$ coefficient matrices.
$\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma})$ is a vector of white noise innovations with covariance matrix $\boldsymbol{\Sigma}$ .

VAR models are used for:

Impulse response analysis: How does a shock to one variable propagate through the system over time?
Forecast error variance decomposition: What proportion of the forecast error variance of one variable is explained by shocks to each variable?
Granger causality testing: Does past information about $Y_{jt}$ improve forecasts of $Y_{it}$ ?

12.3 Structural Breaks

A structural break is a sudden change in the parameters of a time series model — e.g., a shift in the mean, a change in slope, or a change in variance — caused by an external event (financial crisis, policy change, pandemic).

Detection methods:

Chow test: Tests for a break at a known date by comparing residual sums of squares from models fitted on subsamples.
CUSUM test: Tests for parameter instability by examining cumulative sums of recursive residuals.
Bai-Perron test: Data-driven detection of multiple unknown breakpoints.

Handling structural breaks:

Include dummy variables at known break dates in the model.
Split the series and model each segment separately.
Use state space models that allow parameters to evolve over time.

12.4 Spectral Analysis

Spectral analysis (frequency-domain analysis) decomposes a time series into sinusoidal components of different frequencies, revealing cyclical patterns.

The spectral density (power spectrum) at frequency $\omega \in [0, \pi]$ is:

$f(\omega) = \frac{1}{2\pi} \sum_{h=-\infty}^{\infty} \gamma(h) e^{-i\omega h} = \frac{\sigma^2}{2\pi} \left|\frac{\Theta(e^{-i\omega})}{\Phi(e^{-i\omega})}\right|^2$

Peaks in the spectral density correspond to dominant cyclical frequencies. The periodogram is the sample estimate of the spectral density.

13. Using the Time Series Component

The Time Series component in the DataStatPro application provides a full end-to-end workflow for analysing and forecasting time series data.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain:

A time/date column (or an index column representing equally spaced observations).
One or more numeric value columns representing the time series.

Step 2 — Select Time Series Variable Select the numeric variable to analyse from the "Time Series Variable" dropdown.

Step 3 — Select Date/Index Column Select the column identifying the time ordering of observations. Specify the frequency/period (e.g., monthly = 12, quarterly = 4, weekly = 52, daily = 365, hourly = 24).

Step 4 — Select Analysis Type Choose the type of analysis to perform:

Exploratory Analysis (decomposition, ACF/PACF plots, stationarity tests)
ARIMA / SARIMA Modelling
Exponential Smoothing (ETS)
SARIMAX (with exogenous variables)
GARCH Modelling
VAR (multivariate; requires selecting multiple series)

Step 5 — Configure Preprocessing

Apply Box-Cox transformation (specify $\lambda$ or use automatic selection).
Apply differencing (specify $d$ and $D$ , or use automatic selection via unit root tests).

Step 6 — Configure Model

For ARIMA/SARIMA:

Specify $(p, d, q)(P, D, Q)_m$ manually, or select "Auto" for automatic selection via AICc.
Set maximum search bounds for auto selection (e.g., max $p = 5$ , max $q = 5$ ).
Choose estimation method (MLE or conditional least squares).

For Exponential Smoothing (ETS):

Choose model type manually (e.g., ETS(A,A,A)) or select "Auto" for automatic selection via AICc.
Enable/disable damped trend.
Specify initial values method (heuristic or optimised).

Step 7 — Set Forecast Horizon Specify the number of periods to forecast ahead ( $h$ ). Choose the confidence level for prediction intervals (default: 95%).

Step 8 — Select Display Options Choose which outputs to display:

✅ Time Series Plot
✅ Decomposition Plot (trend, seasonal, residual)
✅ ACF and PACF Plots
✅ Stationarity Test Results (ADF, KPSS, PP)
✅ Model Summary (coefficients, SE, t-values, p-values)
✅ Information Criteria (AIC, AICc, BIC)
✅ Residual Diagnostics (time plot, ACF, histogram, Q-Q plot)
✅ Ljung-Box Test Results
✅ Forecast Plot with Prediction Intervals
✅ Forecast Table
✅ Accuracy Metrics (MAE, RMSE, MAPE, MASE)

Step 9 — Run the Analysis Click "Run Time Series Analysis". The application will:

Parse and sort the time series by date/index.
Apply any specified transformations.
Run stationarity tests and produce ACF/PACF plots.
Fit the specified (or automatically selected) model.
Run residual diagnostics.
Generate forecasts with prediction intervals.
Compute accuracy metrics on the hold-out set (if configured).
Produce all selected visualisations and tables.

14. Computational and Formula Details

14.1 The Wold Decomposition and $\psi$ -Weights

Any covariance-stationary process can be represented as an infinite MA:

$Y_t - \mu = \sum_{j=0}^{\infty} \psi_j \epsilon_{t-j}, \quad \psi_0 = 1$

For an ARMA( $p$ , $q$ ) model, the $\psi$ -weights satisfy the recursion:

$\psi_j = \theta_j + \sum_{k=1}^{\min(j,p)} \phi_k \psi_{j-k}, \quad j = 0, 1, 2, \dots$

Where $\theta_j = 0$ for $j > q$ and $\phi_k = 0$ for $k > p$ .

These weights are used to compute forecast error variances and prediction intervals.

14.2 Yule-Walker Equations for AR Parameter Estimation

For an AR( $p$ ) model, the Yule-Walker equations relate the ACF to the AR coefficients:

$\begin{pmatrix} \rho(1) \\ \rho(2) \\ \vdots \\ \rho(p) \end{pmatrix} = \begin{pmatrix} 1 & \rho(1) & \cdots & \rho(p-1) \\ \rho(1) & 1 & \cdots & \rho(p-2) \\ \vdots & \vdots & \ddots & \vdots \\ \rho(p-1) & \rho(p-2) & \cdots & 1 \end{pmatrix} \begin{pmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_p \end{pmatrix}$

Or in matrix form: $\boldsymbol{\rho} = \mathbf{R}\boldsymbol{\phi}$ , giving $\hat{\boldsymbol{\phi}} = \mathbf{R}^{-1}\boldsymbol{\rho}$ .

The innovation variance is: $\hat{\sigma}^2 = \hat{\gamma}(0)(1 - \hat{\boldsymbol{\phi}}^T \boldsymbol{\hat{\rho}})$ .

14.3 ARIMA Estimation via the Kalman Filter

Exact MLE for ARIMA models is computed using the Kalman filter, which recursively computes the innovations and their variances:

State space representation of ARIMA( $p$ , $d$ , $q$ ):

$\mathbf{x}_t = \mathbf{F} \mathbf{x}_{t-1} + \mathbf{g} \epsilon_t \quad \text{(transition equation)}$ $Y_t = \mathbf{h}^T \mathbf{x}_t + \epsilon_t \quad \text{(measurement equation)}$

Where $\mathbf{F}$ , $\mathbf{g}$ , $\mathbf{h}$ are matrices determined by the ARIMA orders. The Kalman filter provides the optimal linear predictor $\hat{Y}_{t|t-1}$ and the innovation $e_t = Y_t - \hat{Y}_{t|t-1}$ at each step. The log-likelihood is:

$\ell = -\frac{T}{2}\ln(2\pi) - \frac{1}{2}\sum_{t=1}^T \ln f_t - \frac{1}{2}\sum_{t=1}^T \frac{e_t^2}{f_t}$

Where $f_t = \text{Var}(e_t)$ is the innovation variance at time $t$ .

14.4 Seasonal Decomposition Formulae

For a series with seasonal period $m$ , the classical decomposition proceeds as:

Step 1: Estimate trend using CMA of order $m$ .

For even $m$ (e.g., monthly data, $m=12$ ):

$\hat{T}_t = \frac{1}{24}\left(Y_{t-6} + 2Y_{t-5} + \dots + 2Y_{t+5} + Y_{t+6}\right) = \frac{1}{12}\left(\frac{Y_{t-6}}{2} + Y_{t-5} + \dots + Y_{t+5} + \frac{Y_{t+6}}{2}\right)$

Step 2: Compute deseasonalised values.

$d_t = \begin{cases} Y_t - \hat{T}_t & \text{(additive)} \\ Y_t / \hat{T}_t & \text{(multiplicative)} \end{cases}$

Step 3: Average $d_t$ over all periods of the same season $j$ to get raw seasonal indices:

$\bar{s}_j = \frac{1}{n_j} \sum_{t: \text{season}(t) = j} d_t, \quad j = 1, \dots, m$

Step 4: Normalise seasonal indices.

Additive: $\hat{s}_j = \bar{s}_j - \frac{1}{m}\sum_{j=1}^m \bar{s}_j$ (so they sum to zero).

Multiplicative: $\hat{s}_j = \bar{s}_j / \left(\frac{1}{m}\sum_{j=1}^m \bar{s}_j\right)$ (so they average to 1).

Step 5: Compute irregular component.

$\hat{I}_t = \begin{cases} Y_t - \hat{T}_t - \hat{S}_t & \text{(additive)} \\ Y_t / (\hat{T}_t \cdot \hat{S}_t) & \text{(multiplicative)} \end{cases}$

14.5 Computing ACF and PACF

Sample ACF at lag $h$ :

$\hat{\rho}(h) = \frac{\hat{\gamma}(h)}{\hat{\gamma}(0)}, \quad \hat{\gamma}(h) = \frac{1}{T}\sum_{t=1}^{T-h}(Y_t - \bar{Y})(Y_{t+h} - \bar{Y})$

Sample PACF via Durbin-Levinson algorithm:

Initialise: $\hat{\phi}_{1,1} = \hat{\rho}(1)$

For $k = 2, 3, \dots$ : $\hat{\phi}_{k,k} = \frac{\hat{\rho}(k) - \sum_{j=1}^{k-1}\hat{\phi}_{k-1,j}\hat{\rho}(k-j)}{1 - \sum_{j=1}^{k-1}\hat{\phi}_{k-1,j}\hat{\rho}(j)}$

$\hat{\phi}_{k,j} = \hat{\phi}_{k-1,j} - \hat{\phi}_{k,k}\hat{\phi}_{k-1,k-j}, \quad j = 1, \dots, k-1$

The PACF at lag $k$ is $\hat{\phi}_{k,k}$ .

14.6 Optimising Exponential Smoothing Parameters

Smoothing parameters $(\alpha, \beta^*, \gamma)$ are estimated by minimising the sum of squared one-step-ahead forecast errors:

$SSE(\alpha, \beta^*, \gamma) = \sum_{t=1}^T \left(Y_t - \hat{Y}_{t|t-1}\right)^2$

Subject to constraints (e.g., $0 < \alpha < 1$ , $0 < \beta^* < 1$ , $0 < \gamma < 1$ ). This is solved using numerical optimisation (e.g., Nelder-Mead, L-BFGS-B), typically starting from a grid of initial values to avoid local minima.

15. Worked Examples

Example 1: ARIMA Modelling of Monthly Sales Data

Data: Monthly sales figures for a retail company, $T = 60$ months (5 years), no seasonal pattern.

Step 1: Plot and Examine the Series

Visual inspection reveals an upward trend with roughly constant variance → possible ARIMA model with $d = 1$ .

Step 2: Stationarity Testing

ADF test on original series: $\tau = -1.82$ , $p = 0.37$ → Fail to reject $H_0$ → Non-stationary.

Apply first difference: $\Delta Y_t = Y_t - Y_{t-1}$ .

ADF test on $\Delta Y_t$ : $\tau = -5.41$ , $p < 0.01$ → Reject $H_0$ → Stationary after first differencing. Therefore $d = 1$ .

KPSS test on $\Delta Y_t$ : statistic = 0.12, $p > 0.10$ → Fail to reject $H_0$ → Stationary. Both tests agree: $d = 1$ .

Step 3: ACF and PACF of $\Delta Y_t$

Lag	ACF $\hat{\rho}(h)$	Significant?	PACF $\hat{\phi}_{hh}$	Significant?
1	-0.312	Yes	-0.312	Yes
2	0.051	No	-0.065	No
3	-0.038	No	-0.052	No
4	0.029	No	0.018	No

Pattern: ACF has a single significant spike at lag 1 (cuts off after lag 1); PACF decays (though also approximately cuts off after lag 1). This suggests an MA(1) process for $\Delta Y_t$ , i.e., ARIMA(0,1,1).

Step 4: Fit ARIMA(0,1,1)

Estimated model:

$\Delta Y_t = \hat{\epsilon}_t + \hat{\theta}_1 \hat{\epsilon}_{t-1}$

$\hat{\theta}_1 = -0.348 \quad (SE = 0.117, \quad z = -2.97, \quad p = 0.003)$

$\hat{\sigma}^2 = 156.4, \quad \text{AICc} = 423.7$

For completeness, also fit ARIMA(1,1,0): $\hat{\phi}_1 = -0.332$ , AICc = 424.3. ARIMA(0,1,1) has lower AICc → preferred.

Step 5: Residual Diagnostics

Time plot of residuals: No visible pattern, roughly constant spread ✅
ACF of residuals: All autocorrelations within $\pm 1.96/\sqrt{60} = \pm 0.253$ bounds ✅
Ljung-Box test at lag 10: $Q_{LB}(10) = 8.43$ , $df = 10 - 0 - 1 = 9$ , $p = 0.49$ → No significant autocorrelation ✅
Jarque-Bera test: $JB = 1.82$ , $p = 0.40$ → Residuals are approximately normal ✅

Model passes all diagnostic checks.

Step 6: Forecasting

For $h = 1, 2, 3$ steps ahead, the ARIMA(0,1,1) forecasts are:

$\hat{Y}_{T+h|T} = Y_T + \hat{\theta}_1 \hat{\epsilon}_T \quad (h = 1)$ $\hat{Y}_{T+h|T} = Y_T + \hat{\theta}_1 \hat{\epsilon}_T \quad (h \geq 2, \text{ flat after } h=1)$

The $\psi$ -weights: $\psi_0 = 1$ , $\psi_j = 1 + \hat{\theta}_1 = 1 + (-0.348) = 0.652$ for $j \geq 1$ .

95% Prediction Intervals:

$\text{Var}(e_{T+h|T}) = \hat{\sigma}^2 \sum_{j=0}^{h-1}\psi_j^2$

For $h = 1$ : $\text{Var} = 156.4 \times 1 = 156.4$ , $SE = 12.51$ , PI = $\hat{Y}_{T+1|T} \pm 1.96 \times 12.51$

For $h = 2$ : $\text{Var} = 156.4 \times (1 + 0.652^2) = 156.4 \times 1.425 = 222.9$ , $SE = 14.93$

For $h = 3$ : $\text{Var} = 156.4 \times (1 + 0.652^2 + 0.652^2) = 156.4 \times 1.850 = 289.4$ , $SE = 17.01$

Horizon	Forecast	95% PI Lower	95% PI Upper
$T+1$	524.3	499.8	548.8
$T+2$	524.3	494.9	553.7
$T+3$	524.3	490.9	557.7

Note: Prediction intervals widen with horizon, reflecting growing uncertainty.

Example 2: SARIMA Modelling of Monthly Airline Passenger Data

Data: Monthly international airline passenger counts (thousands), $T = 144$ months (12 years). This is the classic Box-Jenkins "airline dataset."

Step 1: Plot and Transform

The series shows:

A clear upward trend.
Multiplicative seasonality (seasonal swings increase with the level).

Apply log transformation ( $\lambda = 0$ ) to stabilise variance:

$W_t = \ln(Y_t)$

Step 2: Determine Differencing Orders

ADF test on $W_t$ : non-stationary → apply regular differencing ( $d = 1$ ).

Canova-Hansen test on $\Delta W_t$ : seasonal non-stationarity detected → apply seasonal differencing ( $D = 1$ , $m = 12$ ).

Let $V_t = \Delta\Delta_{12} W_t = (1-L)(1-L^{12}) W_t$ .

ADF test on $V_t$ : stationary ✅. Therefore $d = 1$ , $D = 1$ .

Step 3: ACF and PACF of $V_t$

Significant ACF spikes at lags 1 and 12; PACF decays at lag 1 and shows a spike at lag 12. This pattern strongly suggests:

Non-seasonal MA(1): $q = 1$ .
Seasonal MA(1): $Q = 1$ .
No AR terms: $p = P = 0$ .

Candidate model: SARIMA(0,1,1)(0,1,1) $_{12}$ — the airline model.

Step 4: Fit SARIMA(0,1,1)(0,1,1) $_{12}$ on $\ln(Y_t)$

$\Delta\Delta_{12} W_t = (1 + \hat{\theta}_1 L)(1 + \hat{\Theta}_1 L^{12}) \epsilon_t$

Parameter	Estimate	SE	z-value	p-value
$\hat{\theta}_1$ (non-seasonal MA)	-0.402	0.083	-4.84	< 0.001
$\hat{\Theta}_1$ (seasonal MA)	-0.557	0.073	-7.63	< 0.001
$\hat{\sigma}^2$	0.00134	—	—	—

AICc = −467.3.

Step 5: Residual Diagnostics

Ljung-Box test at lag 24: $Q_{LB}(24) = 22.1$ , $df = 24 - 0 - 1 - 0 - 1 = 22$ , $p = 0.39$ → No autocorrelation ✅
Jarque-Bera: $p = 0.28$ → Approximately normal ✅

Step 6: Forecasting (12 months ahead)

Forecasts are generated on the log scale and back-transformed with a bias correction:

$\hat{Y}_{T+h|T} = \exp\left(\hat{W}_{T+h|T} + \frac{\hat{\sigma}^2}{2}\hat{\psi}_{h-1}^2\right)$

Month	$\hat{W}_{T+h}$	$\hat{Y}_{T+h}$ (000s)	95% PI Lower	95% PI Upper
$T+1$	6.181	483.5	447.2	522.8
$T+6$	6.302	545.1	490.3	606.2
$T+12$	6.249	513.7	446.9	590.4

Example 3: Holt-Winters Forecasting of Quarterly Retail Sales

Data: Quarterly retail sales ( $T = 40$ quarters, 10 years). Clear upward trend and stable multiplicative seasonality.

Selected Model: ETS(M,A,M) — Multiplicative Holt-Winters.

Estimated Parameters (via SSE minimisation):

$\hat{\alpha} = 0.312, \quad \hat{\beta}^* = 0.058, \quad \hat{\gamma} = 0.183, \quad \hat{\phi} = 1.000 \text{ (not damped)}$

Final State Values at $t = T$ :

$\hat{\ell}_T = 412.6, \quad \hat{b}_T = 4.82$

Seasonal Indices:

Quarter	$\hat{s}$
Q1	0.863
Q2	0.971
Q3	1.048
Q4	1.118

Check: $0.863 + 0.971 + 1.048 + 1.118 = 4.000$ , average = 1.00 ✅

4-Step-Ahead Forecasts:

$\hat{Y}_{T+h|T} = (\hat{\ell}_T + h \hat{b}_T) \times \hat{s}_{T+h-m}$

$h=1$ (Q1 of next year): $\hat{Y}_{T+1|T} = (412.6 + 1 \times 4.82) \times 0.863 = 417.42 \times 0.863 = 360.2$

$h=2$ (Q2): $\hat{Y}_{T+2|T} = (412.6 + 2 \times 4.82) \times 0.971 = 422.24 \times 0.971 = 410.0$

$h=3$ (Q3): $\hat{Y}_{T+3|T} = (412.6 + 3 \times 4.82) \times 1.048 = 427.06 \times 1.048 = 447.6$

$h=4$ (Q4): $\hat{Y}_{T+4|T} = (412.6 + 4 \times 4.82) \times 1.118 = 431.88 \times 1.118 = 482.8$

Accuracy Metrics (on 8-quarter hold-out set):

Metric	Value
MAE	12.4
RMSE	15.7
MAPE	3.2%
MASE	0.61

MASE = 0.61 < 1: The Holt-Winters model outperforms the seasonal naïve benchmark by 39%.

16. Common Mistakes and How to Avoid Them

Mistake 1: Fitting ARMA to a Non-Stationary Series

Problem: Applying ARMA models directly to a trending or non-stationary series, producing spurious results, unreliable coefficients, and invalid inference.
Solution: Always test for stationarity (ADF, KPSS) before fitting ARMA. Apply the necessary differencing (and/or transformation) to achieve stationarity. Use ARIMA( $p$ , $d$ , $q$ ) with the appropriate $d$ .

Mistake 2: Over-Differencing

Problem: Applying more differences than necessary (e.g., differencing twice when once is sufficient), which introduces unnecessary MA components and inflates forecast variance.
Solution: Apply the minimum number of differences needed to pass stationarity tests. Check whether the differenced series passes ADF/KPSS before differencing again. If the ACF of the differenced series shows a large negative spike at lag 1, it may be over-differenced.

Mistake 3: Ignoring Seasonality

Problem: Fitting a non-seasonal ARIMA model to a clearly seasonal series, leaving seasonal structure in the residuals (which will show spikes in the ACF at seasonal lags).
Solution: Identify the seasonal period $m$ from domain knowledge and data inspection. Apply seasonal differencing if needed ( $D = 1$ ). Use SARIMA or ETS models with seasonal components. Always check the ACF of residuals at seasonal lags.

Mistake 4: Confusing ACF and PACF Patterns

Problem: Misreading the ACF/PACF and specifying wrong model orders (e.g., using an AR model when MA would be more appropriate).
Solution: Remember: ACF cuts off for MA; PACF cuts off for AR; both decay for ARMA. Use information criteria (AICc, BIC) alongside ACF/PACF to narrow down model orders. Consider multiple candidate models.

Mistake 5: Not Checking Residual Diagnostics

Problem: Accepting a model without verifying that the residuals are white noise, leading to a mis-specified model with poor forecasting performance.
Solution: Always perform a full residual diagnostic check: time plot, ACF plot, Ljung-Box test, histogram, Q-Q plot. If residuals show autocorrelation, refine the model. If they show heteroscedasticity, consider GARCH.

Mistake 6: Evaluating Model Fit on Training Data Only

Problem: Selecting a model based solely on in-sample fit metrics (e.g., the lowest AIC) without validating forecast performance on held-out data.
Solution: Reserve the last $h$ observations as a test set. Compute out-of-sample accuracy metrics (RMSE, MASE). Use time-series cross-validation (rolling-origin or expanding-window) for more robust evaluation.

Mistake 7: Ignoring Structural Breaks

Problem: Fitting a single model over a period that contains a structural break (e.g., a financial crisis, a pandemic), leading to poor fit and unreliable forecasts.
Solution: Plot the series and look for sudden level shifts or trend changes. Test for breaks formally (CUSUM, Chow, Bai-Perron). If breaks are detected, model each segment separately or include dummy variables.

Mistake 8: Treating Multiplicative Seasonality as Additive

Problem: Using an additive decomposition or additive Holt-Winters when seasonal variation grows with the level, underestimating seasonality in peak periods and overestimating in troughs.
Solution: Plot the series and assess whether seasonal swings are roughly constant (additive) or grow proportionally with the level (multiplicative). Apply a log transformation to convert multiplicative to additive, or use the multiplicative ETS model directly.

Mistake 9: Extrapolating Trends Too Far

Problem: Generating long-horizon forecasts from a model with a strong trend, leading to unrealistic forecasts that grow without bound.
Solution: Use the damped trend method (ETS with damping) for longer horizons. Report widening prediction intervals to communicate growing uncertainty. Treat long-range forecasts with appropriate scepticism.

Mistake 10: Using MAPE with Near-Zero Values

Problem: MAPE is undefined or extremely large when the actual values $Y_t$ are zero or close to zero, leading to misleading accuracy assessments.
Solution: Use MASE or RMSE instead of MAPE when the series contains zero or very small values. MASE is always well-defined and has the additional advantage of being scale-free.

17. Troubleshooting

Issue	Likely Cause	Solution
ARIMA model fails to converge	Very short series; too many parameters; near-unit-root behaviour	Reduce $p$ , $q$ ; check stationarity; use simpler model
AICc selects a very high-order model ( $p$ or $q > 3$ )	Insufficient data; non-stationarity not fully addressed; outliers	Increase data length; check stationarity; inspect for outliers; try REML-based estimation
Residual ACF shows significant spike at seasonal lag (e.g., lag 12)	Seasonal component not modelled	Add seasonal MA or AR term ( $Q=1$ or $P=1$ ); apply seasonal differencing ( $D=1$ )
Residual ACF has one large negative spike at lag 1	Over-differencing ( $d$ or $D$ too large)	Reduce differencing order by 1
Prediction intervals are extremely wide	High $d$ or $D$ ; large $\hat{\sigma}^2$ ; long forecast horizon	Reconsider differencing; check for outliers inflating $\hat{\sigma}^2$ ; report shorter horizon
ADF and KPSS tests give contradictory results	Near-unit-root behaviour; small sample	Increase sample if possible; use PP test as tiebreaker; consult ACF pattern
Forecasts quickly converge to a flat line	Random walk structure ( $d=1$ , no AR); or SES applied	This is expected behaviour for ARIMA(0,1,1); use Holt's if trend is needed
Holt-Winters gives very poor out-of-sample accuracy	Wrong seasonality type (additive vs. multiplicative); outliers at end of series	Try both additive and multiplicative; inspect and handle outliers
GARCH estimation fails to converge	Near-integrated volatility ( $\alpha+\beta \approx 1$ ); insufficient data	Try IGARCH; increase data; use simpler ARCH(1)
Ljung-Box test is always significant regardless of model	Outliers or structural breaks inflating residual autocorrelation	Identify and handle outliers; test for structural breaks; use robust estimation
Seasonal naïve outperforms all fitted models (MASE > 1)	Insufficient data to estimate model; strong irregular component	Collect more data; consider ensemble approaches; report seasonal naïve as the baseline
Log transformation produces negative back-transformed forecasts	Series contains zeros or negative values	Use Box-Cox with $\lambda > 0$ ; add a constant before transforming

18. Quick Reference Cheat Sheet

Core Model Equations

Model	Equation
AR( $p$ )	$Y_t = c + \sum_{j=1}^p \phi_j Y_{t-j} + \epsilon_t$
MA( $q$ )	$Y_t = \mu + \epsilon_t + \sum_{j=1}^q \theta_j \epsilon_{t-j}$
ARMA( $p$ , $q$ )	$\Phi(L)Y_t = c + \Theta(L)\epsilon_t$
ARIMA( $p$ , $d$ , $q$ )	$\Phi(L)(1-L)^d Y_t = c + \Theta(L)\epsilon_t$
SARIMA( $p$ , $d$ , $q$ )( $P$ , $D$ , $Q$ ) $_m$	$\Phi(L)\Phi_S(L^m)(1-L)^d(1-L^m)^D Y_t = c + \Theta(L)\Theta_S(L^m)\epsilon_t$
SES	$\ell_t = \alpha Y_t + (1-\alpha)\ell_{t-1}$ ; $\hat{Y}_{t+h} = \ell_t$
Holt's	$\ell_t = \alpha Y_t + (1-\alpha)(\ell_{t-1}+b_{t-1})$ ; $b_t = \beta^(\ell_t-\ell_{t-1})+(1-\beta^)b_{t-1}$ ; $\hat{Y}_{t+h} = \ell_t + hb_t$
HW Additive	$\hat{Y}_{t+h} = \ell_t + hb_t + s_{t+h-m(k+1)}$
HW Multiplicative	$\hat{Y}_{t+h} = (\ell_t + hb_t) \times s_{t+h-m(k+1)}$
GARCH(1,1)	$\sigma_t^2 = \omega + \alpha_1\epsilon_{t-1}^2 + \beta_1\sigma_{t-1}^2$

ACF/PACF Pattern Guide

Model	ACF	PACF
White Noise	No spikes	No spikes
AR( $p$ )	Decays exponentially/sinusoidally	Cuts off after lag $p$
MA( $q$ )	Cuts off after lag $q$	Decays exponentially/sinusoidally
ARMA( $p$ , $q$ )	Decays after lag $q-p$	Decays after lag $p-q$
Non-stationary	Very slow decay (near 1.0)	Large spike at lag 1
Seasonal AR( $P$ )	Spikes at $m, 2m, \dots$ decaying	Spike at lag $m$ only
Seasonal MA( $Q$ )	Spike at lag $m$ only	Spikes at $m, 2m, \dots$ decaying

Stationarity Tests Summary

Test	$H_0$	$H_1$	Significant result means
ADF	Unit root (non-stationary)	Stationary	$p < 0.05$ : Stationary
KPSS	Stationary	Non-stationary (unit root)	$p < 0.05$ : Non-stationary
PP	Unit root (non-stationary)	Stationary	$p < 0.05$ : Stationary

Model Selection Guide

Scenario	Recommended Model
No trend, no seasonality	SES / ARIMA(0,1,1)
Trend, no seasonality	Holt's / ARIMA(0,2,2)
Trend, damped	Damped Holt's / ETS(A,A $_d$ ,N)
Additive seasonality + trend	Additive HW / SARIMA
Multiplicative seasonality + trend	Multiplicative HW / SARIMA on log scale
Volatility clustering (financial data)	GARCH(1,1) on residuals
Multiple interrelated series	VAR( $p$ )
External predictors available	SARIMAX
Unknown structure; small dataset	ETS with automatic selection
Unknown structure; large dataset	Auto ARIMA

Differencing Guide

Pattern in Original Series	Action
No trend, no seasonality, ACF decays fast	No differencing needed ( $d=0$ , $D=0$ )
Linear trend; ADF non-significant	First difference ( $d=1$ )
Quadratic trend	Second difference ( $d=2$ )
Seasonal non-stationarity	Seasonal difference ( $D=1$ )
Both trend and seasonal non-stationarity	$d=1$ and $D=1$
Increasing variance	Log or Box-Cox transformation first

Information Criteria Reference

Criterion	Formula	Prefer	Notes
AIC	$-2\ell + 2k$	Lower	Can overfit with small $T$
AICc	$\text{AIC} + \frac{2k(k+1)}{T-k-1}$	Lower	Recommended for time series
BIC	$-2\ell + k\ln(T)$	Lower	More parsimonious than AIC

Forecast Accuracy Metrics

Metric	Formula	Notes
MAE	$\frac{1}{h}\sum\\|e_t\\|$	Intuitive; same units as data
RMSE	$\sqrt{\frac{1}{h}\sum e_t^2}$	Penalises large errors; same units
MAPE	$\frac{100}{h}\sum\\|e_t/Y_t\\|$	Percentage; undefined if $Y_t = 0$
MASE	$\text{MAE} / \text{MAE}_{\text{naïve}}$	Scale-free; MASE < 1 beats naïve

ETS Model Taxonomy

Error	Trend	Seasonal	ETS Code	Method
A	N	N	ETS(A,N,N)	SES
A	A	N	ETS(A,A,N)	Holt's Linear
A	A $_d$	N	ETS(A,A $_d$ ,N)	Damped Holt's
A	A	A	ETS(A,A,A)	Additive HW
M	A	M	ETS(M,A,M)	Multiplicative HW
M	A $_d$	M	ETS(M,A $_d$ ,M)	Damped Multiplicative HW

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Time Series Analysis using the DataStatPro application. For further reading, consult Hyndman & Athanasopoulos's "Forecasting: Principles and Practice" (freely available at otexts.com/fpp3), Box, Jenkins, Reinsel & Ljung's "Time Series Analysis: Forecasting and Control", or Brockwell & Davis's "Introduction to Time Series and Forecasting". For feature requests or support, contact the DataStatPro team.

Time Series Analysis

Time Series Analysis: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Random Variables and Expectation

1.2 Covariance and Correlation

1.3 The Lag Operator

1.4 White Noise

1.5 The Normal Distribution

2. What is Time Series Analysis?

2.1 What Makes Time Series Special?

2.2 Real-World Applications

2.3 Goals of Time Series Analysis

2.4 Types of Time Series Data

3. Components of a Time Series

3.1 Trend (TtT_tTt​)

3.2 Seasonality (StS_tSt​)

3.3 Cyclical Fluctuations (CtC_tCt​)

3.4 Irregular (Residual) Component (ItI_tIt​)

3.5 Summary of Components

4. Time Series Decomposition

4.1 Additive Decomposition

4.2 Multiplicative Decomposition

4.3 Moving Average Smoothing for Trend Estimation

4.4 Classical Decomposition Procedure

4.5 STL Decomposition (Seasonal and Trend Decomposition using Loess)

4.6 Seasonal Adjustment

5. Stationarity

5.1 Strict Stationarity

5.2 Weak (Covariance) Stationarity

5.3 Why Stationarity Matters

5.4 Types of Non-Stationarity

5.5 Differencing

5.6 The Box-Cox Transformation

5.7 Formal Tests for Stationarity

5.7.1 Augmented Dickey-Fuller (ADF) Test

5.7.2 KPSS Test (Kwiatkowski-Phillips-Schmidt-Shin)

5.7.3 Phillips-Perron (PP) Test

5.8 Determining the Order of Differencing

6. Autocorrelation and Partial Autocorrelation

6.1 Autocovariance Function

6.2 Autocorrelation Function (ACF)

6.3 Partial Autocorrelation Function (PACF)

6.4 ACF and PACF as Model Identification Tools

6.5 The Ljung-Box Test for Autocorrelation

7. Classical Time Series Models

7.1 Autoregressive (AR) Models

7.1.1 Stationarity Condition for AR(ppp)

7.1.2 Properties of the AR(1) Process

7.2 Moving Average (MA) Models

7.2.1 Properties of the MA(qqq) Process

7.2.2 Invertibility Condition for MA(qqq)

7.3 ARMA Models

7.4 ARIMA Models

7.5 SARIMA Models

7.6 SARIMAX Models

7.7 The Random Walk and Random Walk with Drift

8. Exponential Smoothing Models

8.1 Simple Exponential Smoothing (SES)

8.2 Holt's Linear Exponential Smoothing (Double Exponential Smoothing)

8.2.1 Damped Trend Method

8.3 Holt-Winters Exponential Smoothing (Triple Exponential Smoothing)

8.3.1 Additive Holt-Winters

8.3.2 Multiplicative Holt-Winters

8.4 The ETS Framework

8.5 Initialisation of Smoothing Parameters

9. Model Identification, Estimation, and Selection

9.1 The Box-Jenkins Methodology

9.2 Maximum Likelihood Estimation (MLE) for ARIMA

9.3 Information Criteria for Model Selection

9.4 Automatic ARIMA Selection (auto.ARIMA)

9.5 Forecast Accuracy Metrics

10. Model Diagnostics

10.1 Residual Definition

10.2 Diagnostic Checks

10.2.1 Time Plot of Residuals

10.2.2 ACF of Residuals

10.2.3 Ljung-Box Test on Residuals

10.2.4 Histogram and Q-Q Plot of Residuals

10.2.5 Jarque-Bera Test for Normality

3.1 Trend ( $T_t$ )

3.2 Seasonality ( $S_t$ )

3.3 Cyclical Fluctuations ( $C_t$ )

3.4 Irregular (Residual) Component ( $I_t$ )

7.1.1 Stationarity Condition for AR( $p$ )

7.2.1 Properties of the MA( $q$ ) Process

7.2.2 Invertibility Condition for MA( $q$ )

12.1.1 ARCH( $m$ ) Model

12.1.2 GARCH( $p$ , $q$ ) Model

14.1 The Wold Decomposition and $\psi$ -Weights