Knowledge Base / Survival Analysis Advanced Analysis 56 min read

Survival Analysis

Comprehensive reference guide for time-to-event analysis and survival modeling.

Survival Analysis: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Survival Analysis all the way through advanced model specification, estimation, evaluation, and practical usage within the DataStatPro application. Whether you are encountering survival analysis for the first time or looking to deepen your understanding of time-to-event methods, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is Survival Analysis?
  3. The Mathematics Behind Survival Analysis
  4. Assumptions of Survival Analysis
  5. Types of Survival Analysis Methods
  6. Using the Survival Analysis Component
  7. Data Structure and Censoring
  8. Non-Parametric Methods: Kaplan-Meier Estimation
  9. Comparing Survival Curves: Log-Rank and Related Tests
  10. Semi-Parametric Methods: Cox Proportional Hazards Model
  11. Parametric Survival Models
  12. Model Fit and Evaluation
  13. Advanced Topics
  14. Worked Examples
  15. Common Mistakes and How to Avoid Them
  16. Troubleshooting
  17. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into survival analysis, it is helpful to be familiar with the following foundational concepts. Each is briefly reviewed below.

1.1 Probability and Probability Distributions

A probability distribution describes the likelihood of all possible outcomes of a random variable. In survival analysis, the primary random variable is time — specifically, the time until a specific event occurs.

The probability density function (PDF) f(t)f(t) of a continuous random variable TT satisfies:

f(t)0and0f(t)dt=1f(t) \geq 0 \quad \text{and} \quad \int_0^{\infty} f(t) \, dt = 1

The cumulative distribution function (CDF) F(t)F(t) gives the probability that the event has occurred by time tt:

F(t)=P(Tt)=0tf(u)duF(t) = P(T \leq t) = \int_0^t f(u) \, du

1.2 The Complement: Survival Probability

The survival function S(t)S(t) is the complement of the CDF — it gives the probability of surviving (not experiencing the event) beyond time tt:

S(t)=P(T>t)=1F(t)=tf(u)duS(t) = P(T > t) = 1 - F(t) = \int_t^{\infty} f(u) \, du

Key properties:

1.3 Rates and Conditional Probability

A rate measures how frequently an event occurs per unit of time. The conditional probability P(AB)P(A \mid B) is the probability of event AA given that event BB has already occurred:

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

In survival analysis, we often ask: "Given that an individual has survived until time tt, what is the probability they experience the event in the next small interval of time?" This is the hazard — one of the central concepts in survival analysis.

1.4 Likelihood and Maximum Likelihood Estimation

The likelihood function L(θ)L(\boldsymbol{\theta}) measures how probable the observed data are, given a set of model parameters θ\boldsymbol{\theta}:

L(θ)=i=1nP(observationiθ)L(\boldsymbol{\theta}) = \prod_{i=1}^{n} P(\text{observation}_i \mid \boldsymbol{\theta})

Maximum Likelihood Estimation (MLE) finds the parameter values θ^\hat{\boldsymbol{\theta}} that maximise L(θ)L(\boldsymbol{\theta}) (or equivalently, maximise the log-likelihood (θ)=lnL(θ)\ell(\boldsymbol{\theta}) = \ln L(\boldsymbol{\theta})). Most survival models are estimated via MLE or partial MLE.

1.5 The Exponential Distribution — A Simple Survival Model

The exponential distribution is the simplest parametric model for survival data. If TExp(λ)T \sim \text{Exp}(\lambda):

f(t)=λeλt,t0f(t) = \lambda e^{-\lambda t}, \quad t \geq 0

S(t)=eλtS(t) = e^{-\lambda t}

F(t)=1eλtF(t) = 1 - e^{-\lambda t}

Where λ>0\lambda > 0 is the rate parameter (events per unit time). The mean survival time is E(T)=1/λE(T) = 1/\lambda.

The exponential distribution has the memoryless property: the probability of experiencing the event in the next moment does not depend on how long an individual has already survived. This is a very strong and often unrealistic assumption — more flexible distributions are usually needed.

1.6 Integration and Differentiation (Brief Review)

Survival analysis involves calculus. The key relationships to remember:

f(t)=ddtF(t)=ddtS(t)f(t) = \frac{d}{dt}F(t) = -\frac{d}{dt}S(t)

S(t)=tf(u)duS(t) = \int_t^{\infty} f(u) \, du

These relationships connect the PDF, CDF, and survival function and are used extensively in deriving hazard functions and likelihood contributions.


2. What is Survival Analysis?

2.1 The Core Question

Survival analysis is a branch of statistics that analyses the time until a specific event of interest occurs. The defining questions are:

Despite the name "survival analysis," the event of interest does not have to be death. It can be any well-defined, non-repeating event:

FieldEvent of InterestTime Variable
MedicineDeath, disease recurrence, hospital dischargeMonths from diagnosis
EngineeringMachine failure, component breakdownHours of operation
FinanceLoan default, customer churn, bankruptcyDays since account opening
PsychologyRelapse after treatment, onset of disorderWeeks since therapy
MarketingCustomer purchase, subscription cancellationDays since sign-up
Social ScienceMarriage, unemployment ending, graduationYears from event start
ManufacturingProduct defect, warranty claimCycles of use

2.2 What Makes Survival Data Special?

Survival data has two unique characteristics that make it unsuitable for standard regression or t-tests:

1. Censoring: Not all individuals experience the event during the observation period. An individual who is still alive (or event-free) at the end of the study has a censored observation — we know they survived at least until time tct_c, but we do not know their full survival time.

2. Skewed, positive-only time distributions: Survival times are always positive and often highly right-skewed (many short times, few very long times). Standard methods that assume normality are inappropriate.

These characteristics require special methods — the tools of survival analysis — to produce valid estimates and inferences.

2.3 The Three Central Functions

Every survival analysis revolves around three mathematically related functions. Understanding all three is essential:

FunctionSymbolQuestion AnsweredRange
Survival FunctionS(t)S(t)What is the probability of surviving past time tt?[0,1][0, 1]
Hazard Functionh(t)h(t)At time tt, how fast are events occurring among survivors?[0,)[0, \infty)
Cumulative HazardH(t)H(t)Total accumulated risk of the event up to time tt[0,)[0, \infty)

These three functions are mathematically equivalent — knowing one fully determines the other two.

2.4 Survival Analysis vs. Standard Regression

FeatureStandard RegressionSurvival Analysis
Outcome variableContinuous, binary, countTime to event
Censoring handled?NoYes — fundamental feature
Distribution assumedNormal (OLS)Exponential, Weibull, etc.
Negative outcomesPossibleNever (time 0\geq 0)
Primary interestMean or probabilitySurvival function, hazard
Key modelLinear / LogisticKaplan-Meier, Cox, Parametric

3. The Mathematics Behind Survival Analysis

3.1 The Survival Function S(t)S(t)

The survival function is the probability that the event time TT exceeds a specified time tt:

S(t)=P(T>t)=1F(t)S(t) = P(T > t) = 1 - F(t)

Key mathematical properties:

3.2 The Hazard Function h(t)h(t)

The hazard function (also called the hazard rate or instantaneous failure rate) is the conditional rate at which the event occurs at time tt, given survival up to tt:

h(t)=limΔt0P(tT<t+ΔtTt)Δth(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t}

This can be re-expressed using the PDF and survival function:

h(t)=f(t)S(t)h(t) = \frac{f(t)}{S(t)}

Interpretation: h(t)ΔtP(event in [t,t+Δt]survived to t)h(t) \cdot \Delta t \approx P(\text{event in } [t, t+\Delta t] \mid \text{survived to } t) for small Δt\Delta t.

Key properties of the hazard function:

DistributionHazard Shape
ExponentialConstant (flat)
Weibull (κ>1\kappa > 1)Monotonically increasing
Weibull (κ<1\kappa < 1)Monotonically decreasing
Weibull (κ=1\kappa = 1)Constant (= Exponential)
Log-normalInitially increases then decreases (hump-shaped)
Log-logisticInitially increases then decreases (hump-shaped)
GompertzMonotonically increasing (exponentially)
Piecewise constantConstant within intervals, steps between

3.3 The Cumulative Hazard Function H(t)H(t)

The cumulative hazard function is the integral of the hazard function over time:

H(t)=0th(u)duH(t) = \int_0^t h(u) \, du

It represents the total accumulated risk of the event from time 0 to time tt.

3.4 The Fundamental Relationship Between S(t)S(t), h(t)h(t), and H(t)H(t)

The three functions are mathematically equivalent through the following relationships:

From hazard to survival:

S(t)=exp(0th(u)du)=eH(t)S(t) = \exp\left(-\int_0^t h(u) \, du\right) = e^{-H(t)}

From survival to hazard:

h(t)=ddtlnS(t)=f(t)S(t)h(t) = -\frac{d}{dt}\ln S(t) = \frac{f(t)}{S(t)}

From survival to cumulative hazard:

H(t)=lnS(t)H(t) = -\ln S(t)

From PDF to survival:

S(t)=tf(u)duS(t) = \int_t^{\infty} f(u) \, du

These relationships form the mathematical backbone of survival analysis. Every parametric model defines one of these functions — the others follow automatically.

3.5 The Mean and Median Survival Time

The mean survival time (restricted to time [0,τ][0, \tau]) is the area under the survival curve:

E(T)=0τS(t)dtE(T) = \int_0^{\tau} S(t) \, dt

The median survival time t0.5t_{0.5} is the time at which exactly half of the subjects have experienced the event (i.e., where the survival curve crosses 0.50):

S(t0.5)=0.50S(t_{0.5}) = 0.50

Similarly, the qq-th quantile of survival is:

tq=S1(1q)t_q = S^{-1}(1 - q)

💡 The median is usually preferred over the mean for survival data because the mean is sensitive to a few very long survival times (right tail), and may be undefined if the survival curve never reaches zero (i.e., if not all subjects experience the event).

3.6 The Likelihood Contribution of Censored Observations

A key mathematical challenge in survival analysis is constructing the likelihood function correctly for censored observations. Each observation contributes to the likelihood differently:

Uncensored (event occurred at time tit_i):

Li=f(ti)=h(ti)S(ti)L_i = f(t_i) = h(t_i) \cdot S(t_i)

Right-censored (event not observed; known to have survived to tit_i):

Li=S(ti)=P(T>ti)L_i = S(t_i) = P(T > t_i)

General likelihood contribution:

Li=[h(ti)]δiS(ti)L_i = [h(t_i)]^{\delta_i} \cdot S(t_i)

Where δi=1\delta_i = 1 if the event was observed (uncensored) and δi=0\delta_i = 0 if censored.

Full likelihood for nn independent observations:

L(θ)=i=1n[h(ti;θ)]δiS(ti;θ)L(\boldsymbol{\theta}) = \prod_{i=1}^{n} [h(t_i; \boldsymbol{\theta})]^{\delta_i} \cdot S(t_i; \boldsymbol{\theta})

Log-likelihood:

(θ)=i=1n[δilnh(ti;θ)+lnS(ti;θ)]\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[\delta_i \ln h(t_i; \boldsymbol{\theta}) + \ln S(t_i; \boldsymbol{\theta})\right]

=i=1n[δilnh(ti;θ)H(ti;θ)]= \sum_{i=1}^{n} \left[\delta_i \ln h(t_i; \boldsymbol{\theta}) - H(t_i; \boldsymbol{\theta})\right]

This formulation correctly accounts for the information provided by both events and censored observations.

3.7 Common Parametric Survival Distributions

Exponential Distribution

h(t)=λh(t) = \lambda

S(t)=eλtS(t) = e^{-\lambda t}

f(t)=λeλtf(t) = \lambda e^{-\lambda t}

Parameter: λ>0\lambda > 0 (rate). Mean = 1/λ1/\lambda.

Weibull Distribution

The Weibull is the most widely used parametric survival model due to its flexibility:

h(t)=κλ(tλ)κ1h(t) = \frac{\kappa}{\lambda}\left(\frac{t}{\lambda}\right)^{\kappa - 1}

S(t)=exp[(tλ)κ]S(t) = \exp\left[-\left(\frac{t}{\lambda}\right)^{\kappa}\right]

f(t)=κλ(tλ)κ1exp[(tλ)κ]f(t) = \frac{\kappa}{\lambda}\left(\frac{t}{\lambda}\right)^{\kappa-1}\exp\left[-\left(\frac{t}{\lambda}\right)^{\kappa}\right]

Parameters: λ>0\lambda > 0 (scale), κ>0\kappa > 0 (shape).

Log-Normal Distribution

S(t)=1Φ(lntμσ)S(t) = 1 - \Phi\left(\frac{\ln t - \mu}{\sigma}\right)

h(t)=ϕ(lntμσ)σt[1Φ(lntμσ)]h(t) = \frac{\phi\left(\frac{\ln t - \mu}{\sigma}\right)}{\sigma t \left[1 - \Phi\left(\frac{\ln t - \mu}{\sigma}\right)\right]}

Parameters: μ\mu (log-scale mean), σ>0\sigma > 0 (log-scale standard deviation). Produces a hump-shaped hazard (increases then decreases).

Log-Logistic Distribution

S(t)=11+(t/α)γS(t) = \frac{1}{1 + (t/\alpha)^{\gamma}}

h(t)=(γ/α)(t/α)γ11+(t/α)γh(t) = \frac{(\gamma/\alpha)(t/\alpha)^{\gamma-1}}{1 + (t/\alpha)^{\gamma}}

Parameters: α>0\alpha > 0 (scale), γ>0\gamma > 0 (shape). Produces a hump-shaped hazard when γ>1\gamma > 1 and a monotone decreasing hazard when γ1\gamma \leq 1.

Gompertz Distribution

h(t)=λeγth(t) = \lambda e^{\gamma t}

S(t)=exp[λγ(eγt1)]S(t) = \exp\left[-\frac{\lambda}{\gamma}(e^{\gamma t} - 1)\right]

Parameters: λ>0\lambda > 0 (initial hazard), γ\gamma (rate of increase). Widely used in actuarial science and demography. Produces a monotonically increasing hazard.

3.8 Summary Table of Parametric Distributions

DistributionParametersHazard ShapeBest For
Exponentialλ\lambdaConstantSimple baseline; memoryless events
Weibullλ,κ\lambda, \kappaMonotone (increasing/decreasing/flat)Most general-purpose analyses
Log-Normalμ,σ\mu, \sigmaHump (up then down)Medical remission, biological processes
Log-Logisticα,γ\alpha, \gammaHump (or decreasing)Similar to log-normal; has closed-form S(t)S(t)
Gompertzλ,γ\lambda, \gammaMonotone increasingMortality, ageing studies
Generalised Gammaμ,σ,Q\mu, \sigma, QVery flexibleModel selection; nests Weibull, log-normal

4. Assumptions of Survival Analysis

4.1 Non-Informative Censoring

The most critical assumption in survival analysis is that censoring is non-informative — the reason an observation is censored must be independent of the true (unobserved) event time.

Examples of non-informative censoring (acceptable):

Examples of informative censoring (problematic):

⚠️ Informative censoring leads to biased estimates of the survival function and hazard. There is no purely statistical remedy — the study design must ensure non-informative censoring.

4.2 Independence of Observations

Each individual's survival time must be independent of all others. This is violated in:

Solutions include frailty models (random effects), marginal models, and conditional models.

4.3 Proportional Hazards (Cox Model)

When using the Cox proportional hazards model (Section 10), the key assumption is that the hazard ratio between any two individuals is constant over time:

hi(t)hj(t)=constant for all t\frac{h_i(t)}{h_j(t)} = \text{constant for all } t

This means the hazard functions of any two individuals are proportional — they never cross. This is a strong assumption that must be checked (see Section 10.6).

4.4 Correct Parametric Form (Parametric Models)

When using parametric models, the assumed distribution must adequately describe the data. Choosing an incorrect parametric family leads to biased estimates. This must be checked via:

4.5 Consistent Definition of Time Origin and Event

The time origin (time zero) must be consistently and clearly defined for all subjects:

Similarly, the event must be defined unambiguously — the same event criterion must be applied to all subjects.

⚠️ Inconsistent definitions of time origin or event across subjects will produce biased survival estimates and hazard ratios. This is a design issue, not a statistical one.

4.6 Sufficient Follow-Up and Events

Survival analysis requires:


5. Types of Survival Analysis Methods

Survival analysis methods are broadly classified by the degree of parametric assumption they make:

5.1 Non-Parametric Methods

Non-parametric methods make no assumptions about the shape of the survival distribution or hazard function. They let the data speak for themselves.

MethodPurpose
Kaplan-Meier EstimatorEstimate the survival curve for one or more groups
Nelson-Aalen EstimatorEstimate the cumulative hazard function
Log-Rank TestCompare survival curves between two or more groups
Wilcoxon (Breslow) TestCompare survival curves with emphasis on early times
Peto-Peto TestCompare survival curves with modified weighting

Advantages: No distributional assumptions; easy to visualise.
Disadvantages: Cannot adjust for multiple covariates simultaneously; no regression coefficients.

5.2 Semi-Parametric Methods

Semi-parametric methods make some parametric assumptions (about the covariate effects) but leave the baseline hazard unspecified.

MethodPurpose
Cox Proportional Hazards ModelAssess covariate effects on survival while controlling for confounders
Cox Model with Time-Varying CovariatesCovariates that change in value over follow-up
Stratified Cox ModelAllow baseline hazard to differ across strata
Frailty Models (Semi-Parametric)Account for unobserved heterogeneity and clustering

Advantages: Handles multiple covariates; yields hazard ratios; robust (no distribution assumption).
Disadvantages: Cannot directly predict survival times; hazard ratios assume proportionality.

5.3 Parametric Methods

Parametric methods assume a specific distributional form for the survival times. They are more efficient (better precision) when the assumed distribution is correct.

MethodPurpose
Exponential RegressionSimplest parametric model; constant hazard
Weibull RegressionFlexible monotone hazard; generalises exponential
Log-Normal RegressionHump-shaped hazard; log-transformed normal model
Log-Logistic RegressionHump-shaped hazard; has closed-form survival function
Gompertz RegressionExponentially increasing hazard; ageing/mortality
Generalised GammaVery flexible; nests several other distributions
Accelerated Failure Time (AFT) ModelsCovariate effects accelerate or decelerate time

Advantages: Can predict absolute survival times; extrapolate beyond observed data; more efficient with correct specification.
Disadvantages: Wrong distributional assumption leads to biased results; more sensitive to model misspecification.

5.4 The Method Selection Framework

Subject 1: |—————————————X| Event at t=12 Subject 2: |————————————————————O| Censored at t=24 Subject 3: |——————X| Event at t=6 Subject 4: |————————————————————————————O| Censored at t=36 Subject 5: |—————————————————X| Event at t=18

X = Event observed O = Censored

💡 Always plot the event timeline before analysis. It reveals data quality issues such as negative times, very short follow-up, or suspiciously high censoring proportions.

7.4 The Risk Set

The risk set R(t)\mathcal{R}(t) at time tt is the set of all subjects who are:

The number at risk n(t)=R(t)n(t) = |\mathcal{R}(t)| decreases over time as subjects experience the event or are censored. A number-at-risk table is routinely displayed below Kaplan-Meier plots to communicate how many subjects contribute information at each time point.


8. Non-Parametric Methods: Kaplan-Meier Estimation

8.1 The Kaplan-Meier (Product-Limit) Estimator

The Kaplan-Meier (KM) estimator is the most widely used method for estimating the survival function non-parametrically. It is also called the product-limit estimator because it computes S^(t)\hat{S}(t) as a product of conditional survival probabilities at each event time.

Let t1<t2<<tkt_1 < t_2 < \dots < t_k be the kk distinct ordered event times (only times at which events occur, not censoring times). At each event time tjt_j, let:

The conditional probability of the event at tjt_j (given survival to tjt_j):

q^j=djnj\hat{q}_j = \frac{d_j}{n_j}

The conditional probability of surviving past tjt_j (given survival to tjt_j):

p^j=1q^j=1djnj=njdjnj\hat{p}_j = 1 - \hat{q}_j = 1 - \frac{d_j}{n_j} = \frac{n_j - d_j}{n_j}

The Kaplan-Meier survival estimate at time tt is the product of all conditional survival probabilities at event times up to and including tt:

S^(t)=j:tjt(1djnj)=j:tjtnjdjnj\hat{S}(t) = \prod_{j: t_j \leq t} \left(1 - \frac{d_j}{n_j}\right) = \prod_{j: t_j \leq t} \frac{n_j - d_j}{n_j}

8.2 Worked Computation of the KM Estimator

Suppose n=10n = 10 subjects are followed, with the following event/censoring times (sorted):

TimeStatusnjn_jdjd_jp^j=(njdj)/nj\hat{p}_j = (n_j - d_j)/n_jS^(t)\hat{S}(t)
31 (event)1019/10=0.9009/10 = 0.9000.9000.900
50 (censor)0.9000.900
71 (event)817/8=0.8757/8 = 0.8750.900×0.875=0.7880.900 \times 0.875 = 0.788
101 (event)725/7=0.7145/7 = 0.7140.788×0.714=0.5630.788 \times 0.714 = 0.563
140 (censor)0.5630.563
181 (event)413/4=0.7503/4 = 0.7500.563×0.750=0.4220.563 \times 0.750 = 0.422
220 (censor)0.4220.422
251 (event)211/2=0.5001/2 = 0.5000.422×0.500=0.2110.422 \times 0.500 = 0.211

Key notes:

8.3 Confidence Intervals for the KM Estimator

Greenwood's Formula for the variance of S^(t)\hat{S}(t):

Var^[S^(t)]=[S^(t)]2j:tjtdjnj(njdj)\widehat{\text{Var}}[\hat{S}(t)] = [\hat{S}(t)]^2 \sum_{j: t_j \leq t} \frac{d_j}{n_j(n_j - d_j)}

Plain (linear) confidence interval:

S^(t)±zα/2Var^[S^(t)]\hat{S}(t) \pm z_{\alpha/2} \sqrt{\widehat{\text{Var}}[\hat{S}(t)]}

This can produce intervals outside [0,1][0, 1] for extreme times. The log-log transformed confidence interval is preferred because it always stays within [0,1][0, 1]:

Log-log (complementary log-log) transformation:

Let U^(t)=ln[lnS^(t)]\hat{U}(t) = \ln[-\ln \hat{S}(t)]. The variance is approximated by:

Var^[U^(t)]=1[lnS^(t)]2j:tjtdjnj(njdj)\widehat{\text{Var}}[\hat{U}(t)] = \frac{1}{[\ln \hat{S}(t)]^2} \sum_{j: t_j \leq t} \frac{d_j}{n_j(n_j - d_j)}

The 95% CI for U(t)U(t) is U^(t)±1.96Var^[U^(t)]\hat{U}(t) \pm 1.96 \sqrt{\widehat{\text{Var}}[\hat{U}(t)]}, which transforms back to a CI for S(t)S(t) that stays within (0,1)(0, 1):

[S^(t)exp(1.96Var^[U^(t)]),S^(t)exp(1.96Var^[U^(t)])]\left[\hat{S}(t)^{\exp(1.96\sqrt{\widehat{\text{Var}}[\hat{U}(t)]})}, \quad \hat{S}(t)^{\exp(-1.96\sqrt{\widehat{\text{Var}}[\hat{U}(t)]})}\right]

8.4 The Nelson-Aalen Estimator of the Cumulative Hazard

An alternative non-parametric estimator is the Nelson-Aalen estimator of the cumulative hazard function:

H^(t)=j:tjtdjnj\hat{H}(t) = \sum_{j: t_j \leq t} \frac{d_j}{n_j}

The corresponding survival estimate derived from the Nelson-Aalen estimator is:

S~(t)=exp(H^(t))\tilde{S}(t) = \exp(-\hat{H}(t))

The Nelson-Aalen estimator is less biased than the KM estimator in small samples and is particularly useful for:

8.5 Median Survival Time and Confidence Interval

The KM estimate of the median survival time t^0.5\hat{t}_{0.5} is the smallest time tt such that S^(t)0.50\hat{S}(t) \leq 0.50:

t^0.5=inf{t:S^(t)0.50}\hat{t}_{0.5} = \inf\{t : \hat{S}(t) \leq 0.50\}

The 95% confidence interval for the median is computed using the Brookmeyer-Crowley method, which inverts the confidence band for S^(t)\hat{S}(t) to find the times corresponding to the upper and lower CI bounds for S(t)=0.50S(t) = 0.50.

⚠️ The median survival time is undefined (cannot be estimated) if the KM curve never drops to or below 0.50. This happens when fewer than half the subjects experience the event.

8.6 Reading and Reporting a Kaplan-Meier Curve

A complete KM plot includes:

  1. The step function S^(t)\hat{S}(t) — drops at each event time.
  2. Confidence band (typically 95%) shown as a shaded region or dashed lines.
  3. Censoring marks — small vertical tick marks at censored times on the survival curve.
  4. Number at risk table — shown below the x-axis at selected time points.
  5. Median survival time — horizontal dashed line at S(t)=0.50S(t) = 0.50, reading off the x-axis.

9. Comparing Survival Curves: Log-Rank and Related Tests

9.1 The Log-Rank Test

The log-rank test (also called the Mantel-Cox test) is the standard non-parametric test for comparing survival curves between two or more groups. It tests the null hypothesis:

H0:S1(t)=S2(t)==Sg(t)for all tH_0: S_1(t) = S_2(t) = \dots = S_g(t) \quad \text{for all } t

H1:At least one group has a different survival functionH_1: \text{At least one group has a different survival function}

At each distinct event time tjt_j in the combined dataset, the test compares the observed number of events in each group to the expected number under H0H_0 (if all groups had the same survival function).

For each event time tjt_j and group gg, the expected number of events under H0H_0 is:

egj=ngjdjnje_{gj} = n_{gj} \cdot \frac{d_j}{n_j}

Where:

The log-rank test statistic for two groups (g=1,2g = 1, 2) is:

χLR2=(j(d1je1j))2jV1j\chi^2_{\text{LR}} = \frac{\left(\sum_j (d_{1j} - e_{1j})\right)^2}{\sum_j V_{1j}}

Where V1jV_{1j} is the hypergeometric variance at event time tjt_j:

V1j=n1jn2jdj(njdj)nj2(nj1)V_{1j} = \frac{n_{1j} \cdot n_{2j} \cdot d_j \cdot (n_j - d_j)}{n_j^2(n_j - 1)}

Under H0H_0, χLR2\chi^2_{\text{LR}} follows approximately a χ2\chi^2 distribution with g1g - 1 degrees of freedom (where gg is the number of groups).

9.2 Weighted Log-Rank Tests

The log-rank test gives equal weight to all event times. Alternative tests weight earlier or later event times more heavily:

TestWeight wjw_jEmphasisesBest When
Log-Rank (Mantel-Cox)11All times equallyGroups differ throughout
Breslow (Wilcoxon)njn_jEarly timesGroups differ early
Peto-PetoS~(tj)\tilde{S}(t_j)Early timesGroups differ early (modified)
Fleming-Harrington[S^(tj)]ρ(1S^(tj))γ[\hat{S}(t_j^-)]^{\rho}(1-\hat{S}(t_j^-))^{\gamma}FlexibleUser-specified weighting

The Fleming-Harrington test with parameters ρ\rho and γ\gamma is the most flexible:

9.3 Stratified Log-Rank Test

When groups differ on a confounding variable (a variable that affects survival and is unevenly distributed across groups), the stratified log-rank test controls for the confounder by computing the log-rank statistic separately within each stratum and then combining:

χstratified2=(sj(d1jse1js))2sjV1js\chi^2_{\text{stratified}} = \frac{\left(\sum_s \sum_j (d_{1js} - e_{1js})\right)^2}{\sum_s \sum_j V_{1js}}

Where ss indexes the strata.

9.4 Interpreting the Log-Rank Test

ResultInterpretation
p<0.05p < 0.05Reject H0H_0 — significant evidence that survival curves differ between groups
p0.05p \geq 0.05Fail to reject H0H_0 — insufficient evidence of a difference

⚠️ The log-rank test is most powerful when hazard ratios are constant over time (proportional hazards). If survival curves cross (non-proportional hazards), the log-rank test may be underpowered. In this case, use the Fleming-Harrington test with appropriate weights, or report the weighted tests alongside the standard log-rank.

9.5 Pairwise Comparisons After a Significant Log-Rank Test

When comparing g>2g > 2 groups and the overall log-rank test is significant, pairwise log-rank tests can identify which specific groups differ. Apply a Bonferroni correction (or similar multiple comparison adjustment) to control the family-wise error rate:

α=α(g2)=0.05g(g1)/2\alpha^* = \frac{\alpha}{\binom{g}{2}} = \frac{0.05}{g(g-1)/2}

For example, with g=3g = 3 groups: α=0.05/3=0.0167\alpha^* = 0.05 / 3 = 0.0167.


10. Semi-Parametric Methods: Cox Proportional Hazards Model

10.1 The Cox Model

The Cox proportional hazards model (Cox, 1972) is the most widely used regression model in survival analysis. It relates the hazard at time tt for individual ii to their covariate values xi=(xi1,xi2,,xip)T\mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{ip})^T:

h(txi)=h0(t)exp(β1xi1+β2xi2++βpxip)h(t \mid \mathbf{x}_i) = h_0(t) \cdot \exp(\beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip})

Or equivalently:

h(txi)=h0(t)exp(βTxi)h(t \mid \mathbf{x}_i) = h_0(t) \cdot \exp(\boldsymbol{\beta}^T \mathbf{x}_i)

Where:

10.2 The Proportional Hazards Property

The ratio of hazards between any two individuals ii and jj is constant over time:

h(txi)h(txj)=h0(t)exp(βTxi)h0(t)exp(βTxj)=exp[βT(xixj)]\frac{h(t \mid \mathbf{x}_i)}{h(t \mid \mathbf{x}_j)} = \frac{h_0(t) \exp(\boldsymbol{\beta}^T \mathbf{x}_i)}{h_0(t) \exp(\boldsymbol{\beta}^T \mathbf{x}_j)} = \exp[\boldsymbol{\beta}^T(\mathbf{x}_i - \mathbf{x}_j)]

The h0(t)h_0(t) terms cancel — the hazard ratio depends only on the covariate difference, not on time. This is the proportional hazards assumption: hazard functions of any two individuals are proportional to each other (they never cross).

10.3 The Hazard Ratio (HR)

For a single binary covariate xx (e.g., treatment vs. control, coded 1 vs. 0):

HR=h(tx=1)h(tx=0)=eβ\text{HR} = \frac{h(t \mid x = 1)}{h(t \mid x = 0)} = e^{\beta}

Interpretation of the hazard ratio:

HRInterpretation
HR>1\text{HR} > 1Covariate increases the hazard (increases risk of event)
HR=1\text{HR} = 1Covariate has no effect on the hazard
HR<1\text{HR} < 1Covariate decreases the hazard (protective effect)

For a continuous covariate xx: eβe^{\beta} is the HR for a one-unit increase in xx, holding all other covariates constant.

The 95%95\% confidence interval for the HR is:

[eβ^1.96SE(β^),eβ^+1.96SE(β^)]\left[e^{\hat{\beta} - 1.96 \cdot SE(\hat{\beta})}, \quad e^{\hat{\beta} + 1.96 \cdot SE(\hat{\beta})}\right]

If the CI does not include 1, the covariate is statistically significant at the 5% level.

10.4 The Partial Likelihood

Because the baseline hazard h0(t)h_0(t) is left unspecified, standard MLE cannot be applied directly to the Cox model. Instead, Cox (1972) proposed the partial likelihood — a likelihood that eliminates h0(t)h_0(t) and depends only on β\boldsymbol{\beta}.

The partial likelihood is constructed by considering, at each event time tjt_j, the probability that it was individual iji_j who experienced the event, given that exactly one event occurred at tjt_j among all individuals at risk:

Lj(β)=exp(βTxij)lR(tj)exp(βTxl)L_j(\boldsymbol{\beta}) = \frac{\exp(\boldsymbol{\beta}^T \mathbf{x}_{i_j})}{\sum_{l \in \mathcal{R}(t_j)} \exp(\boldsymbol{\beta}^T \mathbf{x}_l)}

The full partial likelihood (product over all event times):

PL(β)=j=1Dexp(βTxij)lR(tj)exp(βTxl)PL(\boldsymbol{\beta}) = \prod_{j=1}^{D} \frac{\exp(\boldsymbol{\beta}^T \mathbf{x}_{i_j})}{\sum_{l \in \mathcal{R}(t_j)} \exp(\boldsymbol{\beta}^T \mathbf{x}_l)}

The log partial likelihood:

PL(β)=j=1D[βTxijlnlR(tj)exp(βTxl)]\ell_{PL}(\boldsymbol{\beta}) = \sum_{j=1}^{D} \left[\boldsymbol{\beta}^T \mathbf{x}_{i_j} - \ln\sum_{l \in \mathcal{R}(t_j)} \exp(\boldsymbol{\beta}^T \mathbf{x}_l)\right]

Where DD is the total number of events.

The coefficient estimates β^\hat{\boldsymbol{\beta}} are found by maximising PL\ell_{PL} using iterative numerical methods (e.g., Newton-Raphson).

10.5 Handling Ties

When multiple events occur at the same time (tied event times), the exact partial likelihood becomes computationally intractable. Three commonly used approximations are:

Breslow's approximation (default in most software):

LjBreslow=exp(βTsj)[lR(tj)exp(βTxl)]djL_j^{\text{Breslow}} = \frac{\exp(\boldsymbol{\beta}^T \mathbf{s}_j)}{\left[\sum_{l \in \mathcal{R}(t_j)} \exp(\boldsymbol{\beta}^T \mathbf{x}_l)\right]^{d_j}}

Where sj=kDjxk\mathbf{s}_j = \sum_{k \in D_j} \mathbf{x}_k is the sum of covariate vectors for all djd_j individuals who experienced the event at tjt_j.

Efron's approximation (more accurate with many ties):

LjEfron=exp(βTsj)k=1dj[lR(tj)exp(βTxl)k1djlDjexp(βTxl)]L_j^{\text{Efron}} = \frac{\exp(\boldsymbol{\beta}^T \mathbf{s}_j)}{\prod_{k=1}^{d_j}\left[\sum_{l \in \mathcal{R}(t_j)} \exp(\boldsymbol{\beta}^T \mathbf{x}_l) - \frac{k-1}{d_j}\sum_{l \in D_j} \exp(\boldsymbol{\beta}^T \mathbf{x}_l)\right]}

Exact (discrete) method: Computes the exact combinatorial probability — accurate but computationally very expensive for large numbers of ties.

💡 Efron's approximation is generally preferred over Breslow's when there are many tied event times. DataStatPro uses Efron's approximation by default.

10.6 Testing the Proportional Hazards Assumption

The proportional hazards (PH) assumption is critical. If violated, hazard ratio estimates are biased and not meaningful. Multiple methods exist to test it:

Method 1: Schoenfeld Residuals Test

The Schoenfeld residual for individual ii at event time tjt_j is the difference between the observed covariate value and its expected value under the model:

rij=xijx^ij=xijlR(tj)xljexp(β^Txl)lR(tj)exp(β^Txl)r_{ij} = x_{ij} - \hat{x}_{ij} = x_{ij} - \frac{\sum_{l \in \mathcal{R}(t_j)} x_{lj} \exp(\hat{\boldsymbol{\beta}}^T \mathbf{x}_l)}{\sum_{l \in \mathcal{R}(t_j)} \exp(\hat{\boldsymbol{\beta}}^T \mathbf{x}_l)}

If PH holds, the Schoenfeld residuals for covariate jj should be uncorrelated with time.

The Grambsch-Therneau test formally tests this:

H0:Schoenfeld residuals for covariate j are uncorrelated with timeH_0: \text{Schoenfeld residuals for covariate } j \text{ are uncorrelated with time}

A significant test (p<0.05p < 0.05) for covariate jj indicates a violation of the PH assumption for that covariate.

Graphically: Plot the scaled Schoenfeld residuals against time (or log-time). A flat, horizontal smoothed line supports PH; a clear trend suggests violation.

Method 2: Log-Log Survival Plot

Plot ln[lnS^(t)]\ln[-\ln\hat{S}(t)] (the complementary log-log of the KM survival estimate) against lnt\ln t for each group. Under proportional hazards, these lines should be approximately parallel:

lnHg(t)=lnH0(t)+βTxg\ln H_g(t) = \ln H_0(t) + \boldsymbol{\beta}^T \mathbf{x}_g

If the log-log curves are parallel, the PH assumption holds for that grouping variable.

Method 3: Time-Covariate Interaction

Add an interaction between each covariate and time (or lnt\ln t) to the Cox model:

h(txi)=h0(t)exp[βTxi+γT(xig(t))]h(t \mid \mathbf{x}_i) = h_0(t) \exp[\boldsymbol{\beta}^T \mathbf{x}_i + \boldsymbol{\gamma}^T (\mathbf{x}_i \cdot g(t))]

Where g(t)g(t) is a function of time (e.g., tt or lnt\ln t). A significant γ^j\hat{\gamma}_j indicates that the effect of covariate jj changes over time (PH violated).

Remedies When PH is Violated

Violation TypeRemedy
One covariate violates PHStratify by that covariate (stratified Cox model)
PH violated for all covariatesUse parametric AFT model instead
Time-varying effectAdd time × covariate interaction; use time-varying Cox model
Crossing hazardsUse restricted mean survival time (RMST) as effect measure

10.7 Residuals in the Cox Model

Several types of residuals are available for diagnosing the Cox model:

Martingale Residuals:

M^i=δiH^0(ti)exp(β^Txi)\hat{M}_i = \delta_i - \hat{H}_0(t_i)\exp(\hat{\boldsymbol{\beta}}^T \mathbf{x}_i)

Range from -\infty to 11. Used to:

Deviance Residuals:

Di=sign(M^i)2[M^i+δiln(δiM^i)]D_i = \text{sign}(\hat{M}_i) \cdot \sqrt{-2[\hat{M}_i + \delta_i \ln(\delta_i - \hat{M}_i)]}

Approximately normally distributed around 0. Values Di>2|D_i| > 2 suggest potential outliers.

Score (Schoenfeld) Residuals:

As described in Section 10.6. Used for testing the PH assumption.

Dfbeta Residuals (Influence Diagnostics):

dfβ^ijrijVar^(β^j)n\widehat{df\beta}_{ij} \approx -r_{ij} \cdot \frac{\widehat{\text{Var}}(\hat{\beta}_j)}{n}

Estimates how much β^j\hat{\beta}_j would change if observation ii were deleted. Large absolute values indicate influential observations.

10.8 The Baseline Survival Function

After estimating β^\hat{\boldsymbol{\beta}}, the baseline survival function S^0(t)\hat{S}_0(t) is estimated using the Breslow estimator:

H^0(t)=j:tjtdjlR(tj)exp(β^Txl)\hat{H}_0(t) = \sum_{j: t_j \leq t} \frac{d_j}{\sum_{l \in \mathcal{R}(t_j)} \exp(\hat{\boldsymbol{\beta}}^T \mathbf{x}_l)}

S^0(t)=exp(H^0(t))\hat{S}_0(t) = \exp(-\hat{H}_0(t))

The survival function for an individual with covariates x\mathbf{x} is then:

S^(tx)=[S^0(t)]exp(β^Tx)\hat{S}(t \mid \mathbf{x}) = [\hat{S}_0(t)]^{\exp(\hat{\boldsymbol{\beta}}^T \mathbf{x})}

10.9 The Stratified Cox Model

When a covariate violates the PH assumption, it can be used as a stratifying variable instead of a covariate. The stratified Cox model allows a separate baseline hazard for each stratum ss, while constraining the regression coefficients to be equal across strata:

h(txi,s)=h0s(t)exp(βTxi)h(t \mid \mathbf{x}_i, s) = h_{0s}(t) \cdot \exp(\boldsymbol{\beta}^T \mathbf{x}_i)

This approach:


11. Parametric Survival Models

11.1 Parametric Proportional Hazards (PH) Formulation

In parametric PH models, the hazard function is:

h(txi)=h0(t;α)exp(βTxi)h(t \mid \mathbf{x}_i) = h_0(t; \boldsymbol{\alpha}) \cdot \exp(\boldsymbol{\beta}^T \mathbf{x}_i)

Where h0(t;α)h_0(t; \boldsymbol{\alpha}) is a fully specified baseline hazard with parameters α\boldsymbol{\alpha} (e.g., Weibull shape parameter κ\kappa). Both α\boldsymbol{\alpha} and β\boldsymbol{\beta} are estimated simultaneously via MLE.

Weibull PH model (the most common parametric PH model):

h(txi)=κλ(tλ)κ1exp(βTxi)h(t \mid \mathbf{x}_i) = \frac{\kappa}{\lambda}\left(\frac{t}{\lambda}\right)^{\kappa-1} \exp(\boldsymbol{\beta}^T \mathbf{x}_i)

S(txi)=exp[(tλ)κexp(βTxi)]S(t \mid \mathbf{x}_i) = \exp\left[-\left(\frac{t}{\lambda}\right)^{\kappa} \exp(\boldsymbol{\beta}^T \mathbf{x}_i)\right]

11.2 The Accelerated Failure Time (AFT) Model

An alternative parametric formulation is the Accelerated Failure Time (AFT) model, which models the effect of covariates on the log of the survival time directly:

lnTi=βTxi+σϵi\ln T_i = \boldsymbol{\beta}^T \mathbf{x}_i + \sigma \epsilon_i

Or equivalently:

Ti=exp(βTxi)T0T_i = \exp(\boldsymbol{\beta}^T \mathbf{x}_i) \cdot T_0

Where T0T_0 is the baseline survival time and σ\sigma is a scale parameter.

The survival function in the AFT model:

S(txi)=S0(texp(βTxi))S(t \mid \mathbf{x}_i) = S_0\left(\frac{t}{\exp(\boldsymbol{\beta}^T \mathbf{x}_i)}\right)

The acceleration factor for covariate xjx_j:

AFj=eβj\text{AF}_j = e^{\beta_j}

Interpretation of AFT coefficients:

eβje^{\beta_j}Interpretation
>1> 1Covariate slows down the event (time is stretched — beneficial)
=1= 1No effect on survival time
<1< 1Covariate speeds up the event (time is compressed — harmful)

💡 The AFT model is more intuitive when covariates do not satisfy the PH assumption. Instead of multiplicative effects on the hazard, it gives multiplicative effects on time itself.

11.3 Relationship Between PH and AFT Formulations

The Weibull model has the unique property of satisfying both the PH and AFT formulations:

FormulationCoefficient Relationship
PH form: βPH\beta_{PH}Effect on lnh(t)\ln h(t)
AFT form: βAFT\beta_{AFT}Effect on lnT\ln T
RelationshipβAFT=βPH/κ\beta_{AFT} = -\beta_{PH} / \kappa

Where κ\kappa is the Weibull shape parameter.

For the exponential, log-normal, log-logistic, and generalised gamma, the AFT formulation is the standard one. For the Weibull and Gompertz, both formulations are available.

11.4 Parametric Model Likelihood

For parametric models, the full MLE log-likelihood for nn subjects is:

(θ)=i=1n[δilnh(ti;θ)H(ti;θ)]\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[\delta_i \ln h(t_i; \boldsymbol{\theta}) - H(t_i; \boldsymbol{\theta})\right]

Where θ=(α,β)\boldsymbol{\theta} = (\boldsymbol{\alpha}, \boldsymbol{\beta}) contains all model parameters.

For the Weibull model with shape κ\kappa and scale λ(xi)=exp(βTxi)\lambda(\mathbf{x}_i) = \exp(\boldsymbol{\beta}^T\mathbf{x}_i):

=i=1n[δiln(κλi)+δi(κ1)ln(tiλi)(tiλi)κ]\ell = \sum_{i=1}^{n}\left[\delta_i \ln\left(\frac{\kappa}{\lambda_i}\right) + \delta_i(\kappa-1)\ln\left(\frac{t_i}{\lambda_i}\right) - \left(\frac{t_i}{\lambda_i}\right)^{\kappa}\right]

11.5 Predicting Survival Times From Parametric Models

A major advantage of parametric models over the Cox model is the ability to predict survival times for new individuals.

For an individual with covariates x\mathbf{x}^*:

Predicted survival function:

S^(tx)=exp[(texp(β^Tx))κ^](Weibull)\hat{S}(t \mid \mathbf{x}^*) = \exp\left[-\left(\frac{t}{\exp(\hat{\boldsymbol{\beta}}^T\mathbf{x}^*)}\right)^{\hat{\kappa}}\right] \quad \text{(Weibull)}

Predicted median survival time:

Solve S^(t^0.5x)=0.50\hat{S}(\hat{t}_{0.5} \mid \mathbf{x}^*) = 0.50:

t^0.5=exp(β^Tx)[ln2]1/κ^(Weibull)\hat{t}_{0.5} = \exp(\hat{\boldsymbol{\beta}}^T\mathbf{x}^*) \cdot [\ln 2]^{1/\hat{\kappa}} \quad \text{(Weibull)}

Predicted mean survival time (Weibull):

E^(Tx)=exp(β^Tx)Γ(1+1κ^)\hat{E}(T \mid \mathbf{x}^*) = \exp(\hat{\boldsymbol{\beta}}^T\mathbf{x}^*) \cdot \Gamma\left(1 + \frac{1}{\hat{\kappa}}\right)

Where Γ()\Gamma(\cdot) is the gamma function.


12. Model Fit and Evaluation

12.1 Graphical Diagnostics for Parametric Distribution Selection

Before fitting a parametric model, graphical methods help identify which distribution is most appropriate.

Log-Log (Weibull) Plot:

If the Weibull distribution is appropriate, a plot of ln[lnS^(t)]\ln[-\ln\hat{S}(t)] against lnt\ln t should be approximately linear:

lnH(t)=κlntκlnλ\ln H(t) = \kappa \ln t - \kappa \ln \lambda

This is a straight line with slope κ\kappa and intercept κlnλ-\kappa \ln \lambda.

Log-Normal Plot:

If the log-normal distribution is appropriate, a plot of Φ1(1S^(t))\Phi^{-1}(1 - \hat{S}(t)) against lnt\ln t should be approximately linear.

Log-Logistic Plot:

If the log-logistic distribution is appropriate, a plot of ln[(1S^(t))/S^(t)]\ln[(1 - \hat{S}(t))/\hat{S}(t)] against lnt\ln t should be approximately linear.

Cumulative Hazard Plot:

Plot H^(t)\hat{H}(t) (Nelson-Aalen estimate) against tt or lnt\ln t:

12.2 AIC and BIC for Model Selection

The AIC and BIC are used to compare parametric models with different distributional assumptions (or different numbers of parameters):

AIC=2^(θ^)+2q\text{AIC} = -2\hat{\ell}(\hat{\boldsymbol{\theta}}) + 2q

BIC=2^(θ^)+qln(n)\text{BIC} = -2\hat{\ell}(\hat{\boldsymbol{\theta}}) + q \ln(n^*)

Where qq is the number of estimated parameters and nn^* is the number of events (not total observations, in some implementations).

Lower AIC/BIC indicates a better-fitting, more parsimonious model.

💡 AIC and BIC can compare models from different distributional families (e.g., Weibull vs. log-normal) as long as they use the same outcome and data — unlike the Δχ2\Delta\chi^2 test, which requires nested models.

12.3 Likelihood Ratio Test for Nested Models

When comparing nested parametric models (one model is a restricted version of another), the likelihood ratio test (LRT) formally tests whether the additional parameters significantly improve fit:

Λ=2[(θ^restricted)(θ^full)]\Lambda = -2[\ell(\hat{\boldsymbol{\theta}}_{\text{restricted}}) - \ell(\hat{\boldsymbol{\theta}}_{\text{full}})]

Under H0H_0, Λ\Lambda follows a χ2\chi^2 distribution with Δq\Delta q degrees of freedom (Δq\Delta q = number of additional parameters in the full model).

Example: Testing whether the Weibull model fits significantly better than the exponential (which is a special case with κ=1\kappa = 1):

Λ=2(ExpWeibull)χ2(1)\Lambda = -2(\ell_{\text{Exp}} - \ell_{\text{Weibull}}) \sim \chi^2(1)

12.4 Cox-Snell Residuals for Overall Fit

The Cox-Snell residuals can be used to assess the overall fit of any survival model (Cox or parametric):

r^iCS=H^(tixi)=lnS^(tixi)\hat{r}_i^{CS} = \hat{H}(t_i \mid \mathbf{x}_i) = -\ln \hat{S}(t_i \mid \mathbf{x}_i)

If the model fits well, the Cox-Snell residuals should follow a unit exponential distribution (i.e., H^(r^iCS)r^iCS\hat{H}(\hat{r}_i^{CS}) \approx \hat{r}_i^{CS}).

Check: Plot the Nelson-Aalen estimate of the cumulative hazard of the Cox-Snell residuals against the residuals themselves. If the model fits well, this plot should fall approximately on the line y=xy = x (45-degree line).

12.5 Concordance Index (C-statistic)

The concordance index (C-index) measures the discriminative ability of a survival model — how well it ranks individuals by their predicted risk. It is the survival analysis analogue of the AUC for binary outcomes:

C=i<j1[ti<tj]1[η^i>η^j]δii<j1[ti<tj]δiC = \frac{\sum_{i < j} \mathbf{1}[t_i < t_j] \cdot \mathbf{1}[\hat{\eta}_i > \hat{\eta}_j] \cdot \delta_i}{\sum_{i < j} \mathbf{1}[t_i < t_j] \cdot \delta_i}

Where η^i=β^Txi\hat{\eta}_i = \hat{\boldsymbol{\beta}}^T \mathbf{x}_i is the predicted linear predictor (risk score) for individual ii.

C-indexInterpretation
0.500.50No discrimination (random)
0.500.600.50 - 0.60Poor
0.600.700.60 - 0.70Acceptable
0.700.800.70 - 0.80Good
0.800.900.80 - 0.90Excellent
>0.90> 0.90Outstanding

12.6 The Global Likelihood Ratio Test (Cox Model)

For the Cox model, three omnibus tests assess whether any covariate is significantly associated with survival:

Likelihood Ratio (LR) Test:

χLR2=2[PL(0)PL(β^)]χ2(p)\chi^2_{LR} = -2[\ell_{PL}(\mathbf{0}) - \ell_{PL}(\hat{\boldsymbol{\beta}})] \sim \chi^2(p)

Wald Test:

χW2=β^T[Var^(β^)]1β^χ2(p)\chi^2_W = \hat{\boldsymbol{\beta}}^T [\widehat{\text{Var}}(\hat{\boldsymbol{\beta}})]^{-1} \hat{\boldsymbol{\beta}} \sim \chi^2(p)

Score (Log-Rank) Test:

χS2=U(0)T[I(0)]1U(0)χ2(p)\chi^2_S = \mathbf{U}(\mathbf{0})^T [\mathbf{I}(\mathbf{0})]^{-1} \mathbf{U}(\mathbf{0}) \sim \chi^2(p)

Where U(0)\mathbf{U}(\mathbf{0}) is the score vector and I(0)\mathbf{I}(\mathbf{0}) is the information matrix, both evaluated at β=0\boldsymbol{\beta} = \mathbf{0}.

All three tests are asymptotically equivalent. The LR test is generally considered the most reliable in finite samples.

12.7 Calibration: Observed vs. Predicted Survival

Calibration assesses whether the predicted survival probabilities match the observed survival rates. A well-calibrated model should show that, among individuals predicted to have (for example) a 70% chance of surviving 5 years, approximately 70% actually do survive 5 years.

The Hosmer-Lemeshow-style calibration test for survival groups subjects by predicted risk (deciles) and compares observed and expected survival in each group. Formally, the D'Agostino-Nam test extends this to survival data:

χHLS2=g=1G(OgEg)2Eg(1Eg/ng)χ2(G2)\chi^2_{HLS} = \sum_{g=1}^G \frac{(O_g - E_g)^2}{E_g(1 - E_g/n_g)} \sim \chi^2(G-2)


13. Advanced Topics

13.1 Time-Varying Covariates

In some studies, covariate values change during follow-up (e.g., treatment dose, blood pressure, smoking status). These are time-varying covariates and must be incorporated into the Cox model using a counting process data format.

In the counting process format, each period of follow-up during which a covariate value is constant becomes a separate row:

IDStartStopEventTreatmentDose
1060A50mg
16121A75mg
20100B100mg
210240B80mg

The counting process Cox model for time-varying covariates xi(t)\mathbf{x}_i(t):

h(txi(t))=h0(t)exp(βTxi(t))h(t \mid \mathbf{x}_i(t)) = h_0(t) \exp(\boldsymbol{\beta}^T \mathbf{x}_i(t))

⚠️ Careful: Internal time-varying covariates (values affected by the individual's own disease process — e.g., a biomarker) can introduce bias. Only external covariates (unaffected by the subject's status) can be safely included as time-varying covariates without special considerations.

13.2 Frailty Models (Random Effects for Survival)

Frailty models extend the Cox model to account for unobserved heterogeneity among subjects or clustering (e.g., patients within hospitals, multiple events per subject). A random frailty term ZiZ_i is introduced:

h(txi,Zi)=Zih0(t)exp(βTxi)h(t \mid \mathbf{x}_i, Z_i) = Z_i \cdot h_0(t) \exp(\boldsymbol{\beta}^T \mathbf{x}_i)

Where ZiZ_i is a random variable (the "frailty") assumed to follow a specific distribution:

The frailty variance θ\theta (or σ2\sigma^2) quantifies the degree of unobserved heterogeneity. If θ=0\theta = 0, the frailty model reduces to the standard Cox model.

The marginal survival function (Gamma frailty):

S(txi)=EZi[S(txi,Zi)]=[1+θH0(t)exp(βTxi)]1/θS(t \mid \mathbf{x}_i) = E_{Z_i}[S(t \mid \mathbf{x}_i, Z_i)] = \left[1 + \theta H_0(t)\exp(\boldsymbol{\beta}^T\mathbf{x}_i)\right]^{-1/\theta}

13.3 Competing Risks

Competing risks arise when multiple mutually exclusive event types are possible, and the occurrence of one event precludes the observation of others. For example, in a study of cancer-specific mortality, deaths from other causes are competing risks.

In the presence of competing risks, the standard KM estimator overestimates the probability of the event of interest because it treats competing events as independent censoring (which they are not — they actually prevent the event of interest from occurring).

The Cause-Specific Hazard for event type kk:

hk(t)=limΔt0P(tT<t+Δt,K=kTt)Δth_k(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t, K = k \mid T \geq t)}{\Delta t}

The Cumulative Incidence Function (CIF) (also called the subdistribution or cause-specific CIF) gives the probability of experiencing event kk by time tt in the presence of competing risks:

Fk(t)=0thk(u)S(u)duF_k(t) = \int_0^t h_k(u) \cdot S(u) \, du

Where S(u)=exp(k0uhk(v)dv)S(u) = \exp\left(-\sum_{k'} \int_0^u h_{k'}(v) \, dv\right) is the overall survival function.

Note that kFk()=1\sum_k F_k(\infty) = 1 (if all events are eventual) and individual CIFs do not sum to 1 at any finite time tt.

The Gray Test compares CIFs between groups in the presence of competing risks (analogous to the log-rank test for the standard case).

Fine and Gray's Subdistribution Hazard Model directly models the effect of covariates on the CIF for event type kk:

log[log(1Fk(tx))]=log[log(1Fk0(t))]+βkTx\log\left[-\log\left(1 - F_k(t \mid \mathbf{x})\right)\right] = \log\left[-\log(1 - F_{k0}(t))\right] + \boldsymbol{\beta}_k^T \mathbf{x}

13.4 Restricted Mean Survival Time (RMST)

When the proportional hazards assumption is violated and hazard ratios are not interpretable, the Restricted Mean Survival Time (RMST) provides an alternative summary measure:

RMST(τ)=E[min(T,τ)]=0τS(t)dt\text{RMST}(\tau) = E[\min(T, \tau)] = \int_0^{\tau} S(t) \, dt

Where τ\tau is a pre-specified restriction time (the maximum follow-up horizon of interest).

RMST is the area under the survival curve up to time τ\tau. It has a direct interpretation as the average event-free time up to τ\tau.

Difference in RMST between two groups (e.g., treatment vs. control):

ΔRMST(τ)=RMST1(τ)RMST0(τ)\Delta\text{RMST}(\tau) = \text{RMST}_1(\tau) - \text{RMST}_0(\tau)

This represents the average additional event-free time gained by the treatment group up to time τ\tau, regardless of proportionality.

Estimated RMST and its variance:

RMST^(τ)=0τS^(t)dtj:tjτS^(tj)(tj+1tj)\widehat{\text{RMST}}(\tau) = \int_0^{\tau} \hat{S}(t) \, dt \approx \sum_{j: t_j \leq \tau} \hat{S}(t_j)(t_{j+1} - t_j)

The variance is estimated using the delta method applied to Greenwood's formula.


14. Worked Examples

Example 1: Kaplan-Meier Estimation and Log-Rank Test — Two Treatment Groups

A clinical trial of n=40n = 40 patients with advanced lung cancer randomises participants to either a new chemotherapy regimen (Group A, n=20n = 20) or standard care (Group B, n=20n = 20). Survival time (in months) is recorded.

Study Summary:

Step 1 — Compute the Kaplan-Meier Estimates (abbreviated)

Group A (selected event times):

Time (m)njn_jdjd_jp^j\hat{p}_jS^(t)\hat{S}(t)95% CI
22010.9500.950(0.828, 1.000)
51820.8890.844(0.674, 0.944)
91510.9330.788(0.608, 0.906)
141210.9170.722(0.528, 0.862)
22820.7500.542(0.337, 0.723)
31410.7500.406(0.197, 0.636)

Group B (selected event times):

Time (m)njn_jdjd_jp^j\hat{p}_jS^(t)\hat{S}(t)95% CI
12020.9000.900(0.755, 0.978)
41720.8820.794(0.617, 0.912)
71420.8570.680(0.489, 0.824)
121020.8000.544(0.347, 0.724)
18620.6670.363(0.182, 0.586)
25310.6670.242(0.071, 0.524)

Median Survival Time:

Step 2 — Log-Rank Test

H0H_0: The survival distributions are identical in Groups A and B.

Computing observed and expected events at each event time in the combined dataset:

jdAj=14\sum_j d_{Aj} = 14 (observed events in Group A)

jeAj=10.2\sum_j e_{Aj} = 10.2 (expected events in Group A under H0H_0)

χLR2=(1410.2)2jVAj=(3.8)24.91=14.444.912.94\chi^2_{LR} = \frac{(14 - 10.2)^2}{\sum_j V_{Aj}} = \frac{(3.8)^2}{4.91} = \frac{14.44}{4.91} \approx 2.94

df=1,p=0.086df = 1, \quad p = 0.086

Interpretation: The log-rank test is not statistically significant at the 5% level (p=0.086p = 0.086), though there is a trend favouring Group A. The median survival time is 10 months longer for Group A (26 vs. 16 months). With a larger sample, this difference might reach statistical significance. The Kaplan-Meier curves show higher survival probabilities for Group A at all time points.


Example 2: Cox Proportional Hazards Model — Predicting Time to Readmission

A hospital study follows n=300n = 300 patients after discharge from heart failure hospitalisation. The event of interest is 30-day readmission. Predictors include age (years), sex (0 = female, 1 = male), number of comorbidities (count), and treatment type (A = reference, B, C).

Model Results:

Variableβ^\hat{\beta}SEzpHR = eβ^e^{\hat{\beta}}95% CI for HR
Age (years)0.0310.0112.820.0051.031(1.009, 1.054)
Sex (Male)0.2870.1422.020.0431.332(1.009, 1.759)
Comorbidities0.1980.0633.140.0021.219(1.078, 1.379)
Treatment B-0.4120.158-2.610.0090.662(0.487, 0.900)
Treatment C-0.6810.171-3.98<<0.0010.506(0.362, 0.707)

Global LR Test: χ2(5)=38.2\chi^2(5) = 38.2, p<0.001p < 0.001 — at least one covariate significantly predicts readmission.

C-index: C^=0.72\hat{C} = 0.72 — Good discriminative ability.

Interpretation of Key Coefficients:

Proportional Hazards Assessment:

Grambsch-Therneau test results:

Variableρ\rho (correlation with time)χ2\chi^2p
Age0.0410.280.597
Sex-0.0630.670.413
Comorbidities0.0821.120.290
Treatment B0.0510.440.507
Treatment C0.0971.580.209
GLOBAL4.820.437

All individual tests and the global test are non-significant — the proportional hazards assumption is supported for all covariates.

Predicted Survival:

For a 65-year-old male with 3 comorbidities on Treatment C:

η^=0.031(65)+0.287(1)+0.198(3)+(0.681)(1)=2.015+0.287+0.5940.681=2.215\hat{\eta} = 0.031(65) + 0.287(1) + 0.198(3) + (-0.681)(1) = 2.015 + 0.287 + 0.594 - 0.681 = 2.215

S^(30x)=[S^0(30)]exp(2.215)\hat{S}(30 \mid \mathbf{x}) = [\hat{S}_0(30)]^{\exp(2.215)}

Where S^0(30)\hat{S}_0(30) is the estimated baseline 30-day survival probability.


Example 3: Weibull Parametric Model — Time to Equipment Failure

An engineering study monitors n=120n = 120 industrial machines for failure. Follow-up is 18 months. The predictor is machine type (Standard = reference, Enhanced).

Distribution Selection (AIC comparison):

DistributionLog-LikelihoodParametersAIC
Exponential-241.32486.6
Weibull-233.83473.6
Log-Normal-237.13480.2
Log-Logistic-235.93477.8
Gompertz-234.53475.0

Decision: Weibull has the lowest AIC (473.6) → select Weibull model.

LRT: Weibull vs. Exponential:

Λ=2(241.3(233.8))=2(7.5)=15.0χ2(1),p<0.001\Lambda = -2(-241.3 - (-233.8)) = -2(-7.5) = 15.0 \sim \chi^2(1), \quad p < 0.001

The Weibull fits significantly better than the exponential — the hazard is not constant.

Weibull Model Results:

ParameterEstimateSE95% CI
Intercept (lnλ\ln \lambda)2.8410.201(2.447, 3.235)
Enhanced vs. Standard (β\beta)0.6240.183(0.265, 0.983)
Shape (κ\kappa)1.820.189(1.481, 2.230)

Shape parameter: κ^=1.82>1\hat{\kappa} = 1.82 > 1Increasing hazard — failure risk increases over time, consistent with mechanical wear-out.

Acceleration Factor (AFT interpretation):

eβ^=e0.624=1.867e^{\hat{\beta}} = e^{0.624} = 1.867

Enhanced machines have failure times that are 86.7% longer than Standard machines on average (95% CI: 1.303, 2.674). In other words, enhanced machines last approximately 1.87 times as long.

Predicted Median Failure Times:

t^0.5Standard=exp(2.841)[ln2]1/1.82=17.12×0.683=11.7\hat{t}_{0.5}^{\text{Standard}} = \exp(2.841) \cdot [\ln 2]^{1/1.82} = 17.12 \times 0.683 = 11.7 months

t^0.5Enhanced=11.7×1.867=21.8\hat{t}_{0.5}^{\text{Enhanced}} = 11.7 \times 1.867 = 21.8 months

Standard machines have a predicted median failure time of 11.7 months, while Enhanced machines are predicted to last 21.8 months before failure, on average.


15. Common Mistakes and How to Avoid Them

Mistake 1: Treating Censored Observations as Events or Excluding Them

Problem: A common error among beginners is either incorrectly coding censored observations as events (which dramatically overestimates the hazard) or excluding censored observations entirely (which introduces severe selection bias — the excluded subjects are typically the longest-surviving ones).
Solution: Include all observations and code censored subjects correctly (status = 0). The statistical methods are specifically designed to use the partial information provided by censored observations.

Mistake 2: Using the Wrong Time Origin

Problem: Using different time origins for different subjects (e.g., some measured from diagnosis, others from treatment start) without harmonising the time scale. This creates an inconsistent and uninterpretable survival analysis.
Solution: Define the time origin before data collection and apply it consistently to all subjects. Document the chosen time origin clearly in the methods section.

Mistake 3: Confusing Censoring and Missing Data

Problem: Treating censored observations as missing (removing them) or treating truly missing data as censored (including them with a specific censoring time) are both errors that lead to biased results.
Solution:

Mistake 4: Not Checking the Proportional Hazards Assumption

Problem: Reporting Cox model hazard ratios without testing the PH assumption. If the assumption is violated, the hazard ratio is a biased average of a time-varying effect and is not meaningfully interpretable.
Solution: Always perform and report Schoenfeld residual tests and log-log plots for all covariates. If PH is violated, use stratification, time-varying coefficients, or switch to an AFT model.

Mistake 5: Applying the Standard KM Estimator With Competing Risks

Problem: Using the Kaplan-Meier estimator to estimate the probability of an event when competing risks are present. KM treats competing events as independent censoring, which overestimates the cumulative incidence of the event of interest.
Solution: Use the Cumulative Incidence Function (CIF) estimator (Aalen-Johansen method) when competing risks are present. Use the Gray test for group comparisons and the Fine-Gray model for regression.

Mistake 6: Using Too Few Events for the Cox Model (Overfitting)

Problem: Fitting a Cox model with many covariates but few events (e.g., 5 covariates with only 20 events). This leads to overfitting — coefficients and hazard ratios are unstable and will not replicate.
Solution: Follow the 10 Events Per Variable (EPV) rule. With 20 events, include at most 2 covariates. For small event counts, use penalised regression (ridge, LASSO) or Bayesian methods.

Mistake 7: Treating the Log-Rank Test as the Only Comparison Method

Problem: Reporting only the log-rank test when survival curves are non-proportional (crossing curves). The log-rank test is least powerful precisely when hazard ratios are not constant.
Solution: Always plot the KM curves first. If curves cross or show non-proportional patterns, report the Fleming-Harrington test with appropriate weights, or compare using RMST differences, which do not require the PH assumption.

Mistake 8: Ignoring the Effect of Left Truncation

Problem: In prevalence or registry studies, subjects are often only enrolled after surviving a certain threshold (e.g., alive at time of registration). Ignoring this creates immortal time bias — subjects who died early are systematically excluded.
Solution: Use left-truncated survival data methods. In the counting process format, specify the entry time (left truncation time) for each subject. This correctly adjusts the risk sets to only include subjects who were observable at each event time.

Mistake 9: Extrapolating Parametric Models Without Caution

Problem: Using parametric survival models to predict survival far beyond the observed follow-up period without acknowledging the uncertainty of such extrapolation.
Solution: Always explicitly state the observed follow-up range. Clearly label any predictions beyond observed follow-up as extrapolations with wide uncertainty. Validate predictions in external data when possible. Use sensitivity analysis with alternative distributions.

Mistake 10: Not Reporting Confidence Intervals for Survival Estimates

Problem: Reporting point estimates of S^(t)\hat{S}(t) or hazard ratios without confidence intervals, giving a false impression of precision.
Solution: Always report confidence intervals for all survival estimates: KM survival probabilities (Greenwood/log-log CIs), median survival times (Brookmeyer-Crowley CIs), and hazard ratios (Wald CIs). Include the number at risk at key time points on KM plots.


16. Troubleshooting

ProblemLikely CauseSolution
Negative or zero survival timesData entry errors; incorrect time originClean data; check for recording errors; redefine time origin
KM curve never reaches 0.50Fewer than 50% of subjects experience the eventReport that median is not estimable; report restricted mean or 75th percentile
KM curve ends abruptly at 1.0 or near 1.0Last few observations are all events with no censored subjects near the endUsually fine; report last observed survival probability
Log-rank test significant but KM curves overlap earlyLate separation of curves (Weibull with increasing hazard)Use Fleming-Harrington test with γ>0\gamma > 0 to emphasise late differences
Log-rank test non-significant but curves look differentCrossing curves; small sampleReport RMST; test for specific crossing patterns; increase sample size
Cox model fails to convergePerfect separation (a covariate perfectly predicts events); very small event countRemove problematic covariate; use Firth's penalised partial likelihood; reduce covariates
Hazard ratios are extremely large (>10> 10) or extremely small (<0.10< 0.10)Perfect or near-perfect separation; very small events in one categoryCheck event rates by category; collapse sparse categories; use penalised likelihood
Schoenfeld test significant for one covariatePH assumption violated for that covariateStratify by that covariate; add time × covariate interaction; use AFT model
All Schoenfeld tests significantFundamental PH violation across all covariatesSwitch to AFT parametric model or use RMST
Deviance residuals show large outliersInfluential observations with unexpected event timingInvestigate outliers clinically; check data quality; use dfbeta diagnostics
Very high C-index (>0.90> 0.90) with few eventsOverfitting; possible data leakageValidate in independent sample; apply shrinkage; check for inadvertent inclusion of outcome-proximate predictors
AIC similar across all parametric distributionsData are not informative enough to discriminate between distributionsUse the most theoretically appropriate distribution; report all models; use the simplest (most parsimonious) model
Competing risks suspected but ignoredMultiple event types recorded but not modelledUse cumulative incidence functions and the Fine-Gray model
Left truncation not accounted forRegistry or prevalent cohort study designRestructure data with entry time; use counting process format
Large number of tied event timesDiscrete-time outcome (e.g., daily data recorded as monthly)Use Efron or exact partial likelihood; consider discrete-time survival models

17. Quick Reference Cheat Sheet

Core Equations

FormulaDescription
S(t)=P(T>t)=1F(t)S(t) = P(T > t) = 1 - F(t)Survival function
h(t)=f(t)S(t)=ddtlnS(t)h(t) = \frac{f(t)}{S(t)} = -\frac{d}{dt}\ln S(t)Hazard function
H(t)=0th(u)du=lnS(t)H(t) = \int_0^t h(u)\,du = -\ln S(t)Cumulative hazard
S(t)=eH(t)=exp(0th(u)du)S(t) = e^{-H(t)} = \exp\left(-\int_0^t h(u)\,du\right)Survival from cumulative hazard
S^(t)=j:tjtnjdjnj\hat{S}(t) = \prod_{j: t_j \leq t} \frac{n_j - d_j}{n_j}Kaplan-Meier estimator
H^(t)=j:tjtdjnj\hat{H}(t) = \sum_{j: t_j \leq t} \frac{d_j}{n_j}Nelson-Aalen estimator
Var^[S^(t)]=[S^(t)]2j:tjtdjnj(njdj)\widehat{\text{Var}}[\hat{S}(t)] = [\hat{S}(t)]^2\sum_{j: t_j \leq t}\frac{d_j}{n_j(n_j - d_j)}Greenwood's formula
χLR2=(j(d1je1j))2jV1j\chi^2_{LR} = \frac{(\sum_j(d_{1j} - e_{1j}))^2}{\sum_j V_{1j}}Log-rank test statistic
h(tx)=h0(t)exp(βTx)h(t \mid \mathbf{x}) = h_0(t)\exp(\boldsymbol{\beta}^T\mathbf{x})Cox PH model
HR=eβ^\text{HR} = e^{\hat{\beta}}Hazard ratio
PL(β)=j[βTxijlnlR(tj)exp(βTxl)]\ell_{PL}(\boldsymbol{\beta}) = \sum_j\left[\boldsymbol{\beta}^T\mathbf{x}_{i_j} - \ln\sum_{l \in \mathcal{R}(t_j)}\exp(\boldsymbol{\beta}^T\mathbf{x}_l)\right]Cox partial log-likelihood
S^(tx)=[S^0(t)]exp(β^Tx)\hat{S}(t \mid \mathbf{x}) = [\hat{S}_0(t)]^{\exp(\hat{\boldsymbol{\beta}}^T\mathbf{x})}Cox predicted survival
Li=[h(ti;θ)]δiS(ti;θ)L_i = [h(t_i;\boldsymbol{\theta})]^{\delta_i} \cdot S(t_i;\boldsymbol{\theta})Parametric likelihood contribution
RMST(τ)=0τS(t)dt\text{RMST}(\tau) = \int_0^{\tau} S(t)\,dtRestricted mean survival time
AIC=2^+2q\text{AIC} = -2\hat{\ell} + 2qAkaike Information Criterion
BIC=2^+qln(n)\text{BIC} = -2\hat{\ell} + q\ln(n^*)Bayesian Information Criterion

Parametric Distribution Quick Reference

DistributionS(t)S(t)Hazard ShapeParameters
Exponentialeλte^{-\lambda t}Constantλ\lambda
Weibullexp[(t/λ)κ]\exp[-(t/\lambda)^{\kappa}]Monotone ↑/↓/flatλ,κ\lambda, \kappa
Log-Normal1Φ(lntμσ)1 - \Phi\left(\frac{\ln t - \mu}{\sigma}\right)Hump-shapedμ,σ\mu, \sigma
Log-Logistic11+(t/α)γ\frac{1}{1 + (t/\alpha)^{\gamma}}Hump or decreasingα,γ\alpha, \gamma
Gompertzexp[λγ(eγt1)]\exp\left[-\frac{\lambda}{\gamma}(e^{\gamma t} - 1)\right]Monotone ↑λ,γ\lambda, \gamma

Hazard Ratio Interpretation

HRMeaning
>1> 1Higher hazard (increased risk) relative to reference
=1= 1No effect on hazard
<1< 1Lower hazard (protective effect) relative to reference

C-Index Benchmarks

C-IndexDiscrimination
0.500.50None (random)
0.500.600.50 - 0.60Poor
0.600.700.60 - 0.70Acceptable
0.700.800.70 - 0.80Good
0.800.900.80 - 0.90Excellent
>0.90> 0.90Outstanding

Method Selection Guide

ScenarioRecommended Method
Describe survival in one groupKaplan-Meier
Compare survival across 2\geq 2 groups (unadjusted)Log-Rank Test
Compare when groups differ on a confounderStratified Log-Rank
Non-proportional hazards in group comparisonFleming-Harrington / RMST
Covariate-adjusted survival analysisCox PH Model
PH assumption violatedStratified Cox or AFT Model
Predict survival timesParametric Model
Competing events presentCumulative Incidence Function + Fine-Gray Model
Clustered or repeated eventsFrailty Model
Covariates change over timeTime-Varying Cox Model

Proportional Hazards Testing Summary

MethodToolPH Holds If
Schoenfeld residualsGrambsch-Therneau testp>0.05p > 0.05 for all covariates
Log-log plotVisual inspectionLines approximately parallel
Time × covariate interactionLR or Wald testp>0.05p > 0.05 for all interactions

Events Per Variable (EPV) Requirements

ModelMinimum EPVRecommended EPV
Cox PH1015–20
Parametric1015–20
Kaplan-Meier (stable curve)20–30 events total50+ events total
Fine-Gray (competing risks)10 per event type15–20 per event type

Censoring Types Summary

Censoring TypeDescriptionExample
RightEvent not yet occurred at end of follow-upStudy ends; patient still alive
LeftEvent occurred before observation beganPre-existing infection
IntervalEvent occurred between two assessment pointsScreening study
Left truncationSubject only enters study after surviving to entry timeRegistry study

This tutorial provides a comprehensive foundation for understanding, performing, and interpreting Survival Analysis using the DataStatPro application. For further reading, consult Kleinbaum & Klein's "Survival Analysis: A Self-Learning Text" (3rd ed., 2012), Hosmer, Lemeshow & May's "Applied Survival Analysis" (2nd ed., 2008), or Therneau & Grambsch's "Modeling Survival Data: Extending the Cox Model" (2000). For feature requests or support, contact the DataStatPro team.