Survival Analysis: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of Survival Analysis all the way through advanced model specification, estimation, evaluation, and practical usage within the DataStatPro application. Whether you are encountering survival analysis for the first time or looking to deepen your understanding of time-to-event methods, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is Survival Analysis?
- The Mathematics Behind Survival Analysis
- Assumptions of Survival Analysis
- Types of Survival Analysis Methods
- Using the Survival Analysis Component
- Data Structure and Censoring
- Non-Parametric Methods: Kaplan-Meier Estimation
- Comparing Survival Curves: Log-Rank and Related Tests
- Semi-Parametric Methods: Cox Proportional Hazards Model
- Parametric Survival Models
- Model Fit and Evaluation
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into survival analysis, it is helpful to be familiar with the following foundational concepts. Each is briefly reviewed below.
1.1 Probability and Probability Distributions
A probability distribution describes the likelihood of all possible outcomes of a random variable. In survival analysis, the primary random variable is time — specifically, the time until a specific event occurs.
The probability density function (PDF) of a continuous random variable satisfies:
The cumulative distribution function (CDF) gives the probability that the event has occurred by time :
1.2 The Complement: Survival Probability
The survival function is the complement of the CDF — it gives the probability of surviving (not experiencing the event) beyond time :
Key properties:
- : At the start, everyone has survived (not yet experienced the event).
- : Eventually, all individuals will experience the event (assuming all eventually do).
- is a non-increasing (monotonically decreasing) function of time.
1.3 Rates and Conditional Probability
A rate measures how frequently an event occurs per unit of time. The conditional probability is the probability of event given that event has already occurred:
In survival analysis, we often ask: "Given that an individual has survived until time , what is the probability they experience the event in the next small interval of time?" This is the hazard — one of the central concepts in survival analysis.
1.4 Likelihood and Maximum Likelihood Estimation
The likelihood function measures how probable the observed data are, given a set of model parameters :
Maximum Likelihood Estimation (MLE) finds the parameter values that maximise (or equivalently, maximise the log-likelihood ). Most survival models are estimated via MLE or partial MLE.
1.5 The Exponential Distribution — A Simple Survival Model
The exponential distribution is the simplest parametric model for survival data. If :
Where is the rate parameter (events per unit time). The mean survival time is .
The exponential distribution has the memoryless property: the probability of experiencing the event in the next moment does not depend on how long an individual has already survived. This is a very strong and often unrealistic assumption — more flexible distributions are usually needed.
1.6 Integration and Differentiation (Brief Review)
Survival analysis involves calculus. The key relationships to remember:
These relationships connect the PDF, CDF, and survival function and are used extensively in deriving hazard functions and likelihood contributions.
2. What is Survival Analysis?
2.1 The Core Question
Survival analysis is a branch of statistics that analyses the time until a specific event of interest occurs. The defining questions are:
- How long until the event occurs?
- What is the probability of surviving beyond a given time?
- Do different groups have different survival experiences?
- What factors are associated with shorter or longer time to the event?
Despite the name "survival analysis," the event of interest does not have to be death. It can be any well-defined, non-repeating event:
| Field | Event of Interest | Time Variable |
|---|---|---|
| Medicine | Death, disease recurrence, hospital discharge | Months from diagnosis |
| Engineering | Machine failure, component breakdown | Hours of operation |
| Finance | Loan default, customer churn, bankruptcy | Days since account opening |
| Psychology | Relapse after treatment, onset of disorder | Weeks since therapy |
| Marketing | Customer purchase, subscription cancellation | Days since sign-up |
| Social Science | Marriage, unemployment ending, graduation | Years from event start |
| Manufacturing | Product defect, warranty claim | Cycles of use |
2.2 What Makes Survival Data Special?
Survival data has two unique characteristics that make it unsuitable for standard regression or t-tests:
1. Censoring: Not all individuals experience the event during the observation period. An individual who is still alive (or event-free) at the end of the study has a censored observation — we know they survived at least until time , but we do not know their full survival time.
2. Skewed, positive-only time distributions: Survival times are always positive and often highly right-skewed (many short times, few very long times). Standard methods that assume normality are inappropriate.
These characteristics require special methods — the tools of survival analysis — to produce valid estimates and inferences.
2.3 The Three Central Functions
Every survival analysis revolves around three mathematically related functions. Understanding all three is essential:
| Function | Symbol | Question Answered | Range |
|---|---|---|---|
| Survival Function | What is the probability of surviving past time ? | ||
| Hazard Function | At time , how fast are events occurring among survivors? | ||
| Cumulative Hazard | Total accumulated risk of the event up to time |
These three functions are mathematically equivalent — knowing one fully determines the other two.
2.4 Survival Analysis vs. Standard Regression
| Feature | Standard Regression | Survival Analysis |
|---|---|---|
| Outcome variable | Continuous, binary, count | Time to event |
| Censoring handled? | No | Yes — fundamental feature |
| Distribution assumed | Normal (OLS) | Exponential, Weibull, etc. |
| Negative outcomes | Possible | Never (time ) |
| Primary interest | Mean or probability | Survival function, hazard |
| Key model | Linear / Logistic | Kaplan-Meier, Cox, Parametric |
3. The Mathematics Behind Survival Analysis
3.1 The Survival Function
The survival function is the probability that the event time exceeds a specified time :
Key mathematical properties:
- (all subjects start event-free).
- (all eventually experience the event, assuming no cure).
- is right-continuous and non-increasing.
- is a step function when estimated non-parametrically (Kaplan-Meier).
3.2 The Hazard Function
The hazard function (also called the hazard rate or instantaneous failure rate) is the conditional rate at which the event occurs at time , given survival up to :
This can be re-expressed using the PDF and survival function:
Interpretation: for small .
Key properties of the hazard function:
- for all (it is a rate, never negative).
- It has no upper bound (unlike a probability).
- It is not a probability — it is a rate (events per unit time among those at risk).
- Different distributions produce different hazard shapes:
| Distribution | Hazard Shape |
|---|---|
| Exponential | Constant (flat) |
| Weibull () | Monotonically increasing |
| Weibull () | Monotonically decreasing |
| Weibull () | Constant (= Exponential) |
| Log-normal | Initially increases then decreases (hump-shaped) |
| Log-logistic | Initially increases then decreases (hump-shaped) |
| Gompertz | Monotonically increasing (exponentially) |
| Piecewise constant | Constant within intervals, steps between |
3.3 The Cumulative Hazard Function
The cumulative hazard function is the integral of the hazard function over time:
It represents the total accumulated risk of the event from time 0 to time .
3.4 The Fundamental Relationship Between , , and
The three functions are mathematically equivalent through the following relationships:
From hazard to survival:
From survival to hazard:
From survival to cumulative hazard:
From PDF to survival:
These relationships form the mathematical backbone of survival analysis. Every parametric model defines one of these functions — the others follow automatically.
3.5 The Mean and Median Survival Time
The mean survival time (restricted to time ) is the area under the survival curve:
The median survival time is the time at which exactly half of the subjects have experienced the event (i.e., where the survival curve crosses 0.50):
Similarly, the -th quantile of survival is:
💡 The median is usually preferred over the mean for survival data because the mean is sensitive to a few very long survival times (right tail), and may be undefined if the survival curve never reaches zero (i.e., if not all subjects experience the event).
3.6 The Likelihood Contribution of Censored Observations
A key mathematical challenge in survival analysis is constructing the likelihood function correctly for censored observations. Each observation contributes to the likelihood differently:
Uncensored (event occurred at time ):
Right-censored (event not observed; known to have survived to ):
General likelihood contribution:
Where if the event was observed (uncensored) and if censored.
Full likelihood for independent observations:
Log-likelihood:
This formulation correctly accounts for the information provided by both events and censored observations.
3.7 Common Parametric Survival Distributions
Exponential Distribution
Parameter: (rate). Mean = .
Weibull Distribution
The Weibull is the most widely used parametric survival model due to its flexibility:
Parameters: (scale), (shape).
- : Reduces to exponential (constant hazard).
- : Increasing hazard (e.g., aging, wear-out).
- : Decreasing hazard (e.g., infant mortality, early failure).
Log-Normal Distribution
Parameters: (log-scale mean), (log-scale standard deviation). Produces a hump-shaped hazard (increases then decreases).
Log-Logistic Distribution
Parameters: (scale), (shape). Produces a hump-shaped hazard when and a monotone decreasing hazard when .
Gompertz Distribution
Parameters: (initial hazard), (rate of increase). Widely used in actuarial science and demography. Produces a monotonically increasing hazard.
3.8 Summary Table of Parametric Distributions
| Distribution | Parameters | Hazard Shape | Best For |
|---|---|---|---|
| Exponential | Constant | Simple baseline; memoryless events | |
| Weibull | Monotone (increasing/decreasing/flat) | Most general-purpose analyses | |
| Log-Normal | Hump (up then down) | Medical remission, biological processes | |
| Log-Logistic | Hump (or decreasing) | Similar to log-normal; has closed-form | |
| Gompertz | Monotone increasing | Mortality, ageing studies | |
| Generalised Gamma | Very flexible | Model selection; nests Weibull, log-normal |
4. Assumptions of Survival Analysis
4.1 Non-Informative Censoring
The most critical assumption in survival analysis is that censoring is non-informative — the reason an observation is censored must be independent of the true (unobserved) event time.
Examples of non-informative censoring (acceptable):
- Study ends on a fixed calendar date and the participant has not yet experienced the event.
- Participant is still being followed at the time of analysis.
- Participant was lost to follow-up for administrative reasons (e.g., moved country).
Examples of informative censoring (problematic):
- Participants drop out of the study because they are getting sicker (their censoring time is related to their imminent event time).
- A machine is removed from service because it is showing signs of failure.
⚠️ Informative censoring leads to biased estimates of the survival function and hazard. There is no purely statistical remedy — the study design must ensure non-informative censoring.
4.2 Independence of Observations
Each individual's survival time must be independent of all others. This is violated in:
- Clustered data: Patients treated at the same hospital, family members, same litter animals.
- Recurrent events: Multiple events per individual.
- Matched data: Matched pairs or matched sets.
Solutions include frailty models (random effects), marginal models, and conditional models.
4.3 Proportional Hazards (Cox Model)
When using the Cox proportional hazards model (Section 10), the key assumption is that the hazard ratio between any two individuals is constant over time:
This means the hazard functions of any two individuals are proportional — they never cross. This is a strong assumption that must be checked (see Section 10.6).
4.4 Correct Parametric Form (Parametric Models)
When using parametric models, the assumed distribution must adequately describe the data. Choosing an incorrect parametric family leads to biased estimates. This must be checked via:
- Graphical diagnostics (log-log plots, hazard plots).
- Formal goodness-of-fit tests.
- AIC/BIC comparison between distributions.
4.5 Consistent Definition of Time Origin and Event
The time origin (time zero) must be consistently and clearly defined for all subjects:
- Date of diagnosis.
- Date of surgery.
- Date of study entry.
- Date of birth.
Similarly, the event must be defined unambiguously — the same event criterion must be applied to all subjects.
⚠️ Inconsistent definitions of time origin or event across subjects will produce biased survival estimates and hazard ratios. This is a design issue, not a statistical one.
4.6 Sufficient Follow-Up and Events
Survival analysis requires:
- Sufficient follow-up time so that a meaningful proportion of subjects experience the event.
- A minimum number of observed events (not total subjects) for reliable estimation:
- Kaplan-Meier curves: At least 20–30 events for stable estimates.
- Cox model: At least 10 events per predictor variable (EPV).
- Parametric models: At least 50–100 events for reliable parameter estimation.
5. Types of Survival Analysis Methods
Survival analysis methods are broadly classified by the degree of parametric assumption they make:
5.1 Non-Parametric Methods
Non-parametric methods make no assumptions about the shape of the survival distribution or hazard function. They let the data speak for themselves.
| Method | Purpose |
|---|---|
| Kaplan-Meier Estimator | Estimate the survival curve for one or more groups |
| Nelson-Aalen Estimator | Estimate the cumulative hazard function |
| Log-Rank Test | Compare survival curves between two or more groups |
| Wilcoxon (Breslow) Test | Compare survival curves with emphasis on early times |
| Peto-Peto Test | Compare survival curves with modified weighting |
Advantages: No distributional assumptions; easy to visualise.
Disadvantages: Cannot adjust for multiple covariates simultaneously; no regression
coefficients.
5.2 Semi-Parametric Methods
Semi-parametric methods make some parametric assumptions (about the covariate effects) but leave the baseline hazard unspecified.
| Method | Purpose |
|---|---|
| Cox Proportional Hazards Model | Assess covariate effects on survival while controlling for confounders |
| Cox Model with Time-Varying Covariates | Covariates that change in value over follow-up |
| Stratified Cox Model | Allow baseline hazard to differ across strata |
| Frailty Models (Semi-Parametric) | Account for unobserved heterogeneity and clustering |
Advantages: Handles multiple covariates; yields hazard ratios; robust (no distribution
assumption).
Disadvantages: Cannot directly predict survival times; hazard ratios assume proportionality.
5.3 Parametric Methods
Parametric methods assume a specific distributional form for the survival times. They are more efficient (better precision) when the assumed distribution is correct.
| Method | Purpose |
|---|---|
| Exponential Regression | Simplest parametric model; constant hazard |
| Weibull Regression | Flexible monotone hazard; generalises exponential |
| Log-Normal Regression | Hump-shaped hazard; log-transformed normal model |
| Log-Logistic Regression | Hump-shaped hazard; has closed-form survival function |
| Gompertz Regression | Exponentially increasing hazard; ageing/mortality |
| Generalised Gamma | Very flexible; nests several other distributions |
| Accelerated Failure Time (AFT) Models | Covariate effects accelerate or decelerate time |
Advantages: Can predict absolute survival times; extrapolate beyond observed data; more
efficient with correct specification.
Disadvantages: Wrong distributional assumption leads to biased results; more sensitive to
model misspecification.
5.4 The Method Selection Framework
Subject 1: |—————————————X| Event at t=12 Subject 2: |————————————————————O| Censored at t=24 Subject 3: |——————X| Event at t=6 Subject 4: |————————————————————————————O| Censored at t=36 Subject 5: |—————————————————X| Event at t=18
X = Event observed O = Censored
💡 Always plot the event timeline before analysis. It reveals data quality issues such as negative times, very short follow-up, or suspiciously high censoring proportions.
7.4 The Risk Set
The risk set at time is the set of all subjects who are:
- Still under observation at time (have not been censored before ), AND
- Have not yet experienced the event before time .
The number at risk decreases over time as subjects experience the event or are censored. A number-at-risk table is routinely displayed below Kaplan-Meier plots to communicate how many subjects contribute information at each time point.
8. Non-Parametric Methods: Kaplan-Meier Estimation
8.1 The Kaplan-Meier (Product-Limit) Estimator
The Kaplan-Meier (KM) estimator is the most widely used method for estimating the survival function non-parametrically. It is also called the product-limit estimator because it computes as a product of conditional survival probabilities at each event time.
Let be the distinct ordered event times (only times at which events occur, not censoring times). At each event time , let:
- = number of events (deaths) occurring at .
- = number of subjects at risk just before (the risk set size).
The conditional probability of the event at (given survival to ):
The conditional probability of surviving past (given survival to ):
The Kaplan-Meier survival estimate at time is the product of all conditional survival probabilities at event times up to and including :
8.2 Worked Computation of the KM Estimator
Suppose subjects are followed, with the following event/censoring times (sorted):
| Time | Status | ||||
|---|---|---|---|---|---|
| 3 | 1 (event) | 10 | 1 | ||
| 5 | 0 (censor) | — | — | — | |
| 7 | 1 (event) | 8 | 1 | ||
| 10 | 1 (event) | 7 | 2 | ||
| 14 | 0 (censor) | — | — | — | |
| 18 | 1 (event) | 4 | 1 | ||
| 22 | 0 (censor) | — | — | — | |
| 25 | 1 (event) | 2 | 1 |
Key notes:
- The subject censored at contributed to the risk set at but not at (they left before ).
- The KM estimate changes only at event times, not at censoring times.
- Between event times, remains constant (step function).
- At , two events occurred simultaneously (); is the number at risk just before .
8.3 Confidence Intervals for the KM Estimator
Greenwood's Formula for the variance of :
Plain (linear) confidence interval:
This can produce intervals outside for extreme times. The log-log transformed confidence interval is preferred because it always stays within :
Log-log (complementary log-log) transformation:
Let . The variance is approximated by:
The 95% CI for is , which transforms back to a CI for that stays within :
8.4 The Nelson-Aalen Estimator of the Cumulative Hazard
An alternative non-parametric estimator is the Nelson-Aalen estimator of the cumulative hazard function:
The corresponding survival estimate derived from the Nelson-Aalen estimator is:
The Nelson-Aalen estimator is less biased than the KM estimator in small samples and is particularly useful for:
- Plotting the cumulative hazard to assess parametric distribution fit (see Section 12).
- Computing the baseline cumulative hazard in the Cox model.
- Assessing the proportional hazards assumption graphically.
8.5 Median Survival Time and Confidence Interval
The KM estimate of the median survival time is the smallest time such that :
The 95% confidence interval for the median is computed using the Brookmeyer-Crowley method, which inverts the confidence band for to find the times corresponding to the upper and lower CI bounds for .
⚠️ The median survival time is undefined (cannot be estimated) if the KM curve never drops to or below 0.50. This happens when fewer than half the subjects experience the event.
8.6 Reading and Reporting a Kaplan-Meier Curve
A complete KM plot includes:
- The step function — drops at each event time.
- Confidence band (typically 95%) shown as a shaded region or dashed lines.
- Censoring marks — small vertical tick marks at censored times on the survival curve.
- Number at risk table — shown below the x-axis at selected time points.
- Median survival time — horizontal dashed line at , reading off the x-axis.
9. Comparing Survival Curves: Log-Rank and Related Tests
9.1 The Log-Rank Test
The log-rank test (also called the Mantel-Cox test) is the standard non-parametric test for comparing survival curves between two or more groups. It tests the null hypothesis:
At each distinct event time in the combined dataset, the test compares the observed number of events in each group to the expected number under (if all groups had the same survival function).
For each event time and group , the expected number of events under is:
Where:
- = number at risk in group at time .
- = total number of events at across all groups.
- = total number at risk at across all groups.
The log-rank test statistic for two groups () is:
Where is the hypergeometric variance at event time :
Under , follows approximately a distribution with degrees of freedom (where is the number of groups).
9.2 Weighted Log-Rank Tests
The log-rank test gives equal weight to all event times. Alternative tests weight earlier or later event times more heavily:
| Test | Weight | Emphasises | Best When |
|---|---|---|---|
| Log-Rank (Mantel-Cox) | All times equally | Groups differ throughout | |
| Breslow (Wilcoxon) | Early times | Groups differ early | |
| Peto-Peto | Early times | Groups differ early (modified) | |
| Fleming-Harrington | Flexible | User-specified weighting |
The Fleming-Harrington test with parameters and is the most flexible:
- : Log-rank test.
- : Breslow (Wilcoxon) test.
- : Emphasises late differences.
- : Emphasises middle differences.
9.3 Stratified Log-Rank Test
When groups differ on a confounding variable (a variable that affects survival and is unevenly distributed across groups), the stratified log-rank test controls for the confounder by computing the log-rank statistic separately within each stratum and then combining:
Where indexes the strata.
9.4 Interpreting the Log-Rank Test
| Result | Interpretation |
|---|---|
| Reject — significant evidence that survival curves differ between groups | |
| Fail to reject — insufficient evidence of a difference |
⚠️ The log-rank test is most powerful when hazard ratios are constant over time (proportional hazards). If survival curves cross (non-proportional hazards), the log-rank test may be underpowered. In this case, use the Fleming-Harrington test with appropriate weights, or report the weighted tests alongside the standard log-rank.
9.5 Pairwise Comparisons After a Significant Log-Rank Test
When comparing groups and the overall log-rank test is significant, pairwise log-rank tests can identify which specific groups differ. Apply a Bonferroni correction (or similar multiple comparison adjustment) to control the family-wise error rate:
For example, with groups: .
10. Semi-Parametric Methods: Cox Proportional Hazards Model
10.1 The Cox Model
The Cox proportional hazards model (Cox, 1972) is the most widely used regression model in survival analysis. It relates the hazard at time for individual to their covariate values :
Or equivalently:
Where:
- is the baseline hazard function — the hazard for an individual with all covariates equal to zero. It is left completely unspecified (non-parametric) — this is what makes the Cox model semi-parametric.
- is the vector of regression coefficients to be estimated.
- is the relative risk (risk multiplier) for individual relative to the baseline.
10.2 The Proportional Hazards Property
The ratio of hazards between any two individuals and is constant over time:
The terms cancel — the hazard ratio depends only on the covariate difference, not on time. This is the proportional hazards assumption: hazard functions of any two individuals are proportional to each other (they never cross).
10.3 The Hazard Ratio (HR)
For a single binary covariate (e.g., treatment vs. control, coded 1 vs. 0):
Interpretation of the hazard ratio:
| HR | Interpretation |
|---|---|
| Covariate increases the hazard (increases risk of event) | |
| Covariate has no effect on the hazard | |
| Covariate decreases the hazard (protective effect) |
For a continuous covariate : is the HR for a one-unit increase in , holding all other covariates constant.
The confidence interval for the HR is:
If the CI does not include 1, the covariate is statistically significant at the 5% level.
10.4 The Partial Likelihood
Because the baseline hazard is left unspecified, standard MLE cannot be applied directly to the Cox model. Instead, Cox (1972) proposed the partial likelihood — a likelihood that eliminates and depends only on .
The partial likelihood is constructed by considering, at each event time , the probability that it was individual who experienced the event, given that exactly one event occurred at among all individuals at risk:
The full partial likelihood (product over all event times):
The log partial likelihood:
Where is the total number of events.
The coefficient estimates are found by maximising using iterative numerical methods (e.g., Newton-Raphson).
10.5 Handling Ties
When multiple events occur at the same time (tied event times), the exact partial likelihood becomes computationally intractable. Three commonly used approximations are:
Breslow's approximation (default in most software):
Where is the sum of covariate vectors for all individuals who experienced the event at .
Efron's approximation (more accurate with many ties):
Exact (discrete) method: Computes the exact combinatorial probability — accurate but computationally very expensive for large numbers of ties.
💡 Efron's approximation is generally preferred over Breslow's when there are many tied event times. DataStatPro uses Efron's approximation by default.
10.6 Testing the Proportional Hazards Assumption
The proportional hazards (PH) assumption is critical. If violated, hazard ratio estimates are biased and not meaningful. Multiple methods exist to test it:
Method 1: Schoenfeld Residuals Test
The Schoenfeld residual for individual at event time is the difference between the observed covariate value and its expected value under the model:
If PH holds, the Schoenfeld residuals for covariate should be uncorrelated with time.
The Grambsch-Therneau test formally tests this:
A significant test () for covariate indicates a violation of the PH assumption for that covariate.
Graphically: Plot the scaled Schoenfeld residuals against time (or log-time). A flat, horizontal smoothed line supports PH; a clear trend suggests violation.
Method 2: Log-Log Survival Plot
Plot (the complementary log-log of the KM survival estimate) against for each group. Under proportional hazards, these lines should be approximately parallel:
If the log-log curves are parallel, the PH assumption holds for that grouping variable.
Method 3: Time-Covariate Interaction
Add an interaction between each covariate and time (or ) to the Cox model:
Where is a function of time (e.g., or ). A significant indicates that the effect of covariate changes over time (PH violated).
Remedies When PH is Violated
| Violation Type | Remedy |
|---|---|
| One covariate violates PH | Stratify by that covariate (stratified Cox model) |
| PH violated for all covariates | Use parametric AFT model instead |
| Time-varying effect | Add time × covariate interaction; use time-varying Cox model |
| Crossing hazards | Use restricted mean survival time (RMST) as effect measure |
10.7 Residuals in the Cox Model
Several types of residuals are available for diagnosing the Cox model:
Martingale Residuals:
Range from to . Used to:
- Assess functional form of continuous covariates (plot vs. — a non-linear pattern suggests a non-linear transformation is needed).
- Identify outliers (very negative values indicate subjects whose event occurred much later than the model predicted).
Deviance Residuals:
Approximately normally distributed around 0. Values suggest potential outliers.
Score (Schoenfeld) Residuals:
As described in Section 10.6. Used for testing the PH assumption.
Dfbeta Residuals (Influence Diagnostics):
Estimates how much would change if observation were deleted. Large absolute values indicate influential observations.
10.8 The Baseline Survival Function
After estimating , the baseline survival function is estimated using the Breslow estimator:
The survival function for an individual with covariates is then:
10.9 The Stratified Cox Model
When a covariate violates the PH assumption, it can be used as a stratifying variable instead of a covariate. The stratified Cox model allows a separate baseline hazard for each stratum , while constraining the regression coefficients to be equal across strata:
This approach:
- Relaxes the PH assumption for the stratifying variable.
- Still assumes PH for all included covariates.
- Does not estimate a coefficient for the stratifying variable (it is absorbed into the baseline).
11. Parametric Survival Models
11.1 Parametric Proportional Hazards (PH) Formulation
In parametric PH models, the hazard function is:
Where is a fully specified baseline hazard with parameters (e.g., Weibull shape parameter ). Both and are estimated simultaneously via MLE.
Weibull PH model (the most common parametric PH model):
11.2 The Accelerated Failure Time (AFT) Model
An alternative parametric formulation is the Accelerated Failure Time (AFT) model, which models the effect of covariates on the log of the survival time directly:
Or equivalently:
Where is the baseline survival time and is a scale parameter.
The survival function in the AFT model:
The acceleration factor for covariate :
Interpretation of AFT coefficients:
| Interpretation | |
|---|---|
| Covariate slows down the event (time is stretched — beneficial) | |
| No effect on survival time | |
| Covariate speeds up the event (time is compressed — harmful) |
💡 The AFT model is more intuitive when covariates do not satisfy the PH assumption. Instead of multiplicative effects on the hazard, it gives multiplicative effects on time itself.
11.3 Relationship Between PH and AFT Formulations
The Weibull model has the unique property of satisfying both the PH and AFT formulations:
| Formulation | Coefficient Relationship |
|---|---|
| PH form: | Effect on |
| AFT form: | Effect on |
| Relationship |
Where is the Weibull shape parameter.
For the exponential, log-normal, log-logistic, and generalised gamma, the AFT formulation is the standard one. For the Weibull and Gompertz, both formulations are available.
11.4 Parametric Model Likelihood
For parametric models, the full MLE log-likelihood for subjects is:
Where contains all model parameters.
For the Weibull model with shape and scale :
11.5 Predicting Survival Times From Parametric Models
A major advantage of parametric models over the Cox model is the ability to predict survival times for new individuals.
For an individual with covariates :
Predicted survival function:
Predicted median survival time:
Solve :
Predicted mean survival time (Weibull):
Where is the gamma function.
12. Model Fit and Evaluation
12.1 Graphical Diagnostics for Parametric Distribution Selection
Before fitting a parametric model, graphical methods help identify which distribution is most appropriate.
Log-Log (Weibull) Plot:
If the Weibull distribution is appropriate, a plot of against should be approximately linear:
This is a straight line with slope and intercept .
Log-Normal Plot:
If the log-normal distribution is appropriate, a plot of against should be approximately linear.
Log-Logistic Plot:
If the log-logistic distribution is appropriate, a plot of against should be approximately linear.
Cumulative Hazard Plot:
Plot (Nelson-Aalen estimate) against or :
- Linear against → Exponential.
- Linear against → Weibull.
- S-shaped → Log-logistic or log-normal.
12.2 AIC and BIC for Model Selection
The AIC and BIC are used to compare parametric models with different distributional assumptions (or different numbers of parameters):
Where is the number of estimated parameters and is the number of events (not total observations, in some implementations).
Lower AIC/BIC indicates a better-fitting, more parsimonious model.
💡 AIC and BIC can compare models from different distributional families (e.g., Weibull vs. log-normal) as long as they use the same outcome and data — unlike the test, which requires nested models.
12.3 Likelihood Ratio Test for Nested Models
When comparing nested parametric models (one model is a restricted version of another), the likelihood ratio test (LRT) formally tests whether the additional parameters significantly improve fit:
Under , follows a distribution with degrees of freedom ( = number of additional parameters in the full model).
Example: Testing whether the Weibull model fits significantly better than the exponential (which is a special case with ):
12.4 Cox-Snell Residuals for Overall Fit
The Cox-Snell residuals can be used to assess the overall fit of any survival model (Cox or parametric):
If the model fits well, the Cox-Snell residuals should follow a unit exponential distribution (i.e., ).
Check: Plot the Nelson-Aalen estimate of the cumulative hazard of the Cox-Snell residuals against the residuals themselves. If the model fits well, this plot should fall approximately on the line (45-degree line).
12.5 Concordance Index (C-statistic)
The concordance index (C-index) measures the discriminative ability of a survival model — how well it ranks individuals by their predicted risk. It is the survival analysis analogue of the AUC for binary outcomes:
Where is the predicted linear predictor (risk score) for individual .
| C-index | Interpretation |
|---|---|
| No discrimination (random) | |
| Poor | |
| Acceptable | |
| Good | |
| Excellent | |
| Outstanding |
12.6 The Global Likelihood Ratio Test (Cox Model)
For the Cox model, three omnibus tests assess whether any covariate is significantly associated with survival:
Likelihood Ratio (LR) Test:
Wald Test:
Score (Log-Rank) Test:
Where is the score vector and is the information matrix, both evaluated at .
All three tests are asymptotically equivalent. The LR test is generally considered the most reliable in finite samples.
12.7 Calibration: Observed vs. Predicted Survival
Calibration assesses whether the predicted survival probabilities match the observed survival rates. A well-calibrated model should show that, among individuals predicted to have (for example) a 70% chance of surviving 5 years, approximately 70% actually do survive 5 years.
The Hosmer-Lemeshow-style calibration test for survival groups subjects by predicted risk (deciles) and compares observed and expected survival in each group. Formally, the D'Agostino-Nam test extends this to survival data:
13. Advanced Topics
13.1 Time-Varying Covariates
In some studies, covariate values change during follow-up (e.g., treatment dose, blood pressure, smoking status). These are time-varying covariates and must be incorporated into the Cox model using a counting process data format.
In the counting process format, each period of follow-up during which a covariate value is constant becomes a separate row:
| ID | Start | Stop | Event | Treatment | Dose |
|---|---|---|---|---|---|
| 1 | 0 | 6 | 0 | A | 50mg |
| 1 | 6 | 12 | 1 | A | 75mg |
| 2 | 0 | 10 | 0 | B | 100mg |
| 2 | 10 | 24 | 0 | B | 80mg |
The counting process Cox model for time-varying covariates :
⚠️ Careful: Internal time-varying covariates (values affected by the individual's own disease process — e.g., a biomarker) can introduce bias. Only external covariates (unaffected by the subject's status) can be safely included as time-varying covariates without special considerations.
13.2 Frailty Models (Random Effects for Survival)
Frailty models extend the Cox model to account for unobserved heterogeneity among subjects or clustering (e.g., patients within hospitals, multiple events per subject). A random frailty term is introduced:
Where is a random variable (the "frailty") assumed to follow a specific distribution:
- Gamma frailty (most common): , with . The marginal survival function has a closed form.
- Log-normal frailty: .
- Inverse-Gaussian frailty.
The frailty variance (or ) quantifies the degree of unobserved heterogeneity. If , the frailty model reduces to the standard Cox model.
The marginal survival function (Gamma frailty):
13.3 Competing Risks
Competing risks arise when multiple mutually exclusive event types are possible, and the occurrence of one event precludes the observation of others. For example, in a study of cancer-specific mortality, deaths from other causes are competing risks.
In the presence of competing risks, the standard KM estimator overestimates the probability of the event of interest because it treats competing events as independent censoring (which they are not — they actually prevent the event of interest from occurring).
The Cause-Specific Hazard for event type :
The Cumulative Incidence Function (CIF) (also called the subdistribution or cause-specific CIF) gives the probability of experiencing event by time in the presence of competing risks:
Where is the overall survival function.
Note that (if all events are eventual) and individual CIFs do not sum to 1 at any finite time .
The Gray Test compares CIFs between groups in the presence of competing risks (analogous to the log-rank test for the standard case).
Fine and Gray's Subdistribution Hazard Model directly models the effect of covariates on the CIF for event type :
13.4 Restricted Mean Survival Time (RMST)
When the proportional hazards assumption is violated and hazard ratios are not interpretable, the Restricted Mean Survival Time (RMST) provides an alternative summary measure:
Where is a pre-specified restriction time (the maximum follow-up horizon of interest).
RMST is the area under the survival curve up to time . It has a direct interpretation as the average event-free time up to .
Difference in RMST between two groups (e.g., treatment vs. control):
This represents the average additional event-free time gained by the treatment group up to time , regardless of proportionality.
Estimated RMST and its variance:
The variance is estimated using the delta method applied to Greenwood's formula.
14. Worked Examples
Example 1: Kaplan-Meier Estimation and Log-Rank Test — Two Treatment Groups
A clinical trial of patients with advanced lung cancer randomises participants to either a new chemotherapy regimen (Group A, ) or standard care (Group B, ). Survival time (in months) is recorded.
Study Summary:
- Group A: 14 events, 6 censored.
- Group B: 17 events, 3 censored.
Step 1 — Compute the Kaplan-Meier Estimates (abbreviated)
Group A (selected event times):
| Time (m) | 95% CI | ||||
|---|---|---|---|---|---|
| 2 | 20 | 1 | 0.950 | 0.950 | (0.828, 1.000) |
| 5 | 18 | 2 | 0.889 | 0.844 | (0.674, 0.944) |
| 9 | 15 | 1 | 0.933 | 0.788 | (0.608, 0.906) |
| 14 | 12 | 1 | 0.917 | 0.722 | (0.528, 0.862) |
| 22 | 8 | 2 | 0.750 | 0.542 | (0.337, 0.723) |
| 31 | 4 | 1 | 0.750 | 0.406 | (0.197, 0.636) |
Group B (selected event times):
| Time (m) | 95% CI | ||||
|---|---|---|---|---|---|
| 1 | 20 | 2 | 0.900 | 0.900 | (0.755, 0.978) |
| 4 | 17 | 2 | 0.882 | 0.794 | (0.617, 0.912) |
| 7 | 14 | 2 | 0.857 | 0.680 | (0.489, 0.824) |
| 12 | 10 | 2 | 0.800 | 0.544 | (0.347, 0.724) |
| 18 | 6 | 2 | 0.667 | 0.363 | (0.182, 0.586) |
| 25 | 3 | 1 | 0.667 | 0.242 | (0.071, 0.524) |
Median Survival Time:
- Group A: months (95% CI: 18, —)
- Group B: months (95% CI: 10, 24)
Step 2 — Log-Rank Test
: The survival distributions are identical in Groups A and B.
Computing observed and expected events at each event time in the combined dataset:
(observed events in Group A)
(expected events in Group A under )
Interpretation: The log-rank test is not statistically significant at the 5% level (), though there is a trend favouring Group A. The median survival time is 10 months longer for Group A (26 vs. 16 months). With a larger sample, this difference might reach statistical significance. The Kaplan-Meier curves show higher survival probabilities for Group A at all time points.
Example 2: Cox Proportional Hazards Model — Predicting Time to Readmission
A hospital study follows patients after discharge from heart failure hospitalisation. The event of interest is 30-day readmission. Predictors include age (years), sex (0 = female, 1 = male), number of comorbidities (count), and treatment type (A = reference, B, C).
Model Results:
| Variable | SE | z | p | HR = | 95% CI for HR | |
|---|---|---|---|---|---|---|
| Age (years) | 0.031 | 0.011 | 2.82 | 0.005 | 1.031 | (1.009, 1.054) |
| Sex (Male) | 0.287 | 0.142 | 2.02 | 0.043 | 1.332 | (1.009, 1.759) |
| Comorbidities | 0.198 | 0.063 | 3.14 | 0.002 | 1.219 | (1.078, 1.379) |
| Treatment B | -0.412 | 0.158 | -2.61 | 0.009 | 0.662 | (0.487, 0.900) |
| Treatment C | -0.681 | 0.171 | -3.98 | 0.001 | 0.506 | (0.362, 0.707) |
Global LR Test: , — at least one covariate significantly predicts readmission.
C-index: — Good discriminative ability.
Interpretation of Key Coefficients:
-
Age: For each additional year of age, the hazard of readmission increases by 3.1% (95% CI: 0.9%, 5.4%), holding all other covariates constant. This effect is statistically significant ().
-
Sex (Male): Males have a 33.2% higher hazard of readmission than females (, 95% CI: 1.009, 1.759; ), after adjusting for age, comorbidities, and treatment.
-
Comorbidities: Each additional comorbidity increases the hazard of readmission by 21.9% (, 95% CI: 1.078, 1.379; ).
-
Treatment B: Patients on Treatment B have a 33.8% lower hazard of readmission compared to Treatment A (, 95% CI: 0.487, 0.900; ). Interpretation: Treatment B is significantly protective relative to Treatment A.
-
Treatment C: Patients on Treatment C have a 49.4% lower hazard compared to Treatment A (, 95% CI: 0.362, 0.707; ). Treatment C provides the strongest protection against readmission among the three treatments.
Proportional Hazards Assessment:
Grambsch-Therneau test results:
| Variable | (correlation with time) | p | |
|---|---|---|---|
| Age | 0.041 | 0.28 | 0.597 |
| Sex | -0.063 | 0.67 | 0.413 |
| Comorbidities | 0.082 | 1.12 | 0.290 |
| Treatment B | 0.051 | 0.44 | 0.507 |
| Treatment C | 0.097 | 1.58 | 0.209 |
| GLOBAL | 4.82 | 0.437 |
All individual tests and the global test are non-significant — the proportional hazards assumption is supported for all covariates.
Predicted Survival:
For a 65-year-old male with 3 comorbidities on Treatment C:
Where is the estimated baseline 30-day survival probability.
Example 3: Weibull Parametric Model — Time to Equipment Failure
An engineering study monitors industrial machines for failure. Follow-up is 18 months. The predictor is machine type (Standard = reference, Enhanced).
Distribution Selection (AIC comparison):
| Distribution | Log-Likelihood | Parameters | AIC |
|---|---|---|---|
| Exponential | -241.3 | 2 | 486.6 |
| Weibull | -233.8 | 3 | 473.6 |
| Log-Normal | -237.1 | 3 | 480.2 |
| Log-Logistic | -235.9 | 3 | 477.8 |
| Gompertz | -234.5 | 3 | 475.0 |
Decision: Weibull has the lowest AIC (473.6) → select Weibull model.
LRT: Weibull vs. Exponential:
The Weibull fits significantly better than the exponential — the hazard is not constant.
Weibull Model Results:
| Parameter | Estimate | SE | 95% CI |
|---|---|---|---|
| Intercept () | 2.841 | 0.201 | (2.447, 3.235) |
| Enhanced vs. Standard () | 0.624 | 0.183 | (0.265, 0.983) |
| Shape () | 1.82 | 0.189 | (1.481, 2.230) |
Shape parameter: → Increasing hazard — failure risk increases over time, consistent with mechanical wear-out.
Acceleration Factor (AFT interpretation):
Enhanced machines have failure times that are 86.7% longer than Standard machines on average (95% CI: 1.303, 2.674). In other words, enhanced machines last approximately 1.87 times as long.
Predicted Median Failure Times:
months
months
Standard machines have a predicted median failure time of 11.7 months, while Enhanced machines are predicted to last 21.8 months before failure, on average.
15. Common Mistakes and How to Avoid Them
Mistake 1: Treating Censored Observations as Events or Excluding Them
Problem: A common error among beginners is either incorrectly coding censored observations as
events (which dramatically overestimates the hazard) or excluding censored observations entirely
(which introduces severe selection bias — the excluded subjects are typically the longest-surviving
ones).
Solution: Include all observations and code censored subjects correctly (status = 0). The
statistical methods are specifically designed to use the partial information provided by censored
observations.
Mistake 2: Using the Wrong Time Origin
Problem: Using different time origins for different subjects (e.g., some measured from
diagnosis, others from treatment start) without harmonising the time scale. This creates an
inconsistent and uninterpretable survival analysis.
Solution: Define the time origin before data collection and apply it consistently to all
subjects. Document the chosen time origin clearly in the methods section.
Mistake 3: Confusing Censoring and Missing Data
Problem: Treating censored observations as missing (removing them) or treating truly missing
data as censored (including them with a specific censoring time) are both errors that lead to
biased results.
Solution:
- Censored = subject left the study without experiencing the event during follow-up → include with status = 0.
- Missing = event status or time is unknown for administrative/data quality reasons → treat as missing data (impute or exclude with justification), not censored.
Mistake 4: Not Checking the Proportional Hazards Assumption
Problem: Reporting Cox model hazard ratios without testing the PH assumption. If the assumption
is violated, the hazard ratio is a biased average of a time-varying effect and is not meaningfully
interpretable.
Solution: Always perform and report Schoenfeld residual tests and log-log plots for
all covariates. If PH is violated, use stratification, time-varying coefficients, or switch to
an AFT model.
Mistake 5: Applying the Standard KM Estimator With Competing Risks
Problem: Using the Kaplan-Meier estimator to estimate the probability of an event when
competing risks are present. KM treats competing events as independent censoring, which
overestimates the cumulative incidence of the event of interest.
Solution: Use the Cumulative Incidence Function (CIF) estimator (Aalen-Johansen method)
when competing risks are present. Use the Gray test for group comparisons and the Fine-Gray model
for regression.
Mistake 6: Using Too Few Events for the Cox Model (Overfitting)
Problem: Fitting a Cox model with many covariates but few events (e.g., 5 covariates with only
20 events). This leads to overfitting — coefficients and hazard ratios are unstable and will not
replicate.
Solution: Follow the 10 Events Per Variable (EPV) rule. With 20 events, include at most 2
covariates. For small event counts, use penalised regression (ridge, LASSO) or Bayesian methods.
Mistake 7: Treating the Log-Rank Test as the Only Comparison Method
Problem: Reporting only the log-rank test when survival curves are non-proportional (crossing
curves). The log-rank test is least powerful precisely when hazard ratios are not constant.
Solution: Always plot the KM curves first. If curves cross or show non-proportional patterns,
report the Fleming-Harrington test with appropriate weights, or compare using RMST
differences, which do not require the PH assumption.
Mistake 8: Ignoring the Effect of Left Truncation
Problem: In prevalence or registry studies, subjects are often only enrolled after surviving
a certain threshold (e.g., alive at time of registration). Ignoring this creates immortal time
bias — subjects who died early are systematically excluded.
Solution: Use left-truncated survival data methods. In the counting process format, specify the
entry time (left truncation time) for each subject. This correctly adjusts the risk sets to only
include subjects who were observable at each event time.
Mistake 9: Extrapolating Parametric Models Without Caution
Problem: Using parametric survival models to predict survival far beyond the observed
follow-up period without acknowledging the uncertainty of such extrapolation.
Solution: Always explicitly state the observed follow-up range. Clearly label any predictions
beyond observed follow-up as extrapolations with wide uncertainty. Validate predictions in
external data when possible. Use sensitivity analysis with alternative distributions.
Mistake 10: Not Reporting Confidence Intervals for Survival Estimates
Problem: Reporting point estimates of or hazard ratios without confidence
intervals, giving a false impression of precision.
Solution: Always report confidence intervals for all survival estimates: KM survival
probabilities (Greenwood/log-log CIs), median survival times (Brookmeyer-Crowley CIs), and
hazard ratios (Wald CIs). Include the number at risk at key time points on KM plots.
16. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| Negative or zero survival times | Data entry errors; incorrect time origin | Clean data; check for recording errors; redefine time origin |
| KM curve never reaches 0.50 | Fewer than 50% of subjects experience the event | Report that median is not estimable; report restricted mean or 75th percentile |
| KM curve ends abruptly at 1.0 or near 1.0 | Last few observations are all events with no censored subjects near the end | Usually fine; report last observed survival probability |
| Log-rank test significant but KM curves overlap early | Late separation of curves (Weibull with increasing hazard) | Use Fleming-Harrington test with to emphasise late differences |
| Log-rank test non-significant but curves look different | Crossing curves; small sample | Report RMST; test for specific crossing patterns; increase sample size |
| Cox model fails to converge | Perfect separation (a covariate perfectly predicts events); very small event count | Remove problematic covariate; use Firth's penalised partial likelihood; reduce covariates |
| Hazard ratios are extremely large () or extremely small () | Perfect or near-perfect separation; very small events in one category | Check event rates by category; collapse sparse categories; use penalised likelihood |
| Schoenfeld test significant for one covariate | PH assumption violated for that covariate | Stratify by that covariate; add time × covariate interaction; use AFT model |
| All Schoenfeld tests significant | Fundamental PH violation across all covariates | Switch to AFT parametric model or use RMST |
| Deviance residuals show large outliers | Influential observations with unexpected event timing | Investigate outliers clinically; check data quality; use dfbeta diagnostics |
| Very high C-index () with few events | Overfitting; possible data leakage | Validate in independent sample; apply shrinkage; check for inadvertent inclusion of outcome-proximate predictors |
| AIC similar across all parametric distributions | Data are not informative enough to discriminate between distributions | Use the most theoretically appropriate distribution; report all models; use the simplest (most parsimonious) model |
| Competing risks suspected but ignored | Multiple event types recorded but not modelled | Use cumulative incidence functions and the Fine-Gray model |
| Left truncation not accounted for | Registry or prevalent cohort study design | Restructure data with entry time; use counting process format |
| Large number of tied event times | Discrete-time outcome (e.g., daily data recorded as monthly) | Use Efron or exact partial likelihood; consider discrete-time survival models |
17. Quick Reference Cheat Sheet
Core Equations
| Formula | Description |
|---|---|
| Survival function | |
| Hazard function | |
| Cumulative hazard | |
| Survival from cumulative hazard | |
| Kaplan-Meier estimator | |
| Nelson-Aalen estimator | |
| Greenwood's formula | |
| Log-rank test statistic | |
| Cox PH model | |
| Hazard ratio | |
| Cox partial log-likelihood | |
| Cox predicted survival | |
| Parametric likelihood contribution | |
| Restricted mean survival time | |
| Akaike Information Criterion | |
| Bayesian Information Criterion |
Parametric Distribution Quick Reference
| Distribution | Hazard Shape | Parameters | |
|---|---|---|---|
| Exponential | Constant | ||
| Weibull | Monotone ↑/↓/flat | ||
| Log-Normal | Hump-shaped | ||
| Log-Logistic | Hump or decreasing | ||
| Gompertz | Monotone ↑ |
Hazard Ratio Interpretation
| HR | Meaning |
|---|---|
| Higher hazard (increased risk) relative to reference | |
| No effect on hazard | |
| Lower hazard (protective effect) relative to reference |
C-Index Benchmarks
| C-Index | Discrimination |
|---|---|
| None (random) | |
| Poor | |
| Acceptable | |
| Good | |
| Excellent | |
| Outstanding |
Method Selection Guide
| Scenario | Recommended Method |
|---|---|
| Describe survival in one group | Kaplan-Meier |
| Compare survival across groups (unadjusted) | Log-Rank Test |
| Compare when groups differ on a confounder | Stratified Log-Rank |
| Non-proportional hazards in group comparison | Fleming-Harrington / RMST |
| Covariate-adjusted survival analysis | Cox PH Model |
| PH assumption violated | Stratified Cox or AFT Model |
| Predict survival times | Parametric Model |
| Competing events present | Cumulative Incidence Function + Fine-Gray Model |
| Clustered or repeated events | Frailty Model |
| Covariates change over time | Time-Varying Cox Model |
Proportional Hazards Testing Summary
| Method | Tool | PH Holds If |
|---|---|---|
| Schoenfeld residuals | Grambsch-Therneau test | for all covariates |
| Log-log plot | Visual inspection | Lines approximately parallel |
| Time × covariate interaction | LR or Wald test | for all interactions |
Events Per Variable (EPV) Requirements
| Model | Minimum EPV | Recommended EPV |
|---|---|---|
| Cox PH | 10 | 15–20 |
| Parametric | 10 | 15–20 |
| Kaplan-Meier (stable curve) | 20–30 events total | 50+ events total |
| Fine-Gray (competing risks) | 10 per event type | 15–20 per event type |
Censoring Types Summary
| Censoring Type | Description | Example |
|---|---|---|
| Right | Event not yet occurred at end of follow-up | Study ends; patient still alive |
| Left | Event occurred before observation began | Pre-existing infection |
| Interval | Event occurred between two assessment points | Screening study |
| Left truncation | Subject only enters study after surviving to entry time | Registry study |
This tutorial provides a comprehensive foundation for understanding, performing, and interpreting Survival Analysis using the DataStatPro application. For further reading, consult Kleinbaum & Klein's "Survival Analysis: A Self-Learning Text" (3rd ed., 2012), Hosmer, Lemeshow & May's "Applied Survival Analysis" (2nd ed., 2008), or Therneau & Grambsch's "Modeling Survival Data: Extending the Cox Model" (2000). For feature requests or support, contact the DataStatPro team.