Knowledge Base / Logistic Regression Tutorial Regression Analysis 32 min read

Logistic Regression Tutorial

Understand and use the Logistic Regression analysis feature.

Logistic Regression: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of logistic regression all the way through advanced interpretation, model diagnostics, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is Logistic Regression?
  3. The Mathematics Behind Logistic Regression
  4. Assumptions of Logistic Regression
  5. Types of Logistic Regression
  6. Using the Logistic Regression Component
  7. Computational and Formula Details
  8. Model Fit and Evaluation
  9. Classification Metrics and Confusion Matrix
  10. ROC Curve and AUC
  11. Prediction Tool
  12. Worked Examples
  13. Common Mistakes and How to Avoid Them
  14. Troubleshooting
  15. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into logistic regression, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Probability

Probability is a number between 0 and 1 that describes the likelihood of an event occurring:

1.2 Odds

Odds are another way to express the likelihood of an event. Instead of asking "what fraction of the time does this happen?", odds ask "how many times more likely is success compared to failure?":

Odds=p1p\text{Odds} = \frac{p}{1 - p}

For example, if p=0.75p = 0.75 (75% probability of success):

Odds=0.750.25=3\text{Odds} = \frac{0.75}{0.25} = 3

This means success is 3 times more likely than failure, often expressed as "3 to 1 odds".

1.3 Log Odds (Logit)

The log odds (also called the logit) is simply the natural logarithm of the odds:

logit(p)=ln(p1p)\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)

Probability (pp)OddsLog Odds (Logit)
0.100.111-2.197
0.250.333-1.099
0.501.0000.000
0.753.0001.099
0.909.0002.197

Key insight: Log odds range from -\infty to ++\infty, which makes them suitable for a linear model.

1.4 Why Not Use Linear Regression for Binary Outcomes?

A natural first question is: why can't we just use linear regression to predict a 0/1 outcome?

Linear regression predicts values using:

Y^=β0+β1X1++βnXn\hat{Y} = \beta_0 + \beta_1 X_1 + \dots + \beta_n X_n

The problem is that linear regression can produce predicted values outside the range [0, 1] — for example, -0.3 or 1.7 — which are meaningless as probabilities. Logistic regression solves this by transforming its output through the logistic (sigmoid) function, which always produces values between 0 and 1.


2. What is Logistic Regression?

Logistic Regression is a statistical method used to model the probability of a binary outcome — that is, an outcome that takes one of exactly two values (e.g., Yes/No, 1/0, True/False, Disease/No Disease).

Despite its name containing "regression," logistic regression is fundamentally a classification algorithm. It estimates the probability that an observation belongs to a particular class.

2.1 Real-World Applications

Logistic regression is one of the most widely used methods in statistics and machine learning. Common applications include:

2.2 Binary Outcome Variable

The dependent variable (also called the response or outcome variable) in logistic regression must be binary. By convention:

The choice of which class is "1" and which is "0" is meaningful — it affects the direction of coefficients. Make sure to define this mapping clearly before running the model.

2.3 Logistic Regression vs. Linear Regression: A Summary

FeatureLinear RegressionLogistic Regression
Outcome TypeContinuous (e.g., income)Binary (e.g., yes/no)
Predicted ValueAny real numberProbability between 0 and 1
Link FunctionIdentityLogit (log odds)
Model Fitting MethodOrdinary Least Squares (OLS)Maximum Likelihood Estimation (MLE)
Goodness-of-Fit MetricPseudo R², AIC, Log-Likelihood
Error DistributionNormalBinomial

3. The Mathematics Behind Logistic Regression

This section builds up the full mathematical framework of logistic regression from scratch.

3.1 The Logistic (Sigmoid) Function

The heart of logistic regression is the logistic function (also called the sigmoid function):

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Where zz is any real number. The logistic function maps any real-valued input zz to a value in the range (0,1)(0, 1), making it ideal for representing a probability.

Key properties of the logistic function:

3.2 The Logistic Regression Model

In logistic regression, zz is replaced by the linear combination of predictors:

z=β0+β1X1+β2X2++βnXnz = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n

So the predicted probability that the outcome Y=1Y = 1 given the predictors is:

p=P(Y=1X1,X2,,Xn)=11+e(β0+β1X1+β2X2++βnXn)p = P(Y=1 \mid X_1, X_2, \dots, X_n) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}

Where:

3.3 The Logit Transformation

Taking the logit (log odds) of the predicted probability linearises the model:

logit(p)=ln(p1p)=β0+β1X1+β2X2++βnXn\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n

This is the fundamental equation of logistic regression. On the left side is the log odds; on the right side is a linear combination of predictors — exactly like a linear regression model. This is why it is called logistic regression.

3.4 From Log Odds Back to Probability

Given the logit, you can always convert back to a probability using:

p=elogit(p)1+elogit(p)=11+elogit(p)p = \frac{e^{\text{logit}(p)}}{1 + e^{\text{logit}(p)}} = \frac{1}{1 + e^{-\text{logit}(p)}}

3.5 The Likelihood Function and Maximum Likelihood Estimation (MLE)

Logistic regression coefficients are estimated by Maximum Likelihood Estimation (MLE) — the method that finds the values of β0,β1,,βn\beta_0, \beta_1, \dots, \beta_n that make the observed data most probable.

For a dataset of mm observations, the likelihood function is:

L(β)=i=1mpiyi(1pi)1yiL(\boldsymbol{\beta}) = \prod_{i=1}^{m} p_i^{y_i} (1 - p_i)^{1 - y_i}

Where:

It is more convenient to work with the log-likelihood (since logarithms turn products into sums):

(β)=i=1m[yiln(pi)+(1yi)ln(1pi)]\ell(\boldsymbol{\beta}) = \sum_{i=1}^{m} \left[ y_i \ln(p_i) + (1 - y_i) \ln(1 - p_i) \right]

MLE finds the coefficient vector β\boldsymbol{\beta} that maximises (β)\ell(\boldsymbol{\beta}). This cannot be solved analytically (unlike OLS), so iterative algorithms are used.

3.6 The IRLS Algorithm

The most common algorithm for maximising the log-likelihood is Iteratively Reweighted Least Squares (IRLS), a special case of Newton-Raphson optimisation.

At each iteration tt, the algorithm updates the coefficient estimates using:

β(t+1)=β(t)+(XTW(t)X)1XT(yp(t))\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \mathbf{p}^{(t)})

Where:

The algorithm continues until the change in log-likelihood or coefficients is smaller than a convergence threshold (e.g., 10810^{-8}).

3.7 The Cost (Cross-Entropy) Function

An equivalent way to frame MLE is as minimising the binary cross-entropy loss:

J(β)=1mi=1m[yiln(pi)+(1yi)ln(1pi)]J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \ln(p_i) + (1 - y_i) \ln(1 - p_i) \right]

This is simply the negative log-likelihood divided by mm. Minimising JJ is equivalent to maximising \ell.


4. Assumptions of Logistic Regression

Logistic regression is a powerful tool, but its results are valid only when certain assumptions are reasonably met. Understanding these assumptions helps you avoid misuse and misinterpretation.

4.1 Binary (or Ordinal/Multinomial) Dependent Variable

The outcome must be binary (two categories). If your outcome has more than two unordered categories, use Multinomial Logistic Regression. If the categories are ordered, use Ordinal Logistic Regression.

4.2 Independence of Observations

Each observation in the dataset must be independent of all others. For example:

4.3 No (or Little) Multicollinearity

Independent variables should not be highly correlated with each other. Severe multicollinearity inflates standard errors, making it harder to determine the individual effect of each predictor. Check using Variance Inflation Factor (VIF): generally, VIF>10\text{VIF} > 10 is a concern.

4.4 Linearity of Independent Variables and the Log Odds

For continuous predictors, logistic regression assumes a linear relationship between the predictor and the log odds (not the probability). This can be checked with a Box-Tidwell test or by plotting the log odds against each continuous predictor.

4.5 No Extreme Outliers

Logistic regression can be sensitive to extreme outliers in the continuous independent variables. Influential observations should be identified and examined.

4.6 Large Sample Size

Logistic regression requires a reasonably large sample. A common rule of thumb is:

4.7 No Perfect Separation (Complete Separation)

If a predictor or combination of predictors perfectly separates the two outcome groups, the MLE algorithm will fail to converge (coefficients go to ±\pm\infty). This is called complete separation and is a sign that the model is too good — often due to a small sample or a predictor that essentially duplicates the outcome.


5. Types of Logistic Regression

TypeOutcome VariableExample
BinaryTwo categories (0 or 1)Disease: Yes / No
MultinomialThree or more unordered categoriesColour Preference: Red / Green / Blue
OrdinalThree or more ordered categoriesSeverity: Low / Medium / High

The DataStatPro application implements Binary Logistic Regression, which is the most common type and the focus of this tutorial.


6. Using the Logistic Regression Component

The Logistic Regression component in the DataStatPro application provides a full end-to-end workflow for performing binary logistic regression on your datasets.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset you want to analyse from the "Dataset" dropdown. The dataset should have at least one binary variable and one or more predictor variables.

Step 2 — Select Independent Variables (X) Select one or more predictor variables from the "Independent Variables (X)" dropdown. These can be:

💡 Tip: Start with variables you have a theoretical reason to believe are associated with the outcome. Avoid blindly throwing in many unrelated predictors.

Step 3 — Select Dependent Variable (Y — Binary) Select the binary outcome variable from the "Dependent Variable (Y — Binary)" dropdown.

⚠️ Important: Make sure you correctly assign which category is 1 and which is 0. This directly affects the direction and interpretation of all coefficients.

Step 4 — Select Base Categories (for Categorical Predictors) For each categorical independent variable with more than two categories, you must specify a base (reference) category. The base category is the group against which all other groups are compared.

💡 Tip: Choose the most natural reference group (e.g., "Placebo" in a drug trial, "Rural" in a location study, or simply the most common category).

Step 5 — Select Confidence Level Choose the desired confidence level for confidence intervals (e.g., 95% is the standard). This affects the confidence intervals reported for each coefficient.

Step 6 — Display Options Select which visualisations and outputs you wish to display:

Step 7 — Run the Analysis Click the "Run Logistic Regression" button. The application will:

  1. Encode categorical variables using dummy coding.
  2. Fit the model using the IRLS algorithm.
  3. Calculate coefficients, standard errors, z-values, p-values, and confidence intervals.
  4. Compute model fit statistics (Log-Likelihood, AIC, Pseudo R²).
  5. Generate the confusion matrix and classification metrics.
  6. Plot the ROC curve and calculate the AUC.

7. Computational and Formula Details

7.1 The Logistic Function and Logit Transformation

The full logistic regression model expresses the log odds as a linear function of predictors:

logit(p)=ln(p1p)=β0+β1X1+β2X2++βnXn\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n

Where:

Solving for pp gives the predicted probability:

p=11+e(β0+β1X1++βnXn)p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n)}}

7.2 Handling Categorical Independent Variables (Dummy Coding)

Categorical independent variables with more than two categories cannot be entered directly into a regression equation. They are converted into a set of dummy (indicator) variables using dummy coding (also known as one-hot encoding with a reference category).

For a categorical variable with kk categories, k1k - 1 dummy variables are created. One category is designated the base (reference) category, and it receives no dummy variable. Each dummy variable is defined as:

Dj={1if observation belongs to category j0otherwiseD_{j} = \begin{cases} 1 & \text{if observation belongs to category } j \\ 0 & \text{otherwise} \end{cases}

Example:

A categorical variable "Region" has three categories: Urban, Suburban, Rural (base).

RegionDUrbanD_\text{Urban}DSuburbanD_\text{Suburban}
Urban10
Suburban01
Rural00

The model becomes:

logit(p)=β0+β1(Age)+β2DUrban+β3DSuburban\text{logit}(p) = \beta_0 + \beta_1 (\text{Age}) + \beta_2 D_\text{Urban} + \beta_3 D_\text{Suburban}

The coefficients β2\beta_2 and β3\beta_3 represent the difference in log odds relative to the Rural base category.

⚠️ Never include all kk dummy variables — this creates perfect multicollinearity (the "dummy variable trap"). Always omit one category as the base.

7.3 Interpretation of Coefficients

Log Odds Interpretation

Each coefficient βi\beta_i represents the change in the log odds of Y=1Y = 1 for a one-unit increase in XiX_i, holding all other variables constant:

Δlogit(p)=βifor a one-unit increase in Xi\Delta \text{logit}(p) = \beta_i \quad \text{for a one-unit increase in } X_i

Odds Ratio Interpretation

The odds ratio (OR) is obtained by exponentiating the coefficient:

ORi=eβi\text{OR}_i = e^{\beta_i}

The odds ratio is the factor by which the odds multiply for a one-unit increase in XiX_i, holding all other variables constant:

Odds Ratio ValueInterpretation
eβi>1e^{\beta_i} > 1Odds of Y=1Y=1 increase as XiX_i increases
eβi=1e^{\beta_i} = 1Odds of Y=1Y=1 are unchanged (XiX_i has no effect)
eβi<1e^{\beta_i} < 1Odds of Y=1Y=1 decrease as XiX_i increases

Example: If βAge=0.05\beta_\text{Age} = 0.05, then e0.051.051e^{0.05} \approx 1.051. For each additional year of age, the odds of the event increase by approximately 5.1%, holding other variables constant.

Converting Between Scales

ScaleFormulaRange
Log Oddsβi\beta_i(,+)(-\infty, +\infty)
Odds Ratioeβie^{\beta_i}(0,+)(0, +\infty)
Probability (at mean X)11+ez^\frac{1}{1 + e^{-\hat{z}}}(0,1)(0, 1)

7.4 Confidence Intervals for Coefficients

A (1α)×100%(1-\alpha) \times 100\% confidence interval for coefficient βi\beta_i is:

[β^izα/2SE(β^i),β^i+zα/2SE(β^i)]\left[\hat{\beta}_i - z_{\alpha/2} \cdot SE(\hat{\beta}_i), \quad \hat{\beta}_i + z_{\alpha/2} \cdot SE(\hat{\beta}_i)\right]

Where:

The corresponding confidence interval for the odds ratio is obtained by exponentiating the endpoints:

[eβ^izα/2SE(β^i),eβ^i+zα/2SE(β^i)]\left[e^{\hat{\beta}_i - z_{\alpha/2} \cdot SE(\hat{\beta}_i)}, \quad e^{\hat{\beta}_i + z_{\alpha/2} \cdot SE(\hat{\beta}_i)}\right]

💡 If the confidence interval for the odds ratio does not include 1, the predictor is statistically significant at the chosen level.

7.5 Statistical Significance Testing

For each coefficient, the application reports:

Standard Error (SE(β^i)SE(\hat{\beta}_i)): The estimated variability of the coefficient estimate. Derived from the square root of the diagonal elements of the Fisher Information Matrix (inverse of the Hessian of the log-likelihood):

SE(β^i)=[(XTWX)1]iiSE(\hat{\beta}_i) = \sqrt{\left[(\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1}\right]_{ii}}

z-value (Wald Statistic): Tests the null hypothesis H0:βi=0H_0: \beta_i = 0 against H1:βi0H_1: \beta_i \neq 0:

zi=β^iSE(β^i)z_i = \frac{\hat{\beta}_i}{SE(\hat{\beta}_i)}

Under H0H_0, ziz_i follows an approximately standard normal distribution N(0,1)\mathcal{N}(0, 1) for large samples.

p-value: The probability of observing a z|z| at least as large as the calculated value, under H0H_0:

p-value=2×P(Z>zi)=2×(1Φ(zi))p\text{-value} = 2 \times P(Z > |z_i|) = 2 \times (1 - \Phi(|z_i|))

Where Φ\Phi is the standard normal CDF. A small p-value (typically p<0.05p < 0.05) provides evidence to reject H0H_0 and conclude that XiX_i is a statistically significant predictor.

Decision Rule:

p-valueInterpretation
p<0.001p < 0.001Extremely strong evidence against H0H_0
0.001p<0.010.001 \leq p < 0.01Very strong evidence against H0H_0
0.01p<0.050.01 \leq p < 0.05Strong evidence against H0H_0
0.05p<0.100.05 \leq p < 0.10Weak evidence against H0H_0 (marginal)
p0.10p \geq 0.10Insufficient evidence against H0H_0

8. Model Fit and Evaluation

Unlike linear regression (which uses R²), logistic regression relies on likelihood-based measures to assess model quality.

8.1 Log-Likelihood

The log-likelihood measures how well the fitted model explains the observed data:

(β^)=i=1m[yiln(p^i)+(1yi)ln(1p^i)]\ell(\hat{\boldsymbol{\beta}}) = \sum_{i=1}^{m} \left[ y_i \ln(\hat{p}_i) + (1 - y_i) \ln(1 - \hat{p}_i) \right]

8.2 Deviance

Deviance is defined as 2-2 times the log-likelihood:

D=2(β^)D = -2 \ell(\hat{\boldsymbol{\beta}})

Lower deviance indicates a better-fitting model. The null deviance (D0D_0) and residual deviance (DrD_r) are often compared:

χ2=D0Dr=2(0β^)\chi^2 = D_0 - D_r = -2(\ell_0 - \ell_{\hat{\boldsymbol{\beta}}})

This statistic follows a chi-squared distribution with degrees of freedom equal to the number of predictors, and can be used for a likelihood ratio test (LRT) of the overall model significance.

8.3 AIC (Akaike Information Criterion)

AIC penalises the log-likelihood for model complexity (number of parameters kk):

AIC=2(β^)+2k\text{AIC} = -2\ell(\hat{\boldsymbol{\beta}}) + 2k

Where kk = number of estimated parameters (coefficients + intercept).

8.4 BIC (Bayesian Information Criterion)

Similar to AIC but with a stronger penalty for model complexity:

BIC=2(β^)+kln(m)\text{BIC} = -2\ell(\hat{\boldsymbol{\beta}}) + k \ln(m)

Where mm is the number of observations. BIC tends to favour more parsimonious (simpler) models than AIC.

8.5 Pseudo R² Measures

Since ordinary R² is not directly applicable to logistic regression, several pseudo R² measures have been developed. They all attempt to quantify "how much better" the fitted model is compared to the null model.

McFadden's Pseudo R²:

RMcFadden2=1(β^)0R^2_{\text{McFadden}} = 1 - \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell_0}

Where 0\ell_0 is the log-likelihood of the null model (intercept only).

Cox & Snell Pseudo R²:

RCS2=1(L0Lβ^)2/mR^2_{\text{CS}} = 1 - \left(\frac{L_0}{L_{\hat{\boldsymbol{\beta}}}}\right)^{2/m}

Where L0=e0L_0 = e^{\ell_0} and Lβ^=e(β^)L_{\hat{\boldsymbol{\beta}}} = e^{\ell(\hat{\boldsymbol{\beta}})} are the likelihoods (not log-likelihoods).

Nagelkerke's Pseudo R² (scaled Cox & Snell):

RNagelkerke2=RCS21L02/mR^2_{\text{Nagelkerke}} = \frac{R^2_{\text{CS}}}{1 - L_0^{2/m}}

This is scaled so that it can reach a maximum of 1.

Interpretation Guidelines (McFadden's):

McFadden's R2R^2Interpretation
0.000.100.00 - 0.10Poor fit
0.100.200.10 - 0.20Acceptable fit
0.200.300.20 - 0.30Good fit
0.300.400.30 - 0.40Excellent fit
>0.40> 0.40Outstanding fit (may warrant scrutiny)

⚠️ Pseudo R² values are not directly comparable to R² from linear regression. A McFadden's R2R^2 of 0.20 is generally considered a good-fitting logistic regression model, whereas an R² of 0.20 in linear regression would typically be considered poor.

8.6 Hosmer–Lemeshow Goodness-of-Fit Test

The Hosmer–Lemeshow test assesses whether the observed event rates match predicted probabilities across deciles (or groups) of the predicted probability:

χHL2=g=1G(OgEg)2Eg(1Eg/ng)\chi^2_{HL} = \sum_{g=1}^{G} \frac{(O_g - E_g)^2}{E_g(1 - E_g / n_g)}

Where:

A non-significant p-value (p>0.05p > 0.05) indicates that the model fits the data well (observed and expected values are close). A significant p-value suggests poor calibration.


9. Classification Metrics and Confusion Matrix

9.1 The Decision Threshold

After estimating predicted probabilities p^i\hat{p}_i, a decision threshold τ\tau (default = 0.5) is used to classify each observation:

Y^i={1if p^iτ0if p^i<τ\hat{Y}_i = \begin{cases} 1 & \text{if } \hat{p}_i \geq \tau \\ 0 & \text{if } \hat{p}_i < \tau \end{cases}

The choice of threshold τ\tau involves a trade-off:

9.2 The Confusion Matrix

The Confusion Matrix cross-tabulates the actual outcomes against the predicted outcomes:

Predicted Y^=0\hat{Y} = 0Predicted Y^=1\hat{Y} = 1
Actual Y=0Y = 0True Negatives (TN)False Positives (FP) — Type I Error
Actual Y=1Y = 1False Negatives (FN) — Type II ErrorTrue Positives (TP)

9.3 Classification Metrics

From the confusion matrix, a rich set of performance metrics can be derived:

Accuracy: The overall proportion of correct predictions:

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

⚠️ Accuracy can be misleading with imbalanced classes. If 95% of outcomes are 0, a model that always predicts 0 achieves 95% accuracy but is useless.

Precision (Positive Predictive Value): Of all predicted positives, what proportion are actually positive?

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Recall (Sensitivity / True Positive Rate): Of all actual positives, what proportion were correctly identified?

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Specificity (True Negative Rate): Of all actual negatives, what proportion were correctly identified?

Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}

False Positive Rate: Of all actual negatives, what proportion were incorrectly classified as positive?

FPR=FPFP+TN=1Specificity\text{FPR} = \frac{FP}{FP + TN} = 1 - \text{Specificity}

F1 Score: The harmonic mean of Precision and Recall. Balances both metrics, useful for imbalanced datasets:

F1=2×Precision×RecallPrecision+RecallF_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Matthew's Correlation Coefficient (MCC): A balanced metric even when class sizes are very unequal:

MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

MCC ranges from 1-1 (total disagreement) to +1+1 (perfect prediction), with 0 indicating random chance.

9.4 Summary of Metrics

MetricFormulaBest ValueWhat It Emphasises
Accuracy(TP+TN)/(TP+TN+FP+FN)(TP+TN)/(TP+TN+FP+FN)1Overall correctness
PrecisionTP/(TP+FP)TP/(TP+FP)1Avoiding false alarms
RecallTP/(TP+FN)TP/(TP+FN)1Catching all positives
SpecificityTN/(TN+FP)TN/(TN+FP)1Catching all negatives
F1 Score2×P×RP+R2 \times \frac{P \times R}{P + R}1Balance of P and R
MCC(see above)1Balanced (imbalanced data)

10. ROC Curve and AUC

10.1 What is the ROC Curve?

The ROC (Receiver Operating Characteristic) Curve is a graphical tool that evaluates the performance of a binary classifier across all possible decision thresholds τ[0,1]\tau \in [0, 1].

For each threshold τ\tau:

The ROC curve plots TPR\text{TPR} (y-axis) against FPR\text{FPR} (x-axis) for each threshold.

Interpretation:

10.2 AUC (Area Under the ROC Curve)

The AUC summarises the entire ROC curve into a single number:

AUC=01TPR(FPR)d(FPR)\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})

Probabilistic Interpretation: AUC equals the probability that the model ranks a randomly chosen positive instance higher (assigns a higher predicted probability) than a randomly chosen negative instance.

AUC ValueModel Discrimination
0.50.5No discrimination (random chance)
0.50.60.5 - 0.6Poor
0.60.70.6 - 0.7Fair
0.70.80.7 - 0.8Acceptable
0.80.90.8 - 0.9Excellent
0.91.00.9 - 1.0Outstanding
1.01.0Perfect discrimination

10.3 Choosing the Optimal Threshold from the ROC Curve

Several methods exist for selecting the best operating threshold:

Youden's J Statistic: Maximises the sum of sensitivity and specificity:

J=Sensitivity+Specificity1=TPRFPRJ = \text{Sensitivity} + \text{Specificity} - 1 = \text{TPR} - \text{FPR}

The optimal threshold is where JJ is maximised.

Closest to Top-Left: Minimise the Euclidean distance from the ROC curve point to the perfect point (0,1)(0, 1):

d=FPR2+(1TPR)2d = \sqrt{\text{FPR}^2 + (1 - \text{TPR})^2}


11. Prediction Tool

11.1 Point Prediction

The Prediction Tool allows you to input specific values for the independent variables and obtain a predicted probability of the outcome being 1.

The predicted probability is calculated using the estimated coefficients:

p^=11+e(β^0+i=1nβ^iXi,input)\hat{p} = \frac{1}{1 + e^{-(\hat{\beta}_0 + \sum_{i=1}^n \hat{\beta}_i X_{i,\text{input}})}}

Steps:

  1. Enter a value for each independent variable (numeric inputs are entered directly; categorical inputs are selected from a dropdown).
  2. The app automatically applies dummy coding to categorical inputs.
  3. The predicted probability p^\hat{p} is displayed.
  4. Based on the threshold (τ=0.5\tau = 0.5 by default), the predicted class is also shown.

11.2 Confidence Interval for Predicted Probability

For a model with predictors x=(1,X1,X2,,Xn)T\mathbf{x}^* = (1, X_1^*, X_2^*, \dots, X_n^*)^T, the variance of the linear predictor z^=β^Tx\hat{z}^* = \hat{\boldsymbol{\beta}}^T \mathbf{x}^* is:

Var(z^)=xTCov(β^)x\text{Var}(\hat{z}^*) = {\mathbf{x}^*}^T \text{Cov}(\hat{\boldsymbol{\beta}}) \, \mathbf{x}^*

Where Cov(β^)=(XTWX)1\text{Cov}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} is the variance-covariance matrix of the coefficients.

A (1α)×100%(1-\alpha) \times 100\% confidence interval for the linear predictor is:

z^±zα/2Var(z^)\hat{z}^* \pm z_{\alpha/2} \sqrt{\text{Var}(\hat{z}^*)}

Converting back to probability:

[11+e(z^zα/2Var(z^)),11+e(z^+zα/2Var(z^))]\left[\frac{1}{1 + e^{-(\hat{z}^* - z_{\alpha/2} \sqrt{\text{Var}(\hat{z}^*)})}}, \quad \frac{1}{1 + e^{-(\hat{z}^* + z_{\alpha/2} \sqrt{\text{Var}(\hat{z}^*)})}}\right]

⚠️ Confidence intervals for predicted probabilities in multiple predictor models require the full variance-covariance matrix. The DataStatPro app currently computes prediction confidence intervals for single numeric predictor models; multi-predictor CIs will be added in a future release.


12. Worked Examples

Example 1: Single Predictor — Age and Ad Click Prediction

Suppose we model the probability of clicking on an ad (1 = clicked, 0 = not clicked) based on age (a single numeric predictor).

After fitting:

logit(p^)=4.2+0.085×Age\text{logit}(\hat{p}) = -4.2 + 0.085 \times \text{Age}

Interpretation:

Prediction for Age = 30:

z=4.2+0.085×30=4.2+2.55=1.65z = -4.2 + 0.085 \times 30 = -4.2 + 2.55 = -1.65

p^=11+e1.65=11+5.20716.2070.161\hat{p} = \frac{1}{1 + e^{1.65}} = \frac{1}{1 + 5.207} \approx \frac{1}{6.207} \approx 0.161

A 30-year-old has approximately a 16.1% predicted probability of clicking the ad. Model classifies this as "not clicked" (below 0.5 threshold).


Example 2: Multiple Predictors — Age and Location

Suppose you are predicting the probability of a customer clicking on an ad (1 = clicked, 0 = not clicked) based on age (numeric) and location (categorical: Urban, Suburban, Rural — Rural is the base category).

After fitting, the results are:

ParameterEstimateStd. Errorz-valuep-valueOdds Ratio
Intercept-3.50000.4500-7.778< 0.0010.0302
Age0.05000.01204.167< 0.0011.0513
Location (Urban)1.20000.49002.4490.0153.3201
Location (Suburban)0.80000.46001.7390.0822.2255

Model Equation:

logit(p^)=3.5000+0.0500×Age+1.2000×DUrban+0.8000×DSuburban\text{logit}(\hat{p}) = -3.5000 + 0.0500 \times \text{Age} + 1.2000 \times D_\text{Urban} + 0.8000 \times D_\text{Suburban}

Coefficient Interpretation:

Prediction Example — 40-year-old in Suburban Location:

z=3.5000+(0.0500×40)+(1.2000×0)+(0.8000×1)z = -3.5000 + (0.0500 \times 40) + (1.2000 \times 0) + (0.8000 \times 1)

z=3.5000+2.0000+0+0.8000=0.7000z = -3.5000 + 2.0000 + 0 + 0.8000 = -0.7000

p^=11+e(0.7000)=11+e0.700011+2.013813.01380.3318\hat{p} = \frac{1}{1 + e^{-(-0.7000)}} = \frac{1}{1 + e^{0.7000}} \approx \frac{1}{1 + 2.0138} \approx \frac{1}{3.0138} \approx 0.3318

The predicted probability is approximately 0.3318 (33.18%). Since p^<0.5\hat{p} < 0.5, the model classifies this individual as "not clicked".

95% Confidence Interval for Odds Ratios:

ParameterOR95% CI Lower95% CI Upper
Age1.0513e0.051.96×0.012e^{0.05-1.96 \times 0.012} = 1.028e0.05+1.96×0.012e^{0.05+1.96 \times 0.012} = 1.075
Location (Urban)3.3201e1.201.96×0.49e^{1.20-1.96 \times 0.49} = 1.270e1.20+1.96×0.49e^{1.20+1.96 \times 0.49} = 8.685

13. Common Mistakes and How to Avoid Them

Mistake 1: Using Logistic Regression With a Non-Binary Outcome

Problem: Applying binary logistic regression to a continuous or multi-category outcome.
Solution: For continuous outcomes, use linear regression. For k>2k > 2 unordered categories, use multinomial logistic regression. For ordered categories, use ordinal logistic regression.

Mistake 2: Ignoring Class Imbalance

Problem: When one class (e.g., Y=1Y = 1) is very rare (e.g., 2% of data), the model may predict 0 for all observations and still achieve high accuracy.
Solution: Use precision, recall, F1, or AUC as primary metrics instead of accuracy. Consider oversampling the minority class (SMOTE), undersampling the majority class, or using class weights.

Mistake 3: Including Too Many Predictors (Overfitting)

Problem: With too many predictors relative to the number of events, the model overfits the training data and performs poorly on new data.
Solution: Follow the EPV rule (10–20 events per predictor). Use regularisation (L1/Lasso, L2/Ridge) or cross-validation for model selection.

Mistake 4: Multicollinearity

Problem: Highly correlated predictors inflate standard errors, making individual coefficient estimates unreliable (even if the overall model is fine).
Solution: Check pairwise correlations and VIF. Remove redundant variables or use dimensionality reduction (PCA) as a preprocessing step.

Mistake 5: Incorrect Reference Category

Problem: Choosing a reference category for a dummy variable arbitrarily, leading to confusing interpretations.
Solution: Choose a reference category that is scientifically meaningful (e.g., control group, most common category). Document the choice clearly.

Mistake 6: Interpreting Coefficients as Probability Changes

Problem: Saying "an increase in age by 1 year increases the probability of clicking by 0.05."
Solution: Coefficients in logistic regression are changes in log odds, not probabilities. The probability change depends on the current value of all predictors (it is non-linear). Always use odds ratios or calculate predicted probabilities at specific values.

Mistake 7: Ignoring Complete Separation

Problem: If a predictor perfectly predicts the outcome, MLE fails to converge and produces extremely large coefficients with enormous standard errors.
Solution: Look for warning messages about convergence. If separation exists, consider removing the problematic predictor, collapsing categories, or using Firth's penalised logistic regression.

Mistake 8: Not Checking the Linearity Assumption

Problem: Treating non-linear relationships between a continuous predictor and the log odds as linear, leading to mis-specified model.
Solution: Plot smoothed log odds against each continuous predictor. Apply transformations (e.g., log, square root) or use polynomial terms or splines if needed.


14. Troubleshooting

IssueLikely CauseSolution
Model fails to convergeComplete/quasi-complete separation; too many predictors; too few observationsCheck for perfect predictors; reduce predictors; collect more data; use Firth regression
Very large coefficients (>10> 10)Complete separationExamine which predictor perfectly splits the outcomes
Very large standard errorsMulticollinearity or separationCheck VIF; examine correlation matrix
AUC = 0.5Model has no predictive powerReview variable selection; check data quality; consider non-linear models
All predictions = 0 or all = 1Severe class imbalance or separationCheck class distribution; adjust threshold; address imbalance
p-values all non-significantInsufficient sample size; weak predictorsIncrease sample size; reconsider predictor selection
Pseudo R² is very high (> 0.9)Possible overfitting or separationCross-validate; check for separation; reduce predictors
Confidence interval includes 1 (for OR)Non-significant predictorVariable may not contribute meaningfully; consider removing

15. Quick Reference Cheat Sheet

Core Equations

FormulaDescription
logit(p)=ln(p1p)=β0+βiXi\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \sum \beta_i X_iLog odds equation
p=11+e(β0+βiXi)p = \frac{1}{1 + e^{-(\beta_0 + \sum \beta_i X_i)}}Predicted probability
ORi=eβi\text{OR}_i = e^{\beta_i}Odds ratio for predictor ii
zi=β^i/SE(β^i)z_i = \hat{\beta}_i / SE(\hat{\beta}_i)Wald z-statistic
AIC=2+2k\text{AIC} = -2\ell + 2kAkaike Information Criterion
RMcFadden2=1β^/0R^2_{\text{McFadden}} = 1 - \ell_{\hat{\boldsymbol{\beta}}} / \ell_0McFadden's Pseudo R²
Accuracy=(TP+TN)/(TP+TN+FP+FN)\text{Accuracy} = (TP + TN)/(TP + TN + FP + FN)Overall accuracy
Precision=TP/(TP+FP)\text{Precision} = TP/(TP + FP)Positive predictive value
Recall=TP/(TP+FN)\text{Recall} = TP/(TP + FN)Sensitivity
Specificity=TN/(TN+FP)\text{Specificity} = TN/(TN + FP)True negative rate
F1=2×P×RP+RF_1 = 2 \times \frac{P \times R}{P + R}F1 Score

Odds Ratio Interpretation

OR ValueMeaning
>1> 1Predictor increases odds of outcome
=1= 1Predictor has no effect on odds
<1< 1Predictor decreases odds of outcome

Model Comparison Guide

ScenarioRecommended Metric
Comparing nested modelsLikelihood Ratio Test (χ2\chi^2)
Comparing non-nested modelsAIC or BIC
Evaluating discrimination abilityAUC
Evaluating calibrationHosmer-Lemeshow test
Imbalanced classesF1, MCC, AUC (not Accuracy)

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Logistic Regression using the DataStatPro application. For further reading, consult Hosmer & Lemeshow's "Applied Logistic Regression" or Agresti's "Categorical Data Analysis". For feature requests or support, contact the DataStatPro team.