Logistic Regression: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of logistic regression all the way through advanced interpretation, model diagnostics, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.

Prerequisites and Background Concepts
What is Logistic Regression?
The Mathematics Behind Logistic Regression
Assumptions of Logistic Regression
Types of Logistic Regression
Using the Logistic Regression Component
Computational and Formula Details
Model Fit and Evaluation
Classification Metrics and Confusion Matrix
ROC Curve and AUC
Prediction Tool
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into logistic regression, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Probability

Probability is a number between 0 and 1 that describes the likelihood of an event occurring:

$p = 0$ means the event will definitely not occur.
$p = 1$ means the event will definitely occur.
$p = 0.7$ means the event has a 70% chance of occurring.

1.2 Odds

Odds are another way to express the likelihood of an event. Instead of asking "what fraction of the time does this happen?", odds ask "how many times more likely is success compared to failure?":

$\text{Odds} = \frac{p}{1 - p}$

For example, if $p = 0.75$ (75% probability of success):

$\text{Odds} = \frac{0.75}{0.25} = 3$

This means success is 3 times more likely than failure, often expressed as "3 to 1 odds".

1.3 Log Odds (Logit)

The log odds (also called the logit) is simply the natural logarithm of the odds:

$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)$

Probability ( $p$ )	Odds	Log Odds (Logit)
0.10	0.111	-2.197
0.25	0.333	-1.099
0.50	1.000	0.000
0.75	3.000	1.099
0.90	9.000	2.197

Key insight: Log odds range from $-\infty$ to $+\infty$ , which makes them suitable for a linear model.

1.4 Why Not Use Linear Regression for Binary Outcomes?

A natural first question is: why can't we just use linear regression to predict a 0/1 outcome?

Linear regression predicts values using:

$\hat{Y} = \beta_0 + \beta_1 X_1 + \dots + \beta_n X_n$

The problem is that linear regression can produce predicted values outside the range [0, 1] — for example, -0.3 or 1.7 — which are meaningless as probabilities. Logistic regression solves this by transforming its output through the logistic (sigmoid) function, which always produces values between 0 and 1.

2. What is Logistic Regression?

Logistic Regression is a statistical method used to model the probability of a binary outcome — that is, an outcome that takes one of exactly two values (e.g., Yes/No, 1/0, True/False, Disease/No Disease).

Despite its name containing "regression," logistic regression is fundamentally a classification algorithm. It estimates the probability that an observation belongs to a particular class.

2.1 Real-World Applications

Logistic regression is one of the most widely used methods in statistics and machine learning. Common applications include:

Medicine: Predicting the likelihood of a disease (e.g., cancer, diabetes) based on patient characteristics (age, blood pressure, test results).
Marketing: Predicting whether a customer will purchase a product, click on an ad, or churn from a subscription.
Finance: Predicting the probability of loan default or credit card fraud.
Social Sciences: Predicting voting behaviour, employment outcomes, or educational attainment.
Engineering: Predicting the probability of a component failure.

2.2 Binary Outcome Variable

The dependent variable (also called the response or outcome variable) in logistic regression must be binary. By convention:

The value 1 represents the "event of interest" (the positive class), e.g., Disease Present, Purchase Made, Default.
The value 0 represents the "non-event" (the negative class), e.g., No Disease, No Purchase, No Default.

The choice of which class is "1" and which is "0" is meaningful — it affects the direction of coefficients. Make sure to define this mapping clearly before running the model.

2.3 Logistic Regression vs. Linear Regression: A Summary

Feature	Linear Regression	Logistic Regression
Outcome Type	Continuous (e.g., income)	Binary (e.g., yes/no)
Predicted Value	Any real number	Probability between 0 and 1
Link Function	Identity	Logit (log odds)
Model Fitting Method	Ordinary Least Squares (OLS)	Maximum Likelihood Estimation (MLE)
Goodness-of-Fit Metric	R²	Pseudo R², AIC, Log-Likelihood
Error Distribution	Normal	Binomial

3. The Mathematics Behind Logistic Regression

This section builds up the full mathematical framework of logistic regression from scratch.

3.1 The Logistic (Sigmoid) Function

The heart of logistic regression is the logistic function (also called the sigmoid function):

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Where $z$ is any real number. The logistic function maps any real-valued input $z$ to a value in the range $(0, 1)$ , making it ideal for representing a probability.

Key properties of the logistic function:

$\sigma(0) = 0.5$ — when the linear combination equals zero, the predicted probability is 50%.
As $z \to +\infty$ , $\sigma(z) \to 1$ .
As $z \to -\infty$ , $\sigma(z) \to 0$ .
The function is S-shaped (sigmoidal), centred at 0.5.

3.2 The Logistic Regression Model

In logistic regression, $z$ is replaced by the linear combination of predictors:

$z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n$

So the predicted probability that the outcome $Y = 1$ given the predictors is:

$p = P(Y=1 \mid X_1, X_2, \dots, X_n) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}$

Where:

$p$ is the predicted probability of the outcome being 1.
$\beta_0$ is the intercept (bias term).
$\beta_i$ are the regression coefficients for each predictor $X_i$ .
$n$ is the number of independent variables.

3.3 The Logit Transformation

Taking the logit (log odds) of the predicted probability linearises the model:

$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n$

This is the fundamental equation of logistic regression. On the left side is the log odds; on the right side is a linear combination of predictors — exactly like a linear regression model. This is why it is called logistic regression.

3.4 From Log Odds Back to Probability

Given the logit, you can always convert back to a probability using:

$p = \frac{e^{\text{logit}(p)}}{1 + e^{\text{logit}(p)}} = \frac{1}{1 + e^{-\text{logit}(p)}}$

3.5 The Likelihood Function and Maximum Likelihood Estimation (MLE)

Logistic regression coefficients are estimated by Maximum Likelihood Estimation (MLE) — the method that finds the values of $\beta_0, \beta_1, \dots, \beta_n$ that make the observed data most probable.

For a dataset of $m$ observations, the likelihood function is:

$L(\boldsymbol{\beta}) = \prod_{i=1}^{m} p_i^{y_i} (1 - p_i)^{1 - y_i}$

Where:

$y_i \in \{0, 1\}$ is the actual outcome for observation $i$ .
$p_i$ is the model's predicted probability of $Y = 1$ for observation $i$ .

It is more convenient to work with the log-likelihood (since logarithms turn products into sums):

$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{m} \left[ y_i \ln(p_i) + (1 - y_i) \ln(1 - p_i) \right]$

MLE finds the coefficient vector $\boldsymbol{\beta}$ that maximises $\ell(\boldsymbol{\beta})$ . This cannot be solved analytically (unlike OLS), so iterative algorithms are used.

3.6 The IRLS Algorithm

The most common algorithm for maximising the log-likelihood is Iteratively Reweighted Least Squares (IRLS), a special case of Newton-Raphson optimisation.

At each iteration $t$ , the algorithm updates the coefficient estimates using:

$\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \mathbf{p}^{(t)})$

Where:

$\mathbf{X}$ is the design matrix of predictors (with a column of 1s for the intercept).
$\mathbf{W}^{(t)}$ is a diagonal weight matrix with entries $w_i = p_i^{(t)}(1 - p_i^{(t)})$ .
$\mathbf{y}$ is the vector of actual outcomes.
$\mathbf{p}^{(t)}$ is the vector of predicted probabilities at iteration $t$ .

The algorithm continues until the change in log-likelihood or coefficients is smaller than a convergence threshold (e.g., $10^{-8}$ ).

3.7 The Cost (Cross-Entropy) Function

An equivalent way to frame MLE is as minimising the binary cross-entropy loss:

$J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \ln(p_i) + (1 - y_i) \ln(1 - p_i) \right]$

This is simply the negative log-likelihood divided by $m$ . Minimising $J$ is equivalent to maximising $\ell$ .

4. Assumptions of Logistic Regression

Logistic regression is a powerful tool, but its results are valid only when certain assumptions are reasonably met. Understanding these assumptions helps you avoid misuse and misinterpretation.

4.1 Binary (or Ordinal/Multinomial) Dependent Variable

The outcome must be binary (two categories). If your outcome has more than two unordered categories, use Multinomial Logistic Regression. If the categories are ordered, use Ordinal Logistic Regression.

4.2 Independence of Observations

Each observation in the dataset must be independent of all others. For example:

✅ Each row represents a different, unrelated individual.
❌ Repeated measurements from the same individual without accounting for it.

4.3 No (or Little) Multicollinearity

Independent variables should not be highly correlated with each other. Severe multicollinearity inflates standard errors, making it harder to determine the individual effect of each predictor. Check using Variance Inflation Factor (VIF): generally, $\text{VIF} > 10$ is a concern.

4.4 Linearity of Independent Variables and the Log Odds

For continuous predictors, logistic regression assumes a linear relationship between the predictor and the log odds (not the probability). This can be checked with a Box-Tidwell test or by plotting the log odds against each continuous predictor.

4.5 No Extreme Outliers

Logistic regression can be sensitive to extreme outliers in the continuous independent variables. Influential observations should be identified and examined.

4.6 Large Sample Size

Logistic regression requires a reasonably large sample. A common rule of thumb is:

At least 10–20 events per predictor variable (EPV rule).
For example, if you have 5 predictors, you need at least 50–100 observations where $Y = 1$ .

4.7 No Perfect Separation (Complete Separation)

If a predictor or combination of predictors perfectly separates the two outcome groups, the MLE algorithm will fail to converge (coefficients go to $\pm\infty$ ). This is called complete separation and is a sign that the model is too good — often due to a small sample or a predictor that essentially duplicates the outcome.

5. Types of Logistic Regression

Type	Outcome Variable	Example
Binary	Two categories (0 or 1)	Disease: Yes / No
Multinomial	Three or more unordered categories	Colour Preference: Red / Green / Blue
Ordinal	Three or more ordered categories	Severity: Low / Medium / High

The DataStatPro application implements Binary Logistic Regression, which is the most common type and the focus of this tutorial.

6. Using the Logistic Regression Component

The Logistic Regression component in the DataStatPro application provides a full end-to-end workflow for performing binary logistic regression on your datasets.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset you want to analyse from the "Dataset" dropdown. The dataset should have at least one binary variable and one or more predictor variables.

Step 2 — Select Independent Variables (X) Select one or more predictor variables from the "Independent Variables (X)" dropdown. These can be:

Numeric (e.g., age, income, test score)
Categorical (e.g., gender, region, product type)

💡 Tip: Start with variables you have a theoretical reason to believe are associated with the outcome. Avoid blindly throwing in many unrelated predictors.

Step 3 — Select Dependent Variable (Y — Binary) Select the binary outcome variable from the "Dependent Variable (Y — Binary)" dropdown.

The variable must have exactly two distinct values (e.g., 0/1, Yes/No, True/False).
If you select a categorical variable with two values, you will be prompted to map which value corresponds to 1 (the event of interest) and which to 0.

⚠️ Important: Make sure you correctly assign which category is 1 and which is 0. This directly affects the direction and interpretation of all coefficients.

Step 4 — Select Base Categories (for Categorical Predictors) For each categorical independent variable with more than two categories, you must specify a base (reference) category. The base category is the group against which all other groups are compared.

💡 Tip: Choose the most natural reference group (e.g., "Placebo" in a drug trial, "Rural" in a location study, or simply the most common category).

Step 5 — Select Confidence Level Choose the desired confidence level for confidence intervals (e.g., 95% is the standard). This affects the confidence intervals reported for each coefficient.

Step 6 — Display Options Select which visualisations and outputs you wish to display:

✅ Regression Curve (for single numeric predictor models)
✅ ROC Curve
✅ Equation
✅ Confusion Matrix
✅ Coefficient Table

Step 7 — Run the Analysis Click the "Run Logistic Regression" button. The application will:

Encode categorical variables using dummy coding.
Fit the model using the IRLS algorithm.
Calculate coefficients, standard errors, z-values, p-values, and confidence intervals.
Compute model fit statistics (Log-Likelihood, AIC, Pseudo R²).
Generate the confusion matrix and classification metrics.
Plot the ROC curve and calculate the AUC.

7. Computational and Formula Details

7.1 The Logistic Function and Logit Transformation

The full logistic regression model expresses the log odds as a linear function of predictors:

$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n$

Where:

$p$ is the probability of $Y = 1$ .
$\beta_0$ is the intercept.
$\beta_i$ are the coefficients for predictors $X_i$ .

Solving for $p$ gives the predicted probability:

$p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n)}}$

7.2 Handling Categorical Independent Variables (Dummy Coding)

Categorical independent variables with more than two categories cannot be entered directly into a regression equation. They are converted into a set of dummy (indicator) variables using dummy coding (also known as one-hot encoding with a reference category).

For a categorical variable with $k$ categories, $k - 1$ dummy variables are created. One category is designated the base (reference) category, and it receives no dummy variable. Each dummy variable is defined as:

$D_{j} = \begin{cases} 1 & \text{if observation belongs to category } j \\ 0 & \text{otherwise} \end{cases}$

Example:

A categorical variable "Region" has three categories: Urban, Suburban, Rural (base).

Region	$D_\text{Urban}$	$D_\text{Suburban}$
Urban	1	0
Suburban	0	1
Rural	0	0

The model becomes:

$\text{logit}(p) = \beta_0 + \beta_1 (\text{Age}) + \beta_2 D_\text{Urban} + \beta_3 D_\text{Suburban}$

The coefficients $\beta_2$ and $\beta_3$ represent the difference in log odds relative to the Rural base category.

⚠️ Never include all $k$ dummy variables — this creates perfect multicollinearity (the "dummy variable trap"). Always omit one category as the base.

7.3 Interpretation of Coefficients

Log Odds Interpretation

Each coefficient $\beta_i$ represents the change in the log odds of $Y = 1$ for a one-unit increase in $X_i$ , holding all other variables constant:

$\Delta \text{logit}(p) = \beta_i \quad \text{for a one-unit increase in } X_i$

Odds Ratio Interpretation

The odds ratio (OR) is obtained by exponentiating the coefficient:

$\text{OR}_i = e^{\beta_i}$

The odds ratio is the factor by which the odds multiply for a one-unit increase in $X_i$ , holding all other variables constant:

Odds Ratio Value	Interpretation
$e^{\beta_i} > 1$	Odds of $Y=1$ increase as $X_i$ increases
$e^{\beta_i} = 1$	Odds of $Y=1$ are unchanged ( $X_i$ has no effect)
$e^{\beta_i} < 1$	Odds of $Y=1$ decrease as $X_i$ increases

Example: If $\beta_\text{Age} = 0.05$ , then $e^{0.05} \approx 1.051$ . For each additional year of age, the odds of the event increase by approximately 5.1%, holding other variables constant.

Converting Between Scales

Scale	Formula	Range
Log Odds	$\beta_i$	$(-\infty, +\infty)$
Odds Ratio	$e^{\beta_i}$	$(0, +\infty)$
Probability (at mean X)	$\frac{1}{1 + e^{-\hat{z}}}$	$(0, 1)$

7.4 Confidence Intervals for Coefficients

A $(1-\alpha) \times 100\%$ confidence interval for coefficient $\beta_i$ is:

$\left[\hat{\beta}_i - z_{\alpha/2} \cdot SE(\hat{\beta}_i), \quad \hat{\beta}_i + z_{\alpha/2} \cdot SE(\hat{\beta}_i)\right]$

Where:

$z_{\alpha/2}$ is the critical value from the standard normal distribution (e.g., 1.96 for 95% CI).
$SE(\hat{\beta}_i)$ is the standard error of the coefficient.

The corresponding confidence interval for the odds ratio is obtained by exponentiating the endpoints:

$\left[e^{\hat{\beta}_i - z_{\alpha/2} \cdot SE(\hat{\beta}_i)}, \quad e^{\hat{\beta}_i + z_{\alpha/2} \cdot SE(\hat{\beta}_i)}\right]$

💡 If the confidence interval for the odds ratio does not include 1, the predictor is statistically significant at the chosen level.

7.5 Statistical Significance Testing

For each coefficient, the application reports:

Standard Error ( $SE(\hat{\beta}_i)$ ): The estimated variability of the coefficient estimate. Derived from the square root of the diagonal elements of the Fisher Information Matrix (inverse of the Hessian of the log-likelihood):

$SE(\hat{\beta}_i) = \sqrt{\left[(\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1}\right]_{ii}}$

z-value (Wald Statistic): Tests the null hypothesis $H_0: \beta_i = 0$ against $H_1: \beta_i \neq 0$ :

$z_i = \frac{\hat{\beta}_i}{SE(\hat{\beta}_i)}$

Under $H_0$ , $z_i$ follows an approximately standard normal distribution $\mathcal{N}(0, 1)$ for large samples.

p-value: The probability of observing a $|z|$ at least as large as the calculated value, under $H_0$ :

$p\text{-value} = 2 \times P(Z > |z_i|) = 2 \times (1 - \Phi(|z_i|))$

Where $\Phi$ is the standard normal CDF. A small p-value (typically $p < 0.05$ ) provides evidence to reject $H_0$ and conclude that $X_i$ is a statistically significant predictor.

Decision Rule:

p-value	Interpretation
$p < 0.001$	Extremely strong evidence against $H_0$
$0.001 \leq p < 0.01$	Very strong evidence against $H_0$
$0.01 \leq p < 0.05$	Strong evidence against $H_0$
$0.05 \leq p < 0.10$	Weak evidence against $H_0$ (marginal)
$p \geq 0.10$	Insufficient evidence against $H_0$

8. Model Fit and Evaluation

Unlike linear regression (which uses R²), logistic regression relies on likelihood-based measures to assess model quality.

8.1 Log-Likelihood

The log-likelihood measures how well the fitted model explains the observed data:

$\ell(\hat{\boldsymbol{\beta}}) = \sum_{i=1}^{m} \left[ y_i \ln(\hat{p}_i) + (1 - y_i) \ln(1 - \hat{p}_i) \right]$

Always negative (since $0 < \hat{p}_i < 1$ , so $\ln(\hat{p}_i) < 0$ ).
Higher (less negative) values indicate a better-fitting model.
The null model log-likelihood $\ell_0$ is computed using only an intercept (no predictors).
The fitted model log-likelihood $\ell_{\hat{\boldsymbol{\beta}}}$ uses all predictors.

8.2 Deviance

Deviance is defined as $-2$ times the log-likelihood:

$D = -2 \ell(\hat{\boldsymbol{\beta}})$

Lower deviance indicates a better-fitting model. The null deviance ( $D_0$ ) and residual deviance ( $D_r$ ) are often compared:

$\chi^2 = D_0 - D_r = -2(\ell_0 - \ell_{\hat{\boldsymbol{\beta}}})$

This statistic follows a chi-squared distribution with degrees of freedom equal to the number of predictors, and can be used for a likelihood ratio test (LRT) of the overall model significance.

8.3 AIC (Akaike Information Criterion)

AIC penalises the log-likelihood for model complexity (number of parameters $k$ ):

$\text{AIC} = -2\ell(\hat{\boldsymbol{\beta}}) + 2k$

Where $k$ = number of estimated parameters (coefficients + intercept).

Lower AIC = better model (better fit relative to complexity).
Useful for comparing non-nested models (models with different sets of predictors).
AIC does not have an absolute scale — only comparisons between models are meaningful.

8.4 BIC (Bayesian Information Criterion)

Similar to AIC but with a stronger penalty for model complexity:

$\text{BIC} = -2\ell(\hat{\boldsymbol{\beta}}) + k \ln(m)$

Where $m$ is the number of observations. BIC tends to favour more parsimonious (simpler) models than AIC.

8.5 Pseudo R² Measures

Since ordinary R² is not directly applicable to logistic regression, several pseudo R² measures have been developed. They all attempt to quantify "how much better" the fitted model is compared to the null model.

McFadden's Pseudo R²:

$R^2_{\text{McFadden}} = 1 - \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell_0}$

Where $\ell_0$ is the log-likelihood of the null model (intercept only).

Cox & Snell Pseudo R²:

$R^2_{\text{CS}} = 1 - \left(\frac{L_0}{L_{\hat{\boldsymbol{\beta}}}}\right)^{2/m}$

Where $L_0 = e^{\ell_0}$ and $L_{\hat{\boldsymbol{\beta}}} = e^{\ell(\hat{\boldsymbol{\beta}})}$ are the likelihoods (not log-likelihoods).

Nagelkerke's Pseudo R² (scaled Cox & Snell):

$R^2_{\text{Nagelkerke}} = \frac{R^2_{\text{CS}}}{1 - L_0^{2/m}}$

This is scaled so that it can reach a maximum of 1.

Interpretation Guidelines (McFadden's):

McFadden's $R^2$	Interpretation
$0.00 - 0.10$	Poor fit
$0.10 - 0.20$	Acceptable fit
$0.20 - 0.30$	Good fit
$0.30 - 0.40$	Excellent fit
$> 0.40$	Outstanding fit (may warrant scrutiny)

⚠️ Pseudo R² values are not directly comparable to R² from linear regression. A McFadden's $R^2$ of 0.20 is generally considered a good-fitting logistic regression model, whereas an R² of 0.20 in linear regression would typically be considered poor.

8.6 Hosmer–Lemeshow Goodness-of-Fit Test

The Hosmer–Lemeshow test assesses whether the observed event rates match predicted probabilities across deciles (or groups) of the predicted probability:

$\chi^2_{HL} = \sum_{g=1}^{G} \frac{(O_g - E_g)^2}{E_g(1 - E_g / n_g)}$

Where:

$G$ is the number of groups (typically 10 deciles).
$O_g$ is the observed number of events in group $g$ .
$E_g$ is the expected number of events in group $g$ .
$n_g$ is the total number of observations in group $g$ .

A non-significant p-value ( $p > 0.05$ ) indicates that the model fits the data well (observed and expected values are close). A significant p-value suggests poor calibration.

9. Classification Metrics and Confusion Matrix

9.1 The Decision Threshold

After estimating predicted probabilities $\hat{p}_i$ , a decision threshold $\tau$ (default = 0.5) is used to classify each observation:

$\hat{Y}_i = \begin{cases} 1 & \text{if } \hat{p}_i \geq \tau \\ 0 & \text{if } \hat{p}_i < \tau \end{cases}$

The choice of threshold $\tau$ involves a trade-off:

Higher threshold: Fewer false positives but more false negatives (more conservative).
Lower threshold: More true positives but also more false positives (more liberal).
The optimal threshold depends on the cost of each type of error in your application.

9.2 The Confusion Matrix

The Confusion Matrix cross-tabulates the actual outcomes against the predicted outcomes:

	Predicted $\hat{Y} = 0$	Predicted $\hat{Y} = 1$
Actual $Y = 0$	True Negatives (TN)	False Positives (FP) — Type I Error
Actual $Y = 1$	False Negatives (FN) — Type II Error	True Positives (TP)

True Positive (TP): Model correctly predicted the event (predicted 1, actual 1).
True Negative (TN): Model correctly predicted the non-event (predicted 0, actual 0).
False Positive (FP): Model incorrectly predicted the event (predicted 1, actual 0). Also called a Type I error.
False Negative (FN): Model missed the event (predicted 0, actual 1). Also called a Type II error.

9.3 Classification Metrics

From the confusion matrix, a rich set of performance metrics can be derived:

Accuracy: The overall proportion of correct predictions:

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

⚠️ Accuracy can be misleading with imbalanced classes. If 95% of outcomes are 0, a model that always predicts 0 achieves 95% accuracy but is useless.

Precision (Positive Predictive Value): Of all predicted positives, what proportion are actually positive?

$\text{Precision} = \frac{TP}{TP + FP}$

Recall (Sensitivity / True Positive Rate): Of all actual positives, what proportion were correctly identified?

$\text{Recall} = \frac{TP}{TP + FN}$

Specificity (True Negative Rate): Of all actual negatives, what proportion were correctly identified?

$\text{Specificity} = \frac{TN}{TN + FP}$

False Positive Rate: Of all actual negatives, what proportion were incorrectly classified as positive?

$\text{FPR} = \frac{FP}{FP + TN} = 1 - \text{Specificity}$

F1 Score: The harmonic mean of Precision and Recall. Balances both metrics, useful for imbalanced datasets:

$F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

Matthew's Correlation Coefficient (MCC): A balanced metric even when class sizes are very unequal:

$\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

MCC ranges from $-1$ (total disagreement) to $+1$ (perfect prediction), with 0 indicating random chance.

9.4 Summary of Metrics

Metric	Formula	Best Value	What It Emphasises
Accuracy	$(TP+TN)/(TP+TN+FP+FN)$	1	Overall correctness
Precision	$TP/(TP+FP)$	1	Avoiding false alarms
Recall	$TP/(TP+FN)$	1	Catching all positives
Specificity	$TN/(TN+FP)$	1	Catching all negatives
F1 Score	$2 \times \frac{P \times R}{P + R}$	1	Balance of P and R
MCC	(see above)	1	Balanced (imbalanced data)

10. ROC Curve and AUC

10.1 What is the ROC Curve?

The ROC (Receiver Operating Characteristic) Curve is a graphical tool that evaluates the performance of a binary classifier across all possible decision thresholds $\tau \in [0, 1]$ .

For each threshold $\tau$ :

Compute the True Positive Rate (Sensitivity / Recall): $\text{TPR}(\tau) = \frac{TP(\tau)}{TP(\tau) + FN(\tau)}$
Compute the False Positive Rate: $\text{FPR}(\tau) = \frac{FP(\tau)}{FP(\tau) + TN(\tau)} = 1 - \text{Specificity}(\tau)$

The ROC curve plots $\text{TPR}$ (y-axis) against $\text{FPR}$ (x-axis) for each threshold.

Interpretation:

A diagonal line from $(0,0)$ to $(1,1)$ represents a random classifier (AUC = 0.5).
A curve that bows toward the top-left corner represents a good classifier.
The top-left corner $(0, 1)$ represents a perfect classifier (AUC = 1.0).

10.2 AUC (Area Under the ROC Curve)

The AUC summarises the entire ROC curve into a single number:

$\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})$

Probabilistic Interpretation: AUC equals the probability that the model ranks a randomly chosen positive instance higher (assigns a higher predicted probability) than a randomly chosen negative instance.

AUC Value	Model Discrimination
$0.5$	No discrimination (random chance)
$0.5 - 0.6$	Poor
$0.6 - 0.7$	Fair
$0.7 - 0.8$	Acceptable
$0.8 - 0.9$	Excellent
$0.9 - 1.0$	Outstanding
$1.0$	Perfect discrimination

10.3 Choosing the Optimal Threshold from the ROC Curve

Several methods exist for selecting the best operating threshold:

Youden's J Statistic: Maximises the sum of sensitivity and specificity:

$J = \text{Sensitivity} + \text{Specificity} - 1 = \text{TPR} - \text{FPR}$

The optimal threshold is where $J$ is maximised.

Closest to Top-Left: Minimise the Euclidean distance from the ROC curve point to the perfect point $(0, 1)$ :

$d = \sqrt{\text{FPR}^2 + (1 - \text{TPR})^2}$

11. Prediction Tool

11.1 Point Prediction

The Prediction Tool allows you to input specific values for the independent variables and obtain a predicted probability of the outcome being 1.

The predicted probability is calculated using the estimated coefficients:

$\hat{p} = \frac{1}{1 + e^{-(\hat{\beta}_0 + \sum_{i=1}^n \hat{\beta}_i X_{i,\text{input}})}}$

Steps:

Enter a value for each independent variable (numeric inputs are entered directly; categorical inputs are selected from a dropdown).
The app automatically applies dummy coding to categorical inputs.
The predicted probability $\hat{p}$ is displayed.
Based on the threshold ( $\tau = 0.5$ by default), the predicted class is also shown.

11.2 Confidence Interval for Predicted Probability

For a model with predictors $\mathbf{x}^* = (1, X_1^*, X_2^*, \dots, X_n^*)^T$ , the variance of the linear predictor $\hat{z}^* = \hat{\boldsymbol{\beta}}^T \mathbf{x}^*$ is:

$\text{Var}(\hat{z}^*) = {\mathbf{x}^*}^T \text{Cov}(\hat{\boldsymbol{\beta}}) \, \mathbf{x}^*$

Where $\text{Cov}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1}$ is the variance-covariance matrix of the coefficients.

A $(1-\alpha) \times 100\%$ confidence interval for the linear predictor is:

$\hat{z}^* \pm z_{\alpha/2} \sqrt{\text{Var}(\hat{z}^*)}$

Converting back to probability:

$\left[\frac{1}{1 + e^{-(\hat{z}^* - z_{\alpha/2} \sqrt{\text{Var}(\hat{z}^*)})}}, \quad \frac{1}{1 + e^{-(\hat{z}^* + z_{\alpha/2} \sqrt{\text{Var}(\hat{z}^*)})}}\right]$

⚠️ Confidence intervals for predicted probabilities in multiple predictor models require the full variance-covariance matrix. The DataStatPro app currently computes prediction confidence intervals for single numeric predictor models; multi-predictor CIs will be added in a future release.

12. Worked Examples

Example 1: Single Predictor — Age and Ad Click Prediction

Suppose we model the probability of clicking on an ad (1 = clicked, 0 = not clicked) based on age (a single numeric predictor).

After fitting:

$\text{logit}(\hat{p}) = -4.2 + 0.085 \times \text{Age}$

Interpretation:

For each additional year of age, the log odds of clicking increase by 0.085.
Odds Ratio = $e^{0.085} \approx 1.089$ : odds of clicking increase by about 8.9% per year.

Prediction for Age = 30:

$z = -4.2 + 0.085 \times 30 = -4.2 + 2.55 = -1.65$

$\hat{p} = \frac{1}{1 + e^{1.65}} = \frac{1}{1 + 5.207} \approx \frac{1}{6.207} \approx 0.161$

A 30-year-old has approximately a 16.1% predicted probability of clicking the ad. Model classifies this as "not clicked" (below 0.5 threshold).

Example 2: Multiple Predictors — Age and Location

Suppose you are predicting the probability of a customer clicking on an ad (1 = clicked, 0 = not clicked) based on age (numeric) and location (categorical: Urban, Suburban, Rural — Rural is the base category).

After fitting, the results are:

Parameter	Estimate	Std. Error	z-value	p-value	Odds Ratio
Intercept	-3.5000	0.4500	-7.778	< 0.001	0.0302
Age	0.0500	0.0120	4.167	< 0.001	1.0513
Location (Urban)	1.2000	0.4900	2.449	0.015	3.3201
Location (Suburban)	0.8000	0.4600	1.739	0.082	2.2255

Model Equation:

$\text{logit}(\hat{p}) = -3.5000 + 0.0500 \times \text{Age} + 1.2000 \times D_\text{Urban} + 0.8000 \times D_\text{Suburban}$

Coefficient Interpretation:

Intercept ( $\hat{\beta}_0 = -3.5000$ ): When Age = 0 and Location = Rural, the log odds of clicking are $-3.5000$ , corresponding to $p = 1/(1 + e^{3.5}) \approx 0.030$ (3%). This is the baseline for a hypothetical Rural individual with Age = 0 — note that Age = 0 is outside the realistic range of the data, so the intercept alone is not typically meaningful.
Age ( $\hat{\beta} = 0.0500$ , $p < 0.001$ ): For each one-year increase in age, the log odds of clicking increase by 0.0500. The odds ratio of 1.0513 means the odds of clicking increase by approximately 5.13% per additional year of age, holding location constant. This effect is statistically significant ( $p < 0.001$ ).
Location — Urban ( $\hat{\beta} = 1.2000$ , $p = 0.015$ ): Compared to Rural (base), being Urban increases the log odds of clicking by 1.2000. The odds ratio of 3.3201 means the odds of clicking are approximately 3.32 times higher in Urban areas compared to Rural areas, holding age constant. This effect is statistically significant ( $p = 0.015$ ).
Location — Suburban ( $\hat{\beta} = 0.8000$ , $p = 0.082$ ): Compared to Rural (base), being Suburban increases the log odds of clicking by 0.8000. The odds ratio of 2.2255 means the odds of clicking are approximately 2.23 times higher in Suburban areas compared to Rural areas, holding age constant. However, this effect is not statistically significant at the 5% level ( $p = 0.082$ ).

Prediction Example — 40-year-old in Suburban Location:

$z = -3.5000 + (0.0500 \times 40) + (1.2000 \times 0) + (0.8000 \times 1)$

$z = -3.5000 + 2.0000 + 0 + 0.8000 = -0.7000$

$\hat{p} = \frac{1}{1 + e^{-(-0.7000)}} = \frac{1}{1 + e^{0.7000}} \approx \frac{1}{1 + 2.0138} \approx \frac{1}{3.0138} \approx 0.3318$

The predicted probability is approximately 0.3318 (33.18%). Since $\hat{p} < 0.5$ , the model classifies this individual as "not clicked".

95% Confidence Interval for Odds Ratios:

Parameter	OR	95% CI Lower	95% CI Upper
Age	1.0513	$e^{0.05-1.96 \times 0.012}$ = 1.028	$e^{0.05+1.96 \times 0.012}$ = 1.075
Location (Urban)	3.3201	$e^{1.20-1.96 \times 0.49}$ = 1.270	$e^{1.20+1.96 \times 0.49}$ = 8.685

13. Common Mistakes and How to Avoid Them

Mistake 1: Using Logistic Regression With a Non-Binary Outcome

Problem: Applying binary logistic regression to a continuous or multi-category outcome.
Solution: For continuous outcomes, use linear regression. For $k > 2$ unordered categories, use multinomial logistic regression. For ordered categories, use ordinal logistic regression.

Mistake 2: Ignoring Class Imbalance

Problem: When one class (e.g., $Y = 1$ ) is very rare (e.g., 2% of data), the model may predict 0 for all observations and still achieve high accuracy.
Solution: Use precision, recall, F1, or AUC as primary metrics instead of accuracy. Consider oversampling the minority class (SMOTE), undersampling the majority class, or using class weights.

Mistake 3: Including Too Many Predictors (Overfitting)

Problem: With too many predictors relative to the number of events, the model overfits the training data and performs poorly on new data.
Solution: Follow the EPV rule (10–20 events per predictor). Use regularisation (L1/Lasso, L2/Ridge) or cross-validation for model selection.

Mistake 4: Multicollinearity

Problem: Highly correlated predictors inflate standard errors, making individual coefficient estimates unreliable (even if the overall model is fine).
Solution: Check pairwise correlations and VIF. Remove redundant variables or use dimensionality reduction (PCA) as a preprocessing step.

Mistake 5: Incorrect Reference Category

Problem: Choosing a reference category for a dummy variable arbitrarily, leading to confusing interpretations.
Solution: Choose a reference category that is scientifically meaningful (e.g., control group, most common category). Document the choice clearly.

Mistake 6: Interpreting Coefficients as Probability Changes

Problem: Saying "an increase in age by 1 year increases the probability of clicking by 0.05."
Solution: Coefficients in logistic regression are changes in log odds, not probabilities. The probability change depends on the current value of all predictors (it is non-linear). Always use odds ratios or calculate predicted probabilities at specific values.

Mistake 7: Ignoring Complete Separation

Problem: If a predictor perfectly predicts the outcome, MLE fails to converge and produces extremely large coefficients with enormous standard errors.
Solution: Look for warning messages about convergence. If separation exists, consider removing the problematic predictor, collapsing categories, or using Firth's penalised logistic regression.

Mistake 8: Not Checking the Linearity Assumption

Problem: Treating non-linear relationships between a continuous predictor and the log odds as linear, leading to mis-specified model.
Solution: Plot smoothed log odds against each continuous predictor. Apply transformations (e.g., log, square root) or use polynomial terms or splines if needed.

14. Troubleshooting

Issue	Likely Cause	Solution
Model fails to converge	Complete/quasi-complete separation; too many predictors; too few observations	Check for perfect predictors; reduce predictors; collect more data; use Firth regression
Very large coefficients ( $> 10$ )	Complete separation	Examine which predictor perfectly splits the outcomes
Very large standard errors	Multicollinearity or separation	Check VIF; examine correlation matrix
AUC = 0.5	Model has no predictive power	Review variable selection; check data quality; consider non-linear models
All predictions = 0 or all = 1	Severe class imbalance or separation	Check class distribution; adjust threshold; address imbalance
p-values all non-significant	Insufficient sample size; weak predictors	Increase sample size; reconsider predictor selection
Pseudo R² is very high (> 0.9)	Possible overfitting or separation	Cross-validate; check for separation; reduce predictors
Confidence interval includes 1 (for OR)	Non-significant predictor	Variable may not contribute meaningfully; consider removing

15. Quick Reference Cheat Sheet

Core Equations

Formula	Description
$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \sum \beta_i X_i$	Log odds equation
$p = \frac{1}{1 + e^{-(\beta_0 + \sum \beta_i X_i)}}$	Predicted probability
$\text{OR}_i = e^{\beta_i}$	Odds ratio for predictor $i$
$z_i = \hat{\beta}_i / SE(\hat{\beta}_i)$	Wald z-statistic
$\text{AIC} = -2\ell + 2k$	Akaike Information Criterion
$R^2_{\text{McFadden}} = 1 - \ell_{\hat{\boldsymbol{\beta}}} / \ell_0$	McFadden's Pseudo R²
$\text{Accuracy} = (TP + TN)/(TP + TN + FP + FN)$	Overall accuracy
$\text{Precision} = TP/(TP + FP)$	Positive predictive value
$\text{Recall} = TP/(TP + FN)$	Sensitivity
$\text{Specificity} = TN/(TN + FP)$	True negative rate
$F_1 = 2 \times \frac{P \times R}{P + R}$	F1 Score

Odds Ratio Interpretation

OR Value	Meaning
$> 1$	Predictor increases odds of outcome
$= 1$	Predictor has no effect on odds
$< 1$	Predictor decreases odds of outcome

Model Comparison Guide

Scenario	Recommended Metric
Comparing nested models	Likelihood Ratio Test ( $\chi^2$ )
Comparing non-nested models	AIC or BIC
Evaluating discrimination ability	AUC
Evaluating calibration	Hosmer-Lemeshow test
Imbalanced classes	F1, MCC, AUC (not Accuracy)

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Logistic Regression using the DataStatPro application. For further reading, consult Hosmer & Lemeshow's "Applied Logistic Regression" or Agresti's "Categorical Data Analysis". For feature requests or support, contact the DataStatPro team.

Logistic Regression Tutorial

Logistic Regression: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Probability

1.2 Odds

1.3 Log Odds (Logit)

1.4 Why Not Use Linear Regression for Binary Outcomes?

2. What is Logistic Regression?

2.1 Real-World Applications

2.2 Binary Outcome Variable

2.3 Logistic Regression vs. Linear Regression: A Summary

3. The Mathematics Behind Logistic Regression

3.1 The Logistic (Sigmoid) Function

3.2 The Logistic Regression Model

3.3 The Logit Transformation

3.4 From Log Odds Back to Probability

3.5 The Likelihood Function and Maximum Likelihood Estimation (MLE)

3.6 The IRLS Algorithm

3.7 The Cost (Cross-Entropy) Function

4. Assumptions of Logistic Regression

4.1 Binary (or Ordinal/Multinomial) Dependent Variable

4.2 Independence of Observations

4.3 No (or Little) Multicollinearity

4.4 Linearity of Independent Variables and the Log Odds

4.5 No Extreme Outliers

4.6 Large Sample Size

4.7 No Perfect Separation (Complete Separation)

5. Types of Logistic Regression

6. Using the Logistic Regression Component

Step-by-Step Guide

7. Computational and Formula Details

7.1 The Logistic Function and Logit Transformation

7.2 Handling Categorical Independent Variables (Dummy Coding)

7.3 Interpretation of Coefficients

Log Odds Interpretation

Odds Ratio Interpretation

Converting Between Scales

7.4 Confidence Intervals for Coefficients

7.5 Statistical Significance Testing

8. Model Fit and Evaluation

8.1 Log-Likelihood

8.2 Deviance

8.3 AIC (Akaike Information Criterion)

8.4 BIC (Bayesian Information Criterion)

8.5 Pseudo R² Measures

8.6 Hosmer–Lemeshow Goodness-of-Fit Test

9. Classification Metrics and Confusion Matrix

9.1 The Decision Threshold

9.2 The Confusion Matrix

9.3 Classification Metrics

9.4 Summary of Metrics

10. ROC Curve and AUC

10.1 What is the ROC Curve?

10.2 AUC (Area Under the ROC Curve)

10.3 Choosing the Optimal Threshold from the ROC Curve

11. Prediction Tool

11.1 Point Prediction

11.2 Confidence Interval for Predicted Probability

12. Worked Examples

Example 1: Single Predictor — Age and Ad Click Prediction

Example 2: Multiple Predictors — Age and Location

13. Common Mistakes and How to Avoid Them

Mistake 1: Using Logistic Regression With a Non-Binary Outcome

Mistake 2: Ignoring Class Imbalance

Mistake 3: Including Too Many Predictors (Overfitting)

Mistake 4: Multicollinearity

Mistake 5: Incorrect Reference Category

Mistake 6: Interpreting Coefficients as Probability Changes

Mistake 7: Ignoring Complete Separation

Mistake 8: Not Checking the Linearity Assumption

14. Troubleshooting

15. Quick Reference Cheat Sheet

Core Equations

Odds Ratio Interpretation

Model Comparison Guide