Logistic Regression: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of logistic regression all the way through advanced interpretation, model diagnostics, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.
Table of Contents
- Prerequisites and Background Concepts
- What is Logistic Regression?
- The Mathematics Behind Logistic Regression
- Assumptions of Logistic Regression
- Types of Logistic Regression
- Using the Logistic Regression Component
- Computational and Formula Details
- Model Fit and Evaluation
- Classification Metrics and Confusion Matrix
- ROC Curve and AUC
- Prediction Tool
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into logistic regression, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.
1.1 Probability
Probability is a number between 0 and 1 that describes the likelihood of an event occurring:
- means the event will definitely not occur.
- means the event will definitely occur.
- means the event has a 70% chance of occurring.
1.2 Odds
Odds are another way to express the likelihood of an event. Instead of asking "what fraction of the time does this happen?", odds ask "how many times more likely is success compared to failure?":
For example, if (75% probability of success):
This means success is 3 times more likely than failure, often expressed as "3 to 1 odds".
1.3 Log Odds (Logit)
The log odds (also called the logit) is simply the natural logarithm of the odds:
| Probability () | Odds | Log Odds (Logit) |
|---|---|---|
| 0.10 | 0.111 | -2.197 |
| 0.25 | 0.333 | -1.099 |
| 0.50 | 1.000 | 0.000 |
| 0.75 | 3.000 | 1.099 |
| 0.90 | 9.000 | 2.197 |
Key insight: Log odds range from to , which makes them suitable for a linear model.
1.4 Why Not Use Linear Regression for Binary Outcomes?
A natural first question is: why can't we just use linear regression to predict a 0/1 outcome?
Linear regression predicts values using:
The problem is that linear regression can produce predicted values outside the range [0, 1] — for example, -0.3 or 1.7 — which are meaningless as probabilities. Logistic regression solves this by transforming its output through the logistic (sigmoid) function, which always produces values between 0 and 1.
2. What is Logistic Regression?
Logistic Regression is a statistical method used to model the probability of a binary outcome — that is, an outcome that takes one of exactly two values (e.g., Yes/No, 1/0, True/False, Disease/No Disease).
Despite its name containing "regression," logistic regression is fundamentally a classification algorithm. It estimates the probability that an observation belongs to a particular class.
2.1 Real-World Applications
Logistic regression is one of the most widely used methods in statistics and machine learning. Common applications include:
- Medicine: Predicting the likelihood of a disease (e.g., cancer, diabetes) based on patient characteristics (age, blood pressure, test results).
- Marketing: Predicting whether a customer will purchase a product, click on an ad, or churn from a subscription.
- Finance: Predicting the probability of loan default or credit card fraud.
- Social Sciences: Predicting voting behaviour, employment outcomes, or educational attainment.
- Engineering: Predicting the probability of a component failure.
2.2 Binary Outcome Variable
The dependent variable (also called the response or outcome variable) in logistic regression must be binary. By convention:
- The value 1 represents the "event of interest" (the positive class), e.g., Disease Present, Purchase Made, Default.
- The value 0 represents the "non-event" (the negative class), e.g., No Disease, No Purchase, No Default.
The choice of which class is "1" and which is "0" is meaningful — it affects the direction of coefficients. Make sure to define this mapping clearly before running the model.
2.3 Logistic Regression vs. Linear Regression: A Summary
| Feature | Linear Regression | Logistic Regression |
|---|---|---|
| Outcome Type | Continuous (e.g., income) | Binary (e.g., yes/no) |
| Predicted Value | Any real number | Probability between 0 and 1 |
| Link Function | Identity | Logit (log odds) |
| Model Fitting Method | Ordinary Least Squares (OLS) | Maximum Likelihood Estimation (MLE) |
| Goodness-of-Fit Metric | R² | Pseudo R², AIC, Log-Likelihood |
| Error Distribution | Normal | Binomial |
3. The Mathematics Behind Logistic Regression
This section builds up the full mathematical framework of logistic regression from scratch.
3.1 The Logistic (Sigmoid) Function
The heart of logistic regression is the logistic function (also called the sigmoid function):
Where is any real number. The logistic function maps any real-valued input to a value in the range , making it ideal for representing a probability.
Key properties of the logistic function:
- — when the linear combination equals zero, the predicted probability is 50%.
- As , .
- As , .
- The function is S-shaped (sigmoidal), centred at 0.5.
3.2 The Logistic Regression Model
In logistic regression, is replaced by the linear combination of predictors:
So the predicted probability that the outcome given the predictors is:
Where:
- is the predicted probability of the outcome being 1.
- is the intercept (bias term).
- are the regression coefficients for each predictor .
- is the number of independent variables.
3.3 The Logit Transformation
Taking the logit (log odds) of the predicted probability linearises the model:
This is the fundamental equation of logistic regression. On the left side is the log odds; on the right side is a linear combination of predictors — exactly like a linear regression model. This is why it is called logistic regression.
3.4 From Log Odds Back to Probability
Given the logit, you can always convert back to a probability using:
3.5 The Likelihood Function and Maximum Likelihood Estimation (MLE)
Logistic regression coefficients are estimated by Maximum Likelihood Estimation (MLE) — the method that finds the values of that make the observed data most probable.
For a dataset of observations, the likelihood function is:
Where:
- is the actual outcome for observation .
- is the model's predicted probability of for observation .
It is more convenient to work with the log-likelihood (since logarithms turn products into sums):
MLE finds the coefficient vector that maximises . This cannot be solved analytically (unlike OLS), so iterative algorithms are used.
3.6 The IRLS Algorithm
The most common algorithm for maximising the log-likelihood is Iteratively Reweighted Least Squares (IRLS), a special case of Newton-Raphson optimisation.
At each iteration , the algorithm updates the coefficient estimates using:
Where:
- is the design matrix of predictors (with a column of 1s for the intercept).
- is a diagonal weight matrix with entries .
- is the vector of actual outcomes.
- is the vector of predicted probabilities at iteration .
The algorithm continues until the change in log-likelihood or coefficients is smaller than a convergence threshold (e.g., ).
3.7 The Cost (Cross-Entropy) Function
An equivalent way to frame MLE is as minimising the binary cross-entropy loss:
This is simply the negative log-likelihood divided by . Minimising is equivalent to maximising .
4. Assumptions of Logistic Regression
Logistic regression is a powerful tool, but its results are valid only when certain assumptions are reasonably met. Understanding these assumptions helps you avoid misuse and misinterpretation.
4.1 Binary (or Ordinal/Multinomial) Dependent Variable
The outcome must be binary (two categories). If your outcome has more than two unordered categories, use Multinomial Logistic Regression. If the categories are ordered, use Ordinal Logistic Regression.
4.2 Independence of Observations
Each observation in the dataset must be independent of all others. For example:
- ✅ Each row represents a different, unrelated individual.
- ❌ Repeated measurements from the same individual without accounting for it.
4.3 No (or Little) Multicollinearity
Independent variables should not be highly correlated with each other. Severe multicollinearity inflates standard errors, making it harder to determine the individual effect of each predictor. Check using Variance Inflation Factor (VIF): generally, is a concern.
4.4 Linearity of Independent Variables and the Log Odds
For continuous predictors, logistic regression assumes a linear relationship between the predictor and the log odds (not the probability). This can be checked with a Box-Tidwell test or by plotting the log odds against each continuous predictor.
4.5 No Extreme Outliers
Logistic regression can be sensitive to extreme outliers in the continuous independent variables. Influential observations should be identified and examined.
4.6 Large Sample Size
Logistic regression requires a reasonably large sample. A common rule of thumb is:
- At least 10–20 events per predictor variable (EPV rule).
- For example, if you have 5 predictors, you need at least 50–100 observations where .
4.7 No Perfect Separation (Complete Separation)
If a predictor or combination of predictors perfectly separates the two outcome groups, the MLE algorithm will fail to converge (coefficients go to ). This is called complete separation and is a sign that the model is too good — often due to a small sample or a predictor that essentially duplicates the outcome.
5. Types of Logistic Regression
| Type | Outcome Variable | Example |
|---|---|---|
| Binary | Two categories (0 or 1) | Disease: Yes / No |
| Multinomial | Three or more unordered categories | Colour Preference: Red / Green / Blue |
| Ordinal | Three or more ordered categories | Severity: Low / Medium / High |
The DataStatPro application implements Binary Logistic Regression, which is the most common type and the focus of this tutorial.
6. Using the Logistic Regression Component
The Logistic Regression component in the DataStatPro application provides a full end-to-end workflow for performing binary logistic regression on your datasets.
Step-by-Step Guide
Step 1 — Select Dataset Choose the dataset you want to analyse from the "Dataset" dropdown. The dataset should have at least one binary variable and one or more predictor variables.
Step 2 — Select Independent Variables (X) Select one or more predictor variables from the "Independent Variables (X)" dropdown. These can be:
- Numeric (e.g., age, income, test score)
- Categorical (e.g., gender, region, product type)
💡 Tip: Start with variables you have a theoretical reason to believe are associated with the outcome. Avoid blindly throwing in many unrelated predictors.
Step 3 — Select Dependent Variable (Y — Binary) Select the binary outcome variable from the "Dependent Variable (Y — Binary)" dropdown.
- The variable must have exactly two distinct values (e.g., 0/1, Yes/No, True/False).
- If you select a categorical variable with two values, you will be prompted to map which value corresponds to 1 (the event of interest) and which to 0.
⚠️ Important: Make sure you correctly assign which category is 1 and which is 0. This directly affects the direction and interpretation of all coefficients.
Step 4 — Select Base Categories (for Categorical Predictors) For each categorical independent variable with more than two categories, you must specify a base (reference) category. The base category is the group against which all other groups are compared.
💡 Tip: Choose the most natural reference group (e.g., "Placebo" in a drug trial, "Rural" in a location study, or simply the most common category).
Step 5 — Select Confidence Level Choose the desired confidence level for confidence intervals (e.g., 95% is the standard). This affects the confidence intervals reported for each coefficient.
Step 6 — Display Options Select which visualisations and outputs you wish to display:
- ✅ Regression Curve (for single numeric predictor models)
- ✅ ROC Curve
- ✅ Equation
- ✅ Confusion Matrix
- ✅ Coefficient Table
Step 7 — Run the Analysis Click the "Run Logistic Regression" button. The application will:
- Encode categorical variables using dummy coding.
- Fit the model using the IRLS algorithm.
- Calculate coefficients, standard errors, z-values, p-values, and confidence intervals.
- Compute model fit statistics (Log-Likelihood, AIC, Pseudo R²).
- Generate the confusion matrix and classification metrics.
- Plot the ROC curve and calculate the AUC.
7. Computational and Formula Details
7.1 The Logistic Function and Logit Transformation
The full logistic regression model expresses the log odds as a linear function of predictors:
Where:
- is the probability of .
- is the intercept.
- are the coefficients for predictors .
Solving for gives the predicted probability:
7.2 Handling Categorical Independent Variables (Dummy Coding)
Categorical independent variables with more than two categories cannot be entered directly into a regression equation. They are converted into a set of dummy (indicator) variables using dummy coding (also known as one-hot encoding with a reference category).
For a categorical variable with categories, dummy variables are created. One category is designated the base (reference) category, and it receives no dummy variable. Each dummy variable is defined as:
Example:
A categorical variable "Region" has three categories: Urban, Suburban, Rural (base).
| Region | ||
|---|---|---|
| Urban | 1 | 0 |
| Suburban | 0 | 1 |
| Rural | 0 | 0 |
The model becomes:
The coefficients and represent the difference in log odds relative to the Rural base category.
⚠️ Never include all dummy variables — this creates perfect multicollinearity (the "dummy variable trap"). Always omit one category as the base.
7.3 Interpretation of Coefficients
Log Odds Interpretation
Each coefficient represents the change in the log odds of for a one-unit increase in , holding all other variables constant:
Odds Ratio Interpretation
The odds ratio (OR) is obtained by exponentiating the coefficient:
The odds ratio is the factor by which the odds multiply for a one-unit increase in , holding all other variables constant:
| Odds Ratio Value | Interpretation |
|---|---|
| Odds of increase as increases | |
| Odds of are unchanged ( has no effect) | |
| Odds of decrease as increases |
Example: If , then . For each additional year of age, the odds of the event increase by approximately 5.1%, holding other variables constant.
Converting Between Scales
| Scale | Formula | Range |
|---|---|---|
| Log Odds | ||
| Odds Ratio | ||
| Probability (at mean X) |
7.4 Confidence Intervals for Coefficients
A confidence interval for coefficient is:
Where:
- is the critical value from the standard normal distribution (e.g., 1.96 for 95% CI).
- is the standard error of the coefficient.
The corresponding confidence interval for the odds ratio is obtained by exponentiating the endpoints:
💡 If the confidence interval for the odds ratio does not include 1, the predictor is statistically significant at the chosen level.
7.5 Statistical Significance Testing
For each coefficient, the application reports:
Standard Error (): The estimated variability of the coefficient estimate. Derived from the square root of the diagonal elements of the Fisher Information Matrix (inverse of the Hessian of the log-likelihood):
z-value (Wald Statistic): Tests the null hypothesis against :
Under , follows an approximately standard normal distribution for large samples.
p-value: The probability of observing a at least as large as the calculated value, under :
Where is the standard normal CDF. A small p-value (typically ) provides evidence to reject and conclude that is a statistically significant predictor.
Decision Rule:
| p-value | Interpretation |
|---|---|
| Extremely strong evidence against | |
| Very strong evidence against | |
| Strong evidence against | |
| Weak evidence against (marginal) | |
| Insufficient evidence against |
8. Model Fit and Evaluation
Unlike linear regression (which uses R²), logistic regression relies on likelihood-based measures to assess model quality.
8.1 Log-Likelihood
The log-likelihood measures how well the fitted model explains the observed data:
- Always negative (since , so ).
- Higher (less negative) values indicate a better-fitting model.
- The null model log-likelihood is computed using only an intercept (no predictors).
- The fitted model log-likelihood uses all predictors.
8.2 Deviance
Deviance is defined as times the log-likelihood:
Lower deviance indicates a better-fitting model. The null deviance () and residual deviance () are often compared:
This statistic follows a chi-squared distribution with degrees of freedom equal to the number of predictors, and can be used for a likelihood ratio test (LRT) of the overall model significance.
8.3 AIC (Akaike Information Criterion)
AIC penalises the log-likelihood for model complexity (number of parameters ):
Where = number of estimated parameters (coefficients + intercept).
- Lower AIC = better model (better fit relative to complexity).
- Useful for comparing non-nested models (models with different sets of predictors).
- AIC does not have an absolute scale — only comparisons between models are meaningful.
8.4 BIC (Bayesian Information Criterion)
Similar to AIC but with a stronger penalty for model complexity:
Where is the number of observations. BIC tends to favour more parsimonious (simpler) models than AIC.
8.5 Pseudo R² Measures
Since ordinary R² is not directly applicable to logistic regression, several pseudo R² measures have been developed. They all attempt to quantify "how much better" the fitted model is compared to the null model.
McFadden's Pseudo R²:
Where is the log-likelihood of the null model (intercept only).
Cox & Snell Pseudo R²:
Where and are the likelihoods (not log-likelihoods).
Nagelkerke's Pseudo R² (scaled Cox & Snell):
This is scaled so that it can reach a maximum of 1.
Interpretation Guidelines (McFadden's):
| McFadden's | Interpretation |
|---|---|
| Poor fit | |
| Acceptable fit | |
| Good fit | |
| Excellent fit | |
| Outstanding fit (may warrant scrutiny) |
⚠️ Pseudo R² values are not directly comparable to R² from linear regression. A McFadden's of 0.20 is generally considered a good-fitting logistic regression model, whereas an R² of 0.20 in linear regression would typically be considered poor.
8.6 Hosmer–Lemeshow Goodness-of-Fit Test
The Hosmer–Lemeshow test assesses whether the observed event rates match predicted probabilities across deciles (or groups) of the predicted probability:
Where:
- is the number of groups (typically 10 deciles).
- is the observed number of events in group .
- is the expected number of events in group .
- is the total number of observations in group .
A non-significant p-value () indicates that the model fits the data well (observed and expected values are close). A significant p-value suggests poor calibration.
9. Classification Metrics and Confusion Matrix
9.1 The Decision Threshold
After estimating predicted probabilities , a decision threshold (default = 0.5) is used to classify each observation:
The choice of threshold involves a trade-off:
- Higher threshold: Fewer false positives but more false negatives (more conservative).
- Lower threshold: More true positives but also more false positives (more liberal).
- The optimal threshold depends on the cost of each type of error in your application.
9.2 The Confusion Matrix
The Confusion Matrix cross-tabulates the actual outcomes against the predicted outcomes:
| Predicted | Predicted | |
|---|---|---|
| Actual | True Negatives (TN) | False Positives (FP) — Type I Error |
| Actual | False Negatives (FN) — Type II Error | True Positives (TP) |
- True Positive (TP): Model correctly predicted the event (predicted 1, actual 1).
- True Negative (TN): Model correctly predicted the non-event (predicted 0, actual 0).
- False Positive (FP): Model incorrectly predicted the event (predicted 1, actual 0). Also called a Type I error.
- False Negative (FN): Model missed the event (predicted 0, actual 1). Also called a Type II error.
9.3 Classification Metrics
From the confusion matrix, a rich set of performance metrics can be derived:
Accuracy: The overall proportion of correct predictions:
⚠️ Accuracy can be misleading with imbalanced classes. If 95% of outcomes are 0, a model that always predicts 0 achieves 95% accuracy but is useless.
Precision (Positive Predictive Value): Of all predicted positives, what proportion are actually positive?
Recall (Sensitivity / True Positive Rate): Of all actual positives, what proportion were correctly identified?
Specificity (True Negative Rate): Of all actual negatives, what proportion were correctly identified?
False Positive Rate: Of all actual negatives, what proportion were incorrectly classified as positive?
F1 Score: The harmonic mean of Precision and Recall. Balances both metrics, useful for imbalanced datasets:
Matthew's Correlation Coefficient (MCC): A balanced metric even when class sizes are very unequal:
MCC ranges from (total disagreement) to (perfect prediction), with 0 indicating random chance.
9.4 Summary of Metrics
| Metric | Formula | Best Value | What It Emphasises |
|---|---|---|---|
| Accuracy | 1 | Overall correctness | |
| Precision | 1 | Avoiding false alarms | |
| Recall | 1 | Catching all positives | |
| Specificity | 1 | Catching all negatives | |
| F1 Score | 1 | Balance of P and R | |
| MCC | (see above) | 1 | Balanced (imbalanced data) |
10. ROC Curve and AUC
10.1 What is the ROC Curve?
The ROC (Receiver Operating Characteristic) Curve is a graphical tool that evaluates the performance of a binary classifier across all possible decision thresholds .
For each threshold :
- Compute the True Positive Rate (Sensitivity / Recall):
- Compute the False Positive Rate:
The ROC curve plots (y-axis) against (x-axis) for each threshold.
Interpretation:
- A diagonal line from to represents a random classifier (AUC = 0.5).
- A curve that bows toward the top-left corner represents a good classifier.
- The top-left corner represents a perfect classifier (AUC = 1.0).
10.2 AUC (Area Under the ROC Curve)
The AUC summarises the entire ROC curve into a single number:
Probabilistic Interpretation: AUC equals the probability that the model ranks a randomly chosen positive instance higher (assigns a higher predicted probability) than a randomly chosen negative instance.
| AUC Value | Model Discrimination |
|---|---|
| No discrimination (random chance) | |
| Poor | |
| Fair | |
| Acceptable | |
| Excellent | |
| Outstanding | |
| Perfect discrimination |
10.3 Choosing the Optimal Threshold from the ROC Curve
Several methods exist for selecting the best operating threshold:
Youden's J Statistic: Maximises the sum of sensitivity and specificity:
The optimal threshold is where is maximised.
Closest to Top-Left: Minimise the Euclidean distance from the ROC curve point to the perfect point :
11. Prediction Tool
11.1 Point Prediction
The Prediction Tool allows you to input specific values for the independent variables and obtain a predicted probability of the outcome being 1.
The predicted probability is calculated using the estimated coefficients:
Steps:
- Enter a value for each independent variable (numeric inputs are entered directly; categorical inputs are selected from a dropdown).
- The app automatically applies dummy coding to categorical inputs.
- The predicted probability is displayed.
- Based on the threshold ( by default), the predicted class is also shown.
11.2 Confidence Interval for Predicted Probability
For a model with predictors , the variance of the linear predictor is:
Where is the variance-covariance matrix of the coefficients.
A confidence interval for the linear predictor is:
Converting back to probability:
⚠️ Confidence intervals for predicted probabilities in multiple predictor models require the full variance-covariance matrix. The DataStatPro app currently computes prediction confidence intervals for single numeric predictor models; multi-predictor CIs will be added in a future release.
12. Worked Examples
Example 1: Single Predictor — Age and Ad Click Prediction
Suppose we model the probability of clicking on an ad (1 = clicked, 0 = not clicked) based on age (a single numeric predictor).
After fitting:
Interpretation:
- For each additional year of age, the log odds of clicking increase by 0.085.
- Odds Ratio = : odds of clicking increase by about 8.9% per year.
Prediction for Age = 30:
A 30-year-old has approximately a 16.1% predicted probability of clicking the ad. Model classifies this as "not clicked" (below 0.5 threshold).
Example 2: Multiple Predictors — Age and Location
Suppose you are predicting the probability of a customer clicking on an ad (1 = clicked, 0 = not clicked) based on age (numeric) and location (categorical: Urban, Suburban, Rural — Rural is the base category).
After fitting, the results are:
| Parameter | Estimate | Std. Error | z-value | p-value | Odds Ratio |
|---|---|---|---|---|---|
| Intercept | -3.5000 | 0.4500 | -7.778 | < 0.001 | 0.0302 |
| Age | 0.0500 | 0.0120 | 4.167 | < 0.001 | 1.0513 |
| Location (Urban) | 1.2000 | 0.4900 | 2.449 | 0.015 | 3.3201 |
| Location (Suburban) | 0.8000 | 0.4600 | 1.739 | 0.082 | 2.2255 |
Model Equation:
Coefficient Interpretation:
-
Intercept (): When Age = 0 and Location = Rural, the log odds of clicking are , corresponding to (3%). This is the baseline for a hypothetical Rural individual with Age = 0 — note that Age = 0 is outside the realistic range of the data, so the intercept alone is not typically meaningful.
-
Age (, ): For each one-year increase in age, the log odds of clicking increase by 0.0500. The odds ratio of 1.0513 means the odds of clicking increase by approximately 5.13% per additional year of age, holding location constant. This effect is statistically significant ().
-
Location — Urban (, ): Compared to Rural (base), being Urban increases the log odds of clicking by 1.2000. The odds ratio of 3.3201 means the odds of clicking are approximately 3.32 times higher in Urban areas compared to Rural areas, holding age constant. This effect is statistically significant ().
-
Location — Suburban (, ): Compared to Rural (base), being Suburban increases the log odds of clicking by 0.8000. The odds ratio of 2.2255 means the odds of clicking are approximately 2.23 times higher in Suburban areas compared to Rural areas, holding age constant. However, this effect is not statistically significant at the 5% level ().
Prediction Example — 40-year-old in Suburban Location:
The predicted probability is approximately 0.3318 (33.18%). Since , the model classifies this individual as "not clicked".
95% Confidence Interval for Odds Ratios:
| Parameter | OR | 95% CI Lower | 95% CI Upper |
|---|---|---|---|
| Age | 1.0513 | = 1.028 | = 1.075 |
| Location (Urban) | 3.3201 | = 1.270 | = 8.685 |
13. Common Mistakes and How to Avoid Them
Mistake 1: Using Logistic Regression With a Non-Binary Outcome
Problem: Applying binary logistic regression to a continuous or multi-category outcome.
Solution: For continuous outcomes, use linear regression. For unordered categories, use multinomial logistic regression. For ordered categories, use ordinal logistic regression.
Mistake 2: Ignoring Class Imbalance
Problem: When one class (e.g., ) is very rare (e.g., 2% of data), the model may predict 0 for all observations and still achieve high accuracy.
Solution: Use precision, recall, F1, or AUC as primary metrics instead of accuracy. Consider oversampling the minority class (SMOTE), undersampling the majority class, or using class weights.
Mistake 3: Including Too Many Predictors (Overfitting)
Problem: With too many predictors relative to the number of events, the model overfits the training data and performs poorly on new data.
Solution: Follow the EPV rule (10–20 events per predictor). Use regularisation (L1/Lasso, L2/Ridge) or cross-validation for model selection.
Mistake 4: Multicollinearity
Problem: Highly correlated predictors inflate standard errors, making individual coefficient estimates unreliable (even if the overall model is fine).
Solution: Check pairwise correlations and VIF. Remove redundant variables or use dimensionality reduction (PCA) as a preprocessing step.
Mistake 5: Incorrect Reference Category
Problem: Choosing a reference category for a dummy variable arbitrarily, leading to confusing interpretations.
Solution: Choose a reference category that is scientifically meaningful (e.g., control group, most common category). Document the choice clearly.
Mistake 6: Interpreting Coefficients as Probability Changes
Problem: Saying "an increase in age by 1 year increases the probability of clicking by 0.05."
Solution: Coefficients in logistic regression are changes in log odds, not probabilities. The probability change depends on the current value of all predictors (it is non-linear). Always use odds ratios or calculate predicted probabilities at specific values.
Mistake 7: Ignoring Complete Separation
Problem: If a predictor perfectly predicts the outcome, MLE fails to converge and produces extremely large coefficients with enormous standard errors.
Solution: Look for warning messages about convergence. If separation exists, consider removing the problematic predictor, collapsing categories, or using Firth's penalised logistic regression.
Mistake 8: Not Checking the Linearity Assumption
Problem: Treating non-linear relationships between a continuous predictor and the log odds as linear, leading to mis-specified model.
Solution: Plot smoothed log odds against each continuous predictor. Apply transformations (e.g., log, square root) or use polynomial terms or splines if needed.
14. Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| Model fails to converge | Complete/quasi-complete separation; too many predictors; too few observations | Check for perfect predictors; reduce predictors; collect more data; use Firth regression |
| Very large coefficients () | Complete separation | Examine which predictor perfectly splits the outcomes |
| Very large standard errors | Multicollinearity or separation | Check VIF; examine correlation matrix |
| AUC = 0.5 | Model has no predictive power | Review variable selection; check data quality; consider non-linear models |
| All predictions = 0 or all = 1 | Severe class imbalance or separation | Check class distribution; adjust threshold; address imbalance |
| p-values all non-significant | Insufficient sample size; weak predictors | Increase sample size; reconsider predictor selection |
| Pseudo R² is very high (> 0.9) | Possible overfitting or separation | Cross-validate; check for separation; reduce predictors |
| Confidence interval includes 1 (for OR) | Non-significant predictor | Variable may not contribute meaningfully; consider removing |
15. Quick Reference Cheat Sheet
Core Equations
| Formula | Description |
|---|---|
| Log odds equation | |
| Predicted probability | |
| Odds ratio for predictor | |
| Wald z-statistic | |
| Akaike Information Criterion | |
| McFadden's Pseudo R² | |
| Overall accuracy | |
| Positive predictive value | |
| Sensitivity | |
| True negative rate | |
| F1 Score |
Odds Ratio Interpretation
| OR Value | Meaning |
|---|---|
| Predictor increases odds of outcome | |
| Predictor has no effect on odds | |
| Predictor decreases odds of outcome |
Model Comparison Guide
| Scenario | Recommended Metric |
|---|---|
| Comparing nested models | Likelihood Ratio Test () |
| Comparing non-nested models | AIC or BIC |
| Evaluating discrimination ability | AUC |
| Evaluating calibration | Hosmer-Lemeshow test |
| Imbalanced classes | F1, MCC, AUC (not Accuracy) |
This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Logistic Regression using the DataStatPro application. For further reading, consult Hosmer & Lemeshow's "Applied Logistic Regression" or Agresti's "Categorical Data Analysis". For feature requests or support, contact the DataStatPro team.