How to Test Statistical Assumptions and Apply Remedies Using DataStatPro
Learning Objectives
By the end of this tutorial, you will be able to:
- Understand the importance of statistical assumptions in data analysis
- Test key assumptions for common statistical procedures
- Identify violations of statistical assumptions
- Apply appropriate remedies when assumptions are violated
- Use DataStatPro's diagnostic tools effectively
- Make informed decisions about alternative analytical approaches
Why Statistical Assumptions Matter
The Foundation of Valid Inference
Statistical tests are built on specific assumptions about the data. When these assumptions are violated, the results may be:
- Inaccurate: Wrong p-values and confidence intervals
- Unreliable: Results that don't replicate
- Misleading: False conclusions about relationships
- Invalid: Inappropriate statistical inferences
Consequences of Assumption Violations
Type I Error Inflation
Example: t-test with non-normal data
- Nominal α = 0.05
- Actual α may be 0.08-0.12
- Increased false positive rate
Type II Error Inflation
Example: ANOVA with unequal variances
- Reduced power to detect true effects
- Increased false negative rate
- Missed important findings
Biased Estimates
Example: Regression with heteroscedasticity
- Unbiased coefficients but wrong standard errors
- Invalid confidence intervals
- Incorrect significance tests
Core Statistical Assumptions
1. Normality
What It Means
The data (or residuals) follow a normal distribution:
- Bell-shaped, symmetric distribution
- Mean = median = mode
- 68% within 1 SD, 95% within 2 SD
- Extends infinitely in both directions
When It's Required
Critical for:
- t-tests (especially with small samples)
- ANOVA (residuals should be normal)
- Linear regression (residuals)
- Parametric correlation tests
Less critical for:
- Large sample tests (Central Limit Theorem)
- Non-parametric alternatives available
Testing Normality
Visual Methods:
1. Histograms
- Look for bell shape
- Check for skewness
- Identify outliers
2. Q-Q Plots (Quantile-Quantile)
- Points should fall on diagonal line
- Deviations indicate non-normality
- Most sensitive visual method
3. Box Plots
- Check for symmetry
- Identify outliers
- Assess skewness
Statistical Tests:
1. Shapiro-Wilk Test (n < 50)
- H₀: Data is normally distributed
- Most powerful for small samples
- Sensitive to outliers
2. Kolmogorov-Smirnov Test
- H₀: Data follows specified distribution
- Less powerful than Shapiro-Wilk
- Can test against any distribution
3. Anderson-Darling Test
- More sensitive to tail deviations
- Good for moderate sample sizes
- Weighted toward tails
4. Jarque-Bera Test
- Based on skewness and kurtosis
- Good for large samples
- Less sensitive to outliers
2. Homogeneity of Variance (Homoscedasticity)
What It Means
Equal variances across groups or conditions:
- Constant spread of data
- No systematic changes in variability
- Uniform scatter around regression line
When It's Required
Critical for:
- Independent samples t-test
- ANOVA (between groups)
- Linear regression
- Pooled variance procedures
Testing Homoscedasticity
Visual Methods:
1. Residual Plots
- Plot residuals vs fitted values
- Look for constant spread
- Check for funnel patterns
2. Box Plots by Group
- Compare box sizes
- Similar IQRs indicate equal variances
- Check for outliers
Statistical Tests:
1. Levene's Test
- H₀: Variances are equal
- Robust to non-normality
- Most commonly used
2. Bartlett's Test
- H₀: Variances are equal
- Sensitive to non-normality
- More powerful if normality holds
3. Brown-Forsythe Test
- Modification of Levene's test
- Uses medians instead of means
- More robust to non-normality
4. Fligner-Killeen Test
- Non-parametric alternative
- Based on ranks
- Very robust to non-normality
3. Independence
What It Means
Observations are independent of each other:
- No systematic relationships between cases
- One observation doesn't influence another
- Random sampling from population
Common Violations
1. Clustered Data
- Students within schools
- Patients within hospitals
- Repeated measures on same subjects
2. Time Series Data
- Temporal autocorrelation
- Seasonal patterns
- Trend effects
3. Spatial Data
- Geographic clustering
- Neighborhood effects
- Distance-based correlations
Testing Independence
1. Durbin-Watson Test (time series)
- Tests for autocorrelation in residuals
- Values near 2 indicate independence
- Values < 1.5 or > 2.5 suggest problems
2. Runs Test
- Tests randomness of sequence
- Counts runs of consecutive values
- Non-parametric approach
3. Ljung-Box Test
- Tests for autocorrelation in time series
- More general than Durbin-Watson
- Can test multiple lags
4. Linearity
What It Means
Relationship between variables is linear:
- Straight-line relationship
- Constant rate of change
- No curved or U-shaped patterns
When It's Required
Critical for:
- Linear regression
- Pearson correlation
- ANCOVA
- Linear mixed models
Testing Linearity
1. Scatterplots
- Plot Y vs X
- Look for straight-line pattern
- Check for curves or patterns
2. Residual Plots
- Plot residuals vs fitted values
- Should show random scatter
- Patterns indicate non-linearity
3. Component + Residual Plots
- Partial regression plots
- Show relationship after controlling for other variables
- Useful in multiple regression
Using DataStatPro for Assumption Testing
Accessing Diagnostic Tools
-
Navigate to Diagnostics
- Go to Analysis → Diagnostics
- Select assumption type
- Choose appropriate test
-
Available Tests
Normality: - Shapiro-Wilk test - Kolmogorov-Smirnov test - Anderson-Darling test - Q-Q plots and histograms Homoscedasticity: - Levene's test - Bartlett's test - Brown-Forsythe test - Residual plots Independence: - Durbin-Watson test - Runs test - Autocorrelation plots Linearity: - Scatterplots - Residual analysis - Component plots
Step-by-Step: Testing t-test Assumptions
1. Load and Examine Data
Scenario: Comparing test scores between two groups
Group 1: n = 25, Group 2: n = 30
Outcome: Test score (0-100 scale)
2. Test Normality
DataStatPro Steps:
1. Select "Normality Tests"
2. Choose both groups
3. Run Shapiro-Wilk test
4. Generate Q-Q plots
Results:
Group 1: W = 0.94, p = 0.18 (normal)
Group 2: W = 0.89, p = 0.004 (non-normal)
3. Test Homoscedasticity
DataStatPro Steps:
1. Select "Variance Tests"
2. Choose Levene's test
3. Compare group variances
Results:
Levene's test: F = 2.34, p = 0.13 (equal variances)
4. Assess Independence
Design Review:
- Random sampling? Yes
- Independent groups? Yes
- No repeated measures? Correct
- No clustering? Correct
Conclusion: Independence assumption met
5. Decision Making
Assumption Status:
✓ Independence: Met
✓ Homoscedasticity: Met
✗ Normality: Violated (Group 2)
Options:
1. Use Welch's t-test (robust to non-normality)
2. Apply data transformation
3. Use Mann-Whitney U test (non-parametric)
4. Bootstrap confidence intervals
ANOVA Assumption Testing Example
Study Design
One-way ANOVA: 4 treatment groups
Outcome: Continuous measure
Sample sizes: n₁ = 20, n₂ = 22, n₃ = 18, n₄ = 25
Comprehensive Testing
1. Normality of Residuals
DataStatPro Process:
1. Run initial ANOVA
2. Extract residuals
3. Test residual normality
4. Create Q-Q plot
Results:
Shapiro-Wilk: W = 0.97, p = 0.08
Conclusion: Normality assumption met
2. Homogeneity of Variances
Levene's Test Results:
F(3, 81) = 1.89, p = 0.14
Conclusion: Equal variances assumption met
3. Independence
Design Check:
- Random assignment: Yes
- No repeated measures: Correct
- No clustering: Correct
Conclusion: Independence assumption met
4. Final Decision
All assumptions met → Proceed with standard ANOVA
F(3, 81) = 5.67, p = 0.001
Conclusion: Significant group differences
Remedies for Assumption Violations
Normality Violations
Data Transformations
1. Log Transformation
When to use:
- Right-skewed data
- Multiplicative relationships
- Proportional data
Formula: Y' = log(Y) or Y' = log(Y + 1)
Example:
Original: 1, 2, 5, 10, 20, 50, 100
Log-transformed: 0, 0.69, 1.61, 2.30, 3.00, 3.91, 4.61
2. Square Root Transformation
When to use:
- Poisson-distributed data
- Count data
- Moderate right skew
Formula: Y' = √Y or Y' = √(Y + 0.5)
Example:
Original: 0, 1, 4, 9, 16, 25
Square root: 0, 1, 2, 3, 4, 5
3. Reciprocal Transformation
When to use:
- Severe right skew
- Rate or time data
- Hyperbolic relationships
Formula: Y' = 1/Y or Y' = 1/(Y + 1)
Caution: Changes interpretation of results
4. Box-Cox Transformation
General power transformation:
Y' = (Y^λ - 1)/λ for λ ≠ 0
Y' = log(Y) for λ = 0
Advantage: Optimal λ chosen by maximum likelihood
DataStatPro: Automatically finds best λ
Alternative Approaches
1. Non-parametric Tests
Instead of t-test → Mann-Whitney U test
Instead of ANOVA → Kruskal-Wallis test
Instead of correlation → Spearman's rank correlation
Advantages:
- No normality assumption
- Robust to outliers
- Valid for ordinal data
Disadvantages:
- Less powerful when normality holds
- Limited post-hoc options
- Different interpretation
2. Bootstrap Methods
Resampling approach:
- Generate many bootstrap samples
- Calculate statistic for each sample
- Create empirical distribution
- Construct confidence intervals
Advantages:
- No distributional assumptions
- Works with complex statistics
- Provides uncertainty estimates
3. Robust Statistics
Methods less sensitive to outliers:
- Trimmed means
- Median-based tests
- M-estimators
- Winsorized statistics
Example: 20% trimmed mean
- Remove top and bottom 10% of data
- Calculate mean of remaining 80%
- More robust to extreme values
Homoscedasticity Violations
Welch's Corrections
For t-tests:
- Use Welch's t-test instead of Student's t-test
- Adjusts degrees of freedom
- No assumption of equal variances
For ANOVA:
- Use Welch's ANOVA
- Adjusts for unequal variances
- More conservative approach
Weighted Least Squares
For regression:
- Weight observations by inverse variance
- Gives less weight to high-variance observations
- Requires variance estimates
Formula: Weight = 1/σᵢ²
where σᵢ² is the variance for observation i
Robust Standard Errors
Sandwich estimators:
- Huber-White robust standard errors
- Consistent even with heteroscedasticity
- Widely available in software
Advantage: Keeps original coefficients
Disadvantage: May be less efficient
Transformations for Variance Stabilization
1. Log transformation
- When variance proportional to mean²
- Common in biological data
2. Square root transformation
- When variance proportional to mean
- Poisson-type data
3. Arcsine transformation
- For proportion data
- Formula: Y' = arcsin(√Y)
Independence Violations
Mixed-Effects Models
For clustered data:
- Random intercepts for clusters
- Accounts for within-cluster correlation
- Appropriate standard errors
Example: Students within schools
- Fixed effects: Treatment, student characteristics
- Random effects: School-level variation
Time Series Methods
For temporal dependence:
- ARIMA models
- Autoregressive terms
- Moving average components
- Seasonal adjustments
Generalized Estimating Equations (GEE)
For correlated data:
- Specifies correlation structure
- Robust standard errors
- Population-averaged effects
Correlation structures:
- Exchangeable (equal correlations)
- AR(1) (autoregressive)
- Unstructured (all different)
Linearity Violations
Polynomial Regression
Add polynomial terms:
Y = β₀ + β₁X + β₂X² + β₃X³ + ε
Cautions:
- Higher-order terms can overfit
- Interpretation becomes complex
- Extrapolation problems
Spline Regression
Piecewise polynomials:
- Flexible curves
- Smooth connections at knots
- Better for complex relationships
Types:
- Linear splines
- Cubic splines
- B-splines
- Smoothing splines
Non-linear Regression
Specific functional forms:
- Exponential: Y = ae^(bX)
- Power: Y = aX^b
- Logistic: Y = a/(1 + be^(-cX))
Requires:
- Subject matter knowledge
- Good starting values
- Iterative fitting procedures
Generalized Additive Models (GAM)
Flexible approach:
- Smooth functions of predictors
- Automatic smoothness selection
- Can handle multiple non-linear terms
Formula: Y = β₀ + f₁(X₁) + f₂(X₂) + ... + ε
where f₁, f₂, ... are smooth functions
Decision Trees for Assumption Violations
Normality Violation Decision Tree
Normality violated?
├─ Yes
│ ├─ Sample size large (n > 30)?
│ │ ├─ Yes → Proceed (CLT applies)
│ │ └─ No → Consider alternatives
│ │ ├─ Transformation possible?
│ │ │ ├─ Yes → Transform and retest
│ │ │ └─ No → Use non-parametric test
│ │ └─ Bootstrap available?
│ │ ├─ Yes → Use bootstrap CI
│ │ └─ No → Use robust methods
└─ No → Proceed with parametric test
Homoscedasticity Violation Decision Tree
Equal variances violated?
├─ Yes
│ ├─ Ratio of largest to smallest variance < 4?
│ │ ├─ Yes → Proceed with caution
│ │ └─ No → Use correction
│ │ ├─ t-test → Use Welch's t-test
│ │ ├─ ANOVA → Use Welch's ANOVA
│ │ └─ Regression → Use robust SE
└─ No → Proceed with standard test
Practical Guidelines
Sample Size Considerations
Small Samples (n < 30)
Assumptions are critical:
- Test all assumptions carefully
- Consider non-parametric alternatives
- Use exact tests when available
- Be conservative in interpretation
Large Samples (n > 100)
Central Limit Theorem helps:
- Normality less critical for means
- Focus on homoscedasticity and independence
- Statistical tests may be overpowered
- Emphasize practical significance
Multiple Assumption Violations
Priority Order
1. Independence (most critical)
2. Homoscedasticity (affects inference)
3. Normality (least critical with large n)
4. Linearity (affects model validity)
Combined Approaches
Example: Non-normal + unequal variances
- Option 1: Transform data (may fix both)
- Option 2: Use non-parametric test
- Option 3: Bootstrap with robust statistics
- Option 4: Generalized linear model
Reporting Assumption Tests
Methods Section
"Statistical assumptions were assessed prior to analysis.
Normality was evaluated using Shapiro-Wilk tests and Q-Q plots.
Homogeneity of variance was tested using Levene's test.
Independence was ensured through study design. When assumptions
were violated, [specific remedy] was applied."
Results Section
"Assumption testing revealed [specific findings].
[Describe any violations and remedies applied].
Subsequent analyses used [modified approach] to account
for assumption violations."
Advanced Diagnostic Techniques
Regression Diagnostics
Residual Analysis
1. Standardized residuals
- Should be approximately N(0,1)
- Values > |3| are potential outliers
2. Studentized residuals
- Account for leverage
- More sensitive to outliers
3. Deleted residuals
- Residuals with observation removed
- Detect influential points
Influence Measures
1. Cook's Distance
- Measures overall influence
- Values > 4/n suggest high influence
2. Leverage (hat values)
- Measures X-space influence
- Values > 2p/n are high leverage
3. DFBETAS
- Change in coefficients
- Standardized measure of influence
Multivariate Normality
Mardia's Test
Tests multivariate skewness and kurtosis:
- More comprehensive than univariate tests
- Required for multivariate procedures
- Available in specialized software
Henze-Zirkler Test
Based on empirical characteristic function:
- Good power properties
- Works with moderate sample sizes
- Less sensitive to outliers
Troubleshooting Common Issues
Problem: All Normality Tests Significant
Cause: Large sample size makes tests overpowered Solution: Focus on effect size and visual inspection Action: Use Q-Q plots and consider practical impact
Problem: Transformation Doesn't Help
Cause: Severe non-normality or multiple issues Solution: Consider non-parametric or robust methods Action: Use rank-based tests or bootstrap procedures
Problem: Conflicting Test Results
Cause: Different tests have different sensitivities Solution: Consider multiple sources of evidence Action: Use visual methods alongside statistical tests
Problem: Assumptions Met But Results Don't Make Sense
Cause: May have other issues (outliers, model misspecification) Solution: Conduct comprehensive diagnostic analysis Action: Check for outliers, influential points, and model fit
Frequently Asked Questions
Q: How strict should I be about assumption violations?
A: It depends on the severity of violation, sample size, and consequences of errors. Minor violations with large samples are often acceptable, while severe violations require remedial action.
Q: Should I test assumptions before every analysis?
A: Yes, especially for parametric tests. Make it part of your standard analytical workflow.
Q: Can I use multiple remedies simultaneously?
A: Yes, but be careful about over-correcting. Document all modifications and their rationale.
Q: What if transformation changes my research question?
A: Consider whether the transformed scale is meaningful. Sometimes it's better to use non-parametric methods that preserve the original scale.
Q: How do I report assumption violations in publications?
A: Be transparent about violations found and remedies applied. Include diagnostic plots in supplementary materials if space allows.
Related Tutorials
- How to Create Publication-Ready Statistical Reports
- How to Handle Multiple Comparisons
- How to Perform Independent Samples t-Test
- How to Perform One-Way ANOVA
Next Steps
After mastering assumption testing, consider exploring:
- Advanced regression diagnostics
- Robust statistical methods
- Generalized linear models
- Non-parametric alternatives
This tutorial is part of DataStatPro's comprehensive statistical analysis guide. For more advanced techniques and personalized support, explore our Pro features.