How to Test Statistical Assumptions and Apply Remedies Using DataStatPro

Learning Objectives

By the end of this tutorial, you will be able to:

Understand the importance of statistical assumptions in data analysis
Test key assumptions for common statistical procedures
Identify violations of statistical assumptions
Apply appropriate remedies when assumptions are violated
Use DataStatPro's diagnostic tools effectively
Make informed decisions about alternative analytical approaches

Why Statistical Assumptions Matter

The Foundation of Valid Inference

Statistical tests are built on specific assumptions about the data. When these assumptions are violated, the results may be:

Inaccurate: Wrong p-values and confidence intervals
Unreliable: Results that don't replicate
Misleading: False conclusions about relationships
Invalid: Inappropriate statistical inferences

Consequences of Assumption Violations

Type I Error Inflation

Example: t-test with non-normal data
- Nominal α = 0.05
- Actual α may be 0.08-0.12
- Increased false positive rate

Type II Error Inflation

Example: ANOVA with unequal variances
- Reduced power to detect true effects
- Increased false negative rate
- Missed important findings

Biased Estimates

Example: Regression with heteroscedasticity
- Unbiased coefficients but wrong standard errors
- Invalid confidence intervals
- Incorrect significance tests

Core Statistical Assumptions

1. Normality

What It Means

The data (or residuals) follow a normal distribution:
- Bell-shaped, symmetric distribution
- Mean = median = mode
- 68% within 1 SD, 95% within 2 SD
- Extends infinitely in both directions

When It's Required

Critical for:
- t-tests (especially with small samples)
- ANOVA (residuals should be normal)
- Linear regression (residuals)
- Parametric correlation tests

Less critical for:
- Large sample tests (Central Limit Theorem)
- Non-parametric alternatives available

Testing Normality

Visual Methods:

1. Histograms
   - Look for bell shape
   - Check for skewness
   - Identify outliers

2. Q-Q Plots (Quantile-Quantile)
   - Points should fall on diagonal line
   - Deviations indicate non-normality
   - Most sensitive visual method

3. Box Plots
   - Check for symmetry
   - Identify outliers
   - Assess skewness

Statistical Tests:

1. Shapiro-Wilk Test (n < 50)
   - H₀: Data is normally distributed
   - Most powerful for small samples
   - Sensitive to outliers

2. Kolmogorov-Smirnov Test
   - H₀: Data follows specified distribution
   - Less powerful than Shapiro-Wilk
   - Can test against any distribution

3. Anderson-Darling Test
   - More sensitive to tail deviations
   - Good for moderate sample sizes
   - Weighted toward tails

4. Jarque-Bera Test
   - Based on skewness and kurtosis
   - Good for large samples
   - Less sensitive to outliers

2. Homogeneity of Variance (Homoscedasticity)

What It Means

Equal variances across groups or conditions:
- Constant spread of data
- No systematic changes in variability
- Uniform scatter around regression line

When It's Required

Critical for:
- Independent samples t-test
- ANOVA (between groups)
- Linear regression
- Pooled variance procedures

Testing Homoscedasticity

Visual Methods:

1. Residual Plots
   - Plot residuals vs fitted values
   - Look for constant spread
   - Check for funnel patterns

2. Box Plots by Group
   - Compare box sizes
   - Similar IQRs indicate equal variances
   - Check for outliers

Statistical Tests:

1. Levene's Test
   - H₀: Variances are equal
   - Robust to non-normality
   - Most commonly used

2. Bartlett's Test
   - H₀: Variances are equal
   - Sensitive to non-normality
   - More powerful if normality holds

3. Brown-Forsythe Test
   - Modification of Levene's test
   - Uses medians instead of means
   - More robust to non-normality

4. Fligner-Killeen Test
   - Non-parametric alternative
   - Based on ranks
   - Very robust to non-normality

3. Independence

What It Means

Observations are independent of each other:
- No systematic relationships between cases
- One observation doesn't influence another
- Random sampling from population

Common Violations

1. Clustered Data
   - Students within schools
   - Patients within hospitals
   - Repeated measures on same subjects

2. Time Series Data
   - Temporal autocorrelation
   - Seasonal patterns
   - Trend effects

3. Spatial Data
   - Geographic clustering
   - Neighborhood effects
   - Distance-based correlations

Testing Independence

1. Durbin-Watson Test (time series)
   - Tests for autocorrelation in residuals
   - Values near 2 indicate independence
   - Values < 1.5 or > 2.5 suggest problems

2. Runs Test
   - Tests randomness of sequence
   - Counts runs of consecutive values
   - Non-parametric approach

3. Ljung-Box Test
   - Tests for autocorrelation in time series
   - More general than Durbin-Watson
   - Can test multiple lags

4. Linearity

What It Means

Relationship between variables is linear:
- Straight-line relationship
- Constant rate of change
- No curved or U-shaped patterns

When It's Required

Critical for:
- Linear regression
- Pearson correlation
- ANCOVA
- Linear mixed models

Testing Linearity

1. Scatterplots
   - Plot Y vs X
   - Look for straight-line pattern
   - Check for curves or patterns

2. Residual Plots
   - Plot residuals vs fitted values
   - Should show random scatter
   - Patterns indicate non-linearity

3. Component + Residual Plots
   - Partial regression plots
   - Show relationship after controlling for other variables
   - Useful in multiple regression

Using DataStatPro for Assumption Testing

Accessing Diagnostic Tools

Navigate to Diagnostics
- Go to Analysis → Diagnostics
- Select assumption type
- Choose appropriate test

Available Tests

Normality:
- Shapiro-Wilk test
- Kolmogorov-Smirnov test
- Anderson-Darling test
- Q-Q plots and histograms

Homoscedasticity:
- Levene's test
- Bartlett's test
- Brown-Forsythe test
- Residual plots

Independence:
- Durbin-Watson test
- Runs test
- Autocorrelation plots

Linearity:
- Scatterplots
- Residual analysis
- Component plots

Step-by-Step: Testing t-test Assumptions

1. Load and Examine Data

Scenario: Comparing test scores between two groups
Group 1: n = 25, Group 2: n = 30
Outcome: Test score (0-100 scale)

2. Test Normality

DataStatPro Steps:
1. Select "Normality Tests"
2. Choose both groups
3. Run Shapiro-Wilk test
4. Generate Q-Q plots

Results:
Group 1: W = 0.94, p = 0.18 (normal)
Group 2: W = 0.89, p = 0.004 (non-normal)

3. Test Homoscedasticity

DataStatPro Steps:
1. Select "Variance Tests"
2. Choose Levene's test
3. Compare group variances

Results:
Levene's test: F = 2.34, p = 0.13 (equal variances)

4. Assess Independence

Design Review:
- Random sampling? Yes
- Independent groups? Yes
- No repeated measures? Correct
- No clustering? Correct

Conclusion: Independence assumption met

5. Decision Making

Assumption Status:
✓ Independence: Met
✓ Homoscedasticity: Met
✗ Normality: Violated (Group 2)

Options:
1. Use Welch's t-test (robust to non-normality)
2. Apply data transformation
3. Use Mann-Whitney U test (non-parametric)
4. Bootstrap confidence intervals

ANOVA Assumption Testing Example

Study Design

One-way ANOVA: 4 treatment groups
Outcome: Continuous measure
Sample sizes: n₁ = 20, n₂ = 22, n₃ = 18, n₄ = 25

Comprehensive Testing

1. Normality of Residuals

DataStatPro Process:
1. Run initial ANOVA
2. Extract residuals
3. Test residual normality
4. Create Q-Q plot

Results:
Shapiro-Wilk: W = 0.97, p = 0.08
Conclusion: Normality assumption met

2. Homogeneity of Variances

Levene's Test Results:
F(3, 81) = 1.89, p = 0.14
Conclusion: Equal variances assumption met

3. Independence

Design Check:
- Random assignment: Yes
- No repeated measures: Correct
- No clustering: Correct
Conclusion: Independence assumption met

4. Final Decision

All assumptions met → Proceed with standard ANOVA
F(3, 81) = 5.67, p = 0.001
Conclusion: Significant group differences

Remedies for Assumption Violations

Normality Violations

Data Transformations

1. Log Transformation

When to use:
- Right-skewed data
- Multiplicative relationships
- Proportional data

Formula: Y' = log(Y) or Y' = log(Y + 1)

Example:
Original: 1, 2, 5, 10, 20, 50, 100
Log-transformed: 0, 0.69, 1.61, 2.30, 3.00, 3.91, 4.61

2. Square Root Transformation

When to use:
- Poisson-distributed data
- Count data
- Moderate right skew

Formula: Y' = √Y or Y' = √(Y + 0.5)

Example:
Original: 0, 1, 4, 9, 16, 25
Square root: 0, 1, 2, 3, 4, 5

3. Reciprocal Transformation

When to use:
- Severe right skew
- Rate or time data
- Hyperbolic relationships

Formula: Y' = 1/Y or Y' = 1/(Y + 1)

Caution: Changes interpretation of results

4. Box-Cox Transformation

General power transformation:
Y' = (Y^λ - 1)/λ for λ ≠ 0
Y' = log(Y) for λ = 0

Advantage: Optimal λ chosen by maximum likelihood
DataStatPro: Automatically finds best λ

Alternative Approaches

1. Non-parametric Tests

Instead of t-test → Mann-Whitney U test
Instead of ANOVA → Kruskal-Wallis test
Instead of correlation → Spearman's rank correlation

Advantages:
- No normality assumption
- Robust to outliers
- Valid for ordinal data

Disadvantages:
- Less powerful when normality holds
- Limited post-hoc options
- Different interpretation

2. Bootstrap Methods

Resampling approach:
- Generate many bootstrap samples
- Calculate statistic for each sample
- Create empirical distribution
- Construct confidence intervals

Advantages:
- No distributional assumptions
- Works with complex statistics
- Provides uncertainty estimates

3. Robust Statistics

Methods less sensitive to outliers:
- Trimmed means
- Median-based tests
- M-estimators
- Winsorized statistics

Example: 20% trimmed mean
- Remove top and bottom 10% of data
- Calculate mean of remaining 80%
- More robust to extreme values

Homoscedasticity Violations

Welch's Corrections

For t-tests:
- Use Welch's t-test instead of Student's t-test
- Adjusts degrees of freedom
- No assumption of equal variances

For ANOVA:
- Use Welch's ANOVA
- Adjusts for unequal variances
- More conservative approach

Weighted Least Squares

For regression:
- Weight observations by inverse variance
- Gives less weight to high-variance observations
- Requires variance estimates

Formula: Weight = 1/σᵢ²
where σᵢ² is the variance for observation i

Robust Standard Errors

Sandwich estimators:
- Huber-White robust standard errors
- Consistent even with heteroscedasticity
- Widely available in software

Advantage: Keeps original coefficients
Disadvantage: May be less efficient

Transformations for Variance Stabilization

1. Log transformation
   - When variance proportional to mean²
   - Common in biological data

2. Square root transformation
   - When variance proportional to mean
   - Poisson-type data

3. Arcsine transformation
   - For proportion data
   - Formula: Y' = arcsin(√Y)

Independence Violations

Mixed-Effects Models

For clustered data:
- Random intercepts for clusters
- Accounts for within-cluster correlation
- Appropriate standard errors

Example: Students within schools
- Fixed effects: Treatment, student characteristics
- Random effects: School-level variation

Time Series Methods

For temporal dependence:
- ARIMA models
- Autoregressive terms
- Moving average components
- Seasonal adjustments

Generalized Estimating Equations (GEE)

For correlated data:
- Specifies correlation structure
- Robust standard errors
- Population-averaged effects

Correlation structures:
- Exchangeable (equal correlations)
- AR(1) (autoregressive)
- Unstructured (all different)

Linearity Violations

Polynomial Regression

Add polynomial terms:
Y = β₀ + β₁X + β₂X² + β₃X³ + ε

Cautions:
- Higher-order terms can overfit
- Interpretation becomes complex
- Extrapolation problems

Spline Regression

Piecewise polynomials:
- Flexible curves
- Smooth connections at knots
- Better for complex relationships

Types:
- Linear splines
- Cubic splines
- B-splines
- Smoothing splines

Non-linear Regression

Specific functional forms:
- Exponential: Y = ae^(bX)
- Power: Y = aX^b
- Logistic: Y = a/(1 + be^(-cX))

Requires:
- Subject matter knowledge
- Good starting values
- Iterative fitting procedures

Generalized Additive Models (GAM)

Flexible approach:
- Smooth functions of predictors
- Automatic smoothness selection
- Can handle multiple non-linear terms

Formula: Y = β₀ + f₁(X₁) + f₂(X₂) + ... + ε
where f₁, f₂, ... are smooth functions

Decision Trees for Assumption Violations

Normality Violation Decision Tree

Normality violated?
├─ Yes
│  ├─ Sample size large (n > 30)?
│  │  ├─ Yes → Proceed (CLT applies)
│  │  └─ No → Consider alternatives
│  │     ├─ Transformation possible?
│  │     │  ├─ Yes → Transform and retest
│  │     │  └─ No → Use non-parametric test
│  │     └─ Bootstrap available?
│  │        ├─ Yes → Use bootstrap CI
│  │        └─ No → Use robust methods
└─ No → Proceed with parametric test

Homoscedasticity Violation Decision Tree

Equal variances violated?
├─ Yes
│  ├─ Ratio of largest to smallest variance < 4?
│  │  ├─ Yes → Proceed with caution
│  │  └─ No → Use correction
│  │     ├─ t-test → Use Welch's t-test
│  │     ├─ ANOVA → Use Welch's ANOVA
│  │     └─ Regression → Use robust SE
└─ No → Proceed with standard test

Practical Guidelines

Sample Size Considerations

Small Samples (n < 30)

Assumptions are critical:
- Test all assumptions carefully
- Consider non-parametric alternatives
- Use exact tests when available
- Be conservative in interpretation

Large Samples (n > 100)

Central Limit Theorem helps:
- Normality less critical for means
- Focus on homoscedasticity and independence
- Statistical tests may be overpowered
- Emphasize practical significance

Multiple Assumption Violations

Priority Order

1. Independence (most critical)
2. Homoscedasticity (affects inference)
3. Normality (least critical with large n)
4. Linearity (affects model validity)

Combined Approaches

Example: Non-normal + unequal variances
- Option 1: Transform data (may fix both)
- Option 2: Use non-parametric test
- Option 3: Bootstrap with robust statistics
- Option 4: Generalized linear model

Reporting Assumption Tests

Methods Section

"Statistical assumptions were assessed prior to analysis. 
Normality was evaluated using Shapiro-Wilk tests and Q-Q plots. 
Homogeneity of variance was tested using Levene's test. 
Independence was ensured through study design. When assumptions 
were violated, [specific remedy] was applied."

Results Section

"Assumption testing revealed [specific findings]. 
[Describe any violations and remedies applied]. 
Subsequent analyses used [modified approach] to account 
for assumption violations."

Advanced Diagnostic Techniques

Regression Diagnostics

Residual Analysis

1. Standardized residuals
   - Should be approximately N(0,1)
   - Values > |3| are potential outliers

2. Studentized residuals
   - Account for leverage
   - More sensitive to outliers

3. Deleted residuals
   - Residuals with observation removed
   - Detect influential points

Influence Measures

1. Cook's Distance
   - Measures overall influence
   - Values > 4/n suggest high influence

2. Leverage (hat values)
   - Measures X-space influence
   - Values > 2p/n are high leverage

3. DFBETAS
   - Change in coefficients
   - Standardized measure of influence

Multivariate Normality

Mardia's Test

Tests multivariate skewness and kurtosis:
- More comprehensive than univariate tests
- Required for multivariate procedures
- Available in specialized software

Henze-Zirkler Test

Based on empirical characteristic function:
- Good power properties
- Works with moderate sample sizes
- Less sensitive to outliers

Troubleshooting Common Issues

Problem: All Normality Tests Significant

Cause: Large sample size makes tests overpowered Solution: Focus on effect size and visual inspection Action: Use Q-Q plots and consider practical impact

Problem: Transformation Doesn't Help

Cause: Severe non-normality or multiple issues Solution: Consider non-parametric or robust methods Action: Use rank-based tests or bootstrap procedures

Problem: Conflicting Test Results

Cause: Different tests have different sensitivities Solution: Consider multiple sources of evidence Action: Use visual methods alongside statistical tests

Problem: Assumptions Met But Results Don't Make Sense

Cause: May have other issues (outliers, model misspecification) Solution: Conduct comprehensive diagnostic analysis Action: Check for outliers, influential points, and model fit

Frequently Asked Questions

Q: How strict should I be about assumption violations?

A: It depends on the severity of violation, sample size, and consequences of errors. Minor violations with large samples are often acceptable, while severe violations require remedial action.

Q: Should I test assumptions before every analysis?

A: Yes, especially for parametric tests. Make it part of your standard analytical workflow.

Q: Can I use multiple remedies simultaneously?

A: Yes, but be careful about over-correcting. Document all modifications and their rationale.

Q: What if transformation changes my research question?

A: Consider whether the transformed scale is meaningful. Sometimes it's better to use non-parametric methods that preserve the original scale.

Q: How do I report assumption violations in publications?

A: Be transparent about violations found and remedies applied. Include diagnostic plots in supplementary materials if space allows.

Next Steps

After mastering assumption testing, consider exploring:

Advanced regression diagnostics
Robust statistical methods
Generalized linear models
Non-parametric alternatives

This tutorial is part of DataStatPro's comprehensive statistical analysis guide. For more advanced techniques and personalized support, explore our Pro features.