Misconceptions About Linear Regression Assumptions
- Andrew Yan

- Apr 21
- 2 min read
Updated: May 13
I recently came across a LinkedIn post discussing the statistical assumptions of linear regression. Because the misconceptions in that post seem to be quite common, even among statisticians, I feel strongly compelled to write about them. The author claimed that the validity of linear regression depends on several key assumptions, namely:
Linearity: The relationship between the dependent variable Y and the independent variable(s) X must be linear.
Independence: The observations in the dataset should be independent of each other.
Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variable(s).
Normality: The residuals (errors) of the model should be normally distributed.
No multicollinearity: The independent variables should not be highly correlated.
Surprisingly, all of these assumptions, or at least the way they are interpreted, are either inaccurate or misleading. Here is why.
Linearity: This is misstated by the author. Linear regression assumes linearity in model parameters, not necessarily a linear relationship between the dependent variable and the independent variable(s).
Independence: Dependent observations can still yield valid inference if the dependence structure is properly taken into account.
Homoscedasticity: Homoscedasticity is not required for unbiasedness. Ordinary least squares (OLS) estimators remain unbiased and consistent even under heteroscedasticity, although they may no longer be efficient.
Normality: The importance of normality is overstated here. Normality is essential for hypothesis testing and confidence intervals, but not for estimation. Moreover, it is only required for exact finite-sample inference.
No multicollinearity: Multicollinearity does not invalidate a linear regression model, and the Gauss-Markov theory still holds in the presence of multicollinearity. Although it can inflate the variance of parameter estimates, it does not inherently reduce the model’s predictive performance.
Comments