Misconceptions About Linear Regression Assumptions
- Andrew Yan

- 3 days ago
- 2 min read
Updated: 2 days ago
I recently came across a LinkedIn post discussing the statistical assumptions of linear regression. Because the misconceptions in that post seem to be quite common, even among statisticians, I feel strongly compelled to write about them. The author claimed that the validity of linear regression depends on several key assumptions, namely:
Linearity: The relationship between the dependent variable Y and the independent variable(s) X must be linear.
Independence: The observations in the dataset should be independent of each other.
Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variable(s).
Normality: The residuals (errors) of the model should be normally distributed.
No multicollinearity: The independent variables should not be highly correlated.
Surprisingly, all of these assumptions, or at least the way they are interpreted, are either inaccurate or misleading. Here is why.
Linearity: This is misstated by the author. Linear regression assumes linearity in model parameters, not necessarily a linear relationship between the dependent variable and the independent variable(s).
Independence: Dependent observations can still yield valid inference if the dependence structure is properly taken into account.
Homoscedasticity: Homoscedasticity is not required for unbiasedness. Ordinary least squares (OLS) estimators remain unbiased and consistent even under heteroscedasticity, although they may no longer be efficient.
Normality: The importance of normality is overstated here. Normality is essential for hypothesis testing and confidence intervals, but not for estimation. Moreover, it is only required for exact finite-sample inference.
No multicollinearity: Multicollinearity does not invalidate a linear regression model, and the Gauss-Markov theory still holds in the presence of multicollinearity. Although it can inflate the variance of parameter estimates, it does not inherently reduce the model’s predictive performance.
Even more surprising is that the author lists the title “Professor of Data Science and Machine Learning” on his LinkedIn profile, which appears to be inflated or, at the very least, presented in a misleading way to attract clients to his consulting business.
Comments