The Propensity Score Controversy

Andrew Yan
May 20
6 min read

Propensity score (PS) methods are probably the most widely used statistical tools for causal inference in observational studies. In medical research, epidemiology, economics, political science, and, increasingly, data science, they are often perceived as a principled way to adjust for confounding when randomization is not available.

Despite their popularity, propensity scores remain deeply controversial among statisticians. Some view them as an elegant design-based framework for reducing bias in observational studies. Others regard them as overused, misunderstood, and, at times, fundamentally misguided. The controversy is often not about the mathematics itself. Rather, it centers on a much deeper question: Can statistical modeling and adjustment meaningfully substitute for randomization?

What Is a Propensity Score?

The propensity score, introduced by Donald Rubin and Paul Rosenbaum (1983), is the probability of receiving treatment conditional on observed covariates. Formally,

where 𝑇 is the treatment indicator and 𝑋 represents observed baseline covariates.

The key theoretical result is that, if treatment assignment is strongly ignorable conditional on 𝑋, that is,

treatment assignment is independent of the potential outcomes given observed covariates, and
each subject has a positive probability of receiving either treatment,

then adjustment based on the scalar 𝑒(𝑋) can balance observed covariates between treatment groups.

This balancing property is the foundation of several commonly used methods, including

PS matching (PSM),
inverse probability weighting,
stratification, and
doubly robust estimation.

In principle, PS methods attempt to mimic certain aspects of randomized experiments using observational data. This is precisely where the controversy begins.

The Appeal of Propensity Scores

The attraction of PS methods is easy to understand. Observational studies often suffer from severe baseline imbalance. For example,

older patients may be more likely to receive one treatment over another;
sicker subjects may be treated more aggressively; and
physicians may prescribe therapies based on prognosis.

Direct between-group comparisons can therefore be badly biased. Propensity scores appear to offer a solution: estimate the probability of treatment assignment, balance covariates, then compare outcomes among “similar” subjects.

The approach is intuitive and operationally attractive. In practice, researchers often produce balance tables, love plots, matched cohorts, stabilized weights, and adjusted treatment effect estimates that visually resemble the outputs of randomized studies. Unfortunately, resemblance is not equivalence.

The Central Criticism

The most important criticism of PS methods is simple: balancing observed covariates does not create randomization. Randomization is fundamentally a design mechanism. It protects against:

both known and unknown confounding,
conscious and unconscious selection bias, and
model-dependent assumptions.

PS methods cannot accomplish this because they only operate on observed variables. If an important confounder is omitted, measured poorly, or completely unknown, the resulting analysis may still be seriously biased, regardless of how impressive the balance diagnostics appear.

This concern was repeatedly emphasized by Paul Rosenbaum, who argued that hidden bias remains the central problem in observational research. In other words, propensity scores may reduce bias, but they cannot eliminate the fundamental uncertainty created by nonrandomized treatment assignment.

Below are several influential critiques and perspectives.

Donald Rubin – "Propensity Scores Do Not Create Randomization"

Ironically, Rubin helped popularize propensity scores, yet he repeatedly emphasized their limitations.

A key point from Rubin is that propensity scores can balance observed covariates, but they cannot balance unobserved confounders. This may sound obvious, but it is often forgotten in practice. Rubin stressed that PS methods are fundamentally design tools, not magic corrections for bias.

A common misuse is to fit a PS model, obtain “balanced” tables, and then speak as if causal inference has been fully justified. Rubin repeatedly argued that causal validity depends primarily on the study design, not on the sophistication of the PS model. He also emphasized that overlap, covariate balance, and sensitivity to hidden bias must all be carefully assessed.

Paul Rosenbaum – "Hidden Bias Remains the Central Problem"

Rosenbaum’s critique is perhaps the most philosophically important. His central message is that observational studies remain vulnerable to hidden bias, regardless of how well propensity scores balance observed variables. Rosenbaum strongly advocated sensitivity analysis, design-based thinking, and careful matching, rather than blind reliance on PS estimation.

One of his recurring themes is that two subjects with identical propensity scores may still differ systematically in important unmeasured ways. This directly challenges the tendency to interpret PS adjusted analyses as "quasi-randomized".

Rosenbaum often emphasized that observational studies should be viewed as opportunities for careful bias reduction, not substitutes for randomized experiments.

Judea Pearl – "Propensity Scores Can Obscure Causal Structure"

Pearl has been sharply critical of purely statistical approaches to causal inference, including naive use of propensity scores. His critique goes deeper than the observation that "unmeasured confounding exists". Instead, he argues that propensity score methods alone do not make causal assumptions explicit, while causal diagrams, such as DAGs, do.

A major criticism from Pearl is that balancing variables without understanding causal structure can introduce bias rather than remove it. For example, adjusting for colliders, mediators, or variables affected by treatment can create serious distortions.

Pearl argued that causal identification should come before statistical adjustment. From this perspective, propensity scores are computational tools, not causal theory.

Stephen Senn – Propensity Scores Are "Illogical, Incoherent, Inadmissible, and Irrelevant"

This is probably one of the harshest critiques of propensity scores. Senn has often argued that observational adjustments are frequently over-trusted relative to randomization. A recurring theme in his writing is that statistical adjustment is not a substitute for randomization. In clinical trials, Senn has emphasized that randomization justifies inference, whereas observational adjustment relies heavily on assumptions that are fundamentally unverifiable.

He has also criticized the tendency to treat increasingly complex adjustment procedures as if they could compensate for poor design. This aligns with his broader philosophy: design dominates analysis.

Gary King and Richard Nielsen – "Why Propensity Score Should Not Be Used for Matching"

Their 2019 paper generated substantial discussion, in part because its title was deliberately provocative. They argued that PSM can actually increase imbalance, increase model dependence, and reduce efficiency.

Their key insight was that matching on a scalar PS may discard useful multivariate information. In many settings, simpler matching methods, such as Mahalanobis distance matching or direct covariate matching performs better. One of their strongest claims was that PSM often approximates complete random matching before achieving optimal balance. This directly challenged the widespread routine use of PSM in applied research.

Miguel Hernan – Criticism of "Black-box Causal Inference"

Hernan has criticized the tendency to apply PS methods mechanically, without explicitly defining the target causal question. He emphasizes target trial emulation, causal estimands, time zero alignment, and careful longitudinal design.

A key critique is that many PS analyses failed because the causal question itself was poorly specified. For example, immortal time bias, time-varying confounding, and selection bias cannot be fixed merely by fitting a PS model.

Common Practical Critiques from Modern Statisticians

Even among researchers who routinely use PS methods, several recurring concerns arise.

Extreme weights: Inverse probability weighting can become highly unstable when propensity scores are close to 0 or 1. This can lead to large variance, sensitivity to model misspecification, and highly influential observations.
Model dependence: Results can change substantially depending on variable selection, functional forms, interactions, trimming rules, and matching algorithms.
Balance ≠ causal validity: Covariate balance diagnostics only assess observed variables. They provide no guarantee against hidden confounding, measurement error, or selection bias.
Large-sample illusion: PS methods may create an illusion of rigor because tables look balanced, plots appear convincing, and standard errors become small. But systematic bias does not disappear with larger sample size.

A Deeper Philosophical Divide

Much of the debate reflects two different philosophies.

The adjustment perspective holds that, if enough covariates are measured and modeled properly, causal effects may be estimated from observational data.
The design-first perspective holds that, without randomization, causal conclusions always remain vulnerable to unverifiable assumptions.

Many prominent critics fall closer to the second view.

A Concise Summary

A common misconception is that propensity scores make observational studies equivalent to randomized trials. Most leading experts would reject that perspective. A more defensible view is that PS methods can reduce certain forms of observed confounding under strong assumptions, but they cannot reproduce the inferential foundation created by randomization.

References

Rosenbaum, P. R. and Rubin, D. B. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70 (1), 41-55.

King, G. and Nielsen, R. (2019). Why Propensity Scores Should Not Be Used for Matching. Political Analysis, 27 (4), 435-454.

The Propensity Score Controversy

Recent Posts

Comments

Contact