Article Title: A framework for understanding selection bias in real-world healthcare data Authors & Year: R. Kundu, X. Shi, J. Morrison, J. Barrett, and B. Mukherjee (2024)
Journal: Journal of the Royal Statistical Society Series A: Statistics in Society [DOI:10.1093/jrsssa/qnae039]
Review Prepared by Peter A. Gao
Electronic health record (EHR) databases compile hundreds of thousands, or even millions, of patients’ medical histories, enabling researchers to study large populations and observe how their health evolves over time. The databases present an opportunity to identify risk factors for certain diseases, evaluate the efficacy of treatments for people of different backgrounds, and map health disparities. However, individuals are rarely included in such datasets at random, meaning the observed sample may not be representative of the target population. If certain groups are underrepresented in EHR data, using it to measure the prevalence of a condition or to assess the association between a risk factor and a condition may result in biased estimates. This bias is known as selection bias, since it is caused by non-random selection of individuals into the sample.
Selection bias is especially problematic for large-scale health datasets for which the typical sources of non-representativeness are not removed by simply increasing sample size. For example, patients typically enter EHR datasets when they receive medical care, meaning that individuals without access to healthcare may not be included. Regardless of the size of the dataset, findings based on EHR data may not be readily generalizable to these populations. In fact, if used without addressing selection bias, EHR datasets can give researchers a false sense of security, as their large size can lead to estimators with high estimated precision (but non-negligible bias). In 2018, Harvard University professor Xiao-Li Meng coined a name for this phenomenon: the ”big data paradox,” in which having more data leads us to be overly confident in our results–when, in reality, selection bias and other data quality issues may in fact result in spurious findings [1].
A further problem is that what it means for a sample to be ”representative” of a population is often loosely defined. Not only is selection bias a pervasive issue, its effects can be nuanced and unique to the situation at hand. The impact it may have on a given statistical analysis is determined by (1) the questions we are interested in answering and (2) how we formalize those questions in terms of statistical models. A recent paper from researchers at the University of Michigan and University of Cambridge [2] seeks to unravel the logic behind several different scenarios in hopes of illustrating the various ways selection bias may impact a study seeking to identify risk factors for a disease.
Consider the following example: suppose we are interested in identifying risk factors for Condition A in a well-defined target population. In our sample, we have information on every person’s (a) height and (b) age. Assuming our sample is a random selection of individuals from our population, we can use a logistic regression model to quantify the impact of both height and age on the risk of having Condition A. However, consider an alternative scenario in which all recruitment into the sample takes place at a local fitness center. Members of this sample likely exercise far more often than the average individual in our population. There’s a good chance that the sampled individuals also have a different age distribution than our target population. By contrast, it’s reasonable to assume that a person’s height does not impact their likelihood of going to exercise, so the distributions of heights may not differ significantly between the sample and population. As a result, under this participant recruitment scheme, our estimate of the effect of age on an individual’s risk of having Condition A may be biased though our estimate of the effect of height should be unbiased. For any study using EHR data, it is thus critical to understand how the variables of interest may affect an individual’s likelihood of being selected into the dataset. Using graphical models, Kundu et al. provide a concise framework for formalizing our understanding of numerous scenarios involving selection bias, giving researchers a tool for comprehensively evaluating the potential impacts of selection bias on their analyses. Graphical models provide a way to describe the dependence structure between variables in our dataset. For example, in the below figure representing our fitness center sampling scheme, age influences inclusion in our sample due to its association with exercise frequency, and ignoring this relationship will bias our estimate of the effect of age on an individual’s risk of having Condition A.
The authors explore other possible models for how selection into a sample may depend on variables of interest and further suggest different statistical approaches for various selection bias scenarios, showing which methods perform best in each of their chosen situations using a simulation study. Kundu et al. provide actionable advice for how to first propose a model for selection bias and then select an appropriate method for carrying out an association study. Even if researchers may disagree on exactly how selection bias operates in a given situation, having a common language is crucial for clarifying and making that disagreement explicit.
References
[1] X.-L. Meng, “Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election,” The Annals of Applied Statistics, vol. 12, pp. 685–726, June 2018. Publisher: Institute of Mathematical Statistics.
[2] R. Kundu, X. Shi, J. Morrison, J. Barrett, and B. Mukherjee, “A framework for understanding selection bias in real-world healthcare data,” Journal of the Royal Statistical Society Series A: Statistics in Society, May 2024.