How epidemiologists try to fool us with flawed statistical practices
by S. Stanley Young and Warren Kindzierski
Climate Etc. recently carried several insightful posts about How we fool ourselves. One of the posts – Part II: Scientific consensus building – was right on the money given our experience! The post pointed out that… ‘researcher degrees of freedom’… allows for researchers to extract statistical significance or other meaningful information out of almost any data set. Along similar lines, we offer some thoughts on how others try to fool us using statistics (aka how to lie with statistics); others being epidemiologists and government bureaucrats.
We have just completed a study for the National Association of Scholars  that took a deep dive looking at flawed statistical practices used in the field of environmental epidemiology. The study focused on air quality−health effect claims; more specifically PM2.5−health effect claims. However, the flawed practices apply to all aspects of risk factor−chronic disease research. The study also looked at how government bureaucrats use these claims to skew policy in favor of PM2.5 regulation and their own positions.
All that we discuss below is drawn from our study. Americans need to be aware that current statistical practices being used at the EPA for setting policy and regulations are flawed and obviously expensive. Viewers can download and read our study to decide the extent of the problem for themselves.
Unbeknownst to the public and far too many academic scientists, modern science suffers from an irreproducibility crisis in a wide range of disciplines—from medicine to social psychology. Far too frequently scientists are unable reproduce claims made in research.
Given the irreproducible science crisis, we completed a study for the National Association of Scholars (NAS) in New York as part of the Shifting Sands project. The project—Shifting Sands: Unsound Science and Unsafe Regulation—examines how irreproducible science negatively affects select areas of government policy and regulation in different federal agencies.
Our study investigated portions of research in the field of epidemiology used for US Environmental Protection Agency (EPA) regulation of PM2.5. This research claims that particulate matter smaller than 2.5 microns (PM2.5) in outdoor air is harmful to humans in many ways. But is the research on PM2.5 and the claims made in the research misleading?
2. Bias in academic research
Academic researcher incentives reward exciting research with new positive (significant association) claims—but not reproducible research. This encourages epidemiologists – who are mainly academics – to wittingly or negligently use various flawed statistical practices to produce positive, but (we show) likely false, claims.
There are numerous key biases that epidemiologists continue to unintentionally (or intentionally) ignore in studies of air quality and health effects. This is done to make positive, but likely false, research claims. Some examples are:
- multiple testing and multiple modeling
- omitting predictors and confounders
- not controlling for residual confounding
- neglecting interactions among variables
- not properly testing model assumptions
- neglecting exposure uncertainties
- making unjustified interventional causal interpretation of regression coefficients
Our study focused on the multiple testing and multiple modeling bias to assess whether a body of research has been affected by flawed statistical practices. We subjected research claiming that PM2.5 is harmful to a series of simple but severe statistical tests.
3. How epidemiologists skew research
Our study found strong circumstantial evidence that claims made about PM2.5 causing mortality, heart attacks and asthma are compromised by flawed statistical practices. These flawed practices make the research untrustworthy as it favors producing false claims that would not reproduce if done properly. This is discussed further below.
Estimating the number of statistical tests in a study – There is known flexibility available to epidemiology researchers to undertake a range of statistical tests and use different statistical models on observational data sets. The researchers then can select, use and report (cherry pick) a portion of the test and model results that favor a narrative.
One form of simple but severe testing we employed was counting. Specifically, we estimated the number of statistical hypothesis tests conducted in 70 different published epidemiology studies that make PM2.5−health effect claims. These results are presented in our study. The counting procedures are straightforward, and readers can learn and use them to count statistical tests in published observational studies. In our case, the median number of statistical tests performed in these 70 studies was over 13,000.
Epidemiologists typically use a Relative Risk (RR) or Odds Ratio (OR) lower confidence limit > 1 (or a p-value < 0.05) as decision criteria to justify a significant PM2.5−health effect claim in a statistical test. However, for any given number of statistical tests performed on the same set of data set, 5% are expected to yield a significant, but false result. A study with 13,000 statistical tests could have as many as 0.05 x 13,000 = 650 significant, but false results!
Given advanced statistical software, epidemiologists today can easily perform this many or more statistical tests on a set of data in an observational study. They can then cherry pick 10 or 20 of their most interesting findings and write up a nice, tight research paper around these findings—which are most likely to be false, irreproducible findings. We have yet to see an air quality−health effects study that reports as many as 650 results. How exactly is one supposed to tell the difference between a false positive or a possible true positive result when so many tests are performed and so few results are presented?
Diagnosing evidence of publication bias, p-hacking and/or HARKing – Publication bias is the failure to publish the results of a study unless they are positive results that show significant associations. P-hacking is reanalyzing data in many different ways to yield a target result. HARKing (Hypothesizing After Results are Known) is using the data to generate a hypothesis and pretend the hypothesis was stated first.
It is traditional in epidemiology to use confidence intervals instead of p-values from a hypothesis test to demonstrate statistical significance. As both confidence intervals and p-values are constructed from the same data, they are interchangeable, and one can be calculated from the other.
We first calculated p-values from confidence intervals for data from meta-analysis studies that make PM2.5−health effect claims. A meta-analysis is a systematic procedure for statistically combining data from multiple studies that address a common research question—for example, whether PM2.5 is a likely cause of a specific health effect (e.g., mortality). We looked at meta-analysis studies claiming that PM2.5 causes: i) mortality, ii) heart attacks and iii) asthma.
We then used a simple but novel statistical method—p-value plotting—as a severe test to diagnose evidence of publication bias, p-hacking and/or HARKing in this data. More specifically, after calculating p-values from confidence intervals we then plotted the distribution of rank ordered p-values (a p-value plot).
Conceptually, a p-value plot allows us to examine a specific premise that factor A causes outcome B using data combined from multiple observational studies in meta-analysis. What should a p-value plot of the data look like?
- a plot that forms an approximate 45-degree line provides evidence of randomness—supporting the null hypothesis of no significant association between factor A & outcome B (Figure 1)
- a plot that forms approximately a line with slope < 1, where most of the p-values are small (less than 0.05), provides evidence for a real effect—supporting a statistically significant association between factor A & outcome B (Figure 2)
- a plot that exhibits bilinearity—that divides into two lines—provides evidence of publication bias, p-hacking and/or HARKing (Figure 3)
Figure 1. P-value plot of a meta-analysis of observational data sets analyzing associations between elderly long-term exercise training (factor A) and mortality & morbidity (injury) (outcome B); data points drawn from 40 observational studies.
Figure 2. P-value plot of a meta-analysis of observational data sets analyzing associations between smoking (factor A) and squamous cell carcinoma of the lungs (outcome B); data points drawn from 102 observational studies.
Figure 3. P-value plot of a meta-analysis of observational data sets analyzing associations between PM2.5 (factor A) and all−cause mortality (outcome B); data points drawn from 29 observational studies.
We show over a dozen p-value plots in our study for meta-analysis data of associations between PM2.5 (and other air quality components) and mortality, heart attacks and asthma. All these plots exhibit bilinearity!
This provides compelling circumstantial evidence that the literature on PM2.5 (and other air quality components)—specifically for mortality, heart attack and asthma claims—has been affected by statistical practices that have rendered the underlying research untrustworthy.
Our findings are consistent with the general claim that false-positive results from publication bias, p-hacking and/or HARKing are common features of the medical science literature today, including the broad range of risk factor−chronic disease research.
4. How government bureaucrats skew policy
The process is further derailed with government involvement. The EPA have relied on statistical analyses to show significant PM2.5−health effect associations. EPA bureaucrats who fund this type of research depend on regulations to support their existence. The EPA has slowly imposed increasingly restrictive regulation over the past 40 years.
However, the EPA appears to have acted selectively in its approach to the health effects of PM2.5. This has been done by paying more attention to research that supports regulation (i.e., shows significant PM2.5−health effect associations) and ignoring or downplaying research that shows no significant PM2.5−health effect associations. This latter research exists, it is simply ignored or downplayed by the bureaucrats! Nor are the researchers finding negative results funded.
It is apparent to us that bureaucrats lack an understanding of, or willfully ignore, flawed statistical practices and other biases identified above in PM2.5−health effects research. They, along with environmental activists, continuously push for tighter air quality regulation based on flawed practices and false findings.
5. Can this mess be fixed?
Epidemiologists and government bureaucrats collectively skew results of medical science towards justifying regulation of PM2.5, while almost always keeping their data sets private. Far too many of these types, and a distressingly large amount of the public, believe that academic (university) science is superior to industry science. However, as epidemiology evidence is largely based on university research, we should treat it with the same skepticism as we would industry research.
Mainstream media appear clueless and uninterested in glaring biases in epidemiology research that cause false findings—flawed statistical practices, analysis manipulation, cherry picking results, selective reporting, broken peer review.
Epidemiologists, and government bureaucrats who depend on their work to justify PM2.5 regulation, proceed with far too much self-confidence. They have an insufficient sense of the need for awareness of just how much statistics must remain an exercise in measuring uncertainty rather than establishing certainty. This mess plagues government policy by providing a false level of certainty to a body of research that justifies PM2.5 regulation.
In our study we make several recommendations to the Biden administration for fixing this mess. However, we do not hold our breath that they will be considered. Some of these include:
- the administration needs to support statistically sound and reproducible science
- unsound statistical practices silently supported by the EPA need to stop
- the building and analysis of data sets should be separately funded
- these data sets should be made available for public scrutiny
Most importantly, Americans need to be aware that current statistical practices being used at the EPA for setting policy and regulations are flawed and obviously expensive.
S. Stanley Young (email@example.com) is the CEO of CGStat in Raleigh, North Carolina and is the Director of the National Association of Scholars’ Shifting Sands Project. Warren Kindzierski (firstname.lastname@example.org) is an Adjunct Professor in the School of Public Health at the University of Alberta in Edmonton, Alberta.
 Young SS, Kindzierski W, Randall D. 2021. Shifting Sands: Unsound Science and Unsafe Regulation. Keeping Count of Government Science: P-Value Plotting, P-Hacking, and PM2.5 Regulation. National Association of Scholars, New York, NY. https://www.nas.org/reports/shifting-sands