As you might imagine, not everyone was happy with my post the other day discussing another paper by Peter Doshi attempting to “prove” that the randomized controlled trials used to justify the emergency use authorization (EUA) for mRNA-based vaccines actually showed more serious adverse events than the placebo group. Of course as I discussed, it took p-hacking, cherry picking of data, and comparing apples and oranges to reach that conclusion, making the preprint by Peter Doshi and at least two other “COVID-19 contrarians” (Joseph Fraiman and Patrick Whelan) highly suspect at best, disinformation at worst.
One of the people not happy with my deconstruction of this study and, even less so, the deconstruction by Susan Oliver in a 12 minute YouTube video, is Norman Fenton, Professor of Risk Information Management at Queen Mary University of London. At the time I encountered his initial objections on Twitter I didn’t recall ever having heard of Prof. Fenton before, but he is clearly unhappy by the criticism leveled at Doshi. Indeed, yesterday he posted a rant on his blog entitled Response to Susan Oliver video “Antivaxxers fooled by p-hacking and apples to oranges comparison.” (Let’s just say that he’s not particularly creative coming up with titles for his post.) His key point?
Interestingly, despite the video title, Susan spends less than 30 seconds describing what p-hacking is and instead refers to a paper about it  (we agree entirely with the general concerns raised about p-hacking and show how it is avoided using Bayesian hypothesis testing ). But the key flaw in Susan’s criticism is that the “Doshi paper” is not an example of p-hacking at all. They do not use p-values and, also contrary to the continued assertions of Susan, they make no claims at all of statistical significance. Rather, the paper provides risk differences and risk ratios with 95% confidence intervals (CIs) for the various different comparisons of vaccine v placebo.
If the authors had been “p-hacking” they would have chosen a p-value like 0.05 and would have added, for each comparison of vaccine v placebo, a ‘significance statistic’ and arrived at at least one example where the statistic was less than 0.05. Then they would claim, for example, that the increased SAE rate was ‘significant’. They do nothing like that at all.
I think I just lost neurons reading that passage. No, seriously, Prof. Fenton’s argument is that it can’t be p-hacking if you don’t use p-values, an argument that is just plain nonsense. He also claims that Joseph Fraiman, Peter Doshi, and the rest of the authors—whom I mention in passing mainly because this paper is clearly primarily Fraiman and Doshi’s, given that they serve as first author and corresponding author, respectively—make no claims of statistical significance at all, even though they used confidence intervals. Let’s revisit the chart (which, conveniently enough, Fenton also reproduces in his post):
So why did Fraiman and Doshi calculate risk differences per 10,000 participants with confidence intervals, as well as risk ratios (RRs), again, with confidence intervals? I will thank Fenton for one thing. Dealing with his post, I rather quickly came to realize that the description in this preprint of how the statistics were done was so minimal as to be almost nonexistent. There’s not even a section on “statistics,” as is the case in most papers. On the one hand, one might argue that that strengthens Fenton’s retort that they didn’t really claim statistical significance. While that might be true, the inclusion of confidence intervals in such a way implies to any scientist that they are looking at which comparisons are and are not statistically significant. They are implicitly looking at statistical significance, even though they’re very careful in their manuscript to avoid the use of words “statistical significance” (or, to be honest, “significance,” “statistics,” or “statistical”) at all wherever possible. I rather suspect that this was done to preempt the sort of argument (such as it is) that they were engaged in p-hacking, which any decent reviewer would likely bring up.
But what about Fenton’s argument that it can’t be p-hacking if you don’t use p-values? This, too, is a nonsensical argument. “P-hacking” is a term commonly used, but because it focuses so much on p-values scientists now tend to use other terms for it, with p-hacking being a subset of a general technique designed to find “statistical significance” through multiple comparisons, plus mixing and combining categories, and multiple hypothesis testing designed to produce a “positive” value. More general terms are:
Inflation bias, also known as “p-hacking” or “selective reporting,” is the misreporting of true effect sizes in published studies (Box 1). It occurs when researchers try out several statistical analyses and/or data eligibility specifications and then selectively report those that produce significant results [12–15]. Common practices that lead to p-hacking include: conducting analyses midway through experiments to decide whether to continue collecting data [15,16]; recording many response variables and deciding which to report postanalysis [16,17], deciding whether to include or drop outliers postanalyses , excluding, combining, or splitting treatment groups postanalysis , including or excluding covariates postanalysis , and stopping data exploration if an analysis yields a significant p-value [18,19].
I also like the term “data dredging,” which I used in my previous post on this study and further point out, citing the paper above, that unfortunately p-hacking, data dredging, or whatever you want to call it is widespread. It’s a major problem under normal circumstances in the scientific literature, but a less known aspect of it is that it can be weaponized in the service of portraying vaccines, in this case the Pfizer and Moderna mRNA-based COVID-19 vaccines, as more dangerous than they are and the RCTs used to garner their EUAs in December 2020 as flawed and not showing the “true extent” of serious AEs attributable to them.
Also, as pointed out by one of my readers:
Interesting comment: they say in the paperIn contrast to the FDA analysis, we found an increased risk of all cause SAEs in the Pfizer trial.What is that based on? If they are only looking at the point estimates then they can say that about their samples, but there is no way to determine whether the size difference is due to chance or something else. If they rest the statement on their confidence intervals then they are essentially using p-values, despite the denials by you and Norman Fenton (who is on record as “questioning” covid vaccine effectiveness and safety).
Elsewhere in Fraiman and Doshi’s study:
In the Pfizer trial, 52 serious AESI (27.7 per 10,000) were reported in the vaccine group and 33 (17.6 per 10,000) in the placebo group. This difference corresponds to a 57% increased risk of serious AESI (RR 1.57 95% CI 0.98 to 2.54) and an absolute risk increase of 10.1 serious AESI per 10,000 vaccinated participants (95% CI -0.4 to 20.6). In the Moderna trial, 87 serious AESI (57.3 per 10,000) were reported in the vaccine group and 64 (42.2 per 10,000) in the placebo group. This difference corresponds to a 36% increased risk of serious AESI (RR 1.36 95% CI 0.93 to 1.99) and an absolute risk increase of 15.1 serious AESI per 10,000 vaccinated participants (95% CI -3.6 to 33.8). Combining the trials, there was a 43% increased risk of serious AESI (RR 1.43; 95% CI 1.07 to 1.92) and an absolute risk increase of 12.5 serious AESI per 10,000 vaccinated participants (95% CI 2.1 to 22.9). (Table 2) Of the 236 serious AESIs occurring across the Pfizer and Moderna trials, 97% (230/236) were adverse event types included as AESIs because they are seen with COVID-19. In both Pfizer and Moderna trials, the largest increase in absolute risk occurred amongst the Brighton category of coagulation disorders. Cardiac disorders have been of central concern for mRNA vaccines; more cardiovascular AESIs occurred in the vaccine group in the Pfizer trial, but cardiovascular AESI events were balanced in the Moderna trial. (Tables 3 and 4)
It’s utterly ridiculous to claim that Fraiman and Doshi weren’t looking for statistical significance here, given that they use confidence intervals and explicitly claim elevated absolute risk and relative risk for their chosen adverse events (AEs). They do appear to have been engaging in data dredging, inflation bias, p-hacking, or whatever you want to call it, under the guise of an “exploratory” reanalysis of the RCT data from Pfizer and Moderna.
Next up, Fenton argues:
Susan’s final criticisms of the Doshi paper concerns the selection of SAEs and the possibility of ‘double counting’. Regarding selection, the events included and not included are governed by the WHO endorsed Brighton scheme, and are not decided by the authors, so this is a critical error Susan makes. The Brighton list was created a priori, based on data before the any results were released from the trials. Any double counting, such as with the diarrhoea and abdominal pain example she uses, are a direct effect of the fact that the data are not public. There’s merit to both measures – counting number of participants (with any SAE) and number of events. If one person has two SAEs that is worse than one person having one SAE. “Double counting” sounds bad, but this is not double counting. Doshi et al are measuring how many SAEs occur in the vaccine group versus the placebo group. If Diarrhoea and abdominal pain were each recorded as a SAE, then that is two SAEs. We don’t know which ones were in the same person as Pfizer/Moderna have not released IPD. In any case, the authors recognise the issue that, because some SAEs occur in the same person, the SAEs are not all independent events; they note it in the paper, and introduce an adjustment to standard error to account for it. It is unclear whether the adjustment is sufficient, but it actually weakens their case (it increases the size of the confidence intervals) – so they can hardly be accused of bias.
Note how Fenton doesn’t even deny the charge of double counting. (How could he?) Instead, he tries to claim that it is justified and a “good thing” because “if one person has two SAEs that is worse than one person having one SAE.” Clearly, Fenton is not a clinician. His argument can be refuted by pointing out that often the SAEs as defined in the trial cluster in a single patient, an observation used by the authors to justify applying a “correction” to the standard error estimates in order to enlarge them. It’s hard not to respond to this sort of argument by saying that there is no reason to use standard errors, much less to introduce an arbitrary “adjustment” to them supposedly in order to account for multiple counting of SAEs in individual patients if you aren’t interested in using some sort of statistical test to show a “statistically significant” difference in their comparisons—or at least to imply that there is a real difference.
As for his defense of the Brighton criteria and how they were mapped, this allows me to come back to a criticism of the paper that I missed, namely how various adverse events (AEs) were “mapped” to the Brighton serious adverse events of special interest (serious AESIs or SAESIs). I did touch on it a bit, but I hadn’t realized at the time that the dataset for the analysis was included as a link near the end of the manuscript that led to a hosting platform that hosted a Microsoft Excel file. One thing stood out, namely that all chest pain, cardiac and non-cardiac, was mapped to myocarditis/pericarditis, while all upper abdominal pain was mapped to colitis/enteritis.
Next up, Fenton digs himself in deeper:
Further regarding double counting, SAEs are counted individually to avoid them being hidden. So, if you get renal failure and then your penis drops off that should be two SAEs, not one. One person having three SAEs (renal failure, penis drops off, stroke) could be considered as serious as three people having a stroke; so, although some clinicians disagree, it is entirely reasonable to count SAEs separately. But Susan does not appear to understand what a SAE is. She assumes something like diarrhoea cannot be a SAE because lots of diarrhoea happens to be mild. But most covid is not serious, either. So diarrhoea can be a SAE if it’s serious enough and meets the regulatory criteria. And it’s a leading cause of death in some places
First of all, it is not a general principle that “one person having three SAEs” should be considered as serious as “three people having a stroke.” It depends on the specific SAEs and how serious they are. As I discussed last time, AEs are graded from 1-3, with serious adverse events (SAEs) being grades 3 and above. Let’s review again briefly. According to the standard terminology used to rate SAEs in clinical trials grade 3 events and above (on a five-point scale) are rated severe. If you look at the list of specific AEs, you’ll see that some grade 3 AEs require hospitalization; some don’t.
A grade 3 is defined as an AE that:
- Is severe or medically significant but not immediately life-threatening; OR
- Requires hospitalization or prolongation of hospitalization indicated; OR
- Limits self care/activities of daily living (ADL)
For completeness, I’ll mention now as I did the other day that grade 4 AEs are by definition life-threatening events that require urgent medical or surgical intervention and that grade 5 events are by definition AEs that result in death.
So, no, as a general principle, it is not true that someone suffering three SAEs is necessarily as big a deal as three people suffering a single SAE each. It might be less severe. It might be more severe. It might suggest similar severity. Which of these three possibilities is the case all depends on the specific SAEs in the clinical trial subjects being considered. I would argue that it is not Susan Oliver who doesn’t understand AEs and SAEs, but rather Norman Fenton, who concludes with what he claims to be a “Bayesian analysis” of AEs in the Moderna and Pfizer clinical trials that is not really Bayesian at all, as you will see:
The benefits of applying a Bayesian analysis to the data is that we are able to ‘learn’ the full probability distributions of the adverse reaction rates for vaccine and placebo. This enables us not just to compute the risk ratios and CIs (we get slightly different results to Doshi) but, crucially, also to make explicit probabilistic statements about whether the vaccine SAE rate is higher than that of the placebo (this approach is the Bayesian alternative to the flawed p-value approach). The results (which we provide below) do indeed provide explicit support for the hypothesis that the SAE rate for vaccine is higher than that of the placebo.
You’ll recall that Bayesian analysis (which I’m a fan of, as are my coauthors at my not-so-secret other blog) involves considering prior probability in determining the posterior probability; e.g., the estimated probability that the null hypothesis is rejected. More simply, this is the probability a difference observed between the groups being compared is due to something other than random chance alone, with the threshold for “significance” of this difference (or rejecting the null hypothesis) generally chosen as 95%, which corresponds to a p-value of 0.05, or a 5% chance that the results observed could be due to random chance alone. I’ve discussed this in detail a number of times, most recently in considering the case of ivermectin for COVID-19. I’ll refer you to these discussions (as well as to discussions by Dr. Kimball Atwood of frequentist statistics versus Bayesian statistics) if you want more information, noting that there is a rich irony in my doing so because the chart that I like use my talks and blog posts about the differences between science-based medicine and evidence-based medicine comes from a work by Sander Greenland, one of the co-authors of Fraiman and Doshi’s paper. The Cliffs Notes version of the discussion is that posterior probability (the probability that the null hypothesis has been rejected) can be strongly affected by the prior probability when the prior probability is very low.
Oh, heck, I’ll just post the chart again:
Now how, for a p-value of 0.05 (which normally indicates a 95% probability that the difference observed is not due to random chance), if the prior probability is low, the posterior probability becomes much lower, even for “highly significant” p–values.
So a true Bayesian analysis would involve estimating the prior probability that a given AE being considered (or, in the case of the total of pooled AEs, such as special AESIs examined by Fraiman and Doshi) is due to the vaccine and then incorporating that estimate into the calculation of posterior probability after comparing control and vaccine groups. Of course, it really isn’t possible, a priori, to come up with an estimate of the probability that the total serious AESIs were due to the vaccines, although it is somewhat possible to make such estimates for individual AEs. Also, one criticism of Bayesian approaches is that there is always some subjectivity involved in estimating prior probabilities, which is why I tend only to invoke Bayesian analysis when the prior probability is inarguably incredibly low (or even zero), such as is inarguably the case for alternative medicine interventions like homeopathy or reiki, although it was arguably very low for ivermectin for COVID-19 as well based just on pharmacological considerations.
So what did Fenton do? Looking at his chart, it appears to me that he just compared AEs in the control and vaccine groups, calculated a 95% confidence interval in the standard manner, and then estimated (again, in a standard manner) the probability that the null hypothesis has been rejected; i.e., that the differences determined were not due to random chance alone. This is, in fact, standard frequentist statistics, whether you use a p-value or 95% confidence intervals calculated using some other statistical test, not Bayesian statistics, and discounts that “statistical significance” is defined in frequentist statistics as a p-value less than or equal to 0.05. (I could go into how many are advocating the use of even stricter p-values because of how bias and poor experimental design often mean that a p-value under 0.05 doesn’t map to a 95% probability that the null hypothesis has been rejected, but rather a much lower probability, but I won’t and will just leave it at stating that observation.) In any event, Fenton’s “Bayesian” analysis is not Bayesian and just compounds the errors of Fraiman and Doshi’s paper by using the confidence intervals in the paper to calculate the probability that the null hypothesis is rejected. In other words, he does exactly what he claims that Fraiman and Doshi did not do, even as he denies that they engaged in p-hacking.
Finally, Mr. Fenton seems pretty cheesed at some of the criticism of Peter Doshi, citing yours truly on Twitter when I pointed out that Doshi had signed a document supporting HIV/AIDS denial, a Tweet in which Steve Salzberg offered to write a letter to The BMJ over Doshi’s continued employment there as a senior editor, and a Tweet by bioethicist Arthur Caplan concurring. Given Doshi’s very long history of playing footsie with antivaxxers and even serving as an expert witness for an antivaccine organization’s lawsuit challenging the University of California system’s influenza vaccine mandate, to the point where I vacillate between considering him antivaccine, “antivax-adjacent,” or just a useful idiot for the antivaccine movement, it is not at all unreasonable to ask why The BMJ continues to employ him.
Even leaving Doshi’s history aside, Prof. Fenton’s defense of Fraiman and Doshi’s paper, which was indeed an exercise in p-hacking and misleading comparisons, is a clear misfire, and I still shake my head in disappointment that Sander Greenland is a coauthor. First John Ioannidis, and now (maybe) Sander Greenland. Truly, if there’s anything COVID-19 has taught me, it’s that I should have no scientific heroes.