Biology Cancer Clinical trials Medicine Pseudoscience Skepticism/critical thinking

Is there a reproducibility "crisis" in biomedical research?

Most scientists I know get a chuckle out of the Journal of Irreproducible Results (JIR), a humor journal that often parodies scientific papers. Back in the day, we used to chuckle at articles like “Any Eye for an Eye for an Arm and a Leg: Applied Dysfunctional Measurement” and “A Double Blind Efficacy Trial of Placebos, Extra Strength Placebos and Generic Placebos.” (What saddens me is that this is basically what research into so-called “complementary and alternative medicine,” now more frequently referred to as “integrative medicine” boils down to.) Unfortunately, these days, reporting on science is giving the impression that the JIR is a little too close to the truth, at least when it comes to reproduciblity, so much so that the issue even has its own name and Wikipedia entry: Replication (or reproducibility) crisis. It’s a topic I had been meaning to write about again for a while. Fortunately, A recent survey published in Nature under the clickbait title “1,500 scientists lift the lid on reproducibility” finally prodded me to look into this question again. Before I get to the survey itself, though, I can’t help but do my usual pontificating to provide a bit of background.

The spectrum of reproducibility and the inherent messiness of science

Having been a PhD-holding scientist now for well over 20 years, and a physician since 1988, I’ve done a lot of experiments, published a fair number of papers in the peer-reviewed biomedical literature (but significantly fewer than I wish I had, because otherwise I would have reached the rank of Professor years ago instead of this year), and grappled with the problem of reproducibility in my areas of research. One of the first things I learned in graduate school is that, as nice and neat as the science sounds when it’s taught in the classroom, it’s anything but nice and neat. Indeed, when it comes to some molecular biology techniques, we used to joke about sacrificing goats to appease the gods of molecular biology to get experiments to work. Then, of course, the more complicated the experiment, the more ways there are for it to go awry. Of course, there is a difference between a technique, such as Western blot or PCR not working, and an experiment giving a result that can’t be reproduced. Both, however, happen, and it can be devilishly difficult to track down the cause. Indeed, I like to point out that one of the most difficult aspects of science to convey to the general public, particularly about science-based medicine is just how messy it can be. One of the first lessons graduate students learn as they embark on their doctoral research is that early reports in the peer-reviewed literature are by their very nature tentative and have a high probability of ultimately being found to be incorrect—or, more often, only partially correct.

Unfortunately, that is not science as it is imbibed by the public. Fed by too-trite tales of simple linear progressions from observation to theory to observation to better theory taught in school, as well as media portrayals of scientists as finding answers fast, most people seem to think that science is able to generate results virtually on demand. This sort of impression is fed even by shows that I used to watch a few years ago—and even—liked—for their ability to excite people about science, for instance CSI: Crime Scene Investigation and its offspring. In their heyday, these shows portrayed beautiful people wearing beautiful pristine lab coats back lit in beautiful labs using perfectly styled multicolored Eppendorf tubes doing various assays and getting answers in minutes that normally take hours, days, or sometimes weeks. Often these assays are all done over a backing soundtrack consisting of classic rock or newer (but still relatively safe) “alternative” rock. And that’s just for applied science, in which no new ground is broken and no new discoveries made. And don’t even get me started on the stereotypical medical examiners in so many crime show dramas who can generate DNA results or detailed chemical analyses seemingly almost instantly. Real scientists know that life (and science) are complicated, much more than they are on television and in other works of fiction.

Still, the messy nature of scientific research doesn’t automatically mean that reproducibility isn’t a problem. At the very least, we as scientists should do all that we can to minimize the difficulty reproducing experiments that we do. One major reason, aside from wanting to get our results right, is that the first step in building on any scientific finding is to reproduce the experiments that led to that finding before going on to do more experiments to expand on or more deeply investigate that finding. When experiments can’t be reproduced, other labs waste a lot of time, effort, and resources. On the other hand, some experimental results are wrong, and there will always be experimental results that turn out to be wrong. Attempts to reproduce those results are how scientists find out they’re wrong. The question is: How much irreproducible science is unavoidable? Is there really a “replication crisis” or “reproducibility crisis” or whatever you want to call it? And if there is such a crisis, what should we scientists do to address it?

As a scientist quoted in the Nature article about the survey notes:

Failing to reproduce results is a rite of passage, says Marcus Munafo, a biological psychologist at the University of Bristol, UK, who has a long-standing interest in scientific reproducibility. When he was a student, he says, “I tried to replicate what looked simple from the literature, and wasn’t able to. Then I had a crisis of confidence, and then I learned that my experience wasn’t uncommon.”

Indeed. I myself went through just such a rite of passage.

A personal anecdote on reproducing scientific results

One of the key results trumpeted by this survey (more details on the survey and its findings, strengths, and shortcomings in a moment) is that more than 70% of researchers have tried and failed to replicate another scientist’s experiments. If anything, I suspect that number is probably low. In any case, I can certainly say that I’m one of those researchers who’s tried and failed to replicate another researcher’s results. No, maybe that’s not quite correct, as you will see.

Back in 1996, as part of my surgical oncology fellowship at the University of Chicago, I did research in the laboratory of the chair of the Department of Radiation and Cellular Oncology. It was there that I first learned of the work of one of my all-time most admired scientists, surgeon-scientist Dr. Judah Folkman, who was basically the father of modern tumor angiogenesis research. Angiogenesis is the normal physiologic process of growing new blood vessels. This process is critical to many normal bodily functions, such as wound healing, the menstrual cycle, and others, but tumors hijack the process to supply themselves with blood and the nutrients it brings. Blocking angiogenesis, Folkman hypothesized, could therefore be an effective anticancer strategy. In any case, I admired Folkman so much that I posted a tribute to him after his sudden death from a heart attack at age 74. The way we first became acquainted was through our laboratory’s collaboration with him to study the effect of combining two of his angiogenesis inhibitors, angiostatin and endostatin, with radiation therapy in rodent models of cancer.

Dr. Folkman’s discovery of angiostatin and endostatin began with an ingenious strategy that began from the clinical observation that sometimes tumor metastases appear shortly after the operation to remove the primary tumor. Folkman found a mouse tumor model that mimicked this behavior and in the early 1990s did a series of pioneering experiments. With a strain Lewis lung carcinoma cells of low metastatic potential (LLC-LM), when cells are injected into C57BL/6 mice and allowed to grow subcutaneously, if the tumor is left alone, mice develop only microscopic lung metastases. These metastases do not grow and kill the mouse. If, however, the primary cancer is removed, then many large lung metastases grow rapidly. The results of the experiment above strongly implied that the primary tumor was secreting something that suppresses the growth of microscopic metastases. After this, the Folkman group did what we like to call “brute force” science, collecting mouse urine and analyzing it for tumor suppressive activity until they were able to purify a single 38 kDa peptide, which they designated angiostatin. This involved analyzing literally gallons of mouse urine. (Who said science isn’t glamorous?) Once Folkman’s group had a bunch of angiostatin on hand, it performed the following experiment. Two groups of mice were injected with LLC-LM and the tumors allowed to grow to a certain size, after which they were surgically removed. One group was treated with angiostatin, and the control group with saline. The result was that the control group developed massive lung metastases and died, while the group treated with angiostatin had microsocopic lung metastases that never grew beyond a ball of cells. Dr. Folkman then demonstrated that it was the inhibition of angiogenesis by the angiostatin that kept these tumors in check. Ultimately, he used a similar method to discover endostatin, and later he demonstrated that endostatin could induce tumor dormancy in mice.

You can see why Ralph Weichselbaum, my research mentor, wanted to test combining angiostatin with radiation therapy. Our results were ultimately published in Nature, the only Nature paper on my CV (and, alas, not even as first author). Science works, right? Yes, it does, but the path to these results was not straight. It was widely known at the time that other laboratories were having difficulty reproducing Folkman’s striking results. In our case, we were not observing nearly as potent an antitumor effect as Folkman had described with angiostatin. We wondered if it was something to do with the angiostatin itself, which was being made in bacteria from a plasmid by our collaborators at Northwestern University. Given that Folkman was one of the best scientists I ever met (and I did have the opportunity to meet him on three occasions), none of us doubted his results and assumed that it must be something we were doing.

So what happened next? Weichselbaum contacted Folkman, who provided reagents, protocols, and advice, as well as some angiostatin made in his laboratory. It turns out that the peptide we were making was easily denatured (unfolded), which was why it was not as potent as Folkman had reported. Now here’s why I say we couldn’t replicate his results. It’s because we couldn’t fully replicate his results. Our angiostatin inhibited the growth of a wide variety of tumors, but, even after applying the tweaks to our angiostatin production suggested by Folkman, in our hands angiostatin never inhibited tumor growth as potently as Folkman had reported. So in other words, there could easily have been something else going on that we never figured out. Be that as it may, Folkman had the best attitude I’ve ever seen in a scientist regarding reproducibility, as we learned later when we heard of how he had done the same thing for several other labs, even to the point of dispatching one of his postdocs to help other investigators to get angiostatin and endostatin to work. Still, few investigators could ever quite replicate Folkman’s initial results, although many demonstrated that angiostatin and endostatin were potent angiogenesis inhibitors.

Eventually, angiogenesis inhibitors were clinically validated, in particular Avastin, which is simply a humanized monoclonal antibody against vascular endothelial growth factor (VEGF). (I also played with anti-VEGF antibodies back in the day.) Unfortunately, no angiogenesis inhibitor in humans has ever been as potent as angiostatin and endostatin were in mice. Angiogenesis inhibitors were a useful addition to our anticancer armamentarium, but, contrary to how they were portrayed in 1998, they were no magic bullet.

The point of this anecdote is that reproducibility falls a spectrum. Did I fail to reproduce Folkman’s results? Yes, I reproduced the general result that angiostatin inhibited tumor growth by blocking angiogenesis, but, no, I didn’t reproduce an effect anywhere near as powerful as the one Folkman had reported. Replication of a result can range from total failure to replicate (e.g., I had failed to show any antitumor effect of angiostatin at all) to partial failure to replicated (e.g., what actually happened) to success at replication (e.g., I had shown angiostatin to block tumor growth as powerfully as Folkman had).

Survey says: Reproducibility is a crisis!

Now let’s take a look at the Nature survey. It’s not a scientific survey or even a poll, really, which made me think of dismissing it almost out of hand. Basically, Nature e-mailed the survey to its readers and advertised it on affiliated websites and social media outlets as being “about reproducibility.” So, in other words, this is nothing even resembling a sampling designed to mirror the scientific community, as political polls are designed to mirror the population being polled. Nature itself even blithely notes that the survey “probably selected for respondents who are more receptive to and aware of concerns about reproducibility.” (“Probably”?) Even so, given that it’s basically an Internet poll, I don’t think the survey is without merit, as it does suggest that there is at least a widespread perception among scientists that there is a problem.

For example:

More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Those are some of the telling figures that emerged from Nature’s survey of 1,576 researchers who took a brief online questionnaire on reproducibility in research.

The data reveal sometimes-contradictory attitudes towards reproducibility. Although 52% of those surveyed agree that there is a significant ‘crisis’ of reproducibility, less than 31% think that failure to reproduce published results means that the result is probably wrong, and most say that they still trust the published literature.

The average lay person might find it odd that half of the respondents had failed to reproduce their own research, but it makes more sense if you look at the actual survey questions, to which the answers were Yes, No, or “I can’t remember”:

  • Tried and failed to reproduce one of your own experiments
  • Tried and failed to reproduce someone else’s experiment
  • Published a successful attempt to reproduce someone else’s work
  • Published a failed attempt to reproduce someone else’s work
  • Tried and failed to publish a successful reproduction
  • Tried and failed to publish an unsuccessful reproduction

I’ve failed to reproduce my own experiments before. When that happens, I either figure out why I couldn’t reproduce the results of that experiment, or I don’t publish and move on to something else. (Of course, that ignores the question of how hard and long I keep plugging away to find out why I can’t reproduce before giving up and moving on, a question impacted by multiple factors.) That’s how science works, and spurious results are not uncommon. It’s why we often do experiments in triplicate and repeat the same experiment.

One interesting result of this survey is which specialties view their published results as being the most reliable. Of note, physicists and chemists have the most confidence in their published literature, with medicine being among the lowest. I can think of potential reasons for that. Physicists, for instance, can usually do many, many more replicates of their measurements than is possible in medicine. Indeed, in medicine, given ethics of clinical trials, there are often just a handful of trials addressing major questions—sometimes only one. Physicists can also control their experimental conditions far more precisely than we ever can in medicine, even for cell culture, much less for animal experiments or clinical trials.

Is reproducibility in science a crisis?

In a word, no.

I don’t like the word crisis to describe what are in fact problems. The word crisis implies an acute time of difficulty or danger, a turning point, or a time when a difficult decision must be made. Reproducibility in science is a problem, a chronic problem, in fact. It is not a crisis, and there is nothing in this survey that suggests we’re coming to a turning point or something horrible is about to happen if we don’t act. In fact, I’m not even convinced that the problem is quite as severe as it is being portrayed. For example, one of the studies frequently cited as evidence that only 10% of biomedical science can be reproduced. As I discussed when I analyzed that claim when it was first made, that survey was written by Lee M. Ellis and a former Amgen executive (C. Glenn Begley) and looked at whether pharmaceutical company scientists could reproduce preclinical results from 53 “landmark” studies as they searched for a way to target new molecular mechanisms those results revealed. As I pointed out at the time, preclinical research is, by definition, preclinical. It’s the groundwork, the preliminary research, that needs to be done to determine the plausibility and feasibility of a new treatment before testing it out in humans. As such, preclinical research encompasses basic research and translational research and can include biochemical, cell culture, and animal experiments.

It’s worth reiterating that what was being discussed was, essentially, frontier science published in very high impact journals, which is why it struck me at the time as rather strange that the authors found it so amazing and deplorable that much of the science at the very frontiers turns out not to be correct when tested further. As I’ve discussed on multiple occasions, the science that is published in the highest profile, most prestigious journals is almost by definition the most tentative science. Given that, it is surprising how much of what is published in such journals actually does stand the test of time, but it should not be surprising that much of it does not. However, the very prestige of such journals gives such research seemingly more authority than research published in less prestigious journals. Moreover, the Amgen executive who co-authored this report led a group that scoured high impact journals for cutting edge studies that appeared to have identified promising molecular targets. Then he had a veritable army of scientists, about 100 of them in the Amgen replication team according to this news report, who were ready to pounce on any published study that suggested a molecular target the company deemed promising. No wonder he could replicated only 11% of the results, particularly given that their definition of “non-reproduced” was assigned “on the basis of findings not being sufficiently robust to drive a drug-development programme.” So in reality, the failure of reproducibility in this oft-cited article is a failure of being able to extend the results sufficiently to justify the resources needed to translate the result to humans, which is a very different thing than a failure to reproduce the experimental results. Indeed, the underlying assumption seems to be that, if there isn’t an immediate practical payoff to a scientific discovery, it’s pointless crap.

Lost in the discussion of the reproducibility problem are a lot of these nuances, but, again, that doesn’t mean there isn’t a problem, and it doesn’t mean that scientists shouldn’t do something, but what?

Defining the reproducibility problem

One major issue that reformers run head on into whenever discussing reproducibility is that there is no consensus on what, exactly, constitutes adequate reproducibility in science. An accompanying editorial notes this very issue:

What does ‘reproducibility’ mean? Those who study the science of science joke that the definition of reproducibility itself is not reproducible. Reproducibility can occur across different realms: empirical, computational and statistical. Replication can be analytical, direct, systematic or conceptual. Different people use reproducibility to mean repeatability, robustness, reliability and generalizability.

Economists and social scientists often use the term to mean that computer code and data are available so that someone would be able, if so inclined, to redo the same analysis using the same data. For bench scientists, who made up most of our respondents, it usually means that another scientist using the same methods gets similar results and can draw the same conclusions. We asked respondents to use this definition.

Even with a fixed definition, the criteria for reproducibility can vary dramatically between scientists. Senior scientists will not expect each tumour sample they examine under a microscope to look exactly like the images presented in a scientific publication; less experienced scientists might worry that such a result shows lack of reproducibility.

In an article from last year about scientific reproducibility, John Ioannidis and C. Glenn Begley observed in a much better discussion of the problem of reproducibility than the one Begley co-authored that I mentioned above:

There is no clear consensus at to what constitutes a reproducible study. The inherent variability in biological systems means there is no expectation that results will necessarily be precisely replicated. So it is not reasonable to expect that each component of a research report will be replicated in perfect detail. However, it seems completely reasonable that the one or two big ideas or major conclusions that emerge from a scientific report should be validated and with-stand close interrogation.

I guess by that definition, I did replicate Judah Folkman’s results after all!

Addressing the causes of the reproducibility problem

The problem of defining what constitutes adequate reproducibility is not a trivial one, and until there is a consensus, there will only be so much that can be done. Still, the Nature survey also asked scientists what they have done and what should be done to make research more reproducible. It turns out that two-thirds of the respondents had instituted procedures to increase reproducibility, one-third within the last five years.

As for the causes of lack of reproducibility, the scientists surveyed listed the usual suspects, such as selective reporting, “publish or perish” pressure, low statistical power, insufficient replications in the original lab, poor experimental design, variable technical expertise, variability in reagents, and even fraud. To show how subtle these problems can be, it’s useful to relate another anecdote, this time from the survey:

Consolidating methods is a project unto itself, says Laura Shankman, a postdoc studying smooth muscle cells at the University of Virginia, Charlottesville. After several postdocs and graduate students left her lab within a short time, remaining members had trouble getting consistent results in their experiments. The lab decided to take some time off from new questions to repeat published work, and this revealed that lab protocols had gradually diverged. She thinks that the lab saved money overall by getting synchronized instead of troubleshooting failed experiments piecemeal, but that it was a long-term investment.

This is not an uncommon tale. The “institutional” memory of a laboratory is something that is very hard to maintain, given that, other than the principal investigator and (sometimes) a permanent technician and/or lab manager, most personnel in labs are only there for at most a few years to get their PhD or do a postdoctoral fellowship. Turnover is high by design. Often there are little “tricks” or nuances to various experimental techniques to get them to work well that are lost when someone leaves a lab. That’s why maintaining protocol notebooks is so important, but few labs do this as rigorously as they should.

The scientists surveyed identified a number of interventions that could improve reproducibility, including a better understanding of statistics, better mentoring, more robust experimental design, more within-lab validation, more time checking notebooks, journals enforcing standards, and incentives for formal replication. These are similar to proposals surveyed by Ioannidis and Begley last year:

  • Editors solicit replication bids
  • Plea to improve editorial standards
  • Reward quality rather than quantity
  • Emphasis on hypothesis testing research
  • Prospective, rigorous experimental plan
  • Improved understanding of statistics
  • Improved experimental design
  • Systematic reviews of animal studies
  • Use clinically relevant concentrations
  • Consider litter effects
  • Recommendations to improve computational biology
  • Focus on reproducibility in training, grants, journals
  • Pathology: Biospecimen quality control
  • Microarray analyses: Provide data access
  • Psychology: open data, methods and workflow
  • Meta-analyses of animal data
  • Judge academics on quality, reproducibility, sharing
  • Greater institutional responsibility
  • Apply greater skepticism to new technologies

Ioannidis and Begley do note, though:

The fundamental problem with most, if not all, of these proposals is the requirement for investigators, institutions, and journals to willingly comply: it is not at all clear how reasonable recommendations will be implemented or monitored while they remain voluntary. Conversely, were they to be mandatory, then one has to examine carefully how they would be enforced and by whom. The details of how to make these changes work can have a major effect on their efficiency.

Indeed. One can easily envision selective enforcement of such rules that give famous and influential scientists an easier time. Making such a mandatory system fair would be a major challenge.

This, of course, brings us to a major part of the problem, namely incentivization and cost. Scientists have long bemoaned that there is little incentive to publish a positive replication of another scientist’s experiment. That’s why getting “scooped” by another scientist can be so disastrous. Because the result is no longer novel, the scientist who gets scooped will have a hard time publishing his results in the better journals. If that result took a lot of time and resources to obtain, not being first to publish can be devastating to future publication and future funding. Worse, there’s even less incentive to publish negative replications.

In the meantime, as these issues are hashed out in the scientific community, the NIH has acted, instituting new requirements for reproducibility and rigor for research grants and mentored career development award applications. The NIH now requires a discussion of the strengths and weaknesses of previous research and the scientific premise, a description of rigorous experimental design and how bias will be eliminated, consideration of sex and other key variables, and validation of key biological and/or chemical reagents. Regarding the latter, believe it or not, a lot of cancer cell lines out there, when tested, turn out not to be the cell line they were thought to be. As for eliminating bias, again, believe it or not, results from many preclinical experiments are not measured in a blinded fashion, allowing observer bias to taint the results.

Conclusion: Maybe not a crisis, certainly an opportunity

The new NIH rules are a step in the right direction but clearly don’t go far enough. I don’t believe that reproducibility in science is in “crisis,” as so many are claiming, but I do believe it’s a significant problem that needs to be addressed in a thoughtful way. I also have to concede that it’s scientists’ fault that we’re in the mess we’re in and that we haven’t addressed problems with reproducibility more robustly before now, given that this problem has been festering for a while. If it takes labeling the problem as a “crisis” to get some action, I suppose I can live with that.

In considering how to encourage good science and discourage bad science, it is important to note that not all science, particularly biomedical science, should be assumed or expected to result in findings that have direct applications or to result in treatments for humans. As Ioannidis and Begley put it, an efficacy “of 100% and waste of 0% is unlikely to be achievable”, even as they note that there is “probably substantial room for improvement.” It is also important to note that, contrary to the way some paint this problem, the concerns about reproducibility in science don’t invalidate the scientific method itself nor disprove “scientism.” Science-based medicine has yielded incredible benefits to human health over the last 150 years. Indeed, the solutions to this problem being proposed are intended to enhance the rigorous application of science, not to abandon it. Finally, I can’t help but note that it is scientists themselves who are being openly self-critical and debating how to fix perceived problems in science. That is a major strength, not weakness, of science.

By Orac

Orac is the nom de blog of a humble surgeon/scientist who has an ego just big enough to delude himself that someone, somewhere might actually give a rodent's posterior about his copious verbal meanderings, but just barely small enough to admit to himself that few probably will. That surgeon is otherwise known as David Gorski.

That this particular surgeon has chosen his nom de blog based on a rather cranky and arrogant computer shaped like a clear box of blinking lights that he originally encountered when he became a fan of a 35 year old British SF television show whose special effects were renowned for their BBC/Doctor Who-style low budget look, but whose stories nonetheless resulted in some of the best, most innovative science fiction ever televised, should tell you nearly all that you need to know about Orac. (That, and the length of the preceding sentence.)

DISCLAIMER:: The various written meanderings here are the opinions of Orac and Orac alone, written on his own time. They should never be construed as representing the opinions of any other person or entity, especially Orac's cancer center, department of surgery, medical school, or university. Also note that Orac is nonpartisan; he is more than willing to criticize the statements of anyone, regardless of of political leanings, if that anyone advocates pseudoscience or quackery. Finally, medical commentary is not to be construed in any way as medical advice.

To contact Orac: [email protected]

23 replies on “Is there a reproducibility "crisis" in biomedical research?”

Things would be simpler if we could replace hypothesis testing by poll, theory by opinion of journal editors, and evaluation of science by productivity in high impact factor journals.
Certainly, the problem of reproducibility is a concern, but not as much as management of science by evaluation of productivity.

It is important here to note the distinction between honest error and fraud. I see no reason to believe that the reproducibility failures Orac has personally witnessed were due to anything other than honest error, and most of the time honest error accounts for the failure to replicate. To take a famous recent example from physics: the experiment that reported faster-than-light neutrinos turned out to have a dodgy cable connection in a crucial part of the apparatus. I am less familiar with the ways a biomedical experiment can go wrong, but ISTM that there are more variables that are hard to control for (inevitable when you are dealing with live subjects, but not limited to such studies).

But fraud does happen, and the so-called “glamour mags”–Nature, Science, and one other that depends on field (Cell for biomedical researchers, Physical Review Letters for physicists, etc.)–have a business model that makes them particularly susceptible to a cutting-edge fraudster. Jan-Hendrik Schön published most of his fabricated results in Nature or Science. And of course many researchers tried and failed to replicate his results.

@ Eric
The problem is that, unlike fraud, “honest” error is rewarded by the academic productivity system. Before submitting a paper, you sort your data in order to make sense of the results you have obtained. If you work with a hypothesis, you will have difficulty to publish and sometimes give up publishing if some of your data go against your hypothesis. If you are not honest, you may conceal those data, publish, and are rewarded without risk. If you don’t have any a priori hypothesis, it can be even better: you can make a story with all the data, and present them as if it was hypothesis testing. This is very worrying, first because your experimental approach is not the best to test the hypothesis, second, because your statistic tests do not get the same significance as if it was hypothesis testing, leading to “false” discoveries.
The academic system is not suited anymore for honest hypothesis testing, which now represents a real risk in a career.

@Daniel Corcos #3: so maybe the solution in regards to tenure, is to require evidence of reproducability, or that “failed” experiments be published in an open access database as a public service to prevent other researchers from wasting time treading water on old ground?

Not all fraud is willful, but results from self-deception. The Utah cold fusion announcement was one such, and the classic example is Blondlot and his N-rays.
There is another kind of fraud as well. Modern re-analysis of Gregor Mendel’s statistics raised suspicions that they were not merely correct, but in fact a little too perfect, bringing up the possibility that he had sanded the rough edges off his data to conform more closely to his hypothesis. That one is still being worked over after some 80 years.
There is also the problem of being right for the wrong reason, obtaining a result that accidentally appears to validate the premise, but actually occurs due to a cause not visualized.
Just a few more reasons why reproducibility can be minefield.

@ Panacea
Nobody asked for Einstein’s reproducibility. Data are just one part of science, and putting too much emphasis on them because they account for grant funding does not make sense.

#5: Sorry, but fraud is by definition willful. I might give a pass to using the term in “wishful thinking” cases, which the cold fusion case probably was.

But Mendel a fraud? He invented genetics in his spare time, not that anyone cared. And he tallied his results pretty much as any biologist of his time would — or those few who did any quantitative work. So he was supposed to invent statistics as well….?

“Nobody asked for Einstein’s reproducibility.”

There’s a good reason for this. You should know it.

ORD@5: Self-deception is a serious problem in science, but in most cases it’s not fraud. Fraud requires that the experimenter knew or should have known that the results are not real. Blondlot, up until the moment Robert Wood revealed that he had palmed the prism during a demonstration, had no reason to know that his results weren’t real–nothing in the theoretical physics of the time excluded the possibility. Schön knew he was faking his results–the reason he was caught was that a postdoc who was having trouble reproducing one of his experimental results happened to notice that two graphs purporting to show the results of different experiments were identical–but he was sufficiently in tune with theorists’ expectations that his deceptions were plausible, and in some cases it turned out his guesses were coincidentally close to the actual behavior of the system. Pons and Fleischmann fall in the middle: their initial results were plausibly self-deception, but there came a point at which they should have known that their results weren’t real.

@Daniel. Fair enough. I was just asking a question. I actually agree that the publish or perish mentality is detrimental to academia and good scholarship, in both the arts and the sciences. Fortunately for me, I teach at a college where excellence is teaching is more important than research for obtaining tenure.

But that won’t work at the research institutions.

There is also a lot to be said about the “inherent variability of biological systems” and what a total hash that can make of your experiments if you don’t take it into account (and even if you do).
All this week I’ll be trying to present data on donor variability, and how that variability is so huge in humans (compared to say, chemistry, or even mice).
Out-bred free-living humans are a pain in the butt for doing even relatively simple research. Had a fried pie before your blood donation? Well, that’s thrown off 3 of my process parameters (but I can’t measure by how much).

Reasons I have seen for being unable to reproduce results:
“That has to be in the fridge.”
“Oh, we dropped a letter in your sequence when making the clone.”
“Oh, the setting are RPM, not G.”
“Oh, those mice need to be on antibiotics.”

Seriously, if we weren’t humans no one would ever study them in lab sciences. We’re too freaking annoying.

lk: A pretty fair summation of what’s come to be known as the Mendel-Fisher controversy is here –
Fraud can be unintentional, at least in US law, where the term is “constructive fraud’, conduct that adds up to misleading others even without intent. It comes up in securities, taxes, and receiving benefits more often, as near as I can tell. If you are on the receiving end it costs you just as much as if intended.
While he didn’t use the word fraud, something of the sense of it comes in the famous Richard Feynman quote, “The first principle is that you must not fool yourself — and you are the easiest person to fool.”
I might also bring up the well-known skeptic (Memory for proper names often fails me.) who learned to do cold reading for a fair (I think), and got good enough at it that he started to believe that he really did have some kind of psychic power. If he communicated that to others then that would be a kind of unintentional fraud.

I see no reason to believe that the reproducibility failures Orac has personally witnessed were due to anything other than honest error, and most of the time honest error accounts for the failure to replicate. To take a famous recent example from physics: the experiment that reported faster-than-light neutrinos turned out to have a dodgy cable connection in a crucial part of the apparatus. I am less familiar with the ways a biomedical experiment can go wrong, but ISTM that there are more variables that are hard to control for (inevitable when you are dealing with live subjects, but not limited to such studies).

I’m having a hard time extrapolating the two boldfaced words (and OPERA) to the instant context. Are biomedical papers in any position to present simple “(sys.)” and “(stat.)” numbers?


As if cholesterol is the sole responsible for heart disease…

Congrat SN 😛

Narad @14: I’m not sure what you mean by “(sys.)” and “(stat.)” numbers, but for most instruments there is (or should be) a known amount of measurement error.
As for biological experiments going *wrong* (rather than not getting the results you expected), that’s also possible. Usually it’s related to doing something wrong in the procedure (either of the experiment itself or of one of the assays you are using to measure your results).
So your laser could be out of alignment, or your calibration could be off. Or your incubator could fail and freeze (or cook) your sample. Your reagents could have expired or have been improperly stored. Or your mice could eat each other. Or you could use the wrong centrifuge settings.

Some of these are easy to catch, and others take a lot of detective work. And that’s just on the things you control, not counting biological variability at all.

Is that what you were asking about, or am I totally off-base?


A link to a piece in The Telegraph which doesn’t even have a link to the actual paper never fills me with any confidence…As I have pointed out before our media are uniformly rubbish at reporting medical and scientific research: you are lucky if you get a link to the actual paper, more usual is a press release of the abstract or a report of someone else’s report of the press release of the abstract (yes, Guardian, I’m looking at you!).

At least that article does contain some criticism of the paper from an epidemiologist and a cardiologist.

JustaTech: Your list of reasons for failure remind me of a couple of famous fails in the US space program. In one case, a rocket went off course and had to be ordered to self -destruct. The reason: a missing hyphen.
More recently, a Mars probe was lost due to the use of both English and metric units in the instructions.
It’s not just the US, either. A Canadian airliner had to make an emergency landing on a runway turned into a dragstrip, gliding the last few miles, because (IIRC) there was confusion due to an improper conversion of metric to Imperial units.

I just wanted to say how happy I was to read about the angiostatin trials and results. Back in the late 90s I went to a seminar on it and came out (along with everyone else) very excited. Then the lab I was in closed, I moved into a different research field, life stuff happened, and I never found out how it went. And now I know!

@ Old Rockin’ Dave

the use of both English and metric units

I believe it’s “imperial” units, not English, as a number of former Commonwealth countries switched to the international unit system (well, at least Canada – most of the time*).
Or just say inch vs metric.

* at a scientific conference I attended a decade ago, the Canadian lecturer described the gizmo he designed as being “1-cm wide and 1-inch long”.

**The Nit Picks Back**
English measurement is today used mainly to designate the British-derived units of measure used in the USA. It’s a little ambiguous because it can also mean imperial units, but it’s used less commonly for that.
The UK didn’t give up all its traditional units, though. You don’t go into the pub and order a half-liter of lager.

ORD @18: Ah, the infamous (and famous) Gimli Glider. One of the most exciting W*kipedia articles I’ve ever read.
Although I thought it was more that the crew thought they’d measured in gallons but they’d actually measured in liters?
Anyway, a masterful rescue of a stupid mistake.

JustaTech, for a long time US civil aviation, has measured fuel in pounds because of the need to calculate takeoff weight, and I think that was the standard. More recently they’ve been switching over to kilograms. In calculating the volume of fuel from its weight there are standard conversion factors both for pounds and kilos. A large part of the problem was that they used the conversion ratio for pounds to gallons of 1.77:1, not realizing they needed to use the kilogram conversion of 0.8:1. The story is more complicated than that, but confusion over pounds/kilograms was at the root of it.

Comments are closed.


Subscribe now to keep reading and get access to the full archive.

Continue reading