The Importance of Replication in Behavioral Operations Management

Scientific findings are based on reproducible evidence because fragile results cannot support general findings and so are of little interest.  That‘s why a recent article published in Science “Estimating the reproducibility of psychological science” ( has caused such a stir.  The article conducted a replication study of 100 published psychology studies taken from three important psychology journals.  The news that only 36% of replications had statistically significant results has caused a great deal of discussion and soul searching among social scientists.  The Science article comes on the heels of similar finding on the lack of reproducibility of experiments in cancer biology, and of incidents of research fraud in political science as well as psychology.   It appears then, that reproducibility is a problem in a number of social and natural sciences.  We decided to share some of our thoughts about the causes of the problem, as well as some solution ideas.

Some scientific results, even published ones, do not reproduce—this in itself is not news.  Random error plays a role in every study, no matter how carefully designed.  Still, standard statistical analysis implies random chance might lead to failure rates of 5 to 10%, far short of the 64% rate reported in the Science paper.   So it is likely that there are additional, systematic sources of error at work here.  After all, every laboratory is slightly different it it’s procedures, so there is always a possibility that a finding is an artifact of the experimental apparatus, rather than the causation the experimenter thinks it has established.

But other practices, such as not reporting all the treatments, dropping “outlier” data, or ex poste crafting of hypotheses, can exacerbate the problem and decrease the probability a study will replicate. These practices are common, and there is no avoiding the fact that subjective judgment calls must be made about what data makes it into a paper.  Many, if not most, investigators dutifully report these calls, and any dropped data, in their papers.  But even so, because these calls are subjective, they can reduce the reproducibility of a study.  As the cases in psychology and political science show, outright fraud also exists.  While we believe it is rare, when it does happen, it causes serious damage to the credibility of the profession as well as the researchers involved.

The incentives inherent in the publishing process, as presently construed, contribute to the problem.  On the one hand, results deemed surprising are more likely to be published in good journals.  But by definition, results are surprising because they were a priori counterintuitive, the prior probability of finding them deemed low and so, to the extent that our prior probabilities are correct, surprising results are less likely to be reproduced.  On the other hand, studies that seek to reproduce a surprising result, whether they succeed or not, are far less likely to be published.   Even a careful study that reports on a failure to replicate is likely to end up in a second tier journal at best.  And the same usually goes for a successful replication—why publish something that has already been done?  Thus, researchers have little incentives to attempt to replicate studies.

So what is to be done?  A number of proposed solutions involve some combination of imposing bureaucratic rules on how research is conducted, and educating researchers in proper ways to conduct and report research.  Some of these ideas, such as requiring authors to make data, software code, and experimental protocols in published articles easily accessible, are excellent innovations.  But in our opinion they are not enough.  Subjective judgments by researchers will still need to be made, and alone, these solutions do nothing to change the incentives in the publication process.    

We think journals, and especially top journals, should reconsider the value they place on replications.  If top journals were to start publishing successful and failed replications, especially of the surprising and counter-intuitive results that fill their pages, it will create incentive for researchers to conduct replication and robustness studies.   It would also encourage more careful, cautious reporting of results in the first place.  Most importantly, the scientific community would greatly benefit from learning which results are robust and replicable, and which are fragile, avoiding the waste of future resources devoted to projects built on reported results that are not robust.

So what should be the criteria for publishing replications?  Successful replications are published naturally, as part of studies that extend existing results.  Studies that build on the findings of earlier published work should include treatments that reproduce the original results.   Editors and referees should insist on this sort of demonstration as a criterion for publication. 

Papers that report that published results do not reproduce or are not robust should be held to the same high standards of rigor and thoroughness applied to other research work.  The research result in question should be important to the field and the investigation of its replication rigorous and thorough. When a published result fails to reproduce or proves not robust, a central question is why.  To provide an informative answer, the study should include systematic manipulations to try to isolate the reasons.  Understanding reasons for lack of reproducibility may well have important methodological and practical implications.  That said, it is not always possible to uncover the precise reasons a result does not replicate.  A good paper should convince the reader that the attempt to find one was exhaustive.

What does all this have to do with Behavioral Operations Management?  Much of the BOM work involves laboratory experiments.  Ease of replication is a major strength of the laboratory method.  Even though BOM is a new field, our top OM journals that publish BOM work (Management Science, M&SOM, and POM) have been open to publishing studies aimed at replicating published results.  Behavioral research on the “newsvendor” problem is a case study in the success of this approach.  Since the seminal Schweitzer and Cachon (2000) paper came out, top journals have published a number of research articles stress testing the laboratory findings (demand distributions, critical fractiles, subject pool effects, payment protocols, payoff saliency, learning, framing, gender, feedback information), associated behavioral models (loss aversion, mental accounting, reference dependence, overconfidence, prospect theory, impulse balance equilibrium, social preferences) and model extensions (trust in forecasting, and a variety of different contracts).

The BOM research dealing with the newsvendor problem that has been conducted and published over the last two decades, and includes both, positive and negative results, is a success story and an example of the value of replicating published work. But this should not lead to complacency.  The field of experimental economics recently launched a new journal (the Journal of the Economic Science Association) devoted in part to publishing replication and robustness studies, out of a sense that the field has moved away from its earlier commitment to the practice.  Our top OM journals should continue their commitment to publishing a certain number of replication studies.  This will provide a solid first step in ensuring that the behavioral work in our journals is reproducible.