Reproducible (Operations) Research

A primer on reproducible research and why the O.R. community should care about it.

Reproducible  (Operations) Research

A primer on reproducible research and why the O.R. community should care about it.

By Scott Nestler

“Let’s be clear: the work of science has nothing whatever to do with consensus, which is the business of politics. What is relevant is reproducible results. The greatest scientists in history are great precisely because they broke with the consensus.”
– Michael Crichton
Buried in the middle of physican-author Michael Crichton’s quote about the place (or lack thereof) of consensus in science, is an alliteration – reproducible results – that he claims (and I suspect most would agree) is relevant to science [1]. The traditional meaning of reproducible results addresses the verification of a scientific experiment by other researchers using an independent experiment. However, computational science poses new challenges to the scientific tradition [2]. Science often proceeds by iterative refinements where the works themselves (i.e., the explicit computations) are seldom published, and it is difficult for others to refine or improve them [3]. As expressed by the Yale Law School Roundtable on Data and Code Sharing, “Generating verifiable knowledge has long been scientific discovery’s central goal, yet today it’s impossible to verify most of the computational results that scientists present at conferences and in papers” [4].

Using the definition of Baggerly and Berry, reproducible research (RR) generally means that conclusions from a single experiment can be reproduced based on the measurements from that single experiment [5]. Note that it does not mean that anyone conducting the experiment again (but recording different measurements) would get the same results. A more precise definition for use in the computational sciences, might be: Reproducible research refers to the idea that the ultimate product of research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. necessary for reproduction of the results and building upon the research [6]. While the majority of the discussions regarding RR have been in some intersection of the medical, bioinformatics/computational biology, signal processing and statistical communities, this is a topic which operations researchers should care about as well.

Evidence of a Problem?

A number of scientific journals have recently had to publish retractions or statements of concern [7]. These have included highly regarded journals, such as: Lancet, New England Journal of Medicine, and Annals of Internal Medicine. Some awareness of this issue extends beyond scholarly journals, due to reporting in the popular media. For example, an article in The New Yorker last year attempted to answer to the question, “Is there something wrong with the scientific method?” While reproducibility (or replicability, as termed in this instance) was not the primary focus of the article, the subject is addressed [8]. The author provides a number of examples from medicine (e.g., anti-psychotic drugs, cardiac stents, Vitamin E), psychology and ecology. An article in the New York Times declared: “Reporters Find Science Journals Harder to Trust, but Not Easy to Verify [9].”

Robert Gentleman, a well-known statistician and bioinformatician, points out that in much modern research, “methodology is often so complicated and computationally expensive that the standard ... journal paper is no longer adequate. ... Most statistics papers, as published, no longer satisfy the conventional scientific criterion of reproducibility: could a reasonably competent and adequately equipped reader obtain equivalent results if the experiment or analysis were repeated? [10]” Reading through the various journals published by INFORMS, e.g., Operations Research, Management Science, INFORMS Journal on Computing, etc. or the American Statistical Association’s Technometrics, it appears the same could be said about many of the articles appearing in our professional publications. While many articles in these journals are theoretical in nature, they often include results from simulations or other computational experiments that could be prime candidates for RR, if the authors were: (1) aware of the advantages of RR, (2) familiar with helpful techniques and technological solutions, and (3) committed to “working reproducibly.”

History of and More Details About RR

The foundations of reproducibility can be found in Aristotle’s Dictum about there being no scientific knowledge in the individual [11]. Jon Claerbout, a geophysics professor at Stanford University, observes, “From Euclid’s reasoning and Galileo’s experiments, it took hundreds of years for the theoretical and experimental branches of science to develop the standards for publication and peer review that are in use today. Computational science, rightly regarded as the third branch, can walk the same road much faster” [12]. Most sources credit Claerbout as being the first to champion RR in the computational sciences [2, 13]. In 1995, Buckheit and Donoho summarized Claerbout’s ideas as follows: “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures” [14]. In recent work, David Donoho suggests that the word “knowledge” be substituted for “scholarship” [13].

Gentlemen and Temple Lang take these suggestions a step further. Besides the figures in a published document, they advocate that authors should provide “explicit inputs and code that can be used to replicate the results,” i.e., tables, figures, etc. based on computation and data analysis [15]. Figure 1 shows the “research pipeline” model that Peng and Eckel, who work in bioinformatics, use to describe RR [16]. Some of the key steps are: (1) the pipeline begins with measured data that is transformed by processing code into analytic data; (2) the analytic code turns the analytic data into computational results; (3) these are then summarized by the presentation code into figures and tables in the text. Note that authors and readers use the pipeline in different directions. Authors start with data, generate analysis and produce the paper itself; readers start at the other end with the text of the paper, and if they are interested, examine the analysis by acquiring the data and code that the authors have provided.

Figure 1: The research pipeline as a model for reproducible research

Figure 1: The research pipeline as a model for reproducible research [16].

Motivations to Work Reproducibly

Donoho recently provided this list of reasons for working reproducibly [13]. While his target audience is primarily biostatisticians, many of his suggestions apply to O.R. analysts with little or no modification.

  1. Improved work products and habits. Because we know that our scripts will be available to others, we will improve them to a higher level of quality than we would if they were only for our own consumption. But, researchers later returning to their earlier works could also benefit. (I know this is true in my case and suspect it holds for others.)
  2. Improved teamwork. When working as part of a team, our colleagues can see what we are doing in a more transparent manner and may be more likely to propose improvements. Additionally, confidence in results produced by other team members will be higher.
  3. Greater impact. If we make it easier for other researchers to use the methods we have developed, it should lead to more acknowledgement (by way of increased citations) from other researchers using our computationally reproducible work.
  4. Greater continuity and cumulative impact. For ongoing, longer-term projects, working reproducibly can ease the integration of other researchers and students into the team and better preserve the efforts of team members after they depart the project.

Donoho also points out that taxpayers should want publicly funded research efforts to result in computational reproducibility. Besides providing good stewardship of work product purchased with public funds, working reproducibly also: (1) ensures that access to the work continues after the project is over; and (2) increases the availability of publicly-funded sponsored research to other researchers and the general public. He states in summary, “I believe anyone who understands the process and the benefits (of RR) will eventually be moved to practice it” [13].

Interested Parties and Shared Responsibilities

Scientists and researchers themselves are not the only ones with important roles in RR; significant responsibility also rests with others. Editors of scholarly journals play a key role in establishing reproducibility standards in their fields. They can do this in a number of ways, including: (1) implementing policies for the provision of stable URLs for open data and code associated with published papers; (2) requiring the replication of computational results prior to publication; and (3) requiring appropriate code and data citations.

Funding agencies and grant reviewers can also influence reproducibility through a variety of means, to include: (1) requiring some projects to fully implement reproducibility in their workflow and publications; (2) funding the creation of tools that better support reproducibility in their field; and (3) others as outlined by the Yale Law School Roundtable on Data and Code Sharing [4].

Common Objections to RR

Researchers give a number of reasons for not “working reproducibly.” Some of these are simply resistance to change or “knee-jerk” objections, while others are more considered objections and deserve more thoughtful responses. Here are some of the protests that Donoho and his colleagues have encountered when encouraging others to practice RR, along with some possible responses [17]:

  1. It takes extra work. This is indeed true, especially when starting to do RR, and breaking your old, informal, non-reproducible habits.
  2. Nobody else does it. If you actually work reproducibly, and make your code and data available, it will: (a) get noticed, (b) get used, and (c) become a reliable tool.
  3. My work is too complicated. This is unlikely, but even if true, why should anyone believe what you write and publish? Don’t you exercise your computations on test data to verify them? Let others see what you have done. Reproducibility may even be more important in this case.
  4. It undermines the creation of intellectual capital. Because tools are given away before they ripen, the researcher cannot develop a toolkit over a career. Maybe, but is the purpose of your publication scholarship, personal aggrandizement and/or financial gain? Repeating prior advice, working reproducibly can improve teamwork and get your work noticed.
  5. Legal issues. There are indeed concerns with copyrights, patents and licenses. Victoria Stodden has several articles on this issue that might be of interest to those with concerns about this aspect of RR, such as the Reproducible Research Standard (RRS), discussed shortly [18, 19].

Possible Solutions and Ways to Ease the Pain

A literate program, as defined by Donald Knuth in 1992, is a document that contains both code and text segments [20]. The text provides an explanation of what the code actually does when it is executed. Literate programs support two types of transformations, for different audiences. Weaving a literate program creates the document for a human reader, while tangling the same file hides the text and allows the code to be compiled or evaluated by a computer. Within the statistical community, the focus has been on using literate programming or close variants. The most common implementation is Sweave [21], from Friedrich Leisch, which combines the statistical programming language R and the typesetting markup language LaTeX.

As shown in Figure 2, “Weaving” a Sweave document results in a LaTeX file that can be processed into a PDF or other format; “tangling” the same document yields code that has been extracted for use within R. I am aware of a number of other weavers, for use with various programs and languages, including: odfWeave (for R with OpenOffice documents), R2wd (R and Microsoft Word), SASweave (SAS with LaTeX), and StatWeave (R, SAS, Stata, and Maple) [6].

Figure 2: Results of weaving and tangling a Sweave document

Figure 2: Results of weaving and tangling a Sweave document [22].

However, one need not be a statistician using R to benefit from this approach. Donoho and his colleagues provide extensive Matlab-based tools, including the Wavelab and Sparselab packages, which have been developed over a 15-year period [17]. LeVeque supplies a Python interface to Fortran code [23]. Peng & Eckel have created a “cacher” package for R, which enables modular reproducibility by storing results of intermediate computations in a database [16]. Instead of LaTeX, another biostatistics researcher proposes using Extensible Markup Language (XML) with R [24]. For use in computational biology, Mesirov provides GenePattern, an add-in to Microsoft Office [2].

A commercial effort called “Inference,” from a company called Blue Reference, attempted to integrate R and Matlab code into Microsoft Office documents. Even though this effort evidently ceased development in 2009, demonstration copies of the software are still available at no cost from the company website [25].

One proposal to assist with RR legal issues is Stodden’s RRS, which is similar to the way the GNU Public License (GPL) is used for open-source software. RRS, however, goes beyond software to include all other data and procedures necessary for replicating a computational experiment [18].

One Size Does Not Fit All

In addition to giving examples of successes working reproducibly with Matlab-based toolboxes, Donoho and his colleagues identify three areas where RR has failed [17].

  1. Postdocs. Postdocs are typically in a rush to publish and prefer to work in a manner with which they are already comfortable. Do you blame them?
  2. Theorists. A theoretical paper with very few diagrams and calculations may not benefit from or look any different as a result of application of RR.
  3. One-off Projects. Some projects are either so small or short-lived that the additional work just isn’t worth it.

Some of these exceptions might apply within O.R. as well. The first and third are likely directly transferrable. While one could find instances of theoretical papers fitting the description of the second item on this list, the majority of papers in O.R. journals do include multiple figures and numerous computational results that could benefit from RR.

One Journal’s Efforts

Over the past couple years, Biostatistics, a peer-reviewed journal published by Oxford University Press, has encouraged authors to employ RR. The editors, Peter Diggle and Scott Zeger, recently provided the following explanation of why they chose this route [26]:

“Our aim was not to police the technical correctness of published work but rather to recognize that a nontrivial statistical analysis involves many decisions by the analyst that are open to debate. This is especially true in more complex analyses, for example, in a Bayesian analysis using Markov chain Monte Carlo that involves choices about the prior, the sampler, the burn-in, and the convergence criteria. All these may affect the inference drawn from the data, and the reader would be well served by giving the ability to find out.”
The editors further point out that, by making their work reproducible, researchers can render state-of-the-art methods more accessible to others and help preclude the abuse of methods in situations where the assumptions on which they rest are not likely to be satisfied. Similarly, in O.R. we often make modeling assumptions and decisions. While we may state these in words, translating them unambiguously into an implementable form for modeling purposes by other researchers can be difficult.

Searching for Evidence of RR in O.R.

My efforts to discover RR in O.R. began with querying proponents in other disciplines, including several researchers previously mentioned. When that failed, my endeavor moved to the Internet. A search of the INFORMS website for the term “reproducible” yielded eight results. Of these, three were germane. First, the Finance department editor’s statement for Management Science says, “Authors of empirical and quantitative papers should provide or make available enough information and data so that the results are reproducible” [27]. The second result was in an article about Office of Management and Budget Circular A-4 in a Decision Analysis Society Newsletter from 2004 [28]. The last was in the title and abstract for a talk at ICS 2009, the 11th INFORMS Computing Society Conference, held in Charleston, SC. The presentation, “GAMSWorld and the Growing Demand for Reproducible Computational Experiments,” attempted to draw the attention of the mathematical programming community away from a myopic focus on performance testing and benchmarking. The GAMSWorld website (http://www.gamsworld.org) offers “well-focused, tested and maintained components (e.g., model libraries, tools for generating, collecting and analyzing results) to use as building blocks in making reproducible experiments” [29]. A similar search for the word “replicable” produced 11 more results; none related to RR. Searches for related terms might yield other instances I did not discover.

Some Examples From O.R.

For a relatively recent example of work that is not easily reproducible, look at a paper published in Naval Research Logistics in 2009 on which I was a co-author [30]. This article studies the Shewhart chart of Q statistics proposed for the detection of process mean shifts in start-up processes and short runs. Exact expressions for the run-length distribution are derived and evaluated using an efficient computational procedure that can be considerably faster than using direct simulation. While we provided pseudo-code for both our procedure and the direct simulation methods, and included the somewhat standard phrase, “... are available upon request from the authors,” I suspect it would be difficult for other researchers (and perhaps even ourselves) to replicate the results based on solely on what was provided in the paper itself.

A much more highly referenced paper (618 cites according to Google Scholar) published in Interfaces in 1995, “Global Supply Chain Management at Digital Equipment Corporation,” was volunteered by one of the article’s authors (Gerald Brown) as a poor instance of RR [31]. He explains, “There was no way Digital Equipment would release their strategic plan, and we did not think to produce a pilot.” Brown was also part of a team that wrote an earlier paper “Design and Implementation of Large Scale Primal Transshipment Algorithms” in 1977 that was much closer to being considered RR. This work included statements such as, “a set of standard test problems [from NETGEN, 43] that have also been solved by other contemporary codes.” ... “The FORTRAN program GNET/Depth [6], 1975, is distributed to researchers for a nominal handling charge on an exclusive use basis. For further information write...” [32]. He reports that they shipped hundreds of these, worldwide. So, it appears that this approach can work.

The Internet is potentially a significant aid to those who desire to make their results reproducible by making available the underlying data and algorithm code. An example of this can be seen at http://faculty.nps.edu/awashburn/, where the “Downloads” link includes additional hyperlinks to numerous zipped applications and Excel workbooks that accompany many of Alan Washburn’s recent publications [33]. While this doesn’t fully meet the definition of RR provided earlier, it is evidence of steps toward reproducibility taken by one operations researcher and his colleagues.

The previously described tools for producing RR in statistics will be of use to some O.R. analysts, but they clearly are not sufficient for all purposes. In particular, the various methods of combining R and LaTeX are of limited use to those in optimization. However, ubiquitous mathematical modeling languages (e.g., GAMS) and high-quality commercial optimization packages (e.g., IBM’s CPLEX, free for academic research) are available and can make models portable, supporting RR. The GAMSWorld site previously introduced appears to be worth consideration for those working in math programming or optimization. Also, COmputational Infrastructure for Operations Research (COIN-OR) is an initiative for open-source software in O.R.; the items in their mission statement indicate that their projects support RR, without explicitly mentioning the term.

So What Can We Do About It?

As mentioned earlier, the responsibilities for RR are shared. If we, as individual researchers, work through the challenges involved with making research reproducible, there are ways to overcome these obstacles. A few examples are presented here; there are probably more, perhaps more applicable to some areas of O.R. but not known to me. Others may yet need to be developed. If you are already using LaTeX and either R or Matlab for statistical analysis and simulation, the move to using Sweave or a similar option is painless. If not, consider an interim step; use scriptable code, rather than spreadsheet or other GUI-based analysis, which are inherently less reproducible.

If you aren’t yet convinced, but want to know more about RR, visit Reproducible Research Planet, http://www.rrplanet.org. There is an active Google Group on the topic available at http://groups.google.com/group/reproducible-research. Additionally, the website http://www.reproducibleresearch.org contains links to a number of useful resources, including many of the papers cited in this article. If you are working in optimization, visit the GAMSWorld site, its companion Google Groups site, http://groups.google.com/group/gamsworld, and the COIN-OR repository, http://www.coin-or.org.

For those serving on editorial boards of scientific journals, consider whether the articles that appear in your publication, and your readers, could benefit from authors who worked in a reproducible manner. When was the last time your editorial statement was reviewed and updated? Perhaps it is time for a revision that encourages or even promises to reward RR. The same applies for organizations who sponsor scientific publications. If INFORMS, with 12 scholarly journals, were to follow the lead of Biostatistics, imagine how far reproducibility in O.R. could progress in a relatively short time.

That’s a Wrap!

Baggerly and Berry caution us, “RR is necessary, but not sufficient, for good science. It needn’t contain the motivation for what was done, and the motivation may be data-dependent. ... perhaps the data we used were ‘cleaned’ before we got them. These potentially fatal biases will not be known by someone checking reproducibility, and they may not be known to the primary analyst” [5]. David Banks at Duke University and a former editor of the Journal of the American Statistical Association is skeptical of the RR movement to date. He has expressed, “My own sense is that very few applied papers are perfectly reproducible.” However, he also suggests that a reproducibility standard is a noble aspiration [34].

The question that Sergey Fomel and Jon Claerbout (the “father” of the RR movement in the past two decades) suggest we ask ourselves before we publish our next paper is, “Have I done enough to allow the readers of my paper to verify and reproduce my computational experiments?” [12]. If your answer is “no,” or even “not quite,” consider the ideas presented here and how they might help you in an effort to make your work an exemplar of RR in O.R., or ... Reproducible Operations Research (ROR).

Scott Nestler is a lieutenant colonel in the U.S. Army and an assistant professor in the Operations Research Department at the Naval Postgraduate School. Previously, he was on the faculty at the U.S. Military Academy at West Point, served as the Chief of Strategic Assessments, Multi-National Force Iraq (MNF-I) at the U.S. Embassy in Baghdad, Iraq, and has worked as an O.R. Analyst on the Army Staff in the Pentagon. Nestler has a Ph.D. in Management Science from the Robert H. Smith School of Business, University of Maryland-College Park; he is also an Accredited Professional Statistician™.

Acknowledgments

The author thanks Professor Emeritus Gordon Bradley, Distinguished Professor Gerald Brown and Distinguished Professor Emeritus Alan Washburn, his colleagues at the Naval Postgraduate School, for reviewing drafts of this article and providing constructive feedback that led to significant improvements.

References

  1. Crichton, M., speech at California Institute of Technology, Pasadena, Calif., Jan. 17, 2003.
  2. Mesirov, J., “Accessible Reproducible Research,” Science Magazine, Vol. 327, Jan. 22, 2010.
  3. Gentleman, R., “Reproducible Research: A Bioinformatics Case Study,” Statistical Applications in Genetics and Molecular Biology, Vol. 4, No. 1, pp. 1-23, 2005.
  4. Yale Law School Roundtable on Data and Code Sharing, “Reproducible Research: Addressing the Need for Data and Code Sharing in Computational Science,” Computing in Science and Engineering, Vol 5, No. 5, September/October 2010.
  5. Baggerly, K., and D. Berry, “Reproducible Research,” AMSTAT NEWS, January 2011.
  6. Reproducible Research Planet!, http://www.rrplanet.com/reproducible-research/reproducible-research.html.
  7. Laine, C. and others, “Reproducible Research: Moving Toward Research the Public Can Really Trust,” Annals of Internal Medicine, pp. 450-454, March 20, 2007.
  8. Lehrer, J., “The Truth Wears Off,” The New Yorker, Dec 13, 2010.
  9. Bosman, J., “Reporters Find Science Journals Harder to Trust, But Not Easy to Verify,” New York Times, Feb 13, 2006.
  10. Green, P., “Diversities of Gifts, But the Same Spirit,” The Statistician, pp. 423-438, 2003.
  11. Turner, W., “History of Philosophy,” Ginn and Company, Ch. 11, 1903.
  12. Fomel, S., and J. Claerbout, “Reproducible Research,” Computing in Science & Engineering, pp. 5-7, January/February 2009.
  13. Donoho, D., “An Invitation to Reproducible Computational Research,” Biostatistics, Vol. 11, No. 3, pp. 385-388, 2010.
  14. Buckheit, J., and D. Donoho, “Wavelab and Reproducible Research,” “Wavelets and Statistics,” Springer-Verlag, 1995.
  15. Gentleman, R., and D. Temple Lang, “Statistical Analyses and Reproducible Research,” Bioconductor Project Working Papers, Paper 2, 2004.
  16. Peng, R., and S. Eckel, “Distributed Reproducible Research Using Cached Computations,” Computing in Science & Engineering, pp. 28-34, January/February 2009.
  17. Donoho, D., and others, “Reproducible Research in Computational Harmonic Analysis,” Computing in Science & Engineering, pp. 8-18, January/February 2009.
  18. Stodden, V., “The Legal Framework for Reproducible Scientific Research: Licensing and Copyright,” Computing in Science & Engineering, pp. 35-40, January/February 2009.
  19. Stodden, V., “Enabling Reproducible Research: Licensing for Scientific Innovation,” International Journal of Communications Law & Policy, pp. 1-25, winter 2009.
  20. Knuth, D., “Literate Programming,” Center for the Study of Language and Information, Stanford, Calif., 1992.
  21. Shotwell, M. BioStatMatt, http://biostatmatt.com/archives/1402, 2011.
  22. Leisch, F., “Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis,” Compustat 2002 – Proceedings in Computational Statistics, pp. 575-580, 2002.
  23. LeVeque, R., “Python Tools for Reproducible Research on Hyperbolic Problems,” Computing in Science & Engineering, pp. 19-27, January/February 2009.
  24. Temple Lang, D., “Embedding S in Other Languages and Environments,” Proceedings of the 2nd International Workshop on Distributed Statistical Computing, March 15-17, 2002.
  25. Blue Reference, “Inference: A Solution Platform for Business Professionals,” http://inference.us, 2009.
  26. Diggle, P., and S. Zeger, “Editorial,” Biostatistics, Vol. 11, No. 3, p. 375, 2010.
  27. Editorial Statements, Management Science, INFORMS, http://www.informs.org/content/download/15885/182755/file/Department Editorial Statements.pdf, p. 2.
  28. Morgan, K., “Opportunity for DA,” Decision Analysis Newsletter, Vol. 23, No. 1, pp. 6-7, March 2004.
  29. Dirkse, S., and others, “GAMSWorld and the Growing Demand for Reproducible Computational Experiments,” Conference Program, ICS2009, 2009.
  30. Zantek, P., and S. Nestler, “Properties of Q-Statistic Monitoring Schemes for Start-Up Processes and Short Runs,” Naval Research Logistics, April 2009.
  31. Arntzen, B., and others, “Global Supply Chain Management at Digital Equipment Corporation,” Interfaces, Vol. 25, No. 1, pp. 69-93, January/February 1995.
  32. Bradley, G., and others, “Design and Implementation of Large Scale Primal Transshipment Algorithms,” Management Science, Vol. 24, No. 1, pp. 1-34, September 1977.
  33. Washburn, A., Naval Postgraduate School, http://faculty.nps.edu/awashburn/.
  34. Banks, D., “Reproducible Research: A Range of Response,” Statistics, Politics and Policy, Vol. 2, No.1, Art. 4, 2011.