VIEWPOINT

Unpacking the true cost of ‘free’ statistical software

By Bradley C. Boehmke and Ross A. Jackson

Pollack, Klimberg and Boklage (PKB) [1] recently asked in OR/MS Today if open-source, statistical software is really free. Their employment of truth and sketch of hidden costs suggest their answer is no. In a sense, we agree. Economic thinking acknowledges there is no free lunch. However, when it comes to the relative risks and merits of software, much remains to be discussed. Despite their titular assertion, PKB did not present “the true cost of ‘free’ statistical software.” Rather, they provided an alternative conceptualization of costs, particularly those of the R programming software. We present a critique of that conceptualization and rebut their seven main claims. Taking a postmodern turn, we leave notions of truth to consumers.

Claim 1: Open-source is a craze. Initial framing of a topic is consequential. PKB introduced open-source software as “one of the newest crazes.” Their use of craze is problematic. Craze conveys something that is popular but short-lived. It is currently conjecture that the gain in prominence of open-source software will be short-lived. In fact, evidence suggests otherwise as some of the largest proprietary software developers are open-sourcing their languages (e.g., Swift by Apple, Go by Google). Furthermore, organizations are proving to value open-source capabilities. A recent O’Reilly survey revealed that analysts focusing on open-source technologies make more money than those dealing in proprietary technologies. Time is needed to determine if this is a “craze” or a software market shift.

Claim 2: Technical ability demands. PKB stated that R requires more technical ability. Although presented as a disadvantage, this is also the foundation of potential benefits. The data analysis process is rarely restricted to a handful of tasks with predictable input and outputs that can be pre-defined by a fixed user interface. In proprietary software, only the developer has access to the underlying software to modify the interface. Open-source software, such as R, blurs the distinction between developer and user, which provides the ability to extend and modify the analytic functionality to the organization’s needs. This allows the data analyst to manage how data are being transformed, manipulated and modeled.

Does this take more technical ability? Yes, but through the process one gains a beneficial autonomy, which organizations have proven to value.

Claim 3: Inferior data handling capability. Are R’s data handling capabilities inferior to those of SPSS/SAS? Answering this requires clarifying what “data handling” means. We define data handling as the ability to collect, hold and process data. Between commercial software and R, the basics of importing data and the size of data each can process are comparable. Where equivalence lacks is in harvesting data from online sources. Few proprietary statistical software products enable scraping data from Web pages. R does. While differences exist in how these programs connect to distributed data storage and processing capabilities, no evidence to date compares these capabilities. PKB speculated here; traveling along means, in J.K. Rowling’s words, “We shall be leaving the firm foundation of fact and journeying together through the murky marshes of memory into thickets of wildest guesswork.”

Claim 4: Inferior user support. For PKB, “one of R’s most serious challenges” is a lack of official help. They acknowledged commercial help comes at “a significant financial cost,” but value “expert” insight. While this assistance is likely more definitive, the transaction occurs within a web of commercial relationships in which the pushing of product and support are ubiquitous. Further, commercial assistance often presents a single solution, where the R community provides multiple solutions. Learning to deal with complexity is essential. This ability is underdeveloped when one simply implements direction. Lastly, the “qualified professional support” from commercial software companies likely comes from customer support personnel. With R, questions regarding algorithms or package code are often answered by programmers.

Claim 5: Lack of quality, scientific controls and rigor. PKB cautioned users that R packages “lack the quality, scientific controls and rigor” of proprietary software. This is a common misperception. Recent Coverity Scan Open-Source Reports (an accepted standard for measuring open-source quality) have found that open-source code quality surpasses proprietary code quality. PKB’s statement minimizes the review process of publishing packages and the development of best practices in the R community. Lastly, many R packages originate from academic research, institutions or from programmers that hold doctorates. It is questionable to establish a position on the implication that the quality, scientific controls and rigor of industry are inherently superior to those of our academic institutions.

Claim 6: Greater hidden costs. Hidden costs are essential to PKB’s argument. Their warnings referenced how using R could result in “serious financial costs,” “dismantling costs” and “great risks to the credibility and reputation of any user.” It would seem using R could “immanentize the eschaton.” These hypothetical situations are probably more like R.E.M.’s “It’s the End of the World as We Know It (And I Feel Fine).”

While botched analytics could cause the existential collapse of organizations, the typical result would simply be irksome. PKB support their claim with an anecdote of how two of them were involved in a project in which R was dropped because it lacked adequate statistical details related to goodness of fit testing and odds ratios for logistic regression. Through a Google search one can obtain these functions [2]. R does not lack the capabilities PKB listed. Locating and using the functions does require a willing operator.

Claim 7: Appropriateness in academia. In regards to educational use, PKB explained R should not be used “if the main objective in a course is to learn the statistical techniques of data mining/predictive analytics.” There are issues to unpack here. Learning a software interface is far from thinking. Using R requires the development of higher-order thinking and helps one integrate abstract knowledge with praxis. If R is harder to use, the student who learns R should be able to make quick use of commercial software; the opposite does not necessarily hold. Further, college seems like an ideal place to learn challenging things, especially if through the process one is less constrained and more critical. Limiting analysts by the idiosyncratic purchases of organizations is suboptimal when one could maximize their cognitive and technical abilities.

Conclusion

It seems doubtful that those adroit at dealing with the complexities of statistical software would easily confuse the absence of a purchasing fee with a situation in which there are no externalities. PKB wrote an article directed at warning would-be consumers of risks associated with free statistical software. When unpacked, a more timely warning might be directed to providers of commercial products and their legions of service representatives and consultants. The apparent “craze” suggests they may need to become more efficient to remain relevant. Time will tell which approach prevails. In the meantime, people will determine if R is right for them. Maybe it is; maybe it isn’t. But wouldn’t rational consumers try the free solution first?

Bradley C. Boehmke, Ph.D., is an operations research analyst whose research primarily focuses on strategic cost analytics across the Air Force. Dr. Boehmke is also an adjunct professor at the Air Force Institute of Technology, Department of Operational Sciences. Ross A. Jackson, Ph.D., is a “poet-analyst” engaged in the exploration of linguistic and existential facets of the military-industrial complex. Additionally, Dr. Jackson is an adjunct instructor of economics at Antioch College in Yellow Springs, Ohio.

Disclaimer: The views expressed are those of the authors, and do not represent the official policy or position of the United States Air Force, Department of Defense or the United States government.

References

  1. Pollack, R. D., Klimberg, R. K., & Boklage, S. H., 2015, “The true cost of ‘free’ statistical software,” OR/MS Today, Vol. 42, No. 5, pp. 34-35.
  2. Null and deviance residuals can be determined using summary(); p-value and the model’s log likelihood can be calculated using the pchisq() and logLik() functions. Additional evaluation measures (i.e., Hosmer-Lemeshow) are also available.