VIEWPOINT: The true cost of ‘free’ statistical software

An examination of the open-source software package R.

By Richard D. Pollack, Ronald K. Klimberg and Susan H. Boklage

open-source software

Open-source software is typically free, highly customizable and widely accessible to the public. The Android operating system, Hadoop, LINUX, Wikipedia and the Firefox browser are well-known examples. In the statistical realm, R is the dominant open-source player. But what happens when something goes wrong when employing R? Is it really free? Where do you go for support? Are there possible significant costs associated with using R? Further, to what degree should open-source statistical software, such as R, be used and taught in academia?

This article explores these questions, as well as others, in discussing the hidden costs of using the open-source statistical software package R.

Open-source software is one of the newest crazes, finding its way into almost any environment where software is used. As in any craze, there are those who extol the virtues of the movement to an extreme. Concerning R, many of its proponents, while keenly aware of it advantages, often ignore its disadvantages, resulting in serious consequences. The motivation behind this article is to paint a more realistic and balanced picture of R’s strengths and weaknesses. This discussion, to be comprehensive, will involve (where appropriate) R’s commercial big brothers, including IBM’s SPSS and Modeler, SAS and SAS’s Enterprise Miner and JMP.

R’s Numerous Advantages

R has numerous advantages. It’s free, widely available and works on a variety of platforms. Moreover, it can integrate with some of its commercial big brothers (e.g., IBM SPSS) and read their files (e.g., SAS, Excel). R’s acceptance and presence is growing significantly and some of its well-known users include Google, Facebook, Twitter, Zillow, Trulia, and the New York Times. Since R is open source, the latest cutting-edge statistical techniques can be added to the list of packages/libraries in a timely fashion. These attributes, including the large active online user community, have great appeal, especially to those who do not want to pay for or simply cannot afford SAS or IBM SPSS.

Despite these advantages, R still has loose plates on its armor. Many first-time users, especially those who have not used other statistical software packages, report a steep learning curve. R requires more technical ability – there is much more writing of “code” – and lacks some of the friendlier user interfaces of its commercial competitors, such as IBM SPSS Statistics or SAS’s JMP.  This lack of a more intuitive, visually based interface is especially apparent when comparing R to major data mining tools, such as IBM SPSS Modeler and SAS Enterprise miner. Additionally, R’s data handling capabilities are inferior as compared to SPSS/SAS and may cause severe problems, especially with large data sets.

One of R’s most serious challenges is no truly designated and official contact point for help. For commercial software (although there is a significant financial cost), one can always get support from an expert who is proficient with the software and can usually query extensive databases covering cases and their resolution. There are also advanced levels of help, where one can interface with high-level research statisticians who are also experts with the software (both available at SAS and SPSS, for instance).

This begs a key question concerning R: Is the user-support community, usually found on the Internet, equivalent to professional commercial assistance? No. Anyone, qualified or not, can provide a bug fix, new routine or an answer to any type of question about the software. This can happen with commercial software, but one always has recourse by obtaining qualified support from the software corporation. Further, users should be cautious since anybody can upload a package to R, and such uploads lack the quality, scientific controls and rigor that the alternatives provide.

It is difficult to assess the quality and accuracy of Internet-based open-source support. This is particularly challenging for more complex and sophisticated issues; some fixes can appear to work but it’s difficult to know if they actually work. The costs associated with a “fix” that actually does not work, sometimes discovered late or not at all, can be devastating. This can occur with commercial software as well, but the probability is clearly lower with access to qualified professional support and the scientific rigor in which they test algorithms prior to deployment to the user community. Ironically, in such situations it is not uncommon in the “free” open-source space to end up paying for more qualified support, especially in business situations where there is a tight deadline and one needs help on a timely and accurate basis.

Hidden Costs

Time is money. The fact that R is more difficult to use and more challenging to get reliable and accurate help can result in serious financial costs for any user, especially in the world of tight deadlines and the growing need for analysts to be up and running in a minimum amount of time. Moreover, these conditions also increase the probability of error, which can pose great risks to the credibility and reputation of any user and/or enterprise. What is free upfront may have dismantling costs down the road, costs that would far exceed paying for commercial software.

Two of this paper’s authors were actually involved in an enterprise-wide project where R was summarily dropped as a tool and replaced by commercial software, for many of the reasons stated above. Specifically, basic statistical output, such as logistic regression, lacked the details one would typically need to evaluate a model, such as goodness of fit testing and odds ratios. Furthermore, advanced statistical models, such as decision tree algorithms, had limited/basic methods as compared to its commercial cousins and cumbersome output to evaluate, synthesize and leverage in other presentations and documentation.

Many users’ first exposure to statistical software is in academia. In light of R’s deficiencies, one has to question whether teaching R, particularly if this is the only statistical software being taught, is a suitable approach. If the main objective in a course is to learn the statistical techniques of data mining/predictive analytics, R should not be used. The difficulty of R programming would detract from the learning of the statistical techniques. Additionally, IBM’s and SAS’s products provide easy-to-use interfaces and their academic prices are not prohibitive.

On the other hand, business analytics (BA) programs should offer students opportunities to keep their technology toolkit up to date and diverse. Learning how to program and exposing students to all of the options available to them to use their skills is imperative. BA programs should not only teach using the commercial statistical packages or open-source software such as R, they should teach and expose their students to both.

R certainly has its strong points. Its weaknesses, however, should not be overlooked. Although commercial software is pricey (although SPSS and JMP cost significantly less than Modeler or Enterprise Miner), its user-friendliness and professional timely support overcomes many of the disadvantages of R. These may be a worthwhile investment, especially when one is first introduced to statistical software, to potentially avoid the even greater costs – both financial and professional – in the future.

Richard D. Pollack, Ph.D., is the owner of Advanced Analytic Solutions (AdvancedStat.com), a statistical consulting firm that helps companies segment, profile and predict the behavior of their key consumers through a variety of data mining techniques.  

Ronald (Ron) K. Klimberg, Ph.D., is a professor in the Decision and System Sciences Department of the Haub School of Business at Saint Joseph’s University and a longtime member of INFORMS.

Susan H. Boklage, MS MPH, is a director with Regeneron Pharmaceutical with responsibility for health economic and outcomes research, She has been statistical programming in SAS, SPSS and Visual Basic for Applications for more than 20 years.