Software Survey: Statistical Analysis
Surrounded by uncertainty
In the kingdom of the blind, The one-eyed man is king
– Erasmus, 1510
By James J. Swain
There are no sure things in an uncertain world, as the Seattle Seahawks dramatically demonstrated to the heavily favored New Orleans Saints during the recent NFL playoffs. At about the same time the 7-9 Seahawks knocked out the defending Super Bowl champ Saints, snow and ice surprisingly blanketed every state of the union except Florida.
The volatility of the stock market, vagaries of travel and delivery times, commodity prices, life expectancy, the reliability of our cars and other infrastructure, the weather; we are surrounded by uncertainty and beset with risk. Most of the time the risks are limited, as when we get drenched by the unexpected shower or miss a meeting due to traffic, while other times the potential risks could be catastrophic, as illustrated by Y2K, Hurricane Katrina, the H1N1 pandemic and the current “200-year” floods in Australia.
Only in the last century has statistical understanding developed that permits us to study and model uncertainty and provide estimates of risk and to make inferences in the presence of “noisy” data. The world is more interconnected by commerce and communication than ever, creating dependencies so that no one is completely isolated from events elsewhere while our ability to deal with risk has grown with our exposure to it.
Some strategies for risk mitigation, such as insurance in which risk is shared among a pool of the insured, are relatively ancient, while others, such as the creation of fire departments, date to relatively recent times. Soon after WWII operations research (O.R.) modeling and optimization tools began to be extended to civilian operations including business and emergency services. As early as the 1960s large cities such as New York used O.R. consultants to make capital investment decisions for emergency equipment, optimize the selection of fire station locations and arrange operations to maximize fire-fighting effectiveness. Statistical estimation of fire frequency and magnitude were combined with geographic and demographic information to quantify the demand and response to fire as the basis for optimization. Similar combinations of statistical data combined with O.R. tools have been used to guide investment through the selection of robust portfolios to balance risk and reward, schedule flight operations to minimize delay and set pricing for airlines to maximize yield, and so on.
From the very beginning statistical application was limited by the amount of computation necessary to make decisions. For instance, David Salsburg estimates that Sir Ronald Fisher’s computations for one table in his “Studies in Crop Variation. I,” would have required a total of 185 hours of effort on the hand-cranked “millionaire” calculator available to him . Some of Fisher’s success at Rothamsted Research in popularizing statistics owed as much to his ability to simplify computations as to provide the framework for the analysis. He was justly celebrated for an ability to save experiments compromised by missing data or other defects through computational adjustments.
Practical statistical analysis as we know it is made possible by the type of software in the accompanying survey. These tools represent the remarkable convergence of mathematical theory, the computational power that became possible with the electronic computer and then the personal computer, and the development of programming tools and interfaces to make these tools available to any user. In addition, the refinement of graphical interfaces has also sparked a revolution in the display and exploration of data.
The computer is now critical for the collection and storage of data in virtually unlimited amounts, plus computational power to perform necessary calculations. As processor speeds have increased, classical parametric procedures have been supplemented with computer-intensive, nonparametric methods such as resampling. Many programs can also provide Monte Carlo methods to evaluate the sensitivity of results to assumptions. Likewise, without computer software most Bayesian methods would simply not be feasible.
The computer and Internet have revolutionized the acquisition, storage and access to data and provide the computation power to numerically process the data. Public and private sources exist for a wide range of data, demographics, production, investment, monetary, infection, disease, mortality and all manner of consumer and operational data. Whereas the classical statistical problem was to make the most of limited data (e.g., “Student’s” 1908 paper  made it possible to deal with small samples), today the challenge is to process large data samples. This has given rise to new techniques in multivariate dimension reduction, data mining and projection-pursuit methods.
The computer has made it possible to expand the range of what is meant by data, from purely numerical data to text and graphics. Documents and e-mails can be compared using word frequencies and even stochastic models, and, of course, user choices online can be logged and analyzed for patterns. Prediction of user choices is useful both to accelerate searching and to match advertising to users. To obtain an idea of the magnitude of data available, consider that the Netflix prize competition to predict user movie ratings was based on a training data set of more than 100 million ratings for over 480,000 users. The qualifying data used to evaluate the proposed algorithms consisted of more than 2.8 million user and movie combinations whose ratings had to be predicted. Even these huge samples are but a fraction of the data collected on the Web.
Since many O.R. applications involve public and business applications, both government and census (demographic) data are often critical inputs to the analysis. The U. S. decennial census for 2010 has just been completed and its results announced. As specified by the U. S. Constitution, it is to be a direct enumeration of every person in the country. This requirement ensures that congressional representation follows changes in the population, but direct enumeration both greatly increases the cost of the census and has reduced the amount information collected, in part to increase the direct response rate. The overall census cost is driven by the difficulty of enumerating all, including those who do not respond to the census forms. The difficulty has become so great that the Canadian government has decided that responding to the census will be optional rather than mandatory. This has led to considerable controversy in Canada among statisticians who fear that the validity of the data will be compromised and advocacy groups concerned that policy decisions will be adversely affected for groups whose participation is low.
The accompanying survey of products is an update of the survey published in 2009. The biennial statistical software products surveyed in this issue provides capsule information about 20 products selected from 15 vendors. The tools range from general tools that cover the standard techniques of inference and estimation, as well as specialized activities such as nonlinear regression, forecasting and design of experiments. The product information contained in the survey was obtained from product vendors and is summarized in the following tables to highlight general features, capabilities and computing requirements and to provide contact information. Many of the vendors have extensive Web sites for further detailed information, and many provide demo programs that can be downloaded from these sites. No attempt is made to evaluate or rank the products, and the information provided comes from the vendors themselves. The survey will be available and updated on the Lionheart Publishing Web site (www.lionhrtpub.com). Vendors that were unable to make the publishing deadline will be added to the online survey.
Products that provide statistical add-ins available for use with spreadsheets remain common and provide enhanced specialized capabilities for spreadsheets. The spreadsheet is the primary computational tool in a wide variety of settings, familiar and accessible to all. Many procedures of data summarization, estimation, inference, basic graphics and even regression modeling can be added to spreadsheets in this way. An example is the Unistat add-in for Excel. The functionality of products for use with spreadsheets continues to grow, including risk analysis and Monte Carlo sampling, such as Oracle Crystal Ball.
Dedicated general and special purpose statistical software generally have a wider variety and depth of analysis than available in the add-in software. For many specialized techniques such as forecasting, design of experiments and so forth, a statistical package is appropriate. Moreover, new procedures are likely to become available first in the statistical software and only later be added to the add-in software. In general, statistical software plays a distinct role on the analyst’s desktop and, provided that data can be freely exchanged among applications, each part of an analysis can be made with the most appropriate (or convenient) software tool.
An important feature of statistical programs is the importation of data from as many sources as possible, to eliminate the need for data entry when data is already available from another source. Most programs have the ability to read from spreadsheets and selected data storage formats. Also highly visible in this survey is the growth of data warehousing and “data mining” capabilities, programs and training. Data mining tools attempt to integrate and analyze data from a variety of sources (and purposes) to look for relations that would not be possible from the individual data sets. Within the survey we observe several specialized products, such as ExpertFit and STAT::FIT, which are more narrowly focused on distribution fitting than general statistics but are of particular use to developers of stochastic models and simulations.
James J. Swain (email@example.com) is professor and chair, Department of Industrial and Systems and Engineering Management, at the University of Alabama in Huntsville. He is a member of INFORMS, IIE, ASA and ASEE.
- David Salsburg, 2001, “The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century,” W. H. Freeman.
- “Student” (pseudonym), 1908, “The Probable Error of a Mean,” Biometrika, Vol. 6, No. 1, pp. 1-25.