P-value Primer: P ∈ OR (P-values in operations research) …M N O P Q R S T…

You don’t need a license to download R, but you should have a good understanding of p-values.

By Scott Nestler and Harrison Schramm

Several of our colleagues asked, “Why is this article necessary? Shouldn’t the American Statistical Association’s (ASA) statement on p-values be enough?” Our answer to this is affirmative; the ASA’s statement is one of the best written documents we have seen this year and nothing we are going to contribute here should be seen as a substitute for reading the source article in its entirety. Our purpose is to draw attention from the O.R. community to an issue that impacts us every day in practice, but that we may have missed in our finite journal attention spans. It is of particular importance to our community, which tends to have strong computer programming skills and a “can do” attitude.

In summary, you don’t need a license to download R – but you should have an understanding of what statements such as p-value: .096 really mean. That understanding should be deeper than simple comparison with a threshold value.

The p-value is a workhorse of inferential statistics.

The p-value is a workhorse of inferential statistics.

If you look for “O.R.” in the alphabet you find the letter “p” squarely in the middle. This circumstance of language happens to reflect that statistical hypothesis testing, based on a “p-value,” is central to much of what we do. Many of the advanced analytic methods employed by operations researchers are based in, or draw on, statistical analyses. It is therefore no coincidence that the first content chapter of Morse and Kimball’s “Methods of Operations Research” (1946) is devoted to probability.

The p-value concept was introduced in the late 18th century by the great Pierre Laplace – the father of many transformative ideas. Formalization by Karl Pearson and advocacy by Ronald Fisher in the early 20th century (e.g., “The Lady Tasting Tea” experiment) led to establishment of the p-value as a workhorse of inferential statistics and the notion of .05 (or 1 in 20) as a commonly accepted surrogate for statistical significance [1, 2]. The advantage – and disadvantage – of the p-value is that it reduces complex hypothesis tests to a single diagnostic value, and this value is “level” across diverse methods; we interpret the meaning of the p-value the same way regardless of the underlying mathematics.

Earlier this year, the ASA published “ASA Statement on Statistical Significance and P-values,” the culmination of a two-year endeavor [3]. The use of p-values is an issue of statistical practice as opposed to theory. It affects each of us in at least one way or another—as a teacher, a student, a researcher or a practitioner, or simply as a citizen who consumes the products and policies that are evaluated by it. In addition to the statement itself, the published version in The American Statistician (TAS) includes commentary from more than two dozen contributors that is also worth reading.

Definitions Matter

What’s in a name? Many of us can likely recite the definition of a p-value from a stats course we took at some point along the way. Our response, if put on the spot, would be, “a p-value is the probability of observing a result this extreme or more, given that the null hypotheses is true.” The official but informal definition (per the ASA) is:

The probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

Figure 1: Two data sets exhibiting p=.0499 (top) and p = .0501 (bottom) for simple linear regression. It is obvious why the data set on the (bottom) has a problem (outlier at x = 5); it is not obvious why the data set on the (top) should be “acceptable.”
Figure 1: Two data sets exhibiting p=.0499 (top) and p = .0501 (bottom) for simple linear regression. It is obvious why the data set on the (bottom) has a problem (outlier at x = 5); it is not obvious why the data set on the (top) should be “acceptable.”

Figure 1: Two data sets exhibiting p=.0499 (top) and p = .0501 (bottom) for simple linear regression. It is obvious why the data set on the (bottom) has a problem (outlier at x = 5); it is not obvious why the data set on the (top) should be “acceptable.”

Our statistics professors were strict adherents to the so-called (by ASA) “Bright-line” method of hypothesis testing, which states:

  1. Choose a value for significance a (alpha).
  2. Compute p.
  3. If p < a, reject the Null Hypothesis.

There was explicitly no allowance for the “amount” of difference between p and a consideration of the underlying process from which the data was collected, or ramifications of the decision to be made. The ASA has recommended that we reconsider this procedure.

Six Principles for Proper Use and Interpretation of P-Values

The most important part of the ASA statement is the listing and subsequent discussion of six principles. In the following section, italicized words are the verbatim principles as they appear in the ASA statement; bold highlights are our attempt to emphasize key points in the principles; and normal text contains our commentary.

  1. P-values can indicate how incompatible the data are with a specified statistical model. As the authors indicate, the (in)compatibility identified by a p-value is between the data and the null hypothesis (something like “the population means are all equal”) of a specific model, IF that null hypothesis and any supporting assumptions actually are true.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. While some students in introductory statistics courses make the error mentioned in the first part of this principle, more experienced purveyors of statistical knowledge (authors included) have been known to make the second error. The ASA authors sum this point up as follows: “The p-value … is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.” Some rumination may be advisable here.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. Of all the stated principles, this is the one that we think is most deserving of our attention. Making a decision as to whether an effect is “statistically significant” based on some pre-specified level of significance (like a=0.05) is not magical. There is little practical difference between a p-value of .049 and a p-value of .051. Instead of reporting the results of a t-test as t(59)=1.84, p < 0.05, we should actually report the actual p-value, or t(59)=1.84, p =0.0354. Then, allow the person interpreting the result of the test to consider it in terms of practical, rather than statistical significance.
  4. Proper inference requires full reporting and transparency. Consider a case study on ethics and data that posits the omission of reporting a data point that does not strengthen the researcher’s argument. Even students in a first year statistics course see problems with that. However, selective use of inference through “p-hacking” is rampant in the literature [4]. The ASA authors emphatically state, “Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted and all p-values computed.” This will prevent the “Green jelly beans cause acne” finding, as presented in XKCD 882, titled “Significant” [5]. P-hacking is prevalent enough and causes enough concern to have been addressed by a recent 20-minute segment of John Oliver’s “Last Week Tonight” [6].
    By “p-hacking” we mean the following (generalized [7]) procedure:
    UNTIL (p<a DO{
    Experiment()
    }
    This is more than unethical, it’s just plain wrong; and it erodes the confidence in the public of science.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. As previously mentioned in principle No. 3, statistical significance and practical (e.g., scientific, human, economic) significance are not the same thing. A smaller p-value does not necessarily imply the presence of a larger effect. As observed with recent “big data” sets, with a large enough sample size, any effect, whether it is really present or not, can generate a tiny p-value [8].
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. A p-value is just one of many tools in the quiver of statisticians or other analysts employing statistical methods. While the ASA is not suggesting abandoning the use of p-values, they do propose using them in conjunction with other approaches, such as: confidence, credibility and prediction intervals; Bayesian methods; decision-theoretic modeling; and false discovery rates. Of course, many of these approaches rely on additional assumptions, but they may also more directly get at measuring the size of an effect or the degree of correctness of hypotheses.

Implications for Operations Research Professionals

The ASA statement should give us pause to consider what it means to operations research analysts, management scientists and analytics professionals. Don’t just take our word for it that this is worthy of your consideration. Take 30 minutes to read the ASA statement in TAS and the associated commentary. Then, consider how to apply the six principles discussed in the statement above. Ensure that we (collectively) do not perpetuate improper usage of or reliance on p-values in the courses we teach and in our research and practice activities. We are not suggesting that p-values be banned, as was done by the editors of Basic and Applied Social Psychology, as they remain a very useful tool, and we both plan to continue using them in our respective practices. Additionally, consider how the ASA statement and principles are interconnected with the issues of reproducibility and replicability.

The ASA is also suggesting that statistics may not be as exact a science as many consider it to be. We have now been given permission to answer questions about matters of statistical practice with “it depends” and “too close to call.” These are answers we have previously been culturally hesitant to provide. We wonder how much of this shift in thinking applies to the other sub-disciplines of O.R. This is an open question that we (and we hope you as well) will ponder in the coming months. Please share your thoughts on this subject with our community through INFORMS Connect. We look forward to seeing how this discussion develops.

Scott Nestler, PhD, CAP, PStat, is an associate teaching professor in the newly formed Department of Information Technology, Analytics and Operations (ITAO), Mendoza College of Business, at the University of Notre Dame.

Harrison Schramm, CAP, PStat, is a principal operations research analyst at CANA Advisors, LLC.

Notes & References

  1. http://www.phil.vt.edu/dmayo/PhilStatistics/b%20Fisher%20design%20of%20experiments.pdf
  2. http://www.radford.edu/~jaspelme/611/Spring-2007/Cowles-n-Davis_Am-Psyc_orignis-of-05-level.pdf
  3. http://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108
  4. http://www.ncbi.nlm.nih.gov/pubmed/22006061
  5. http://xkcd.com/882/
  6. http://www.washingtonpost.com/news/speaking-of-science/wp/2016/05/09/john-oliver-explains-why-so-much-science-you-read-about-is-bogus/
  7. You can p-hack at home! Execute the following command in R: shapiro.test(rexp(10)) See how many tries it takes to get the computer to conclude that data from the exponential distribution is actually normal.
  8. http://dx.doi.org/10.1287/isre.2013.0480