Predictive Analytics: Sound data science

Avoiding the most pernicious prediction pitfall.

By Eric Siegel

Are orange cars the least likely lemons?

Are orange cars the least likely lemons?

Data science and predictive analytics’ explosive popularity promises meteoric value, but a common misapplication readily backfires. The number crunching only delivers if a fundamental – yet often omitted – fail-safe is applied.

Prediction is booming. Data scientists have the “sexiest job of the 21st century” (as Thomas Davenport and U.S. Chief Data Scientist D.J. Patil declared in 2012). Fueled by the data tsunami, we’ve entered a golden age of predictive discoveries. A frenzy of analysis churns out a bonanza of colorful, valuable and sometimes surprising insights [1]:

  • People who “like” curly fries on Facebook are more intelligent.
  • Typing with proper capitalization indicates creditworthiness.
  • Users of the Chrome and Firefox browsers make better employees.
  • Men who skip breakfast are at greater risk for coronary heart disease.
  • Credit card holders who go to the dentist are better credit risks.
  • High-crime neighborhoods demand more Uber rides.

Look like fun? Before you dive in, be warned: This spree of data exploration must be tamed with strict quality control. It’s easy to get it wrong, crash and burn – or at least end up with egg on your face.

In 2012, a Seattle Times article led with an eye-catching predictive discovery: “An orange used car is least likely to be a lemon” [2]. This insight came from a predictive analytics competition to detect which used cars are bad buys (lemons). While insights also emerged pertaining to other car attributes – such as make, model, year, trim level and size – the apparent advantage of being orange caught the most attention. Responding to quizzical expressions, data wonks offered creative explanations, such as the idea that owners who select an unusual car color tend to have more of a “connection” to and take better care of their vehicle.

Figure 1: Are orange cars really less likely to turn into lemons?

Figure 1: Are orange cars really less likely to turn into lemons?

Examined alone, the “orange lemon” discovery appeared sound from a mathematical perspective. The specific result is shown in Figure 1.

According to Figure 1, orange cars turn out to be lemons one-third less often than average. Put another way, if you buy a car that’s not orange, you increase your risk by 50 percent.

Well-established statistics appeared to back up this “colorful” discovery. A formal assessment indicated it was statistically significant, meaning that the chances were slim this pattern would have appeared only by random chance. It seemed safe to assume the finding was sound. To be more specific, a standard mathematical test indicated there was less than a 1 percent chance this trend would show up in the data if orange cars weren’t actually more reliable.

But something had gone terribly wrong. The “orange car” insight later proved inconclusive. The statistical test had been applied in a flawed manner; the press had ran with the finding prematurely. As data gets bigger, so does a potential pitfall in the application of common, established statistical methods.

The Little Gotcha of Big Data

“The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.”
 – Bertrand Russell

Big data brings big potential – but also big danger. With more data, a unique pitfall often dupes even the brightest of data scientists. This hidden hazard can undermine the process that evaluates for statistical significance, the gold standard of scientific soundness. And what a hazard it is! A bogus discovery can spell disaster. You may buy an orange car – or undergo an ineffective medical procedure – for no good reason. As the aphorisms tell us, bad information is worse than no information at all; misplaced confidence is seldom found again.

This peril seems paradoxical. If data’s so valuable, why should we suffer from obtaining more and more of it? Statistics has long advised that having more examples is better. A longer list of cases provides the means to more scrupulously assess a trend. Can you imagine what the downside of more data might be? As you’ll see in a moment, it’s a thought-provoking, dramatic plot twist.

The fate of science – and sleeping well at night – depends on deterring the danger. The very notion of empirical discovery is at stake. To leverage the extraordinary opportunity of today’s data explosion, we need a surefire way to determine whether an observed trend is real, rather than a random artifact of the data. How can we reaffirm science’s trustworthy reputation?

Statistics approaches this challenge in a very particular way. It tells us the chances the observed trend could randomly appear even if the effect were not real. That is, it answers this question [3]:

Question that statistics can answer: If orange cars were actually no more reliable than used cars in general, what would be the probability that this strong of a trend – depicting orange cars as more reliable – would show in data anyway, just by random chance?

With any discovery in data, there’s always some possibility we’ve been “Fooled by Randomness,” as Nassim Taleb titled his compelling book. The book reveals the dangerous tendency people have to subscribe to unfounded explanations for their own successes and failures, rather than correctly attributing many happenings to sheer randomness. The scientific antidote to this failing is probability, which Taleb affectionately dubs “a branch of applied skepticism.”

Statistics is the resource we rely on to gauge probability. It answers the orange car question above by calculating the probability that what’s been observed in data would occur randomly if orange cars actually held no advantage. The calculation takes data size into account – in this case, there were 72,983 used cars varying across 15 colors, of which 415 were orange [4]. The calculated answer: under 0.68 percent.

Looks like a safe bet. Common practice considers this risk acceptably remote, low enough to at least tentatively believe the data. But don’t buy an orange car just yet – or write about the finding in a newspaper for that matter.

What Went Wrong: Accumulating Risk

“In China when you’re one in a million, there are 1,300 people just like you.”
- Bill Gates

So if there had only been a 1 percent long shot that we’d be misled by randomness, what went wrong?

The experimenters’ mistake was to not account for running many small risks, which had added up to one big one. In addition to checking whether being orange is predictive of car reliability, they also checked each of the other 14 colors, as well as the make, model, year, trim level, type of transmission, size and more. For each of these factors, they repeatedly ran the risk of being fooled by randomness.

Probability is relative, affected entirely by context. With additional background information, a seemingly unlikely event turns out to be not so special after all. Imagine your friend calls to tell you, “I won the jackpot at hundred-to-one odds!” You might get a little excited. “Wow!”

Now imagine your friend adds, “By the way, I’m only talking about one of 70 times that I spun the jackpot wheel.” The occurrence that had at first seemed special suddenly has a new context, positioned alongside a number of less remarkable episodes. Instead of exclaiming “wow,” you might instead do some arithmetic. The probability of losing a spin is 99 percent. If you spin twice, the chances of losing both is 99 percent x 99 percent, which is about 98 percent. Although you’ll probably lose both spins, why stop at two? The more times you spin, the lower the chances of never winning once. To figure out the probability of losing 70 times in a row, multiply 99 percent times itself 70 times, aka 0.99 raised to the power of 70. That comes to just under 0.5. Let your friend know that nothing special happened – the odds of winning at least once were about 50/50.

Special cases aren’t so special after all. By the same sort of reasoning, we might be skeptical about the merits of the famed and fortuned. Do the most successful elite hold talents as elevated as their singular status? As Taleb put it in “Fooled by Randomness,” “I am not saying that Warren Buffett is not skilled; only that a large population of random investors will almost necessarily produce someone with his track records just by luck.”

Play enough and you’ll eventually win. Likewise, press your luck repeatedly and you’ll eventually lose. Imagine your same well-intentioned friend calls to tell you, “I discovered that orange cars are more reliable, and the stats say there’s only a 1 percent chance this phenomenon would appear in the data if it weren’t true.” You might get a little impressed. “Interesting discovery!”

Now imagine your friend adds, “By the way, I’m only talking about one among dozens of car factors – my computer program systematically went through and checked each one.” Both of your friend’s stories enthusiastically led with a “remarkable” event – a jackpot win or a predictive discovery. But the numerous other less remarkable attempts – that often go unmentioned – are just as pertinent to each story’s conclusion.

Wake up and smell the probability. Imagine we test 70 characteristics of cars that in reality are not predictive of lemons. But each test suffers, say, a 1 percent risk the data will falsely show a predictive effect just by random chance. The accumulated risk piles up. As with the jackpot wheel, there’s a 50/50 chance the unlikely event will eventually take place – that you will stumble upon a random perturbation that, considered in isolation, is compelling enough to mislead.

The Potential and Danger of Automating Science: Vast Search

“The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but rather ‘Hmm… that’s funny’…”
– Isaac Asimov

A tremendous potential inspires us to face this peril: Predictive modeling automates scientific discovery. Although it may seem like an obvious thing to do in this computer age, trying out each predictor variable is a dramatic departure from the classic scientific method of developing a single hypothesis and then testing it. Your computer essentially acts as hundreds or even thousands of scientists by conducting a broad, exploratory analysis, automatically evaluating an entire batch of predictors. This aggressive hunt for any novel source of predictive information leaves no stone unturned. The process is key to uncovering valuable, unforeseen insights.

Automating this search for valuable predictors empowers science, lessening its dependence on ever-elusive serendipity. Instead of waiting to inadvertently stumble upon revelations or racking our brains for hypotheses, we rely less on luck and hunches by systematically testing many factors.

But as exciting a proposition as it is, this automation of data exploration builds up the risk of eventually being fooled – at one time or another – by randomness. This inflation of risk comes as a consequence of assessing many characteristics of used cars, for example. The power of automatically testing a batch of predictors may serve us well, but it also exposes us to the very real risk of bogus discoveries.

Let’s call this issue vast search – the term that industry leader John Elder coined for this form of automated exploration and its associated peril. Repeatedly identified anew across industries and fields of science, this issue is also called p-hacking or the multiple comparisons trap [5]. Elder warns, “The problem is so widespread that it is the chief reason for a crisis in experimental science, where most journal results have been discovered to resist replication; that is, to be wrong!”

Statistics darling Nate Silver jumped straight to the issue of vast search when asked generally about the topic of big data on Freakonomics Radio: “You have so many lottery tickets when you can run an analysis on a [large data set] that you’re going to have some one-in-a-million coincidences just by chance alone.”

Bigger data isn’t the problem; more specifically, it’s wider data. When prepared for predictive analytics, data grows in two dimensions – it’s a table (see Table 1).

Table 1: Data for predicting bad buys among used cars. The complete data is both wider and longer.

Table 1: Data for predicting bad buys among used cars. The complete data is both wider and longer.

As you accrue more examples of cars, people or whatever you’re predicting, the table grows longer (more rows, aka training cases). That’s always a good thing. The more training cases to analyze, the more statistically sound [6]. Expanding in the other dimension, each row widens (more columns) as more factors – aka predictor variables – are accrued. A certain factor such as car color may only amount to a single column in the data, but since we look at each possible color individually, it has the virtual effect of adding 15 columns to the width, one per color. Overall, the sample data in Table 1 is not nearly as wide as data often gets, but even in this case the vast search effect is at play. With wider and wider data, we can only tap the potential if we can avoid the booby trap set by vast search.

A Failsafe for Sound Results

“There are three kinds of lies: lies, damned lies and statistics.”
- Benjamin Disraeli

“If you torture the data long enough, it will confess.”
-Ronald Coase
To understand what sort of failsafe mechanism we need, let’s revisit the misleading “orange lemons” discovery (Figure 1). The 12.3-vs.-8.2 result is calculated from four numbers: There were 72,983 cars, of which 8,976 were lemons. There were 415 orange cars, of which 34 were lemons.

The standard method – the one that misled researchers as well as the press – evaluates for statistical significance based only on those four numbers. When fed these as input, the test provides a positive result, calculating there was only a 0.68 percent chance we would witness that extreme of a difference among orange cars if they were in actuality no more prone to be lemons than cars of other colors.

But these four numbers alone do not tell the whole story – the context of the discovery also matters. How vast was the search for such discoveries? How many other factors were also checked for a correlation with whether a car is a lemon?

In other words, if a data scientist hands you these four numbers as “proof” of a discovery, you should ask what it took to find it. Inquire, “How many other things did you also try that came up dry?”

With the breadth of search taken into account, the “orange lemon” discovery collapses. Confidence diminishes, and it shows as inconclusive. Even if we assume the other 14 colors were the only other factors examined, statistical methods estimate a much less impressive 7.2 percent probability of stumbling by chance alone upon a bogus finding that appears this compelling [7]. Although 7.2 percent is lower odds than a coin toss, it’s no long shot; by common standards, this is not a publishable result. Moreover, 7.2 percent is an optimistic estimate. We can assume the risk was even higher than that (i.e., worse) since other factors such as car make, model and year were also available, rendering the search even wider and the opportunities to be duped even more plentiful.

Inconclusive results must not be overstated. It may still be true that orange cars are less likely to be lemons, but the likelihood this would have appeared in the data by chance alone is too high to put a lot of faith in it. There’s not enough evidence to rigorously support the hypothesis. It is, at least for now, relegated to “a fascinating possibility,” only provisionally distinct from any untested theories one might think up.

Want conclusive results? Then get longer data, i.e., more rows of examples. Adequately rigorous fail-safes that account for the breadth of search set a higher bar. They serve as a more scrupulous filter to eliminate inconclusive findings before they get applied or published. To compensate for this strictness and increase the opportunity to nonetheless attain conclusive results, the best recourse is elongating the list of cases. If the search is vast – that is, if the data is wide – then findings will need to be more compelling in order to pass through the filter. To that end, if there are ample examples with which to confirm findings – in other words, if the data makes up for its width by also being longer – then legitimate findings will have the empirical support they need to be validated.

The potential of data will prevail so long as there are enough training examples to correctly discern which predictive discoveries are authentic. In this big data tsunami, you’ve got to either sharpen your surfing skills or get out of the water.

Eric Siegel, Ph.D., is the founder of the Predictive Analytics World conference series (cross-sector events), executive editor of The Predictive Analytics Times and a former computer science professor at Columbia University. This article was adapted and reprinted with permission of the publisher, Wiley, from the book, “Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die,” revised and updated edition, by Eric Siegel (Wiley, January 2016).

Notes & References

  1. For more details on these findings, see the section on “Bizarre and Surprising Insights” within the Notes for the book, “Predictive Analytics,” available as a PDF online at For further reading on this article’s overall topic, look in the section, “Further Reading on Vast Search” within the same document.
  2. This discovery was also featured by The Huffington Post, The New York Times, National Public Radio, The Wall Street Journal and The New York Times bestseller, “Big Data: A Revolution That Will Transform How We Live, Work and Think.”
  3. The notion that orange cars have no advantage is called the null hypothesis. The probability the observed effect would occur in data if the null hypothesis were true is called the p-value. If the p-value is low enough – e.g., below 1 percent or 5 percent – then a researcher will typically reject the null hypothesis as too unlikely, and view this as support for the discovery, which is thereby considered statistically significant.
  4. The applicable statistical method is a one-sided equality of proportions hypothesis test, which calculated the p-value as under 0.0068.
  5. More synonyms: multiple hypothesis testing, researcher degrees of freedom and cherry-picking findings.
  6. This only holds true under the assumption you have a representative sample, e.g., an unbiased, random selection of cases.
  7. This probability was estimated with a method called target shuffling. For details, see “Are Orange Cars Really not Lemons?” by John Elder and Ben Bullard (