The evolution of analytics

Historical perspective on fast-developing field and its expected impact on science now and in the future.

By Harsha Rao and Deepali Jain

The evolution of analytics

Science entered the 19th century with a strong philosophical vision known as the “clockwork universe,” which compares the universe to a perfect, ticking, mechanical clock. Precise mathematical formulas as determined by Newtonian physics were used to describe relationships such as the motion of planets, comets and moons. Some success had been achieved in finding laws of chemistry. Darwin’s law of natural selection provided a start to the understanding of evolution.

However, given the very imprecise nature of measuring, Pierre Simon Laplace in 1820 proposed that errors associated with scientific observations followed a pattern. This simple thought started the statistical revolution. While the thought of more precise measurements leading to diminishing errors made sense, the errors in fact grew with more precise measurements, challenging the deterministic view of the world.
By the end of the 19th century the deterministic approach to science collapsed, and the scientific community was ready to embrace a probabilistic approach proposed by Karl Pearson. He demonstrated that measurements and not error associated with measurements are probabilistic in nature and have inherent randomness. Before Pearson, science dealt with precise “ideas” such as mathematical laws, which described the motion of planets. Pearson proposed that the “ideas” of science are not observable and hence not precise; they can only be estimated using mathematical functions and data.

This transformative shift from a deterministic perspective to a probabilistic perspective laid the foundation for data analysis, which further evolved during the 20th century. Starting in 1921, Sir R. A. Fisher, often called the “Father of Statistics,” published a series of remarkable papers that illustrated how to design experiments and factor out external effects that are not part of the experimental design. He developed most of the “significance” testing and defined the probability that allows one to declare significance.

The idea of significance testing was revolutionized in 1933 when Jerzy Neyman and Egon Pearson formulated hypothesis testing and formalized the process of decision-making. However, data analysis was not recognized in the mainstream statistical community until John Tukey helped understand data and demonstrated how to draw inferences from it.

Advances in Analytics

Through the early part of the 20th century, the discipline of statistics and mathematics started to mature. Advances in statistics and mathematics helped solve real-life complex problems, paving the way for science to be revolutionized. In operations research, a class of algorithms in the meta-heuristic space enabled simplification of complex optimization routines. Machine learning improved the ability to predict outcomes based on different parameters, as they are not held down by the rules of probability and inference. However, these algorithms/techniques were used minimally outside of academia until the 21st century due to high computing requirements that could be afforded only by universities.

In the last 20 years, two significant achievements have made the usage of analytics mainstream. For one, the introduction of personal computers resulted in a drastic reduction of computing costs, enabling companies and individuals to get access to heavy duty computing. Secondly, the Internet and other technologies such as sensors have been able to measure many more activities, bringing about an explosion of data. The collaborative impact of improved data generation and collection, reduction in computing costs and advances in statistics has led to significant advancements in the field of science, specifically in the fields of physics and medicine where the impact has led to major breakthroughs that will highlight the roadmap for the future.

In the last 20 years, two significant achievements –- introduction of personal computers and the Internet –- have made the usage of analytics mainstream.

In the last 20 years, two significant achievements –- introduction of personal computers and the Internet –- have made the usage of analytics mainstream.

Understanding the Universe

The field of physics has been scaling over the past three centuries with the help of statistics. It has also contributed significantly to the development of statistics (i.e. the discovery of ordinary least squares based on astronomical work done by Gauss). More recently, the growth in astrophysics, condensed matter physics and particle physics has been aided by improvements in data capturing and measurement. The Standard Model theory in particle physics is a consequence of 70 years of research. It explains what the world is and what holds it together. The model explains the way subatomic particles are bound together to create atoms and matter. However, the most critical piece to the Standard Model that was still missing was the coveted Higgs Boson – an elementary particle, theorized in 1964, that explained why fundamental particles have mass.

Scientists have been working to uncover the mysteries of this particle for almost 40 years by colliding stable particles like electrons, protons and positrons inside particle accelerators. However, building the particle accelerator was just the first puzzle to solve. Advanced analytics was required in the discovery of the evidence of Higgs Boson, a.k.a. the “God’s Particle” which is responsible for giving mass to the universe.

When scientists first indicated the discovery of Higgs Boson they analyzed more than 800 trillion collisions to confirm their hypothesis. During this process they amassed more than 200 petabytes of data, which was scrutinized billions of times – doing statistical analysis to confirm and corroborate that the particle whose traces was found was indeed Higgs Boson. The scientists concluded that there was a one-in-550 million chance that the results may have been a statistical coincidence. Amazingly, the accuracy of this discovery was calculated using the same probabilistic approach proposed by Karl Pearson.

Today, scientists have leveraged the power of analytics to deepen the understanding of particle physics. While this has gained its fair share of publicity, there have been other significant deep dives into problems in the area of astrophysics and condensed matter physics apart from other areas. For example, the origins of the world are being studied based on cosmic microwave background using cutting-edge methods in Bayesian analysis. Measurements based in deep space have significant measurements errors that make this field a unique challenge for many statisticians as well as astronomers. Experimental design has been very instrumental in understanding the physical properties of various elements. This has also led to advances in nanotechnology and magnetic resonance.

What Makes Humans Tick?

While being able to move the needle in the world of particle physics, analytics has demonstrated tremendous potential to extend and improve human life by facilitating better diagnosis, promoting preventive treatment and introducing a future of personalized medicines. DNA sequencing, IBM’s super computer Watson and “intelligent pills” are a few breakthroughs in data analytics that have the potential to revolutionize medicine in the 21st century.

The vision of personalized and preventive treatment was the main driving force behind the 13-year, $3 billion Human Genome Project. The project commenced in 1990, with the aim to map and sequence the complete human genome. This process requires big data technology such as Hadoop, intensive statistical analysis and extensive computational power. With advances in technology, the cost of sequencing the human genome is falling, and this will revolutionize our understanding of health and disease. Personalized treatment seeks to develop newer ones and identify the most suitable for each patient. It can also determine which groups of patients are more prone to developing some diseases and trigger an era of preventive treatment, thereby delaying the onset of disease or reducing its impact.

Further technological innovations are leading to better quality data collection, which in turn triggers the application of real-time analytics in innovative ways. The development of “intelligent pills” – digestible microchips with embedded sensors that are activated by acids in the stomach – will facilitate an era of enhanced data collection. These pills can capture and collect information and also transmit its data to a mobile phone. The sensor can capture vital information such as blood pressure, heart rate, body temperature and the patient’s responses to a drug. This technology can tremendously improve the science of monitoring elderly patients and patients with chronic illness.

Healthcare in the early 20th century was dominated by a heuristic and generic “one size fits all” approach adopted by doctors. Experience and intuition was central, leading to high dependency and moderate accuracy. In the near future, healthcare will evolve with the development of “tailor made” pharmaceutical drugs and more accurate clinical decision support systems. One day every individual may have wearable or implantable technology that can monitor health and measure environmental impact. This could transform healthcare by promoting an era of preventive treatment. It’s not hard to image that data collection, technology and machine learning will advance to the extent that robotic surgery becomes commonplace.

Future of Analytics and Science

Even today, Pearson’s statistical revolution dominates all of modern science. Medical investigations use mathematical models of distributions to determine possible effects of treatments. Socialists and economists use mathematical distributions to describe the behavior of human societies. In particle physics, physicists use mathematical distributions to describe subatomic particles.

While science in the 20th century was revolutionized by the adoption of a statistical perspective, it seems likely that science in the 21st century will revolutionize due to advances in statistics, innovation in big data technology and growth in computational power. A significant element of self-learning in science will come with advances in these areas as data collection and machine learning algorithms push the boundary on artificial intelligence. This evolution of analytics is destined to scale the speed at which we discover ourselves and our environment, leading to a future that has tremendous potential for humanity. 

Harsha Rao is a client partner and Deepali Jain is an associate manager at Mu Sigma (www.mu-sigma.com).

References

  1. David Salsburg, 2001, “The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century,” Henry Holt and Company.
  2. Loraine Lawson, 2012, “The Big Data Software Problem Behind CERN’s Higgs Boson Hunt,” IT Business Edge (http://www.itbusinessedge.com/cm/blogs/lawson/the-big-data-software-problem-behind-cerns-higgs-boson-hunt/?cs=50736).
  3. Colin Lecher, 2013, “New Results Confirm: The Particle Believed To Be The Higgs Boson Really Is The Higgs Boson,” POPSCI (http://www.popsci.com/ science/article/2013-03/particle-believed-be-higgs-boson-really-higgs-boson).
  4. Stephen M. Feeney, Matthew C. Johnson, Jason D. McEwen, Daniel J. Mortlock and Hiranya V. Peiris, 2012, “Hierarchical Bayesian Detection Algorithm for Early-Universe Relics in the Cosmic Microwave Background,” Cornel University Library (http://arxiv.org/pdf/1210.2725.pdf).
  5. Brandon C. Kelly, 2007, “Some Aspects of Measurement Error in Linear Regression of Astronomical Data,” Cornel University Library (http://arxiv.org/pdf/0705.2774v1.pdf).
  6. Randi Martin, 2012, “FDA approves ‘intelligent’ pill that reports back to doctors,” WTOP website (http://www.wtop.com/267/2974694/FDA-approves-intelligent-pill-that-reports-back-to-doctors).
  7. Human Genome Project Information Archive (http://web.ornl.gov/sci/techresources/Human_ Genome/index.shtml).
  8. National Human Genome Research Institute, “DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program” (http://www.genome.gov/sequencingcosts/).