ISSUES IN EDUCATION

Teaching MBA students about big data

Daniel Fylstradaniel@solver.com

Since the INFORMS Big Data Conference in June 2014, I’ve had many conversations with instructors teaching OR/MS and analytics to MBA students. They’ve told me they don’t really know much about big data, and are skeptical about whether there’s anything new to teach. Few have taken steps to incorporate big data topics into their courses. With developments in big data accelerating, I’ve become concerned that instructors may miss something truly important. This article offers a brief “crash course” in big data, and argues that exposure to big data needs to be part of the MBA analytics curriculum.

Big Data vs. ‘Ordinary’ Data

“Big data” deals with massive data sets, such as the terabyte of trade data generated by the New York Stock Exchange each day or the 1 million customer transactions handled by Walmart every hour. Big data architecture and technology is a response to physical limits of hard disks. Over the last 20 years, read speed has improved by about 23 times, but capacity has increased by more than 700 times [1], so the time taken to read the contents of a typical hard drive has increased from about 5 minutes to 2.5 hours.

What separates big data from “ordinary” data is that a large data set is spread across many computers and hard disks and processed in parallel – the only way to process this much data in reasonable time. But with hundreds to thousands of machines involved, some will inevitably fail. Hadoop and its distributed file system HDFS solves the problem of scheduling work and recovering from failures across a cluster of computers and hard disks.

Hadoop, Hive and Spark

In big data’s early years (from Google’s 2003 GFS paper [2] to about 2009), the focus was on basic data processing, using the disk-based MapReduce paradigm. Hadoop, developed and heavily used at Yahoo, became an Apache Foundation open-source project in 2008. But big data users soon moved beyond basic data processing. Hive, developed at Facebook [3], starting in 2009, and now as part of Hadoop, made it possible to treat a big data cluster like a data warehouse and query it using a variant of SQL called HQL.

In late 2013, Hadoop 2.0 made it easier to go beyond MapReduce on HDFS clusters, and utilize main memory as well as disk. This quickly led to distributed analytics algorithms, and the explosive growth of the Spark project [4], started at U.C. Berkeley’s AMPLab, which became an Apache Foundation open-source project in early 2014. Spark now has more contributors than Hadoop itself, ranging from Yahoo, Netflix and Intel to Cloudera and HortonWorks.

Large Companies Embracing Big Data

Introducing the 2014 New Advantage Partners Big Data Executive Survey [5] of 125 senior corporate executives representing 59 Fortune 1000 companies, Tom Davenport wrote that 82 percent of executives surveyed say that big data is “important or mission critical” to their organizations. Two-thirds (67 percent) of executives reported big data initiatives running in production within the corporation. In June 2015, IBM announced [6] a major ($300 million) commitment to Apache Spark, involving more than 3,500 IBM researchers and developers.

Big Data Education at Scale

Perhaps more important, IBM said it would “educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.” It appears that IBM, seeking to move quickly, is working with the new “ed tech” competitors rather than traditional university programs. I recently participated in a MOOC on Apache Spark [7] taught by U.C. Berkeley through edX – with more than 70,000 other students.

Compare these numbers with the oft-cited 2011 McKinsey study [8] predicting “a shortage of 140,000 to 190,000 people with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” How long will the shortage last?

How and What to Teach MBA Students?

Most data science or analytics courses offered by “ed tech” firms, and by university master’s in analytics programs, emphasize programming in a language such as R, Python, Java or Scala. This could prove challenging for many business students without an engineering background.

Fortunately, big data tools have progressed far enough so this isn’t necessary. It’s now possible to work with big data, using only SQL, Excel (with the latest version of Frontline’s Solvers, supporting Apache Spark) and data visualization tools such as Tableau and Microsoft Power BI. With some imagination and determination, it’s possible to give MBA students a “hands-on experience” working with big data, using these tools.

The main challenge for instructors lies in setting up a big data cluster of computers and hard disks, loaded with appropriate data sets. This sounds daunting, but the ability to run Hadoop and Apache Spark on a cluster of virtual machines on Amazon Web Services (AWS), Microsoft Azure and soon IBM’s Bluemix makes it far more feasible. Frontline Systems is currently seeking a small number of instructors to share use of our Spark big data cluster on AWS, pre-loaded with our own test data sets, such as the ASA’s airline data, New York City’s taxi cab trip and fare data, and the UCI Higgs Boson data set.

We want students to be able to ask and answer key business questions by exploring the data. For example, with 20 years of airline data [9] (29 airlines, 3,376 airports, 120 million records), we can study planned and actual flight departure and arrival times, seeking to predict delays. Doing this involves steps such as preprocessing, feature selection, adding weather information and using logistic regression to build a predictive model.

In the late 1990s, OR/MS courses in MBA programs experienced a “crisis of relevance.” It was ultimately overcome by embracing new tools, updating the curriculum and improving teaching methods. I believe that big data presents a new challenge that, if not addressed quickly, could lead to another “crisis of relevance.” Now is the time to dig in, embrace change and give MBA students the big data analytics experience they want and need for their future careers.

Daniel Fylstra (daniel@solver.com) is the president of Frontline Systems, Inc.

References

  1. White, Tom, 2012, “Hadoop: The Definitive Guide, 3rd Edition,” O’Reilly Media.
  2. http://labs.google.com/papers/gfs.html.
  3. https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919.
  4. https://spark.apache.org/.
  5. http://newvantage.com/wp-content/uploads/2012/12/Big-Data-Survey-2014-Executive-Summary-110314.pdf.
  6. https://www-03.ibm.com/press/us/en/pressrelease/47107.wss.
  7. https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x.
  8. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation.
  9. http://stat-computing.org/dataexpo/2009/the-data.html.