INNOVATIVE EDUCATION: ‘Big data’ and the analytics course

Given student expectations and industry demands, it’s time to re-frame content to make course big data friendly.

By Peter C. Bell

Peter Bell

I’ve been struggling with “big data.” I am fully aware that there is enormous interest in big data out there in the business world, including among my students who see firms advertising jobs in big data and realize that students who can claim to be knowledgeable about big data are very marketable. I am also acutely aware that my core analytics course is the only place in the program where students will be exposed to the technical details of big data, so if I don’t do something on big data, our students will likely lack big data literacy. I am also aware that many OR/MS/Analytics (hereafter “analytics”) faculty think “big data” is mostly puff; that is, there is not much novelty or substance behind most of the hype about big data appearing in the press. So what to do?

Like most instructors that I have talked to who teach the core analytics course in a business school, I would like to keep the course focused on advanced analytics so the challenge is not to massively re-engineer the course, but rather to maintain the traditional course content but re-frame the content to make it more big data friendly. 

The Easy Fix: Guest Speakers

The easy solution is to devote a couple of class sessions to guest speakers. Invite in a big data expert or two to lecture about big data. Guests I have invited introduce the big data “Vs” (volume, variety, velocity, veracity, etc.) and talk about terabytes and petabytes, etc., and provide examples of both external (to the firm) sources of big data such as social media or e-mail streams, as well as more focused examples such as soccer (or hockey or basketball) games and golf tournaments. They discuss the kinds of data streams that events such as these can provide and spend time worrying about whether or not these are examples of big data. They then move on to the firm’s internal sources of big data, citing examples such as customer and point of sale data, financial transaction, machine or supply chain status data, and vehicle location/routing/load data. 

Again, the discussion of examples such as these swirls around the issue of bigness: At what point does the usual kind of data become big data? The speakers see analytics as mostly a vehicle to achieving understanding from data by reducing big data to something much more manageable. For example, a video of a hockey game is a stream of unstructured big data, but if you want to start analyzing tactics or player performance, you have to quickly focus attention on a subset of this data – maybe the movement of a single player. Consequently, the ability to filter data streams to isolate relevant data for analysis is a critical task and may not be easy. 

Guest lecturers have also been very good at presenting the business side of big data and pointing out the complexity of the big data industry, particularly the case of Internet marketing where many providers of analytical and data analysis services exist in the space between the firm that owns the product and the site where the ads appear. The example of Lime-a-rita and Straw-ber-rita (lime and strawberry flavored Bud Light beer) is often cited where analytics was able to pick up the fact that the initial ads were reaching loyal Budweiser drinkers who were not at all amused by flavored Bud Light. Guest speakers have also been very good at presenting big data as an opportunity area where new products and businesses that we haven’t thought of yet will appear. 

The downside of the guest lecturer for me has been the absence of a link to advanced analytics and to the materials of my course. Big data without analytics is just a cost item, but most big data people can lecture for an hour about big data with only a perfunctory mention of analytics. I think it is important that when someone says analytics was able to pick up the fact that the ads were going to the wrong audience, that our students have some idea of the analytics required in reaching that conclusion so that they can intelligently challenge the bearer of this bad news. 

A big data guest speaker (or you could do this “descriptive” lecture yourself) early in the course provides a useful introduction to big data, but the challenge that remains is to try to integrate some of these big data issues into the remainder of the course and to relate analytics to big data. In my own courses, I have found it helpful to identify various talking points where various issues surrounding big data can be woven into a discussion that has always been relevant about “small” data that we were going to have already. 

Collecting Data

In the past, firms collected their own data so we discussed questionnaire and survey design and sampling methods, and validity and veracity were big issues. Much internal data is now collected electronically and so validity and veracity are less of an issue. The availability of vast amounts of public data makes data selection and interpretation more important and sampling methods are now even more important. Analytics is used to working with structured (numeric) data, but much big data is now unstructured (such as video or textual data like tweet streams), raising the issue of how to acquire such data and process it into something that is useful in a decision-making context. 

A quick look at the GNIP site (http://gnip.com/sources/twitter/) where you can acquire tweet streams in various forms/samples provides an introduction to what is available in the way of raw and processed social media data and some of the complexity involved in acquiring this data.

Storing Data

Organizations all over the world are collecting and boasting about how much data they are collecting, but it is apparent that the great majority of this data will never be used, in large part because of the shortage of people who have the skills to process this bulk data into meaningful information. A colleague recently asked me to take a look at a new case where a firm was trying to choose an IT solution to store huge amounts of data. I suggested that the firm might want to ask what it was going to do with the data before deciding on a technology to warehouse the data. There is not much point storing data without a plan and the ability to extract useful information from it. While the comment was seen as relevant, it was ignored going forward. However, this is a point that analysts have always made: Don’t start the data collection until you know what you are going to do with the data. This extends to public data: Don’t buy the data unless you have a plan and the ability to use it. People (our graduates) who understand both big data and analytics have a role to play in trying to reduce the cost of storing data that is unlikely to ever be used.

Cleaning/Verifying Data

Cleaning and verifying data seems to be more important than ever if someone you don’t know collected the data. When examining data, analysts want to first understand exactly what the data represents or in statistical terms, what exactly is the measure? It is also important to understand the accuracy of the data. We see many numbers reported in Excel precision (16 significant digits) even though our measuring methods lack anything like this level of precision. There may be different technologies to collect data where the data you get depends on the technology. For example, many different technologies are used to collect vehicular traffic flow data (mechanical counters, magnetometers in the road, imaging methods, GPS-based methods and so on), and the same traffic flow can produce different data. I ran into an example where vehicles appeared and/or disappeared between counters where the technologies were different at different points on the same highway! In the world of big data, the old adage not to begin the analytics until you thoroughly understand the data seems more appropriate than ever.

Cleaning data often focuses on finding and eliminating “outliers.” With internal data, local knowledge can be used to confirm that an observation should be dropped, but with external data, this knowledge may be lacking. Algorithmic methods to eliminate outliers reduce standard deviations and so may systematically underestimate risk. If you are buying data, you want to know whether it has been cleaned and if so, how? Firms have a history of underestimating risks by using poor risk calculations, so this discussion can also be related to current events. 

Summarizing Data

At the start of my executive courses, managers consistently tell me that if you just give me the data I can make the decision. Some executives feel that they have the ability to process data intuitively, but it doesn’t take too long to start them thinking about how they want the data summarized and presented to them. As the sample size grows it is even more important to know rigorous and useful methods to summarize data, so the methods of descriptive statistics seem more important than ever. 

Business students will be expected to be able to actually do some analytics when they start employment, so summarizing and presenting big data is an area where they could be immediately useful. In the past we have focused on summarizing numerical data, but having student teams make short presentations on how video and textual data is being processed to extract information useful to decision-makers (with examples) is one way to broaden the discussion to include unstructured data. 

Modeling with Big Data

The choice of which models to cover in the course is becoming more complex as specialized big data techniques continue to emerge. We cover multiple regression, and this provides an opportunity to integrate with data mining at a descriptive level and to include some rich, big data content. 

Should we add neural networks, search techniques and other specific big data tools to the analytics course? In our case-based courses that focus on decision-making, we try to have an important decision issue in front of the class at all times, and our classes are less technical than most. When we introduce a model as a useful approach to a case issue, I try to have the class think about the scalability of the approach. Which of our models scale up to a big data input? What publically available data might be useful in this decision situation? Who has this data? How accessible is this data? How might it affect our analysis?

Applications of Big Data

Students are always interested in how analytics is being used, so it is very helpful that some recent Edelman prize competitions have featured big data applications. Dell was founded and prospered on a configure-to-order business model where customers chose the various components of their PC that was then assembled and shipped. About 2007, Dell decided to expand by offering a specific product line of “off-the-shelf” pre-configured PCs. Dell used prescriptive big data analysis of its billions of records of past customer purchases to evaluate millions of combinations of possible product configurations and define a product line of pre-configured PCs that (allowing for some trade-ups) captured almost three quarters of Dell’s notebook and desktop sales. Today, these configurations make up about half Dell’s sales [1].

Industrial and Commercial Bank of China (ICBC) has some 16,000 branches throughout China and is by many measures (market capitalization, deposits) the world’s largest bank. The branch network is a key strategic asset of ICBC, and with China’s economy expanding and modernizing, it is critical to ICBC to quickly identify new high-potential locations for branches, as well as moving or reconfiguring existing branches. In 2006, ICBC partnered with IBM Research to begin development of ICBC’s branch network optimization system. The system was driven by an analysis where each city was divided into tens of thousands of 100-meter square cells, and the business activity and demographic data for each cell was identified from geographic information system databases. This data was used with human opinion, expert judgment and large-scale optimization to find the cells that offered the best locations for branches, taking into account market potential, competitor’s locations and other ICBC branches in the neighborhood. This process has been implemented with great effect in more than 40 major cities in China [2]. 

The Twitter “Who-to-Follow” system [3] extracts relationship data from tweet streams from more than 240 million users and processes this data to recommend accounts for users to follow. Other Edelman prize entries that can be examined through a big data lens and seen as potential big data applications include Procter & Gamble’s supply chain optimizations [4] and ABB Electric’s customer choice modeling [5]. 

Opportunity for Teachers of Analytics

The analytics space that used to be well defined is now growing fast, driven by technology and the ability to record and store vast amounts of data in many different forms about almost everything. Some of this data may be relevant to an important decision, but the challenge is to find the data, extract it and process it into useful input to the decision-maker. The need for these skills and knowledge provides a huge opportunity for analytics teachers, but seizing this opportunity carries considerable costs and may well be quite disruptive. 

Whether we like it or not, our colleagues will expect us to cover “big data,” so the basic choice that we have is whether to re-engineer the basic analytics course into a big data course or to keep our traditional advanced analytics, model-based outline but reframe the materials to make them more big data friendly. Data analysts (“data scientists?”) are in high demand, but the more difficult hires are people who can grow and lead an analytics team in a major organization. Graduates with a good understanding of business, analytics and big data have the potential to fill this role and appear to have a very bright future.

Peter C. Bell (pbell@ivey.ca) is a professor of management science at the Ivey School of Business at Western University in Ontario, Canada. He served as chair of the 2013 and 2014 INFORMS Franz Edelman Prize Competition.

References

1. K. Martin, P. Chitalia, M. Pugalenthi, K. Raghava Rau, S. Maity, R. Kumar, R. Saksena, R. Hebbar, M. Krishnan, G. Hegde, C. Kesanapally, T. Kaur Bimbraw and S. Subramanian, 2014, “Dell’s Channel Transformation – Leveraging Operations Research to Unleash Potential across the Value Chain,” Interfaces, Vol. 44, No. 1, pp. 55-69.

2. Xiquan Wang, Xingdong Zhang, Xiaohu Liu, Lijie Guo, Thomas Li, Jin Dong, Wenjun Yin, Ming Xie, and Bin Zhang, 2012, “Branch Reconfiguration Practice Through Operations Research in Industrial and Commercial Bank of China,” Interfaces, Vol. 42, No. 1, pp.33-44.

3. A. Goel, P. Gupta, J. Sioois, D. Wang, A. Sharma and S. Gurumurthy, 2014, “The ‘Who-to-Follow’ system at Twitter: Strategy, Algorithms and Revenue Impact,” Franz Edelman finalist presentation, 2014 INFORMS Conference on Business Analytics & Operations Research, Boston, March 31, 2014 (forthcoming in Interfaces, Vol. 45, No. 1).

4. J.D. Camm, T.E. Chorman, F.A. Dill, J.R. Evans, D.J. Sweeney and G.W. Wegryn, 1997, “Blending OR/MS judgment and GIS: Restructuring P&G’s supply chain,” Interfaces, Vol. 27, No. 1, pp. 128-142.

5. D.H. Gensch, N. Aversa and S. P. Moore, 1990, “A Choice-Modeling Market Information System That Enabled ABB Electric to Expand its Market Share,” Interfaces, Vol. 20, No. 1, pp. 6-25