Integrating data mining and forecasting

Leveraging time-series data offers the best possible forecasting model.

Data Mining Intro

By Tim Rey and Chip Wells

Big data means different things to different people. In the context of forecasting, the savvy decision-maker needs to find ways to derive value from big data. Data mining for forecasting offers the opportunity to leverage the numerous sources of time-series data, both internal and external, now readily available to the business decision-maker, into actionable strategies that can directly impact profitability. Deciding what to make, when to make it and for whom is a complex process. Understanding what factors drive demand and how these factors (e.g., raw materials, logistics, labor, etc.) interact with production processes or demand and change over time are keys to deriving value in this context.  

Traditional data mining processes, methods and technology oriented to static-type data (data not having a time-series framework) has grown immensely in the last quarter century (Fayyad, et al. (1996), Cabena, et al. (1998), Berry (2000), Pyle (2003), Duling, Thompson (2005), Rey, Kalos (2005), Kurgan and Musilek (2006), Han, Kamber (2012)). These references speak to the process as well as the myriad methods aimed at building prediction models on data that does not have a time-series framework. Significant value can be found in the interdisciplinary notion of data mining for forecasting – the use of time-series-based methods to mine data collected over time.

This value comes in many forms. Obviously, being more accurate when it comes to deciding what to make when and for whom can help immensely from an inventory cost reduction as well as a revenue optimization viewpoint, not to mention customer loyalty. There is also value in capturing a subject matter expert’s knowledge of the company’s market dynamics. Doing so in terms of mathematical models helps to institutionalize corporate knowledge. When done properly, the ensuing equations actually become intellectual property that can be leveraged across the company. This is true even if the data sources are public, since it is how the data is used that creates intellectual property, and that is in fact proprietary.

Three prerequisites need to be considered in the successful implementation of a data mining for a time-series approach: 1. understanding the usefulness of forecasts at different time horizons, 2. differentiating planning and forecasting, and 3. getting all stakeholders on the same page in forecast implementation.

One primary difference between traditional and time-series data mining is that, in the latter, the time horizon of the prediction plays a key role. For reference purposes, short-ranged forecasts are defined herein as one to three years, medium-range forecasts are defined as three to five years, and long-term forecasts are defined as greater than five years. Anything more than 10 years should be considered a scenario rather than a forecast.

Finance groups generally control the “planning” roll-up process for corporations and deliver “the” number that the company plans against and reports to Wall Street. Strategy groups are always in need for medium- to long-range forecasts for strategic planning. Executive sales and operations planning (ESOP) processes demand medium-range forecasts for resource and asset planning. Marketing and sales organizations always need short- to medium-range forecasts for planning purposes. New business development incorporates medium- to long-range forecasts in the net present value (NPV) process for evaluating new business opportunities. Business managers themselves rely heavily on short- and medium-term forecasts for their own businesses data, but they also need to know the same about the market. Since every penny a purchasing organization can save a company goes straight to the bottom line, it behooves a company’s purchasing organization to develop and support high-quality forecasts for raw materials, logistics costs, materials and supplies, as well as services.

However, regardless of the needs and aims of various stakeholder groups, differentiating a “planning” process from a “forecasting” process is critical. Companies need to aspire to a “plan.” Business leaders have to be responsible for the plan. However, to claim that this plan is a “forecast” can be disastrous. Plans are what we “feel we can do,” while forecasts are mathematical estimates of what is most likely. These are not the same, but both should be maintained. In fact, the accuracy of both should be tracked over a long period of time. When reported to Wall Street, accuracy is more important than precision. Being closer to the wrong number does not help.

Given that so many groups within an organization have similar forecasting needs, a best practice is to move toward a “one number” framework for the whole company. If the Finance, Strategy, Marketing/Sales, Business ESOP, NBD, Supply Chain and Purchasing departments are not using the “same numbers,” tremendous waste can result. This waste can take the form of rework and/or mismanagement given an organization is not totally lined up to the same numbers. This then calls for a more centralized approach to deliver forecasts for a corporation that is balanced with input from the business planning function. Charles Chase presents this corporate framework for centralized forecasting in his book “Demand-Driven Forecasting.”

Big Data in Data Mining for Forecasting

Over the last 15 years or so, there has been an explosion in the amount of external time-series-based data available to businesses. The list of providers includes Global Insights, Euromonitor, CMAI, Bloomberg, Neilsen, Moody’s Economy.com, Economagic, etc., as well as government sources such as www.census.gov, www.stastics.gov.uk/statbase, IQSS data base, research.stlouisfed.org, imf.org, stat.wto.org, www2.lib.udel.edu, sunsite.berkely.edu, etc. All provide some sort of time-series data – that is, data collected over time inclusive of a time stamp. Many of these services are available for a fee; some are free. Global Insights (ihs.com) alone contains more than 30 million time series.

This wealth of additional information actually changes the way a company should approach the time-series forecasting problem in that new methods are necessary to determine which of the potentially thousands of useful time-series variables should be considered in the exogenous variable forecasting problem. Business managers do not have the time to “scan” and plot all of the series for use in decision-making.

Many of these external sources offer data bases for historical time-series data but do not offer forecasts of the variables. Leading or forecasted values of model exogenous variables are necessary to create forecasts for the dependent or target variable. Some services, such as Global Insights, CMAI and others, offer lead forecasts.

Data Mining Figure

Concerning internal data, IT systems for collecting and managing data, such as SAP and others, have truly opened the door for businesses to get a handle on detailed historical data for revenue, volume, price, costs and could even include the whole product income statement. That is, the system architecture is designed to save historical data. Twenty-five years ago IT managers worried about storage limitations and thus would “design out of the system” any useful historical detail for forecasting purposes. Today, with the cost of storage being so cheap, IT architectural designs include “saving” various prorated levels of detail so that companies can take full advantage of this wealth of information.

A couple of key distinctions about time-series modeling are important at this point. The one thing that differentiates time-series data from simple, static data is that the time-series data can be related to “itself” over time. This is called serial correlation. If simple regression or correlation techniques are used to try and relate one time-series variable to another, and ignore possible serial correlation, the business person can be misled. Thus, rigorous statistical handling of this serial correlation is important.

The second distinction is that two main classes of statistical forecasting approaches need to be considered.

In the case of a “univariate” forecasting approach, only the variable to be forecast (the “Y” or dependent variable) is considered in the modeling exercise. Historical trends, cycles and seasonality of the Y itself are the only structures considered when building the forecasting model. There is no need for data mining in this context.

In the second approach – where the plethora of various time-series data sources comes in – various “Xs” or independent (exogenous) variables are used to help forecast the Y or dependent variable of interest. This approach is considered exogenous variable forecast model building. Businesses typically consider this value added; now we are trying to understand the “drivers” or “leading indicators.” The exogenous variable approach leads to the need for data mining for forecasting problems.

Though univariate or “Y only” forecasts are often very useful and can be quite accurate in the short run, there are two things that they cannot do as well as “multivariate” forecasts. First and foremost is providing an understanding of “the drivers” of the forecast. Business managers always want to know what “variables” (and in this case, what other time-series) “drive” the series they are trying to forecast. “Y only” forecasts do not accommodate these drivers. Secondly, when using these drivers, the exogenous variable models can often forecast further and more accurately than the univariate forecasting models.

The 2008/2009 recession is evidence of a situation where the use of proper Xs in an exogenous variable “leading indicator” framework would have given some companies more warning of the dilemma ahead. Univariate forecasts were not able to capture this phenomena as well as exogenous variable forecasts.

The external data bases introduced above not only offer the “Ys” that businesses are trying to model (like that in NAICS or ISIC data bases), but also provide potential “Xs” (hypothesized drivers) for the multivariate (in X) forecasting problem. Joseph Ellis, in his book “Ahead of the Curve,” does a nice job of laying out the structure to use for determining what “mega level“ X variables to consider in a multivariate in X forecasting problem. Ellis provides a thought process that, when complimented with the data mining for forecasting process proposed herein, will help the business forecaster do a better job identifying key drivers, as well as build useful forecasting models.

The use of exogenous variable forecasting not only manifests itself in potentially more accurate values for price, demand, costs, etc., in the future, but it also provides a basis for understanding the timing of changes in economic activity. Achuthan and Banerji (2004), in their book “Beating the Business Cycle,” along with Banerii (1999), present a compelling approach for determining potential Xs to consider as leading indicators in forecasting models. Evans et al., (2002), as well as www.nber.org and www.conference-board.org, have developed frameworks for indicating large turns in economic activity for large regional economies and specific industries. In doing so, they have identified key drivers. In the end, much of this work speaks to the concept that, if studied over a long enough time frame, many of the structural relations between Ys and Xs do not actually change. This offers solace to the business decision-maker and forecaster willing to learn how to use data mining techniques for forecasting in order to mine the time-series relationships in the data.

Many large companies have decided to include external data, such as that found in Global Insights as mentioned previously, as part of their overall data architecture. Small internal computer systems are built to automatically move data from the external source to an internal data base. This, accompanied with tools such as SAS’s Data Surveyor for SAP, allows users to bring both the external Y and X data alongside the internal. Often times the internal Y data is still in transactional form. Once properly processed, or aggregated, e.g., by simply summing over a consistent time interval (month) and concatenated to a monthly time stamp, this time-stamped data becomes time-series data. This data base would now have the proper time stamp, include both internal and external Y and X data and be all in one place. This time-series data base is now the starting point for the data mining for forecasting multivariate modeling process.

Process and Methods for Data Mining for Forecasting

Various authors have defined the difference between “data mining” and classical statistical inference; Hand (1998), Glymour, et al. (1997) and Kantardzic (2011) are notable examples. In a classical statistical framework, the scientific method (Cohen, (1934)) drives the approach. First, a particular research objective is sought. These objectives are often driven by first principles or the physics of the problem. This objective is then specified in the form of a hypothesis; from there a particular statistical “model” is proposed, which then is reflected in a particular experimental design. These experimental designs make the ensuing analysis much easier in that the Xs are independent, or orthogonal to one another. This othogonality leads to perfect separation of the effects of the “drivers.” The data is then collected, the model is fit and all previously specified hypotheses are tested using specific statistical approaches. Thus, very clean and specific cause-and-effect models can be built.

In contrast, in many business settings a set of “data” often contains many Ys and Xs, but they have no particular modeling objective or hypothesis for being collected in the first place. This lack of an original objective often leads to the data having irrelevant and redundant candidate explanatory variables. Redundancy of explanatory variables is also known as “multicollienarity” – that is, the Xs are actually related to one another. This makes building “causes-and-effect” models much more difficult.

Data mining practitioners will “mine” this type of data in the sense that various statistical and machine-learning methods are applied to the data looking for specific Xs that might “predict” the Y with a certain level of accuracy. Data mining on static data is then the process of determining what set of Xs best predicts the Y(s). This is a different approach than classical statistical inference using the scientific method. Building adequate “prediction” models does not necessarily mean an adequate “cause-and-effect” model was built.

Considering time-series data, a similar framework can be understood. The scientific method in time-series problems is driven by the “economics” or “physics” of the problem. Various “structural forms” may be hypothesized. Often a small and limited set of Xs are then used to build multivariate times-series forecasting models or small sets of linear models that are solved as a “set of simultaneous equations.” Data mining for forecasting is a similar process to the “static” data-mining process. That is, given a set of Ys and Xs in a time-series data base, what Xs do the best job of forecasting the Ys? In an industrial setting, unlike traditional data mining, a “data set” is not normally readily available for doing this data mining for forecasting exercise. There are particular approaches that in some sense follow the scientific method discussed earlier. The main difference herein is that time-series data cannot be laid out in a “designed experiment” fashion.

Data Mining Figure

With regard to process, various authors have reported on the process for data-mining static data. Azevedo and Santos (2008) compared the KDD process, SAS Institutes SEMMA process (sample, explore, modify, model, assess) and the CRISP data mining process. Rey and Kalos (2005) review the data mining and modeling process used at The Dow Chemical Company. A common theme in all of these processes is that there are many Xs, and thus some methodology is necessary to reduce the number of Xs provided as input to the particular modeling method of choice. This reduction is often referred to as variable or feature selection. Many researchers have studied and proposed numerous approaches for variable selection on static data (Koller (1996), Guyon (2003), etc.). One of the expositions of this article is an evolving area of research in variable selection for time-series type data.

The process for developing time-series forecasting models with exogenous variables starts with understanding the strategic objectives of the business leadership sponsoring the project.

This is often secured via a written charter so as to document key objectives, scope, ownership, decisions, value, deliverables, timing and costs. Understanding the system understudy with the aid of the business subject matter experts provides the proper environment for focusing on and solving the right problem. Determining what data helps describe the system previously defined can take some time. In the end, the most time-consuming step in any data mining prediction or forecasting problem is the data-processing step where data is defined, extracted, cleaned, harmonized and prepared for modeling (see accompanying article).

The reason for integrating data mining and forecasting is straightforward: producing a high-quality forecast. The unique advantage to this approach lies in having access to literally thousands of potential independent variables (Xs) and a process and technology that enables data mining on time-series-type data in an efficient and effective manner. In the end, the business receives a solution with the best explanatory forecasting model possible.

Tim Rey (TDRey@dow.com) is director of Advanced Analytics at The Dow Chemical Company. Fenton (Chip) Wells (Chip.Wells@sas.com) is a statistical services specialist in SAS Education at SAS. They are co-authors of the book, “Applied Data Mining and Forecasting Using SAS.”

References

  1. Achuthan, L. and Banerji, A., “Beating the Business Cycle,” Doubleday, 2004.
  2. Antunes, C . and Oliveira, A., “Temporal Data Mining: An Overview,” KDD Workshop on Temporal Data Mining, 2001.
  3. Azevedo, A. and Santos, M., “KDD, SEMMA and CRISP-DM: A parallel overview,” Proceedings of the IADIS, 2008.
  4. Banerji, A., “The Lead Profile and Other Nonparametric to Evaluate Survey Series as Leading Indicators,” 24th CIRET conference, 1999.
  5. Berry, M., “Data Mining Techniques and Algorithms,” John Wiley and Sons, 2000.
  6. Cabena, P, Hadjinian, P, Stadler, R, Verhees, J and Zanasi, A, “Discovering Data Mining: From Concept to Implementation,” Prentice Hall, 1998.
  7. Chase, C. “Demand-driven forecasting: a structured approach to forecasting,” SAS Institute, Inc., 2009.
  8. Cohen, M. and Nagel, E., “An Introduction to Logic and Scientific Method,” Oxford, England: Harcourt, Brace xii, 1934.
  9. CRISP -DM 1.0, SPSS, Incorporate, 2000.
  10. “Data Mining Using SAS Enterprise Miner: A Case Study Approach,” SAS Institute, 2003.
  11. Duling, D. and Thompson, W., “What’s New in SAS® Enterprise Miner 5.2,” SUGI-31, Paper 082-31, 2005.
  12. Ellis, J. “Ahead of the Curve: A common sense guide to forecasting business and market cycles,” Harvard Business School Press, 2005.
  13. Engle, R. and Granger, W., “Long-Run Economic Relationships: Readings in Co-integration,” Oxford University Press, 1992.
  14. Evans, C., Liu, C. T., and Pham-Kanter, G., “The 2001 recession and the Chicago Fed National Activity Index: Identifying business cycle turning points,” Federal Reserve Bank of Chicago, 2002.
  15. Fayyad, U, Piatesky-Shapiro, G, Smyth, P and Uthurusamy, R (eds.), “Advances in Knowledge Discovery and Data Mining,” AAAI Press, 1996a.
  16. Glymour, C., Madigan, D., Pregibon, Smyth, P., “Statistical Themes and lesson for Data Mining,” “Data Mining and Knowledge Discovery 1,” pp. 11-28, Kluwer Academic Publishers, 1997.
  17. Guyon, I., “An introduction to variable and feature selection,” The Journal of Machine Learning Research, Vol. 3, Vol. 7-8, pp. 1,157-1,182, 2003
  18. Han, J. and Kamber, M. and Pie, J., “Data Mining: concepts and techniques,” Elsevier, Inc., 2012.
  19. Hand, D. “Data Mining: Statistics and More,” American Statistician, Vol. 52, No. 2, May 1998.
  20. Kantardzic, M., “Data Mining: Concepts, Models, Methods, and Algorithms,” Wiley, 2011.
  21. Koller, D. and Sahami, M., “Towards Optimal Feature Selection,” International Conference on Machine Learning, 1996 issue (May), publisher: Citeseer, pp. 284-292.
  22. Kurgan, L. and Musilek, P., “A Survey of Knowledge Discover and Data Mining process models,” The Knowledge Engineering Review, Vol. 21, No. 1, pp. 1-24, 2006.
  23. Lee, T. and S. Schubert, “Time Series Data Mining with SAS Enterprise Miner,” paper 160-2011, SAS Institute Inc., Cary, N.C., 2011.
  24. Lee, T., et al., “Two-Stage Variable Clustering for Large Data Sets,” SAS Institute Inc., Cary, N.C., SAS Global Forum 2008, paper 320-2008.
  25. Leonard, M., Lee. T, Sloan, J. and Elsheimer, B. “An Introduction to Similarity Analysis Using SAS,” SAS Institute White Paper, 2008.
  26. Leonard, M. and Wolfe, B., “Mining Transactional and Time Series Data International Symposium of Forecasting,” 2002.
  27. Mitsa, T., “Temporal Data Mining,” Taylor and Francios Group, LLC, 2010.
  28. Pankratz, A., “Forecasting with Dynamic Regression Models,” Wiley, 1991.
  29. Pyle, D., “Business Modeling and Data Mining,” Elsevier Science, 2003.
  30. Rey, T. and Kalos, A., “Data Mining in the Chemical Industry,” Proceedings of the eleventh ACM SIGKDD, 2005.