The ‘D’ factor

INFORMS President

Anne Robinson

Anne Robinson

When thinking about this issue’s column, colleagues and friends sent me a flurry of topics, from updating INFORMS initiatives to exploring interesting new fields where analytics is taking a more prevalent position. One underlying thread surfaced throughout all the suggestions – data and its availability, readiness and role in the analytics process.

A few years ago, my colleague Teresa Wong and I were discussing at length how understanding data and data quality was the missing piece in advanced analytics education. In graduate school, we often ignore the importance of checking for good data. We are afforded the luxury of assuming away the data, supposing that it will always follow nice, well-known distributions. The harsh reality is that data is ugly … really, really ugly. In practice, data is rarely well-behaved, never in a single place, often has gaps and is generally missing the attributes required to solve your problem. In this era of big data, it is more critical than ever to ensure that we understand the quality of the data we employ in our models.

Like a diamond, data quality can also be measure by 5Cs – completeness, correctness, consistency, current and collaborative. (There are many flavors of data quality models by expert groups, but they all essentially speak to the same dimensions.)

Completeness refers to data sets in which all the expected fields are present (not to be confused with the mathematical definition of completeness.) However, every part of a data record may not be required. For example, in the INFORMS database, while we ask members to identify their employer and job title, it is not required that everyone complete this field. As such, this information would not be necessary to consider the data set complete.

Correctness or accuracy is the degree to which data correctly reflects the real world object or an event being described. For example, imagine if you had a numeric date code. The month field should never exceed the number “12.” If a value higher than this was in this field, it would denote an accuracy or correctness problem in the data.

Consistency of data ensures that the state of the data at any given point in time is in sync with the agreed upon definition. This includes ensuring that data housed in different subscribing systems is, in fact, the same. Good consistency in data is paramount for the data to be credible and trustworthy. Well-defined metadata management and master data management (the processes for defining and managing the definitions of data) help drive consistency across different uses.

 Please be aware, though, that multiple-source data can have a different definition depending on how that data is being used. Perhaps you are looking at shipment data from two different sources – one held by sales and one by manufacturing. These numbers may be different for a variety of reasons; for example, sales may only be interested in shipments that generate revenue while manufacturing is interested in all the units that need to go out, including demo units, sample units, etc. However, the foundational source for both of these data sets should be the same, and the business rules that guide the variants should also be captured in the master data.

When it comes to currency or timeliness of data, two things need to be considered: how fresh is this data and is it all in the same time frame? The former ensures that you are using the most up-to-date and relevant information as is required for your model. The latter ensures that all of your data sources are from the same time period and are in the same phase (all monthly data, for example). If one of your sources is quarterly and you have to interpolate to monthly data, that is fine, but please recognize that that is an assumption within your model and may influence the outcome.

In this era of social interactions, one of the best ways of ensuring complete, correct, consistent and current data is for it to be collaborative. This drives consensus on definitions, etc. that cross-functionally everyone can agree to. Furthermore, many experts (such as Professor Scott E. Page from University of Michigan who spoke previously at an INFORMS Roundtable meeting) maintain that crowd-sourced data is often better than that solely presented by “the expert.”

You can do simple tests to check the health of your data, and several tools such as Informatica or DataFlux are available to assist you. In fact, for data that is used on a regular basis, it is possible to automatically fix data issues using these technologies.

Only when you have good data quality can the modeling process truly begin. Modelers in practice often spend up to 80 percent of their time reviewing and prepping the data.

What’s the moral of this story? Don’t underestimate the data – because no matter how good a model you build or how strong a hypothesis you have, if the data is bad, the results are worthless. To quote the late Joseph M. Juran, father and evangelist for quality and quality management, “Data are of high quality if they are fit for their intended uses in operations, decision making, and planning.”