Text analytics

Bridging the gap between quantitative and qualitative information.

watson on jeopardy

By Douglas A. Samuelson

New computer software and analytical methods offer promising ways to combine two kinds of data traditionally separated: quantitative and qualitative information. What data mining became in the 1990s, text mining/text analytics may well become in the current decade – a powerful way to find patterns not previously suspected. Statistical analysis and understanding natural language can go together. New methods and technology now make it possible to store, index, search and retrieve free-form text more effectively and efficiently, and on a much larger scale, than was possible just a few years ago.

Buoyed by these advances, text mining offers great promise for OR/MS and general analytics. Recent successful applications include fraud and abuse detection, market analyses, evaluation of effectiveness of sales and marketing campaigns, national security and law enforcement, work process improvement, information technology support for healthcare decisions and patient education, along the success of IBM’s Watson program in the television game show “Jeopardy!”

Computerized storage and keyword-based retrieval of free-form text is not new. Directed search, “mining” the stored text for patterns, is also well-known. In the 1980s, some young computer-savvy analysts used computerized document storage and retrieval, including mining methods, to great effect to identify key defendants in the Argentine “dirty war” after the junta there was deposed [5]. This was a pioneering use of document clustering: one key component was finding reports with certain common elements such as the type and color of automobile the suspects used. Other human rights applications quickly followed.

In 1991, this reporter applied text-mining methods to demonstrate the ability to detect and discover patterns of causation in general aviation crashes [5]. Around that same time, no less a personage than Nobel Laureate Herbert Simon was exploring the cognitive processes of people’s responses to free-form questions and explaining how to interpret these anecdotal statements into quantitative data [2].

As with data mining, text mining can be “supervised,” looking for a pre-specified pattern, or “unsupervised,” noting whatever stands out from the background. Supervised searches tend to be more effective and are easier to explain later, but they are open to criticism that the analyst merely found what he already suspected was there. For this reason, leading text analysts have expanded the method to include evaluating relationships and trying to find and use context, as humans do in interpreting language.

Expanding the Method: Key Definitional Questions

In a recent book [3], John Elder and Andrew Fast of Elder Research enumerated five questions for finding the right scope and practice area:

  • focus, especially the question of whether to emphasize search or information extraction – that is, finding specific words and documents versus characterizing the entire data set;
  • granularity, the desired level of detail of focus;
  • available information, emphasizing whether there is enough information to provide a pattern and look for that, or with less information, just look for anomalies;o999o
  • syntax versus semantics, which means deciding whether the analysis is more about what words literally mean or what they connote in context; and
  • Web information vs. traditional document-based text.

They then proceeded to categorize text mining into seven practice areas:

  • search and information retrieval: the classic early use of the technique, consisting simply of finding words or phrases and retrieving blocks of text containing those terms;
  • document clustering: finding sets of documents related by one or more search criteria – the Argentine human rights analysis is an early and striking example;
  • document classification: grouping documents into sets with assigned labels characterizing what those documents have in common, typically working from some pre-specified sets of documents already labeled (i.e., a “supervised” method);
  • Web mining: drawing on older techniques, particularly document classification and natural language understanding, to exploit the structured, linked format of information on the Web, which offers opportunities and challenges different from standard text;
  • information extraction: deriving or discovering structured data from unstructured data, which often requires specialized algorithms and software, considerable training and tuning, and substantial customization, and may require continuing involvement by human subject matter experts;
  • natural language processing: supplementing the long history of this approach in linguistics and computer science by expanding the use of statistical methods to help, and by supporting other techniques by supplying grammatical cues and phrase boundaries; and
  • concept extraction: the newest, most powerful and probably trickiest component, as it requires the continuing combination of human and machine intelligence to try to determine meaning within context, including implications that may be extremely difficult to explain consistently, much less to derive by any logical scheme.

Of course, these practice areas all overlap and interact to a substantial degree. As Miner, Elder et al. stated, document classification, clustering and concept extraction offer the best opportunities for the non-specialist to learn and produce results quickly, while search and information retrieval, thanks to Google and similar engines, have become standardized enough that everyone does them routinely as part of other efforts. “Information extraction and natural language are the most distinct areas technically,” they explained, “often requiring specialized software to achieve strong performance,” and also requiring “significant amounts of domain expertise.”

The book includes many case studies and tutorials based on actual applications, in many instances keyed to specific software with instructions about how to use the software to replicate the results. Elder Research’s software includes visual displays of social network structures, multi-level clustering and revealed preferences in markets, all derived from text mining and related analytics.

text mining

Watson and the Paris Hilton Problem

IBM’s Watson is one of the most widely noted text analytics packages, mostly because of its striking success competing against top human opponents in the television game “Jeopardy!” This effort required not only highly accurate, extremely fast search and retrieval of information, but also numerous advances in interpreting the meaning of questions, recognizing and discarding irrelevant and misleading information, and learning from mistakes. IBM devoted the current one of the IBM Research and Development Journal [4] to the detailed development and exposition of the variety of techniques and approaches involved. The technically advanced reader with some time on his or her hands can become immersed to daunting depth in issues such as assessing the performance of alternative relevance-scoring models and language interpretation.

To illustrate the challenge, one example IBM presenters like to mention is “the Paris Hilton problem.” Given that phrase, how much additional information is needed to determine whether the subject is the famous (or infamous) young woman or the hotel, and what methods offer the best balance of speed and reliability to resolve ambiguities like this one? Another set of examples is the panoply of English phrases in which a noun modifies another noun, and word order may or may not make a big difference. “Drunk driving” means exactly the same thing as “driving drunk.” “Assistant physician” means much the same thing, but not quite, as “physician’s assistant,” and “practical nurse” is similar to, but again not quite the same as, “nurse practitioner.” However, “house cat” means something very different from “cat house.”

One can only imagine how much more difficult such context-dependent, knowledge-dependent distinctions become in a highly idiomatic language such as French. For instance, “un chèvre coiffé” literally “a goat with a hairdo,” is a commonly used phrase for “ugly person,” connoting someone so unattractive that all efforts to improve his or her appearance are futile.

The Internet provided an unusually amusing example of this problem a few years ago when actress Sharon Stone struck a scantily clad pose for the Paris-based newspaper, Le Monde. The newspaper captioned the photo, “50 ans, et.... alors!” This could best be translated as “50 years old, and look here!” Someone who had a dictionary but limited knowledge of French usage had relied on “then,” the literal meaning of “alors,” and translated this to American media as, “50 years old, and then some!” Information as to whether they relied on a computer translator was not available. For that matter, information as to who was responsible for the translation, by whatever means, was also unavailable.

Thus the challenges of inferring context and inferring from context remain critical to text analytics, and these are precisely the areas in which some of the most exciting research is now taking place. Watson does not generate single “best” answers to “Jeopardy!” questions; it produces multiple suggested answers, each with an estimate of its confidence that the answer is right. It interprets natural language questions and learns improved question interpretation, not just improved answer inference, from wrong answers. To do this, Watson also supports iterative dialogue to refine results.

Watson was constrained to quick response times, static content and the assumption of a single questioner, but contemplated enhancements include multiple users, more varied training data, more dynamic content updates and varying required response times. Additional future enhancements desired include large amounts of structured data, greatly increased predictive and statistical capabilities and social media analysis.

Doctor Watson?

IBM has declared that one of the most high-priority, high-consequence applications they have in mind is support for medical diagnoses. The DeepQA inference engine has shown some promise in hypothesizing and assessing differential diagnoses from the volumes of medical texts it has ingested. To do this, Watson had to be modified to make use of temporal reasoning (when symptoms occurred and how long they persisted are significant), geospatial reasoning (where the pain started and how it spread or traveled is also important), and statistical paraphrasing to map more reliably between lay terms and medical terminology [2].

This does not mean having Watson diagnose, as many analysts still have fresh and painful memories of earlier attempts by various technical providers to develop machine-based diagnoses. Rather, the design is for Watson to provide the physician a list of possibilities to consider, evidence from which these possibilities were inferred and tests that would help to elevate or rule out some of those possibilities. The judgment and the responsibility remain with the physician; intelligent augmentation of human capabilities is the idea, not artificial intelligence displacing the human.

Another exciting component of the potential for this type of medical decision support system is the ability to recognize much more quickly the emergence of similar symptoms in a number of patients in diverse locations at about the same time. This capability could greatly enhance the early detection of incipient epidemics, one of the most vital and stubborn challenges in public health [2], [7].

doctor pateint

The adaptation will not necessarily be smooth or easy, however. An important difference between “Jeopardy!” and healthcare is the relative costs of different types of error and the consequent effect on learning protocols. In the game, timing of buzzing in with the answer is critical. Most contestants know most of the answers, but buzzing in ahead of one’s opponents – but not too early to hear the question correctly – is a key element. In medical diagnosis, a delay of several seconds or even a few minutes is much less likely, compared to the game, to be seriously harmful, but being egregiously wrong is much worse.

Still another difference is the thorny set of issues around sharing information effectively while protecting the privacy and confidentiality of individually identifiable information. Methods for establishing and maintaining the proper balance continue to evolve, which means available data sets and access protocols also continue to evolve. Merely ensuring that tests do not need to be repeated (some are both costly and dangerous) and that decision-makers have the information they need when they need it is an ongoing challenge, and an opportunity for major improvement in both safety and cost-effectiveness of healthcare [6]. Increasing involvement of computers in decision support may help to push institutions toward better information handling, a benefit in itself, but it also raises concerns about the extent to which human know-how can safely be supplanted.

The Human Element

Another element of IBM’s experience illustrates why human judgment remains critical. Arnie Greenland, a Distinguished Engineer in IBM’s Global Business Services Federal practice, tells of a project he led a few years ago for the Social Security Administration, trying to use text mining methods to improve processing of disability claims. As often happens in analytical projects, the biggest breakthrough was the discovery that redefining the problem could produce a dramatic effect. Rather than trying to distinguish approved claims from disapproved claims, the team found that they could greatly reduce backlogs and expedite the vast majority of claims by distinguishing simple cases from complicated ones. By sorting the claims in this way, and having human resources reallocated accordingly, they were able to generate better service for most beneficiaries and substantial cost savings for the agency.

So far, even the most advanced computer-based analytical methods, in text mining and in other fields, cannot approach humans’ ability to recognize when redefining the problem could be beneficial. Perhaps there is some way of processing the language by which goals were stated in order to suggest alternative objectives and priorities that would work better than those originally stated, but this capability seems to be still well beyond reach. (It is difficult even for the most experienced human analysts!) Considering what computers have been able to accomplish, however, this may yet be a promising and productive area for future research.


Advances in methods for combining qualitative and quantitative data, especially via expanded tools and techniques for text mining and related activities, offer great potential for improved analysis and decision-making. This appears likely to be one of the hottest analytical areas this decade. There are many areas of this rapidly growing subject area that are ripe for further research, ranging from enhancing inference to developing more efficient storage, in-stream categorization and retrieval of data, to improving understanding of and interaction with natural language questions. However, the need for human judgment and know-how persists, and in some areas is even more critical. The importance and information value of context continues to raise difficult issues. OR/MS analysts who learn about these methods and their potential and limitations are likely to see numerous and growing opportunities to put these capabilities to good use.

Douglas A. Samuelson (samuelsondoug@yahoo.com) is president of InfoLogix, Inc., a research and consulting company in Annandale, Va., and a contributing editor of OR/MS Today.


  1. Basit Chaudhry, “Putting Watson to Work in Healthcare,” presentation to the 2nd Annual Robert H. Smith School of Business and IBM Business Analytics workshop, May 17, 2012, www.rhsmith.umd.edu/AnalyticsWorkshop/agenda.aspx.
  2. K. Anders Ericsson and Herbert A. Simon, “Protocol Analysis: Verbal Reports as Data,” MIT Press, 1984 (revised 1993).
  3. Gary Miner, Dursun Delen, John Elder, Andrew Fast, Thomas Hill and Robert A. Nisbet, “Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications,” Academic Press, Waltham, Mass., 2012
  4. IBM Journal of Research and Development, Vol. 56, No. 3/4, May/July 2012.
  5. Douglas A. Samuelson, “Use and Retrieval of Anecdotal Information,” Proceedings of the Social Science Section, American Statistical Association Annual Meetings, 1992; also available at www.asksam.com under “User Stories.”
  6. Douglas A. Samuelson, “Diagnosing the Real Health Care Villain,” OR/MS Today, February 1995.
  7. Douglas A. Samuelson, “Can We Detect ‘The Coming Plague’?: How Emerging Health Threats Are Sneaking Up on Us,” OR/MS Today, June 2008.