Recent advancements in unstructured text analytics.

By Charlie Greenbacker

Text Analytics

According to some estimates, unstructured data accounts for more than 90 percent of the digital universe. Photo credit: “Paperwork 2” by Issac Bowen ( CC-BY-SA 2.0.

We exist in a disorderly universe brimming with entropy. The digital information we want to use to inform and improve our decision-making does not always arrive in a format optimized for quantitative analysis. Very often, a large part of the analytic process involves simply cleaning, re-formatting or otherwise pre-processing data to support these analyses. This is particularly the case when working with large volumes of unstructured text data, which requires special methods and approaches to exploit quantitatively.

Some estimates claim unstructured data accounts for more than 90 percent of the digital universe [4], much of it coming in the form of text. Digital publishing, social media and other forms of electronic communication all contribute to the deluge of text data from which countless organizations are now seeking to derive insights and extract value. Fortunately, many new tools and techniques have been developed recently that facilitate the analysis of vast amounts of unstructured text documents.

Brief History of Natural Language Processing

One field of study seeking to enable machines to automatically distill meaning from text data is natural language processing (NLP). Operating at the intersection of computer science, artificial intelligence and computational linguistics, people working in NLP are continually advancing the state-of-the-art in text analytics by inventing new and innovative means of algorithmically understanding human language.

NLP research dates back to at least the 1950s, with the famous (or perhaps now infamous) Turing Test and the first efforts at automatically translating Russian text into English [6]. Early work in NLP was typically characterized by rule-based systems, often inspired by theories developed within traditional linguistics. However, initial enthusiasm about applications as diverse as voice-controlled robots and simulated psychotherapists eventually gave way to frustration, as researchers ultimately realized it would be nearly impossible to manually encode all of the rules necessary for a computer to participate in human-like interaction via natural language.

Over the past few decades, most people working in NLP have adjusted their strategy to rely on more statistically driven approaches, such as methods involving machine learning. This shift has enabled some incredible advances in NLP technology, notably in the areas of automated question answering and machine translation. IBM’s success with “Watson” in the TV game show Jeopardy! is a noteworthy example [16]. Problems that previously defied computational solutions, including tasks as seemingly trivial as sentence segmentation, can now be tackled as relatively straightforward applications of machine learning. For example, no one could ever write a step-by-step algorithm to accurately segment English-language text documents into discrete sentences, but with a sufficiently large amount of annotated training data, this can now be accomplished by a relatively simple classifier.

These recent successes have resulted in an explosion of open source software packages and consumer applications based on NLP entering the market, leading to the widespread use of tools supporting automated analysis and processing of human language. Many high-quality open source software libraries for text analytics have emerged in the past decade, often (but not exclusively) coming out of academia, putting advanced NLP capabilities into the hands of analysts and developers everywhere. These include Stanford CoreNLP, Apache OpenNLP, GATE and NLTK for Python. NLP-based consumer products, like Google Translate and Apple’s intelligent personal assistant Siri, have captured the public’s attention and opened up a whole new world of human-computer interaction.

Recent Advancements

Smart Phone

NLP-based consumer products like Apple’s intelligent personal assistant Siri (above) have opened a new world of human-computer interaction. Photo credit: “Siri, welcome to” by Vasile Cotovanu ( CC-BY 2.0.

Several specific NLP-based analytic resources stand out in particular for their ease of use and direct applicability to text analysis. These software tools and datasets enable powerful insights to be derived from large volumes of unstructured text data, all without having to devote endless time and energy to manually reviewing and meticulously coding data samples by hand. Additionally, each is available for use, free of charge, for any academic or commercial purposes under the terms of an open source software license or otherwise generously unrestrictive terms of use.

Topic modeling. In the age of “big data,” it’s a fairly common experience to somehow obtain a large number of text documents that you’re expected to analyze in a relatively short period of time. No single person, or group of people for that matter, would want to read 10,000 individual e-mail messages just to get a sense of the broad themes of discussion within an organization’s internal communication channels, for example. One approach to solving this problem would be to build topic models using latent Dirichlet allocation (LDA).

An algorithmic descendent of term frequency-inverse document frequency (TF-IDF) and probabilistic latent semantic analysis (pLSA), LDA operates based on the assumption that text documents are a mixture of a certain number of topics, and that each word used by the author represents one or more of those topics. Given a set of input documents, the LDA algorithm will automatically identify a collection of topics distributed within the document set without having to be told what the topics are or which words belong to which topics. The identified topics are represented as sets of related words, each assigned with a score indicating how strongly the word corresponds to the topic (the same word can belong to multiple topics with different strengths). Documents are then represented as distributions of one or more topics, with the specific distribution being calculated based on the relative frequency of occurrence of words associated with each topic. LDA topic modeling allows an analyst to quickly determine which topics exist in a large collection of documents, which words best characterize each topic, which topics exist within each document, which documents are most strongly associated with each topic, and so on.

A standard implementation of LDA in the Java programming language is provided in the MALLET software package developed by UMass Amherst [12], while a more scalable implementation based on Hadoop and MapReduce is offered by the Mr. LDA package [15].
Geoparsing documents. As the demand for geospatial analytics continues to grow, most information remains “trapped” in unstructured text. Tools like entity extractors and named entity recognizers can identify place names mentioned in text documents – “Springfield” for example – but these tools cannot determine which “Springfield” was intended by the author (Massachusetts? Illinois? Missouri?), nor what “Springfield” actually means in the real world (i.e., a city located at a specific lat/lon, and with a specific population, elevation, etc.). Doing this resolution of place names to actual geospatial objects requires a type of tool called a geoparser.

CLAVIN (cartographic location and vicinity indexer) is an award-winning open source software package for document geotagging and geoparsing that’s fast, accurate, easy to use and scales to accommodate big data [3]. It automatically extracts location names from unstructured text using a model based on machine learning, and then intelligently resolves these names against a worldwide gazetteer [5] to produce data-rich geographic entities representing the locations mentioned in the input text. CLAVIN does not simply “look up” location names; it uses heuristic techniques to solve “the Springfield problem” based on the semantic context within the document itself. CLAVIN also employs fuzzy search to handle misspelled location names, and it recognizes alternative names (e.g., “Ivory Coast” and “Côte d’Ivoire”) as referring to the same geographic entity.

By enriching documents with structured geospatial data derived from the semantic content, the CLAVIN geoparser renders unstructured text into a new data source for traditional geospatial analysis, and it also enables the addition of a geospatial perspective for traditional text analytics.
Global event detection & monitoring. Each day, countless important events take place all around the world. Many thousands of these events are recorded and reported in various media sources in virtually every country. To identify trends and patterns in all of this information, performing a large-scale analysis would traditionally require a manually intensive undertaking to painstakingly code each media report and add the corresponding records to a massive event database. With the sheer volume of global media output and its rapidly accelerating rate of growth, it’s unlikely that any research organization of any size would ever have the manpower and resources necessary to keep up with everything using purely manual methods.

GDELT is the global database of events, language and tone [9]. Each day, a series of software algorithms operating without human intervention scours thousands of broadcast, print and online new sources around the globe, automatically extracting comprehensive event records capturing what’s happening around the world, who’s involved and what everyone thinks about it. With nearly 250 million event records in total, the GDELT database goes back to 1979 and covers all countries. Each record contains detailed information about the actors, locations, affiliations and other important context for the event. Data updates are published every day, with everything being fully open and 100 percent available for unlimited use and distribution.

GDELT was created to provide an open database of quantitative information about societal-scale human behavior and worldwide events for use by the global research community. In particular, GDELT data has been used to analyze trends in global conflict, as well as to monitor the tone of news coverage about each of the world’s heads of state.
Fusing structured & unstructured data. More often than not, analysts and researchers find themselves working with many different kinds of data. They’re not just looking solely at text documents or working exclusively with a single database, they’re working with both unstructured text and structured data, plus images, video and potentially many other forms of multimedia. All of this needs to be combined, cross-referenced and integrated into a single common operating picture in order to run analyses that encompass the entire spectrum of source material. This presents a significant challenge, as it’s not typically straightforward as to how to perform a complex “join” operation across all this disparate data lacking common “keys.”

Lumify is an open source project to create a big data fusion, analysis and visualization platform designed to help users discover connections and explore relationships in their data [11]. It can ingest anything from spreadsheets and text documents, to images and video files, representing this diverse data as a collection of entities, properties and relationships between entities. Several different open source tools (including OpenNLP and CLAVIN) are used to enrich the data, increase its discoverability and uncover hidden connections. Text documents ingested by Lumify (and any additional text it’s able to extract from other data types) gets processed by an entity extractor, a geoparser and other analytics to identify key elements in the unstructured data, integrating it automatically with other sources of structured and unstructured data based on common properties as outlined in a user-defined ontology. For example, Lumify can identify a person’s name mentioned in the text of a news report, link that mention to a specific record in an employee database, and match both of those instances to the transcript of a conversation automatically generated from a YouTube video.

Everything in Lumify is stored in a scalable and secure graph database to enable advanced social network analysis and complex graph traversals. Built on proven open source technologies for big data, it supports a variety of mission-critical use cases centered around the emerging concepts of activity-based intelligence, object-based production and human geography. Its intuitive Web-based user interface provides a suite of analytic options with multiple views on the data, including 2D and 3D graphs, full-text faceted search, histograms with aggregate statistics and an interactive geographic map exploration feature.

Text Analytics

Screenshot of Lumify being used to investigate the terrorist network threatening the 2014 Winter Olympics in Sochi, Russia.

Current Challenges

Despite more than 50 years of academic research and industry efforts in NLP, many challenges and unsolved problems yet remain. Among these are the task areas of coreference resolution and sentiment analysis, both of which are major obstacles preventing machines from achieving reading comprehension scores even at the level of an average elementary school student.
Coreference resolution. Unfortunately for practitioners of text analytics, proper names are generally not unique identifiers. As we saw earlier with the “Springfield problem,” multiple distinct entities can share the same name. Resolving ambiguous names for locations is difficult enough, but doing the same with other entity types such as persons and organizations is even more challenging. There is no single authoritative and exhaustive listing of names for every person in the world; many millions of people are born and die each year; and the level of ambiguity for people’s names is significantly higher (compare the number of cities named “Springfield” to the number of people named “John”). Solving the problem of coreference resolution would enable machines to determine that “President Obama,” “Barack Obama,” and “Mr. Obama” all refer to the same person, but that “President George Bush” may in fact refer to two different people.
Sentiment analysis. It’s a common cliché in various works of science fiction to portray a struggle by artificial intelligence to understand complex human emotions. There’s been much interest recently in using methods for sentiment analysis to identify and extract opinions and other forms of subjective information from text. In some circumstances where the topic is clear and unambiguous, there has been some success in analyzing the sentiment expressed in text, such as with online movie reviews [13]. However, in other domains where the subject of the sentiment is uncertain and there may be multiple levels of obfuscation, such as newspaper editorials, the results have been far less impressive. Another complicating factor is the rampant use of sarcasm in online communications. Consider two somewhat similar tweets, one that reads, “This concert was awesome!” and another saying, “My car got rear-ended. Awesome!” [14]. Without a significant amount of world knowledge at its disposal, it would be virtually impossible for an algorithm to distinguish the very different sentiment expressed in those two posts.

Additional Resources

Founded in 1962, the Association for Computational Linguistics (ACL) is the international scientific and professional society for NLP and related disciplines [1]. The ACL publishes two quarterly journals and sponsors several annual conferences and workshops, with the 2014 meeting of the main ACL conference set for June 22-27 in Baltimore. Can’t make it to ACL 2014? The proceedings will be published online at the ACL Anthology, where more than 25,000 open-access research papers on NLP and text analysis are hosted, covering every topic from part-of-speech tagging and lemmatization to automatic speech recognition and machine translation.

An easy, hands-on introduction to NLP programming with Python’s NLTK package can be found in the book “Natural Language Processing with Python” [2]. “Taming Text” provides a guidebook of practical examples for working with text in real-world analytic applications [7]. A more advanced look at algorithms for text analysis that scale to “big data” is available in “Data-Intensive Text Processing with MapReduce” [10]. Finally, the book “Data Mining Methods for the Content Analyst” offers an excellent introduction to NLP-based computational analysis of text content geared towards researchers in the humanities and social sciences [8].

Charlie Greenbacker (@greenbacker on Twitter) is director of Data Science at Altamira Technologies Corporation in McLean, Va., and founder of the Washington, D.C., Natural Language Processing meetup group (

  1. Association for Computational Linguistics (
  2. Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python,” O’Reilly, Sebastopol, Calif., 2009.
  3. CLAVIN, Berico Technologies (
  4. John Gantz and David Reinsel, “Extracting Value from Chaos,” IDC Study sponsored by EMC Corporation, June 2011 (
  5. GeoNames geographical database (
  6. John W. Hutchins, “The Georgetown-IBM Experiment Demonstrated in January 1954,” in “Machine Translation: From Real Users to Research” (Robert E. Frederking and Kathryn B. Taylor, editors), Springer Berlin Heidelberg, 2004.
  7. Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris, “Taming Text,” Manning, Shelter Island, N.Y., 2013.
  8. Kalev Hannes Leetaru, “Data Mining Methods for the Content Analyst,” Routledge, New York, N.Y., 2012.
  9. Kalev Leetaru and Philip Schrodt, “GDELT: Global Data on Events, Language and Tone, 1979-2012,” Proceedings of the International Studies Association Annual Conference, 2013 (
  10. Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with MapReduce,” Morgan & Claypool, Lexington, Ky., 2010.
  11. Lumify, Altamira Technologies Corporation (
  12. Andrew Kachites McCallum, “MALLET: A Machine Learning for Language Toolkit,” 2002 (
  13. Bo Pang and Lillian Lee, “Opinion Mining and Sentiment Analysis,” Foundations and Trends in Information Retrieval, Vol. 2, Issue 1-2, January 2008.
  14. Maksim Tsvetovat, “Implicit Sentiment Mining in Twitter Streams,” online presentation, Nov. 20, 2012 (
  15. Ke Zhai, Jordan Boyd-Graber, Nima Asadi and Mohamad Alkhouja, “Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce,” Proceedings of the 21th International World Wide Web Conference, 2012 (
  16. Douglas A. Samuelson, “Text Analytics: Bridging the Gap Between Quantitative and Qualitative Information,” OR/MS Today, June 2012.