Data visualization

The important role visualization plays in presenting data in a powerful and credible way.

Data Visualization

By Navneet Kesher

Data science is more than just building machine learning models; it’s also about explaining the models and using them to drive data-driven decisions. In the journey from analysis to data-driven outcomes, data visualization plays a very important role of presenting data in a powerful and credible way.

Why Unstructured Data?

Structured data only accounts for about 20 percent of stored information. The rest is unstructured data – texts, blogs, documents, photos, videos, etc. Unstructured data, also known as dark data, includes information assets that organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Unstructured data is the hidden part of the massive iceberg that has yet to be analyzed for useful decision-making.

In many circles, unstructured data is considered a burden that should be sorted and stored away. In reality, it contains valuable business insights that can significantly augment the business understanding that we have today from structured data.

Figure 1: Unstructured data: the hidden part of the massive iceberg.

Figure 1: Unstructured data: the hidden part of the massive iceberg.

Although machine learning can analyze any type of data (structured or unstructured), unstructured data is virtually useless without machine learning algorithms (including natural language processing (NLP) algorithms, text-mining algorithms, pattern/classification algorithms, etc.) While machine learning algorithms have seen significant advancements, the tools and processes to visualize the results from these algorithms for the common man have not kept pace.

Visualization tools for unstructured data are extremely valuable, but they have traditionally operated mostly on highly structured data, such as stock prices and sales records. As we create and consume more unstructured data, we have to extend the visualization efforts to include unstructured data.

Importance of Data Visualization

As a data scientist, I always question the amount of time I put into data visualization. Throughout my early analytics career, I observed that the prettier my graph, the more skeptical my audience was in the quality of my analysis. While I loved data visualization, I also feared coming out as the person who puts more emphasis and effort into making the graphs pretty rather than ensuring a thorough analysis (Figure 2).

Figure 2: Can pretty graphs make an audience more skeptical of the quality of the analysis?

Figure 2: Can pretty graphs make an audience more skeptical of the quality of the analysis?

As I progressed in my career, I realized that data analysis and data visualization are not entirely exclusive work sets – they co-exist and feed off of each other (aka, you can produce pretty graphs and still come off as someone with analytical prowess). The rest of this article on data visualization will focus on representing highly complex analysis on a sheet of paper (or slide) for someone who may not have the need to understand the underlying details.

Representing Unstructured Data

Below are the three broad guidelines that I follow while building visualization for unstructured data:

1. Start with a goal. Goals are the fundamental bonding agent that connect the purpose of the analysis to the visualization of results. Whether the goal is to arrive to a decision or start an action into exploring next steps, the data scientist should aim to identify and convey the results and corresponding visualization that best supports a well-defined goal.

For example, if the goal is to analyze a call center’s audio recordings to determine the type and corresponding volume of complaints, a cubism horizongraph [1] may be very useful. Cubism.js is a very effective time series visualization tool that uses stacked area graphs to help analyze output content from audio-video streaming data. In the case of call center audio recordings, the horizongraph visualization can help determine the intensity of the customer conversations (along with time series data) without having to transcribe audio into text.

While we are on the topic of call center’s customer service recordings, if the goal is to understand the differences between a subscription customer versus a free-tier customer, then a text analysis along with scatter text visualization [2] may make more sense. Of course, this will need transcription and annotation of the media files.

Figure 3: Visualization is most effective when it is simple to understand and can stand by itself.

Figure 3: Visualization is most effective when it is simple to understand and can stand by itself.

Call centers use analytics for analyzing thousands (or millions) of hours of recorded calls. Among others, the main goal is to gain insight into customer behavior and identify product/service issues. The analysis method that I have found particularly useful for these goals is self-organizing maps (SOM) [3], which, along with classification, have added the benefit of dimensionality reduction. SOMs are also good for visualizing multidimensional data into 2-D planar diffusion map.

Having and understanding the goal is the most crucial aspect for any data visualization process. Always ask yourself and your stakeholders: What will this data be used for? List the data points that will be vital for answering strategic questions for your business and then create a wireframe of the story that is going to engage your audience.

2. Simplicity for the win. The very reason we analyze unstructured data is to provide structure to it. Data visualization plays a very important role in conveying the results of the analysis, and visualization is most effective when it is simple to understand and can stand by itself without a lot of subtext or metadata. One of my favorite examples is this visualization on “How Families Interact on Facebook” [4] by the Facebook Data Science Team. This is a very simple yet powerful way to reveal the results of a very complex text analysis.

Figure 4: Classic example: radar/spider charts vs. bar/line charts.

Figure 4: Classic example: radar/spider charts vs. bar/line charts.

Another classic example is the use of bar/line charts vs. radar/spider charts. I am a big proponent of easy-to-read charts, aka, charts that can convey a maximum amount of information in the least amount of time.

Here are a couple of other ways I like for simple visualization of unstructured data:

Word clouds. Word clouds help visualize the occurrence of words within a corpus, with the size of the text representing the number of times the word or the phrase occurs in the larger text collection. Word clouds are very effective in visualizing the results when performing tf-idf (term frequency-inverse document frequency, a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus of words). Word clouds can be very effective in uncovering the topic areas of discussion for any social media content or feedback surveys/comments. If you use Python, you may want to bookmark an awesome word cloud library [5] by Andreas Muller. For an interesting application, see “Inauguration Word Clouds with tf-idf” [6].

Chord diagrams. A chord diagram is a powerful tool that can be used to represent the contextual meaning of words (especially when analyzing using latent semantic analysis [7]). If the number of topics are <10, Seaborn heat-maps [8] (or even a stacked bar chart) may be a better alternative; however, with a larger set of topics, chord diagrams have better visual representation.

Lines connecting nodes on a circle in Python examples of a chord diagram [9] and a filled chord diagram [10] indicate the relationship between these nodes/words (color of the line can denote a positive or negative relation) and the thickness of the connecting lines quantifies the extent of the relationship.

Don’t overload your visualization with data and present clear contrasts wherever applicable.

Figure 5: The four quadrants of good analytics insights.

Figure 5: The four quadrants of good analytics insights.

3. Know your audience. Knowing your audience and tailoring visualization for optimal consumption will go a long way into making a successful presentation. It’s always good to understand how the data translates into strategic direction for the product. Don’t work in a silo – involve and get feedback from your stakeholders as you do your analysis and create visualization thereof. Iterate! If you cannot get feedback from everyone, make sure that you think about who will actually be looking at these visualizations, what’s important to them and most importantly, how much time will they have to look at your graphs.

These are the most important things you should do to understand your audience. Technical jargon won’t work if your audience doesn’t know what they mean. No matter how beautiful your graphs are, if you don’t deliver meaningful and actionable insights, your work does not classify as impact.

For example, if you present your data in the form of a network graph [10] (network graphs are designed to measure and quantify the relationships between different vertices or nodes on a graph), take some time to explain how the graph works. In a social data context, network graphs can be a powerful tool in telling a story on the health of your product’s ecosystem.

Data visualization is an art that data scientists need to be good at in order to tell a compelling story from their analysis. Figure 5 best depicts the four quadrants of good analytics insights. ORMS

Navneet Kesher ( is head of Platform Data Sciences at Facebook. Prior to joining Facebook, he served as a manager of analytics for Amazon. Based in the Greater Seattle Area, he holds an MBA from the University of Southern California.

Note: A version of this article appeared in Analytics magazine.