insights.jpg

A Gentleman's Guide To Good (Data) Sets: It's All About Proportion

By Abigail Rosenson

Anyone who has attempted to make a decision based on a spreadsheet full of numbers knows that visualizing data sets is not just aesthetically pleasing, but enhances your ability to understand and process the information to make more informed decisions.

Three reasons why data visualization is critical to your enterprise

Once you are on board with visualizing your data, the next step is to choose the best method. As demonstrated by the Periodic Table of Visualization Methods, you have many options! The key to making the right selection is understanding how each type of analysis works and matching it with your data and decision-making objectives.

In this post, we will discuss the concept of proportion, why it is an important consideration, and how to incorporate it into your data analysis.

Venn diagrams are very powerful data analysis tools for representing sets and their relationships. There is truly no finer way to describe two or three sets, demonstrate overlap, and show which items are or aren't at the intersections. Unfortunately, Venn diagrams rely solely on circles, which makes them poor demonstrators of proportionality when dealing with a large number of sets, complex sets, and/or empty sets.

Euler diagrams are like Venn diagrams, but they aren’t limited to the classic three-ring structure or to circles (some use squares, triangles, or ellipses). For this reason, they are much more accurate with proportion, and can show both subsets and disjointed sets.

This illuminating Euler diagram found in Wikipedia was drawn by Daniel Glasser - you can now count yourself as a member of an exclusive and tiny minority represented in the third circle:

napoleans march

Next up is the Sankey diagram.  It’s commonly used to show flows, e.g., of money or energy. The first was drawn by Captain Matthew Henry Phineas Riall Sankey to present energy flows in a steam engine. Their benefit over a diagram that shows flows with arrows? It’s their ability to demonstrate proportion, represented by the width of the lines. The most famous Sankey diagram shows the horrible attrition of Napoleon’s army as it marched into Russia:

describe the image

It is featured prominently in The Visual Display of Quantitative Information, the bible for those interested in visualization. The author, Edward Tufte, says, “It may well be the best statistical graphic ever drawn.”  With no sacrifice in clarity, it simultaneously displays “the size of the army, it’s location on a two-dimensional surface….direction….and temperature.” 

The same Sankey approach was also used by the Department of Energy  to show energy use in the United States:

LLNL_US_Energy_Flow_2009.png

Finally, something a little off the beaten path.  The research lab Density Design has created a visualization approach that combines all of the above aspects and more. It’s called Raw.

If we want to stick with this kind of proportional visualization, the first step is to create a dataset with rows of categorical data. For the example here, I searched for data on cars with a Google trick - narrowing my results to spreadsheets by adding "filetype:xls" to my search. I found a usable dataset and uploaded it to the Raw website.

The interactive result is a fascinating chance to view the relationships between various sets simultaneously. For example, this screenshot shows the proportion of each make, type and leather vs. cloth interior.

proportionscreenshot.png

Here’s another example, but with flight data comparing origin cities, airlines, and routes by the number of cancellations in October 2015:

flightdatascreenshot.png

These two examples, while beautiful and useful, are not true Sankey diagrams because the lines don't represent a continuous flow. The connections are only between groups, and the width corresponds to the number of correlating rows. (As an experiment, users could trick Raw into charting percentage data by duplicating data rows into proportions: to represent 50%, copy the row 50 times, the other rows should follow in similar fashion. To keep the math simple, the ending spreadsheet should have 100 rows.)

Raw, like Venn diagrams, Euler Diagrams and Sankey Diagrams, can help you visualize and understand data relationships in terms of proportion. Can a scatterplot do that?

5 Tips for Security Data Analysis