United Nations World Statistics Day 2015, Data Visualization Challenge:

Use your creativity and imagination to build an infographic or dynamic visualization featuring the latest data from the 2015 Millennium Development Goals report.

**My entry:**

Indicator Correlation Explorer.

**Abstract**

This data visualization allows you to explore the relationships between MDG indicators via two interactive graphics.

The first graphic is a correlation matrix. The bottom triangle of the matrix displays the raw correlation between any two indicators. The top triangle of the matrix displays *partial correlations*, which control for one of several user-defined variables, providing the user with more insight into the reason for each correlation.

To examine closer, the user can select any pairwise correlation in the matrix, and view the underlying data points as a scatterplot.

**Problem and Motivation**

The purpose of this analysis is to identify correlations between seemingly unrelated indicators, as well as provide the tools for a closer examination of the reasons for the correlations.

In general, if two indicators are correlated, it means one of two things.

1. There is a causation relationship, in which case the actions that are meant to influence one indicator have the unintended consequence of effect of affecting another indicator. Such relationships would be important to be aware of so that the unintended results can be factored into decision-making.

2. The correlation is due to a *lurking* variable, and is not the result of any action related to the MDG. In this case, knowing about the correlation is unlikely to affect decision-making, but it is still important to be aware of when interpreting the outcomes.

**Underlying Data Manipulation**

First, I reduced the data to what I thought would make for the most meaningful sample:

● Only MDG countries

● Three representative indicators for each of the Millennium Development Goals

● Only data from 2000 or later

My intention was to examine the changes over time for each indicator. Calculating the changes would have been straightforward if not for two issues relating to the completeness of the data.

Not all of the indicators examined were available for every country. So for each indicator, the analysis is limited to the countries for which data is available.

Also, when the data was available for a country, there was little consistency in the years for which it was reported. Ideally, I would have chosen fixed start and end dates for all cases, and just calculated the change. Instead, for each country and each indicator I chose the start date to be the earliest date (post-2000) for which data was available. Likewise, I chose the end date to be the most recent date for which data was available.

In each case, I calculated the change in value between the start and end dates. Then, to normalize the data, I divided by the number of years, giving the **average annual change** for each indicator for each country.

Finally, for the sake of consistency, I changed the sign of for some indicators so that a positive change always corresponds to an improvement. e.g. For the indicator “Population below $1.25 (PPP) per day, percentage,” a lower value implies improvement. However, in the output of my analysis, such a change would be shown as a positive number.

**Analysis**

The input data is comprised of the annual average changes of 24 indicators for 200 countries, calculated as described above.

The visualization for analyzing this data consists of two graphics.

The first graphic is a correlation matrix. In the bottom triangle of the matrix are the raw pairwise correlations between the average annual changes of 24 indicators calculated across the 200 countries. Each correlation is represented as a circle, with larger circles corresponding to larger absolute correlations. White = positive correlation. Black = negative correlation.

What does it mean for two indicators to be correlated?

In simple terms, if two indicators are positively correlated, it implies that the countries making the most progress by one indicator will tend to be the ones making progress by the other indicator.

For example, this is the scatter plot for “Proportion of the population using improved drinking water sources” vs “Children under five mortality rate per 1,000 live births.”

It’s clear from the scatter plot that there is a positive correlation between these two indicators. In the matrix, this correlation is shown as a large, white circle (highlighted with a large grey circle behind it).

The upper triangle of the matrix displays *partial correlations*, pairwise correlations that control for a given variable. The user may select one of two options for the control variable: the “x-axis value” or the “y-axis value.” For each pairwise correlation, these values correspond to the values of the two indicators on the start date.

For example, take the correlation between “Internet users” (on the x-axis) and the change in “Child mortality” (on the y-axis).

If the user selects “y-axis value” as a control variable, the partial correlation will be:

The correlation between:

**Average annual change in Internet users**

and

**Average annual change in Child mortality**

controling for

**The Child mortality rate in the year 2000**

As showing in the screenshot, these variables have a negative raw correlation. However, when you control for the number of internet users, the correlation goes away. Presumably, when a country has a high childhood mortality rate, it would rather put its resources toward solving that problem, rather than toward the internet.

If a user wishes to take a closer examination of a correlation, clicking on the circle will display a scatterplot of the underlying values for each country.

**Tools**

The primary tool for analyzing the raw data was Excel, making use of the built-in statistical functions for the calculations. I also did some Visual Basic scripting to automate the repetitive tasks.

To create the graphics, I used Javascript / HTML / CSS, along with the D3.js graphics plugin, for the programming. Additionally, I used Photoshop and Inkscape to create some of the images.

**Results**

In addition to the examples above, drinking water vs child mortality and child mortality vs internet use, here is one more example I found interesting.

There is a strong, negative correlation between literacy rate and employment as a % of population. And the correlation remains even when you control for various other variables. Having spent some time looking into it, it appears that promoting education to improve literacy rates removes those people from the workforce.

Whether or not this phenomenon affects any decision making, it is important to be aware of when assessing a country’s progress. If a country is making great improvements in terms of literacy and its employment improvement is merely average, that is likely a positive reflection on the effectiveness of both initiatives.

The examples described here demonstrate the insight that can be gained though data.

This analysis is limited to only 24 indicators and only two control variables. A more thorough analysis that includes a broader range of data, more advanced statistical methods, and more time for a deeper investigation would likely yield an order of magnitude more information about such relationships between MDG indicators.