By Nicholas Quisler
This project investigates probable correlations between the economic output of a country, gross domestic product (USD), and the life expectancy at birth of its citizens.
Here are a few questions this project seeks to answer:
Data sources
GDP Source: World Bank national accounts data, and OECD National Accounts data files.
Life expectancy Data Source: World Health Organization
We start by importing preliminary python modules:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
To look for connections between GDP and life expectancy load the datasets into DataFrames so that they can be visualized.
Here all_data.csv will be read in into a DataFrame called df
, followed by a
quick inspection of the DataFrame using .head()
to check its contents.
df = pd.read_csv('all_data.csv')
df.head()
Country | Year | Life expectancy at birth (years) | GDP | |
---|---|---|---|---|
0 | Chile | 2000 | 77.3 | 7.786093e+10 |
1 | Chile | 2001 | 77.3 | 7.097992e+10 |
2 | Chile | 2002 | 77.8 | 6.973681e+10 |
3 | Chile | 2003 | 77.9 | 7.564346e+10 |
4 | Chile | 2004 | 78.0 | 9.921039e+10 |
Here we inspect the data to make sure the data was imported as the correct data type, without any missing or null values.
df.dtypes
Country object Year int64 Life expectancy at birth (years) float64 GDP float64 dtype: object
df.count()
Country 96 Year 96 Life expectancy at birth (years) 96 GDP 96 dtype: int64
df.describe()
Year | Life expectancy at birth (years) | GDP | |
---|---|---|---|
count | 96.000000 | 96.000000 | 9.600000e+01 |
mean | 2007.500000 | 72.789583 | 3.880499e+12 |
std | 4.633971 | 10.672882 | 5.197561e+12 |
min | 2000.000000 | 44.300000 | 4.415703e+09 |
25% | 2003.750000 | 74.475000 | 1.733018e+11 |
50% | 2007.500000 | 76.750000 | 1.280220e+12 |
75% | 2011.250000 | 78.900000 | 4.067510e+12 |
max | 2015.000000 | 81.000000 | 1.810000e+13 |
With the columns being the correct data type, the column counts being equal, and the statistics appearing normal as expected we can conclude there are no missing or null values.
We want to know about the countries and years represented in the dataset and their respective sample sizes.
df1 = df.groupby(["Country"])["Country"].count()
df1
Country Chile 16 China 16 Germany 16 Mexico 16 United States of America 16 Zimbabwe 16 Name: Country, dtype: int64
df2 = df.groupby(["Year"])["Year"].count()
df2
Year 2000 6 2001 6 2002 6 2003 6 2004 6 2005 6 2006 6 2007 6 2008 6 2009 6 2010 6 2011 6 2012 6 2013 6 2014 6 2015 6 Name: Year, dtype: int64
There are six countries, Chile
, China
, Germany
, Mexico
,
theUnited States of America
, and Zimbabwe
represented. A data point for each
country exists each year from 2000-2015, for a total of 96 data points.
The Life expectancy at birth (years)
column name is too long and not consistant in word count
compared to the other columns. It would be best to preceed with its abbreviation, LEAB
. The
rename
function is used to make future coding easier.
df = df.rename({"Life expectancy at birth (years)": "LEAB"}, axis = 'columns')
df.head()
Country | Year | LEAB | GDP | |
---|---|---|---|---|
0 | Chile | 2000 | 77.3 | 7.786093e+10 |
1 | Chile | 2001 | 77.3 | 7.097992e+10 |
2 | Chile | 2002 | 77.8 | 6.973681e+10 |
3 | Chile | 2003 | 77.9 | 7.564346e+10 |
4 | Chile | 2004 | 78.0 | 9.921039e+10 |
Below the distribution of LEAB
is shown. The data is very left skewed where most of the values
are on the right-hand side. This type of distribution could be described as a power law distribution, which
is a common enough distribution that it has its own name. More about the power law can be read here. A further look might also identify different
modes or smaller groupings of distributions within the range.
ax1 = sns.histplot(x='LEAB', data=df, stat="percent", binwidth=1)
ax1.set_xlabel('Life Expectancy at Birth (years)');
Next the distribution of GDP
was examined. The distribution is very right skewed where most of
the values are on the left-hand side. This is almost the opposite of what was observed in the
LEAB
column.
ax2 = sns.histplot(x='GDP', data=df, stat="percent", binwidth=0.05e13)
ax2.set_xlabel('GDP in Trillions of U.S. Dollars');
The previous plots did not break up the data by countries, so the next task will be to find the average
LEAB
and GDP
by country.
dfMeans = df.drop("Year", axis=1).groupby("Country").mean().reset_index()
dfMeans
Country | LEAB | GDP | |
---|---|---|---|
0 | Chile | 78.94375 | 1.697888e+11 |
1 | China | 74.26250 | 4.957714e+12 |
2 | Germany | 79.65625 | 3.094776e+12 |
3 | Mexico | 75.71875 | 9.766506e+11 |
4 | United States of America | 78.06250 | 1.407500e+13 |
5 | Zimbabwe | 50.09375 | 9.062580e+09 |
Now that they are broken down by Country
and the average values for LEAB
and
GDP
are created, bar plots showing the mean values for each variable are created below.
The first plot is Life Expectancy and all of the countries except for Zimbabwe have values in the mid-to-high 70s. This probably explains the skew in the distribution from before!
ax3 = sns.barplot(y="Country", x="LEAB", data=dfMeans)
ax3.set_xlabel('Mean Life Expectancy at Birth (years)');
For the average GDP
by Country
it seems that the US has a much higher value
compared to the rest of the countries. In this bar plot, Zimbabwe is not even visible where Chile is just
barely seen. In comparison the USA has a huge GDP compared to the rest. China, Germany and Mexico seem to be
relatively close in figures.
ax4 = sns.barplot(y="Country", x="GDP", data=dfMeans)
ax4.set_xlabel('Mean GDP (Trillions of U.S Dollars)');
A newer method for showing distributions is the strip plot. Strip plots are useful because they show dot density around the values as well as distribution.
In the case of of the GDP
plot, Chile and Zimbabwe have a a dense clutter of dots that
illustrate the number of data points that fall around their values. This detail would have been lost in the
box plot, unless the reader is very adept at data visualizations.
ax5 = sns.stripplot(x="GDP", y="Country", data=df, hue='Country', alpha=0.5)
sns.stripplot(x='GDP', y='Country', data=dfMeans, hue='Country', palette='dark:black', marker='s')
ax5.set_xlabel('GDP (Trillions of U.S Dollars)')
plt.legend([],[], frameon=False);
The LEAB
plot shows most of the countries except Zimbabwe have a fairly consistent life
expectancy.
ax6 = sns.stripplot(x="LEAB", y="Country", data=df, hue='Country', alpha=0.5)
sns.stripplot(x='LEAB', y='Country', data=dfMeans, hue='Country', palette='dark:black', marker='s')
ax6.set_xlabel('Life Expectancy at Birth (years)')
plt.legend([],[], frameon=False);
Next the data will explore LEAB
and GDP
over the years through line charts. Below
the countries are separated by colors and one can see that every country has been increasing their life
expectancy between 2000-2015, but Zimbabwe has seen the greatest increase after a bit of a dip around 2004.
ax7 = sns.lineplot(x='Year', y="LEAB", data=df, hue='Country')
ax7.set_ylabel('Life Expectancy at Birth (years)')
plt.legend();
Another aspect that was looked more into depth was the faceted line charts by Country. In the individual
plots, each country has their own y axis, which makes it easier to compare the shape of their
LEAB
over the years without the same scale. This method makes it easier to see that Chile, and
Mexico seemed to have dips in their life expectancy around the same time which could be looked into further.
ax8 = sns.relplot(
data=df, x="Year", y="LEAB", col="Country", hue='Country',
kind="line", col_wrap=3, facet_kws={'sharey': False}
)
ax8.set_ylabels('Life Expectancy at Birth (years)');
The chart below now looks at GDP over the years. The chart shows that China went from a GDP less than a quarter trillion dollars to one trillion dollars in the time span. The rest of the countries did not see increases in this magnitude.
ax9 = sns.lineplot(x='Year', y="GDP", data=df, hue='Country')
ax9.set_ylabel('GDP (Trillions of U.S Dollars)')
plt.legend();
Much like the breakdown of LEAB by country before, the plot below breaks out GDP by country. It is apparent that all of the countries have seen increases. In the chart above, the other country's GDP growth looked modest compared to China and the US, but all of the countries did experience growth from compared to the year 2000. This type of plotting proves useful since much of these nuances were lost when the y axis was shared among the countries. Also the seemingly linear changes were in reality was not as smooth for some of the countries.
ax10 = sns.relplot(
data=df, x="Year", y="GDP", col="Country", hue='Country',
kind="line", col_wrap=3, facet_kws={'sharey': False}
)
ax10.set_ylabels('GDP (Trillions of U.S Dollars)');
Is there a correlation between GDP and life expectancy of a country?
The next two charts will explore the relationship between GDP
and LEAB
. In the
chart below, it looks like the previous charts where GDP for Zimbabwe is staying flat, while their life
expectancy is going up. For the other countries they seem to exhibit a rise in life expectancy as GDP goes
up. The US and China seem to have very similar slopes in their relationship between GDP and life expectancy.
ax11 = sns.scatterplot(x='LEAB', y="GDP", data=df, hue='Country')
ax11.set_xlabel('Life Expectancy at Birth (years)')
ax11.set_ylabel('GDP (Trillions of U.S Dollars)')
plt.legend();
Like the previous plots, countries are broken out into each scatter plot by facets. Looking at the individual countries, most countries like the US, Mexico and Zimbabwe have linear relationships between GDP and life expectancy. China on the other hand has a slightly exponential curve, and Chile's looks a bit logarithmic. In general though one can see an increase in GDP and life expectancy, exhibiting a positive correlation.
ax12 = sns.relplot(
data=df, x="LEAB", y="GDP", col="Country", hue='Country',
kind="scatter", col_wrap=3, facet_kws={'sharex': False, 'sharey': False}
)
ax12.set_xlabels('Life Expectancy at Birth (years)')
ax12.set_ylabels('GDP (Trillions of U.S Dollars)');
This project was able to make quite a few data visualizations with the data even though there were only 96 rows and 4 columns.
The project was also able to answer some of the questions posed in the beginning:
[1] Cong, Lin and Gao, Haoyu and Ponticelli, Jacopo and Yang, Xiaoguang, Credit Allocation under Economic Stimulus: Evidence from China (November 1, 2018). Chicago Booth Research Paper No. 17-19, Available at SSRN: https://ssrn.com/abstract=2862101 or http://dx.doi.org/10.2139/ssrn.2862101
[2] Moyo, Nicky and Besada, Hany, Zimbabwe in Crisis: Mugabe's Policies and Failures (October 18, 2008). The Centre for International Governance Innovation Technical Paper No. 38 , Available at SSRN: https://ssrn.com/abstract=1286683 or http://dx.doi.org/10.2139/ssrn.1286683