Installing and Attaching Packages
Reading and Cleaning Data
Long vs Wide Data
Joining Tables
Defining Functions
Row Filtering and Static Line Graphs
Visualizing Distributions and Correlations
Creating Interactive Visualizations
Creating Animatied Visualizations

Installing and Attaching Packages

We will be using librarian for package management as it greatly simplifies the process of installing and attaching packages. It needs to be installed using install.packages() unless already installed.

install.packages("librarian")

The librarian::shelf() function checks whether all listed packages are installed and installs them if needed. Then it loads all listed packages into the environment. You can think of it as a combination of install.packages() and library() except it will not re-install a package if it is already installed. Here are the packages we will be using:

readr is a tidyverse package that allows for fast reading and writing of rectangular text data
janitor is a package commonly used to clean messy data and quickly format column names
dplyr is a tidyverse package allowing for easy data manipulation and variable grouping
magrittr is a tidyverse package that offers a set of operators useful for constructing pipelines
tidyr is a tidyverse package intended for data restructuring with the goal of achieving tidy data
tidyselect is a tidyverse package that allows for the easy selection of columns based on criteria
ggplot2 is a tidyverse package and arguably the most popular R data visualization package
plotly allows for the easy generation of various interactive data visualizations including maps
gganimate and gifski can be used to generate and export various animated visualizations

librarian::shelf(readr, janitor, dplyr, magrittr, tidyr, tidyselect, ggplot2,
                 plotly, gganimate, gifski)

Reading and Cleaning Data

We will be using the following data files from the data directory to investigate the relationship between health and wealth:

gdp.csv – World Bank gross domestic product (GDP) estimates (in USD) from 1960 until 2022
life-expectancy.csv – World Bank life expectancy at birth estimates from 1960 until 2021
m49.csv – United Nations M49 Standard Country or Area Codes for Statistical Use
population.csv – World Bank country and region population estimates from 1960 until 2022

All the data are in IEFT RFC 4180 CSV (comma-separated values) format and the first four rows of the World Bank data files contain metadata with the actual data table starting on row five.

Let us start with the population data. CSV data files can be easily read in using the readr::read_csv() function. The function reads the contents of the file into a tibble object (which is basically an advanced data frame) and supports various additional arguments. For example, we can utilize the skip argument to skip the first four rows of the CSV file as the data table does not start until row five.

population <- readr::read_csv(file = "data/population.csv",
                              skip = 4,
                              show_col_types = FALSE)

Upon file read, the function outputs a summary outlining the column names and types of the data by default. This can be silenced by specifying show_col_types=FALSE when calling the function.

Now the World Bank population data is stored in a tibble named population. Calling the name of the variable will display the data in a neat interactive table viewer.

population

We see that the tibble appears to have the following columns:

Country Name – English name of the country
Country Code – ISO 3166-1 alpha-3 country code
Indicator Name – name of the indicator represented by the data
Indicator Code – World Bank code for the indicator
1960 … 2022 – population estimates by year

We also see that the table has 266 rows and 67 columns. The number of rows and columns of a tibble or other data-frame-like object can also be programmatically extracted by calling the dim() function.

dim(population)

## [1] 266  67

The list of column names can be programmatically extracted via the names() function.

names(population)

##  [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
##  [5] "1960"           "1961"           "1962"           "1963"          
##  [9] "1964"           "1965"           "1966"           "1967"          
## [13] "1968"           "1969"           "1970"           "1971"          
## [17] "1972"           "1973"           "1974"           "1975"          
## [21] "1976"           "1977"           "1978"           "1979"          
## [25] "1980"           "1981"           "1982"           "1983"          
## [29] "1984"           "1985"           "1986"           "1987"          
## [33] "1988"           "1989"           "1990"           "1991"          
## [37] "1992"           "1993"           "1994"           "1995"          
## [41] "1996"           "1997"           "1998"           "1999"          
## [45] "2000"           "2001"           "2002"           "2003"          
## [49] "2004"           "2005"           "2006"           "2007"          
## [53] "2008"           "2009"           "2010"           "2011"          
## [57] "2012"           "2013"           "2014"           "2015"          
## [61] "2016"           "2017"           "2018"           "2019"          
## [65] "2020"           "2021"           "2022"

Note how the first four column names contain a space and the rest of the column names are all numbers. This is bad practice as various functions do not work well with column names that either contain a space or start with a number. It is common to replace all spaces with either periods or underscores and add a letter prefix to any column names consisting solely of numbers.

This can be easily achieved by using the janitor::clean_names() function, which takes the original table as input and outputs a new table where the column names have been properly formatted.

population <- janitor::clean_names(population)
names(population)

##  [1] "country_name"   "country_code"   "indicator_name" "indicator_code"
##  [5] "x1960"          "x1961"          "x1962"          "x1963"         
##  [9] "x1964"          "x1965"          "x1966"          "x1967"         
## [13] "x1968"          "x1969"          "x1970"          "x1971"         
## [17] "x1972"          "x1973"          "x1974"          "x1975"         
## [21] "x1976"          "x1977"          "x1978"          "x1979"         
## [25] "x1980"          "x1981"          "x1982"          "x1983"         
## [29] "x1984"          "x1985"          "x1986"          "x1987"         
## [33] "x1988"          "x1989"          "x1990"          "x1991"         
## [37] "x1992"          "x1993"          "x1994"          "x1995"         
## [41] "x1996"          "x1997"          "x1998"          "x1999"         
## [45] "x2000"          "x2001"          "x2002"          "x2003"         
## [49] "x2004"          "x2005"          "x2006"          "x2007"         
## [53] "x2008"          "x2009"          "x2010"          "x2011"         
## [57] "x2012"          "x2013"          "x2014"          "x2015"         
## [61] "x2016"          "x2017"          "x2018"          "x2019"         
## [65] "x2020"          "x2021"          "x2022"

We know that the population table stores population values, so the columns indicator_name and indicator_code are redundant. These can be dropped using the dplyr::select() function. The first argument to the function is the table variable itself and following arguments specify which columns to keep or drop. For example, explicitly specifying a column name will result in only the listed column being selected and adding a minus sign in front will reverse the selection, meaning any specified column name preceded with a minus sign will cause it to be dropped and all remaining columns to be selected.

The output is a new table with the specified selection applied. There are numerous different ways one can select columns using dplyr::select(). For example, one could use a tidyselect function to select a set of columns based on a criterion. Refer to the function documentation for more information.

population <- dplyr::select(population, -indicator_name, -indicator_code)

We can validate that the desired columns have been removed by either listing the column names via the names() function again or taking a quick peek at the table using the head() function. It displays the first six rows of the table by default, but a this can be changed by specifying the n argument.

head(population)

Knowing that the World Bank GDP data follows the exact same format as the World Bank population data , we can read the CSV file, clean the column names, and drop the unneeded columns all in one go by combining the readr::read_csv(), janitor::clean_names(), and dplyr::select() functions using the forward pipe operator %>% from the magrittr package. The %>% operator takes either a variable or the output of a function and feeds it into the following function as its first argument.

gdp <- readr::read_csv(file = "data/gdp.csv",
                       skip = 4,
                       show_col_types = FALSE) %>%
    janitor::clean_names() %>%
    dplyr::select(-indicator_name, -indicator_code)

In the pipeline above, the tibble outputted by the readr::read_csv() function gets piped into the janitor::clean_names() function, the output of which gets passed into the dplyr::select() function. The output of the final function is saved into the variable gdp. The head() function can be used to take a quick look at the new table and ensure the constructed pipeline worked as expected.

head(gdp)

Long vs Wide Data

GDP on its own is not a good indicator of the wealth of a country as countries with more people tend to have higher GDP. But if we were to normalize GDP by population, then the resulting GDP per capita values can be compared across countries and used as a proxy for wealth. To do so, we must be able to match up the GDP and population values for each unique combination of country and year.

The GDP and population tables are currently in wide format – each row represents a unique country and each column represents a unique year with the cell values representing GDP or population estimates. While this wide format has many advantages and is commonly used in geospatial applications, it does complicate joining various data sets. One option would be to treat both tables as matrices and calculate GDP per capita by dividing the GDP matrix with the population matrix. However, both tables need to have the exact same layout with the same number of countries and years in the same exact order for this to work and the result to be reliable. Ensuring this is not a trivial task, so this method would involve a lot of work to produce reliable results.

Alternatively the two tables could be joined by country. Then we will have an extra-wide table with two sets of year columns – one set of year columns for population and another set of year columns for GDP. Then we would need to create another new column for each year by dividing the corresponding GDP column with the corresponding population column, resulting in another new set of year columns. As you can see, this approach would quickly lead to a very messy and difficult to manage table and would also involve a lot of work, making it far from preferred.

The easiest option for calculating GDP per capita would involve converting both data tables into a long format, where each row represents a single unique observation (estimation). Instead of having countries in rows and years in columns, each row would instead represent a unique country and year combination. This would allow us to easily combine data on both country and year, ensuring that the population and GDP values for each country-year combination get matched appropriately.

We can use the tidyr::pivot_longer() function to covert wide format tables into long format tables. The following arguments are of interest to us when calling the function:

data – the tibble object to convert from wide to long format
cols – the columns to pivot into longer format (the columns containing the data values)
names_to – name of the column in the long table that stores the column names from the wide table
names_prefix – the prefix to remove from the specified wide table column names (if any)
values_to – name of the column in the long table that sores the data values from the wide table

The columns we would like to convert to long format are the year columns. These are all prefixed with an x character. We can use the tidyselect::starts_with() function to select all the columns beginning with an x character and pass those as the cols argument. The column names should be stripped of the preceding x character, so we pass that as the names_prefix argument. The stripped column names represent years and the cell values represent population represent population estimates, so finally we specify names_to = "year" and values_to = "population" as the final arguments.

population_long <- tidyr::pivot_longer(data = population,
                                       cols = tidyselect::starts_with("x"),
                                       names_to = "year",
                                       names_prefix = "x",
                                       values_to = "population")
population_long

Now we have a new long population tibble called population_long, where each row represents a unique country and year combination. But note how the data type of the year column appears to be listed as chr, implying that the data is in textual character format instead of numbers. Let us confirm this by extracting the column using the $ operator and checking its data type via the class() function.

class(population_long$year)

## [1] "character"

The dplyr::mutate() function can be combined with the base as.integer() function to convert the year column back into numeric format. The first argument of the dpylr::mutate() function is the data table and subsequent arguments specify how to create new columns or modify existing columns.

For example, to apply the as.integer() function on the whole year column and then replace the year column with the new values, we would specify year = as.integer(year) as an argument.

population_long <- dplyr::mutate(population_long,
                                 year = as.integer(year))
class(population_long$year)

## [1] "integer"

Note how now the data type of the year column is listed as int or integer, meaning it is numeric.

Now let us convert the GDP table into long format as well. As with the population table, we will first need to use tidyr::pivot_longer() to pivot the table and then as.numeric() with dplyr::mutate() to fix the data type of the year column. These can be combined into a pipeline using the %>% operator.

gdp_long <- gdp %>%
    tidyr::pivot_longer(cols = tidyselect::starts_with("x"),
                        names_to = "year",
                        names_prefix = "x",
                        values_to = "gdp") %>%
    dplyr::mutate(year = as.integer(year))

gdp_long

Joining Tables

Finally we are ready to combine the population and GDP tables. The dplyr package has four different join functions we could utilize:

left_join() – include all rows from the left table and only matching rows from the right table
right_join() – include all rows from the right table and only matching tows from the left table
full_join() – include all rows from both tables regardless of whether they have a match
inner_join() – only include rows from both tables that have matches in the other table

All the aforementioned functions require at least three arguments:

x – the left table
y – the right table
by – the column(s) to join on

We would like to join on each unique country and year combination. As spellings of country names might differ between tables, it is good practice to always use the ISO 3166-1 alpha-3 country code or some other analogous unique identifier to distinguish between countries. The country code for each country is determined by an international standard and should not differ between tables, allowing us to reliably join the data. Hence we will specify by = c("country_code", "year") to perform the join on unique country-year combinations using the dplyr::inner_join() function to only keep country-year combinations that are present in both tables.

data <- dplyr::inner_join(x = population_long,
                          y = gdp_long,
                          by = c("country_code", "year"))
head(data)

Note how now the data are joined but the resulting table has two different country_name columns. That is because this column was present in both of the joined tables and was not omitted before the join. We can use the dplyr::select() to drop one of the columns by adding a minus sign in front of its name and rename the other one using the new_name = old_name convention. Then we can use dplyr::mutate() to add a new gdp_per_capita column by dividing the gdp column with the population column. We can combine these two operations into a pipeline using the %>% operator.

data <- data %>%
    dplyr::select(-country_name.y,
                  country_name = country_name.x) %>%
    dplyr::mutate(gdp_per_capita = gdp / population)

data

Defining Functions

Now we would also like to add life expectancy information to this joined table. Knowing that all World Bank data tables follow the same format, we can easily convert the workflow from before into a function that reads in a World Bank data table, drops unneeded columns, converts it to long format, and ensures the year is in numeric format. That function would only need two inputs – the path of the CSV file and the name of the indicator represented by the data. (This name will be used as the column name for the values column in the long format table.) Let us define this function and use it to read in the World Bank life expectancy table and convert it to long format.

read_world_bank_data <- function(file_path, variable_name) {
    readr::read_csv(file_path,
                    skip = 4,
                    show_col_types = FALSE) %>%
        janitor::clean_names() %>%
        dplyr::select(-indicator_name,
                      -indicator_code) %>%
        tidyr::pivot_longer(cols = starts_with("x"),
                            names_to = "year",
                            names_prefix = "x",
                            values_to = variable_name) %>%
        dplyr::mutate(year = as.integer(year)) %>%
        return()
}

life_expectancy <- read_world_bank_data("data/life-expectancy.csv",
                                        "life_expectancy")
head(life_expectancy)

Now we can use a pipeline to drop the redundant country_name column and then join the life expectancy table with the rest of our data. Remember that the output of the last function in a pipeline is specified as the first argument of the following function by default. We can override this by referring to the output of the previous function as . and specifying it elsewhere int the following function call.

data <- life_expectancy %>%
    dplyr::select(-country_name) %>%
    dplyr::inner_join(x = data,
                      y = .,
                      by = c("country_code", "year"))
data

Finally we would also like to know which United Nations regional geoscheme the country belongs to. Information on this is available in the United Nations M49 table. We can utilize a pipeline to read in and clean the table all in one go. Note that as this is a standard CSV table, there is no need to skip any rows.

m49 <- readr::read_csv(file = "data/m49.csv",
                       show_col_types = FALSE) %>%
    janitor::clean_names()

head(m49)

Note how this table contains a lot of information on the various groups and codes assigned to each country. We are only interested in the name of the region the country belongs into and need the ISO 3166-1 alpha-3 code assigned to the country to join the data. Let us use the %>% operator to create a pipeline that selects the desired columns and then performs the join.

data <- m49 %>%
    dplyr::select(region_name,
                  country_code = iso_alpha3_code) %>%
    dplyr::inner_join(data,
                      by = "country_code") %>%
    dplyr::select(country_name,
                  country_code,
                  region_name,
                  year,
                  tidyselect::everything())
data

Note how we can use dplyr::select() along with tidyselect:everything() to rearrange the order of specified columns.

Row Filtering and Static Line Graphs

If we wanted to extract data for a specific country, we could utilize boolean indexing, or we could use the dplyr::filter() function, which behaves similarly, but is more intuitive to use and works well within pipelines. For example, to extract all data for the United States, we could combiner dplyr::filter() with country_code == "USA" to extract all the rows where the country_code column is equal to "USA".

data %>%
    dplyr::filter(country_code == "USA") %>%
    head()

One can also use more complex logical statements within dplyr::filter(). For example, to extract data for all North American countries, we could use country_code %in% c("USA", "CAN", "MEX"). To confirm this works, we can select all columns that do not depend on the year from the result and utilize dplyr::distinct() to drop all duplicate rows. The result should be one row for each selected country.

data %>%
    dplyr::filter(country_code %in% c("USA", "CAN", "MEX")) %>%
    dplyr::select(country_name, country_code, region_name) %>%
    dplyr::distinct()

The dplyr::filter() function can be piped together with the ggplot2::ggplot() function to easily visualize data for a specific selection. For example, the life expectancy for the United States over time could be displayed as follows. Note that we can specify na.rm = TRUE to drop any missing observations before creating the plot. The visualization would still work otherwise, but we would get warnings regarding the missing data.

data %>%
    dplyr::filter(country_code == "USA") %>%
    ggplot2::ggplot(mapping = ggplot2::aes(x = year, y = life_expectancy)) +
    ggplot2::geom_line(na.rm = TRUE)

Following the example above, the GDP per capita growth over time can be compared across all North American countries as follows.

data %>%
    dplyr::filter(country_code %in% c("USA", "CAN", "MEX")) %>%
    ggplot2::ggplot(mapping = ggplot2::aes(x = year,
                                           y = gdp_per_capita,
                                           color = country_name)) +
    ggplot2::geom_line()

Be careful when using ggplot2 functions within pipelines as ggplot2 functions must be combined using the + operator instead of the %>% pipe operator. This is due to overly complicated technical reasons and just something we have to live with.

Visualizing Distributions and Correlations

Let us return to our original goal of exploring the relationship between health and wealth. We will use GDP per capita as a proxy for wealth and life expectancy as an indicator of health. We can simplify the analysis by looking only at one point in time and focus our analysis on 2021, which is the latest year we have both GDP per capita and life expectancy data available. We can use dplyr::filter() to extract all 2021 data and combine it with tidyr::drop_na() to remove any countries that do not have data for 2021.

data2021 <- data %>%
    dplyr::filter(year == 2021) %>%
    tidyr::drop_na()

head(data2021)

How is wealth distributed among the global population? Let us get a vague idea by visualizing the distribution of GDP per capita among world countries in 2021. We can easily create an histogram using ggplot2 by defining data2021 as the data set and specifying that gdp_per_capita should be used for the X axis. Then we add a histogram layer to the visualization using ggplot2::geom_histogram(). The histogram will have 30 bins by default, which is rarely a good number. Use either bins to specify the number of bins or bindwith to define the width of each bin.

ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = gdp_per_capita)) +
    ggplot2::geom_histogram(binwidth = 10000)

But what about life expectancy? As an alternative to a histogram, we can create a kernel density estimation (KDE) visualization by following the same procedure as before but using ggplot2::geom_density() instead of ggplot2::geom_histogram(). As KDE is continuous, there is no need to specify the number or width of bins.

ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = life_expectancy)) +
    ggplot2::geom_density()

A scatter plot is the best way to investigate the relationship between two variables and can be easily generated using ggplot2::geom_point().

ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_point()

The relationship appears to be logarithmic. This is likely due to the distribution of GDP per capita being heavily skewed. To make better sense of this potentially logarithmic the relationship, we should apply a logarithmic transformation to the axis corresponding to GDP per capita. In our example this is the X axis and we can apply a logarithmic transformation on the X axis by adding ggplot2::scale_x_log10() to the plot call.

ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_point() +
    ggplot2::scale_x_log10()

As a scatter plot alternative, we could also create a two-dimensional kernel density estimate (KDE) surface to get an even better understanding of the data distribution and any potential relationship. This can be done by replacing the ggplot2::geom_point() function call with either ggplot2::geom_density_2d() for simple contour lines or ggplot2::geom_density_2d_filled() for a beautiful color gradient surface.

ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_density2d_filled() +
    ggplot2::scale_x_log10()

But does the size of a country play a role in this relationship? What about the region? We can investigate this by specifying color = region_name and size = population in the ggplot2::aes() call and creating a colored bubble chart.

ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy,
                                       color = region_name,
                                       size = population)) +
    ggplot2::geom_point() +
    ggplot2::scale_x_log10()

The population of world countries varies vastly. To illustrate this better, we can add ggplot2::scale_size() to our visualization and use the range argument to specify the smallest and largest possible bubble size.

ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy,
                                       color = region_name,
                                       size = population)) +
    ggplot2::geom_point() +
    ggplot2::scale_x_log10() +
    ggplot2::scale_size(range = c(1,10))

A smoothed trend line can be added using ggplot2::geom_smooth() and specifyingy ~ x as the formula. The method argument is used to specify the statistical function to use for the trend line and can be one of the following:

"lm" for a simple linear model
"glm" for a generalized linear model
"gam" for a generalized additive model
"loess" for locally estimated scatter plot smoothing

ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
                                               size = population)) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_size(range = c(1,10)) +
    ggplot2::geom_smooth(formula = y ~ x, method = "lm")

Additional arguments can be added to ggplot2::geom_smooth() to modify the appearance of the trend line and ggplot2::labs() can be used to add a title and label the axes and legend.

plot <- ggplot2::ggplot(data = data2021,
                mapping = ggplot2::aes(x = gdp_per_capita,
                                       y = life_expectancy)) +
    ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
                                               size = population)) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_size(range = c(1,10)) +
    ggplot2::geom_smooth(formula = y ~ x,
                         method = "lm",
                         se = FALSE,
                         color = "black",
                         linetype = "dashed") +
    ggplot2::labs(title = "Health and Wealth by Country in 2021",
                  x = "GDP per Capita (USD)",
                  y = "Life Expectancy at Birth",
                  color = "Region",
                  size = "Population")
plot

Creating Interactive Visualizations

While the static scatter plot above is quite pretty to look at, it is not the most informative. We have no idea which points represent which countries and many countries appear clustered together, which makes it harder to tell them apart. An interactive visualization would allow for better exploration and investigation of the data. The easiest way of creating an interactive visualization in R is via the plotly package.

Creating an interactive visualization using plotly is somewhat reminiscent of using ggplot2 but with some key differences. The core aspects of the visualization are first defined using the plotly::plot_ly() function, the output of which is then piped into plotly::layout() to customize the appearance of the visualization by adding axis labels and transformations. Note that column names must be prefixed with a tilde character (~) when passing them as arguments in the plotly::plot_ly() function call.

plotly::plot_ly(data = data2021,
                x = ~gdp_per_capita,
                y = ~life_expectancy,
                color = ~region_name,
                text = ~country_name,
                size = ~population,
                type = "scatter",
                mode = "markers",
                sizes = c(5, 50),
                marker = list(symbol = "circle",
                              sizemode = "diameter")) %>%
    plotly::layout(title = "Health and Wealth by Country in 2021",
                   xaxis = list(title = "GDP per Capita (USD)",
                                type = "log"),
                   yaxis = list(title = "Life Expectancy at Birth"))

If construction an interactive visualization using the procedure above seems overly complicated, worry not. The plotly::ggplotly() function can be used to transform any static ggplot2 visualization into an interactive plotly visualization. For example, we can convert the previous static visualization saved into the plot variable into an interactive one as follows.

plotly::ggplotly(plot)

Creating Animatied Visualizations

Would it not be nice if we could see how the health and wealth of world countries has changed over time? Luckily R allows us to create animated visualizations! But before we get animating, we must preprocess the data a little. Countries appearing and disappearing throughout the animation would be very distracting, hence it is best to remove countries that do not have data for all the years we are interested in visualizing.

To do this, we first utilize tidyr::drop_na() to remove any country-year combinations that have missing data. Then we group the data by country using dplyr::group_by() and cleverly utilize dplyr::filter() to remove countries that do not have data for all the years. When used with grouped data, dplyr::n() gives us the number of observations in each group. In our example, this would be the number of years each country has data for. Using range(data$year) we can get the minimum and maximum years in the data and combining that with diff() gives us the number of years the data spans. Hence dplyr::filter(dplyr::n() == diff(range(data$year))) will extract all countries that have data for every possible year present in the data set.

animation_data <- data %>%
    tidyr::drop_na() %>%
    dplyr::group_by(country_code) %>%
    dplyr::filter(dplyr::n() == diff(range(data$year)))

animation_data

To create an animated visualization, we first use ggplot2 functions to define the layout for a single animation frame and then we specify the time variable using gganimate::transition_time() and the animation type via gganimate::ease_eas(). Finally we use gganimate::animate() to create the animation and define its characteristics.

animated_plot <- ggplot2::ggplot(data = animation_data,
                                 mapping = ggplot2::aes(x = gdp_per_capita,
                                                        y = life_expectancy)) +
    ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
                                               size = population)) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_size(range = c(1,10)) +
    ggplot2::geom_smooth(formula = y ~ x,
                         method = "lm",
                         se = FALSE,
                         color = "black",
                         linetype = "dashed") +
    ggplot2::labs(title = "Health and Wealth by Country in {frame_time}",
                  x = "GDP per Capita (USD)",
                  y = "Life Expectancy at Birth",
                  color = "Region",
                  size = "Population") +
    gganimate::transition_time(year) +
    gganimate::ease_aes("linear")

animation <- gganimate::animate(plot = animated_plot,
                                nframes = diff(range(data$year)),
                                fps = 4,
                                renderer = gganimate::gifski_renderer(),
                                width = 5,
                                height = 5,
                                units = "in",
                                res = 72)

The generated animation can be exported as a GIF file using gganimate::anim_save().

gganimate::anim_save(filename = "animation.gif",
                     animation = animation)

We must use kintr::include_graphics() to display the animation in the notebook.

knitr::include_graphics("animation.gif")

While GIF animations are quite cool, they cannot be easily paused to investigate a single time stamp and as you might have noticed, creating them using R is somewhat convoluted. Luckily we can easily animate an interactive plotly visualization by specifying a frame argument.

plotly::plot_ly(data = animation_data,
                x = ~gdp_per_capita,
                y = ~life_expectancy,
                color = ~region_name,
                text = ~country_name,
                size = ~population,
                frame = ~year,
                type = "scatter",
                mode = "markers",
                sizes = c(5, 50),
                marker = list(symbol = "circle",
                              sizemode = "diameter")) %>%
    plotly::layout(xaxis = list(title = "GDP per Capita (USD)",
                                type = "log"),
                   yaxis = list(title = "Life Expectancy at Birth")) %>%
    plotly::animation_opts(frame = 100,
                           transition = 0)

R for Data Manipulation and Visualization

Uku-Kaspar Uustalu

March 6, 2024