We will be using librarian
for package management as it
greatly simplifies the process of installing and attaching packages. It
needs to be installed using install.packages()
unless
already installed.
install.packages("librarian")
The librarian::shelf()
function checks whether all
listed packages are installed and installs them if needed. Then it loads
all listed packages into the environment. You can think of it as a
combination of install.packages()
and
library()
except it will not re-install a package if it is
already installed. Here are the packages we will be using:
readr
is a tidyverse package that allows for fast
reading and writing of rectangular text datajanitor
is a package commonly used to clean messy data
and quickly format column namesdplyr
is a tidyverse package allowing for easy data
manipulation and variable groupingmagrittr
is a tidyverse package that offers a set of
operators useful for constructing pipelinestidyr
is a tidyverse package intended for data
restructuring with the goal of achieving tidy datatidyselect
is a tidyverse package that allows for the
easy selection of columns based on criteriaggplot2
is a tidyverse package and arguably the most
popular R data visualization packageplotly
allows for the easy generation of various
interactive data visualizations including mapsgganimate
and gifski
can be used to
generate and export various animated visualizationslibrarian::shelf(readr, janitor, dplyr, magrittr, tidyr, tidyselect, ggplot2,
plotly, gganimate, gifski)
We will be using the following data files from the data
directory to investigate the relationship between health and wealth:
gdp.csv
– World Bank gross domestic product (GDP)
estimates (in USD) from 1960 until 2022life-expectancy.csv
– World Bank life expectancy at
birth estimates from 1960 until 2021m49.csv
– United Nations M49 Standard Country or
Area Codes for Statistical Usepopulation.csv
– World Bank country and region
population estimates from 1960 until 2022All the data are in IEFT RFC 4180 CSV (comma-separated values) format and the first four rows of the World Bank data files contain metadata with the actual data table starting on row five.
Let us start with the population data. CSV data files can be easily
read in using the readr::read_csv()
function. The function
reads the contents of the file into a tibble
object (which
is basically an advanced data frame) and supports various additional
arguments. For example, we can utilize the skip
argument to
skip the first four rows of the CSV file as the data table does not
start until row five.
population <- readr::read_csv(file = "data/population.csv",
skip = 4,
show_col_types = FALSE)
Upon file read, the function outputs a summary outlining the column
names and types of the data by default. This can be silenced by
specifying show_col_types=FALSE
when calling the
function.
Now the World Bank population data is stored in a tibble
named population
. Calling the name of the variable will
display the data in a neat interactive table viewer.
population
We see that the tibble
appears to have the following
columns:
Country Name
– English name of the countryCountry Code
– ISO 3166-1
alpha-3 country codeIndicator Name
– name of the indicator represented by
the dataIndicator Code
– World Bank code for the indicator1960
… 2022
– population estimates by
yearWe also see that the table has 266 rows and 67 columns. The number of
rows and columns of a tibble
or other data-frame-like
object can also be programmatically extracted by calling the
dim()
function.
dim(population)
## [1] 266 67
The list of column names can be programmatically extracted via the
names()
function.
names(population)
## [1] "Country Name" "Country Code" "Indicator Name" "Indicator Code"
## [5] "1960" "1961" "1962" "1963"
## [9] "1964" "1965" "1966" "1967"
## [13] "1968" "1969" "1970" "1971"
## [17] "1972" "1973" "1974" "1975"
## [21] "1976" "1977" "1978" "1979"
## [25] "1980" "1981" "1982" "1983"
## [29] "1984" "1985" "1986" "1987"
## [33] "1988" "1989" "1990" "1991"
## [37] "1992" "1993" "1994" "1995"
## [41] "1996" "1997" "1998" "1999"
## [45] "2000" "2001" "2002" "2003"
## [49] "2004" "2005" "2006" "2007"
## [53] "2008" "2009" "2010" "2011"
## [57] "2012" "2013" "2014" "2015"
## [61] "2016" "2017" "2018" "2019"
## [65] "2020" "2021" "2022"
Note how the first four column names contain a space and the rest of the column names are all numbers. This is bad practice as various functions do not work well with column names that either contain a space or start with a number. It is common to replace all spaces with either periods or underscores and add a letter prefix to any column names consisting solely of numbers.
This can be easily achieved by using the
janitor::clean_names()
function, which takes the original
table as input and outputs a new table where the column names have been
properly formatted.
population <- janitor::clean_names(population)
names(population)
## [1] "country_name" "country_code" "indicator_name" "indicator_code"
## [5] "x1960" "x1961" "x1962" "x1963"
## [9] "x1964" "x1965" "x1966" "x1967"
## [13] "x1968" "x1969" "x1970" "x1971"
## [17] "x1972" "x1973" "x1974" "x1975"
## [21] "x1976" "x1977" "x1978" "x1979"
## [25] "x1980" "x1981" "x1982" "x1983"
## [29] "x1984" "x1985" "x1986" "x1987"
## [33] "x1988" "x1989" "x1990" "x1991"
## [37] "x1992" "x1993" "x1994" "x1995"
## [41] "x1996" "x1997" "x1998" "x1999"
## [45] "x2000" "x2001" "x2002" "x2003"
## [49] "x2004" "x2005" "x2006" "x2007"
## [53] "x2008" "x2009" "x2010" "x2011"
## [57] "x2012" "x2013" "x2014" "x2015"
## [61] "x2016" "x2017" "x2018" "x2019"
## [65] "x2020" "x2021" "x2022"
We know that the population
table stores population
values, so the columns indicator_name
and
indicator_code
are redundant. These can be dropped using
the dplyr::select()
function. The first argument to the
function is the table variable itself and following arguments specify
which columns to keep or drop. For example, explicitly specifying a
column name will result in only the listed column being selected and
adding a minus sign in front will reverse the selection, meaning any
specified column name preceded with a minus sign will cause it to be
dropped and all remaining columns to be selected.
The output is a new table with the specified selection applied. There
are numerous different ways one can select columns using
dplyr::select()
. For example, one could use a
tidyselect
function to select a set of columns based on a
criterion. Refer to the function documentation for more information.
population <- dplyr::select(population, -indicator_name, -indicator_code)
We can validate that the desired columns have been removed by either
listing the column names via the names()
function again or
taking a quick peek at the table using the head()
function.
It displays the first six rows of the table by default, but a this can
be changed by specifying the n
argument.
head(population)
Knowing that the World Bank GDP data follows the exact same format as
the World Bank population data , we can read the CSV file, clean the
column names, and drop the unneeded columns all in one go by combining
the readr::read_csv()
, janitor::clean_names()
,
and dplyr::select()
functions using the forward pipe
operator %>%
from the magrittr
package. The
%>%
operator takes either a variable or the output of a
function and feeds it into the following function as its first
argument.
gdp <- readr::read_csv(file = "data/gdp.csv",
skip = 4,
show_col_types = FALSE) %>%
janitor::clean_names() %>%
dplyr::select(-indicator_name, -indicator_code)
In the pipeline above, the tibble
outputted by the
readr::read_csv()
function gets piped into the
janitor::clean_names()
function, the output of which gets
passed into the dplyr::select()
function. The output of the
final function is saved into the variable gdp
. The
head()
function can be used to take a quick look at the new
table and ensure the constructed pipeline worked as expected.
head(gdp)
GDP on its own is not a good indicator of the wealth of a country as countries with more people tend to have higher GDP. But if we were to normalize GDP by population, then the resulting GDP per capita values can be compared across countries and used as a proxy for wealth. To do so, we must be able to match up the GDP and population values for each unique combination of country and year.
The GDP and population tables are currently in wide format – each row represents a unique country and each column represents a unique year with the cell values representing GDP or population estimates. While this wide format has many advantages and is commonly used in geospatial applications, it does complicate joining various data sets. One option would be to treat both tables as matrices and calculate GDP per capita by dividing the GDP matrix with the population matrix. However, both tables need to have the exact same layout with the same number of countries and years in the same exact order for this to work and the result to be reliable. Ensuring this is not a trivial task, so this method would involve a lot of work to produce reliable results.
Alternatively the two tables could be joined by country. Then we will have an extra-wide table with two sets of year columns – one set of year columns for population and another set of year columns for GDP. Then we would need to create another new column for each year by dividing the corresponding GDP column with the corresponding population column, resulting in another new set of year columns. As you can see, this approach would quickly lead to a very messy and difficult to manage table and would also involve a lot of work, making it far from preferred.
The easiest option for calculating GDP per capita would involve converting both data tables into a long format, where each row represents a single unique observation (estimation). Instead of having countries in rows and years in columns, each row would instead represent a unique country and year combination. This would allow us to easily combine data on both country and year, ensuring that the population and GDP values for each country-year combination get matched appropriately.
We can use the tidyr::pivot_longer()
function to covert
wide format tables into long format tables. The following arguments are
of interest to us when calling the function:
data
– the tibble
object to convert from
wide to long formatcols
– the columns to pivot into longer format (the
columns containing the data values)names_to
– name of the column in the long table that
stores the column names from the wide tablenames_prefix
– the prefix to remove from the specified
wide table column names (if any)values_to
– name of the column in the long table that
sores the data values from the wide tableThe columns we would like to convert to long format are the year
columns. These are all prefixed with an x
character. We can
use the tidyselect::starts_with()
function to select all
the columns beginning with an x
character and pass those as
the cols
argument. The column names should be stripped of
the preceding x
character, so we pass that as the
names_prefix
argument. The stripped column names represent
years and the cell values represent population represent population
estimates, so finally we specify names_to = "year"
and
values_to = "population"
as the final arguments.
population_long <- tidyr::pivot_longer(data = population,
cols = tidyselect::starts_with("x"),
names_to = "year",
names_prefix = "x",
values_to = "population")
population_long
Now we have a new long population tibble
called
population_long
, where each row represents a unique country
and year combination. But note how the data type of the
year
column appears to be listed as chr
,
implying that the data is in textual character
format
instead of numbers. Let us confirm this by extracting the column using
the $
operator and checking its data type via the
class()
function.
class(population_long$year)
## [1] "character"
The dplyr::mutate()
function can be combined with the
base as.integer()
function to convert the year
column back into numeric format. The first argument of the
dpylr::mutate()
function is the data table and subsequent
arguments specify how to create new columns or modify existing
columns.
For example, to apply the as.integer()
function on the
whole year
column and then replace the year
column with the new values, we would specify
year = as.integer(year)
as an argument.
population_long <- dplyr::mutate(population_long,
year = as.integer(year))
class(population_long$year)
## [1] "integer"
Note how now the data type of the year
column is listed
as int
or integer
, meaning it is numeric.
Now let us convert the GDP table into long format as well. As with
the population table, we will first need to use
tidyr::pivot_longer()
to pivot the table and then
as.numeric()
with dplyr::mutate()
to fix the
data type of the year
column. These can be combined into a
pipeline using the %>%
operator.
gdp_long <- gdp %>%
tidyr::pivot_longer(cols = tidyselect::starts_with("x"),
names_to = "year",
names_prefix = "x",
values_to = "gdp") %>%
dplyr::mutate(year = as.integer(year))
gdp_long
Finally we are ready to combine the population and GDP tables. The
dplyr
package has four different join functions we could
utilize:
left_join()
– include all rows from the left table and
only matching rows from the right tableright_join()
– include all rows from the right table
and only matching tows from the left tablefull_join()
– include all rows from both tables
regardless of whether they have a matchinner_join()
– only include rows from both tables that
have matches in the other tableAll the aforementioned functions require at least three arguments:
x
– the left tabley
– the right tableby
– the column(s) to join onWe would like to join on each unique country and year combination. As
spellings of country names might differ between tables, it is good
practice to always use the ISO 3166-1
alpha-3 country code or some other analogous unique identifier to
distinguish between countries. The country code for each country is
determined by an international standard and should not differ between
tables, allowing us to reliably join the data. Hence we will specify
by = c("country_code", "year")
to perform the join on
unique country-year combinations using the
dplyr::inner_join()
function to only keep country-year
combinations that are present in both tables.
data <- dplyr::inner_join(x = population_long,
y = gdp_long,
by = c("country_code", "year"))
head(data)
Note how now the data are joined but the resulting table has two
different country_name
columns. That is because this column
was present in both of the joined tables and was not omitted before the
join. We can use the dplyr::select()
to drop one of the
columns by adding a minus sign in front of its name and rename the other
one using the new_name = old_name
convention. Then we can
use dplyr::mutate()
to add a new
gdp_per_capita
column by dividing the gdp
column with the population
column. We can combine these two
operations into a pipeline using the %>%
operator.
data <- data %>%
dplyr::select(-country_name.y,
country_name = country_name.x) %>%
dplyr::mutate(gdp_per_capita = gdp / population)
data
Now we would also like to add life expectancy information to this joined table. Knowing that all World Bank data tables follow the same format, we can easily convert the workflow from before into a function that reads in a World Bank data table, drops unneeded columns, converts it to long format, and ensures the year is in numeric format. That function would only need two inputs – the path of the CSV file and the name of the indicator represented by the data. (This name will be used as the column name for the values column in the long format table.) Let us define this function and use it to read in the World Bank life expectancy table and convert it to long format.
read_world_bank_data <- function(file_path, variable_name) {
readr::read_csv(file_path,
skip = 4,
show_col_types = FALSE) %>%
janitor::clean_names() %>%
dplyr::select(-indicator_name,
-indicator_code) %>%
tidyr::pivot_longer(cols = starts_with("x"),
names_to = "year",
names_prefix = "x",
values_to = variable_name) %>%
dplyr::mutate(year = as.integer(year)) %>%
return()
}
life_expectancy <- read_world_bank_data("data/life-expectancy.csv",
"life_expectancy")
head(life_expectancy)
Now we can use a pipeline to drop the redundant
country_name
column and then join the life expectancy table
with the rest of our data. Remember that the output of the last function
in a pipeline is specified as the first argument of the following
function by default. We can override this by referring to the output of
the previous function as .
and specifying it elsewhere int
the following function call.
data <- life_expectancy %>%
dplyr::select(-country_name) %>%
dplyr::inner_join(x = data,
y = .,
by = c("country_code", "year"))
data
Finally we would also like to know which United Nations regional geoscheme the country belongs to. Information on this is available in the United Nations M49 table. We can utilize a pipeline to read in and clean the table all in one go. Note that as this is a standard CSV table, there is no need to skip any rows.
m49 <- readr::read_csv(file = "data/m49.csv",
show_col_types = FALSE) %>%
janitor::clean_names()
head(m49)
Note how this table contains a lot of information on the various
groups and codes assigned to each country. We are only interested in the
name of the region the country belongs into and need the ISO 3166-1
alpha-3 code assigned to the country to join the data. Let us use
the %>%
operator to create a pipeline that selects the
desired columns and then performs the join.
data <- m49 %>%
dplyr::select(region_name,
country_code = iso_alpha3_code) %>%
dplyr::inner_join(data,
by = "country_code") %>%
dplyr::select(country_name,
country_code,
region_name,
year,
tidyselect::everything())
data
Note how we can use dplyr::select()
along with
tidyselect:everything()
to rearrange the order of specified
columns.
If we wanted to extract data for a specific country, we could utilize
boolean indexing, or we could use the dplyr::filter()
function, which behaves similarly, but is more intuitive to use and
works well within pipelines. For example, to extract all data for the
United States, we could combiner dplyr::filter()
with
country_code == "USA"
to extract all the rows where the
country_code
column is equal to "USA"
.
data %>%
dplyr::filter(country_code == "USA") %>%
head()
One can also use more complex logical statements within
dplyr::filter()
. For example, to extract data for all North
American countries, we could use
country_code %in% c("USA", "CAN", "MEX")
. To confirm this
works, we can select all columns that do not depend on the year from the
result and utilize dplyr::distinct()
to drop all duplicate
rows. The result should be one row for each selected country.
data %>%
dplyr::filter(country_code %in% c("USA", "CAN", "MEX")) %>%
dplyr::select(country_name, country_code, region_name) %>%
dplyr::distinct()
The dplyr::filter()
function can be piped together with
the ggplot2::ggplot()
function to easily visualize data for
a specific selection. For example, the life expectancy for the United
States over time could be displayed as follows. Note that we can specify
na.rm = TRUE
to drop any missing observations before
creating the plot. The visualization would still work otherwise, but we
would get warnings regarding the missing data.
data %>%
dplyr::filter(country_code == "USA") %>%
ggplot2::ggplot(mapping = ggplot2::aes(x = year, y = life_expectancy)) +
ggplot2::geom_line(na.rm = TRUE)
Following the example above, the GDP per capita growth over time can be compared across all North American countries as follows.
data %>%
dplyr::filter(country_code %in% c("USA", "CAN", "MEX")) %>%
ggplot2::ggplot(mapping = ggplot2::aes(x = year,
y = gdp_per_capita,
color = country_name)) +
ggplot2::geom_line()
Be careful when using ggplot2
functions within pipelines
as ggplot2
functions must be combined using the
+
operator instead of the %>%
pipe
operator. This is due to overly complicated technical reasons and just
something we have to live with.
Let us return to our original goal of exploring the relationship
between health and wealth. We will use GDP per capita as a proxy for
wealth and life expectancy as an indicator of health. We can simplify
the analysis by looking only at one point in time and focus our analysis
on 2021, which is the latest year we have both GDP per capita and life
expectancy data available. We can use dplyr::filter()
to
extract all 2021 data and combine it with tidyr::drop_na()
to remove any countries that do not have data for 2021.
data2021 <- data %>%
dplyr::filter(year == 2021) %>%
tidyr::drop_na()
head(data2021)
How is wealth distributed among the global population? Let us get a
vague idea by visualizing the distribution of GDP per capita among world
countries in 2021. We can easily create an histogram using
ggplot2
by defining data2021
as the data set
and specifying that gdp_per_capita
should be used for the X
axis. Then we add a histogram layer to the visualization using
ggplot2::geom_histogram()
. The histogram will have 30 bins
by default, which is rarely a good number. Use either bins
to specify the number of bins or bindwith
to define the
width of each bin.
ggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = gdp_per_capita)) +
ggplot2::geom_histogram(binwidth = 10000)
But what about life expectancy? As an alternative to a histogram, we
can create a kernel density estimation (KDE) visualization by following
the same procedure as before but using
ggplot2::geom_density()
instead of
ggplot2::geom_histogram()
. As KDE is continuous, there is
no need to specify the number or width of bins.
ggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = life_expectancy)) +
ggplot2::geom_density()
A scatter plot is the best way to investigate the relationship
between two variables and can be easily generated using
ggplot2::geom_point()
.
ggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = gdp_per_capita,
y = life_expectancy)) +
ggplot2::geom_point()
The relationship appears to be logarithmic. This is likely due to the
distribution of GDP per capita being heavily skewed. To make better
sense of this potentially logarithmic the relationship, we should apply
a logarithmic transformation to the axis corresponding to GDP per
capita. In our example this is the X axis and we can apply a logarithmic
transformation on the X axis by adding
ggplot2::scale_x_log10()
to the plot call.
ggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = gdp_per_capita,
y = life_expectancy)) +
ggplot2::geom_point() +
ggplot2::scale_x_log10()
As a scatter plot alternative, we could also create a two-dimensional
kernel density estimate (KDE) surface to get an even better
understanding of the data distribution and any potential relationship.
This can be done by replacing the ggplot2::geom_point()
function call with either ggplot2::geom_density_2d()
for
simple contour lines or ggplot2::geom_density_2d_filled()
for a beautiful color gradient surface.
ggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = gdp_per_capita,
y = life_expectancy)) +
ggplot2::geom_density2d_filled() +
ggplot2::scale_x_log10()
But does the size of a country play a role in this relationship? What
about the region? We can investigate this by specifying
color = region_name
and size = population
in
the ggplot2::aes()
call and creating a colored bubble
chart.
ggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = gdp_per_capita,
y = life_expectancy,
color = region_name,
size = population)) +
ggplot2::geom_point() +
ggplot2::scale_x_log10()
The population of world countries varies vastly. To illustrate this
better, we can add ggplot2::scale_size()
to our
visualization and use the range
argument to specify the
smallest and largest possible bubble size.
ggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = gdp_per_capita,
y = life_expectancy,
color = region_name,
size = population)) +
ggplot2::geom_point() +
ggplot2::scale_x_log10() +
ggplot2::scale_size(range = c(1,10))
A smoothed trend line can be added using
ggplot2::geom_smooth()
and specifyingy ~ x
as
the formula
. The method
argument is used to
specify the statistical function to use for the trend line and can be
one of the following:
"lm"
for a simple linear model"glm"
for a generalized linear model"gam"
for a generalized additive model"loess"
for locally estimated scatter plot
smoothingggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = gdp_per_capita,
y = life_expectancy)) +
ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
size = population)) +
ggplot2::scale_x_log10() +
ggplot2::scale_size(range = c(1,10)) +
ggplot2::geom_smooth(formula = y ~ x, method = "lm")
Additional arguments can be added to
ggplot2::geom_smooth()
to modify the appearance of the
trend line and ggplot2::labs()
can be used to add a title
and label the axes and legend.
plot <- ggplot2::ggplot(data = data2021,
mapping = ggplot2::aes(x = gdp_per_capita,
y = life_expectancy)) +
ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
size = population)) +
ggplot2::scale_x_log10() +
ggplot2::scale_size(range = c(1,10)) +
ggplot2::geom_smooth(formula = y ~ x,
method = "lm",
se = FALSE,
color = "black",
linetype = "dashed") +
ggplot2::labs(title = "Health and Wealth by Country in 2021",
x = "GDP per Capita (USD)",
y = "Life Expectancy at Birth",
color = "Region",
size = "Population")
plot
While the static scatter plot above is quite pretty to look at, it is
not the most informative. We have no idea which points represent which
countries and many countries appear clustered together, which makes it
harder to tell them apart. An interactive visualization would allow for
better exploration and investigation of the data. The easiest way of
creating an interactive visualization in R is via the
plotly
package.
Creating an interactive visualization using plotly
is
somewhat reminiscent of using ggplot2
but with some key
differences. The core aspects of the visualization are first defined
using the plotly::plot_ly()
function, the output of which
is then piped into plotly::layout()
to customize the
appearance of the visualization by adding axis labels and
transformations. Note that column names must be prefixed with a tilde
character (~
) when passing them as arguments in the
plotly::plot_ly()
function call.
plotly::plot_ly(data = data2021,
x = ~gdp_per_capita,
y = ~life_expectancy,
color = ~region_name,
text = ~country_name,
size = ~population,
type = "scatter",
mode = "markers",
sizes = c(5, 50),
marker = list(symbol = "circle",
sizemode = "diameter")) %>%
plotly::layout(title = "Health and Wealth by Country in 2021",
xaxis = list(title = "GDP per Capita (USD)",
type = "log"),
yaxis = list(title = "Life Expectancy at Birth"))
If construction an interactive visualization using the procedure
above seems overly complicated, worry not. The
plotly::ggplotly()
function can be used to transform any
static ggplot2
visualization into an interactive
plotly
visualization. For example, we can convert the
previous static visualization saved into the plot
variable
into an interactive one as follows.
plotly::ggplotly(plot)
Would it not be nice if we could see how the health and wealth of world countries has changed over time? Luckily R allows us to create animated visualizations! But before we get animating, we must preprocess the data a little. Countries appearing and disappearing throughout the animation would be very distracting, hence it is best to remove countries that do not have data for all the years we are interested in visualizing.
To do this, we first utilize tidyr::drop_na()
to remove
any country-year combinations that have missing data. Then we group the
data by country using dplyr::group_by()
and cleverly
utilize dplyr::filter()
to remove countries that do not
have data for all the years. When used with grouped data,
dplyr::n()
gives us the number of observations in each
group. In our example, this would be the number of years each country
has data for. Using range(data$year)
we can get the minimum
and maximum years in the data and combining that with
diff()
gives us the number of years the data spans. Hence
dplyr::filter(dplyr::n() == diff(range(data$year)))
will
extract all countries that have data for every possible year present in
the data set.
animation_data <- data %>%
tidyr::drop_na() %>%
dplyr::group_by(country_code) %>%
dplyr::filter(dplyr::n() == diff(range(data$year)))
animation_data
To create an animated visualization, we first use
ggplot2
functions to define the layout for a single
animation frame and then we specify the time variable using
gganimate::transition_time()
and the animation type via
gganimate::ease_eas()
. Finally we use
gganimate::animate()
to create the animation and define its
characteristics.
animated_plot <- ggplot2::ggplot(data = animation_data,
mapping = ggplot2::aes(x = gdp_per_capita,
y = life_expectancy)) +
ggplot2::geom_point(mapping = ggplot2::aes(color = region_name,
size = population)) +
ggplot2::scale_x_log10() +
ggplot2::scale_size(range = c(1,10)) +
ggplot2::geom_smooth(formula = y ~ x,
method = "lm",
se = FALSE,
color = "black",
linetype = "dashed") +
ggplot2::labs(title = "Health and Wealth by Country in {frame_time}",
x = "GDP per Capita (USD)",
y = "Life Expectancy at Birth",
color = "Region",
size = "Population") +
gganimate::transition_time(year) +
gganimate::ease_aes("linear")
animation <- gganimate::animate(plot = animated_plot,
nframes = diff(range(data$year)),
fps = 4,
renderer = gganimate::gifski_renderer(),
width = 5,
height = 5,
units = "in",
res = 72)
The generated animation can be exported as a GIF file using
gganimate::anim_save()
.
gganimate::anim_save(filename = "animation.gif",
animation = animation)
We must use kintr::include_graphics()
to display the
animation in the notebook.
knitr::include_graphics("animation.gif")
While GIF animations are quite cool, they cannot be easily paused to
investigate a single time stamp and as you might have noticed, creating
them using R is somewhat convoluted. Luckily we can easily animate an
interactive plotly
visualization by specifying a
frame
argument.
plotly::plot_ly(data = animation_data,
x = ~gdp_per_capita,
y = ~life_expectancy,
color = ~region_name,
text = ~country_name,
size = ~population,
frame = ~year,
type = "scatter",
mode = "markers",
sizes = c(5, 50),
marker = list(symbol = "circle",
sizemode = "diameter")) %>%
plotly::layout(xaxis = list(title = "GDP per Capita (USD)",
type = "log"),
yaxis = list(title = "Life Expectancy at Birth")) %>%
plotly::animation_opts(frame = 100,
transition = 0)