# Load packages
library(tidyverse)
library(gapminder)
# Load data
data(gapminder)
1 Learning Objectives
- Understand importance of data frames;
- Explore Gapminder dataset;
- Use basic data frame functions.
2 Start a Script
For this lab or project, begin by:
- Starting a new
R
script - Create a good header section and table of contents
- Save the script file with an informative name
- set your working directory
Aim to make the script a future reference for doing things in R
!
3 Introduction
dplyr
is an R
package for data manipulation, developed by Hadley Wickham and Romain Francois. It is built to be fast, highly expressive, and open-minded about how your data is stored. It is installed as part of the tidyverse
meta-package and, as a core package, it is among those loaded via library(tidyverse)
. dplyr
is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
dplyr
’s roots are in an earlier package called plyr
, which implements the “split-apply-combine” strategy for data analysis. Where plyr
covers a diverse set of inputs and outputs (e.g., arrays, data frames, lists), dplyr
has a laser-like focus on data frames. This makes dplyr
blazing fast, much faster than plyr
for data frames, and it also makes it easier to learn since you only have to learn one set of tools.
Have no idea what I’m talking about? Not sure if you care? If you use these base R
functions: subset()
, apply()
, [sl]apply()
, tapply()
, aggregate()
, split()
, do.call()
, with()
, within()
then you will be interested in dplyr
.
4 Packages and Data
We are going to load the tidyverse
, which will load dplyr
, as well as gapminder
:
As a reminder, the gapminder
data frame is a special kind of data frame: a tibble.
# Print data
gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
It’s tibble-ness is why we get nice compact printing. For a reminder of the problems with base data frame printing, go type iris
in the R
Console or, better yet, print a data frame to screen that has lots of columns. Note how gapminder
’s class()
includes tbl_df
; the “tibble” terminology is a nod to this:
# Identify class of gapminder
class(gapminder)
[1] "tbl_df" "tbl" "data.frame"
There will be some functions, like print()
, that know about tibbles and do something special. There will others that do not, like summary()
. In which case the regular data frame treatment will happen, because every tibble is also a regular data frame.
To turn any data frame into a tibble use as_tibble()
:
# Convert iris to tibble
as_tibble(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
5 Subsetting Data
If you ever feel the urge to store a little snippet of your data:
# Create a new data frame containing only the data for Canada
<- gapminder[241:252, ]) (canada
# A tibble: 12 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Canada Americas 1952 68.8 14785584 11367.
2 Canada Americas 1957 70.0 17010154 12490.
3 Canada Americas 1962 71.3 18985849 13462.
4 Canada Americas 1967 72.1 20819767 16077.
5 Canada Americas 1972 72.9 22284500 18971.
6 Canada Americas 1977 74.2 23796400 22091.
7 Canada Americas 1982 75.8 25201900 22899.
8 Canada Americas 1987 76.9 26549700 26627.
9 Canada Americas 1992 78.0 28523502 26343.
10 Canada Americas 1997 78.6 30305843 28955.
11 Canada Americas 2002 79.8 31902268 33329.
12 Canada Americas 2007 80.7 33390141 36319.
Stop and ask yourself …
Do I want to create mini datasets for each level of some factor (or unique combination of several factors) … in order to compute or graph something?
If YES, use proper data aggregation techniques – don’t subset the data. Or, more realistic, only subset the data as a temporary measure while you develop your elegant code for computing on or visualising these data subsets.
If NO, then maybe you really do need to store a copy of a subset of the data. But seriously consider whether you can achieve your goals by simply using the subset =
argument of, e.g., the lm()
function, to limit computation to your excerpt of choice. Lots of functions offer a subset =
argument!
Copies and excerpts of your data clutter your workspace, invite mistakes, and sow general confusion. Avoid whenever possible.
Reality can also lie somewhere in between. You will find the workflows presented below can help you accomplish your goals with minimal creation of temporary, intermediate objects.
5.1 Logical Expressions
Logical operations to access specific parts of the data that match our specification. There are many logical operators you can use to subset your data and these can be combined in different ways:
==
equal to!=
not equal to>
greater than<
less than>=
greater than or equal to<=
less than or equal to%in%
is in a vector!
not&
and|
or
We will cover these in more detail!
5.2 filter()
Rows
filter()
takes logical expressions and returns the rows for which all are TRUE
.
# All rows where lifeExp < 29
filter(gapminder, lifeExp < 29)
# A tibble: 2 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Rwanda Africa 1992 23.6 7290203 737.
# All rows where country is Rwanda and year is greater than 1979
filter(gapminder, country == "Rwanda", year > 1979)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Rwanda Africa 1982 46.2 5507565 882.
2 Rwanda Africa 1987 44.0 6349365 848.
3 Rwanda Africa 1992 23.6 7290203 737.
4 Rwanda Africa 1997 36.1 7212583 590.
5 Rwanda Africa 2002 43.4 7852401 786.
6 Rwanda Africa 2007 46.2 8860588 863.
# All rows where country is Rwanda or Afghanistan
filter(gapminder, country %in% c("Rwanda", "Afghanistan"))
# A tibble: 24 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 14 more rows
Compare with some base R
code to accomplish the same things:
# All rows where lifeExp < 29
$lifeExp < 29, ] # repeat `gapminder`, [i, j] indexing is distracting
gapminder[gapminder
# All rows where country is Rwanda
subset(gapminder, country == "Rwanda") ## almost same as filter
Under no circumstances should you subset your data the way I did at first:
# Create a new data frame containing only the data for Canada
<- gapminder[241:252, ] excerpt
Why is this a terrible idea?
- It is not self-documenting. What is so special about rows 241 through 252?
- It is fragile. This line of code will produce different results if someone changes the row order of
gapminder
, e.g., sorts the data earlier in the script.
# Filter for Canada
filter(gapminder, country == "Canada")
This call explains itself and is fairly robust.
5.3 The Pipe Operator
Before we go any further, we should exploit the pipe operator that the tidyverse imports from the magrittr
package by Stefan Bache. This is going to change your data analytical life. You no longer need to enact multi-operation commands by nesting them inside each other, like so many Russian nesting dolls. This syntax leads to code that is much easier to write and to read.
Here’s what it looks like: %>%
. The RStudio keyboard shortcut: Ctrl+Shift+M (Windows), Cmd+Shift+M (Mac).
Let’s demo then I’ll explain.
# Using the pipe operator to chain together commands
%>%
gapminder head() # same as head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
This is equivalent to head(gapminder)
. The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side – literally, drops it in as the first argument.
Never fear, you can still specify other arguments to this function! To see the first 3 rows of gapminder
, we could say head(gapminder, 3)
or this:
# Using the pipe operator to chain together commands
%>%
gapminder head(3)
# A tibble: 3 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
You should think “then” whenever you see the pipe operator, %>%
.
5.4 select()
Columns
Use select()
to subset the data on variables or columns. Here’s a conventional call:
# Select the year and lifeExp columns
select(gapminder, year, lifeExp)
# A tibble: 1,704 × 2
year lifeExp
<int> <dbl>
1 1952 28.8
2 1957 30.3
3 1962 32.0
4 1967 34.0
5 1972 36.1
6 1977 38.4
7 1982 39.9
8 1987 40.8
9 1992 41.7
10 1997 41.8
# ℹ 1,694 more rows
And here’s the same operation, but written with the pipe operator and piped through head()
:
# Using the pipe operator to chain together commands
%>%
gapminder select(year, lifeExp) %>% # select columns
head(4) # show first 4 rows
# A tibble: 4 × 2
year lifeExp
<int> <dbl>
1 1952 28.8
2 1957 30.3
3 1962 32.0
4 1967 34.0
Think: “Take gapminder
, then select the variables year
and lifeExp
, then show the first 4 rows.”
5.5 Convenience!
Here’s the data for Cambodia, but only certain variables:
# Using the pipe operator to chain together commands
%>%
gapminder filter(country == "Cambodia") %>% # filter for Cambodia
select(year, lifeExp) # select columns
# A tibble: 12 × 2
year lifeExp
<int> <dbl>
1 1952 39.4
2 1957 41.4
3 1962 43.4
4 1967 45.4
5 1972 40.3
6 1977 31.2
7 1982 51.0
8 1987 53.9
9 1992 55.8
10 1997 56.5
11 2002 56.8
12 2007 59.7
and what a typical base R
call would look like:
$country == "Cambodia", c("year", "lifeExp")] gapminder[gapminder
# A tibble: 12 × 2
year lifeExp
<int> <dbl>
1 1952 39.4
2 1957 41.4
3 1962 43.4
4 1967 45.4
5 1972 40.3
6 1977 31.2
7 1982 51.0
8 1987 53.9
9 1992 55.8
10 1997 56.5
11 2002 56.8
12 2007 59.7
6 Pure, Predictable, Pipeable
We’ve barely scratched the surface of dplyr
but I want to point out key principles you may start to appreciate. dplyr
’s verbs, such as filter()
and select()
, are what’s called pure functions. To quote from Wickham’s book:
The functions that are the easiest to understand and reason about are pure functions: functions that always map the same input to the same output and have no other impact on the workspace. In other words, pure functions have no side effects: they don’t affect the state of the world in any way apart from the value they return.
In fact, these verbs are a special case of pure functions: they take the same flavor of object as input and output. Namely, a data frame or one of the other data receptacles dplyr supports. And finally, the data is always the very first argument of the verb functions. This set of deliberate design choices, together with the new pipe operator, produces a highly effective, low friction domain-specific language for data analysis.
As we move on to look at some of dplyr
’s other verbs, you will see that they all follow these same principles. This makes them easy to learn and easy to use. Because we are now going to be making changes to the gapminder
tibble. To eliminate any fear that you’re damaging the data that comes with the package, we create an explicit copy of gapminder
for our experiments.
# Create a new object called my_gap
<- gapminder my_gap
Pay close attention to when we evaluate statements but let the output just print to screen:
## Let output print to screen, but do not store
%>%
my_gap filter(country == "Canada") # filter for Canada
… versus when we assign the output to an object, possibly overwriting an existing object.
## Store the output as an R object
<- my_gap %>%
my_precious filter(country == "Canada") # filter for Canada
7 mutate()
Variables
Imagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Let’s multiply them together. mutate()
is a function that defines and inserts new variables into a data frame. You can refer to existing variables by name.
# Using the pipe operator to chain together commands
%>%
my_gap mutate(gdp = pop * gdpPercap) # create new variable
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap gdp
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
# ℹ 1,694 more rows
Hmmmm … those GDP numbers are almost uselessly large and abstract. Maybe it would be more meaningful to consumers of my tables and figures to stick with GDP per capita. But what if I reported GDP per capita, relative to some benchmark country. Since we’re in the UK, we’ll go with that.
I need to create a new variable that is gdpPercap
divided by UK gdpPercap
, taking care that I always divide two numbers that pertain to the same year.
How I achieve this:
- Filter down to the rows for United Kingdom.
- Create a new temporary variable in
my_gap
:- Extract the
gdpPercap
variable from the UK data. - Replicate it once per country in the data set, so it has the right length.
- Extract the
- Divide raw
gdpPercap
by this UK figure. - Discard the temporary variable of replicated UK
gdpPercap
.
# Filter data for the UK
<- my_gap %>%
ctib filter(country == "United Kingdom") # filter for UK
# Create a new colimn for GDP per capita relative to the UK
<- my_gap %>%
my_gap mutate(tmp = rep(ctib$gdpPercap, nlevels(country)), # Create a temporary variable
gdpPercapRel = gdpPercap / tmp, # Create the new variable
tmp = NULL) # Discard the temporary variable
Note that, mutate()
builds new variables sequentially so you can reference earlier ones (like tmp
) when defining later ones (like gdpPercapRel
). Also, you can get rid of a variable by setting it to NULL
. How could we sanity check that this worked? The Canadian values for gdpPercapRel
should all be 1!
# Check it worked!
%>%
my_gap filter(country == "United Kingdom") %>% # Filter for UK
select(country, year, gdpPercapRel) # Select only these columns
# A tibble: 12 × 3
country year gdpPercapRel
<fct> <int> <dbl>
1 United Kingdom 1952 1
2 United Kingdom 1957 1
3 United Kingdom 1962 1
4 United Kingdom 1967 1
5 United Kingdom 1972 1
6 United Kingdom 1977 1
7 United Kingdom 1982 1
8 United Kingdom 1987 1
9 United Kingdom 1992 1
10 United Kingdom 1997 1
11 United Kingdom 2002 1
12 United Kingdom 2007 1
I perceive the UK to be a “high GDP” country, so I predict that the distribution of gdpPercapRel
is located below 1, possibly even well below. Check your intuition!
# Check the distribution of gdpPercapRel
summary(my_gap$gdpPercapRel)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.008181 0.071340 0.197620 0.378300 0.531490 10.860490
The relative GDP per capita numbers are, in general, well below 1. We see that most of the countries covered by this dataset have substantially lower GDP per capita, relative to the UK, across the entire time period.
Remember: Trust No One. Including (especially?) yourself. Always try to find a way to check that you’ve done what meant to. Prepare to be horrified.
8 arrange()
Data
arrange()
reorders the rows in a data frame. Imagine you wanted this data ordered by year then country, as opposed to by country then year.
# Arrange by year then country
%>%
my_gap arrange(year, country) # arrange by year then country
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap gdpPercapRel
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 0.0781
2 Albania Europe 1952 55.2 1282697 1601. 0.160
3 Algeria Africa 1952 43.1 9279525 2449. 0.245
4 Angola Africa 1952 30.0 4232095 3521. 0.353
5 Argentina Americas 1952 62.5 17876956 5911. 0.592
6 Australia Oceania 1952 69.1 8691212 10040. 1.01
7 Austria Europe 1952 66.8 6927772 6137. 0.615
8 Bahrain Asia 1952 50.9 120447 9867. 0.989
9 Bangladesh Asia 1952 37.5 46886859 684. 0.0686
10 Belgium Europe 1952 68 8730405 8343. 0.836
# ℹ 1,694 more rows
Or maybe you want just the data from 2007, sorted on life expectancy?
# Arrange by year then life expectancy
%>%
my_gap filter(year == 2007) %>% # filter for 2007
arrange(lifeExp) # arrange by year then life expectancy
# A tibble: 142 × 7
country continent year lifeExp pop gdpPercap gdpPercapRel
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Swaziland Africa 2007 39.6 1.13e6 4513. 0.136
2 Mozambique Africa 2007 42.1 2.00e7 824. 0.0248
3 Zambia Africa 2007 42.4 1.17e7 1271. 0.0383
4 Sierra Leone Africa 2007 42.6 6.14e6 863. 0.0260
5 Lesotho Africa 2007 42.6 2.01e6 1569. 0.0473
6 Angola Africa 2007 42.7 1.24e7 4797. 0.144
7 Zimbabwe Africa 2007 43.5 1.23e7 470. 0.0141
8 Afghanistan Asia 2007 43.8 3.19e7 975. 0.0294
9 Central African Republ… Africa 2007 44.7 4.37e6 706. 0.0213
10 Liberia Africa 2007 45.7 3.19e6 415. 0.0125
# ℹ 132 more rows
Oh, you’d like to sort on life expectancy in descending order? Then use desc()
.
# Arramhe by year then life expectancy
%>%
my_gap filter(year == 2007) %>% # filter for 2007
arrange(desc(lifeExp)) # arrange by year then life expectancy
# A tibble: 142 × 7
country continent year lifeExp pop gdpPercap gdpPercapRel
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Japan Asia 2007 82.6 127467972 31656. 0.953
2 Hong Kong, China Asia 2007 82.2 6980412 39725. 1.20
3 Iceland Europe 2007 81.8 301931 36181. 1.09
4 Switzerland Europe 2007 81.7 7554661 37506. 1.13
5 Australia Oceania 2007 81.2 20434176 34435. 1.04
6 Spain Europe 2007 80.9 40448191 28821. 0.868
7 Sweden Europe 2007 80.9 9031088 33860. 1.02
8 Israel Asia 2007 80.7 6426679 25523. 0.769
9 France Europe 2007 80.7 61083916 30470. 0.918
10 Canada Americas 2007 80.7 33390141 36319. 1.09
# ℹ 132 more rows
I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.
9 rename()
Variables
Many of the variables in the Gapminder data set are camelCase
, but I personally like snake_case
. Let’s rename some variables to fix this:
# Rename variables
%>%
my_gap rename(life_exp = lifeExp, # rename lifeExp
gdp_percap = gdpPercap, # rename gdpPercap
gdp_percap_rel = gdpPercapRel) # rename gdpPercapRel
# A tibble: 1,704 × 7
country continent year life_exp pop gdp_percap gdp_percap_rel
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 0.0781
2 Afghanistan Asia 1957 30.3 9240934 821. 0.0728
3 Afghanistan Asia 1962 32.0 10267083 853. 0.0684
4 Afghanistan Asia 1967 34.0 11537966 836. 0.0591
5 Afghanistan Asia 1972 36.1 13079460 740. 0.0466
6 Afghanistan Asia 1977 38.4 14880372 786. 0.0451
7 Afghanistan Asia 1982 39.9 12881816 978. 0.0536
8 Afghanistan Asia 1987 40.8 13867957 852. 0.0393
9 Afghanistan Asia 1992 41.7 16317921 649. 0.0286
10 Afghanistan Asia 1997 41.8 22227415 635. 0.0244
# ℹ 1,694 more rows
I did NOT assign the post-rename object back to my_gap
because that would make the chunks in this computer lab harder to copy/paste and run out of order. In real life, I would probably assign this back to my_gap
, in a data preparation script, and proceed with the new variable names.
select()
can also rename and reposition variables. You’ve already seen simple use of select()
. There are two tricks you might enjoy:
select()
can rename the variables you request to keep;select()
can be used witheverything()
to hoist a variable up to the front of the tibble.
# filter() and select() together
%>%
my_gap filter(country == "Burundi", year > 1996) %>% # filter for Burundi after 1996
select(yr = year, lifeExp, gdpPercap) %>% # rename year to yr
select(gdpPercap, everything()) # hoist gdpPercap to the front
# A tibble: 3 × 3
gdpPercap yr lifeExp
<dbl> <int> <dbl>
1 463. 1997 45.3
2 446. 2002 47.4
3 430. 2007 49.6
everything()
is one of several helpers for variable selection. Read its help to see the rest.
10 group_by()
Many people love to ask seemingly innocuous questions like, “which country experienced the sharpest 5-year drop in life expectancy?”. In fact, that is a totally natural question to ask. But if you are using a language that doesn’t know about data, it’s an incredibly annoying question to answer. dplyr
offers powerful tools to solve this class of problem:
group_by()
adds extra structure to your dataset – grouping information – which lays the groundwork for computations within the groups.summarise()
takes a dataset with \(n\) observations, computes requested summaries, and returns a dataset with 1 observation.- Window functions take a dataset with \(n\) observations and return a dataset with \(n\) observations.
mutate()
andsummarise()
will honor groups.
Combined with the verbs you already know, these new tools allow you to solve an extremely diverse set of problems with relative ease.
10.1 Counting
Let’s start with simple counting. How many observations do we have per continent?
# Count observations per continent
%>%
my_gap group_by(continent) %>% # group by continent
summarise(n = n()) # count observations per continent
# A tibble: 5 × 2
continent n
<fct> <int>
1 Africa 624
2 Americas 300
3 Asia 396
4 Europe 360
5 Oceania 24
Let us pause here to think about the tidyverse
. You could get these same frequencies using table()
from base R.
# Count observations per continent
table(gapminder$continent)
Africa Americas Asia Europe Oceania
624 300 396 360 24
# Or, equivalently
str(table(gapminder$continent))
'table' int [1:5(1d)] 624 300 396 360 24
- attr(*, "dimnames")=List of 1
..$ : chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
But the object of class table
that is returned makes downstream computation a bit fiddlier than you’d like. For example, it’s too bad the continent levels come back only as names and not as a proper factor, with the original set of levels. This is an example of how the tidyverse smooths transitions where you want the output of step i
to become the input of step i + 1
.
The tally()
function is a convenience function that knows to count rows. It honours groups.
# Count observations per continent
%>%
my_gap group_by(continent) %>% # group by continent
tally() # count observations per continent
# A tibble: 5 × 2
continent n
<fct> <int>
1 Africa 624
2 Americas 300
3 Asia 396
4 Europe 360
5 Oceania 24
The count()
function is an even more convenient function that does both grouping and counting.
# Count observations per continent
%>%
my_gap count(continent)
# A tibble: 5 × 2
continent n
<fct> <int>
1 Africa 624
2 Americas 300
3 Asia 396
4 Europe 360
5 Oceania 24
What if we wanted to add the number of unique countries for each continent? You can compute multiple summaries inside summarise()
. Use the n_distinct()
function to count the number of distinct countries within each continent.
# Count observations per country
%>%
my_gap group_by(continent) %>% # group by continent
summarise(n = n(), # count observations per continent
n_countries = n_distinct(country)) # count distinct countries per continent
# A tibble: 5 × 3
continent n n_countries
<fct> <int> <int>
1 Africa 624 52
2 Americas 300 25
3 Asia 396 33
4 Europe 360 30
5 Oceania 24 2
10.2 General Summarisation
The functions you’ll apply within summarise()
include classical statistical summaries, like mean()
, median()
, var()
, sd()
, mad()
, IQR()
, min()
, and max()
. Remember they are functions that take \(n\) inputs and distill them down into 1 output. Although this may be statistically ill-advised, let’s compute the average life expectancy by continent.
# Average life expectancy by continent
%>%
my_gap group_by(continent) %>% # group by continent
summarise(avg_lifeExp = mean(lifeExp)) # Calculate average life expectancy
# A tibble: 5 × 2
continent avg_lifeExp
<fct> <dbl>
1 Africa 48.9
2 Americas 64.7
3 Asia 60.1
4 Europe 71.9
5 Oceania 74.3
10.3 Grouped Mutate
Sometimes you don’t want to collapse the \(n\) rows for each group into one row. You want to keep your groups, but compute within them.
Let’s make a new variable that is the years of life expectancy gained (lost) relative to 1952, for each individual country. We group by country and use mutate()
to make a new variable. The first()
function extracts the first value from a vector. Notice that first()
is operating on the vector of life expectancies within each country group.
# Calculate life expectancy gain
%>%
my_gap group_by(country) %>% # group by country
select(country, year, lifeExp) %>% # select country, year, and lifeExp
mutate(lifeExp_gain = lifeExp - first(lifeExp)) %>% # calculate life expectancy gain
filter(year < 1963) # `filter()` for years before 1963
# A tibble: 426 × 4
# Groups: country [142]
country year lifeExp lifeExp_gain
<fct> <int> <dbl> <dbl>
1 Afghanistan 1952 28.8 0
2 Afghanistan 1957 30.3 1.53
3 Afghanistan 1962 32.0 3.20
4 Albania 1952 55.2 0
5 Albania 1957 59.3 4.05
6 Albania 1962 64.8 9.59
7 Algeria 1952 43.1 0
8 Algeria 1957 45.7 2.61
9 Algeria 1962 48.3 5.23
10 Angola 1952 30.0 0
# ℹ 416 more rows
Within country, we take the difference between life expectancy in year \(i\) and life expectancy in 1952. Therefore, we always see zeroes for 1952 and, for most countries, a sequence of positive and increasing numbers.
11 Activities
For these activities you will be using the iris
dataset. This is a classic dataset that contains measurements of three different species of iris flowers. The dataset contains 150 observations and 5 variables. You can load the dataset by running the following code:
# Load the iris dataset
data(iris)
11.1 Filtering
Filter the dataset to only include observations where the
Sepal.Length
is greater than 5.5.Filter the dataset to only include observations where the
Sepal.Length
is greater than 5.5 and theSepal.Width
is greater than 3.5. Assign this to a new object callediris_filtered
.
💡 Click here to view a solution
# Filter the dataset to only include observations where the Sepal.Length is greater than 5.5
%>%
iris filter(Sepal.Length > 5.5)
# Filter the dataset to only include observations where the Sepal.Length is greater than 5.5 and the Sepal.Width is greater than 3.5
<- iris %>%
iris_filtered filter(Sepal.Length > 5.5,
> 3.5) Sepal.Width
11.2 Selecting
Select the
Sepal.Length
andSepal.Width
variables from theiris
dataset.Select all variables except for
Sepal.Length
andSepal.Width
from theiris
dataset.
💡 Click here to view a solution
# Select the Sepal.Length and Sepal.Width variables from the iris dataset
%>%
iris select(Sepal.Length, Sepal.Width)
# Select all variables except for Sepal.Length and Sepal.Width from the iris dataset
%>%
iris select(-Sepal.Length, -Sepal.Width)
11.3 Arranging
Arrange the
iris
dataset bySepal.Length
in descending order.Arrange the
iris
dataset bySepal.Length
in descending order and then bySepal.Width
in ascending order.
💡 Click here to view a solution
# Arrange the iris dataset by Sepal.Length in descending order
%>%
iris arrange(desc(Sepal.Length))
# Arrange the iris dataset by Sepal.Length in descending order and then by Sepal.Width in ascending order
%>%
iris arrange(desc(Sepal.Length), Sepal.Width)
11.4 Mutating
Create a new variable called
Sepal.Area
that is equal to the product ofSepal.Length
andSepal.Width
.Create a new variable called
Sepal.Area
that is equal to the product ofSepal.Length
andSepal.Width
. Then, create a new variable calledSepal.Area.Scaled
that is equal toSepal.Area
divided by 100.
💡 Click here to view a solution
# Create a new variable called Sepal.Area that is equal to the product of Sepal.Length and Sepal.Width
%>%
iris mutate(Sepal.Area = Sepal.Length * Sepal.Width)
# Create a new variable called Sepal.Area that is equal to the product of Sepal.Length and Sepal.Width. Then, create a new variable called Sepal.Area.Scaled that is equal to Sepal.Area divided by 100
%>%
iris mutate(Sepal.Area = Sepal.Length * Sepal.Width,
Sepal.Area.Scaled = Sepal.Area / 100)
12 Recap
dplyr
is a package that provides a grammar of data manipulation;- You should aggregate data rather than fully subset it;
filter()
subsets rows;select()
subsets columns;arrange()
sorts rows;mutate()
creates new columns;summarise()
reduces multiple values down to a single summary.