Data Manipulation

1 Learning Objectives

  • Understand importance of data frames;
  • Explore Gapminder dataset;
  • Use basic data frame functions.

2 Start a Script

For this lab or project, begin by:

  • Starting a new R script
  • Create a good header section and table of contents
  • Save the script file with an informative name
  • set your working directory

Aim to make the script a future reference for doing things in R!

3 Introduction

dplyr is an R package for data manipulation, developed by Hadley Wickham and Romain Francois. It is built to be fast, highly expressive, and open-minded about how your data is stored. It is installed as part of the tidyverse meta-package and, as a core package, it is among those loaded via library(tidyverse). dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.

dplyr’s roots are in an earlier package called plyr, which implements the “split-apply-combine” strategy for data analysis. Where plyr covers a diverse set of inputs and outputs (e.g., arrays, data frames, lists), dplyr has a laser-like focus on data frames. This makes dplyr blazing fast, much faster than plyr for data frames, and it also makes it easier to learn since you only have to learn one set of tools.

Have no idea what I’m talking about? Not sure if you care? If you use these base R functions: subset(), apply(), [sl]apply(), tapply(), aggregate(), split(), do.call(), with(), within() then you will be interested in dplyr.

4 Packages and Data

We are going to load the tidyverse, which will load dplyr, as well as gapminder:

# Load packages
library(tidyverse)
library(gapminder)

# Load data
data(gapminder)

As a reminder, the gapminder data frame is a special kind of data frame: a tibble.

# Print data
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

It’s tibble-ness is why we get nice compact printing. For a reminder of the problems with base data frame printing, go type iris in the R Console or, better yet, print a data frame to screen that has lots of columns. Note how gapminder’s class() includes tbl_df; the “tibble” terminology is a nod to this:

# Identify class of gapminder
class(gapminder) 
[1] "tbl_df"     "tbl"        "data.frame"

There will be some functions, like print(), that know about tibbles and do something special. There will others that do not, like summary(). In which case the regular data frame treatment will happen, because every tibble is also a regular data frame.

To turn any data frame into a tibble use as_tibble():

# Convert iris to tibble
as_tibble(iris)
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

5 Subsetting Data

If you ever feel the urge to store a little snippet of your data:

# Create a new data frame containing only the data for Canada
(canada <- gapminder[241:252, ]) 
# A tibble: 12 × 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Canada  Americas   1952    68.8 14785584    11367.
 2 Canada  Americas   1957    70.0 17010154    12490.
 3 Canada  Americas   1962    71.3 18985849    13462.
 4 Canada  Americas   1967    72.1 20819767    16077.
 5 Canada  Americas   1972    72.9 22284500    18971.
 6 Canada  Americas   1977    74.2 23796400    22091.
 7 Canada  Americas   1982    75.8 25201900    22899.
 8 Canada  Americas   1987    76.9 26549700    26627.
 9 Canada  Americas   1992    78.0 28523502    26343.
10 Canada  Americas   1997    78.6 30305843    28955.
11 Canada  Americas   2002    79.8 31902268    33329.
12 Canada  Americas   2007    80.7 33390141    36319.

Stop and ask yourself …

Do I want to create mini datasets for each level of some factor (or unique combination of several factors) … in order to compute or graph something?

If YES, use proper data aggregation techniquesdon’t subset the data. Or, more realistic, only subset the data as a temporary measure while you develop your elegant code for computing on or visualising these data subsets.

If NO, then maybe you really do need to store a copy of a subset of the data. But seriously consider whether you can achieve your goals by simply using the subset = argument of, e.g., the lm() function, to limit computation to your excerpt of choice. Lots of functions offer a subset = argument!

Copies and excerpts of your data clutter your workspace, invite mistakes, and sow general confusion. Avoid whenever possible.

Reality can also lie somewhere in between. You will find the workflows presented below can help you accomplish your goals with minimal creation of temporary, intermediate objects.

5.1 Logical Expressions

Logical operations to access specific parts of the data that match our specification. There are many logical operators you can use to subset your data and these can be combined in different ways:

  • == equal to
  • != not equal to
  • > greater than
  • < less than
  • >= greater than or equal to
  • <= less than or equal to
  • %in% is in a vector
  • ! not
  • & and
  • | or

We will cover these in more detail!

5.2 filter() Rows

filter() takes logical expressions and returns the rows for which all are TRUE.

# All rows where lifeExp < 29
filter(gapminder, lifeExp < 29) 
# A tibble: 2 × 6
  country     continent  year lifeExp     pop gdpPercap
  <fct>       <fct>     <int>   <dbl>   <int>     <dbl>
1 Afghanistan Asia       1952    28.8 8425333      779.
2 Rwanda      Africa     1992    23.6 7290203      737.
# All rows where country is Rwanda and year is greater than 1979
filter(gapminder, country == "Rwanda", year > 1979)
# A tibble: 6 × 6
  country continent  year lifeExp     pop gdpPercap
  <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
1 Rwanda  Africa     1982    46.2 5507565      882.
2 Rwanda  Africa     1987    44.0 6349365      848.
3 Rwanda  Africa     1992    23.6 7290203      737.
4 Rwanda  Africa     1997    36.1 7212583      590.
5 Rwanda  Africa     2002    43.4 7852401      786.
6 Rwanda  Africa     2007    46.2 8860588      863.
# All rows where country is Rwanda or Afghanistan
filter(gapminder, country %in% c("Rwanda", "Afghanistan"))
# A tibble: 24 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 14 more rows

Compare with some base R code to accomplish the same things:

# All rows where lifeExp < 29
gapminder[gapminder$lifeExp < 29, ]  # repeat `gapminder`, [i, j] indexing is distracting

# All rows where country is Rwanda
subset(gapminder, country == "Rwanda") ## almost same as filter

Under no circumstances should you subset your data the way I did at first:

# Create a new data frame containing only the data for Canada
excerpt <- gapminder[241:252, ] 

Why is this a terrible idea?

  • It is not self-documenting. What is so special about rows 241 through 252?
  • It is fragile. This line of code will produce different results if someone changes the row order of gapminder, e.g., sorts the data earlier in the script.
# Filter for Canada
filter(gapminder, country == "Canada")

This call explains itself and is fairly robust.

5.3 The Pipe Operator

Before we go any further, we should exploit the pipe operator that the tidyverse imports from the magrittr package by Stefan Bache. This is going to change your data analytical life. You no longer need to enact multi-operation commands by nesting them inside each other, like so many Russian nesting dolls. This syntax leads to code that is much easier to write and to read.

Here’s what it looks like: %>%. The RStudio keyboard shortcut: Ctrl+Shift+M (Windows), Cmd+Shift+M (Mac).

Let’s demo then I’ll explain.

# Using the pipe operator to chain together commands
gapminder %>% 
  head()  # same as head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

This is equivalent to head(gapminder). The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side – literally, drops it in as the first argument.

Never fear, you can still specify other arguments to this function! To see the first 3 rows of gapminder, we could say head(gapminder, 3) or this:

# Using the pipe operator to chain together commands
gapminder %>% 
  head(3)
# A tibble: 3 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.

You should think “then” whenever you see the pipe operator, %>%.

5.4 select() Columns

Use select() to subset the data on variables or columns. Here’s a conventional call:

# Select the year and lifeExp columns
select(gapminder, year, lifeExp)
# A tibble: 1,704 × 2
    year lifeExp
   <int>   <dbl>
 1  1952    28.8
 2  1957    30.3
 3  1962    32.0
 4  1967    34.0
 5  1972    36.1
 6  1977    38.4
 7  1982    39.9
 8  1987    40.8
 9  1992    41.7
10  1997    41.8
# ℹ 1,694 more rows

And here’s the same operation, but written with the pipe operator and piped through head():

# Using the pipe operator to chain together commands
gapminder %>%
  select(year, lifeExp) %>% # select columns
  head(4) # show first 4 rows
# A tibble: 4 × 2
   year lifeExp
  <int>   <dbl>
1  1952    28.8
2  1957    30.3
3  1962    32.0
4  1967    34.0

Think: “Take gapminder, then select the variables year and lifeExp, then show the first 4 rows.”

5.5 Convenience!

Here’s the data for Cambodia, but only certain variables:

# Using the pipe operator to chain together commands
gapminder %>%
  filter(country == "Cambodia") %>% # filter for Cambodia
  select(year, lifeExp) # select columns
# A tibble: 12 × 2
    year lifeExp
   <int>   <dbl>
 1  1952    39.4
 2  1957    41.4
 3  1962    43.4
 4  1967    45.4
 5  1972    40.3
 6  1977    31.2
 7  1982    51.0
 8  1987    53.9
 9  1992    55.8
10  1997    56.5
11  2002    56.8
12  2007    59.7

and what a typical base R call would look like:

gapminder[gapminder$country == "Cambodia", c("year", "lifeExp")]
# A tibble: 12 × 2
    year lifeExp
   <int>   <dbl>
 1  1952    39.4
 2  1957    41.4
 3  1962    43.4
 4  1967    45.4
 5  1972    40.3
 6  1977    31.2
 7  1982    51.0
 8  1987    53.9
 9  1992    55.8
10  1997    56.5
11  2002    56.8
12  2007    59.7

6 Pure, Predictable, Pipeable

We’ve barely scratched the surface of dplyr but I want to point out key principles you may start to appreciate. dplyr’s verbs, such as filter() and select(), are what’s called pure functions. To quote from Wickham’s book:

The functions that are the easiest to understand and reason about are pure functions: functions that always map the same input to the same output and have no other impact on the workspace. In other words, pure functions have no side effects: they don’t affect the state of the world in any way apart from the value they return.

In fact, these verbs are a special case of pure functions: they take the same flavor of object as input and output. Namely, a data frame or one of the other data receptacles dplyr supports. And finally, the data is always the very first argument of the verb functions. This set of deliberate design choices, together with the new pipe operator, produces a highly effective, low friction domain-specific language for data analysis.

As we move on to look at some of dplyr’s other verbs, you will see that they all follow these same principles. This makes them easy to learn and easy to use. Because we are now going to be making changes to the gapminder tibble. To eliminate any fear that you’re damaging the data that comes with the package, we create an explicit copy of gapminder for our experiments.

# Create a new object called my_gap
my_gap <- gapminder

Pay close attention to when we evaluate statements but let the output just print to screen:

## Let output print to screen, but do not store
my_gap %>% 
  filter(country == "Canada") # filter for Canada

… versus when we assign the output to an object, possibly overwriting an existing object.

## Store the output as an R object
my_precious <- my_gap %>% 
  filter(country == "Canada") # filter for Canada

7 mutate() Variables

Imagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Let’s multiply them together. mutate() is a function that defines and inserts new variables into a data frame. You can refer to existing variables by name.

# Using the pipe operator to chain together commands
my_gap %>%
  mutate(gdp = pop * gdpPercap) # create new variable
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap          gdp
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
 2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
 3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
 4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
 5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
 6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
 7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
 8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
 9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
# ℹ 1,694 more rows

Hmmmm … those GDP numbers are almost uselessly large and abstract. Maybe it would be more meaningful to consumers of my tables and figures to stick with GDP per capita. But what if I reported GDP per capita, relative to some benchmark country. Since we’re in the UK, we’ll go with that.

I need to create a new variable that is gdpPercap divided by UK gdpPercap, taking care that I always divide two numbers that pertain to the same year.

How I achieve this:

  1. Filter down to the rows for United Kingdom.
  2. Create a new temporary variable in my_gap:
    1. Extract the gdpPercap variable from the UK data.
    2. Replicate it once per country in the data set, so it has the right length.
  3. Divide raw gdpPercap by this UK figure.
  4. Discard the temporary variable of replicated UK gdpPercap.
# Filter data for the UK
ctib <- my_gap %>%
  filter(country == "United Kingdom") # filter for UK

# Create a new colimn for GDP per capita relative to the UK
my_gap <- my_gap %>%
  mutate(tmp = rep(ctib$gdpPercap, nlevels(country)),  # Create a temporary variable
         gdpPercapRel = gdpPercap / tmp,  # Create the new variable
         tmp = NULL)  # Discard the temporary variable

Note that, mutate() builds new variables sequentially so you can reference earlier ones (like tmp) when defining later ones (like gdpPercapRel). Also, you can get rid of a variable by setting it to NULL. How could we sanity check that this worked? The Canadian values for gdpPercapRel should all be 1!

# Check it worked!
my_gap %>% 
  filter(country == "United Kingdom") %>%  # Filter for UK
  select(country, year, gdpPercapRel) # Select only these columns
# A tibble: 12 × 3
   country         year gdpPercapRel
   <fct>          <int>        <dbl>
 1 United Kingdom  1952            1
 2 United Kingdom  1957            1
 3 United Kingdom  1962            1
 4 United Kingdom  1967            1
 5 United Kingdom  1972            1
 6 United Kingdom  1977            1
 7 United Kingdom  1982            1
 8 United Kingdom  1987            1
 9 United Kingdom  1992            1
10 United Kingdom  1997            1
11 United Kingdom  2002            1
12 United Kingdom  2007            1

I perceive the UK to be a “high GDP” country, so I predict that the distribution of gdpPercapRel is located below 1, possibly even well below. Check your intuition!

# Check the distribution of gdpPercapRel
summary(my_gap$gdpPercapRel)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.008181  0.071340  0.197620  0.378300  0.531490 10.860490 

The relative GDP per capita numbers are, in general, well below 1. We see that most of the countries covered by this dataset have substantially lower GDP per capita, relative to the UK, across the entire time period.

Remember: Trust No One. Including (especially?) yourself. Always try to find a way to check that you’ve done what meant to. Prepare to be horrified.

8 arrange() Data

arrange() reorders the rows in a data frame. Imagine you wanted this data ordered by year then country, as opposed to by country then year.

# Arrange by year then country
my_gap %>%
  arrange(year, country) # arrange by year then country
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap gdpPercapRel
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.       0.0781
 2 Albania     Europe     1952    55.2  1282697     1601.       0.160 
 3 Algeria     Africa     1952    43.1  9279525     2449.       0.245 
 4 Angola      Africa     1952    30.0  4232095     3521.       0.353 
 5 Argentina   Americas   1952    62.5 17876956     5911.       0.592 
 6 Australia   Oceania    1952    69.1  8691212    10040.       1.01  
 7 Austria     Europe     1952    66.8  6927772     6137.       0.615 
 8 Bahrain     Asia       1952    50.9   120447     9867.       0.989 
 9 Bangladesh  Asia       1952    37.5 46886859      684.       0.0686
10 Belgium     Europe     1952    68    8730405     8343.       0.836 
# ℹ 1,694 more rows

Or maybe you want just the data from 2007, sorted on life expectancy?

# Arrange by year then life expectancy
my_gap %>%
  filter(year == 2007) %>% # filter for 2007
  arrange(lifeExp) # arrange by year then life expectancy
# A tibble: 142 × 7
   country                 continent  year lifeExp    pop gdpPercap gdpPercapRel
   <fct>                   <fct>     <int>   <dbl>  <int>     <dbl>        <dbl>
 1 Swaziland               Africa     2007    39.6 1.13e6     4513.       0.136 
 2 Mozambique              Africa     2007    42.1 2.00e7      824.       0.0248
 3 Zambia                  Africa     2007    42.4 1.17e7     1271.       0.0383
 4 Sierra Leone            Africa     2007    42.6 6.14e6      863.       0.0260
 5 Lesotho                 Africa     2007    42.6 2.01e6     1569.       0.0473
 6 Angola                  Africa     2007    42.7 1.24e7     4797.       0.144 
 7 Zimbabwe                Africa     2007    43.5 1.23e7      470.       0.0141
 8 Afghanistan             Asia       2007    43.8 3.19e7      975.       0.0294
 9 Central African Republ… Africa     2007    44.7 4.37e6      706.       0.0213
10 Liberia                 Africa     2007    45.7 3.19e6      415.       0.0125
# ℹ 132 more rows

Oh, you’d like to sort on life expectancy in descending order? Then use desc().

# Arramhe by year then life expectancy
my_gap %>%
  filter(year == 2007) %>% # filter for 2007
  arrange(desc(lifeExp)) # arrange by year then life expectancy
# A tibble: 142 × 7
   country          continent  year lifeExp       pop gdpPercap gdpPercapRel
   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>        <dbl>
 1 Japan            Asia       2007    82.6 127467972    31656.        0.953
 2 Hong Kong, China Asia       2007    82.2   6980412    39725.        1.20 
 3 Iceland          Europe     2007    81.8    301931    36181.        1.09 
 4 Switzerland      Europe     2007    81.7   7554661    37506.        1.13 
 5 Australia        Oceania    2007    81.2  20434176    34435.        1.04 
 6 Spain            Europe     2007    80.9  40448191    28821.        0.868
 7 Sweden           Europe     2007    80.9   9031088    33860.        1.02 
 8 Israel           Asia       2007    80.7   6426679    25523.        0.769
 9 France           Europe     2007    80.7  61083916    30470.        0.918
10 Canada           Americas   2007    80.7  33390141    36319.        1.09 
# ℹ 132 more rows

I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.

9 rename() Variables

Many of the variables in the Gapminder data set are camelCase, but I personally like snake_case. Let’s rename some variables to fix this:

# Rename variables
my_gap %>%
  rename(life_exp = lifeExp, # rename lifeExp
         gdp_percap = gdpPercap, # rename gdpPercap
         gdp_percap_rel = gdpPercapRel) # rename gdpPercapRel
# A tibble: 1,704 × 7
   country     continent  year life_exp      pop gdp_percap gdp_percap_rel
   <fct>       <fct>     <int>    <dbl>    <int>      <dbl>          <dbl>
 1 Afghanistan Asia       1952     28.8  8425333       779.         0.0781
 2 Afghanistan Asia       1957     30.3  9240934       821.         0.0728
 3 Afghanistan Asia       1962     32.0 10267083       853.         0.0684
 4 Afghanistan Asia       1967     34.0 11537966       836.         0.0591
 5 Afghanistan Asia       1972     36.1 13079460       740.         0.0466
 6 Afghanistan Asia       1977     38.4 14880372       786.         0.0451
 7 Afghanistan Asia       1982     39.9 12881816       978.         0.0536
 8 Afghanistan Asia       1987     40.8 13867957       852.         0.0393
 9 Afghanistan Asia       1992     41.7 16317921       649.         0.0286
10 Afghanistan Asia       1997     41.8 22227415       635.         0.0244
# ℹ 1,694 more rows

I did NOT assign the post-rename object back to my_gap because that would make the chunks in this computer lab harder to copy/paste and run out of order. In real life, I would probably assign this back to my_gap, in a data preparation script, and proceed with the new variable names.

select() can also rename and reposition variables. You’ve already seen simple use of select(). There are two tricks you might enjoy:

  1. select() can rename the variables you request to keep;
  2. select() can be used with everything() to hoist a variable up to the front of the tibble.
# filter() and select() together
my_gap %>%
  filter(country == "Burundi", year > 1996) %>% # filter for Burundi after 1996
  select(yr = year, lifeExp, gdpPercap) %>% # rename year to yr
  select(gdpPercap, everything()) # hoist gdpPercap to the front
# A tibble: 3 × 3
  gdpPercap    yr lifeExp
      <dbl> <int>   <dbl>
1      463.  1997    45.3
2      446.  2002    47.4
3      430.  2007    49.6

everything() is one of several helpers for variable selection. Read its help to see the rest.

10 group_by()

Many people love to ask seemingly innocuous questions like, “which country experienced the sharpest 5-year drop in life expectancy?”. In fact, that is a totally natural question to ask. But if you are using a language that doesn’t know about data, it’s an incredibly annoying question to answer. dplyr offers powerful tools to solve this class of problem:

  • group_by() adds extra structure to your dataset – grouping information – which lays the groundwork for computations within the groups.
  • summarise() takes a dataset with \(n\) observations, computes requested summaries, and returns a dataset with 1 observation.
  • Window functions take a dataset with \(n\) observations and return a dataset with \(n\) observations.
  • mutate() and summarise() will honor groups.

Combined with the verbs you already know, these new tools allow you to solve an extremely diverse set of problems with relative ease.

10.1 Counting

Let’s start with simple counting. How many observations do we have per continent?

# Count observations per continent
my_gap %>%
  group_by(continent) %>% # group by continent
  summarise(n = n()) # count observations per continent
# A tibble: 5 × 2
  continent     n
  <fct>     <int>
1 Africa      624
2 Americas    300
3 Asia        396
4 Europe      360
5 Oceania      24

Let us pause here to think about the tidyverse. You could get these same frequencies using table() from base R.

# Count observations per continent
table(gapminder$continent) 

  Africa Americas     Asia   Europe  Oceania 
     624      300      396      360       24 
# Or, equivalently
str(table(gapminder$continent))
 'table' int [1:5(1d)] 624 300 396 360 24
 - attr(*, "dimnames")=List of 1
  ..$ : chr [1:5] "Africa" "Americas" "Asia" "Europe" ...

But the object of class table that is returned makes downstream computation a bit fiddlier than you’d like. For example, it’s too bad the continent levels come back only as names and not as a proper factor, with the original set of levels. This is an example of how the tidyverse smooths transitions where you want the output of step i to become the input of step i + 1.

The tally() function is a convenience function that knows to count rows. It honours groups.

# Count observations per continent
my_gap %>%
  group_by(continent) %>% # group by continent
  tally() # count observations per continent
# A tibble: 5 × 2
  continent     n
  <fct>     <int>
1 Africa      624
2 Americas    300
3 Asia        396
4 Europe      360
5 Oceania      24

The count() function is an even more convenient function that does both grouping and counting.

# Count observations per continent
my_gap %>% 
  count(continent)
# A tibble: 5 × 2
  continent     n
  <fct>     <int>
1 Africa      624
2 Americas    300
3 Asia        396
4 Europe      360
5 Oceania      24

What if we wanted to add the number of unique countries for each continent? You can compute multiple summaries inside summarise(). Use the n_distinct() function to count the number of distinct countries within each continent.

# Count observations per country
my_gap %>%
  group_by(continent) %>% # group by continent
  summarise(n = n(), # count observations per continent
            n_countries = n_distinct(country)) # count distinct countries per continent
# A tibble: 5 × 3
  continent     n n_countries
  <fct>     <int>       <int>
1 Africa      624          52
2 Americas    300          25
3 Asia        396          33
4 Europe      360          30
5 Oceania      24           2

10.2 General Summarisation

The functions you’ll apply within summarise() include classical statistical summaries, like mean(), median(), var(), sd(), mad(), IQR(), min(), and max(). Remember they are functions that take \(n\) inputs and distill them down into 1 output. Although this may be statistically ill-advised, let’s compute the average life expectancy by continent.

# Average life expectancy by continent
my_gap %>%
  group_by(continent) %>% # group by continent
  summarise(avg_lifeExp = mean(lifeExp)) # Calculate average life expectancy
# A tibble: 5 × 2
  continent avg_lifeExp
  <fct>           <dbl>
1 Africa           48.9
2 Americas         64.7
3 Asia             60.1
4 Europe           71.9
5 Oceania          74.3

10.3 Grouped Mutate

Sometimes you don’t want to collapse the \(n\) rows for each group into one row. You want to keep your groups, but compute within them.

Let’s make a new variable that is the years of life expectancy gained (lost) relative to 1952, for each individual country. We group by country and use mutate() to make a new variable. The first() function extracts the first value from a vector. Notice that first() is operating on the vector of life expectancies within each country group.

# Calculate life expectancy gain
my_gap %>% 
  group_by(country) %>% # group by country
  select(country, year, lifeExp) %>% # select country, year, and lifeExp
  mutate(lifeExp_gain = lifeExp - first(lifeExp)) %>% # calculate life expectancy gain
  filter(year < 1963) # `filter()` for years before 1963
# A tibble: 426 × 4
# Groups:   country [142]
   country      year lifeExp lifeExp_gain
   <fct>       <int>   <dbl>        <dbl>
 1 Afghanistan  1952    28.8         0   
 2 Afghanistan  1957    30.3         1.53
 3 Afghanistan  1962    32.0         3.20
 4 Albania      1952    55.2         0   
 5 Albania      1957    59.3         4.05
 6 Albania      1962    64.8         9.59
 7 Algeria      1952    43.1         0   
 8 Algeria      1957    45.7         2.61
 9 Algeria      1962    48.3         5.23
10 Angola       1952    30.0         0   
# ℹ 416 more rows

Within country, we take the difference between life expectancy in year \(i\) and life expectancy in 1952. Therefore, we always see zeroes for 1952 and, for most countries, a sequence of positive and increasing numbers.

11 Activities

For these activities you will be using the iris dataset. This is a classic dataset that contains measurements of three different species of iris flowers. The dataset contains 150 observations and 5 variables. You can load the dataset by running the following code:

# Load the iris dataset
data(iris)

11.1 Filtering

  1. Filter the dataset to only include observations where the Sepal.Length is greater than 5.5.

  2. Filter the dataset to only include observations where the Sepal.Length is greater than 5.5 and the Sepal.Width is greater than 3.5. Assign this to a new object called iris_filtered.

💡 Click here to view a solution
# Filter the dataset to only include observations where the Sepal.Length is greater than 5.5
iris %>%
  filter(Sepal.Length > 5.5) 

# Filter the dataset to only include observations where the Sepal.Length is greater than 5.5 and the Sepal.Width is greater than 3.5
iris_filtered <- iris %>%
  filter(Sepal.Length > 5.5,
         Sepal.Width > 3.5)

11.2 Selecting

  1. Select the Sepal.Length and Sepal.Width variables from the iris dataset.

  2. Select all variables except for Sepal.Length and Sepal.Width from the iris dataset.

💡 Click here to view a solution
# Select the Sepal.Length and Sepal.Width variables from the iris dataset
iris %>%
  select(Sepal.Length, Sepal.Width)

# Select all variables except for Sepal.Length and Sepal.Width from the iris dataset
iris %>%
  select(-Sepal.Length, -Sepal.Width)

11.3 Arranging

  1. Arrange the iris dataset by Sepal.Length in descending order.

  2. Arrange the iris dataset by Sepal.Length in descending order and then by Sepal.Width in ascending order.

💡 Click here to view a solution
# Arrange the iris dataset by Sepal.Length in descending order
iris %>%
  arrange(desc(Sepal.Length))

# Arrange the iris dataset by Sepal.Length in descending order and then by Sepal.Width in ascending order
iris %>%
  arrange(desc(Sepal.Length), Sepal.Width)

11.4 Mutating

  1. Create a new variable called Sepal.Area that is equal to the product of Sepal.Length and Sepal.Width.

  2. Create a new variable called Sepal.Area that is equal to the product of Sepal.Length and Sepal.Width. Then, create a new variable called Sepal.Area.Scaled that is equal to Sepal.Area divided by 100.

💡 Click here to view a solution
# Create a new variable called Sepal.Area that is equal to the product of Sepal.Length and Sepal.Width
iris %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width)

# Create a new variable called Sepal.Area that is equal to the product of Sepal.Length and Sepal.Width. Then, create a new variable called Sepal.Area.Scaled that is equal to Sepal.Area divided by 100
iris %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width,
         Sepal.Area.Scaled = Sepal.Area / 100)

12 Recap

  • dplyr is a package that provides a grammar of data manipulation;
  • You should aggregate data rather than fully subset it;
  • filter() subsets rows;
  • select() subsets columns;
  • arrange() sorts rows;
  • mutate() creates new columns;
  • summarise() reduces multiple values down to a single summary.