Tidy Data

1 Learning Objectives

  • Review key rules for tidy data;
  • Review key tidyr and dplyr functions for wrangling data;
  • Manually tidy some data in Excel

2 Introduction

Data recorded in the field or in the lab is often very different to the way you want your data entered into R. During data collection tables are usually drawn up ahead of time and completed in situ, and you will have extra notes and information in addition to the data you want to analyse. In this lab you will learn a consistent way to organise your data in R known as tidy data. Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data, you will spend much less time wrangling data into different formats.

3 Rules for Tidy Data

Why ensure that your data is tidy? There are two main advantages:

  1. There is a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

  2. There is a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column;
  • Each observation must have its own row;
  • Each value must have its own cell.

Rules for tidy data

These three rules are interrelated because it’s impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:

  • Put each dataset in a data frame;
  • Put each variable in a column.

4 Start a Script

For this lab or project, begin by:

  • Starting a new R script
  • Create a good header section and table of contents
  • Save the script file with an informative name
  • set your working directory

Aim to make the script a future reference for doing things in R!

5 Packages and Data

We will predominantly be using two core packages from the tidyverse called tidyr and dplyr but load the entire ecosystem for convenience. The data set we are using represents annual increments in crowberry shrub (Empetrum hermaphroditum) stem growth in a sand dune ecosystem. You can download this data here.

# Check whether a package is installed and install if not.
# This will also load the package if already installed!
if(!require("tidyverse")) install.packages("tidyverse")
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load data
elongation <- read.csv("data/empetrum_elongation.csv", header = TRUE)

6 Tidying Data

The elongation dataset we previously loaded isn’t tidy - the same variable (i.e., stem length) is spread across multiple columns representing different years. We can use the gather() function from the tidyr package to convert this wide-format, untidy data frame to a long-format, tidy data frame. We want to create a single column Year that will have the years currently in the columns (2007-2012) repeated for each individual. From this, you should be able to work out that the data frame will be six times longer than the original. We also want a column Length where all the growth data associated to each year and individual will go.

# Let's take a look at a summary of our original data frame
glimpse(elongation)
Rows: 114
Columns: 8
$ Zone  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3…
$ Indiv <int> 373, 379, 383, 389, 390, 395, 396, 408, 412, 421, 425, 429, 431,…
$ X2007 <dbl> 5.1, 8.1, 9.3, 15.0, 3.5, 6.1, 7.2, 6.1, 4.6, 7.2, 6.4, 8.9, 3.5…
$ X2008 <dbl> 5.1, 13.3, 8.5, 10.3, 6.2, 5.6, 5.9, 8.1, 6.7, 5.8, 8.1, 11.2, 5…
$ X2009 <dbl> 4.8, 8.6, 11.7, 6.8, 4.7, 4.4, 5.7, 7.6, 4.5, 5.8, 7.8, 8.6, 5.4…
$ X2010 <dbl> 8.7, 4.9, 7.9, 6.9, 3.8, 4.5, 5.4, 6.2, 2.8, 5.6, 6.2, 8.4, 4.9,…
$ X2011 <dbl> 6.3, 5.9, 8.0, 5.9, 3.5, 4.5, 5.5, 9.6, 4.2, 4.3, 6.4, 8.2, 3.5,…
$ X2012 <dbl> 3.2, 6.3, 6.3, 7.6, 3.0, 7.6, 4.7, 10.1, 5.2, 3.4, 7.9, 6.6, 4.9…
# Here we want the lengths (value) to be gathered by year (key) 
elongation_long <- gather(elongation, Year, Length, # data frame, key, value                          
                          c(X2007, X2008, X2009, X2010, X2011, X2012)) # specify columns 

# Let's take a look at a summary of our new data frame
glimpse(elongation_long) 
Rows: 684
Columns: 4
$ Zone   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, …
$ Indiv  <int> 373, 379, 383, 389, 390, 395, 396, 408, 412, 421, 425, 429, 431…
$ Year   <chr> "X2007", "X2007", "X2007", "X2007", "X2007", "X2007", "X2007", …
$ Length <dbl> 5.1, 8.1, 9.3, 15.0, 3.5, 6.1, 7.2, 6.1, 4.6, 7.2, 6.4, 8.9, 3.…
# spread() is the inverse function, allowing you to go from long to wide format
elongation_wide <- spread(elongation_long, Year, Length) 

Notice how we used the column names to tell gather() which columns to reshape. This is handy if you only have a few, and if the columns change order eventually, the function will still work. However, if you have a dataset with columns for 100 genes, for instance, you are better off specifying the column numbers.

7 Wrangling Data with Base R

Data frames are R objects made of rows and columns containing observations of different variables: you will often be importing your data that way. Sometimes, you might notice some mistakes after importing, need to rename a variable, or keep only a subset of the data that meets some conditions. Let’s dive right in and do that on the EmpetrumElongation.csv data set.

7.1 Subsetting

We can access variables in R by using the dollar sign $. This is already one way of subsetting, as it essentially reduces your data frame (2 dimensions) to a vector (1 dimension). You can also access parts of a data frame using square brackets [ , ]. The first number you put will get the row number, and the second the column. Leave one blank to keep all rows or all columns.

# Check import and preview data
head(elongation)   # first few observations
  Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012
1    2   373   5.1   5.1   4.8   8.7   6.3   3.2
2    2   379   8.1  13.3   8.6   4.9   5.9   6.3
3    2   383   9.3   8.5  11.7   7.9   8.0   6.3
4    2   389  15.0  10.3   6.8   6.9   5.9   7.6
5    2   390   3.5   6.2   4.7   3.8   3.5   3.0
6    2   395   6.1   5.6   4.4   4.5   4.5   7.6
str(elongation)    # types of variables
'data.frame':   114 obs. of  8 variables:
 $ Zone : int  2 2 2 2 2 2 2 2 2 2 ...
 $ Indiv: int  373 379 383 389 390 395 396 408 412 421 ...
 $ X2007: num  5.1 8.1 9.3 15 3.5 6.1 7.2 6.1 4.6 7.2 ...
 $ X2008: num  5.1 13.3 8.5 10.3 6.2 5.6 5.9 8.1 6.7 5.8 ...
 $ X2009: num  4.8 8.6 11.7 6.8 4.7 4.4 5.7 7.6 4.5 5.8 ...
 $ X2010: num  8.7 4.9 7.9 6.9 3.8 4.5 5.4 6.2 2.8 5.6 ...
 $ X2011: num  6.3 5.9 8 5.9 3.5 4.5 5.5 9.6 4.2 4.3 ...
 $ X2012: num  3.2 6.3 6.3 7.6 3 7.6 4.7 10.1 5.2 3.4 ...
# Let's get some information out of this object!
elongation$Indiv   # prints out all the ID codes in the dataset
  [1]  373  379  383  389  390  395  396  408  412  421  425  429  431  442  444
 [16]  447  458  464  486  512  530  534  538  542  549  569  572  577  581  584
 [31]  597  616  647  660  591  595  603  612  618  619  623  632  641  645  646
 [46]  648  654  657  661  663  677  678  682    1   37   38   41   49   64   71
 [61]   72   85   88  101  114  120  126  145  150  155  166  177  206  129  257
 [76]  276  303  322  349  350  355  370  454  725  766  812  817  844  863  868
 [91]  896  899  901  917  924  963  970  977  979 1000 1069 1073 1087 1109 1140
[106] 1183 1188 1247 1286 1340 1358 1392 1477 1481
length(unique(elongation$Indiv))   # returns the number of distinct shrubs in the data
[1] 114
# Here's how we get the value in the second row and fifth column
elongation[2,5]
[1] 8.6
# Here's how we get all the info for row number 6
elongation[6, ]
  Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012
6    2   395   6.1   5.6   4.4   4.5   4.5   7.6
# And of course you can mix it all together!
elongation[6, ]$Indiv   # returns the value in the column Indiv for the sixth observation
[1] 395
# (much easier calling columns by their names than figuring out where they are!)

7.2 Logical operators

Subsetting with brackets using row and column numbers can be quite tedious if you have a large data set and you don’t know where the observations you’re looking for are situated! And it’s never recommended anyway, because if you hard-code a number in your script and you add some rows later on, you might not be selecting the same observations anymore! That’s why we can use logical operations to access specific parts of the data that match our specification.

# Let's access the values for Individual number 603
elongation[elongation$Indiv == 603, ]
   Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012
37    4   603   2.5   3.8   3.2   2.3   3.2   5.4

There’s a lot to unpack here! We’re saying: “Take this dataframe (elongation), subset it ([ , ]) so as to keep the rows (writing the expression on the left-hand of the comma) for which the value in the column Indiv ($Indiv) is exactly (==) 603”. There are many logical operators you can use to subset your data:

  • == equal to
  • != not equal to
  • > greater than
  • < less than
  • >= greater than or equal to
  • <= less than or equal to
  • %in% is in a vector
  • ! not
  • & and

7.3 Subsetting with logical operators

# Subsetting with one condition

elongation[elongation$Zone < 4, ]    # returns only the data for zones 2-3
   Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012
1     2   373   5.1   5.1   4.8   8.7   6.3   3.2
2     2   379   8.1  13.3   8.6   4.9   5.9   6.3
3     2   383   9.3   8.5  11.7   7.9   8.0   6.3
4     2   389  15.0  10.3   6.8   6.9   5.9   7.6
5     2   390   3.5   6.2   4.7   3.8   3.5   3.0
6     2   395   6.1   5.6   4.4   4.5   4.5   7.6
7     2   396   7.2   5.9   5.7   5.4   5.5   4.7
8     2   408   6.1   8.1   7.6   6.2   9.6  10.1
9     2   412   4.6   6.7   4.5   2.8   4.2   5.2
10    2   421   7.2   5.8   5.8   5.6   4.3   3.4
11    2   425   6.4   8.1   7.8   6.2   6.4   7.9
12    2   429   8.9  11.2   8.6   8.4   8.2   6.6
13    2   431   3.5   5.9   5.4   4.9   3.5   4.9
14    2   442   5.3   9.9   9.2   6.4   5.5   5.0
15    2   444   8.7   5.6   5.9   5.2   6.6   7.4
16    3   447   8.3   6.5   8.1   7.0   7.6   8.3
17    3   458   5.4   8.6   6.1   5.4   7.2   9.4
18    3   464   9.2   9.7   7.8   5.8   7.2   6.2
19    3   486   9.1   6.7   6.8   6.1   8.4   7.9
20    3   512   9.0  10.8   7.1   5.8   8.3   6.7
21    3   530   9.2  11.4   4.5   3.6   6.9   5.7
22    3   534   7.4   6.8   7.9   3.7   5.5   6.7
23    3   538   8.9  12.5  10.5   6.3   7.3   7.4
24    3   542   5.3   6.0   5.8   8.6   6.4   7.9
25    3   549   4.5   5.6   5.8   4.4   7.4   4.9
26    3   569   4.2   6.3   6.2   6.3   6.4   6.4
27    3   572   7.1  10.1  10.0  10.1   9.8  13.1
28    3   577   6.8  12.1  11.1   8.6   6.1   8.2
29    3   581   6.0   6.9   6.3   3.7   5.6   5.3
30    3   584   5.6   6.6   5.9   4.8   6.0   6.1
31    3   597   8.9   7.9   6.4   7.1   8.0   9.4
32    3   616   9.7  10.8   8.8   6.4   9.5   8.0
33    3   647   8.1   6.9   6.9   4.0   7.2   7.3
34    3   660   6.7   5.4   6.0   4.9   6.8   6.4
elongation[elongation$Zone <= 4, ]   # returns only the data for zones 2-3-4
   Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012
1     2   373   5.1   5.1   4.8   8.7   6.3   3.2
2     2   379   8.1  13.3   8.6   4.9   5.9   6.3
3     2   383   9.3   8.5  11.7   7.9   8.0   6.3
4     2   389  15.0  10.3   6.8   6.9   5.9   7.6
5     2   390   3.5   6.2   4.7   3.8   3.5   3.0
6     2   395   6.1   5.6   4.4   4.5   4.5   7.6
7     2   396   7.2   5.9   5.7   5.4   5.5   4.7
8     2   408   6.1   8.1   7.6   6.2   9.6  10.1
9     2   412   4.6   6.7   4.5   2.8   4.2   5.2
10    2   421   7.2   5.8   5.8   5.6   4.3   3.4
11    2   425   6.4   8.1   7.8   6.2   6.4   7.9
12    2   429   8.9  11.2   8.6   8.4   8.2   6.6
13    2   431   3.5   5.9   5.4   4.9   3.5   4.9
14    2   442   5.3   9.9   9.2   6.4   5.5   5.0
15    2   444   8.7   5.6   5.9   5.2   6.6   7.4
16    3   447   8.3   6.5   8.1   7.0   7.6   8.3
17    3   458   5.4   8.6   6.1   5.4   7.2   9.4
18    3   464   9.2   9.7   7.8   5.8   7.2   6.2
19    3   486   9.1   6.7   6.8   6.1   8.4   7.9
20    3   512   9.0  10.8   7.1   5.8   8.3   6.7
21    3   530   9.2  11.4   4.5   3.6   6.9   5.7
22    3   534   7.4   6.8   7.9   3.7   5.5   6.7
23    3   538   8.9  12.5  10.5   6.3   7.3   7.4
24    3   542   5.3   6.0   5.8   8.6   6.4   7.9
25    3   549   4.5   5.6   5.8   4.4   7.4   4.9
26    3   569   4.2   6.3   6.2   6.3   6.4   6.4
27    3   572   7.1  10.1  10.0  10.1   9.8  13.1
28    3   577   6.8  12.1  11.1   8.6   6.1   8.2
29    3   581   6.0   6.9   6.3   3.7   5.6   5.3
30    3   584   5.6   6.6   5.9   4.8   6.0   6.1
31    3   597   8.9   7.9   6.4   7.1   8.0   9.4
32    3   616   9.7  10.8   8.8   6.4   9.5   8.0
33    3   647   8.1   6.9   6.9   4.0   7.2   7.3
34    3   660   6.7   5.4   6.0   4.9   6.8   6.4
35    4   591   6.5  10.9   7.8   6.1  11.0   5.0
36    4   595   9.8   9.2   8.0   6.7  10.2  10.5
37    4   603   2.5   3.8   3.2   2.3   3.2   5.4
38    4   612   5.4   6.5   8.0   5.7   4.2   6.6
39    4   618   6.1   7.5   6.1   4.2   5.4   5.5
40    4   619   8.4  10.5   6.9   5.2  11.1   7.4
41    4   623   5.4   7.5   4.4   3.4   5.2   6.8
42    4   632   7.3   7.3   6.9   6.3   6.9   6.6
43    4   641   7.1  13.2   9.5   6.0   7.6   6.7
44    4   645   8.0  11.5   7.2   8.5   9.7   9.9
45    4   646  11.5   7.0   4.1   5.2   7.8   6.8
46    4   648  10.7   9.2   6.9   5.5   8.8   9.2
47    4   654   9.5  10.4   8.9   8.6  11.1  10.2
48    4   657   6.6  14.0  10.1   6.6   9.7   6.2
49    4   661   8.2   8.9   6.1   5.7  11.2   8.7
50    4   663   5.7  10.6   9.5   6.9   8.3   7.8
51    4   677   6.0   8.6   6.1   7.0   7.2   7.8
52    4   678   3.6   3.7   8.2   4.4   4.6   7.9
53    4   682   5.0   6.8   8.7   5.1   5.6   6.0
# This is completely equivalent to the last statement
elongation[!elongation$Zone >= 5, ]   # the ! means exclude
   Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012
1     2   373   5.1   5.1   4.8   8.7   6.3   3.2
2     2   379   8.1  13.3   8.6   4.9   5.9   6.3
3     2   383   9.3   8.5  11.7   7.9   8.0   6.3
4     2   389  15.0  10.3   6.8   6.9   5.9   7.6
5     2   390   3.5   6.2   4.7   3.8   3.5   3.0
6     2   395   6.1   5.6   4.4   4.5   4.5   7.6
7     2   396   7.2   5.9   5.7   5.4   5.5   4.7
8     2   408   6.1   8.1   7.6   6.2   9.6  10.1
9     2   412   4.6   6.7   4.5   2.8   4.2   5.2
10    2   421   7.2   5.8   5.8   5.6   4.3   3.4
11    2   425   6.4   8.1   7.8   6.2   6.4   7.9
12    2   429   8.9  11.2   8.6   8.4   8.2   6.6
13    2   431   3.5   5.9   5.4   4.9   3.5   4.9
14    2   442   5.3   9.9   9.2   6.4   5.5   5.0
15    2   444   8.7   5.6   5.9   5.2   6.6   7.4
16    3   447   8.3   6.5   8.1   7.0   7.6   8.3
17    3   458   5.4   8.6   6.1   5.4   7.2   9.4
18    3   464   9.2   9.7   7.8   5.8   7.2   6.2
19    3   486   9.1   6.7   6.8   6.1   8.4   7.9
20    3   512   9.0  10.8   7.1   5.8   8.3   6.7
21    3   530   9.2  11.4   4.5   3.6   6.9   5.7
22    3   534   7.4   6.8   7.9   3.7   5.5   6.7
23    3   538   8.9  12.5  10.5   6.3   7.3   7.4
24    3   542   5.3   6.0   5.8   8.6   6.4   7.9
25    3   549   4.5   5.6   5.8   4.4   7.4   4.9
26    3   569   4.2   6.3   6.2   6.3   6.4   6.4
27    3   572   7.1  10.1  10.0  10.1   9.8  13.1
28    3   577   6.8  12.1  11.1   8.6   6.1   8.2
29    3   581   6.0   6.9   6.3   3.7   5.6   5.3
30    3   584   5.6   6.6   5.9   4.8   6.0   6.1
31    3   597   8.9   7.9   6.4   7.1   8.0   9.4
32    3   616   9.7  10.8   8.8   6.4   9.5   8.0
33    3   647   8.1   6.9   6.9   4.0   7.2   7.3
34    3   660   6.7   5.4   6.0   4.9   6.8   6.4
35    4   591   6.5  10.9   7.8   6.1  11.0   5.0
36    4   595   9.8   9.2   8.0   6.7  10.2  10.5
37    4   603   2.5   3.8   3.2   2.3   3.2   5.4
38    4   612   5.4   6.5   8.0   5.7   4.2   6.6
39    4   618   6.1   7.5   6.1   4.2   5.4   5.5
40    4   619   8.4  10.5   6.9   5.2  11.1   7.4
41    4   623   5.4   7.5   4.4   3.4   5.2   6.8
42    4   632   7.3   7.3   6.9   6.3   6.9   6.6
43    4   641   7.1  13.2   9.5   6.0   7.6   6.7
44    4   645   8.0  11.5   7.2   8.5   9.7   9.9
45    4   646  11.5   7.0   4.1   5.2   7.8   6.8
46    4   648  10.7   9.2   6.9   5.5   8.8   9.2
47    4   654   9.5  10.4   8.9   8.6  11.1  10.2
48    4   657   6.6  14.0  10.1   6.6   9.7   6.2
49    4   661   8.2   8.9   6.1   5.7  11.2   8.7
50    4   663   5.7  10.6   9.5   6.9   8.3   7.8
51    4   677   6.0   8.6   6.1   7.0   7.2   7.8
52    4   678   3.6   3.7   8.2   4.4   4.6   7.9
53    4   682   5.0   6.8   8.7   5.1   5.6   6.0
# Subsetting with two conditions
elongation[elongation$Zone == 2 | elongation$Zone == 7, ]    # returns only data for zones 2 and 7
    Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012
1      2   373   5.1   5.1   4.8   8.7   6.3   3.2
2      2   379   8.1  13.3   8.6   4.9   5.9   6.3
3      2   383   9.3   8.5  11.7   7.9   8.0   6.3
4      2   389  15.0  10.3   6.8   6.9   5.9   7.6
5      2   390   3.5   6.2   4.7   3.8   3.5   3.0
6      2   395   6.1   5.6   4.4   4.5   4.5   7.6
7      2   396   7.2   5.9   5.7   5.4   5.5   4.7
8      2   408   6.1   8.1   7.6   6.2   9.6  10.1
9      2   412   4.6   6.7   4.5   2.8   4.2   5.2
10     2   421   7.2   5.8   5.8   5.6   4.3   3.4
11     2   425   6.4   8.1   7.8   6.2   6.4   7.9
12     2   429   8.9  11.2   8.6   8.4   8.2   6.6
13     2   431   3.5   5.9   5.4   4.9   3.5   4.9
14     2   442   5.3   9.9   9.2   6.4   5.5   5.0
15     2   444   8.7   5.6   5.9   5.2   6.6   7.4
95     7   924   8.1   7.8   7.1   6.8   7.0   6.6
96     7   963   6.8   7.2   3.7   3.5   4.0   4.5
97     7   970   6.2   5.2   5.9   6.2   7.7   6.3
98     7   977   5.5   5.3   6.4   5.7   7.7   7.9
99     7   979   6.6   6.3   6.3   7.7   6.6   7.1
100    7  1000   4.3   5.5   3.6   3.1   6.7   5.8
101    7  1069   6.2   3.9   4.1   3.8   5.3   7.2
102    7  1073   7.3   6.3   5.3   6.7   6.0   8.7
103    7  1087   5.9   6.7   7.4   5.1   5.3   6.2
104    7  1109   4.5   4.3   3.4   4.3   2.9   2.4
105    7  1140   4.7   3.9   5.5   5.2   5.7   3.9
106    7  1183   7.2   9.3   6.7   7.8   8.3   7.7
107    7  1188   9.4   8.9   7.0   6.9   9.5   7.8
108    7  1247   5.5   5.0   4.2   3.9   6.0   4.8
109    7  1286   7.1   9.5   7.7   5.6   8.0   8.2
110    7  1340   7.1   5.7   7.2   6.9   7.3   8.4
111    7  1358   9.0  10.0   8.6   8.5   8.3   8.3
112    7  1392   3.9   4.7   2.9   3.9   3.8   4.8
113    7  1477   5.3   5.9   6.2   4.2   5.5   6.1
114    7  1481   5.4   6.7   6.0   6.0   7.2   7.0
elongation[elongation$Zone == 2 & elongation$Indiv %in% c(300:400), ]    # returns data for shrubs in zone 2 whose ID numbers are between 300 and 400
  Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012
1    2   373   5.1   5.1   4.8   8.7   6.3   3.2
2    2   379   8.1  13.3   8.6   4.9   5.9   6.3
3    2   383   9.3   8.5  11.7   7.9   8.0   6.3
4    2   389  15.0  10.3   6.8   6.9   5.9   7.6
5    2   390   3.5   6.2   4.7   3.8   3.5   3.0
6    2   395   6.1   5.6   4.4   4.5   4.5   7.6
7    2   396   7.2   5.9   5.7   5.4   5.5   4.7

8 Wrangling Data with dplyr

The dplyr package is a useful bundle of intuitive functions for data wrangling that are named after the action they perform. A big advantage of these functions is that they take your data frame as a first argument, so that you can refer to columns without explicitly having to refer to the full object (so you can drop those $ signs!). Let’s take a look at the most common and useful functions by working on the long format object we just created (elongation_long).

8.1 rename() variables

This lets you change the name(s) of a column or columns. The first argument is the data frame, the second (and third, etc.) takes the form New name = Old name.

# Changes the column names of the columns - getting rid of capital letters
elongation_long <- rename(elongation_long, 
                          zone = Zone, 
                          indiv = Indiv, 
                          year = Year,
                          length = Length)     

# View the first six data rows
head(elongation_long)
  zone indiv  year length
1    2   373 X2007    5.1
2    2   379 X2007    8.1
3    2   383 X2007    9.3
4    2   389 X2007   15.0
5    2   390 X2007    3.5
6    2   395 X2007    6.1
# Base R equivalent 
names(elongation_long) <- c("zone", "indiv", "year", "length")

8.2 filter() rows and select() columns

These are some of the most routine functions that let you reduce your data frame to just the rows and columns you need. The filter() function is great for subsetting rows with logical operations while the select() function lets you specify which columns to keep.

# Filter rows

# Let's keep observations from zones 2 and 3 only, and from years 2009 to 2011
elong_subset <- filter(elongation_long,     # data frame
                       zone %in% c(2, 3),   # keep rows where zone is 2 or 3
                       year %in% c("X2009", "X2010", "X2011")) # keep rows where year is 2009, 2010 or 2011

# You can use multiple different conditions separated by commas

# Base R equivalent 
elong_subset2 <- elongation_long[elongation_long$zone %in% c(2,3) & 
                                 elongation_long$year %in% c("X2009", "X2010", "X2011"), ]

Note that here, we use %in% as a logical operator because we are looking to match a list of exact (character) values. If you want to keep observations within a range of numeric values, you either need two logical statements in your filter() function, e.g. length > 4 & length <= 6.5 or you can use the convenient between() function, e.g. between(length, 4, 6.5).

# Select columns

# Let's remove the zone column (just as an example)
elong_no.zone <- select(elongation_long, 
                        indiv, 
                        year, 
                        length)   

# or alternatively...
elong_no.zone <- select(elongation_long, -zone) # the minus sign removes the column

head(elong_no.zone)
  indiv  year length
1   373 X2007    5.1
2   379 X2007    8.1
3   383 X2007    9.3
4   389 X2007   15.0
5   390 X2007    3.5
6   395 X2007    6.1
# Base R equivalent 
elong_no.zone2 <- elongation_long[ , -1]  # removes first column

# A nice hack! select() lets you rename and reorder columns on the fly
elong_no.zone <- dplyr::select(elongation_long, 
                               Year = year, 
                               Shrub.ID = indiv, 
                               Growth = length)

8.3 mutate() your dataset by creating new columns

Creating new columns is useful when you want to perform an operation on multiple columns, or perhaps reclassify a factor. The mutate() function does just that as well as allowing you to define the column name. Let’s use our old wide-format object elongation and create a column representing total growth for the period 2007-2012.

elong_total <- mutate(elongation, 
                      total.growth = X2007 + X2008 + X2009 + X2010 + X2011 + X2012)

head(elong_total)
  Zone Indiv X2007 X2008 X2009 X2010 X2011 X2012 total.growth
1    2   373   5.1   5.1   4.8   8.7   6.3   3.2         33.2
2    2   379   8.1  13.3   8.6   4.9   5.9   6.3         47.1
3    2   383   9.3   8.5  11.7   7.9   8.0   6.3         51.7
4    2   389  15.0  10.3   6.8   6.9   5.9   7.6         52.5
5    2   390   3.5   6.2   4.7   3.8   3.5   3.0         24.7
6    2   395   6.1   5.6   4.4   4.5   4.5   7.6         32.7

Now, let’s see how we could accomplish the same thing on our long-format data elongation_long by using two functions that pair extremely well together: group_by() and summarise().

8.4 group_by() factors to perform operations on data chunks

The most important thing to understand about this function is that you don’t see any visible change to your data frame. It creates an internal grouping structure, which means that every subsequent function you run on it will use these groups, and not the whole dataset, as an input. It’s very useful when you want to compute summary statistics for different sites, treatments, species, etc.

# Grouping data by individual
elong_grouped <- group_by(elongation_long, indiv)   

8.5 summarise() data with a range of summary statistics

This function will always aggregate your original data frame, i.e. the output data frame will be shorter than the input. Here, let’s contrast summing growth increments over the study period on the original dataset vs our new grouped dataset.

summary_1 <- summarise(elongation_long, total.growth = sum(length))
summary_2 <- summarise(elong_grouped, total.growth = sum(length))

The first summary corresponds to the sum of all growth increments in the dataset (all individuals and years). The second one gives us a breakdown of total growth per individual, our grouping variable. Amazing! We can compute all sorts of summary statistics, too, like the mean or standard deviation of growth across years.

summary_3 <- summarise(elong_grouped, total.growth = sum(length),
                                     mean.growth = mean(length),
                                     sd.growth = sd(length))

head(summary_3)
# A tibble: 6 × 4
  indiv total.growth mean.growth sd.growth
  <int>        <dbl>       <dbl>     <dbl>
1     1         42.6        7.1      1.35 
2    37         42.8        7.13     0.781
3    38         29          4.83     1.14 
4    41         28.2        4.7      2.05 
5    49         39.1        6.52     1.41 
6    64         41.5        6.92     0.868

Unfortunately we lose all the other columns not specified at the grouping stage or in a summary operation. For instance, we lost the column year because there are 5 years for each individual, and we’re summarising to get one single growth value per individual. Always create a new object for summarised data, so that your full data set doesn’t get overwritten!

9 Activities

Download the biological_experiment_untidy_data.xlsx file from here. This dataset simulates the growth measurements of two different plant species (Species A and Species B) under varying light conditions, recorded over a period of 30 days.

9.1 Identify ‘untidy’ features of the dataset

Open this data set in Excel and identify the ‘untidy’ features within it. Make a note of what the problems are and what you would need to do to make it tidy?

💡 Click here to view a solution

These are the key features of the dataset that make it untidy:

  • Duplicate Columns with Slightly Different Names: There are two sets of growth measurements for each species, with slightly different names and values (Species A Growth (cm) vs Sp. A Growth (cm), and Species B Growth (cm) vs Sp. B Growth).
  • Inconsistent Formatting: The Light Condition column has values in uppercase, which may need standardisation.
  • Unnecessary Columns: Columns like Notes and Experiment ID may not be relevant for analysis and could be considered for removal or cleaning.

9.2 Manually tidy the dataset

Using the notes you made in the previous step, manually tidy the dataset in Excel. Save the tidied dataset as a new Excel file.

9.3 Tidy the dataset using dplyr

Now, let’s tidy a dataset using dplyr functions. First, load the tidyverse package then download and import this data file into R. This data was collected by a student from the entomology department. Their experiment was looking at the effect of diet on alarm pheromone production for two aphid species. This was measured by taking 100 individuals from each diet treatment and crushing them. Analytical chemistry techniques were used to quantify aphid alarm pheromone production.Then use dplyr functions to do the following on the tidy data frame:

  • Convert from wide to long format;
  • Filter rows to remove the partial artificial diet data;
  • Mutate the replicate column to turn it into a factor;
  • Calculate the mean and standard error for each treatment and aphid species.

Remember standard error of the mean is calculated as: sd(x)/sqrt(length(x)).

💡 Click here to view a solution
# Load the tidyverse package
library(tidyverse, warn.conflicts = FALSE, quietly = TRUE)
library(openxlsx, warn.conflicts = FALSE, quietly = TRUE)
Warning: package 'openxlsx' was built under R version 4.3.2
# Import the data
untidy_data <- read.xlsx("data/untidy_data.xlsx", # File name
                         sheet = 1)               # Sheet name

# Convert from wide to long format
untidy_data <- gather(untidy_data,                             # Data frame
                      key = "replicate",                       # Name of new column
                      value = "pheromone_conc", rep_1:rep_5)   # Columns to gather

# Filter rows to remove the partial artificial diet data
untidy_data <- filter(untidy_data,             # Data frame
                      !is.na(aphid_species))   # Condition to filter by

# Mutate the replicate column to turn it into a factor
untidy_data <- mutate(untidy_data,                       # Data frame
                      replicate = as.factor(replicate))  # Column to mutate

# Calculate the mean and standard error for each treatment and aphid species
untidy_data <- untidy_data %>%                    # Data frame
  group_by(diet, aphid_species) %>%               # Grouping variables
  summarise(mean = mean(pheromone_conc),          # Calculate mean
            se = sd(pheromone_conc)/sqrt(n()))    # Calculate standard error
`summarise()` has grouped output by 'diet'. You can override using the
`.groups` argument.

9.4 Tidy the dataset using Base R

We can also use Base R functions to tidy the dataset, though it is much more complicated! Using the same data set as in the previous step, do the following:

  • Convert from wide to long format;
  • Filter rows to remove the partial artificial diet data;
  • Mutate the replicate column to turn it into a factor;
  • Calculate the mean and standard error for each treatment and aphid species.
💡 Click here to view a solution
# Load the openxlsx package for reading Excel files
library(openxlsx, warn.conflicts = FALSE, quietly = TRUE)

# Import the data
untidy_data <- read.xlsx("data/untidy_data.xlsx", sheet = 1)

# Convert from wide to long format
untidy_data <- reshape(untidy_data, 
                       varying = list(c("rep_1", "rep_2", "rep_3", "rep_4", "rep_5")),
                       v.names = "pheromone_conc",
                       timevar = "replicate",
                       idvar = setdiff(names(untidy_data), c("rep_1", "rep_2", "rep_3", "rep_4", "rep_5")),
                       direction = "long")

# Filter rows to remove the partial artificial diet data
untidy_data <- untidy_data[untidy_data$diet != "artificial_part",]

# Calculate the mean and standard error for each treatment and aphid species
aggregate_result <- aggregate(pheromone_conc ~ diet + aphid_species, data = untidy_data, function(x) {
    c(mean = mean(x, na.rm = TRUE), se = sd(x, na.rm = TRUE) / sqrt(length(na.omit(x))))
})