Tidy Data

1 Learning Objectives

  • Review key rules for tidy data;
  • Review key tidyr functions for wrangling data;
  • Manually tidy some data in Excel.

2 Start a Script

For this lab or project, begin by:

  • Starting a new R script
  • Create a good header section and table of contents
  • Save the script file with an informative name
  • set your working directory

Aim to make the script a future reference for doing things in R!

3 Introduction

Data recorded in the field or in the lab is often very different to the way you want your data entered into R. During data collection tables are usually drawn up ahead of time and completed in situ, and you will have extra notes and information in addition to the data you want to analyse. In this lab you will learn a consistent way to organise your data in R known as tidy data. Getting your data into this format requires some upfront work, but that work pays off in the long term. Once you have tidy data, you will spend much less time wrangling data into different formats.

4 Rules for Tidy Data

Why ensure that your data is tidy? There are two main advantages:

  1. There is a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

  2. There is a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

There are three interrelated rules which make a data set tidy:

  • Each variable must have its own column;
  • Each observation must have its own row;
  • Each value must have its own cell.

Rules for tidy data

These three rules are interrelated because it’s impossible to only satisfy two of the three. That inter-relationship leads to an even simpler set of practical instructions:

  • Put each data set in a data frame;
  • Put each variable in a column.

5 Packages and Data

We will predominantly be using two core packages from the tidyverse called tidyr and dplyr but load the entire ecosystem for convenience. The data set we are using represents annual increments in crowberry shrub (Empetrum hermaphroditum) stem growth in a sand dune ecosystem. You can download this data here.

# Check whether a package is installed and install if not.
# This will also load the package if already installed!
if(!require("tidyverse")) install.packages("tidyverse")
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load data
elongation <- read.csv("data/empetrum_elongation.csv", header = TRUE)

6 Tidying Data

The elongation dataset we previously loaded isn’t tidy - the same variable (i.e., stem length) is spread across multiple columns representing different years. We can use the gather() function from the tidyr package to convert this wide-format, untidy data frame to a long-format, tidy data frame. We want to create a single column Year that will have the years currently in the columns (2007-2012) repeated for each individual. From this, you should be able to work out that the data frame will be six times longer than the original. We also want a column Length where all the growth data associated to each year and individual will go.

# Let's take a look at a summary of our original data frame
glimpse(elongation)
Rows: 114
Columns: 8
$ Zone  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3…
$ Indiv <int> 373, 379, 383, 389, 390, 395, 396, 408, 412, 421, 425, 429, 431,…
$ X2007 <dbl> 5.1, 8.1, 9.3, 15.0, 3.5, 6.1, 7.2, 6.1, 4.6, 7.2, 6.4, 8.9, 3.5…
$ X2008 <dbl> 5.1, 13.3, 8.5, 10.3, 6.2, 5.6, 5.9, 8.1, 6.7, 5.8, 8.1, 11.2, 5…
$ X2009 <dbl> 4.8, 8.6, 11.7, 6.8, 4.7, 4.4, 5.7, 7.6, 4.5, 5.8, 7.8, 8.6, 5.4…
$ X2010 <dbl> 8.7, 4.9, 7.9, 6.9, 3.8, 4.5, 5.4, 6.2, 2.8, 5.6, 6.2, 8.4, 4.9,…
$ X2011 <dbl> 6.3, 5.9, 8.0, 5.9, 3.5, 4.5, 5.5, 9.6, 4.2, 4.3, 6.4, 8.2, 3.5,…
$ X2012 <dbl> 3.2, 6.3, 6.3, 7.6, 3.0, 7.6, 4.7, 10.1, 5.2, 3.4, 7.9, 6.6, 4.9…
# Here we want the lengths (value) to be gathered by year (key) 
elongation_long <- gather(elongation, Year, Length, # data frame, key, value                          
                          c(X2007, X2008, X2009, X2010, X2011, X2012)) # specify columns 

# Let's take a look at a summary of our new data frame
glimpse(elongation_long) 
Rows: 684
Columns: 4
$ Zone   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, …
$ Indiv  <int> 373, 379, 383, 389, 390, 395, 396, 408, 412, 421, 425, 429, 431…
$ Year   <chr> "X2007", "X2007", "X2007", "X2007", "X2007", "X2007", "X2007", …
$ Length <dbl> 5.1, 8.1, 9.3, 15.0, 3.5, 6.1, 7.2, 6.1, 4.6, 7.2, 6.4, 8.9, 3.…
# spread() is the inverse function, allowing you to go from long to wide format
elongation_wide <- spread(elongation_long, Year, Length) 

Notice how we used the column names to tell gather() which columns to reshape. This is handy if you only have a few, and if the columns change order eventually, the function will still work. However, if you have a dataset with columns for 100 genes, for instance, you are better off specifying the column numbers.

7 Activities

Download the biological_experiment_untidy_data.xlsx file from here. This data set simulates the growth measurements of two different plant species (Species A and Species B) under varying light conditions, recorded over a period of 30 days.

7.1 Identify ‘untidy’ features of the data set

Open this data set in Excel and identify the ‘untidy’ features within it. Make a note of what the problems are and what you would need to do to make it tidy?

💡 Click here to view a solution

These are the key features of the data set that make it untidy:

  • Duplicate Columns with Slightly Different Names: There are two sets of growth measurements for each species, with slightly different names and values (Species A Growth (cm) vs Sp. A Growth (cm), and Species B Growth (cm) vs Sp. B Growth).
  • Inconsistent Formatting: The Light Condition column has values in uppercase, which may need standardisation.
  • Unnecessary Columns: Columns like Notes and Experiment ID may not be relevant for analysis and could be considered for removal or cleaning.

7.2 Manually tidy the data set

Using the notes you made in the previous step, manually tidy the data set in Excel. Save the tidied data set as a new Excel file.

7.3 Tidy the data set using dplyr

Now, let’s tidy a data set using dplyr functions. First, load the tidyverse package then download and import this data file into R. This data was collected by a student from the entomology department. Their experiment was looking at the effect of diet on alarm pheromone production for two aphid species. This was measured by taking 100 individuals from each diet treatment and crushing them. Analytical chemistry techniques were used to quantify aphid alarm pheromone production.Then use dplyr functions to do the following on the tidy data frame:

  • Convert from wide to long format;
  • Filter rows to remove the partial artificial diet data;
  • Mutate the replicate column to turn it into a factor;
  • Calculate the mean and standard error for each treatment and aphid species.

Remember standard error of the mean is calculated as: sd(x)/sqrt(length(x)).

💡 Click here to view a solution
# Load the tidyverse package
library(tidyverse, warn.conflicts = FALSE, quietly = TRUE)
library(openxlsx, warn.conflicts = FALSE, quietly = TRUE)
Warning: package 'openxlsx' was built under R version 4.3.2
# Import the data
untidy_data <- read.xlsx("data/untidy_data.xlsx", # File name
                         sheet = 1)               # Sheet name

# Convert from wide to long format
untidy_data <- gather(untidy_data,                             # Data frame
                      key = "replicate",                       # Name of new column
                      value = "pheromone_conc", rep_1:rep_5)   # Columns to gather

# Filter rows to remove the partial artificial diet data
untidy_data <- filter(untidy_data,             # Data frame
                      !is.na(aphid_species))   # Condition to filter by

# Mutate the replicate column to turn it into a factor
untidy_data <- mutate(untidy_data,                       # Data frame
                      replicate = as.factor(replicate))  # Column to mutate

# Calculate the mean and standard error for each treatment and aphid species
untidy_data <- untidy_data %>%                    # Data frame
  group_by(diet, aphid_species) %>%               # Grouping variables
  summarise(mean = mean(pheromone_conc),          # Calculate mean
            se = sd(pheromone_conc)/sqrt(n()))    # Calculate standard error
`summarise()` has grouped output by 'diet'. You can override using the
`.groups` argument.

8 Recap

  • Tidy data format ensures each variable has its own column, each observation its own row, and each value its own cell;
  • tidyr can be used for data manipulation and to make data tidy;
  • gather() and spread() are used to convert between wide and long formats.