Introduction

This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data
4. Tidy data

Course coordinates

spds.uni.kn

Preparations

Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)

## Essential commmands | Data science for psychologists
## 2018 07 09
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##

## Preparations: ----- 

library(tidyverse)

## Tidy data: tidyr ----- 

# ...

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- 

Chapter 12: Tidy data teaches a consistent way to organise tabular data. It introduces commands of the tidyr package, which is a core member of the tidyverse.

Tidy data

Tabular data

In R, rectangular data is often organized in tibbles or data frames. Importantly, each column is a vector (of a particular type) that contains the values of a variable. Thus, whereas every column must be of one type, every row can contain values of different variables and types.

The same set of data (values of variables) can be organised in many different ways. For instance, the following tables (or tibbles) all provide the number of TB cases documented by the World Health Organization in 3 countries (Afghanistan, Brazil, and China) in 2 years (1999 and 2000):

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

library(tidyverse)

## Example of the same data organised in 4 different ways:
# ?table1 # for semantics and source of data

tidyr::table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

tidyr::table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

tidyr::table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

tidyr::table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
tidyr::table4b
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

tidyr::table5
#> # A tibble: 6 x 4
#>   country     century year  rate             
#> * <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583

Practice: Recreate the above bar plot using ggplot2 with data = table1.

Defining tidy data

Definition: A tidy dataset conforms to 3 interrelated rules:

  1. Each variable must have its own column.

  2. Each case/observation must have its own row.

  3. Each value must have its own cell.

See http://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure for a graphical illustration of these rules.

The 3 rules defining tidy data are connected, as it is impossible to only satisfy 2 of the 3. This leads to a simpler set of practical instructions for tidying a messy set of data:

  1. turn each dataset into a tibble.
  2. put each variable into a column.

Note that we need to interpret the semantics of the variables to understand whether a data set is tidy.

Practice: Which of the data tables in the above example (table1 to table5) are tidy? Why or why not?

Advantages of tidy data

  1. Consistency: Consistent data structures make it easier to learn the tools that work with it because they have an underlying uniformity.

  2. Vectorization: Placing variables in columns allows R’s vectorised nature to shine. For instance, the basic verbs of dplyr (and most built-in R functions) work with vectors of values. That makes transforming tidy data easy and natural.

  3. Matching data and tools: Packages like dplyr, ggplot2, and many others are designed to work with tidy data.

Commands and examples

We consider 2 pairs of 2 complementary commands as essential:

  1. separate splits 1 variable into 2 variables;
  2. unite combines 2 variables into 1 variable;
  3. gather makes wide data longer (by gathering many variables into 1);
  4. spread makes long data wider (by spreading 1 variable into many).

separate is the complement/opposite of unite and spread is the complement/opposite of gather.

Here are some basic examples for using these 4 commands:

1. separate 1 variable into 2

separate splits 1 variable (column) into multiple variables (columns) – at a position where some separator character appears – and is the complement to unite. Using separate requires the following arguments:

  • some tibble/data frame data;
  • the variable (column) col to be separated (specified by its name or column number);
  • the names of the new variables (columns) into which col is to be split (specified as a character vector);
  • the separator character sep (as a character/regular expression).

An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.

# Data to use: 
tidyr::table3  # Note that column rate contains 2 numbers, separated by "/". 
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

## Basics: ----- 

# Full separate command:
separate(data = table3, col = rate, into = c("cases", "population"), sep = "/")
#> # A tibble: 6 x 4
#>   country      year cases  population
#> * <chr>       <int> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
# Note that "/" disappears from output tibble.

# Shorter versions of the same command:
separate(table3, rate, c("cases", "population"))
#> # A tibble: 6 x 4
#>   country      year cases  population
#> * <chr>       <int> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# Using the pipe: 
table3 %>% 
  separate(rate, c("cases", "population"))
#> # A tibble: 6 x 4
#>   country      year cases  population
#> * <chr>       <int> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

## Variants: ----- 

# Specifying the variable to be split (rate) by its column number (3):
table3 %>% 
  separate(3, c("cases", "population"))
#> # A tibble: 6 x 4
#>   country      year cases  population
#> * <chr>       <int> <chr>  <chr>     
#> 1 Afghanistan  1999 745    19987071  
#> 2 Afghanistan  2000 2666   20595360  
#> 3 Brazil       1999 37737  172006362 
#> 4 Brazil       2000 80488  174504898 
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# Not dropping the original rate variable:
table3 %>% 
  separate(rate, c("cases", "population"), remove = FALSE)
#> # A tibble: 6 x 5
#>   country      year rate              cases  population
#> * <chr>       <int> <chr>             <chr>  <chr>     
#> 1 Afghanistan  1999 745/19987071      745    19987071  
#> 2 Afghanistan  2000 2666/20595360     2666   20595360  
#> 3 Brazil       1999 37737/172006362   37737  172006362 
#> 4 Brazil       2000 80488/174504898   80488  174504898 
#> 5 China        1999 212258/1272915272 212258 1272915272
#> 6 China        2000 213766/1280428583 213766 1280428583

The example shows that the argument names (data, col, and into) can be left out (but still require appropriate arguments in the correct order) and sep can be left unspecified when tidyr can make a good guess what the separator character might be.

However, consider the following table6, which is available online and can be read into R via read_csv("http://rpository.com/ds4psy/data/table6.csv"):

## Load data (as comma-separated file): 
table6 <- read_csv("http://rpository.com/ds4psy/data/table6.csv")  # from online source

## Alternatively (from local source "data/table6.csv"): 
# table6 <- read_csv("data/table6.csv")  # from local directory

table6
#> # A tibble: 6 x 2
#>   country     when_what              
#>   <chr>       <chr>                  
#> 1 Afghanistan 19_99.745/19987071     
#> 2 Afghanistan 20_00.2666/20595360    
#> 3 Brazil      19_99.37737/172006362  
#> 4 Brazil      20_00.80488/174504898  
#> 5 China       19_99.212258/1272915272
#> 6 China       20_00.213766/1280428583

Here, the variable when_what contains several plausible separation characters: _, ., and /. Let’s first see what happens when we fail to provide a separating character sep, and then split the variable when_what in three different ways:

# Data to use: 
table6 <- read_csv("http://rpository.com/ds4psy/data/table6.csv") # from online source
table6    # Note that column when_what contains several splitting options. 
#> # A tibble: 6 x 2
#>   country     when_what              
#>   <chr>       <chr>                  
#> 1 Afghanistan 19_99.745/19987071     
#> 2 Afghanistan 20_00.2666/20595360    
#> 3 Brazil      19_99.37737/172006362  
#> 4 Brazil      20_00.80488/174504898  
#> 5 China       19_99.212258/1272915272
#> 6 China       20_00.213766/1280428583

# What happens when we do not specify "sep"? 
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"))  # sep is not provided!
#> # A tibble: 6 x 3
#>   country     var_1 var_2
#> * <chr>       <chr> <chr>
#> 1 Afghanistan 19    99   
#> 2 Afghanistan 20    00   
#> 3 Brazil      19    99   
#> 4 Brazil      20    00   
#> 5 China       19    99   
#> 6 China       20    00

# => when_what is split at 1st option (_), but Warning (and loss of data)!

# Specifying different splitting characters:
# (a) split at "_": 
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "_")  # 
#> # A tibble: 6 x 3
#>   country     var_1 var_2               
#> * <chr>       <chr> <chr>               
#> 1 Afghanistan 19    99.745/19987071     
#> 2 Afghanistan 20    00.2666/20595360    
#> 3 Brazil      19    99.37737/172006362  
#> 4 Brazil      20    00.80488/174504898  
#> 5 China       19    99.212258/1272915272
#> 6 China       20    00.213766/1280428583

# (b) split at "." (specified as a regular expression "\\."):
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "\\.")  
#> # A tibble: 6 x 3
#>   country     var_1 var_2            
#> * <chr>       <chr> <chr>            
#> 1 Afghanistan 19_99 745/19987071     
#> 2 Afghanistan 20_00 2666/20595360    
#> 3 Brazil      19_99 37737/172006362  
#> 4 Brazil      20_00 80488/174504898  
#> 5 China       19_99 212258/1272915272
#> 6 China       20_00 213766/1280428583

# (c) split at "/":
table6 %>%
  separate(col = when_what, into = c("var_1", "var_2"), sep = "/")
#> # A tibble: 6 x 3
#>   country     var_1        var_2     
#> * <chr>       <chr>        <chr>     
#> 1 Afghanistan 19_99.745    19987071  
#> 2 Afghanistan 20_00.2666   20595360  
#> 3 Brazil      19_99.37737  172006362 
#> 4 Brazil      20_00.80488  174504898 
#> 5 China       19_99.212258 1272915272
#> 6 China       20_00.213766 1280428583

Note that using the point or period (.) as a splitting character sep = "." would not work. Instead, we need to use the corresponding regular expression sep = "\\.". (See Chapter 14: Strings for details.)

Practice: Split the when_what variable of table6 3 times to create a tibble table6a that contains 5 variables (columns) and reasonable variable names:

#> # A tibble: 6 x 5
#>   country     century year  cases  population
#> * <chr>       <chr>   <chr> <chr>  <chr>     
#> 1 Afghanistan 19      99    745    19987071  
#> 2 Afghanistan 20      00    2666   20595360  
#> 3 Brazil      19      99    37737  172006362 
#> 4 Brazil      20      00    80488  174504898 
#> 5 China       19      99    212258 1272915272
#> 6 China       20      00    213766 1280428583

2. unite 2 variables into 1

unite combines 2 variables (columns) into 1 variable (column) – adding an optional separator character – and is the complement to separate. Using unite requires the following arguments:

  • some tibble/data frame data;
  • the name of the new compound variable (column) col (specified as a character);
  • the names of the variables (columns) to be combined (specified by their names or column numbers);
  • an optional separator character sep (as a character/regular expression).

An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.

# Data to use: 
tidyr::table5  # Note that columns 2 and 3 contain 2 values (as characters!) that belong together. 
#> # A tibble: 6 x 4
#>   country     century year  rate             
#> * <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583

## Basics: ----- 

# Full separate command:
unite(data = table5, col = "yr", century, year, sep = "")
#> # A tibble: 6 x 3
#>   country     yr    rate             
#> * <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583
# Note that century and year variables disappear from output tibble.

# Shorter versions of the same command:
unite(table5, "yr", century, year, sep = "")
#> # A tibble: 6 x 3
#>   country     yr    rate             
#> * <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

# Using the pipe: 
table5 %>%
  unite("yr", century, year, sep = "")
#> # A tibble: 6 x 3
#>   country     yr    rate             
#> * <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

## Variants: ----- 

# Providing a different separation character:
table5 %>%
  unite("yr", century, year, sep = "<--|-->")
#> # A tibble: 6 x 3
#>   country     yr          rate             
#> * <chr>       <chr>       <chr>            
#> 1 Afghanistan 19<--|-->99 745/19987071     
#> 2 Afghanistan 20<--|-->00 2666/20595360    
#> 3 Brazil      19<--|-->99 37737/172006362  
#> 4 Brazil      20<--|-->00 80488/174504898  
#> 5 China       19<--|-->99 212258/1272915272
#> 6 China       20<--|-->00 213766/1280428583

# Specifying the variables to be combined () by their column numbers (2 & 3):
table5 %>% 
  unite("yr", 2, 3, sep = "")
#> # A tibble: 6 x 3
#>   country     yr    rate             
#> * <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

# Not dropping the original variables:
table5 %>%
  unite("yr", century, year, sep = "", remove = FALSE)
#> # A tibble: 6 x 5
#>   country     yr    century year  rate             
#> * <chr>       <chr> <chr>   <chr> <chr>            
#> 1 Afghanistan 1999  19      99    745/19987071     
#> 2 Afghanistan 2000  20      00    2666/20595360    
#> 3 Brazil      1999  19      99    37737/172006362  
#> 4 Brazil      2000  20      00    80488/174504898  
#> 5 China       1999  19      99    212258/1272915272
#> 6 China       2000  20      00    213766/1280428583

Practice: Take the data from dplyr::storms and unite the variables year, month, day into 1 variable date.

#> # A tibble: 6 x 11
#>   name  date   hour   lat  long status category  wind pressure ts_diameter
#>   <chr> <chr> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>       <dbl>
#> 1 Amy   1975…  0     27.5 -79.0 tropi… -1          25     1013          NA
#> 2 Amy   1975…  6.00  28.5 -79.0 tropi… -1          25     1013          NA
#> 3 Amy   1975… 12.0   29.5 -79.0 tropi… -1          25     1013          NA
#> 4 Amy   1975… 18.0   30.5 -79.0 tropi… -1          25     1013          NA
#> 5 Amy   1975…  0     31.5 -78.8 tropi… -1          25     1012          NA
#> 6 Amy   1975…  6.00  32.4 -78.7 tropi… -1          25     1012          NA
#> # ... with 1 more variable: hu_diameter <dbl>

Practice: Read the data from read_csv("http://rpository.com/ds4psy/data/table7.csv") into a tibble table7 and inspect its dimension and contents.

  1. Use multiple (4) separate commands to split table7 into a tibble table7a with multiple (5) columns.

  2. Use multiple (4) unite commands on table7a to re-create a tibble table7b that contains all data in 1 column.

Examples of table7 and possible solutions for table7a and table7b:

#> # A tibble: 6 x 1
#>   where_when_what                   
#>   <chr>                             
#> 1 "Afghanistan@19:99$745\\19987071" 
#> 2 "Afghanistan@20:00$2666\\20595360"
#> 3 "Brazil@19:99$37737\\172006362"   
#> 4 "Brazil@20:00$80488\\174504898"   
#> 5 "China@19:99$212258\\1272915272"  
#> 6 "China@20:00$213766\\1280428583"
#> # A tibble: 6 x 5
#>   country     century year  rate   population
#> * <chr>       <chr>   <chr> <chr>  <chr>     
#> 1 Afghanistan 19      99    745    19987071  
#> 2 Afghanistan 20      00    2666   20595360  
#> 3 Brazil      19      99    37737  172006362 
#> 4 Brazil      20      00    80488  174504898 
#> 5 China       19      99    212258 1272915272
#> 6 China       20      00    213766 1280428583
#> # A tibble: 6 x 1
#>   where_when_what               
#> * <chr>                         
#> 1 Afghanistan:1999_745/19987071 
#> 2 Afghanistan:2000_2666/20595360
#> 3 Brazil:1999_37737/172006362   
#> 4 Brazil:2000_80488/174504898   
#> 5 China:1999_212258/1272915272  
#> 6 China:2000_213766/1280428583

3. gather makes wide data longer

Gathering is the opposite of spreading and used when observations that are distributed over multiple columns should be contained in 1 variable (column). More specifically, gather moves the values of several variables (columns) into 1 column value and describes this value by the value of a new key variable. When gathering more than 2 variables, this reduces the number of columns by increasing the number of rows (i.e., makes a wide data set longer).1

Using gather requires the following arguments:

  • data is a data frame or tibble;
  • key is the name of the variable that describes the values of the gathered columns (or name of the independent variable);
  • value is the name of the variable that is contained in the gathered columns (or the name of the dependent variable);
  • ... or var_x:var_y is a list of variables (columns) to be gathered.
# ?gather # provides documentation

## Data to use: 
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
# Note that counts of cases is distributed over 2 variables (columns) for each country.

## Basics: -----

# gather 2 variables into 1 variable:
gather(data = table4a, 
       key = year, value = cases, 
       `1999`:`2000`)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

# The same command using the pipe:
table4a %>%
  gather(key = year, value = cases, 
         `1999`:`2000`)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

## Variants: ----- 

# The same command with in different order of arguments:
table4a %>%
  gather(`1999`:`2000`, key = year, value = cases)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

# The same command specifying the numbers of the columns to gather:
table4a %>%
  gather(2:3, key = year, value = cases)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766

Note that year is of type character in the above example. If we wanted our key variable to be converted into a number (here: an integer), we can add the optional argument convert = TRUE:

## Default: convert = FALSE: 
table4a %>%
  gather(key = year, value = cases, `1999`:`2000`, convert = FALSE)
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Brazil      1999   37737
#> 3 China       1999  212258
#> 4 Afghanistan 2000    2666
#> 5 Brazil      2000   80488
#> 6 China       2000  213766
# => year is a character vector.

## Converting year into an integer:
table4a %>%
  gather(key = year, value = cases, `1999`:`2000`, convert = TRUE)
#> # A tibble: 6 x 3
#>   country      year  cases
#>   <chr>       <int>  <int>
#> 1 Afghanistan  1999    745
#> 2 Brazil       1999  37737
#> 3 China        1999 212258
#> 4 Afghanistan  2000   2666
#> 5 Brazil       2000  80488
#> 6 China        2000 213766
# => year is a vector of integers. 

Practice: Save the following data as a tibble de and then turn it into tidy data (by using gather to create a single variable share and listing the election year as an additional variable).

party share_2013 share_2017
CDU/CSU 0.415 0.330
SPD 0.257 0.205
Others 0.328 0.465
## (a) Data saved as a tibble (see above): 
de
#> # A tibble: 3 x 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.330
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465

## (b) Converting de into a tidy data table:
de_2 <- de %>%
  gather(share_2013:share_2017, key = "election", value = "share") %>%
  separate(col = election, into = c("dummy", "year")) %>%
  select(year, party, share)

de_2
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

4. spread makes long data wider

Spreading is the opposite of gathering and used when an observation that should be in 1 row is distributed over multiple rows (in 1 column). More specifically, spread puts the values of several cases (rows) into different variables (columns) of 1 row. When spreading more than 2 rows per case, this decreases the number of rows by increasing the number of columns (i.e., makes a long data set wider).2

Using spread requires the following arguments:

  • data is a data frame or tibble;
  • key is the name of the variable that describes the values of the gathered columns (or the names of the independent variables which become the names of the new columns);
  • value is the name of the variable whose values should be spread over multiple columns (or the name of the dependent variable);

Note that we do not need to specify a range of new columns. The number of new columns is determined by the number of different values in the key variable.

# ?spread # provides documentation

## Data to use: 
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583
# Note that count contains 2 DVs which are described by the values of type. 

## Basics: -----

# spread 2 rows into 2 columns of 1 row:
spread(data = table2, 
       key = type, value = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#> * <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# The same command using the pipe:
table2 %>% 
  spread(key = type, value = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#> * <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# The same shorter: 
table2 %>% 
  spread(type, count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#> * <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583


## Variants: -----

# Use <key><sep><value> to create new column names:
table2 %>% 
  spread(key = type, value = count, sep = ":")
#> # A tibble: 6 x 4
#>   country      year `type:cases` `type:population`
#> * <chr>       <int>        <int>             <int>
#> 1 Afghanistan  1999          745          19987071
#> 2 Afghanistan  2000         2666          20595360
#> 3 Brazil       1999        37737         172006362
#> 4 Brazil       2000        80488         174504898
#> 5 China        1999       212258        1272915272
#> 6 China        2000       213766        1280428583

Practice: Take the 6 x 3 tibble de_2 (from above) and use spread to create a 3 x 3 tibble de_3 that re-creates the original tibble de from it.

## (a) Data from above: 
de_2
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

## (b) Using spread to put share by year into 2 columns/variables:
de_3 <- de_2 %>% 
  spread(key = year, value = share) %>%
  rename(share_2013 = `2013`,  # restore original variable names
         share_2017 = `2017`)

de_3
#> # A tibble: 3 x 3
#>   party   share_2013 share_2017
#> * <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.330
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465

## (c) Comparing de_3 to de: 
de
#> # A tibble: 3 x 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.330
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465
all.equal(de_3, de)
#> [1] TRUE

Practice: Moving stocks from wide to long to wide.

The following table shows the start and end price of 3 stocks on 3 days (d1, d2, d3):

Stock data example showing the start and end prices of the shares of 3 companies on 3 days.
stock d1_start d1_end d2_start d2_end d3_start d3_end
Amada 2.5 3.6 3.5 4.2 4.4 2.8
Betix 3.3 2.9 3.0 2.1 2.3 2.5
Cevis 4.2 4.8 4.6 3.1 3.2 3.7

a. Create a tibble st that contains this data in this (wide) format.

b. Transform st into a longer table st_long that contains 18 rows and only 1 numeric variable for all stock prices. Adjust this table so that the day and time appear as 2 separate columns.

c. Create a (line) graph that shows the 3 stocks’ end prices (on the y-axis) over the 3 days (on the x-axis).

d. Spread st_long into a wider table that contains start and end prices as 2 distinct variables (columns) for each stock and day.

# library(tidyverse)

## (a) Enter stock data (in wide format) as a tibble:
st <- tribble(
  ~stock, ~d1_start, ~d1_end, ~d2_start, ~d2_end, ~d3_start, ~d3_end,  
  #-----|----------|--------|----------|--------|----------|--------|
  "Amada",   2.5,     3.6,    3.5,       4.2,      4.4,       2.8,            
  "Betix",   3.3,     2.9,    3.0,       2.1,      2.3,       2.5,  
  "Cevis",   4.2,     4.8,    4.6,       3.1,      3.2,       3.7     
)
dim(st)
#> [1] 3 7

## Note data structure: 
## 2 nested factors: day (1 to 3), type (start or end).

## (b) Change from wide to long format 
##     that contains the day (d1, d2, d3) and type (start vs. end) as separate columns:
st_long <- st %>%
  gather(d1_start:d3_end, key = "key", value = "val") %>%
  separate(key, into = c("day", "time")) %>%
  arrange(stock, day, time) # optional: arrange rows
st_long
#> # A tibble: 18 x 4
#>    stock day   time    val
#>    <chr> <chr> <chr> <dbl>
#>  1 Amada d1    end    3.60
#>  2 Amada d1    start  2.50
#>  3 Amada d2    end    4.20
#>  4 Amada d2    start  3.50
#>  5 Amada d3    end    2.80
#>  6 Amada d3    start  4.40
#>  7 Betix d1    end    2.90
#>  8 Betix d1    start  3.30
#>  9 Betix d2    end    2.10
#> 10 Betix d2    start  3.00
#> 11 Betix d3    end    2.50
#> 12 Betix d3    start  2.30
#> 13 Cevis d1    end    4.80
#> 14 Cevis d1    start  4.20
#> 15 Cevis d2    end    3.10
#> 16 Cevis d2    start  4.60
#> 17 Cevis d3    end    3.70
#> 18 Cevis d3    start  3.20

## (c) Plot the end values (on the y-axis) of the 3 stocks over 3 days (x-axis):
st_long %>% 
  filter(time == "end") %>%
  ggplot(aes(x = day, y = val, color = stock, shape = stock)) +
  geom_point(size = 4) + 
  geom_line(aes(group = stock)) +
  ## Pimping plot: 
  labs(title = "End prices of stocks", 
       x = "Day", y = "End price", 
       shape = "Stock:", color = "Stock:") +
  theme_bw()

## (d) Change st_long into a wider format that lists start and end as 2 distinct variables (columns):
st_long %>%
  spread(key = time, value = val) %>%
  mutate(day_nr = parse_integer(str_sub(day, 2, 2))) # optional: get day_nr as integer variable
#> # A tibble: 9 x 5
#>   stock day     end start day_nr
#>   <chr> <chr> <dbl> <dbl>  <int>
#> 1 Amada d1     3.60  2.50      1
#> 2 Amada d2     4.20  3.50      2
#> 3 Amada d3     2.80  4.40      3
#> 4 Betix d1     2.90  3.30      1
#> 5 Betix d2     2.10  3.00      2
#> 6 Betix d3     2.50  2.30      3
#> 7 Cevis d1     4.80  4.20      1
#> 8 Cevis d2     3.10  4.60      2
#> 9 Cevis d3     3.70  3.20      3

More on tidy data

Conclusion

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data
4. Tidy data

[Last update on 2018-07-09 19:37:53 by hn.]


  1. The length and width of a data set are relative terms here: gathering tends to decrease data width by increasing length, spreading tends to decrease data length by increasing width.

  2. Again, the length and width of data sets are relative terms.