Introduction
This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Creating and using tibbles |
| 2. | Data transformation |
| 3. | Visualizing data |
| 4. | Tidy data |
Course coordinates
- Course Data Science for Psychologists (ds4psy).
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Spring/summer 2018: Mondays, 13:30–15:00, C511.
- Links to ZeUS and Ilias
Preparations
Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## Essential commmands | Data science for psychologists
## 2018 07 09
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## Preparations: -----
library(tidyverse)
## Tidy data: tidyr -----
# ...
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- Chapter 12: Tidy data teaches a consistent way to organise tabular data. It introduces commands of the tidyr package, which is a core member of the tidyverse.
Tidy data
Tabular data
In R, rectangular data is often organized in tibbles or data frames. Importantly, each column is a vector (of a particular type) that contains the values of a variable. Thus, whereas every column must be of one type, every row can contain values of different variables and types.
The same set of data (values of variables) can be organised in many different ways. For instance, the following tables (or tibbles) all provide the number of TB cases documented by the World Health Organization in 3 countries (Afghanistan, Brazil, and China) in 2 years (1999 and 2000):
| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
library(tidyverse)
## Example of the same data organised in 4 different ways:
# ?table1 # for semantics and source of data
tidyr::table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
tidyr::table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
tidyr::table3
#> # A tibble: 6 x 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
tidyr::table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
tidyr::table4b
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
tidyr::table5
#> # A tibble: 6 x 4
#> country century year rate
#> * <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583Practice: Recreate the above bar plot using ggplot2 with data = table1.
Defining tidy data
Definition: A tidy dataset conforms to 3 interrelated rules:
Each variable must have its own column.
Each case/observation must have its own row.
Each value must have its own cell.
See http://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure for a graphical illustration of these rules.
The 3 rules defining tidy data are connected, as it is impossible to only satisfy 2 of the 3. This leads to a simpler set of practical instructions for tidying a messy set of data:
- turn each dataset into a tibble.
- put each variable into a column.
Note that we need to interpret the semantics of the variables to understand whether a data set is tidy.
Practice: Which of the data tables in the above example (table1 to table5) are tidy? Why or why not?
Advantages of tidy data
Consistency: Consistent data structures make it easier to learn the tools that work with it because they have an underlying uniformity.
Vectorization: Placing variables in columns allows R’s vectorised nature to shine. For instance, the basic verbs of
dplyr(and most built-in R functions) work with vectors of values. That makes transforming tidy data easy and natural.Matching data and tools: Packages like
dplyr,ggplot2, and many others are designed to work with tidy data.
Commands and examples
We consider 2 pairs of 2 complementary commands as essential:
separatesplits 1 variable into 2 variables;
unitecombines 2 variables into 1 variable;
gathermakes wide data longer (by gathering many variables into 1);
spreadmakes long data wider (by spreading 1 variable into many).
separate is the complement/opposite of unite and spread is the complement/opposite of gather.
Here are some basic examples for using these 4 commands:
1. separate 1 variable into 2
separate splits 1 variable (column) into multiple variables (columns) – at a position where some separator character appears – and is the complement to unite. Using separate requires the following arguments:
- some tibble/data frame
data; - the variable (column)
colto be separated (specified by its name or column number); - the names of the new variables (columns)
intowhichcolis to be split (specified as a character vector); - the separator character
sep(as a character/regular expression).
An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.
# Data to use:
tidyr::table3 # Note that column rate contains 2 numbers, separated by "/".
#> # A tibble: 6 x 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
## Basics: -----
# Full separate command:
separate(data = table3, col = rate, into = c("cases", "population"), sep = "/")
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Note that "/" disappears from output tibble.
# Shorter versions of the same command:
separate(table3, rate, c("cases", "population"))
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Using the pipe:
table3 %>%
separate(rate, c("cases", "population"))
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
## Variants: -----
# Specifying the variable to be split (rate) by its column number (3):
table3 %>%
separate(3, c("cases", "population"))
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <chr> <chr>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Not dropping the original rate variable:
table3 %>%
separate(rate, c("cases", "population"), remove = FALSE)
#> # A tibble: 6 x 5
#> country year rate cases population
#> * <chr> <int> <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071 745 19987071
#> 2 Afghanistan 2000 2666/20595360 2666 20595360
#> 3 Brazil 1999 37737/172006362 37737 172006362
#> 4 Brazil 2000 80488/174504898 80488 174504898
#> 5 China 1999 212258/1272915272 212258 1272915272
#> 6 China 2000 213766/1280428583 213766 1280428583The example shows that the argument names (data, col, and into) can be left out (but still require appropriate arguments in the correct order) and sep can be left unspecified when tidyr can make a good guess what the separator character might be.
However, consider the following table6, which is available online and can be read into R via read_csv("http://rpository.com/ds4psy/data/table6.csv"):
## Load data (as comma-separated file):
table6 <- read_csv("http://rpository.com/ds4psy/data/table6.csv") # from online source
## Alternatively (from local source "data/table6.csv"):
# table6 <- read_csv("data/table6.csv") # from local directory
table6
#> # A tibble: 6 x 2
#> country when_what
#> <chr> <chr>
#> 1 Afghanistan 19_99.745/19987071
#> 2 Afghanistan 20_00.2666/20595360
#> 3 Brazil 19_99.37737/172006362
#> 4 Brazil 20_00.80488/174504898
#> 5 China 19_99.212258/1272915272
#> 6 China 20_00.213766/1280428583Here, the variable when_what contains several plausible separation characters: _, ., and /. Let’s first see what happens when we fail to provide a separating character sep, and then split the variable when_what in three different ways:
# Data to use:
table6 <- read_csv("http://rpository.com/ds4psy/data/table6.csv") # from online source
table6 # Note that column when_what contains several splitting options.
#> # A tibble: 6 x 2
#> country when_what
#> <chr> <chr>
#> 1 Afghanistan 19_99.745/19987071
#> 2 Afghanistan 20_00.2666/20595360
#> 3 Brazil 19_99.37737/172006362
#> 4 Brazil 20_00.80488/174504898
#> 5 China 19_99.212258/1272915272
#> 6 China 20_00.213766/1280428583
# What happens when we do not specify "sep"?
table6 %>%
separate(col = when_what, into = c("var_1", "var_2")) # sep is not provided!
#> # A tibble: 6 x 3
#> country var_1 var_2
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19 99
#> 2 Afghanistan 20 00
#> 3 Brazil 19 99
#> 4 Brazil 20 00
#> 5 China 19 99
#> 6 China 20 00
# => when_what is split at 1st option (_), but Warning (and loss of data)!
# Specifying different splitting characters:
# (a) split at "_":
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "_") #
#> # A tibble: 6 x 3
#> country var_1 var_2
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19 99.745/19987071
#> 2 Afghanistan 20 00.2666/20595360
#> 3 Brazil 19 99.37737/172006362
#> 4 Brazil 20 00.80488/174504898
#> 5 China 19 99.212258/1272915272
#> 6 China 20 00.213766/1280428583
# (b) split at "." (specified as a regular expression "\\."):
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "\\.")
#> # A tibble: 6 x 3
#> country var_1 var_2
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19_99 745/19987071
#> 2 Afghanistan 20_00 2666/20595360
#> 3 Brazil 19_99 37737/172006362
#> 4 Brazil 20_00 80488/174504898
#> 5 China 19_99 212258/1272915272
#> 6 China 20_00 213766/1280428583
# (c) split at "/":
table6 %>%
separate(col = when_what, into = c("var_1", "var_2"), sep = "/")
#> # A tibble: 6 x 3
#> country var_1 var_2
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19_99.745 19987071
#> 2 Afghanistan 20_00.2666 20595360
#> 3 Brazil 19_99.37737 172006362
#> 4 Brazil 20_00.80488 174504898
#> 5 China 19_99.212258 1272915272
#> 6 China 20_00.213766 1280428583Note that using the point or period (.) as a splitting character sep = "." would not work. Instead, we need to use the corresponding regular expression sep = "\\.". (See Chapter 14: Strings for details.)
Practice: Split the when_what variable of table6 3 times to create a tibble table6a that contains 5 variables (columns) and reasonable variable names:
#> # A tibble: 6 x 5
#> country century year cases population
#> * <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745 19987071
#> 2 Afghanistan 20 00 2666 20595360
#> 3 Brazil 19 99 37737 172006362
#> 4 Brazil 20 00 80488 174504898
#> 5 China 19 99 212258 1272915272
#> 6 China 20 00 213766 1280428583
2. unite 2 variables into 1
unite combines 2 variables (columns) into 1 variable (column) – adding an optional separator character – and is the complement to separate. Using unite requires the following arguments:
- some tibble/data frame
data; - the name of the new compound variable (column)
col(specified as a character); - the names of the variables (columns) to be combined (specified by their names or column numbers);
- an optional separator character
sep(as a character/regular expression).
An additional argument remove regulates whether the original columns are dropped from the output tibble. By default, remove = TRUE.
# Data to use:
tidyr::table5 # Note that columns 2 and 3 contain 2 values (as characters!) that belong together.
#> # A tibble: 6 x 4
#> country century year rate
#> * <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583
## Basics: -----
# Full separate command:
unite(data = table5, col = "yr", century, year, sep = "")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Note that century and year variables disappear from output tibble.
# Shorter versions of the same command:
unite(table5, "yr", century, year, sep = "")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Using the pipe:
table5 %>%
unite("yr", century, year, sep = "")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
## Variants: -----
# Providing a different separation character:
table5 %>%
unite("yr", century, year, sep = "<--|-->")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 19<--|-->99 745/19987071
#> 2 Afghanistan 20<--|-->00 2666/20595360
#> 3 Brazil 19<--|-->99 37737/172006362
#> 4 Brazil 20<--|-->00 80488/174504898
#> 5 China 19<--|-->99 212258/1272915272
#> 6 China 20<--|-->00 213766/1280428583
# Specifying the variables to be combined () by their column numbers (2 & 3):
table5 %>%
unite("yr", 2, 3, sep = "")
#> # A tibble: 6 x 3
#> country yr rate
#> * <chr> <chr> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Not dropping the original variables:
table5 %>%
unite("yr", century, year, sep = "", remove = FALSE)
#> # A tibble: 6 x 5
#> country yr century year rate
#> * <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 1999 19 99 745/19987071
#> 2 Afghanistan 2000 20 00 2666/20595360
#> 3 Brazil 1999 19 99 37737/172006362
#> 4 Brazil 2000 20 00 80488/174504898
#> 5 China 1999 19 99 212258/1272915272
#> 6 China 2000 20 00 213766/1280428583Practice: Take the data from dplyr::storms and unite the variables year, month, day into 1 variable date.
#> # A tibble: 6 x 11
#> name date hour lat long status category wind pressure ts_diameter
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> <dbl>
#> 1 Amy 1975… 0 27.5 -79.0 tropi… -1 25 1013 NA
#> 2 Amy 1975… 6.00 28.5 -79.0 tropi… -1 25 1013 NA
#> 3 Amy 1975… 12.0 29.5 -79.0 tropi… -1 25 1013 NA
#> 4 Amy 1975… 18.0 30.5 -79.0 tropi… -1 25 1013 NA
#> 5 Amy 1975… 0 31.5 -78.8 tropi… -1 25 1012 NA
#> 6 Amy 1975… 6.00 32.4 -78.7 tropi… -1 25 1012 NA
#> # ... with 1 more variable: hu_diameter <dbl>
Practice: Read the data from read_csv("http://rpository.com/ds4psy/data/table7.csv") into a tibble table7 and inspect its dimension and contents.
Use multiple (4)
separatecommands to splittable7into a tibbletable7awith multiple (5) columns.Use multiple (4)
unitecommands ontable7ato re-create a tibbletable7bthat contains all data in 1 column.
Examples of table7 and possible solutions for table7a and table7b:
#> # A tibble: 6 x 1
#> where_when_what
#> <chr>
#> 1 "Afghanistan@19:99$745\\19987071"
#> 2 "Afghanistan@20:00$2666\\20595360"
#> 3 "Brazil@19:99$37737\\172006362"
#> 4 "Brazil@20:00$80488\\174504898"
#> 5 "China@19:99$212258\\1272915272"
#> 6 "China@20:00$213766\\1280428583"
#> # A tibble: 6 x 5
#> country century year rate population
#> * <chr> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745 19987071
#> 2 Afghanistan 20 00 2666 20595360
#> 3 Brazil 19 99 37737 172006362
#> 4 Brazil 20 00 80488 174504898
#> 5 China 19 99 212258 1272915272
#> 6 China 20 00 213766 1280428583
#> # A tibble: 6 x 1
#> where_when_what
#> * <chr>
#> 1 Afghanistan:1999_745/19987071
#> 2 Afghanistan:2000_2666/20595360
#> 3 Brazil:1999_37737/172006362
#> 4 Brazil:2000_80488/174504898
#> 5 China:1999_212258/1272915272
#> 6 China:2000_213766/1280428583
3. gather makes wide data longer
Gathering is the opposite of spreading and used when observations that are distributed over multiple columns should be contained in 1 variable (column). More specifically, gather moves the values of several variables (columns) into 1 column value and describes this value by the value of a new key variable. When gathering more than 2 variables, this reduces the number of columns by increasing the number of rows (i.e., makes a wide data set longer).1
Using gather requires the following arguments:
datais a data frame or tibble;
keyis the name of the variable that describes the values of the gathered columns (or name of the independent variable);
valueis the name of the variable that is contained in the gathered columns (or the name of the dependent variable);
...orvar_x:var_yis a list of variables (columns) to be gathered.
# ?gather # provides documentation
## Data to use:
table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
# Note that counts of cases is distributed over 2 variables (columns) for each country.
## Basics: -----
# gather 2 variables into 1 variable:
gather(data = table4a,
key = year, value = cases,
`1999`:`2000`)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# The same command using the pipe:
table4a %>%
gather(key = year, value = cases,
`1999`:`2000`)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
## Variants: -----
# The same command with in different order of arguments:
table4a %>%
gather(`1999`:`2000`, key = year, value = cases)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# The same command specifying the numbers of the columns to gather:
table4a %>%
gather(2:3, key = year, value = cases)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766Note that year is of type character in the above example. If we wanted our key variable to be converted into a number (here: an integer), we can add the optional argument convert = TRUE:
## Default: convert = FALSE:
table4a %>%
gather(key = year, value = cases, `1999`:`2000`, convert = FALSE)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# => year is a character vector.
## Converting year into an integer:
table4a %>%
gather(key = year, value = cases, `1999`:`2000`, convert = TRUE)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <int> <int>
#> 1 Afghanistan 1999 745
#> 2 Brazil 1999 37737
#> 3 China 1999 212258
#> 4 Afghanistan 2000 2666
#> 5 Brazil 2000 80488
#> 6 China 2000 213766
# => year is a vector of integers. Practice: Save the following data as a tibble de and then turn it into tidy data (by using gather to create a single variable share and listing the election year as an additional variable).
| party | share_2013 | share_2017 |
|---|---|---|
| CDU/CSU | 0.415 | 0.330 |
| SPD | 0.257 | 0.205 |
| Others | 0.328 | 0.465 |
## (a) Data saved as a tibble (see above):
de
#> # A tibble: 3 x 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.330
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
## (b) Converting de into a tidy data table:
de_2 <- de %>%
gather(share_2013:share_2017, key = "election", value = "share") %>%
separate(col = election, into = c("dummy", "year")) %>%
select(year, party, share)
de_2
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.4654. spread makes long data wider
Spreading is the opposite of gathering and used when an observation that should be in 1 row is distributed over multiple rows (in 1 column). More specifically, spread puts the values of several cases (rows) into different variables (columns) of 1 row. When spreading more than 2 rows per case, this decreases the number of rows by increasing the number of columns (i.e., makes a long data set wider).2
Using spread requires the following arguments:
datais a data frame or tibble;
keyis the name of the variable that describes the values of the gathered columns (or the names of the independent variables which become the names of the new columns);
valueis the name of the variable whose values should be spread over multiple columns (or the name of the dependent variable);
Note that we do not need to specify a range of new columns. The number of new columns is determined by the number of different values in the key variable.
# ?spread # provides documentation
## Data to use:
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
# Note that count contains 2 DVs which are described by the values of type.
## Basics: -----
# spread 2 rows into 2 columns of 1 row:
spread(data = table2,
key = type, value = count)
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# The same command using the pipe:
table2 %>%
spread(key = type, value = count)
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# The same shorter:
table2 %>%
spread(type, count)
#> # A tibble: 6 x 4
#> country year cases population
#> * <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
## Variants: -----
# Use <key><sep><value> to create new column names:
table2 %>%
spread(key = type, value = count, sep = ":")
#> # A tibble: 6 x 4
#> country year `type:cases` `type:population`
#> * <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583Practice: Take the 6 x 3 tibble de_2 (from above) and use spread to create a 3 x 3 tibble de_3 that re-creates the original tibble de from it.
## (a) Data from above:
de_2
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## (b) Using spread to put share by year into 2 columns/variables:
de_3 <- de_2 %>%
spread(key = year, value = share) %>%
rename(share_2013 = `2013`, # restore original variable names
share_2017 = `2017`)
de_3
#> # A tibble: 3 x 3
#> party share_2013 share_2017
#> * <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.330
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
## (c) Comparing de_3 to de:
de
#> # A tibble: 3 x 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.330
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
all.equal(de_3, de)
#> [1] TRUEPractice: Moving stocks from wide to long to wide.
The following table shows the start and end price of 3 stocks on 3 days (d1, d2, d3):
| stock | d1_start | d1_end | d2_start | d2_end | d3_start | d3_end |
|---|---|---|---|---|---|---|
| Amada | 2.5 | 3.6 | 3.5 | 4.2 | 4.4 | 2.8 |
| Betix | 3.3 | 2.9 | 3.0 | 2.1 | 2.3 | 2.5 |
| Cevis | 4.2 | 4.8 | 4.6 | 3.1 | 3.2 | 3.7 |
a. Create a tibble st that contains this data in this (wide) format.
b. Transform st into a longer table st_long that contains 18 rows and only 1 numeric variable for all stock prices. Adjust this table so that the day and time appear as 2 separate columns.
c. Create a (line) graph that shows the 3 stocks’ end prices (on the y-axis) over the 3 days (on the x-axis).
d. Spread st_long into a wider table that contains start and end prices as 2 distinct variables (columns) for each stock and day.
# library(tidyverse)
## (a) Enter stock data (in wide format) as a tibble:
st <- tribble(
~stock, ~d1_start, ~d1_end, ~d2_start, ~d2_end, ~d3_start, ~d3_end,
#-----|----------|--------|----------|--------|----------|--------|
"Amada", 2.5, 3.6, 3.5, 4.2, 4.4, 2.8,
"Betix", 3.3, 2.9, 3.0, 2.1, 2.3, 2.5,
"Cevis", 4.2, 4.8, 4.6, 3.1, 3.2, 3.7
)
dim(st)
#> [1] 3 7
## Note data structure:
## 2 nested factors: day (1 to 3), type (start or end).
## (b) Change from wide to long format
## that contains the day (d1, d2, d3) and type (start vs. end) as separate columns:
st_long <- st %>%
gather(d1_start:d3_end, key = "key", value = "val") %>%
separate(key, into = c("day", "time")) %>%
arrange(stock, day, time) # optional: arrange rows
st_long
#> # A tibble: 18 x 4
#> stock day time val
#> <chr> <chr> <chr> <dbl>
#> 1 Amada d1 end 3.60
#> 2 Amada d1 start 2.50
#> 3 Amada d2 end 4.20
#> 4 Amada d2 start 3.50
#> 5 Amada d3 end 2.80
#> 6 Amada d3 start 4.40
#> 7 Betix d1 end 2.90
#> 8 Betix d1 start 3.30
#> 9 Betix d2 end 2.10
#> 10 Betix d2 start 3.00
#> 11 Betix d3 end 2.50
#> 12 Betix d3 start 2.30
#> 13 Cevis d1 end 4.80
#> 14 Cevis d1 start 4.20
#> 15 Cevis d2 end 3.10
#> 16 Cevis d2 start 4.60
#> 17 Cevis d3 end 3.70
#> 18 Cevis d3 start 3.20
## (c) Plot the end values (on the y-axis) of the 3 stocks over 3 days (x-axis):
st_long %>%
filter(time == "end") %>%
ggplot(aes(x = day, y = val, color = stock, shape = stock)) +
geom_point(size = 4) +
geom_line(aes(group = stock)) +
## Pimping plot:
labs(title = "End prices of stocks",
x = "Day", y = "End price",
shape = "Stock:", color = "Stock:") +
theme_bw()
## (d) Change st_long into a wider format that lists start and end as 2 distinct variables (columns):
st_long %>%
spread(key = time, value = val) %>%
mutate(day_nr = parse_integer(str_sub(day, 2, 2))) # optional: get day_nr as integer variable
#> # A tibble: 9 x 5
#> stock day end start day_nr
#> <chr> <chr> <dbl> <dbl> <int>
#> 1 Amada d1 3.60 2.50 1
#> 2 Amada d2 4.20 3.50 2
#> 3 Amada d3 2.80 4.40 3
#> 4 Betix d1 2.90 3.30 1
#> 5 Betix d2 2.10 3.00 2
#> 6 Betix d3 2.50 2.30 3
#> 7 Cevis d1 4.80 4.20 1
#> 8 Cevis d2 3.10 4.60 2
#> 9 Cevis d3 3.70 3.20 3More on tidy data
Study the vignette on
vignette("tidy-data")and the RStudio cheatsheet on Data Import for essentialtidyrcommands.Read Chapter 12: Tidy data and complete its exercises.
For background information on the notion of tidy data, see
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23.
available at http://www.jstatsoft.org/v59/i10/paper.Follow the links on https://tidyr.tidyverse.org. for additional information.
Conclusion
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Creating and using tibbles |
| 2. | Data transformation |
| 3. | Visualizing data |
| 4. | Tidy data |
[Last update on 2018-07-09 19:37:53 by hn.]