Introduction
This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Creating and using tibbles |
| 2. | Data transformation |
| 3. | Visualizing data |
| 4. | Tidy data |
Course coordinates
- Course Data Science for Psychologists (ds4psy).
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Spring/summer 2018: Mondays, 13:30–15:00, C511.
- Links to ZeUS and Ilias
Preparations
Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## Essential commmands | Data science for psychologists
## 2018 06 18
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## Preparations: -----
library(tidyverse)
## Tibbles: -----
# ...
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- Tibbles
Whenever working with rectangular data structures – data consisting of multiple cases (rows) and variables (columns) – our first step is to create or transform the data into a tibble (i.e., a simple version of a data frame).
Creating tibbles
Basic commands
There are 3 basic commands for creating tibbles:
as_tibbleconverts (or coerces) an existing data frame into a tibble.tibbleconverts several vectors into (the columns of) a tibble.tribbleconverts a table (entered row-by-row) into a tibble.
Check: The 3 commands yield the same type of output (i.e., a tibble), but require different inputs. Ask yourself which kind of input each command takes and how this input needs to be structured and formatted (e.g., with commas).
1. as_tibble
Use as_tibble when the data to be used already is in a data frame (or matrix):
## Using the data frame `sleep`: ------
# ?datasets::sleep # provides background information on the data set.
# Save the sleep data frame as df:
df <- datasets::sleep
# Convert df into a tibble tb:
tb <- as_tibble(df)
# Inspect the data frame df:
dim(df)
#> [1] 20 3
is.data.frame(df)
#> [1] TRUE
head(df)
#> extra group ID
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
str(df)
#> 'data.frame': 20 obs. of 3 variables:
#> $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#> $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
# Inspect the tibble tb:
dim(tb)
#> [1] 20 3
is.tibble(tb)
#> [1] TRUE
is.data.frame(tb) # => tibbles ARE data frames.
#> [1] TRUE
head(tb)
#> # A tibble: 6 x 3
#> extra group ID
#> <dbl> <fctr> <fctr>
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
glimpse(tb)
#> Observations: 20
#> Variables: 3
#> $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1....
#> $ group <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
#> $ ID <fctr> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, ...Practice: Convert the data frames datasets::attitude and datasets::iris into tibbles and inspect their dimensions and contents. What types of variables do they contain?
2. tibble
Use tibble when the data to be used appears as a collection of columns. For instance, imagine we have the following information about a family:
| id | name | age | gender | drives | married_2 |
|---|---|---|---|---|---|
| 1 | Adam | 46 | male | TRUE | Eva |
| 2 | Eva | 48 | female | TRUE | Adam |
| 3 | Xaxi | 21 | female | FALSE | Zenon |
| 4 | Yota | 19 | female | TRUE | NA |
| 5 | Zack | 17 | male | FALSE | NA |
One way of viewing this table is as a series of columns. Each column consists of a variable name and the same number of (here: 5) values, which can be of different types (here: numbers, characters, or Boolean truth values). Each column may or may not contain missing values (entered as NA).
The tibble command expects that each column of the table is entered as a vector:
## Create a tibble from vectors (column-by-column):
fm <- tibble(
id = c(1, 2, 3, 4, 5), # OR: id = 1:5,
name = c("Adam", "Eva", "Xaxi", "Yota", "Zack"),
age = c(46, 48, 21, 19, 17),
gender = c("male", rep("female", 3), "male"),
drives = c(TRUE, TRUE, FALSE, TRUE, FALSE),
married_2 = c("Eva", "Adam", "Zenon", NA, NA)
)
fm # prints the tibble:
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>Note some details:
Each vector is labeled by the variable (column) name, which is not put into quotes;
Avoid spaces within variable (column) names (or enclose names in single quotes if you really
must use spaces);All vectors need to have the same length;
Each vector is of a single type (numeric, character, or Boolean truth values);
Consecutive vectors are separated by commas (but there is no comma after the final vector).
When using tibble, later vectors may use the values of earlier vectors:
# Using earlier vectors when defining later ones:
abc <- tibble(
ltr = LETTERS[1:5],
num = 1:5,
l_n = paste(ltr, num, sep = "_"), # combining abc with num
nsq = num^2 # squaring num
)
abc # prints the tibble:
#> # A tibble: 5 x 4
#> ltr num l_n nsq
#> <chr> <int> <chr> <dbl>
#> 1 A 1 A_1 1
#> 2 B 2 B_2 4
#> 3 C 3 C_3 9
#> 4 D 4 D_4 16
#> 5 E 5 E_5 25Practice: Find some tabular data online (e.g., on Wikipedia) and enter it as a tibble.
3. tribble
Use tribble when the data to be used appears as a collection of rows (or already is in tabular form).
For instance, when you copy and paste the above family data from an electronic document, it is easy to insert commas between consecutive cell values and use tribble to convert it into a tibble:
## Create a tibble from tabular data (row-by-row):
fm2 <- tribble(
~id, ~name, ~age, ~gender, ~drives, ~married_2,
#--|------|-----|--------|----------|----------|
1, "Adam", 46, "male", TRUE, "Eva",
2, "Eva", 48, "female", TRUE, "Adam",
3, "Xaxi", 21, "female", FALSE, "Zenon",
4, "Yota", 19, "female", TRUE, NA,
5, "Zack", 17, "male", FALSE, NA )
fm2 # prints the tibble:
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>Note some details:
The column names are preceded by
~;Consecutive entries are separated by a comma (but there is no comma after the final entry);
The line
#--|------|-----|--------|----------|----------|is commented out and can be omitted;The type of each column is determined by the type of the corresponding cell values. For instance, the NA values in
fm2are missing character values because the entries above were characters (entered in quotes).
Check: If tibble and tribble really are alternative commands, then the contents of our objects fm and fm2 should be identical:
# Are fm and fm2 equal?
all.equal(fm, fm2)
#> [1] TRUEPractice: Enter the tibble abc by using tribble.
Accessing parts of a tibble
Once we have an R object that is a tibble, we often want to access individual parts of it. We can distinguish between 3 simple cases:
1. Variables (columns)
As each column of a tibble is a vector, obtaining a column amounts to obtaining the corresponding vector. We can access this vector by its name (label) or by its number (column position):
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Get the name column of fm:
fm$name # by label (with $)
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
fm[["name"]] # by label (with [])
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
fm[[2]] # by number (with [])
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
# Get the age column of fm:
fm$age # by name (with $)
#> [1] 46 48 21 19 17
fm[["age"]] # by name (with [])
#> [1] 46 48 21 19 17
fm[[3]] # by number (with [])
#> [1] 46 48 21 19 17
# Note: The following all yield the same vectors as a tibble:
fm[ , 2] # yields the name vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
select(fm, 2)
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
select(fm, name)
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
fm[ , 3] # yields the age vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17
select(fm, 3)
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17
select(fm, age)
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17Practice: Extract the price column of ggplot2::diamonds in at least 3 different ways and verify that they all yield the same mean price.
2. Cases (rows)
Extracting specific rows of a tibble amounts to filtering a tibble and typically yields smaller tibbles (as a row may contain entries of different types). The best way of filtering specific rows of a tibble is using dplyr::filter. However, it’s also possible to specify the desired rows by subsetting (i.e., specifying a condition that results in a Boolean value) and by row number:
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Filter specific rows (by condition):
filter(fm, id > 2)
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
filter(fm, age < 18)
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm %>% filter(drives == TRUE)
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>
# The same filters by using Boolean vectors (subsetting):
fm[fm$id > 2, ]
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
fm[fm$age < 18, ]
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm[fm$drives == TRUE, ]
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>
# The same filters by providing specific row numbers:
fm[3:5, ] # getting rows 3 to 5 of fm
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
fm[5, ] # getting row 5 of fm
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm[c(1, 2, 4), ] # getting rows 1, 2, and 4 of fm
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>Practice: Extract all diamonds from ggplot2::diamonds that have at least 2 carat. How many of them are there and what is their mean price?
3. Cells
Accessing the values of individual tibble cells is relatively rare, but can be achieved by
a. explicitly providing both row number `r` and column number `c` (as `[r, c]`), or by
b. first extracting the column (as a vector `v`) and then providing the desired row number `r` (`v[r]`).
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Getting specific cell values:
fm$name[4] # getting the name of the 4th row
#> [1] "Yota"
fm[4, 2] # getting the same name by row and column numbers
#> # A tibble: 1 x 1
#> name
#> <chr>
#> 1 Yota
# Note: What if we don't know the row number?
which(fm$name == "Yota") # getting the row number that contains the name "Yota"
#> [1] 4In practice, accessing individual cell values is mostly needed to check for specific cell values and to change or correct erroneous entries by re-assigning them to a different value.
# Checking and changing cell values:
# Check: "Who is Xaxi's spouse?" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2
#> [1] "Zenon"
fm$married_2[3]
#> [1] "Zenon"
fm[3, 6]
#> # A tibble: 1 x 1
#> married_2
#> <chr>
#> 1 Zenon
# Change: "Zenon" is actually "Zeus" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2 <- "Zeus"
fm$married_2[3] <- "Zeus"
fm[3, 6] <- "Zeus"
# Check for successful change:
fm
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>By contrast, a relatively common task is to check an entire tibble for missing values, count them, or replace them by some other value:
# Checking for, counting, and changing missing values:
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# (a) Check for missing values:
is.na(fm) # checks each cell value for being NA
#> id name age gender drives married_2
#> [1,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE FALSE TRUE
#> [5,] FALSE FALSE FALSE FALSE FALSE TRUE
# (b) Count the number of missing values:
sum(is.na(fm)) # counts missing values (by adding up all TRUE values)
#> [1] 2
# (c) Change all missing values:
fm[is.na(fm)] <- "A MISSING value!"
# Check for successful change:
fm
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE A MISSING value!
#> 5 5 Zack 17 male FALSE A MISSING value!Practice: Determine the number and the percentage of missing values in the datasets dplyr::starwars and dplyr::storms.
More advanced operations on tibbles are covered in Chapter 5: Data transformation and involve using the dplyr commands arrange, filter, and select.
More on tibbles
For more details on tibbles,
- study
vignette("tibble")and the documentation for?tibble; - study https://tibble.tidyverse.org/ and its examples;
- read Chapter 10: Tibbles and complete its exercises.
Conclusion
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Creating and using tibbles |
| 2. | Data transformation |
| 3. | Visualizing data |
| 4. | Tidy data |
[Last update on 2018-07-09 19:08:44 by hn.]