Introduction

This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data
4. Tidy data

Course coordinates

spds.uni.kn

Preparations

Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)

## Essential commmands | Data science for psychologists
## 2018 06 18
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##

## Preparations: ----- 

library(tidyverse)

## Tibbles: ----- 

# ...

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- 

Tibbles

Whenever working with rectangular data structures – data consisting of multiple cases (rows) and variables (columns) – our first step is to create or transform the data into a tibble (i.e., a simple version of a data frame).

Creating tibbles

Basic commands

There are 3 basic commands for creating tibbles:

  1. as_tibble converts (or coerces) an existing data frame into a tibble.

  2. tibble converts several vectors into (the columns of) a tibble.

  3. tribble converts a table (entered row-by-row) into a tibble.

Check: The 3 commands yield the same type of output (i.e., a tibble), but require different inputs. Ask yourself which kind of input each command takes and how this input needs to be structured and formatted (e.g., with commas).

1. as_tibble

Use as_tibble when the data to be used already is in a data frame (or matrix):

## Using the data frame `sleep`: ------ 

# ?datasets::sleep # provides background information on the data set.

# Save the sleep data frame as df: 
df <- datasets::sleep

# Convert df into a tibble tb: 
tb <- as_tibble(df)

# Inspect the data frame df: 
dim(df)
#> [1] 20  3
is.data.frame(df)
#> [1] TRUE
head(df)
#>   extra group ID
#> 1   0.7     1  1
#> 2  -1.6     1  2
#> 3  -0.2     1  3
#> 4  -1.2     1  4
#> 5  -0.1     1  5
#> 6   3.4     1  6
str(df)
#> 'data.frame':    20 obs. of  3 variables:
#>  $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#>  $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

# Inspect the tibble tb:
dim(tb)
#> [1] 20  3
is.tibble(tb)
#> [1] TRUE
is.data.frame(tb) # => tibbles ARE data frames.
#> [1] TRUE
head(tb)
#> # A tibble: 6 x 3
#>   extra  group     ID
#>   <dbl> <fctr> <fctr>
#> 1   0.7      1      1
#> 2  -1.6      1      2
#> 3  -0.2      1      3
#> 4  -1.2      1      4
#> 5  -0.1      1      5
#> 6   3.4      1      6
glimpse(tb)
#> Observations: 20
#> Variables: 3
#> $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1....
#> $ group <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
#> $ ID    <fctr> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, ...

Practice: Convert the data frames datasets::attitude and datasets::iris into tibbles and inspect their dimensions and contents. What types of variables do they contain?

2. tibble

Use tibble when the data to be used appears as a collection of columns. For instance, imagine we have the following information about a family:

Example data of some family.
id name age gender drives married_2
1 Adam 46 male TRUE Eva
2 Eva 48 female TRUE Adam
3 Xaxi 21 female FALSE Zenon
4 Yota 19 female TRUE NA
5 Zack 17 male FALSE NA

One way of viewing this table is as a series of columns. Each column consists of a variable name and the same number of (here: 5) values, which can be of different types (here: numbers, characters, or Boolean truth values). Each column may or may not contain missing values (entered as NA).

The tibble command expects that each column of the table is entered as a vector:

## Create a tibble from vectors (column-by-column): 
fm <- tibble(
  id       = c(1, 2, 3, 4, 5), # OR: id = 1:5, 
  name     = c("Adam", "Eva", "Xaxi", "Yota", "Zack"), 
  age      = c(46, 48, 21, 19, 17), 
  gender   = c("male", rep("female", 3), "male"), 
  drives   = c(TRUE, TRUE, FALSE, TRUE, FALSE), 
  married_2 = c("Eva", "Adam", "Zenon", NA, NA)
  )

fm  # prints the tibble: 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

Note some details:

  • Each vector is labeled by the variable (column) name, which is not put into quotes;

  • Avoid spaces within variable (column) names (or enclose names in single quotes if you really must use spaces);

  • All vectors need to have the same length;

  • Each vector is of a single type (numeric, character, or Boolean truth values);

  • Consecutive vectors are separated by commas (but there is no comma after the final vector).

When using tibble, later vectors may use the values of earlier vectors:

# Using earlier vectors when defining later ones:
abc <- tibble(
  ltr = LETTERS[1:5],
  num = 1:5,
  l_n = paste(ltr, num, sep = "_"),  # combining abc with num
  nsq = num^2                        # squaring num
  )

abc  # prints the tibble: 
#> # A tibble: 5 x 4
#>     ltr   num   l_n   nsq
#>   <chr> <int> <chr> <dbl>
#> 1     A     1   A_1     1
#> 2     B     2   B_2     4
#> 3     C     3   C_3     9
#> 4     D     4   D_4    16
#> 5     E     5   E_5    25

Practice: Find some tabular data online (e.g., on Wikipedia) and enter it as a tibble.

3. tribble

Use tribble when the data to be used appears as a collection of rows (or already is in tabular form).

For instance, when you copy and paste the above family data from an electronic document, it is easy to insert commas between consecutive cell values and use tribble to convert it into a tibble:

## Create a tibble from tabular data (row-by-row): 
fm2 <- tribble(
  ~id, ~name, ~age, ~gender, ~drives, ~married_2,   
  #--|------|-----|--------|----------|----------|
  1,  "Adam", 46,  "male",    TRUE,     "Eva",    
  2,  "Eva",  48,  "female",  TRUE,     "Adam",  
  3,  "Xaxi", 21,  "female",  FALSE,    "Zenon",    
  4,  "Yota", 19,  "female",  TRUE,      NA, 
  5,  "Zack", 17,  "male",    FALSE,     NA      )

fm2  # prints the tibble: 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

Note some details:

  • The column names are preceded by ~;

  • Consecutive entries are separated by a comma (but there is no comma after the final entry);

  • The line #--|------|-----|--------|----------|----------| is commented out and can be omitted;

  • The type of each column is determined by the type of the corresponding cell values. For instance, the NA values in fm2 are missing character values because the entries above were characters (entered in quotes).

Check: If tibble and tribble really are alternative commands, then the contents of our objects fm and fm2 should be identical:

# Are fm and fm2 equal?
all.equal(fm, fm2)
#> [1] TRUE

Practice: Enter the tibble abc by using tribble.

Accessing parts of a tibble

Once we have an R object that is a tibble, we often want to access individual parts of it. We can distinguish between 3 simple cases:

1. Variables (columns)

As each column of a tibble is a vector, obtaining a column amounts to obtaining the corresponding vector. We can access this vector by its name (label) or by its number (column position):

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Get the name column of fm:
fm$name       # by label (with $)
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"
fm[["name"]]  # by label (with [])
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"
fm[[2]]       # by number (with [])
#> [1] "Adam" "Eva"  "Xaxi" "Yota" "Zack"

# Get the age column of fm: 
fm$age        # by name (with $)
#> [1] 46 48 21 19 17
fm[["age"]]   # by name (with [])
#> [1] 46 48 21 19 17
fm[[3]]       # by number (with [])
#> [1] 46 48 21 19 17

# Note: The following all yield the same vectors as a tibble:
fm[ , 2] # yields the name vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack
select(fm, 2) 
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack
select(fm, name)
#> # A tibble: 5 x 1
#>    name
#>   <chr>
#> 1  Adam
#> 2   Eva
#> 3  Xaxi
#> 4  Yota
#> 5  Zack

fm[ , 3] # yields the age vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17
select(fm, 3)
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17
select(fm, age)
#> # A tibble: 5 x 1
#>     age
#>   <dbl>
#> 1    46
#> 2    48
#> 3    21
#> 4    19
#> 5    17

Practice: Extract the price column of ggplot2::diamonds in at least 3 different ways and verify that they all yield the same mean price.

2. Cases (rows)

Extracting specific rows of a tibble amounts to filtering a tibble and typically yields smaller tibbles (as a row may contain entries of different types). The best way of filtering specific rows of a tibble is using dplyr::filter. However, it’s also possible to specify the desired rows by subsetting (i.e., specifying a condition that results in a Boolean value) and by row number:

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Filter specific rows (by condition):
filter(fm, id > 2)
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
filter(fm, age < 18)
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm %>% filter(drives == TRUE) 
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>
  
# The same filters by using Boolean vectors (subsetting):
fm[fm$id > 2, ]
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
fm[fm$age < 18, ]
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm[fm$drives == TRUE, ]
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>

# The same filters by providing specific row numbers:
fm[3:5, ]  # getting rows 3 to 5 of fm
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     3  Xaxi    21 female  FALSE     Zenon
#> 2     4  Yota    19 female   TRUE      <NA>
#> 3     5  Zack    17   male  FALSE      <NA>
fm[5, ]    # getting row 5 of fm
#> # A tibble: 1 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     5  Zack    17   male  FALSE      <NA>
fm[c(1, 2, 4), ]  # getting rows 1, 2, and 4 of fm
#> # A tibble: 3 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     4  Yota    19 female   TRUE      <NA>

Practice: Extract all diamonds from ggplot2::diamonds that have at least 2 carat. How many of them are there and what is their mean price?

3. Cells

Accessing the values of individual tibble cells is relatively rare, but can be achieved by

a. explicitly providing both row number `r` and column number `c` (as `[r, c]`), or by  
b. first extracting the column (as a vector `v`) and then providing the desired row number `r` (`v[r]`). 
fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE     Zenon
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# Getting specific cell values:
fm$name[4]  # getting the name of the 4th row
#> [1] "Yota"
fm[4, 2]    # getting the same name by row and column numbers
#> # A tibble: 1 x 1
#>    name
#>   <chr>
#> 1  Yota

# Note: What if we don't know the row number? 
which(fm$name == "Yota") # getting the row number that contains the name "Yota"
#> [1] 4

In practice, accessing individual cell values is mostly needed to check for specific cell values and to change or correct erroneous entries by re-assigning them to a different value.

# Checking and changing cell values:

# Check: "Who is Xaxi's spouse?" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2
#> [1] "Zenon"
fm$married_2[3]
#> [1] "Zenon"
fm[3, 6]
#> # A tibble: 1 x 1
#>   married_2
#>       <chr>
#> 1     Zenon

# Change: "Zenon" is actually "Zeus" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2 <- "Zeus"
fm$married_2[3] <- "Zeus"
fm[3, 6] <- "Zeus"

# Check for successful change:
fm
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE      Zeus
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

By contrast, a relatively common task is to check an entire tibble for missing values, count them, or replace them by some other value:

# Checking for, counting, and changing missing values:

fm  # family tibble (defined above): 
#> # A tibble: 5 x 6
#>      id  name   age gender drives married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>     <chr>
#> 1     1  Adam    46   male   TRUE       Eva
#> 2     2   Eva    48 female   TRUE      Adam
#> 3     3  Xaxi    21 female  FALSE      Zeus
#> 4     4  Yota    19 female   TRUE      <NA>
#> 5     5  Zack    17   male  FALSE      <NA>

# (a) Check for missing values:
is.na(fm)       # checks each cell value for being NA
#>         id  name   age gender drives married_2
#> [1,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [2,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [3,] FALSE FALSE FALSE  FALSE  FALSE     FALSE
#> [4,] FALSE FALSE FALSE  FALSE  FALSE      TRUE
#> [5,] FALSE FALSE FALSE  FALSE  FALSE      TRUE

# (b) Count the number of missing values: 
sum(is.na(fm))  # counts missing values (by adding up all TRUE values)
#> [1] 2

# (c) Change all missing values: 
fm[is.na(fm)] <- "A MISSING value!"

# Check for successful change: 
fm
#> # A tibble: 5 x 6
#>      id  name   age gender drives        married_2
#>   <dbl> <chr> <dbl>  <chr>  <lgl>            <chr>
#> 1     1  Adam    46   male   TRUE              Eva
#> 2     2   Eva    48 female   TRUE             Adam
#> 3     3  Xaxi    21 female  FALSE             Zeus
#> 4     4  Yota    19 female   TRUE A MISSING value!
#> 5     5  Zack    17   male  FALSE A MISSING value!

Practice: Determine the number and the percentage of missing values in the datasets dplyr::starwars and dplyr::storms.

More advanced operations on tibbles are covered in Chapter 5: Data transformation and involve using the dplyr commands arrange, filter, and select.

More on tibbles

For more details on tibbles,

Conclusion

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data
4. Tidy data

[Last update on 2018-07-09 19:08:44 by hn.]