Introduction

This file contains essential commands from Chapter 5: Data transformation of the textbook r4ds and corresponding examples and exercises. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.

All ds4psy essentials so far:

Nr. Topic
0. Syllabus
1. Basic R concepts and commands
2. Visualizing data
3. Transforming data
4. Exploring data (EDA)
+. Datasets

Course coordinates

spds.uni.kn

Preparations

Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)

## Transforming data | ds4psy
## 2018 11 19
## ----------------------------

## Preparations: ----------

library(tidyverse)

## 1. Topic: ----------

# etc.

## End of file (eof). ----------  

Visualizing data

In the following, we assume that you have read and worked through Chapter 5: Data transformation.
Based on this background, we examine some essential commands of dplyr in the context of examples. However, the dplyr package extends beyond this introduction – and will appear again later (e.g., in Chapter 13: Relational data).

Essential commands and examples

Data transformation includes re-arranging, selecting, changing, or aggregating data.

Overview

When we have data in the form of a tibble or data frame, dplyr provides a range of simple tools to transform this data. Six essential dplyr commands are:

  1. arrange sorts cases (rows);
  2. filter selects cases (rows) by logical conditions;
  3. select selects and reorders variables (columns);
  4. mutate computes new variables (columns) and adds them to existing ones;
  5. summarise collapses multiple values of a variable (rows of a column) to a single one;
  6. group_by changes the unit of aggregation (in combination with mutate and summarise).

Not quite as essential but still useful dplyr commands include:

  • slice selects (ranges of) cases (rows) by number;
  • rename renames variables (columns) and keeps others;
  • transmute computes new variables (columns) and drops existing ones;
  • sample_n and sample_frac draw random samples of cases (rows).

Data

We save the dplyr::starwars data as a tibble sw and use it to illustrate the essential dplyr commands.

library(tidyverse)
sw <- dplyr::starwars

sw  # => A tibble: 87 rows (individuals) x 13 columns (variables)
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Leia…    150    49 brown      light      brown           19   female
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  
#>  7 Beru…    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg…    183    84 black      light      brown           24   male  
#> 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

Standard questions to ask of every new data file include:

  • What are the dimensions of the data file?

  • How many variables (columns) are there in sw and of which type are they?

  • How many missing (NA) values are there?

## Data:
# sw

## Standard checks: ------ 
dim(sw)     # 87 x 13 variables
#> [1] 87 13
names(sw)   # variable names
#>  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
#>  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
#> [11] "films"      "vehicles"   "starships"
glimpse(sw) #    
#> Observations: 87
#> Variables: 13
#> $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
#> $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
#> $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
#> $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
#> $ gender     <chr> "male", NA, NA, "male", "female", "male", "female",...
#> $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
#> $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
#> $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
#> $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
#> $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...

## Missing (NA) values: ------ 
sum(is.na(sw))      # 101 missing values
#> [1] 101
colSums(is.na(sw))  # missing by column (variable) 
#>       name     height       mass hair_color skin_color  eye_color 
#>          0          6         28          5          0          0 
#> birth_year     gender  homeworld    species      films   vehicles 
#>         44          3         10          5          0          0 
#>  starships 
#>          0

1. arrange to sort rows

Using arrange sorts cases (rows) by putting specific variables (columns) in specific orders (e.g., ascending or descending):

# Sort rows alphabetically (by name):
arrange(sw, name)
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Ackb…    180    83 none       brown mot… orange          41   male  
#>  2 Adi …    184    50 none       dark       blue            NA   female
#>  3 Anak…    188    84 blond      fair       blue            41.9 male  
#>  4 Arve…     NA    NA brown      fair       brown           NA   male  
#>  5 Ayla…    178    55 none       blue       hazel           48   female
#>  6 Bail…    191    NA black      tan        brown           67   male  
#>  7 Barr…    166    50 black      yellow     blue            40   female
#>  8 BB8       NA    NA none       none       black           NA   none  
#>  9 Ben …    163    65 none       grey, gre… orange          NA   male  
#> 10 Beru…    165    75 brown      light      blue            47   female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

# The same command using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  arrange(name) 
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Ackb…    180    83 none       brown mot… orange          41   male  
#>  2 Adi …    184    50 none       dark       blue            NA   female
#>  3 Anak…    188    84 blond      fair       blue            41.9 male  
#>  4 Arve…     NA    NA brown      fair       brown           NA   male  
#>  5 Ayla…    178    55 none       blue       hazel           48   female
#>  6 Bail…    191    NA black      tan        brown           67   male  
#>  7 Barr…    166    50 black      yellow     blue            40   female
#>  8 BB8       NA    NA none       none       black           NA   none  
#>  9 Ben …    163    65 none       grey, gre… orange          NA   male  
#> 10 Beru…    165    75 brown      light      blue            47   female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

# Sort rows in descending order:
sw %>% 
  arrange(desc(name)) 
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Zam …    168    55 blonde     fair, gre… yellow            NA female
#>  2 Yoda      66    17 white      green      brown            896 male  
#>  3 Yara…    264    NA none       white      yellow            NA male  
#>  4 Wilh…    180    NA auburn, g… fair       blue              64 male  
#>  5 Wick…     88    20 brown      brown      brown              8 male  
#>  6 Wedg…    170    77 brown      fair       hazel             21 male  
#>  7 Watto    137    NA black      blue, grey yellow            NA male  
#>  8 Wat …    193    48 none       green, gr… unknown           NA male  
#>  9 Tion…    206    80 none       grey       black             NA male  
#> 10 Taun…    213    NA none       grey       black             NA female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

# Sort by multiple variables:
sw %>% 
  arrange(eye_color, gender, desc(height))
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Taun…    213    NA none       grey       black             NA female
#>  2 Shaa…    178    57 none       red, blue… black             NA female
#>  3 Lama…    229    88 none       grey       black             NA male  
#>  4 Tion…    206    80 none       grey       black             NA male  
#>  5 Kit …    196    87 none       green      black             NA male  
#>  6 Plo …    188    80 none       orange     black             22 male  
#>  7 Gree…    173    74 <NA>       green      black             44 male  
#>  8 Nien…    160    68 none       grey       black             NA male  
#>  9 Gasg…    122    NA none       white, bl… black             NA male  
#> 10 BB8       NA    NA none       none       black             NA none  
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

## Note: See 
# ?dplyr::arrange  # for more help and examples.

Note some details:

  • All basic dplyr commands can be called as verb(data, ...) or – using the pipe from magrittr – as data %>% verb(...) (see vignette("magrittr") for details).

  • Variable names are unquoted.

  • The order of variable names (x, y, ...) specifies the order or priority of operations (first by x, then by y, etc.).

Practice: Arrange the sw data in different ways, combining multiple variables and (ascending and descending) orders. Where are cases containing NA values in sorted variables placed?

2. filter to select rows

Using filter selects cases (rows) by logical conditions. It keeps all rows for which the conditions are TRUE and drops all rows for which the conditions are FALSE or NA.

# Filter to keep all humans:
filter(sw, species == "Human")
#> # A tibble: 35 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 Dart…    202   136 none       white      yellow          41.9 male  
#>  3 Leia…    150    49 brown      light      brown           19   female
#>  4 Owen…    178   120 brown, gr… light      blue            52   male  
#>  5 Beru…    165    75 brown      light      blue            47   female
#>  6 Bigg…    183    84 black      light      brown           24   male  
#>  7 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#>  8 Anak…    188    84 blond      fair       blue            41.9 male  
#>  9 Wilh…    180    NA auburn, g… fair       blue            64   male  
#> 10 Han …    180    80 brown      fair       brown           29   male  
#> # ... with 25 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

# The same command using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  filter(species == "Human")
#> # A tibble: 35 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 Dart…    202   136 none       white      yellow          41.9 male  
#>  3 Leia…    150    49 brown      light      brown           19   female
#>  4 Owen…    178   120 brown, gr… light      blue            52   male  
#>  5 Beru…    165    75 brown      light      blue            47   female
#>  6 Bigg…    183    84 black      light      brown           24   male  
#>  7 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#>  8 Anak…    188    84 blond      fair       blue            41.9 male  
#>  9 Wilh…    180    NA auburn, g… fair       blue            64   male  
#> 10 Han …    180    80 brown      fair       brown           29   male  
#> # ... with 25 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

# Filter by multiple (additive) conditions: 
sw %>%
  filter(height > 180, mass <= 75)  # tall and light individuals
#> # A tibble: 3 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Jar …    196    66 none       orange     orange            52 male  
#> 2 Adi …    184    50 none       dark       blue              NA female
#> 3 Wat …    193    48 none       green, gr… unknown           NA male  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# The same command using the logical operator (&): 
sw %>%
  filter(height > 180 & mass <= 75)  # tall and light individuals
#> # A tibble: 3 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Jar …    196    66 none       orange     orange            52 male  
#> 2 Adi …    184    50 none       dark       blue              NA female
#> 3 Wat …    193    48 none       green, gr… unknown           NA male  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Filter for a range of a specific variable:
sw %>%
  filter(height >= 150, height <= 165)  # (a) using height twice
#> # A tibble: 9 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Leia…    150    49 brown      light      brown             19 female
#> 2 Beru…    165    75 brown      light      blue              47 female
#> 3 Mon …    150    NA auburn     fair       blue              48 female
#> 4 Nien…    160    68 none       grey       black             NA male  
#> 5 Shmi…    163    NA black      fair       brown             72 female
#> 6 Ben …    163    65 none       grey, gre… orange            NA male  
#> 7 Cordé    157    NA brown      light      brown             NA female
#> 8 Dormé    165    NA brown      light      brown             NA female
#> 9 Padm…    165    45 brown      light      brown             46 female
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

sw %>%
  filter(between(height, 150, 165))     # (b) using between(...)
#> # A tibble: 9 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Leia…    150    49 brown      light      brown             19 female
#> 2 Beru…    165    75 brown      light      blue              47 female
#> 3 Mon …    150    NA auburn     fair       blue              48 female
#> 4 Nien…    160    68 none       grey       black             NA male  
#> 5 Shmi…    163    NA black      fair       brown             72 female
#> 6 Ben …    163    65 none       grey, gre… orange            NA male  
#> 7 Cordé    157    NA brown      light      brown             NA female
#> 8 Dormé    165    NA brown      light      brown             NA female
#> 9 Padm…    165    45 brown      light      brown             46 female
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Filter by multiple (alternative) conditions: 
sw %>%
  filter(homeworld == "Kashyyyk" | skin_color == "green")
#> # A tibble: 8 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Chew…    228   112 brown      unknown    blue             200 male  
#> 2 Gree…    173    74 <NA>       green      black             44 male  
#> 3 Yoda      66    17 white      green      brown            896 male  
#> 4 Bossk    190   113 none       green      red               53 male  
#> 5 Rugo…    206    NA none       green      orange            NA male  
#> 6 Kit …    196    87 none       green      black             NA male  
#> 7 Pogg…    183    80 none       green      yellow            NA male  
#> 8 Tarf…    234   136 brown      brown      blue              NA male  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Filter cases with missing (NA) values on specific variables:
sw %>%
  filter(is.na(gender))
#> # A tibble: 3 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 C-3PO    167    75 <NA>       gold       yellow           112 <NA>  
#> 2 R2-D2     96    32 <NA>       white, bl… red               33 <NA>  
#> 3 R5-D4     97    32 <NA>       white, red red               NA <NA>  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Filter cases with existing (non-NA) values on specific variables:
sw %>%
  filter(!is.na(mass), !is.na(birth_year))
#> # A tibble: 36 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Leia…    150    49 brown      light      brown           19   female
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  
#>  7 Beru…    165    75 brown      light      blue            47   female
#>  8 Bigg…    183    84 black      light      brown           24   male  
#>  9 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#> 10 Anak…    188    84 blond      fair       blue            41.9 male  
#> # ... with 26 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

## Note: See 
# ?dplyr::filter  # for more help and examples.

Note some details:

  • Separating multiple conditions by commas is the same as the logical AND (&).

  • Variable names are unquoted.

  • The comma between conditions or tests (x, y, ...) means the same as & (logical AND), as each test results in a vector of Boolean values.

  • Unlike in base R, rows for which the condition evaluates to NA are dropped.

  • Additional filter functions include near() for testing numerical (near-)identity.

Practice: Use filter on the dplyr::starwars data (sw) to select very diverse or narrow subsets of individuals. For instance,

  • which individual with blond hair and blue eyes has an unknown mass?
  • of which species are individuals that are over 2m tall and have brown hair?
  • which individuals from Tatooine are not male (but may be NA)?
  • which individuals are neither male nor female OR heavier than 130kg?

3. select to select columns

Using select selects variables (columns) by their names or numbers:

# Select 4 specific variables (columns) of sw:
select(sw, name, species, birth_year, gender)
#> # A tibble: 87 x 4
#>    name               species birth_year gender
#>    <chr>              <chr>        <dbl> <chr> 
#>  1 Luke Skywalker     Human         19   male  
#>  2 C-3PO              Droid        112   <NA>  
#>  3 R2-D2              Droid         33   <NA>  
#>  4 Darth Vader        Human         41.9 male  
#>  5 Leia Organa        Human         19   female
#>  6 Owen Lars          Human         52   male  
#>  7 Beru Whitesun lars Human         47   female
#>  8 R5-D4              Droid         NA   <NA>  
#>  9 Biggs Darklighter  Human         24   male  
#> 10 Obi-Wan Kenobi     Human         57   male  
#> # ... with 77 more rows

# The same when using the pipe:
sw %>%           # Note: %>% is NOT + (used in ggplot) 
  select(name, species, birth_year, gender)
#> # A tibble: 87 x 4
#>    name               species birth_year gender
#>    <chr>              <chr>        <dbl> <chr> 
#>  1 Luke Skywalker     Human         19   male  
#>  2 C-3PO              Droid        112   <NA>  
#>  3 R2-D2              Droid         33   <NA>  
#>  4 Darth Vader        Human         41.9 male  
#>  5 Leia Organa        Human         19   female
#>  6 Owen Lars          Human         52   male  
#>  7 Beru Whitesun lars Human         47   female
#>  8 R5-D4              Droid         NA   <NA>  
#>  9 Biggs Darklighter  Human         24   male  
#> 10 Obi-Wan Kenobi     Human         57   male  
#> # ... with 77 more rows

# The same when providing a vector of variable names: 
sw %>%
  select(c(name, species, birth_year, gender)) 
#> # A tibble: 87 x 4
#>    name               species birth_year gender
#>    <chr>              <chr>        <dbl> <chr> 
#>  1 Luke Skywalker     Human         19   male  
#>  2 C-3PO              Droid        112   <NA>  
#>  3 R2-D2              Droid         33   <NA>  
#>  4 Darth Vader        Human         41.9 male  
#>  5 Leia Organa        Human         19   female
#>  6 Owen Lars          Human         52   male  
#>  7 Beru Whitesun lars Human         47   female
#>  8 R5-D4              Droid         NA   <NA>  
#>  9 Biggs Darklighter  Human         24   male  
#> 10 Obi-Wan Kenobi     Human         57   male  
#> # ... with 77 more rows

# The same when providing column numbers:
sw %>%
  select(1, 10, 7, 8) 
#> # A tibble: 87 x 4
#>    name               species birth_year gender
#>    <chr>              <chr>        <dbl> <chr> 
#>  1 Luke Skywalker     Human         19   male  
#>  2 C-3PO              Droid        112   <NA>  
#>  3 R2-D2              Droid         33   <NA>  
#>  4 Darth Vader        Human         41.9 male  
#>  5 Leia Organa        Human         19   female
#>  6 Owen Lars          Human         52   male  
#>  7 Beru Whitesun lars Human         47   female
#>  8 R5-D4              Droid         NA   <NA>  
#>  9 Biggs Darklighter  Human         24   male  
#> 10 Obi-Wan Kenobi     Human         57   male  
#> # ... with 77 more rows

# The same when providing a vector of column numbers: 
sw %>%
  select(c(1, 10, 7, 8)) 
#> # A tibble: 87 x 4
#>    name               species birth_year gender
#>    <chr>              <chr>        <dbl> <chr> 
#>  1 Luke Skywalker     Human         19   male  
#>  2 C-3PO              Droid        112   <NA>  
#>  3 R2-D2              Droid         33   <NA>  
#>  4 Darth Vader        Human         41.9 male  
#>  5 Leia Organa        Human         19   female
#>  6 Owen Lars          Human         52   male  
#>  7 Beru Whitesun lars Human         47   female
#>  8 R5-D4              Droid         NA   <NA>  
#>  9 Biggs Darklighter  Human         24   male  
#> 10 Obi-Wan Kenobi     Human         57   male  
#> # ... with 77 more rows

# Select ranges of variables with ":":
sw %>%
  select(name:mass, films:starships)
#> # A tibble: 87 x 6
#>    name               height  mass films     vehicles  starships
#>    <chr>               <int> <dbl> <list>    <list>    <list>   
#>  1 Luke Skywalker        172    77 <chr [5]> <chr [2]> <chr [2]>
#>  2 C-3PO                 167    75 <chr [6]> <chr [0]> <chr [0]>
#>  3 R2-D2                  96    32 <chr [7]> <chr [0]> <chr [0]>
#>  4 Darth Vader           202   136 <chr [4]> <chr [0]> <chr [1]>
#>  5 Leia Organa           150    49 <chr [5]> <chr [1]> <chr [0]>
#>  6 Owen Lars             178   120 <chr [3]> <chr [0]> <chr [0]>
#>  7 Beru Whitesun lars    165    75 <chr [3]> <chr [0]> <chr [0]>
#>  8 R5-D4                  97    32 <chr [1]> <chr [0]> <chr [0]>
#>  9 Biggs Darklighter     183    84 <chr [1]> <chr [0]> <chr [1]>
#> 10 Obi-Wan Kenobi        182    77 <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows

# Select to re-order variables (columns) with everything():
sw %>%
  select(species, name, gender, everything())
#> # A tibble: 87 x 13
#>    species name  gender height  mass hair_color skin_color eye_color
#>    <chr>   <chr> <chr>   <int> <dbl> <chr>      <chr>      <chr>    
#>  1 Human   Luke… male      172    77 blond      fair       blue     
#>  2 Droid   C-3PO <NA>      167    75 <NA>       gold       yellow   
#>  3 Droid   R2-D2 <NA>       96    32 <NA>       white, bl… red      
#>  4 Human   Dart… male      202   136 none       white      yellow   
#>  5 Human   Leia… female    150    49 brown      light      brown    
#>  6 Human   Owen… male      178   120 brown, gr… light      blue     
#>  7 Human   Beru… female    165    75 brown      light      blue     
#>  8 Droid   R5-D4 <NA>       97    32 <NA>       white, red red      
#>  9 Human   Bigg… male      183    84 black      light      brown    
#> 10 Human   Obi-… male      182    77 auburn, w… fair       blue-gray
#> # ... with 77 more rows, and 5 more variables: birth_year <dbl>,
#> #   homeworld <chr>, films <list>, vehicles <list>, starships <list>

# Select variables with helper functions:
sw %>%
  select(starts_with("s"))
#> # A tibble: 87 x 3
#>    skin_color  species starships
#>    <chr>       <chr>   <list>   
#>  1 fair        Human   <chr [2]>
#>  2 gold        Droid   <chr [0]>
#>  3 white, blue Droid   <chr [0]>
#>  4 white       Human   <chr [1]>
#>  5 light       Human   <chr [0]>
#>  6 light       Human   <chr [0]>
#>  7 light       Human   <chr [0]>
#>  8 white, red  Droid   <chr [0]>
#>  9 light       Human   <chr [1]>
#> 10 fair        Human   <chr [5]>
#> # ... with 77 more rows

sw %>%
  select(ends_with("s"))
#> # A tibble: 87 x 5
#>     mass species films     vehicles  starships
#>    <dbl> <chr>   <list>    <list>    <list>   
#>  1    77 Human   <chr [5]> <chr [2]> <chr [2]>
#>  2    75 Droid   <chr [6]> <chr [0]> <chr [0]>
#>  3    32 Droid   <chr [7]> <chr [0]> <chr [0]>
#>  4   136 Human   <chr [4]> <chr [0]> <chr [1]>
#>  5    49 Human   <chr [5]> <chr [1]> <chr [0]>
#>  6   120 Human   <chr [3]> <chr [0]> <chr [0]>
#>  7    75 Human   <chr [3]> <chr [0]> <chr [0]>
#>  8    32 Droid   <chr [1]> <chr [0]> <chr [0]>
#>  9    84 Human   <chr [1]> <chr [0]> <chr [1]>
#> 10    77 Human   <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows

sw %>%
  select(contains("_"))
#> # A tibble: 87 x 4
#>    hair_color    skin_color  eye_color birth_year
#>    <chr>         <chr>       <chr>          <dbl>
#>  1 blond         fair        blue            19  
#>  2 <NA>          gold        yellow         112  
#>  3 <NA>          white, blue red             33  
#>  4 none          white       yellow          41.9
#>  5 brown         light       brown           19  
#>  6 brown, grey   light       blue            52  
#>  7 brown         light       blue            47  
#>  8 <NA>          white, red  red             NA  
#>  9 black         light       brown           24  
#> 10 auburn, white fair        blue-gray       57  
#> # ... with 77 more rows

sw %>%
  select(matches("or"))
#> # A tibble: 87 x 4
#>    hair_color    skin_color  eye_color homeworld
#>    <chr>         <chr>       <chr>     <chr>    
#>  1 blond         fair        blue      Tatooine 
#>  2 <NA>          gold        yellow    Tatooine 
#>  3 <NA>          white, blue red       Naboo    
#>  4 none          white       yellow    Tatooine 
#>  5 brown         light       brown     Alderaan 
#>  6 brown, grey   light       blue      Tatooine 
#>  7 brown         light       blue      Tatooine 
#>  8 <NA>          white, red  red       Tatooine 
#>  9 black         light       brown     Tatooine 
#> 10 auburn, white fair        blue-gray Stewjon  
#> # ... with 77 more rows

# Renaming variables:
sw %>%
  rename(creature = name, from_planet = homeworld)
#> # A tibble: 87 x 13
#>    creature height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke Sk…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO       167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2        96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Darth V…    202   136 none       white      yellow          41.9 male  
#>  5 Leia Or…    150    49 brown      light      brown           19   female
#>  6 Owen La…    178   120 brown, gr… light      blue            52   male  
#>  7 Beru Wh…    165    75 brown      light      blue            47   female
#>  8 R5-D4        97    32 <NA>       white, red red             NA   <NA>  
#>  9 Biggs D…    183    84 black      light      brown           24   male  
#> 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  
#> # ... with 77 more rows, and 5 more variables: from_planet <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

## Note: See 
# ?dplyr::select  # for more help and examples. 
?dplyr::select_if  # for more help and examples. 

Note some details:

  • select works both by specifying variable (column) names and by specifying column numbers.

  • Variable names are unquoted.

  • The sequence of variable names (separated by commas) specifies the order of columns in the resulting tibble.

  • Selecting and adding everything() allows re-ordering.

  • Various helper functions (e.g., starts_with, ends_with, contains, matches, num_range) refer to (parts of) variable names.

  • rename renames specified variables (without quotes) and keeps all other variables.

Practice: Use select on the dplyr::starwars data (sw) to select and re-order specific subsets of variables (e.g., all variables starting with “h”, all even columns, all character variables, etc.).

4. mutate to compute new variables

Using mutate computes new variables (columns) from scratch or existing ones:

# Preparation: Save only a subset variables of sw as sws:   
sws <- select(sw, name:mass, birth_year:species) 
sws    # => 87 cases (rows), but only 7 variables (columns)
#> # A tibble: 87 x 7
#>    name               height  mass birth_year gender homeworld species
#>    <chr>               <int> <dbl>      <dbl> <chr>  <chr>     <chr>  
#>  1 Luke Skywalker        172    77       19   male   Tatooine  Human  
#>  2 C-3PO                 167    75      112   <NA>   Tatooine  Droid  
#>  3 R2-D2                  96    32       33   <NA>   Naboo     Droid  
#>  4 Darth Vader           202   136       41.9 male   Tatooine  Human  
#>  5 Leia Organa           150    49       19   female Alderaan  Human  
#>  6 Owen Lars             178   120       52   male   Tatooine  Human  
#>  7 Beru Whitesun lars    165    75       47   female Tatooine  Human  
#>  8 R5-D4                  97    32       NA   <NA>   Tatooine  Droid  
#>  9 Biggs Darklighter     183    84       24   male   Tatooine  Human  
#> 10 Obi-Wan Kenobi        182    77       57   male   Stewjon   Human  
#> # ... with 77 more rows

# Compute 2 new variables and add them to existing ones:
mutate(sws, id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#>    name height  mass birth_year gender homeworld species    id height_feet
#>    <ch>  <int> <dbl>      <dbl> <chr>  <chr>     <chr>   <int>       <dbl>
#>  1 Luk…    172    77       19   male   Tatooine  Human       1        5.64
#>  2 C-3…    167    75      112   <NA>   Tatooine  Droid       2        5.48
#>  3 R2-…     96    32       33   <NA>   Naboo     Droid       3        3.15
#>  4 Dar…    202   136       41.9 male   Tatooine  Human       4        6.63
#>  5 Lei…    150    49       19   female Alderaan  Human       5        4.92
#>  6 Owe…    178   120       52   male   Tatooine  Human       6        5.84
#>  7 Ber…    165    75       47   female Tatooine  Human       7        5.41
#>  8 R5-…     97    32       NA   <NA>   Tatooine  Droid       8        3.18
#>  9 Big…    183    84       24   male   Tatooine  Human       9        6.00
#> 10 Obi…    182    77       57   male   Stewjon   Human      10        5.97
#> # ... with 77 more rows

# The same using the pipe:
sws %>%
  mutate(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#>    name height  mass birth_year gender homeworld species    id height_feet
#>    <ch>  <int> <dbl>      <dbl> <chr>  <chr>     <chr>   <int>       <dbl>
#>  1 Luk…    172    77       19   male   Tatooine  Human       1        5.64
#>  2 C-3…    167    75      112   <NA>   Tatooine  Droid       2        5.48
#>  3 R2-…     96    32       33   <NA>   Naboo     Droid       3        3.15
#>  4 Dar…    202   136       41.9 male   Tatooine  Human       4        6.63
#>  5 Lei…    150    49       19   female Alderaan  Human       5        4.92
#>  6 Owe…    178   120       52   male   Tatooine  Human       6        5.84
#>  7 Ber…    165    75       47   female Tatooine  Human       7        5.41
#>  8 R5-…     97    32       NA   <NA>   Tatooine  Droid       8        3.18
#>  9 Big…    183    84       24   male   Tatooine  Human       9        6.00
#> 10 Obi…    182    77       57   male   Stewjon   Human      10        5.97
#> # ... with 77 more rows

# Transmute commputes and only keeps new variables:
sws %>%
  transmute(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 2
#>       id height_feet
#>    <int>       <dbl>
#>  1     1        5.64
#>  2     2        5.48
#>  3     3        3.15
#>  4     4        6.63
#>  5     5        4.92
#>  6     6        5.84
#>  7     7        5.41
#>  8     8        3.18
#>  9     9        6.00
#> 10    10        5.97
#> # ... with 77 more rows

# Compute variables based on multiple others (including computed ones):
sws %>%
  mutate(BMI = mass / ((height / 100)  ^ 2),  # compute body mass index (kg/m^2)
         BMI_low  = BMI < 18.5,               # classify low BMI values
         BMI_high = BMI > 30,                 # classify high BMI values
         BMI_norm = !BMI_low & !BMI_high      # classify normal BMI values 
         )
#> # A tibble: 87 x 11
#>    name  height  mass birth_year gender homeworld species   BMI BMI_low
#>    <chr>  <int> <dbl>      <dbl> <chr>  <chr>     <chr>   <dbl> <lgl>  
#>  1 Luke…    172    77       19   male   Tatooine  Human    26.0 FALSE  
#>  2 C-3PO    167    75      112   <NA>   Tatooine  Droid    26.9 FALSE  
#>  3 R2-D2     96    32       33   <NA>   Naboo     Droid    34.7 FALSE  
#>  4 Dart…    202   136       41.9 male   Tatooine  Human    33.3 FALSE  
#>  5 Leia…    150    49       19   female Alderaan  Human    21.8 FALSE  
#>  6 Owen…    178   120       52   male   Tatooine  Human    37.9 FALSE  
#>  7 Beru…    165    75       47   female Tatooine  Human    27.5 FALSE  
#>  8 R5-D4     97    32       NA   <NA>   Tatooine  Droid    34.0 FALSE  
#>  9 Bigg…    183    84       24   male   Tatooine  Human    25.1 FALSE  
#> 10 Obi-…    182    77       57   male   Stewjon   Human    23.2 FALSE  
#> # ... with 77 more rows, and 2 more variables: BMI_high <lgl>,
#> #   BMI_norm <lgl>

## Note: See 
# ?dplyr::mutate  # for more help and examples. 

Note some details:

  • mutate computes new variables (columns) and adds them to existing ones, while transmute drops existing ones.

  • Each mutate command specifies a new variable name (without quotes), followed by = and a rule for computing the new variable from existing ones.

  • Variable names are unquoted.

  • Multiple mutate steps are separated by commas, each of which creates a new variable.

  • See http://r4ds.had.co.nz/transform.html#mutate-funs for useful functions for creating new variables.

Practice: Compute a new variable mass_pound from mass (in kg) and the age of each individual in sw relative to Yoda’s age. (Note that the variable birth_year is provided in years BBY, i.e., Before Battle of Yavin.)

5. summarise to compute summaries

summarise computes a function for a specified variable and collapses the values of the specified variable (i.e., the rows of a specified columns) to a single value. It provides many different summary statistics by itself, but is even more useful in combination with group_by (discussed next).

# Summarise allows computing a function for a variable (column): 
summarise(sw, mn_mass = mean(mass, na.rm = TRUE))  # => 97.31 kg 
#> # A tibble: 1 x 1
#>   mn_mass
#>     <dbl>
#> 1    97.3

# The same using the pipe: 
sw %>%
  summarise(mn_mass = mean(mass, na.rm = TRUE))  # => 97.31 kg 
#> # A tibble: 1 x 1
#>   mn_mass
#>     <dbl>
#> 1    97.3

# Multiple summarise steps allow applying 
# different functions for 1 dependent variable: 
sw %>%
  summarise(n_mass = sum(!is.na(mass)), 
            mn_mass = mean(mass, na.rm = TRUE),
            md_mass = median(mass, na.rm = TRUE),
            sd_mass = sd(mass, na.rm = TRUE),
            max_mass = max(mass, na.rm = TRUE),
            big_mass = any(mass > 1000)
            )
#> # A tibble: 1 x 6
#>   n_mass mn_mass md_mass sd_mass max_mass big_mass
#>    <int>   <dbl>   <dbl>   <dbl>    <dbl> <lgl>   
#> 1     59    97.3      79    169.     1358 TRUE
            
# Multiple summarise steps also allow applying 
# different functions to different dependent variables: 
sw %>%
  summarise(# Descriptives of height:  
            n_height = sum(!is.na(height)), 
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE), 
            # Descriptives of mass:
            n_mass = sum(!is.na(mass)), 
            mn_mass = mean(mass, na.rm = TRUE),
            sd_mass = sd(mass, na.rm = TRUE),
            # Counts of character variables:
            n_names = n(), 
            n_species = n_distinct(species),
            n_worlds = n_distinct(homeworld)
            )
#> # A tibble: 1 x 9
#>   n_height mn_height sd_height n_mass mn_mass sd_mass n_names n_species
#>      <int>     <dbl>     <dbl>  <int>   <dbl>   <dbl>   <int>     <int>
#> 1       81      174.      34.8     59    97.3    169.      87        38
#> # ... with 1 more variable: n_worlds <int>

## Note: See 
# ?dplyr::summarise  # for more help and examples. 

Note some details:

  • summarise collapses multiple values into one value and returns a new tibble with as many rows as values computed.

  • Each summarise step specifies a new variable name (without quotes), followed by =, and a function for computing the new variable from existing ones.

  • Multiple summarise steps are separated by commas.

  • Variable names are unquoted.

  • See https://dplyr.tidyverse.org/reference/summarise.html for examples and useful functions in combination with summarise.

Practice: Apply all summary functions mentioned in ?dplyr::summarise to the sw dataset.

6. group_by to aggregate variables

Using group_by does not change the data, but the unit of aggregation for other commands, which is very useful in combination with mutate and summarise.

# Grouping does not change the data, but lists its groups: 
group_by(sws, species)  # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups:   species [38]
#>    name               height  mass birth_year gender homeworld species
#>    <chr>               <int> <dbl>      <dbl> <chr>  <chr>     <chr>  
#>  1 Luke Skywalker        172    77       19   male   Tatooine  Human  
#>  2 C-3PO                 167    75      112   <NA>   Tatooine  Droid  
#>  3 R2-D2                  96    32       33   <NA>   Naboo     Droid  
#>  4 Darth Vader           202   136       41.9 male   Tatooine  Human  
#>  5 Leia Organa           150    49       19   female Alderaan  Human  
#>  6 Owen Lars             178   120       52   male   Tatooine  Human  
#>  7 Beru Whitesun lars    165    75       47   female Tatooine  Human  
#>  8 R5-D4                  97    32       NA   <NA>   Tatooine  Droid  
#>  9 Biggs Darklighter     183    84       24   male   Tatooine  Human  
#> 10 Obi-Wan Kenobi        182    77       57   male   Stewjon   Human  
#> # ... with 77 more rows

# The same using the pipe: 
sws %>%
  group_by(species)  # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups:   species [38]
#>    name               height  mass birth_year gender homeworld species
#>    <chr>               <int> <dbl>      <dbl> <chr>  <chr>     <chr>  
#>  1 Luke Skywalker        172    77       19   male   Tatooine  Human  
#>  2 C-3PO                 167    75      112   <NA>   Tatooine  Droid  
#>  3 R2-D2                  96    32       33   <NA>   Naboo     Droid  
#>  4 Darth Vader           202   136       41.9 male   Tatooine  Human  
#>  5 Leia Organa           150    49       19   female Alderaan  Human  
#>  6 Owen Lars             178   120       52   male   Tatooine  Human  
#>  7 Beru Whitesun lars    165    75       47   female Tatooine  Human  
#>  8 R5-D4                  97    32       NA   <NA>   Tatooine  Droid  
#>  9 Biggs Darklighter     183    84       24   male   Tatooine  Human  
#> 10 Obi-Wan Kenobi        182    77       57   male   Stewjon   Human  
#> # ... with 77 more rows

# group_by is ineffective by itself, but very powerful 
# (a) in combination with `mutate` and 
# (b) in combination with `summarise`. 

# ad (a):
# In combination with mutate and an aggregation function, 
# group_by changes the unit of aggregation:

sws %>%
  mutate(mn_height_1 = mean(height, na.rm = TRUE)) %>%  # aggregates over ALL cases
  group_by(species) %>%
  mutate(mn_height_2 = mean(height, na.rm = TRUE)) %>%  # aggregates over current group (species)
  group_by(gender) %>%
  mutate(mn_height_3 = mean(height, na.rm = TRUE)) %>%  # aggregates over current group (gender)
  group_by(name) %>%
  mutate(mn_height_4 = mean(height, na.rm = TRUE))      # aggregates over current group (name)
#> # A tibble: 87 x 11
#> # Groups:   name [87]
#>    name  height  mass birth_year gender homeworld species mn_height_1
#>    <chr>  <int> <dbl>      <dbl> <chr>  <chr>     <chr>         <dbl>
#>  1 Luke…    172    77       19   male   Tatooine  Human          174.
#>  2 C-3PO    167    75      112   <NA>   Tatooine  Droid          174.
#>  3 R2-D2     96    32       33   <NA>   Naboo     Droid          174.
#>  4 Dart…    202   136       41.9 male   Tatooine  Human          174.
#>  5 Leia…    150    49       19   female Alderaan  Human          174.
#>  6 Owen…    178   120       52   male   Tatooine  Human          174.
#>  7 Beru…    165    75       47   female Tatooine  Human          174.
#>  8 R5-D4     97    32       NA   <NA>   Tatooine  Droid          174.
#>  9 Bigg…    183    84       24   male   Tatooine  Human          174.
#> 10 Obi-…    182    77       57   male   Stewjon   Human          174.
#> # ... with 77 more rows, and 3 more variables: mn_height_2 <dbl>,
#> #   mn_height_3 <dbl>, mn_height_4 <dbl>

# ad (b):
# group_by is particularly useful in combination 
# with summarise:

sws %>%
  group_by(homeworld) %>%
  summarise(count = n(),
            mn_height = mean(height, na.rm = TRUE),
            mn_mass = mean(mass, na.rm = TRUE)
            )
#> # A tibble: 49 x 4
#>    homeworld      count mn_height mn_mass
#>    <chr>          <int>     <dbl>   <dbl>
#>  1 Alderaan           3      176.    64  
#>  2 Aleen Minor        1       79     15  
#>  3 Bespin             1      175     79  
#>  4 Bestine IV         1      180    110  
#>  5 Cato Neimoidia     1      191     90  
#>  6 Cerea              1      198     82  
#>  7 Champala           1      196    NaN  
#>  8 Chandrila          1      150    NaN  
#>  9 Concord Dawn       1      183     79  
#> 10 Corellia           2      175     78.5
#> # ... with 39 more rows

# Note that this pipe returns a new tibble, 
# with 49 rows (= different levels of homeworld) and 
# - 1 column of the group variable (homeworld) and 
# - 3 columns of the 3 newly summarised variables.


# group_by used with multiple variables yields a tibble 
# containing the combination of all variable levels: 
sw %>%
  group_by(hair_color, eye_color)  # => 35 groups (combinations)
#> # A tibble: 87 x 13
#> # Groups:   hair_color, eye_color [35]
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke…    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
#>  4 Dart…    202   136 none       white      yellow          41.9 male  
#>  5 Leia…    150    49 brown      light      brown           19   female
#>  6 Owen…    178   120 brown, gr… light      blue            52   male  
#>  7 Beru…    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg…    183    84 black      light      brown           24   male  
#> 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

# Counting the frequency of cases in groups:
sw %>%
  group_by(hair_color, eye_color) %>%
  count() %>%
  arrange(desc(n))  
#> # A tibble: 35 x 3
#> # Groups:   hair_color, eye_color [35]
#>    hair_color eye_color     n
#>    <chr>      <chr>     <int>
#>  1 black      brown         9
#>  2 brown      brown         9
#>  3 none       black         9
#>  4 brown      blue          7
#>  5 none       orange        7
#>  6 none       yellow        6
#>  7 blond      blue          3
#>  8 none       blue          3
#>  9 none       red           3
#> 10 black      blue          2
#> # ... with 25 more rows

# The same using summarise:
sw %>%
  group_by(hair_color, eye_color) %>%
  summarise(n = n()) %>%
  arrange(desc(n))  
#> # A tibble: 35 x 3
#> # Groups:   hair_color [13]
#>    hair_color eye_color     n
#>    <chr>      <chr>     <int>
#>  1 black      brown         9
#>  2 brown      brown         9
#>  3 none       black         9
#>  4 brown      blue          7
#>  5 none       orange        7
#>  6 none       yellow        6
#>  7 blond      blue          3
#>  8 none       blue          3
#>  9 none       red           3
#> 10 black      blue          2
#> # ... with 25 more rows

## Note: See 
# ?dplyr::group_by  # for more help and examples. 

Note some details:

  • group_by changes the unit of aggregation for other commands (mutate and summarise).

  • Variable names are unquoted.

  • When using group_by with multiple variables, they are separated by commas.

  • Using group_by with mutate results in a tibble that has the same number of cases (rows) as the original tibble. By contrast, using group_by with summarise results in a new tibble with all combinations of variable levels as its cases (rows).

Powerful pipes

The essential dplyr commands are quite simple by themselves, but form the verbs of a language for basic data manipulation. These commands become particularly powerful when they are combined into pipes (by using the %>% operator, as in the examples above). Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly. For instance, pipes allow selecting and sorting sub-groups of data, computing descriptive statistics (n, mean, median, standard deviation, etc.), and answering detailed questions about specific variables.

Examples

Each of the following questions can be answered by a pipe of dplyr commands:

  • What is the number and mean height and mass of individuals from Tatooine by species and gender?

  • Which humans are more than 5cm taller then the average human overall?

  • Which humans are more than 5cm taller than the average human of their own gender?

Exercises (WPA03)

The following exercises practice the essential dplyr commands and aim to show that show that simple pipes of them can solve quite intriguing puzzles about data.

Exercise 1

Star and R wars

Let’s tackle the universe/tidyverse by uncovering even more facts about the dplyr::starwars dataset. Answer the following questions by using pipes of basic dplyr commands (i.e., arranging, filtering, selecting, grouping, counting, summarizing).

  • Save the tibble dplyr::starwars as sw and report its dimensions.
## Data: ----- 
# ?dplyr::starwars

## (a) Basic data properties: ---- 
sw <- dplyr::starwars
dim(sw)  # => 87 rows (denoting individuals) x 13 columns (variables) 
#> [1] 87 13

Known unknowns

  • How many missing (NA) values does sw contain?

  • Which variable (column) has the most missing values?

  • Which individuals come from an unknown (missing) homeworld but have a known birth_year or known mass?

## Missing data: ----- 

# How many missing data points?
sum(is.na(sw))  # => 101 missing values.
#> [1] 101

# Which individuals come from an unknown (missing) homeworld 
# but have a known birth_year or mass? 
sw %>% 
  filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Yoda      66    17 white      green      brown            896 male  
#> 2 IG-88    200   140 none       metal      red               15 none  
#> 3 Qui-…    193    89 brown      fair       blue              92 male  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# Which variable (column) has the most missing values?
colSums(is.na(sw))  # => birth_year has 44 missing values
#>       name     height       mass hair_color skin_color  eye_color 
#>          0          6         28          5          0          0 
#> birth_year     gender  homeworld    species      films   vehicles 
#>         44          3         10          5          0          0 
#>  starships 
#>          0
colMeans(is.na(sw)) #    (amounting to 50.1% of all cases). 
#>       name     height       mass hair_color skin_color  eye_color 
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000 
#> birth_year     gender  homeworld    species      films   vehicles 
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000 
#>  starships 
#> 0.00000000

## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald": 
# sw$hair_color[is.na(sw$hair_color)] <- "bald"

Gender issues

  • How many humans are contained in sw overall and by gender?

  • How many and which individuals in sw are neither male nor female?

  • Of which species in sw exist at least 2 different gender values?

## (c) Gender issues: ----- 

# (+) How many humans are there of each gender?
sw %>% 
  filter(species == "Human") %>%
  group_by(gender) %>%
  count()
#> # A tibble: 2 x 2
#> # Groups:   gender [2]
#>   gender     n
#>   <chr>  <int>
#> 1 female     9
#> 2 male      26

## Answer: 35 Humans in total: 9 females, 26 male.

# (+) How many and which individuals are neither male nor female?
sw %>% 
  filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Jabb…    175  1358 <NA>       green-tan… orange           600 herma…
#> 2 IG-88    200   140 none       metal      red               15 none  
#> 3 BB8       NA    NA none       none       black             NA none  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# (+) Of which species are there at least 2 different gender values?
sw %>%
  group_by(species, gender) %>%
  count() %>%  # table shows species by gender: 
  group_by(species) %>%  # Which species appear more than once in this table? 
  count() %>%
  filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups:   species [5]
#>   species     nn
#>   <chr>    <int>
#> 1 Droid        2
#> 2 Human        2
#> 3 Kaminoan     2
#> 4 Twi'lek      2
#> 5 <NA>         2

# alternative (and shorter) solution:
sw %>%
  group_by(species)%>%
  summarise(n_gender_vals = n_distinct(gender)) %>%
  filter(n_gender_vals >= 2)
#> # A tibble: 5 x 2
#>   species  n_gender_vals
#>   <chr>            <int>
#> 1 Droid                2
#> 2 Human                2
#> 3 Kaminoan             2
#> 4 Twi'lek              2
#> 5 <NA>                 2

Size and mass issues

  • Compute the median, mean, and standard deviation of height for all droids.

  • Compute the average height and mass by species and save the result as h_m.

  • Sort h_m to list the 3 species with the smallest individuals (in terms of mean height).

  • Sort h_m to list the 3 species with the heaviest individuals (in terms of median mass).

## Size and mass issues (group summaries): ----- 

# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
  filter(species == "Droid") %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            md_height = median(height, na.rm = TRUE),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#>       n not_NA_h md_height mn_height sd_height
#>   <int>    <int>     <dbl>     <dbl>     <dbl>
#> 1     5        4       132       140      52.0

# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
  group_by(species) %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            not_NA_m = sum(!is.na(mass)),
            md_mass = median(mass, na.rm = TRUE)
            )
h_m
#> # A tibble: 38 x 6
#>    species       n not_NA_h mn_height not_NA_m md_mass
#>    <chr>     <int>    <int>     <dbl>    <int>   <dbl>
#>  1 Aleena        1        1       79         1    15  
#>  2 Besalisk      1        1      198         1   102  
#>  3 Cerean        1        1      198         1    82  
#>  4 Chagrian      1        1      196         0    NA  
#>  5 Clawdite      1        1      168         1    55  
#>  6 Droid         5        4      140         4    53.5
#>  7 Dug           1        1      112         1    40  
#>  8 Ewok          1        1       88         1    20  
#>  9 Geonosian     1        1      183         1    80  
#> 10 Gungan        3        3      209.        2    74  
#> # ... with 28 more rows

# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#>   species            n not_NA_h mn_height not_NA_m md_mass
#>   <chr>          <int>    <int>     <dbl>    <int>   <dbl>
#> 1 Yoda's species     1        1        66        1      17
#> 2 Aleena             1        1        79        1      15
#> 3 Ewok               1        1        88        1      20

# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>%  slice(1:3)
#> # A tibble: 3 x 6
#>   species     n not_NA_h mn_height not_NA_m md_mass
#>   <chr>   <int>    <int>     <dbl>    <int>   <dbl>
#> 1 Hutt        1        1       175        1    1358
#> 2 Kaleesh     1        1       216        1     159
#> 3 Wookiee     2        2       231        2     124

Bonus tasks

  • How many individuals come from the 3 most frequent (known) species?

  • Which individuals are more than 20% lighter (in terms of mass) than the average mass of individuals of their own homeworld?

## Bonus questions: ----- 

# How many individuals come from the 3 most frequent (known) species?
sw %>%
  group_by(species) %>%
  count %>%
  arrange(desc(n)) %>%
  filter(n > 1)
#> # A tibble: 9 x 2
#> # Groups:   species [9]
#>   species      n
#>   <chr>    <int>
#> 1 Human       35
#> 2 Droid        5
#> 3 <NA>         5
#> 4 Gungan       3
#> 5 Kaminoan     2
#> 6 Mirialan     2
#> 7 Twi'lek      2
#> 8 Wookiee      2
#> 9 Zabrak       2

# Which individuals are more than 20% lighter (in terms of mass) 
# than the average mass of individuals of their own homeworld?
sw %>%
  select(name, homeworld, mass) %>%
  group_by(homeworld) %>%
  mutate(n_notNA_mass = sum(!is.na(mass)),  
         mn_mass = mean(mass, na.rm = TRUE),
         lighter = mass < (mn_mass - (.20 * mn_mass))
         ) %>%
  filter(lighter == TRUE)
#> # A tibble: 5 x 6
#> # Groups:   homeworld [4]
#>   name          homeworld  mass n_notNA_mass mn_mass lighter
#>   <chr>         <chr>     <dbl>        <int>   <dbl> <lgl>  
#> 1 R2-D2         Naboo        32            6    64.2 TRUE   
#> 2 Leia Organa   Alderaan     49            2    64   TRUE   
#> 3 R5-D4         Tatooine     32            8    85.4 TRUE   
#> 4 Yoda          <NA>         17            3    82   TRUE   
#> 5 Padmé Amidala Naboo        45            6    64.2 TRUE

Exercise 2

Sleeping mammals

The dataset ggplot2::msleep contains a mammals sleep dataset (see ?msleep for details and the definition of variables).

  • Save the data as sp and check the dimensions, variable types, and number of missing values in the dataset.
## Data: 
# ?msleep  # check variables     
sp <- ggplot2::msleep
# Check: 
dim(sp)        # 83 x 11 variables
#> [1] 83 11
glimpse(sp)    # 5 <chr> and 6 <dbl>
#> Observations: 83
#> Variables: 11
#> $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
#> $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
#> $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
#> $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
#> $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
#> $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
#> $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
#> $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
#> $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
#> $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
sum(is.na(sp)) # 136 missing values
#> [1] 136

Arranging and filtering data

Use the dplyr-verbs arrange, group_by, and filter to answer the following questions by creating ordered subsets of the data:

  • Arrange the rows (alphabetically) by vore, order, and name, and report the genus of the top 3 mammals.

  • What is the most common type of vore in the data? How many omnivores are there?

  • What is the most common order in the dataset? Are there more exemplars of the order “Carnivora” or “Primates”?

  • Which 2 mammals of the order “Primates” have the longest and shortest sleep_total times?

# Arranging rows:
sp1 <- sp %>%
  arrange(vore, order, name)
sp1$genus[1:3]  # => top 3: "Vulpes"   "Phoca"    "Acinonyx"
#> [1] "Vulpes"   "Phoca"    "Acinonyx"

# Counting common vores:
# (a) short solution:
sp %>%
  count(vore) %>%
  arrange(desc(n))  
#> # A tibble: 5 x 2
#>   vore        n
#>   <chr>   <int>
#> 1 herbi      32
#> 2 omni       20
#> 3 carni      19
#> 4 <NA>        7
#> 5 insecti     5

# (a) is the same as (b): 
sp %>% 
  group_by(vore) %>%
  tally() %>%
  arrange(desc(n))
#> # A tibble: 5 x 2
#>   vore        n
#>   <chr>   <int>
#> 1 herbi      32
#> 2 omni       20
#> 3 carni      19
#> 4 <NA>        7
#> 5 insecti     5

# (b) is the same as (c):  
sp %>% 
  group_by(vore) %>%
  summarise(n = n()) %>%
  arrange(desc(n))
#> # A tibble: 5 x 2
#>   vore        n
#>   <chr>   <int>
#> 1 herbi      32
#> 2 omni       20
#> 3 carni      19
#> 4 <NA>        7
#> 5 insecti     5

# => 32 herbivores, 20 omnivores

# Counting common orders: 
sp %>% 
  count(order) %>%
  arrange(desc(n)) 
#> # A tibble: 19 x 2
#>    order               n
#>    <chr>           <int>
#>  1 Rodentia           22
#>  2 Carnivora          12
#>  3 Primates           12
#>  4 Artiodactyla        6
#>  5 Soricomorpha        5
#>  6 Cetacea             3
#>  7 Hyracoidea          3
#>  8 Perissodactyla      3
#>  9 Chiroptera          2
#> 10 Cingulata           2
#> 11 Didelphimorphia     2
#> 12 Diprotodontia       2
#> 13 Erinaceomorpha      2
#> 14 Proboscidea         2
#> 15 Afrosoricida        1
#> 16 Lagomorpha          1
#> 17 Monotremata         1
#> 18 Pilosa              1
#> 19 Scandentia          1

# OR: 
sp %>% 
  group_by(order) %>%
  count() %>%
  arrange(desc(n)) 
#> # A tibble: 19 x 2
#> # Groups:   order [19]
#>    order               n
#>    <chr>           <int>
#>  1 Rodentia           22
#>  2 Carnivora          12
#>  3 Primates           12
#>  4 Artiodactyla        6
#>  5 Soricomorpha        5
#>  6 Cetacea             3
#>  7 Hyracoidea          3
#>  8 Perissodactyla      3
#>  9 Chiroptera          2
#> 10 Cingulata           2
#> 11 Didelphimorphia     2
#> 12 Diprotodontia       2
#> 13 Erinaceomorpha      2
#> 14 Proboscidea         2
#> 15 Afrosoricida        1
#> 16 Lagomorpha          1
#> 17 Monotremata         1
#> 18 Pilosa              1
#> 19 Scandentia          1
# => 22 Rodentia, 12 Carnivora = 12 Primates

# Primates with longest and shortest sleep_total times:
sp %>% 
  filter(order == "Primates") %>%
  arrange(desc(sleep_total)) 
#> # A tibble: 12 x 11
#>    name  genus vore  order conservation sleep_total sleep_rem sleep_cycle
#>    <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
#>  1 Owl … Aotus omni  Prim… <NA>                17         1.8      NA    
#>  2 Slow… Nyct… carni Prim… <NA>                11        NA        NA    
#>  3 Potto Pero… omni  Prim… lc                  11        NA        NA    
#>  4 Pata… Eryt… omni  Prim… lc                  10.9       1.1      NA    
#>  5 Maca… Maca… omni  Prim… <NA>                10.1       1.2       0.75 
#>  6 Griv… Cerc… omni  Prim… lc                  10         0.7      NA    
#>  7 Gala… Gala… omni  Prim… <NA>                 9.8       1.1       0.55 
#>  8 Chim… Pan   omni  Prim… <NA>                 9.7       1.4       1.42 
#>  9 Squi… Saim… omni  Prim… <NA>                 9.6       1.4      NA    
#> 10 Mong… Lemur herbi Prim… vu                   9.5       0.9      NA    
#> 11 Babo… Papio omni  Prim… <NA>                 9.4       1         0.667
#> 12 Human Homo  omni  Prim… <NA>                 8         1.9       1.5  
#> # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
# => max: Owl monkey: 17 hours, min: Human 8 hours.

Computing new variables

Solve the following tasks by mutate, group_by, and summarise:

  • Compute a variable sleep_awake_sum that adds the sleep_total time and the awake time of each mammal. What result do you expect and get?

  • Which animals have the smallest and largest brain to body ratio (in terms of weight)? How many mammals have a larger ratio than humans?

  • What is the minimum, average (mean), and maximum sleep cycle length for each vore? (Hint: First group the data by group_by, then use summarise on the sleep_cycle variable, but also count the number of NA values for each vore. When computing grouped summaries, NA values can be removed by na.rm = TRUE.)

  • Replace your summarise verb in the previous task by mutate. What do you get as a result? (Hint: The last two tasks illustrate the difference between mutate and grouped mutate commands.)

# Computing `sleep_awake_sum`: 
sp2 <- sp %>%
  mutate(sleep_awake_sum = sleep_total + awake)
sp2$sleep_awake_sum  # => all 24 hours 
#>  [1] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [12] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [23] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.05 24.00 24.00
#> [34] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [45] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [56] 24.00 24.00 24.00 24.00 24.05 24.00 24.00 24.00 24.00 24.00 24.00
#> [67] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#>  [ reached getOption("max.print") -- omitted 8 entries ]

# Computing brain to body ratios:
sp3 <- sp %>%
  mutate(brain_to_body_wt = brainwt / bodywt) %>%
  arrange(brain_to_body_wt)

sp3$name[1] # => smallest ratio: Cow
#> [1] "Cow"

sp3 %>% 
  arrange(desc(brain_to_body_wt))
#> # A tibble: 83 x 12
#>    name  genus vore  order conservation sleep_total sleep_rem sleep_cycle
#>    <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
#>  1 Thir… Sper… herbi Rode… lc                  13.8       3.4       0.217
#>  2 Owl … Aotus omni  Prim… <NA>                17         1.8      NA    
#>  3 Less… Cryp… omni  Sori… lc                   9.1       1.4       0.15 
#>  4 Squi… Saim… omni  Prim… <NA>                 9.6       1.4      NA    
#>  5 Maca… Maca… omni  Prim… <NA>                10.1       1.2       0.75 
#>  6 Litt… Myot… inse… Chir… <NA>                19.9       2         0.2  
#>  7 Gala… Gala… omni  Prim… <NA>                 9.8       1.1       0.55 
#>  8 Mole… Spal… <NA>  Rode… <NA>                10.6       2.4      NA    
#>  9 Tree… Tupa… omni  Scan… <NA>                 8.9       2.6       0.233
#> 10 Human Homo  omni  Prim… <NA>                 8         1.9       1.5  
#> # ... with 73 more rows, and 4 more variables: awake <dbl>, brainwt <dbl>,
#> #   bodywt <dbl>, brain_to_body_wt <dbl>
# => largest ratio: Thirteen-lined ground squirrel, 
#    9 mammals have larger brain_to_body_wt than Human.

# Computing sleep cycle length for each `vore`: 
sp %>%
  group_by(vore) %>%
  summarise(n = n(),
            n_NA = sum(is.na(sleep_cycle)), 
            non_NA = sum(!is.na(sleep_cycle)),
            min_cyc = min(sleep_cycle, na.rm = TRUE),
            mn_cyc = mean(sleep_cycle, na.rm = TRUE),
            max_cyc = max(sleep_cycle, na.rm = TRUE)
            )
#> # A tibble: 5 x 7
#>   vore        n  n_NA non_NA min_cyc mn_cyc max_cyc
#>   <chr>   <int> <int>  <int>   <dbl>  <dbl>   <dbl>
#> 1 carni      19    14      5   0.333  0.373   0.417
#> 2 herbi      32    20     12   0.117  0.418   1    
#> 3 insecti     5     2      3   0.117  0.161   0.2  
#> 4 omni       20     9     11   0.133  0.592   1.5  
#> 5 <NA>        7     6      1   0.183  0.183   0.183

# Replacing summarise by mutate: 
sp4 <- sp %>%
  group_by(vore) %>%
  mutate(n = n(),
         n_NA = sum(is.na(sleep_cycle)), 
         non_NA = sum(!is.na(sleep_cycle)),
         min_cyc = min(sleep_cycle, na.rm = TRUE),
         mn_cyc = mean(sleep_cycle, na.rm = TRUE),
         max_cyc = max(sleep_cycle, na.rm = TRUE)
  )

# => A tibble that contains grouped summaries as 6 new variables: 
sp4 %>% 
  select(vore, name, sleep_cycle, n:max_cyc) %>%
  arrange(vore)
#> # A tibble: 83 x 9
#> # Groups:   vore [5]
#>    vore  name        sleep_cycle     n  n_NA non_NA min_cyc mn_cyc max_cyc
#>    <chr> <chr>             <dbl> <int> <int>  <int>   <dbl>  <dbl>   <dbl>
#>  1 carni Cheetah          NA        19    14      5   0.333  0.373   0.417
#>  2 carni Northern f…       0.383    19    14      5   0.333  0.373   0.417
#>  3 carni Dog               0.333    19    14      5   0.333  0.373   0.417
#>  4 carni Long-nosed…       0.383    19    14      5   0.333  0.373   0.417
#>  5 carni Domestic c…       0.417    19    14      5   0.333  0.373   0.417
#>  6 carni Pilot whale      NA        19    14      5   0.333  0.373   0.417
#>  7 carni Gray seal        NA        19    14      5   0.333  0.373   0.417
#>  8 carni Thick-tail…      NA        19    14      5   0.333  0.373   0.417
#>  9 carni Slow loris       NA        19    14      5   0.333  0.373   0.417
#> 10 carni Northern g…      NA        19    14      5   0.333  0.373   0.417
#> # ... with 73 more rows

Exercise 3

Outliers

This exercise examines different possibilities for defining outliers and uses a generated dataset (entitled out.csv and available at http://rpository.com/ds4psy/data/out.csv) to illustate and compare them. It also helps to further disentangle the difference between mutate and grouped mutate commands.

Data on outliers

Use the following read_csv() command to obtain and load the data into an R object entitled outliers:

## Load data (as comma-separated file): 
outliers <- read_csv("http://rpository.com/ds4psy/data/out.csv")  # from online source

## Alternatively (from local source): 
# outliers <- read_csv("out.csv")  # from current directory

Not all outliers are alike

An outlier can be defined as an individual whose value in some variable deviates by more than a given criterion (e.g., 2 standard deviations) from the mean of the variable. However, this definition is incomplete unless it also specifies the reference group over which the means and deviations are computed. In the following, we explore the implications of different reference groups.

Basic tasks

  • Save the data into a tibble outliers and report its number of observations and variables, and their types.

  • How many missing data values are there in outliers?

  • What is the gender (or sex) distribution in this sample?

  • Create a plot that shows the distribution of height values for each gender.

# Load and inspect data:
# outliers <- read_csv("out.csv") # read in csv-file
# outliers <- as_tibble(data)   # if data is not already a tibble
dim(outliers)  # => 1000 observations (rows) x 3 variables (columns)
#> [1] 1000    3

# Missing data points: 
sum(is.na(outliers))  # => 18 missing values
#> [1] 18

# Gender distribution: 
outliers %>% 
  group_by(sex) %>% 
  count()
#> # A tibble: 2 x 2
#> # Groups:   sex [2]
#>   sex        n
#>   <chr>  <int>
#> 1 female   507
#> 2 male     493
# => 50.7% females, 49.3% males.

# Distributions of `height` as density plot: 
ggplot(outliers, aes(x = height)) +
  geom_density(fill = "gold", alpha = 2/3) +
  geom_density(aes(fill = sex), alpha = 2/5) +
  labs(title = "Distribution of heights overall and by gender", 
       fill = "Gender") + 
  # scale_fill_manual(values = c("firebrick", "steelblue3")) +
  scale_fill_brewer(palette = "Set1") +  # using Brewer palette
  theme_bw()


# Note: To avoid the warning about removing 18 cases with NA-values, 
#       we could first filter out those cases:
# non_NA_data <- filter(outliers, !is.na(height))

# Alternative solution as 2 histograms: 
ggplot(outliers) +
  facet_wrap(~sex) + 
  geom_histogram(aes(x = height, fill = sex), binwidth = 5, color = "grey10") +
  labs(title = "Distribution of heights by gender",
       x = "Height", y = "Frequency") +
  scale_fill_brewer(name = "Gender:", palette = "Set1") +  # using Brewer palette
  # scale_fill_manual(name = "Gender:", values = c("firebrick", "steelblue3")) + 
  theme_bw()

Defining different outliers

Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height:

  1. outliers relative to the height of the overall sample (i.e., individuals with height values deviating more than 2 SD from the overall mean of height);

  2. outliers relative to the height of some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals with height values deviating more than 2 SD from the mean height of their own gender).

Hints: As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE or FALSE) and added as new columns to data (via appropriate mutate commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping outliers by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x and y is provided by abs(x - y).

Relative outliers

Now use the 2 new outlier variables to define (or filter) 2 subsets of the data that contain 2 subgroups of people:

  1. out_1: Individuals (females and males) with height values that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are in outliers?

  2. out_2: Individuals (females and males) with height values that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are in outliers?

## Defining different outliers: ----- 

# Included in data_out (below), but also possible to do separately:  
# Compute the number, means and SD of height values in 2 ways: 

# 1. overall: 
outliers %>%
  summarise(n = n(),
            n_not_NA = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 4
#>       n n_not_NA mn_height sd_height
#>   <int>    <int>     <dbl>     <dbl>
#> 1  1000      982      175.      11.3

# 2. by gender:
outliers %>%
  group_by(sex) %>%
  summarise(n = n(),
            n_not_NA = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 2 x 5
#>   sex        n n_not_NA mn_height sd_height
#>   <chr>  <int>    <int>     <dbl>     <dbl>
#> 1 female   507      501      169.      8.19
#> 2 male     493      481      181.     11.0


# Detecting and marking outliers (by logical variables): 
# Compute the means, SDs, and corresponding outliers in 2 ways:
crit <- 2  # criterion value for detecting outliers (in SD units)

data_out <- outliers %>%      
  # 1. Compute means, SD, and outliers for overall sample: 
  mutate(mn_height  = mean(height, na.rm = TRUE),  
         sd_height  = sd(height, na.rm = TRUE),
         out_height = abs(height - mn_height) > (crit * sd_height)) %>%
  group_by(sex) %>%       
  # 2. Compute same metrics for subgroups (by sex):
  mutate(mn_sex_height  = mean(height, na.rm = TRUE), 
         sd_sex_height  = sd(height, na.rm = TRUE),
         out_sex_height = abs(height - mn_sex_height) > (crit * sd_sex_height))

knitr::kable(head(data_out))
id sex height mn_height sd_height out_height mn_sex_height sd_sex_height out_sex_height
nr.1 female 164 174.7006 11.26015 FALSE 169.0679 8.193619 FALSE
nr.2 male 178 174.7006 11.26015 FALSE 180.5676 11.026677 FALSE
nr.3 female 177 174.7006 11.26015 FALSE 169.0679 8.193619 FALSE
nr.4 male 188 174.7006 11.26015 FALSE 180.5676 11.026677 FALSE
nr.5 male 193 174.7006 11.26015 FALSE 180.5676 11.026677 FALSE
nr.6 female 168 174.7006 11.26015 FALSE 169.0679 8.193619 FALSE

## Relative outliers: ----- 

# Filter specific combinations of outliers:

# 1. Outliers relative to the entire population AND to their own gender: 
out_1 <- data_out %>%
  filter(out_height & out_sex_height) %>%
  arrange(sex, height)

nrow(out_1) # => 21 individuals. 
#> [1] 21

# 2. Outliers relative to their own gender, but NOT relative to the entire population:
out_2 <- data_out %>%
  filter(!out_height & out_sex_height) %>%
  arrange(sex, height)  

nrow(out_2) # => 24 individuals.
#> [1] 24

Bonus plots

  • Visualize the raw values and distributions of height for both types of outliers (out_1 and out_2) in 2 separate plots.

  • Interpret both plots by describing the height and sex combination of the individuals shown in each plot.

# Visualization and interpretation of both types of outliers: 

# 1. Showing out_1: 
ggplot(out_1, aes(x = sex, y = height)) +
  geom_violin(aes(fill = sex)) + 
  geom_jitter(size = 4, alpha = 2/3) + 
  scale_fill_manual(values = c("firebrick", "steelblue3")) +
  labs(title = "Outliers relative to both overall sample and gender", 
       x = "Gender", y = "Height (in cm)", 
       fill = "Gender:") +
  theme_bw()


# Interpretation: 
# `out_1` contains mostly short women (except for 1 tall woman) 
#  and mostly tall men (except for 2 short men). 

# 2. Showing out_2: 
ggplot(out_2, aes(x = sex, y = height)) +
  geom_violin(aes(fill = sex)) + 
  geom_jitter(size = 4, alpha = 2/3) + 
  scale_fill_manual(values = c("firebrick", "steelblue3")) +
  labs(title = "Outliers relative to gender but not overall sample", 
       x = "Gender", y = "Height (in cm)", 
       fill = "Gender:") +
  theme_bw()


# Interpretation: 
# `out_2` contains individuals which are either tall women or short men.

Exercise 4

In Exercise 6 of WPA01 and Exercise 5 of WPA02 you used the p_info data (available at http://rpository.com/ds4psy/data/posPsy_participants.csv) from

  • Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R. and Schüz, B. (2018). Data from, ‘Web-based positive psychology interventions: A reexamination of effectiveness’. Journal of Open Psychology Data, 6: 1. DOI: https://doi.org/10.5334/jopd.35

to explore the participant information and create some corresponding plots.

library(readr)

# Read data (from online source):
p_info <- read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")

# p_info
dim(p_info)      # 295 rows, 6 columns
#> [1] 295   6

Answer the same questions as in those exercises by transforming your earlier base R commands and graphs (created by using ggplot2 commands) into pipes of dplyr commands that provide the same information.

From WPA01: Exercise 6

Questions from Exercise 6 of WPA01:

Examine the participant information in p_info by describing each of its variables:

  1. How many individuals are contained in the dataset?
  2. What percentage of them is female (i.e., has a sex value of 1)?
  3. How many participants were in one of the 3 treatment groups (i.e., have an intervention value of 1, 2, or 3)?
  4. What is the participants’ mean education level? What percentage has a university degree (i.e., an educ value of at least 4)?
  5. What is the age range (min to max) of participants? What is the average (mean and median) age?
  6. Describe the range of income levels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., an income value of 1)?
# 1. How many individuals are contained in the dataset? 
N <- nrow(p_info) # OR 
N  # 295
#> [1] 295

# Note: 
p_info %>% count()
#> # A tibble: 1 x 1
#>       n
#>   <int>
#> 1   295
# would yield a tibble (with only 1 element: 295).

# 2. What percentage of them is female (i.e., has a `sex` value of 1)?
p_info %>%
  group_by(sex) %>% 
  summarise(n_sex = n(),            # number 
            pc_sex = n_sex/N * 100  # percentage
            )
#> # A tibble: 2 x 3
#>     sex n_sex pc_sex
#>   <int> <int>  <dbl>
#> 1     1   251   85.1
#> 2     2    44   14.9

# 3. How many participants were in one of the 3 treatment groups (i.e., have an `intervention` value of 1, 2, or 3)? 
p_info %>%
  filter(intervention < 4) %>%
  count()  
#> # A tibble: 1 x 1
#>       n
#>   <int>
#> 1   222
# => 222 individuals

# OR: 
t_iv <- p_info %>%
  group_by(intervention) %>% 
  summarise(n_iv = n(),           # number 
            pc_iv = n_iv/N * 100  # percentage
            ) 
t_iv
#> # A tibble: 4 x 3
#>   intervention  n_iv pc_iv
#>          <int> <int> <dbl>
#> 1            1    72  24.4
#> 2            2    76  25.8
#> 3            3    74  25.1
#> 4            4    73  24.7
sum(t_iv$n_iv[1:3]) 
#> [1] 222
# => 222 in elements 1:3 of vector n_iv 


# 4. What is the participants' mean education level? 
p_info %>%
  summarise(mn_edu = mean(educ))
#> # A tibble: 1 x 1
#>   mn_edu
#>    <dbl>
#> 1   3.98

# What percentage has a university degree (i.e., an `educ` value of at least 4)?
p_info %>%
  group_by(educ) %>% 
  summarise(n_edu = n(),            # number 
            pc_edu = n_edu/N * 100  # percentage
            ) 
#> # A tibble: 5 x 3
#>    educ n_edu pc_edu
#>   <int> <int>  <dbl>
#> 1     1    14   4.75
#> 2     2    21   7.12
#> 3     3    39  13.2 
#> 4     4   104  35.3 
#> 5     5   117  39.7

# 5. What is the age range (`min` to `max`) of participants? 
#    What is the average (mean and median) age?
p_info %>%
  summarise(n = n(),
            min_age = min(age),
            mn_age = mean(age),
            max_age = max(age)) 
#> # A tibble: 1 x 4
#>       n min_age mn_age max_age
#>   <int>   <dbl>  <dbl>   <dbl>
#> 1   295      18   43.8      83

# 6. Describe the range of `income` levels present in this sample of participants. 
#    What percentage of participants self-identifies as a below-average income 
#    (i.e., an `income` value of 1)?
p_info %>%
  group_by(income) %>%
  summarise(n_income = n(),
            pc_income = n_income/N * 100) 
#> # A tibble: 3 x 3
#>   income n_income pc_income
#>    <int>    <int>     <dbl>
#> 1      1       73      24.7
#> 2      2      136      46.1
#> 3      3       86      29.2

From WPA02: Exercise 5

Questions from Exercise 5 of WPA02

Use the p_info data to create some plots that descripte the sample of participants:

  • A histogram that shows the distribution of participant age in 3 ways:
    • overall,
    • separately for each sex, and
    • separately for each intervention.

When using dplyr instead of ggplot2, we can replace the information contained in the histogram by a table of descriptives:

# Age distribution (as a table): 
age_overall <- p_info %>%
  summarise(n = n(),              # number 
            min_age = min(age),   # minimum
            mn_age = mean(age),   # mean
            md_age = median(age), # median           
            sd_age = sd(age),     # standard deviation
            max_age = max(age)    # maximum
  )
age_overall
#> # A tibble: 1 x 6
#>       n min_age mn_age md_age sd_age max_age
#>   <int>   <dbl>  <dbl>  <int>  <dbl>   <dbl>
#> 1   295      18   43.8     44   12.4      83

# Age distribution by sex (as a table): 
age_by_sex <- p_info %>%
  group_by(sex) %>% 
  summarise(n = n(),              # number 
            min_age = min(age),   # minimum
            mn_age = mean(age),   # mean
            md_age = median(age), # median           
            sd_age = sd(age),     # standard deviation
            max_age = max(age)    # maximum
  )
age_by_sex
#> # A tibble: 2 x 7
#>     sex     n min_age mn_age md_age sd_age max_age
#>   <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1     1   251      19   43.9     44   12.1      83
#> 2     2    44      18   43.0     44   14.1      71

# Age distribution by intervention (as a table): 
age_by_iv <- p_info %>%
  group_by(intervention) %>% 
  summarise(n = n(),              # number 
            min_age = min(age),   # minimum
            mn_age = mean(age),   # mean
            md_age = median(age), # median           
            sd_age = sd(age),     # standard deviation
            max_age = max(age)    # maximum
  )
age_by_iv
#> # A tibble: 4 x 7
#>   intervention     n min_age mn_age md_age sd_age max_age
#>          <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1            1    72      22   44.6   45     12.1      71
#> 2            2    76      18   45.4   45.5   12.5      83
#> 3            3    74      19   43.3   44     12.2      71
#> 4            4    73      19   41.7   40     12.8      75
  • A bar plot that
    • shows how many participants took part in each intervention; or
    • shows how many participants of each sex took part in each intervention.

Using dplyr instead of ggplot2:

# Number of participants per intervention:
# (was contained in age_by_iv above): 
age_by_iv %>%
  select(intervention, n)
#> # A tibble: 4 x 2
#>   intervention     n
#>          <int> <int>
#> 1            1    72
#> 2            2    76
#> 3            3    74
#> 4            4    73

# N and percentage by intervention and sex:
p_info %>%
  group_by(intervention, sex) %>% 
  summarise(n_iv_sex  = n(),              # number 
            pc_iv_sex = n_iv_sex/N * 100  # percentage 
            ) 
#> # A tibble: 8 x 4
#> # Groups:   intervention [?]
#>   intervention   sex n_iv_sex pc_iv_sex
#>          <int> <int>    <int>     <dbl>
#> 1            1     1       62     21.0 
#> 2            1     2       10      3.39
#> 3            2     1       66     22.4 
#> 4            2     2       10      3.39
#> 5            3     1       62     21.0 
#> 6            3     2       12      4.07
#> 7            4     1       61     20.7 
#> 8            4     2       12      4.07

More on data transformation

For more details on dplyr,

Conclusion

All ds4psy essentials so far:

Nr. Topic
0. Syllabus
1. Basic R concepts and commands
2. Visualizing data
3. Transforming data
4. Exploring data (EDA)
+. Datasets

[Last update on 2018-11-27 22:57:45 by hn.]