Introduction
This file contains essential commands from Chapter 5: Data transformation of the textbook r4ds and corresponding examples and exercises. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.
All ds4psy essentials so far:
Nr. | Topic |
---|---|
0. | Syllabus |
1. | Basic R concepts and commands |
2. | Visualizing data |
3. | Transforming data |
4. | Exploring data (EDA) |
+. | Datasets |
Course coordinates
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Winter 2018/2019: Mondays, 13:30–15:00, C511.
- Links to current course syllabus | ZeUS | Ilias
Preparations
Create an R script (.R
) or an R-Markdown file (.Rmd
) and load the R packages of the tidyverse
. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## Transforming data | ds4psy
## 2018 11 19
## ----------------------------
## Preparations: ----------
library(tidyverse)
## 1. Topic: ----------
# etc.
## End of file (eof). ----------
Visualizing data
In the following, we assume that you have read and worked through Chapter 5: Data transformation.
Based on this background, we examine some essential commands of dplyr
in the context of examples. However, the dplyr
package extends beyond this introduction – and will appear again later (e.g., in Chapter 13: Relational data).
Essential commands and examples
Data transformation includes re-arranging, selecting, changing, or aggregating data.
Overview
When we have data in the form of a tibble or data frame, dplyr
provides a range of simple tools to transform this data. Six essential dplyr
commands are:
arrange
sorts cases (rows);filter
selects cases (rows) by logical conditions;select
selects and reorders variables (columns);mutate
computes new variables (columns) and adds them to existing ones;summarise
collapses multiple values of a variable (rows of a column) to a single one;
group_by
changes the unit of aggregation (in combination withmutate
andsummarise
).
Not quite as essential but still useful dplyr
commands include:
slice
selects (ranges of) cases (rows) by number;rename
renames variables (columns) and keeps others;transmute
computes new variables (columns) and drops existing ones;sample_n
andsample_frac
draw random samples of cases (rows).
Data
We save the dplyr::starwars
data as a tibble sw
and use it to illustrate the essential dplyr
commands.
library(tidyverse)
sw <- dplyr::starwars
sw # => A tibble: 87 rows (individuals) x 13 columns (variables)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Dart… 202 136 none white yellow 41.9 male
#> 5 Leia… 150 49 brown light brown 19 female
#> 6 Owen… 178 120 brown, gr… light blue 52 male
#> 7 Beru… 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Bigg… 183 84 black light brown 24 male
#> 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
Standard questions to ask of every new data file include:
What are the dimensions of the data file?
How many variables (columns) are there in
sw
and of which type are they?How many missing (
NA
) values are there?
## Data:
# sw
## Standard checks: ------
dim(sw) # 87 x 13 variables
#> [1] 87 13
names(sw) # variable names
#> [1] "name" "height" "mass" "hair_color" "skin_color"
#> [6] "eye_color" "birth_year" "gender" "homeworld" "species"
#> [11] "films" "vehicles" "starships"
glimpse(sw) #
#> Observations: 87
#> Variables: 13
#> $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
#> $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
#> $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
#> $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
#> $ gender <chr> "male", NA, NA, "male", "female", "male", "female",...
#> $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
#> $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
#> $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
#> $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
#> $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...
## Missing (NA) values: ------
sum(is.na(sw)) # 101 missing values
#> [1] 101
colSums(is.na(sw)) # missing by column (variable)
#> name height mass hair_color skin_color eye_color
#> 0 6 28 5 0 0
#> birth_year gender homeworld species films vehicles
#> 44 3 10 5 0 0
#> starships
#> 0
1. arrange
to sort rows
Using arrange
sorts cases (rows) by putting specific variables (columns) in specific orders (e.g., ascending or descending):
# Sort rows alphabetically (by name):
arrange(sw, name)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Ackb… 180 83 none brown mot… orange 41 male
#> 2 Adi … 184 50 none dark blue NA female
#> 3 Anak… 188 84 blond fair blue 41.9 male
#> 4 Arve… NA NA brown fair brown NA male
#> 5 Ayla… 178 55 none blue hazel 48 female
#> 6 Bail… 191 NA black tan brown 67 male
#> 7 Barr… 166 50 black yellow blue 40 female
#> 8 BB8 NA NA none none black NA none
#> 9 Ben … 163 65 none grey, gre… orange NA male
#> 10 Beru… 165 75 brown light blue 47 female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# The same command using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
arrange(name)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Ackb… 180 83 none brown mot… orange 41 male
#> 2 Adi … 184 50 none dark blue NA female
#> 3 Anak… 188 84 blond fair blue 41.9 male
#> 4 Arve… NA NA brown fair brown NA male
#> 5 Ayla… 178 55 none blue hazel 48 female
#> 6 Bail… 191 NA black tan brown 67 male
#> 7 Barr… 166 50 black yellow blue 40 female
#> 8 BB8 NA NA none none black NA none
#> 9 Ben … 163 65 none grey, gre… orange NA male
#> 10 Beru… 165 75 brown light blue 47 female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# Sort rows in descending order:
sw %>%
arrange(desc(name))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Zam … 168 55 blonde fair, gre… yellow NA female
#> 2 Yoda 66 17 white green brown 896 male
#> 3 Yara… 264 NA none white yellow NA male
#> 4 Wilh… 180 NA auburn, g… fair blue 64 male
#> 5 Wick… 88 20 brown brown brown 8 male
#> 6 Wedg… 170 77 brown fair hazel 21 male
#> 7 Watto 137 NA black blue, grey yellow NA male
#> 8 Wat … 193 48 none green, gr… unknown NA male
#> 9 Tion… 206 80 none grey black NA male
#> 10 Taun… 213 NA none grey black NA female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# Sort by multiple variables:
sw %>%
arrange(eye_color, gender, desc(height))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Taun… 213 NA none grey black NA female
#> 2 Shaa… 178 57 none red, blue… black NA female
#> 3 Lama… 229 88 none grey black NA male
#> 4 Tion… 206 80 none grey black NA male
#> 5 Kit … 196 87 none green black NA male
#> 6 Plo … 188 80 none orange black 22 male
#> 7 Gree… 173 74 <NA> green black 44 male
#> 8 Nien… 160 68 none grey black NA male
#> 9 Gasg… 122 NA none white, bl… black NA male
#> 10 BB8 NA NA none none black NA none
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
## Note: See
# ?dplyr::arrange # for more help and examples.
Note some details:
All basic
dplyr
commands can be called asverb(data, ...)
or – using the pipe frommagrittr
– asdata %>% verb(...)
(seevignette("magrittr")
for details).Variable names are unquoted.
The order of variable names (
x, y, ...
) specifies the order or priority of operations (first byx
, then byy
, etc.).
Practice: Arrange the sw
data in different ways, combining multiple variables and (ascending and descending) orders. Where are cases containing NA
values in sorted variables placed?
2. filter
to select rows
Using filter
selects cases (rows) by logical conditions. It keeps all rows for which the conditions are TRUE
and drops all rows for which the conditions are FALSE
or NA
.
# Filter to keep all humans:
filter(sw, species == "Human")
#> # A tibble: 35 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 Dart… 202 136 none white yellow 41.9 male
#> 3 Leia… 150 49 brown light brown 19 female
#> 4 Owen… 178 120 brown, gr… light blue 52 male
#> 5 Beru… 165 75 brown light blue 47 female
#> 6 Bigg… 183 84 black light brown 24 male
#> 7 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> 8 Anak… 188 84 blond fair blue 41.9 male
#> 9 Wilh… 180 NA auburn, g… fair blue 64 male
#> 10 Han … 180 80 brown fair brown 29 male
#> # ... with 25 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# The same command using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
filter(species == "Human")
#> # A tibble: 35 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 Dart… 202 136 none white yellow 41.9 male
#> 3 Leia… 150 49 brown light brown 19 female
#> 4 Owen… 178 120 brown, gr… light blue 52 male
#> 5 Beru… 165 75 brown light blue 47 female
#> 6 Bigg… 183 84 black light brown 24 male
#> 7 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> 8 Anak… 188 84 blond fair blue 41.9 male
#> 9 Wilh… 180 NA auburn, g… fair blue 64 male
#> 10 Han … 180 80 brown fair brown 29 male
#> # ... with 25 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# Filter by multiple (additive) conditions:
sw %>%
filter(height > 180, mass <= 75) # tall and light individuals
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jar … 196 66 none orange orange 52 male
#> 2 Adi … 184 50 none dark blue NA female
#> 3 Wat … 193 48 none green, gr… unknown NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# The same command using the logical operator (&):
sw %>%
filter(height > 180 & mass <= 75) # tall and light individuals
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jar … 196 66 none orange orange 52 male
#> 2 Adi … 184 50 none dark blue NA female
#> 3 Wat … 193 48 none green, gr… unknown NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter for a range of a specific variable:
sw %>%
filter(height >= 150, height <= 165) # (a) using height twice
#> # A tibble: 9 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Leia… 150 49 brown light brown 19 female
#> 2 Beru… 165 75 brown light blue 47 female
#> 3 Mon … 150 NA auburn fair blue 48 female
#> 4 Nien… 160 68 none grey black NA male
#> 5 Shmi… 163 NA black fair brown 72 female
#> 6 Ben … 163 65 none grey, gre… orange NA male
#> 7 Cordé 157 NA brown light brown NA female
#> 8 Dormé 165 NA brown light brown NA female
#> 9 Padm… 165 45 brown light brown 46 female
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
sw %>%
filter(between(height, 150, 165)) # (b) using between(...)
#> # A tibble: 9 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Leia… 150 49 brown light brown 19 female
#> 2 Beru… 165 75 brown light blue 47 female
#> 3 Mon … 150 NA auburn fair blue 48 female
#> 4 Nien… 160 68 none grey black NA male
#> 5 Shmi… 163 NA black fair brown 72 female
#> 6 Ben … 163 65 none grey, gre… orange NA male
#> 7 Cordé 157 NA brown light brown NA female
#> 8 Dormé 165 NA brown light brown NA female
#> 9 Padm… 165 45 brown light brown 46 female
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter by multiple (alternative) conditions:
sw %>%
filter(homeworld == "Kashyyyk" | skin_color == "green")
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Chew… 228 112 brown unknown blue 200 male
#> 2 Gree… 173 74 <NA> green black 44 male
#> 3 Yoda 66 17 white green brown 896 male
#> 4 Bossk 190 113 none green red 53 male
#> 5 Rugo… 206 NA none green orange NA male
#> 6 Kit … 196 87 none green black NA male
#> 7 Pogg… 183 80 none green yellow NA male
#> 8 Tarf… 234 136 brown brown blue NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter cases with missing (NA) values on specific variables:
sw %>%
filter(is.na(gender))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 2 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 3 R5-D4 97 32 <NA> white, red red NA <NA>
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter cases with existing (non-NA) values on specific variables:
sw %>%
filter(!is.na(mass), !is.na(birth_year))
#> # A tibble: 36 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Dart… 202 136 none white yellow 41.9 male
#> 5 Leia… 150 49 brown light brown 19 female
#> 6 Owen… 178 120 brown, gr… light blue 52 male
#> 7 Beru… 165 75 brown light blue 47 female
#> 8 Bigg… 183 84 black light brown 24 male
#> 9 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> 10 Anak… 188 84 blond fair blue 41.9 male
#> # ... with 26 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
## Note: See
# ?dplyr::filter # for more help and examples.
Note some details:
Separating multiple conditions by commas is the same as the logical AND (
&
).Variable names are unquoted.
The comma between conditions or tests (
x, y, ...
) means the same as&
(logical AND), as each test results in a vector of Boolean values.Unlike in base R, rows for which the condition evaluates to
NA
are dropped.Additional filter functions include
near()
for testing numerical (near-)identity.
Practice: Use filter
on the dplyr::starwars
data (sw
) to select very diverse or narrow subsets of individuals. For instance,
- which individual with blond hair and blue eyes has an unknown mass?
- of which species are individuals that are over 2m tall and have brown hair?
- which individuals from Tatooine are not male (but may be
NA
)? - which individuals are neither male nor female OR heavier than 130kg?
3. select
to select columns
Using select
selects variables (columns) by their names or numbers:
# Select 4 specific variables (columns) of sw:
select(sw, name, species, birth_year, gender)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# The same when using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
select(name, species, birth_year, gender)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# The same when providing a vector of variable names:
sw %>%
select(c(name, species, birth_year, gender))
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# The same when providing column numbers:
sw %>%
select(1, 10, 7, 8)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# The same when providing a vector of column numbers:
sw %>%
select(c(1, 10, 7, 8))
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# Select ranges of variables with ":":
sw %>%
select(name:mass, films:starships)
#> # A tibble: 87 x 6
#> name height mass films vehicles starships
#> <chr> <int> <dbl> <list> <list> <list>
#> 1 Luke Skywalker 172 77 <chr [5]> <chr [2]> <chr [2]>
#> 2 C-3PO 167 75 <chr [6]> <chr [0]> <chr [0]>
#> 3 R2-D2 96 32 <chr [7]> <chr [0]> <chr [0]>
#> 4 Darth Vader 202 136 <chr [4]> <chr [0]> <chr [1]>
#> 5 Leia Organa 150 49 <chr [5]> <chr [1]> <chr [0]>
#> 6 Owen Lars 178 120 <chr [3]> <chr [0]> <chr [0]>
#> 7 Beru Whitesun lars 165 75 <chr [3]> <chr [0]> <chr [0]>
#> 8 R5-D4 97 32 <chr [1]> <chr [0]> <chr [0]>
#> 9 Biggs Darklighter 183 84 <chr [1]> <chr [0]> <chr [1]>
#> 10 Obi-Wan Kenobi 182 77 <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows
# Select to re-order variables (columns) with everything():
sw %>%
select(species, name, gender, everything())
#> # A tibble: 87 x 13
#> species name gender height mass hair_color skin_color eye_color
#> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Human Luke… male 172 77 blond fair blue
#> 2 Droid C-3PO <NA> 167 75 <NA> gold yellow
#> 3 Droid R2-D2 <NA> 96 32 <NA> white, bl… red
#> 4 Human Dart… male 202 136 none white yellow
#> 5 Human Leia… female 150 49 brown light brown
#> 6 Human Owen… male 178 120 brown, gr… light blue
#> 7 Human Beru… female 165 75 brown light blue
#> 8 Droid R5-D4 <NA> 97 32 <NA> white, red red
#> 9 Human Bigg… male 183 84 black light brown
#> 10 Human Obi-… male 182 77 auburn, w… fair blue-gray
#> # ... with 77 more rows, and 5 more variables: birth_year <dbl>,
#> # homeworld <chr>, films <list>, vehicles <list>, starships <list>
# Select variables with helper functions:
sw %>%
select(starts_with("s"))
#> # A tibble: 87 x 3
#> skin_color species starships
#> <chr> <chr> <list>
#> 1 fair Human <chr [2]>
#> 2 gold Droid <chr [0]>
#> 3 white, blue Droid <chr [0]>
#> 4 white Human <chr [1]>
#> 5 light Human <chr [0]>
#> 6 light Human <chr [0]>
#> 7 light Human <chr [0]>
#> 8 white, red Droid <chr [0]>
#> 9 light Human <chr [1]>
#> 10 fair Human <chr [5]>
#> # ... with 77 more rows
sw %>%
select(ends_with("s"))
#> # A tibble: 87 x 5
#> mass species films vehicles starships
#> <dbl> <chr> <list> <list> <list>
#> 1 77 Human <chr [5]> <chr [2]> <chr [2]>
#> 2 75 Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 32 Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 136 Human <chr [4]> <chr [0]> <chr [1]>
#> 5 49 Human <chr [5]> <chr [1]> <chr [0]>
#> 6 120 Human <chr [3]> <chr [0]> <chr [0]>
#> 7 75 Human <chr [3]> <chr [0]> <chr [0]>
#> 8 32 Droid <chr [1]> <chr [0]> <chr [0]>
#> 9 84 Human <chr [1]> <chr [0]> <chr [1]>
#> 10 77 Human <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows
sw %>%
select(contains("_"))
#> # A tibble: 87 x 4
#> hair_color skin_color eye_color birth_year
#> <chr> <chr> <chr> <dbl>
#> 1 blond fair blue 19
#> 2 <NA> gold yellow 112
#> 3 <NA> white, blue red 33
#> 4 none white yellow 41.9
#> 5 brown light brown 19
#> 6 brown, grey light blue 52
#> 7 brown light blue 47
#> 8 <NA> white, red red NA
#> 9 black light brown 24
#> 10 auburn, white fair blue-gray 57
#> # ... with 77 more rows
sw %>%
select(matches("or"))
#> # A tibble: 87 x 4
#> hair_color skin_color eye_color homeworld
#> <chr> <chr> <chr> <chr>
#> 1 blond fair blue Tatooine
#> 2 <NA> gold yellow Tatooine
#> 3 <NA> white, blue red Naboo
#> 4 none white yellow Tatooine
#> 5 brown light brown Alderaan
#> 6 brown, grey light blue Tatooine
#> 7 brown light blue Tatooine
#> 8 <NA> white, red red Tatooine
#> 9 black light brown Tatooine
#> 10 auburn, white fair blue-gray Stewjon
#> # ... with 77 more rows
# Renaming variables:
sw %>%
rename(creature = name, from_planet = homeworld)
#> # A tibble: 87 x 13
#> creature height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Darth V… 202 136 none white yellow 41.9 male
#> 5 Leia Or… 150 49 brown light brown 19 female
#> 6 Owen La… 178 120 brown, gr… light blue 52 male
#> 7 Beru Wh… 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Biggs D… 183 84 black light brown 24 male
#> 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: from_planet <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
## Note: See
# ?dplyr::select # for more help and examples.
?dplyr::select_if # for more help and examples.
Note some details:
select
works both by specifying variable (column) names and by specifying column numbers.Variable names are unquoted.
The sequence of variable names (separated by commas) specifies the order of columns in the resulting tibble.
Selecting and adding
everything()
allows re-ordering.Various helper functions (e.g.,
starts_with
,ends_with
,contains
,matches
,num_range
) refer to (parts of) variable names.rename
renames specified variables (without quotes) and keeps all other variables.
Practice: Use select
on the dplyr::starwars
data (sw
) to select and re-order specific subsets of variables (e.g., all variables starting with “h”, all even columns, all character variables, etc.).
4. mutate
to compute new variables
Using mutate
computes new variables (columns) from scratch or existing ones:
# Preparation: Save only a subset variables of sw as sws:
sws <- select(sw, name:mass, birth_year:species)
sws # => 87 cases (rows), but only 7 variables (columns)
#> # A tibble: 87 x 7
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19 male Tatooine Human
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19 female Alderaan Human
#> 6 Owen Lars 178 120 52 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57 male Stewjon Human
#> # ... with 77 more rows
# Compute 2 new variables and add them to existing ones:
mutate(sws, id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#> name height mass birth_year gender homeworld species id height_feet
#> <ch> <int> <dbl> <dbl> <chr> <chr> <chr> <int> <dbl>
#> 1 Luk… 172 77 19 male Tatooine Human 1 5.64
#> 2 C-3… 167 75 112 <NA> Tatooine Droid 2 5.48
#> 3 R2-… 96 32 33 <NA> Naboo Droid 3 3.15
#> 4 Dar… 202 136 41.9 male Tatooine Human 4 6.63
#> 5 Lei… 150 49 19 female Alderaan Human 5 4.92
#> 6 Owe… 178 120 52 male Tatooine Human 6 5.84
#> 7 Ber… 165 75 47 female Tatooine Human 7 5.41
#> 8 R5-… 97 32 NA <NA> Tatooine Droid 8 3.18
#> 9 Big… 183 84 24 male Tatooine Human 9 6.00
#> 10 Obi… 182 77 57 male Stewjon Human 10 5.97
#> # ... with 77 more rows
# The same using the pipe:
sws %>%
mutate(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#> name height mass birth_year gender homeworld species id height_feet
#> <ch> <int> <dbl> <dbl> <chr> <chr> <chr> <int> <dbl>
#> 1 Luk… 172 77 19 male Tatooine Human 1 5.64
#> 2 C-3… 167 75 112 <NA> Tatooine Droid 2 5.48
#> 3 R2-… 96 32 33 <NA> Naboo Droid 3 3.15
#> 4 Dar… 202 136 41.9 male Tatooine Human 4 6.63
#> 5 Lei… 150 49 19 female Alderaan Human 5 4.92
#> 6 Owe… 178 120 52 male Tatooine Human 6 5.84
#> 7 Ber… 165 75 47 female Tatooine Human 7 5.41
#> 8 R5-… 97 32 NA <NA> Tatooine Droid 8 3.18
#> 9 Big… 183 84 24 male Tatooine Human 9 6.00
#> 10 Obi… 182 77 57 male Stewjon Human 10 5.97
#> # ... with 77 more rows
# Transmute commputes and only keeps new variables:
sws %>%
transmute(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 2
#> id height_feet
#> <int> <dbl>
#> 1 1 5.64
#> 2 2 5.48
#> 3 3 3.15
#> 4 4 6.63
#> 5 5 4.92
#> 6 6 5.84
#> 7 7 5.41
#> 8 8 3.18
#> 9 9 6.00
#> 10 10 5.97
#> # ... with 77 more rows
# Compute variables based on multiple others (including computed ones):
sws %>%
mutate(BMI = mass / ((height / 100) ^ 2), # compute body mass index (kg/m^2)
BMI_low = BMI < 18.5, # classify low BMI values
BMI_high = BMI > 30, # classify high BMI values
BMI_norm = !BMI_low & !BMI_high # classify normal BMI values
)
#> # A tibble: 87 x 11
#> name height mass birth_year gender homeworld species BMI BMI_low
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <dbl> <lgl>
#> 1 Luke… 172 77 19 male Tatooine Human 26.0 FALSE
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid 26.9 FALSE
#> 3 R2-D2 96 32 33 <NA> Naboo Droid 34.7 FALSE
#> 4 Dart… 202 136 41.9 male Tatooine Human 33.3 FALSE
#> 5 Leia… 150 49 19 female Alderaan Human 21.8 FALSE
#> 6 Owen… 178 120 52 male Tatooine Human 37.9 FALSE
#> 7 Beru… 165 75 47 female Tatooine Human 27.5 FALSE
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid 34.0 FALSE
#> 9 Bigg… 183 84 24 male Tatooine Human 25.1 FALSE
#> 10 Obi-… 182 77 57 male Stewjon Human 23.2 FALSE
#> # ... with 77 more rows, and 2 more variables: BMI_high <lgl>,
#> # BMI_norm <lgl>
## Note: See
# ?dplyr::mutate # for more help and examples.
Note some details:
mutate
computes new variables (columns) and adds them to existing ones, whiletransmute
drops existing ones.Each
mutate
command specifies a new variable name (without quotes), followed by=
and a rule for computing the new variable from existing ones.Variable names are unquoted.
Multiple
mutate
steps are separated by commas, each of which creates a new variable.See http://r4ds.had.co.nz/transform.html#mutate-funs for useful functions for creating new variables.
Practice: Compute a new variable mass_pound
from mass
(in kg) and the age of each individual in sw
relative to Yoda’s age. (Note that the variable birth_year
is provided in years BBY, i.e., Before Battle of Yavin.)
5. summarise
to compute summaries
summarise
computes a function for a specified variable and collapses the values of the specified variable (i.e., the rows of a specified columns) to a single value. It provides many different summary statistics by itself, but is even more useful in combination with group_by
(discussed next).
# Summarise allows computing a function for a variable (column):
summarise(sw, mn_mass = mean(mass, na.rm = TRUE)) # => 97.31 kg
#> # A tibble: 1 x 1
#> mn_mass
#> <dbl>
#> 1 97.3
# The same using the pipe:
sw %>%
summarise(mn_mass = mean(mass, na.rm = TRUE)) # => 97.31 kg
#> # A tibble: 1 x 1
#> mn_mass
#> <dbl>
#> 1 97.3
# Multiple summarise steps allow applying
# different functions for 1 dependent variable:
sw %>%
summarise(n_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
md_mass = median(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
max_mass = max(mass, na.rm = TRUE),
big_mass = any(mass > 1000)
)
#> # A tibble: 1 x 6
#> n_mass mn_mass md_mass sd_mass max_mass big_mass
#> <int> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 59 97.3 79 169. 1358 TRUE
# Multiple summarise steps also allow applying
# different functions to different dependent variables:
sw %>%
summarise(# Descriptives of height:
n_height = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
# Descriptives of mass:
n_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
# Counts of character variables:
n_names = n(),
n_species = n_distinct(species),
n_worlds = n_distinct(homeworld)
)
#> # A tibble: 1 x 9
#> n_height mn_height sd_height n_mass mn_mass sd_mass n_names n_species
#> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <int>
#> 1 81 174. 34.8 59 97.3 169. 87 38
#> # ... with 1 more variable: n_worlds <int>
## Note: See
# ?dplyr::summarise # for more help and examples.
Note some details:
summarise
collapses multiple values into one value and returns a new tibble with as many rows as values computed.Each
summarise
step specifies a new variable name (without quotes), followed by=
, and a function for computing the new variable from existing ones.Multiple
summarise
steps are separated by commas.Variable names are unquoted.
See https://dplyr.tidyverse.org/reference/summarise.html for examples and useful functions in combination with
summarise
.
Practice: Apply all summary functions mentioned in ?dplyr::summarise
to the sw
dataset.
6. group_by
to aggregate variables
Using group_by
does not change the data, but the unit of aggregation for other commands, which is very useful in combination with mutate
and summarise
.
# Grouping does not change the data, but lists its groups:
group_by(sws, species) # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups: species [38]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19 male Tatooine Human
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19 female Alderaan Human
#> 6 Owen Lars 178 120 52 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57 male Stewjon Human
#> # ... with 77 more rows
# The same using the pipe:
sws %>%
group_by(species) # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups: species [38]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19 male Tatooine Human
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19 female Alderaan Human
#> 6 Owen Lars 178 120 52 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57 male Stewjon Human
#> # ... with 77 more rows
# group_by is ineffective by itself, but very powerful
# (a) in combination with `mutate` and
# (b) in combination with `summarise`.
# ad (a):
# In combination with mutate and an aggregation function,
# group_by changes the unit of aggregation:
sws %>%
mutate(mn_height_1 = mean(height, na.rm = TRUE)) %>% # aggregates over ALL cases
group_by(species) %>%
mutate(mn_height_2 = mean(height, na.rm = TRUE)) %>% # aggregates over current group (species)
group_by(gender) %>%
mutate(mn_height_3 = mean(height, na.rm = TRUE)) %>% # aggregates over current group (gender)
group_by(name) %>%
mutate(mn_height_4 = mean(height, na.rm = TRUE)) # aggregates over current group (name)
#> # A tibble: 87 x 11
#> # Groups: name [87]
#> name height mass birth_year gender homeworld species mn_height_1
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Luke… 172 77 19 male Tatooine Human 174.
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid 174.
#> 3 R2-D2 96 32 33 <NA> Naboo Droid 174.
#> 4 Dart… 202 136 41.9 male Tatooine Human 174.
#> 5 Leia… 150 49 19 female Alderaan Human 174.
#> 6 Owen… 178 120 52 male Tatooine Human 174.
#> 7 Beru… 165 75 47 female Tatooine Human 174.
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid 174.
#> 9 Bigg… 183 84 24 male Tatooine Human 174.
#> 10 Obi-… 182 77 57 male Stewjon Human 174.
#> # ... with 77 more rows, and 3 more variables: mn_height_2 <dbl>,
#> # mn_height_3 <dbl>, mn_height_4 <dbl>
# ad (b):
# group_by is particularly useful in combination
# with summarise:
sws %>%
group_by(homeworld) %>%
summarise(count = n(),
mn_height = mean(height, na.rm = TRUE),
mn_mass = mean(mass, na.rm = TRUE)
)
#> # A tibble: 49 x 4
#> homeworld count mn_height mn_mass
#> <chr> <int> <dbl> <dbl>
#> 1 Alderaan 3 176. 64
#> 2 Aleen Minor 1 79 15
#> 3 Bespin 1 175 79
#> 4 Bestine IV 1 180 110
#> 5 Cato Neimoidia 1 191 90
#> 6 Cerea 1 198 82
#> 7 Champala 1 196 NaN
#> 8 Chandrila 1 150 NaN
#> 9 Concord Dawn 1 183 79
#> 10 Corellia 2 175 78.5
#> # ... with 39 more rows
# Note that this pipe returns a new tibble,
# with 49 rows (= different levels of homeworld) and
# - 1 column of the group variable (homeworld) and
# - 3 columns of the 3 newly summarised variables.
# group_by used with multiple variables yields a tibble
# containing the combination of all variable levels:
sw %>%
group_by(hair_color, eye_color) # => 35 groups (combinations)
#> # A tibble: 87 x 13
#> # Groups: hair_color, eye_color [35]
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Dart… 202 136 none white yellow 41.9 male
#> 5 Leia… 150 49 brown light brown 19 female
#> 6 Owen… 178 120 brown, gr… light blue 52 male
#> 7 Beru… 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Bigg… 183 84 black light brown 24 male
#> 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# Counting the frequency of cases in groups:
sw %>%
group_by(hair_color, eye_color) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 35 x 3
#> # Groups: hair_color, eye_color [35]
#> hair_color eye_color n
#> <chr> <chr> <int>
#> 1 black brown 9
#> 2 brown brown 9
#> 3 none black 9
#> 4 brown blue 7
#> 5 none orange 7
#> 6 none yellow 6
#> 7 blond blue 3
#> 8 none blue 3
#> 9 none red 3
#> 10 black blue 2
#> # ... with 25 more rows
# The same using summarise:
sw %>%
group_by(hair_color, eye_color) %>%
summarise(n = n()) %>%
arrange(desc(n))
#> # A tibble: 35 x 3
#> # Groups: hair_color [13]
#> hair_color eye_color n
#> <chr> <chr> <int>
#> 1 black brown 9
#> 2 brown brown 9
#> 3 none black 9
#> 4 brown blue 7
#> 5 none orange 7
#> 6 none yellow 6
#> 7 blond blue 3
#> 8 none blue 3
#> 9 none red 3
#> 10 black blue 2
#> # ... with 25 more rows
## Note: See
# ?dplyr::group_by # for more help and examples.
Note some details:
group_by
changes the unit of aggregation for other commands (mutate
andsummarise
).Variable names are unquoted.
When using
group_by
with multiple variables, they are separated by commas.Using
group_by
withmutate
results in a tibble that has the same number of cases (rows) as the original tibble. By contrast, usinggroup_by
withsummarise
results in a new tibble with all combinations of variable levels as its cases (rows).
Powerful pipes
The essential dplyr
commands are quite simple by themselves, but form the verbs of a language for basic data manipulation. These commands become particularly powerful when they are combined into pipes (by using the %>%
operator, as in the examples above). Stringing together several dplyr
commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly. For instance, pipes allow selecting and sorting sub-groups of data, computing descriptive statistics (n, mean, median, standard deviation, etc.), and answering detailed questions about specific variables.
Examples
Each of the following questions can be answered by a pipe of dplyr
commands:
What is the number and mean height and mass of individuals from Tatooine by species and gender?
Which humans are more than 5cm taller then the average human overall?
Which humans are more than 5cm taller than the average human of their own gender?
Exercises (WPA03)
The following exercises practice the essential dplyr
commands and aim to show that show that simple pipes of them can solve quite intriguing puzzles about data.
Exercise 1
Star and R wars
Let’s tackle the universe/tidyverse by uncovering even more facts about the dplyr::starwars
dataset. Answer the following questions by using pipes of basic dplyr
commands (i.e., arranging, filtering, selecting, grouping, counting, summarizing).
- Save the tibble
dplyr::starwars
assw
and report its dimensions.
## Data: -----
# ?dplyr::starwars
## (a) Basic data properties: ----
sw <- dplyr::starwars
dim(sw) # => 87 rows (denoting individuals) x 13 columns (variables)
#> [1] 87 13
Known unknowns
How many missing (
NA
) values doessw
contain?Which variable (column) has the most missing values?
Which individuals come from an unknown (missing)
homeworld
but have a knownbirth_year
or knownmass
?
## Missing data: -----
# How many missing data points?
sum(is.na(sw)) # => 101 missing values.
#> [1] 101
# Which individuals come from an unknown (missing) homeworld
# but have a known birth_year or mass?
sw %>%
filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Yoda 66 17 white green brown 896 male
#> 2 IG-88 200 140 none metal red 15 none
#> 3 Qui-… 193 89 brown fair blue 92 male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Which variable (column) has the most missing values?
colSums(is.na(sw)) # => birth_year has 44 missing values
#> name height mass hair_color skin_color eye_color
#> 0 6 28 5 0 0
#> birth_year gender homeworld species films vehicles
#> 44 3 10 5 0 0
#> starships
#> 0
colMeans(is.na(sw)) # (amounting to 50.1% of all cases).
#> name height mass hair_color skin_color eye_color
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000
#> birth_year gender homeworld species films vehicles
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000
#> starships
#> 0.00000000
## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald":
# sw$hair_color[is.na(sw$hair_color)] <- "bald"
Gender issues
How many humans are contained in
sw
overall and by gender?How many and which individuals in
sw
are neither male nor female?Of which species in
sw
exist at least 2 different gender values?
## (c) Gender issues: -----
# (+) How many humans are there of each gender?
sw %>%
filter(species == "Human") %>%
group_by(gender) %>%
count()
#> # A tibble: 2 x 2
#> # Groups: gender [2]
#> gender n
#> <chr> <int>
#> 1 female 9
#> 2 male 26
## Answer: 35 Humans in total: 9 females, 26 male.
# (+) How many and which individuals are neither male nor female?
sw %>%
filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jabb… 175 1358 <NA> green-tan… orange 600 herma…
#> 2 IG-88 200 140 none metal red 15 none
#> 3 BB8 NA NA none none black NA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# (+) Of which species are there at least 2 different gender values?
sw %>%
group_by(species, gender) %>%
count() %>% # table shows species by gender:
group_by(species) %>% # Which species appear more than once in this table?
count() %>%
filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups: species [5]
#> species nn
#> <chr> <int>
#> 1 Droid 2
#> 2 Human 2
#> 3 Kaminoan 2
#> 4 Twi'lek 2
#> 5 <NA> 2
# alternative (and shorter) solution:
sw %>%
group_by(species)%>%
summarise(n_gender_vals = n_distinct(gender)) %>%
filter(n_gender_vals >= 2)
#> # A tibble: 5 x 2
#> species n_gender_vals
#> <chr> <int>
#> 1 Droid 2
#> 2 Human 2
#> 3 Kaminoan 2
#> 4 Twi'lek 2
#> 5 <NA> 2
Popular homes and heights
From which
homeworld
do the most indidividuals (rows) come from?What is the mean
height
of all individuals with orange eyes from the most popular homeworld?
## (d) Homeworld issues: -----
# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from?
sw %>%
group_by(homeworld) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 49 x 2
#> # Groups: homeworld [49]
#> homeworld n
#> <chr> <int>
#> 1 Naboo 11
#> 2 Tatooine 10
#> 3 <NA> 10
#> 4 Alderaan 3
#> 5 Coruscant 3
#> 6 Kamino 3
#> 7 Corellia 2
#> 8 Kashyyyk 2
#> 9 Mirial 2
#> 10 Ryloth 2
#> # ... with 39 more rows
# => Naboo (with 11 individuals)
# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld?
sw %>%
filter(homeworld == "Naboo", eye_color == "orange") %>%
summarise(n = n(),
mn_height = mean(height))
#> # A tibble: 1 x 2
#> n mn_height
#> <int> <dbl>
#> 1 3 209.
## Note:
sw %>%
filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jabb… 175 1358 <NA> green-tan… orange 600 herma…
#> 2 Ackb… 180 83 none brown mot… orange 41 male
#> 3 Jar … 196 66 none orange orange 52 male
#> 4 Roos… 224 82 none grey orange NA male
#> 5 Rugo… 206 NA none green orange NA male
#> 6 Sebu… 112 40 none grey, red orange NA male
#> 7 Ben … 163 65 none grey, gre… orange NA male
#> 8 Saes… 188 NA none pale orange NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# (+) What is the mass and homeworld of the smallest droid?
sw %>%
filter(species == "Droid") %>%
arrange(height)
#> # A tibble: 5 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 2 R5-D4 97 32 <NA> white, red red NA <NA>
#> 3 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 4 IG-88 200 140 none metal red 15 none
#> 5 BB8 NA NA none none black NA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Size and mass issues
Compute the median, mean, and standard deviation of
height
for all droids.Compute the average height and mass by species and save the result as
h_m
.Sort
h_m
to list the 3 species with the smallest individuals (in terms of mean height).Sort
h_m
to list the 3 species with the heaviest individuals (in terms of median mass).
## Size and mass issues (group summaries): -----
# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
filter(species == "Droid") %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
md_height = median(height, na.rm = TRUE),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#> n not_NA_h md_height mn_height sd_height
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 5 4 132 140 52.0
# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
group_by(species) %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
not_NA_m = sum(!is.na(mass)),
md_mass = median(mass, na.rm = TRUE)
)
h_m
#> # A tibble: 38 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Aleena 1 1 79 1 15
#> 2 Besalisk 1 1 198 1 102
#> 3 Cerean 1 1 198 1 82
#> 4 Chagrian 1 1 196 0 NA
#> 5 Clawdite 1 1 168 1 55
#> 6 Droid 5 4 140 4 53.5
#> 7 Dug 1 1 112 1 40
#> 8 Ewok 1 1 88 1 20
#> 9 Geonosian 1 1 183 1 80
#> 10 Gungan 3 3 209. 2 74
#> # ... with 28 more rows
# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Yoda's species 1 1 66 1 17
#> 2 Aleena 1 1 79 1 15
#> 3 Ewok 1 1 88 1 20
# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Hutt 1 1 175 1 1358
#> 2 Kaleesh 1 1 216 1 159
#> 3 Wookiee 2 2 231 2 124
Bonus tasks
How many individuals come from the 3 most frequent (known) species?
Which individuals are more than 20% lighter (in terms of mass) than the average mass of individuals of their own homeworld?
## Bonus questions: -----
# How many individuals come from the 3 most frequent (known) species?
sw %>%
group_by(species) %>%
count %>%
arrange(desc(n)) %>%
filter(n > 1)
#> # A tibble: 9 x 2
#> # Groups: species [9]
#> species n
#> <chr> <int>
#> 1 Human 35
#> 2 Droid 5
#> 3 <NA> 5
#> 4 Gungan 3
#> 5 Kaminoan 2
#> 6 Mirialan 2
#> 7 Twi'lek 2
#> 8 Wookiee 2
#> 9 Zabrak 2
# Which individuals are more than 20% lighter (in terms of mass)
# than the average mass of individuals of their own homeworld?
sw %>%
select(name, homeworld, mass) %>%
group_by(homeworld) %>%
mutate(n_notNA_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
lighter = mass < (mn_mass - (.20 * mn_mass))
) %>%
filter(lighter == TRUE)
#> # A tibble: 5 x 6
#> # Groups: homeworld [4]
#> name homeworld mass n_notNA_mass mn_mass lighter
#> <chr> <chr> <dbl> <int> <dbl> <lgl>
#> 1 R2-D2 Naboo 32 6 64.2 TRUE
#> 2 Leia Organa Alderaan 49 2 64 TRUE
#> 3 R5-D4 Tatooine 32 8 85.4 TRUE
#> 4 Yoda <NA> 17 3 82 TRUE
#> 5 Padmé Amidala Naboo 45 6 64.2 TRUE
Exercise 2
Sleeping mammals
The dataset ggplot2::msleep
contains a mammals sleep dataset (see ?msleep
for details and the definition of variables).
- Save the data as
sp
and check the dimensions, variable types, and number of missing values in the dataset.
## Data:
# ?msleep # check variables
sp <- ggplot2::msleep
# Check:
dim(sp) # 83 x 11 variables
#> [1] 83 11
glimpse(sp) # 5 <chr> and 6 <dbl>
#> Observations: 83
#> Variables: 11
#> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
#> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
#> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
#> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
#> $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
sum(is.na(sp)) # 136 missing values
#> [1] 136
Arranging and filtering data
Use the dplyr
-verbs arrange
, group_by
, and filter
to answer the following questions by creating ordered subsets of the data:
Arrange the rows (alphabetically) by
vore
,order
, andname
, and report thegenus
of the top 3 mammals.What is the most common type of
vore
in the data? How many omnivores are there?What is the most common
order
in the dataset? Are there more exemplars of theorder
“Carnivora” or “Primates”?Which 2 mammals of the order “Primates” have the longest and shortest
sleep_total
times?
# Arranging rows:
sp1 <- sp %>%
arrange(vore, order, name)
sp1$genus[1:3] # => top 3: "Vulpes" "Phoca" "Acinonyx"
#> [1] "Vulpes" "Phoca" "Acinonyx"
# Counting common vores:
# (a) short solution:
sp %>%
count(vore) %>%
arrange(desc(n))
#> # A tibble: 5 x 2
#> vore n
#> <chr> <int>
#> 1 herbi 32
#> 2 omni 20
#> 3 carni 19
#> 4 <NA> 7
#> 5 insecti 5
# (a) is the same as (b):
sp %>%
group_by(vore) %>%
tally() %>%
arrange(desc(n))
#> # A tibble: 5 x 2
#> vore n
#> <chr> <int>
#> 1 herbi 32
#> 2 omni 20
#> 3 carni 19
#> 4 <NA> 7
#> 5 insecti 5
# (b) is the same as (c):
sp %>%
group_by(vore) %>%
summarise(n = n()) %>%
arrange(desc(n))
#> # A tibble: 5 x 2
#> vore n
#> <chr> <int>
#> 1 herbi 32
#> 2 omni 20
#> 3 carni 19
#> 4 <NA> 7
#> 5 insecti 5
# => 32 herbivores, 20 omnivores
# Counting common orders:
sp %>%
count(order) %>%
arrange(desc(n))
#> # A tibble: 19 x 2
#> order n
#> <chr> <int>
#> 1 Rodentia 22
#> 2 Carnivora 12
#> 3 Primates 12
#> 4 Artiodactyla 6
#> 5 Soricomorpha 5
#> 6 Cetacea 3
#> 7 Hyracoidea 3
#> 8 Perissodactyla 3
#> 9 Chiroptera 2
#> 10 Cingulata 2
#> 11 Didelphimorphia 2
#> 12 Diprotodontia 2
#> 13 Erinaceomorpha 2
#> 14 Proboscidea 2
#> 15 Afrosoricida 1
#> 16 Lagomorpha 1
#> 17 Monotremata 1
#> 18 Pilosa 1
#> 19 Scandentia 1
# OR:
sp %>%
group_by(order) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 19 x 2
#> # Groups: order [19]
#> order n
#> <chr> <int>
#> 1 Rodentia 22
#> 2 Carnivora 12
#> 3 Primates 12
#> 4 Artiodactyla 6
#> 5 Soricomorpha 5
#> 6 Cetacea 3
#> 7 Hyracoidea 3
#> 8 Perissodactyla 3
#> 9 Chiroptera 2
#> 10 Cingulata 2
#> 11 Didelphimorphia 2
#> 12 Diprotodontia 2
#> 13 Erinaceomorpha 2
#> 14 Proboscidea 2
#> 15 Afrosoricida 1
#> 16 Lagomorpha 1
#> 17 Monotremata 1
#> 18 Pilosa 1
#> 19 Scandentia 1
# => 22 Rodentia, 12 Carnivora = 12 Primates
# Primates with longest and shortest sleep_total times:
sp %>%
filter(order == "Primates") %>%
arrange(desc(sleep_total))
#> # A tibble: 12 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Owl … Aotus omni Prim… <NA> 17 1.8 NA
#> 2 Slow… Nyct… carni Prim… <NA> 11 NA NA
#> 3 Potto Pero… omni Prim… lc 11 NA NA
#> 4 Pata… Eryt… omni Prim… lc 10.9 1.1 NA
#> 5 Maca… Maca… omni Prim… <NA> 10.1 1.2 0.75
#> 6 Griv… Cerc… omni Prim… lc 10 0.7 NA
#> 7 Gala… Gala… omni Prim… <NA> 9.8 1.1 0.55
#> 8 Chim… Pan omni Prim… <NA> 9.7 1.4 1.42
#> 9 Squi… Saim… omni Prim… <NA> 9.6 1.4 NA
#> 10 Mong… Lemur herbi Prim… vu 9.5 0.9 NA
#> 11 Babo… Papio omni Prim… <NA> 9.4 1 0.667
#> 12 Human Homo omni Prim… <NA> 8 1.9 1.5
#> # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
# => max: Owl monkey: 17 hours, min: Human 8 hours.
Computing new variables
Solve the following tasks by mutate
, group_by
, and summarise
:
Compute a variable
sleep_awake_sum
that adds thesleep_total
time and theawake
time of each mammal. What result do you expect and get?Which animals have the smallest and largest brain to body ratio (in terms of weight)? How many mammals have a larger ratio than humans?
What is the minimum, average (mean), and maximum sleep cycle length for each
vore
? (Hint: First group the data bygroup_by
, then usesummarise
on thesleep_cycle
variable, but also count the number ofNA
values for eachvore
. When computing grouped summaries,NA
values can be removed byna.rm = TRUE
.)Replace your
summarise
verb in the previous task bymutate
. What do you get as a result? (Hint: The last two tasks illustrate the difference betweenmutate
and groupedmutate
commands.)
# Computing `sleep_awake_sum`:
sp2 <- sp %>%
mutate(sleep_awake_sum = sleep_total + awake)
sp2$sleep_awake_sum # => all 24 hours
#> [1] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [12] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [23] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.05 24.00 24.00
#> [34] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [45] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [56] 24.00 24.00 24.00 24.00 24.05 24.00 24.00 24.00 24.00 24.00 24.00
#> [67] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [ reached getOption("max.print") -- omitted 8 entries ]
# Computing brain to body ratios:
sp3 <- sp %>%
mutate(brain_to_body_wt = brainwt / bodywt) %>%
arrange(brain_to_body_wt)
sp3$name[1] # => smallest ratio: Cow
#> [1] "Cow"
sp3 %>%
arrange(desc(brain_to_body_wt))
#> # A tibble: 83 x 12
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Thir… Sper… herbi Rode… lc 13.8 3.4 0.217
#> 2 Owl … Aotus omni Prim… <NA> 17 1.8 NA
#> 3 Less… Cryp… omni Sori… lc 9.1 1.4 0.15
#> 4 Squi… Saim… omni Prim… <NA> 9.6 1.4 NA
#> 5 Maca… Maca… omni Prim… <NA> 10.1 1.2 0.75
#> 6 Litt… Myot… inse… Chir… <NA> 19.9 2 0.2
#> 7 Gala… Gala… omni Prim… <NA> 9.8 1.1 0.55
#> 8 Mole… Spal… <NA> Rode… <NA> 10.6 2.4 NA
#> 9 Tree… Tupa… omni Scan… <NA> 8.9 2.6 0.233
#> 10 Human Homo omni Prim… <NA> 8 1.9 1.5
#> # ... with 73 more rows, and 4 more variables: awake <dbl>, brainwt <dbl>,
#> # bodywt <dbl>, brain_to_body_wt <dbl>
# => largest ratio: Thirteen-lined ground squirrel,
# 9 mammals have larger brain_to_body_wt than Human.
# Computing sleep cycle length for each `vore`:
sp %>%
group_by(vore) %>%
summarise(n = n(),
n_NA = sum(is.na(sleep_cycle)),
non_NA = sum(!is.na(sleep_cycle)),
min_cyc = min(sleep_cycle, na.rm = TRUE),
mn_cyc = mean(sleep_cycle, na.rm = TRUE),
max_cyc = max(sleep_cycle, na.rm = TRUE)
)
#> # A tibble: 5 x 7
#> vore n n_NA non_NA min_cyc mn_cyc max_cyc
#> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 carni 19 14 5 0.333 0.373 0.417
#> 2 herbi 32 20 12 0.117 0.418 1
#> 3 insecti 5 2 3 0.117 0.161 0.2
#> 4 omni 20 9 11 0.133 0.592 1.5
#> 5 <NA> 7 6 1 0.183 0.183 0.183
# Replacing summarise by mutate:
sp4 <- sp %>%
group_by(vore) %>%
mutate(n = n(),
n_NA = sum(is.na(sleep_cycle)),
non_NA = sum(!is.na(sleep_cycle)),
min_cyc = min(sleep_cycle, na.rm = TRUE),
mn_cyc = mean(sleep_cycle, na.rm = TRUE),
max_cyc = max(sleep_cycle, na.rm = TRUE)
)
# => A tibble that contains grouped summaries as 6 new variables:
sp4 %>%
select(vore, name, sleep_cycle, n:max_cyc) %>%
arrange(vore)
#> # A tibble: 83 x 9
#> # Groups: vore [5]
#> vore name sleep_cycle n n_NA non_NA min_cyc mn_cyc max_cyc
#> <chr> <chr> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 carni Cheetah NA 19 14 5 0.333 0.373 0.417
#> 2 carni Northern f… 0.383 19 14 5 0.333 0.373 0.417
#> 3 carni Dog 0.333 19 14 5 0.333 0.373 0.417
#> 4 carni Long-nosed… 0.383 19 14 5 0.333 0.373 0.417
#> 5 carni Domestic c… 0.417 19 14 5 0.333 0.373 0.417
#> 6 carni Pilot whale NA 19 14 5 0.333 0.373 0.417
#> 7 carni Gray seal NA 19 14 5 0.333 0.373 0.417
#> 8 carni Thick-tail… NA 19 14 5 0.333 0.373 0.417
#> 9 carni Slow loris NA 19 14 5 0.333 0.373 0.417
#> 10 carni Northern g… NA 19 14 5 0.333 0.373 0.417
#> # ... with 73 more rows
Exercise 3
Outliers
This exercise examines different possibilities for defining outliers and uses a generated dataset (entitled out.csv
and available at http://rpository.com/ds4psy/data/out.csv) to illustate and compare them. It also helps to further disentangle the difference between mutate
and grouped mutate
commands.
Data on outliers
Use the following read_csv()
command to obtain and load the data into an R object entitled outliers
:
## Load data (as comma-separated file):
outliers <- read_csv("http://rpository.com/ds4psy/data/out.csv") # from online source
## Alternatively (from local source):
# outliers <- read_csv("out.csv") # from current directory
Not all outliers are alike
An outlier can be defined as an individual whose value in some variable deviates by more than a given criterion (e.g., 2 standard deviations) from the mean of the variable. However, this definition is incomplete unless it also specifies the reference group over which the means and deviations are computed. In the following, we explore the implications of different reference groups.
Basic tasks
Save the data into a tibble
outliers
and report its number of observations and variables, and their types.How many missing data values are there in
outliers
?What is the gender (or
sex
) distribution in this sample?Create a plot that shows the distribution of
height
values for each gender.
# Load and inspect data:
# outliers <- read_csv("out.csv") # read in csv-file
# outliers <- as_tibble(data) # if data is not already a tibble
dim(outliers) # => 1000 observations (rows) x 3 variables (columns)
#> [1] 1000 3
# Missing data points:
sum(is.na(outliers)) # => 18 missing values
#> [1] 18
# Gender distribution:
outliers %>%
group_by(sex) %>%
count()
#> # A tibble: 2 x 2
#> # Groups: sex [2]
#> sex n
#> <chr> <int>
#> 1 female 507
#> 2 male 493
# => 50.7% females, 49.3% males.
# Distributions of `height` as density plot:
ggplot(outliers, aes(x = height)) +
geom_density(fill = "gold", alpha = 2/3) +
geom_density(aes(fill = sex), alpha = 2/5) +
labs(title = "Distribution of heights overall and by gender",
fill = "Gender") +
# scale_fill_manual(values = c("firebrick", "steelblue3")) +
scale_fill_brewer(palette = "Set1") + # using Brewer palette
theme_bw()
# Note: To avoid the warning about removing 18 cases with NA-values,
# we could first filter out those cases:
# non_NA_data <- filter(outliers, !is.na(height))
# Alternative solution as 2 histograms:
ggplot(outliers) +
facet_wrap(~sex) +
geom_histogram(aes(x = height, fill = sex), binwidth = 5, color = "grey10") +
labs(title = "Distribution of heights by gender",
x = "Height", y = "Frequency") +
scale_fill_brewer(name = "Gender:", palette = "Set1") + # using Brewer palette
# scale_fill_manual(name = "Gender:", values = c("firebrick", "steelblue3")) +
theme_bw()
Defining different outliers
Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height
:
outliers relative to the
height
of the overall sample (i.e., individuals withheight
values deviating more than 2 SD from the overall mean ofheight
);outliers relative to the
height
of some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals withheight
values deviating more than 2 SD from the meanheight
of their own gender).
Hints: As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE
or FALSE
) and added as new columns to data
(via appropriate mutate
commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping outliers
by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x
and y
is provided by abs(x - y)
.
Relative outliers
Now use the 2 new outlier variables to define (or filter
) 2 subsets of the data that contain 2 subgroups of people:
out_1
: Individuals (females and males) withheight
values that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are inoutliers
?out_2
: Individuals (females and males) withheight
values that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are inoutliers
?
## Defining different outliers: -----
# Included in data_out (below), but also possible to do separately:
# Compute the number, means and SD of height values in 2 ways:
# 1. overall:
outliers %>%
summarise(n = n(),
n_not_NA = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 4
#> n n_not_NA mn_height sd_height
#> <int> <int> <dbl> <dbl>
#> 1 1000 982 175. 11.3
# 2. by gender:
outliers %>%
group_by(sex) %>%
summarise(n = n(),
n_not_NA = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 2 x 5
#> sex n n_not_NA mn_height sd_height
#> <chr> <int> <int> <dbl> <dbl>
#> 1 female 507 501 169. 8.19
#> 2 male 493 481 181. 11.0
# Detecting and marking outliers (by logical variables):
# Compute the means, SDs, and corresponding outliers in 2 ways:
crit <- 2 # criterion value for detecting outliers (in SD units)
data_out <- outliers %>%
# 1. Compute means, SD, and outliers for overall sample:
mutate(mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
out_height = abs(height - mn_height) > (crit * sd_height)) %>%
group_by(sex) %>%
# 2. Compute same metrics for subgroups (by sex):
mutate(mn_sex_height = mean(height, na.rm = TRUE),
sd_sex_height = sd(height, na.rm = TRUE),
out_sex_height = abs(height - mn_sex_height) > (crit * sd_sex_height))
knitr::kable(head(data_out))
id | sex | height | mn_height | sd_height | out_height | mn_sex_height | sd_sex_height | out_sex_height |
---|---|---|---|---|---|---|---|---|
nr.1 | female | 164 | 174.7006 | 11.26015 | FALSE | 169.0679 | 8.193619 | FALSE |
nr.2 | male | 178 | 174.7006 | 11.26015 | FALSE | 180.5676 | 11.026677 | FALSE |
nr.3 | female | 177 | 174.7006 | 11.26015 | FALSE | 169.0679 | 8.193619 | FALSE |
nr.4 | male | 188 | 174.7006 | 11.26015 | FALSE | 180.5676 | 11.026677 | FALSE |
nr.5 | male | 193 | 174.7006 | 11.26015 | FALSE | 180.5676 | 11.026677 | FALSE |
nr.6 | female | 168 | 174.7006 | 11.26015 | FALSE | 169.0679 | 8.193619 | FALSE |
## Relative outliers: -----
# Filter specific combinations of outliers:
# 1. Outliers relative to the entire population AND to their own gender:
out_1 <- data_out %>%
filter(out_height & out_sex_height) %>%
arrange(sex, height)
nrow(out_1) # => 21 individuals.
#> [1] 21
# 2. Outliers relative to their own gender, but NOT relative to the entire population:
out_2 <- data_out %>%
filter(!out_height & out_sex_height) %>%
arrange(sex, height)
nrow(out_2) # => 24 individuals.
#> [1] 24
Bonus plots
Visualize the raw values and distributions of
height
for both types of outliers (out_1
andout_2
) in 2 separate plots.Interpret both plots by describing the
height
andsex
combination of the individuals shown in each plot.
# Visualization and interpretation of both types of outliers:
# 1. Showing out_1:
ggplot(out_1, aes(x = sex, y = height)) +
geom_violin(aes(fill = sex)) +
geom_jitter(size = 4, alpha = 2/3) +
scale_fill_manual(values = c("firebrick", "steelblue3")) +
labs(title = "Outliers relative to both overall sample and gender",
x = "Gender", y = "Height (in cm)",
fill = "Gender:") +
theme_bw()
# Interpretation:
# `out_1` contains mostly short women (except for 1 tall woman)
# and mostly tall men (except for 2 short men).
# 2. Showing out_2:
ggplot(out_2, aes(x = sex, y = height)) +
geom_violin(aes(fill = sex)) +
geom_jitter(size = 4, alpha = 2/3) +
scale_fill_manual(values = c("firebrick", "steelblue3")) +
labs(title = "Outliers relative to gender but not overall sample",
x = "Gender", y = "Height (in cm)",
fill = "Gender:") +
theme_bw()
# Interpretation:
# `out_2` contains individuals which are either tall women or short men.
Exercise 4
In Exercise 6 of WPA01 and Exercise 5 of WPA02 you used the p_info
data (available at http://rpository.com/ds4psy/data/posPsy_participants.csv) from
- Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R. and Schüz, B. (2018). Data from, ‘Web-based positive psychology interventions: A reexamination of effectiveness’. Journal of Open Psychology Data, 6: 1. DOI: https://doi.org/10.5334/jopd.35
to explore the participant information and create some corresponding plots.
library(readr)
# Read data (from online source):
p_info <- read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")
# p_info
dim(p_info) # 295 rows, 6 columns
#> [1] 295 6
Answer the same questions as in those exercises by transforming your earlier base R commands and graphs (created by using ggplot2
commands) into pipes of dplyr
commands that provide the same information.
From WPA01: Exercise 6
Questions from Exercise 6 of WPA01:
Examine the participant information in p_info
by describing each of its variables:
- How many individuals are contained in the dataset?
- What percentage of them is female (i.e., has a
sex
value of 1)? - How many participants were in one of the 3 treatment groups (i.e., have an
intervention
value of 1, 2, or 3)? - What is the participants’ mean education level? What percentage has a university degree (i.e., an
educ
value of at least 4)? - What is the age range (
min
tomax
) of participants? What is the average (mean and median) age? - Describe the range of
income
levels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., anincome
value of 1)?
# 1. How many individuals are contained in the dataset?
N <- nrow(p_info) # OR
N # 295
#> [1] 295
# Note:
p_info %>% count()
#> # A tibble: 1 x 1
#> n
#> <int>
#> 1 295
# would yield a tibble (with only 1 element: 295).
# 2. What percentage of them is female (i.e., has a `sex` value of 1)?
p_info %>%
group_by(sex) %>%
summarise(n_sex = n(), # number
pc_sex = n_sex/N * 100 # percentage
)
#> # A tibble: 2 x 3
#> sex n_sex pc_sex
#> <int> <int> <dbl>
#> 1 1 251 85.1
#> 2 2 44 14.9
# 3. How many participants were in one of the 3 treatment groups (i.e., have an `intervention` value of 1, 2, or 3)?
p_info %>%
filter(intervention < 4) %>%
count()
#> # A tibble: 1 x 1
#> n
#> <int>
#> 1 222
# => 222 individuals
# OR:
t_iv <- p_info %>%
group_by(intervention) %>%
summarise(n_iv = n(), # number
pc_iv = n_iv/N * 100 # percentage
)
t_iv
#> # A tibble: 4 x 3
#> intervention n_iv pc_iv
#> <int> <int> <dbl>
#> 1 1 72 24.4
#> 2 2 76 25.8
#> 3 3 74 25.1
#> 4 4 73 24.7
sum(t_iv$n_iv[1:3])
#> [1] 222
# => 222 in elements 1:3 of vector n_iv
# 4. What is the participants' mean education level?
p_info %>%
summarise(mn_edu = mean(educ))
#> # A tibble: 1 x 1
#> mn_edu
#> <dbl>
#> 1 3.98
# What percentage has a university degree (i.e., an `educ` value of at least 4)?
p_info %>%
group_by(educ) %>%
summarise(n_edu = n(), # number
pc_edu = n_edu/N * 100 # percentage
)
#> # A tibble: 5 x 3
#> educ n_edu pc_edu
#> <int> <int> <dbl>
#> 1 1 14 4.75
#> 2 2 21 7.12
#> 3 3 39 13.2
#> 4 4 104 35.3
#> 5 5 117 39.7
# 5. What is the age range (`min` to `max`) of participants?
# What is the average (mean and median) age?
p_info %>%
summarise(n = n(),
min_age = min(age),
mn_age = mean(age),
max_age = max(age))
#> # A tibble: 1 x 4
#> n min_age mn_age max_age
#> <int> <dbl> <dbl> <dbl>
#> 1 295 18 43.8 83
# 6. Describe the range of `income` levels present in this sample of participants.
# What percentage of participants self-identifies as a below-average income
# (i.e., an `income` value of 1)?
p_info %>%
group_by(income) %>%
summarise(n_income = n(),
pc_income = n_income/N * 100)
#> # A tibble: 3 x 3
#> income n_income pc_income
#> <int> <int> <dbl>
#> 1 1 73 24.7
#> 2 2 136 46.1
#> 3 3 86 29.2
From WPA02: Exercise 5
Questions from Exercise 5 of WPA02
Use the p_info
data to create some plots that descripte the sample of participants:
- A histogram that shows the distribution of participant
age
in 3 ways:- overall,
- separately for each
sex
, and - separately for each
intervention
.
When using dplyr
instead of ggplot2
, we can replace the information contained in the histogram by a table of descriptives:
# Age distribution (as a table):
age_overall <- p_info %>%
summarise(n = n(), # number
min_age = min(age), # minimum
mn_age = mean(age), # mean
md_age = median(age), # median
sd_age = sd(age), # standard deviation
max_age = max(age) # maximum
)
age_overall
#> # A tibble: 1 x 6
#> n min_age mn_age md_age sd_age max_age
#> <int> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 295 18 43.8 44 12.4 83
# Age distribution by sex (as a table):
age_by_sex <- p_info %>%
group_by(sex) %>%
summarise(n = n(), # number
min_age = min(age), # minimum
mn_age = mean(age), # mean
md_age = median(age), # median
sd_age = sd(age), # standard deviation
max_age = max(age) # maximum
)
age_by_sex
#> # A tibble: 2 x 7
#> sex n min_age mn_age md_age sd_age max_age
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 251 19 43.9 44 12.1 83
#> 2 2 44 18 43.0 44 14.1 71
# Age distribution by intervention (as a table):
age_by_iv <- p_info %>%
group_by(intervention) %>%
summarise(n = n(), # number
min_age = min(age), # minimum
mn_age = mean(age), # mean
md_age = median(age), # median
sd_age = sd(age), # standard deviation
max_age = max(age) # maximum
)
age_by_iv
#> # A tibble: 4 x 7
#> intervention n min_age mn_age md_age sd_age max_age
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 72 22 44.6 45 12.1 71
#> 2 2 76 18 45.4 45.5 12.5 83
#> 3 3 74 19 43.3 44 12.2 71
#> 4 4 73 19 41.7 40 12.8 75
- A bar plot that
- shows how many participants took part in each
intervention
; or - shows how many participants of each
sex
took part in eachintervention
.
- shows how many participants took part in each
Using dplyr
instead of ggplot2
:
# Number of participants per intervention:
# (was contained in age_by_iv above):
age_by_iv %>%
select(intervention, n)
#> # A tibble: 4 x 2
#> intervention n
#> <int> <int>
#> 1 1 72
#> 2 2 76
#> 3 3 74
#> 4 4 73
# N and percentage by intervention and sex:
p_info %>%
group_by(intervention, sex) %>%
summarise(n_iv_sex = n(), # number
pc_iv_sex = n_iv_sex/N * 100 # percentage
)
#> # A tibble: 8 x 4
#> # Groups: intervention [?]
#> intervention sex n_iv_sex pc_iv_sex
#> <int> <int> <int> <dbl>
#> 1 1 1 62 21.0
#> 2 1 2 10 3.39
#> 3 2 1 66 22.4
#> 4 2 2 10 3.39
#> 5 3 1 62 21.0
#> 6 3 2 12 4.07
#> 7 4 1 61 20.7
#> 8 4 2 12 4.07
More on data transformation
For more details on dplyr
,
- study
vignette("dplyr")
and the documentation for?arrange
,?filter
,?select
, etc. - study https://dplyr.tidyverse.org/ and its examples;
- see the cheat sheet on data transformation;
- read Chapter 5: Data transformation and complete its exercises.
Conclusion
All ds4psy essentials so far:
Nr. | Topic |
---|---|
0. | Syllabus |
1. | Basic R concepts and commands |
2. | Visualizing data |
3. | Transforming data |
4. | Exploring data (EDA) |
+. | Datasets |
[Last update on 2018-11-27 22:57:45 by hn.]