Introduction
This file contains essential commands from Chapter 5: Data transformation of the textbook r4ds and corresponding examples and exercises. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.
All ds4psy essentials so far:
| Nr. | Topic |
|---|---|
| 0. | Syllabus |
| 1. | Basic R concepts and commands |
| 2. | Visualizing data |
| 3. | Transforming data |
| 4. | Exploring data (EDA) |
| +. | Datasets |
Course coordinates
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Winter 2018/2019: Mondays, 13:30–15:00, C511.
- Links to current course syllabus | ZeUS | Ilias
Preparations
Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## Transforming data | ds4psy
## 2018 11 19
## ----------------------------
## Preparations: ----------
library(tidyverse)
## 1. Topic: ----------
# etc.
## End of file (eof). ---------- Visualizing data
In the following, we assume that you have read and worked through Chapter 5: Data transformation.
Based on this background, we examine some essential commands of dplyr in the context of examples. However, the dplyr package extends beyond this introduction – and will appear again later (e.g., in Chapter 13: Relational data).
Essential commands and examples
Data transformation includes re-arranging, selecting, changing, or aggregating data.
Overview
When we have data in the form of a tibble or data frame, dplyr provides a range of simple tools to transform this data. Six essential dplyr commands are:
arrangesorts cases (rows);filterselects cases (rows) by logical conditions;selectselects and reorders variables (columns);mutatecomputes new variables (columns) and adds them to existing ones;summarisecollapses multiple values of a variable (rows of a column) to a single one;
group_bychanges the unit of aggregation (in combination withmutateandsummarise).
Not quite as essential but still useful dplyr commands include:
sliceselects (ranges of) cases (rows) by number;renamerenames variables (columns) and keeps others;transmutecomputes new variables (columns) and drops existing ones;sample_nandsample_fracdraw random samples of cases (rows).
Data
We save the dplyr::starwars data as a tibble sw and use it to illustrate the essential dplyr commands.
library(tidyverse)
sw <- dplyr::starwars
sw # => A tibble: 87 rows (individuals) x 13 columns (variables)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Dart… 202 136 none white yellow 41.9 male
#> 5 Leia… 150 49 brown light brown 19 female
#> 6 Owen… 178 120 brown, gr… light blue 52 male
#> 7 Beru… 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Bigg… 183 84 black light brown 24 male
#> 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>Standard questions to ask of every new data file include:
What are the dimensions of the data file?
How many variables (columns) are there in
swand of which type are they?How many missing (
NA) values are there?
## Data:
# sw
## Standard checks: ------
dim(sw) # 87 x 13 variables
#> [1] 87 13
names(sw) # variable names
#> [1] "name" "height" "mass" "hair_color" "skin_color"
#> [6] "eye_color" "birth_year" "gender" "homeworld" "species"
#> [11] "films" "vehicles" "starships"
glimpse(sw) #
#> Observations: 87
#> Variables: 13
#> $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
#> $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
#> $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
#> $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
#> $ gender <chr> "male", NA, NA, "male", "female", "male", "female",...
#> $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
#> $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
#> $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
#> $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
#> $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...
## Missing (NA) values: ------
sum(is.na(sw)) # 101 missing values
#> [1] 101
colSums(is.na(sw)) # missing by column (variable)
#> name height mass hair_color skin_color eye_color
#> 0 6 28 5 0 0
#> birth_year gender homeworld species films vehicles
#> 44 3 10 5 0 0
#> starships
#> 01. arrange to sort rows
Using arrange sorts cases (rows) by putting specific variables (columns) in specific orders (e.g., ascending or descending):
# Sort rows alphabetically (by name):
arrange(sw, name)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Ackb… 180 83 none brown mot… orange 41 male
#> 2 Adi … 184 50 none dark blue NA female
#> 3 Anak… 188 84 blond fair blue 41.9 male
#> 4 Arve… NA NA brown fair brown NA male
#> 5 Ayla… 178 55 none blue hazel 48 female
#> 6 Bail… 191 NA black tan brown 67 male
#> 7 Barr… 166 50 black yellow blue 40 female
#> 8 BB8 NA NA none none black NA none
#> 9 Ben … 163 65 none grey, gre… orange NA male
#> 10 Beru… 165 75 brown light blue 47 female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# The same command using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
arrange(name)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Ackb… 180 83 none brown mot… orange 41 male
#> 2 Adi … 184 50 none dark blue NA female
#> 3 Anak… 188 84 blond fair blue 41.9 male
#> 4 Arve… NA NA brown fair brown NA male
#> 5 Ayla… 178 55 none blue hazel 48 female
#> 6 Bail… 191 NA black tan brown 67 male
#> 7 Barr… 166 50 black yellow blue 40 female
#> 8 BB8 NA NA none none black NA none
#> 9 Ben … 163 65 none grey, gre… orange NA male
#> 10 Beru… 165 75 brown light blue 47 female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# Sort rows in descending order:
sw %>%
arrange(desc(name))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Zam … 168 55 blonde fair, gre… yellow NA female
#> 2 Yoda 66 17 white green brown 896 male
#> 3 Yara… 264 NA none white yellow NA male
#> 4 Wilh… 180 NA auburn, g… fair blue 64 male
#> 5 Wick… 88 20 brown brown brown 8 male
#> 6 Wedg… 170 77 brown fair hazel 21 male
#> 7 Watto 137 NA black blue, grey yellow NA male
#> 8 Wat … 193 48 none green, gr… unknown NA male
#> 9 Tion… 206 80 none grey black NA male
#> 10 Taun… 213 NA none grey black NA female
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# Sort by multiple variables:
sw %>%
arrange(eye_color, gender, desc(height))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Taun… 213 NA none grey black NA female
#> 2 Shaa… 178 57 none red, blue… black NA female
#> 3 Lama… 229 88 none grey black NA male
#> 4 Tion… 206 80 none grey black NA male
#> 5 Kit … 196 87 none green black NA male
#> 6 Plo … 188 80 none orange black 22 male
#> 7 Gree… 173 74 <NA> green black 44 male
#> 8 Nien… 160 68 none grey black NA male
#> 9 Gasg… 122 NA none white, bl… black NA male
#> 10 BB8 NA NA none none black NA none
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
## Note: See
# ?dplyr::arrange # for more help and examples.Note some details:
All basic
dplyrcommands can be called asverb(data, ...)or – using the pipe frommagrittr– asdata %>% verb(...)(seevignette("magrittr")for details).Variable names are unquoted.
The order of variable names (
x, y, ...) specifies the order or priority of operations (first byx, then byy, etc.).
Practice: Arrange the sw data in different ways, combining multiple variables and (ascending and descending) orders. Where are cases containing NA values in sorted variables placed?
2. filter to select rows
Using filter selects cases (rows) by logical conditions. It keeps all rows for which the conditions are TRUE and drops all rows for which the conditions are FALSE or NA.
# Filter to keep all humans:
filter(sw, species == "Human")
#> # A tibble: 35 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 Dart… 202 136 none white yellow 41.9 male
#> 3 Leia… 150 49 brown light brown 19 female
#> 4 Owen… 178 120 brown, gr… light blue 52 male
#> 5 Beru… 165 75 brown light blue 47 female
#> 6 Bigg… 183 84 black light brown 24 male
#> 7 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> 8 Anak… 188 84 blond fair blue 41.9 male
#> 9 Wilh… 180 NA auburn, g… fair blue 64 male
#> 10 Han … 180 80 brown fair brown 29 male
#> # ... with 25 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# The same command using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
filter(species == "Human")
#> # A tibble: 35 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 Dart… 202 136 none white yellow 41.9 male
#> 3 Leia… 150 49 brown light brown 19 female
#> 4 Owen… 178 120 brown, gr… light blue 52 male
#> 5 Beru… 165 75 brown light blue 47 female
#> 6 Bigg… 183 84 black light brown 24 male
#> 7 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> 8 Anak… 188 84 blond fair blue 41.9 male
#> 9 Wilh… 180 NA auburn, g… fair blue 64 male
#> 10 Han … 180 80 brown fair brown 29 male
#> # ... with 25 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# Filter by multiple (additive) conditions:
sw %>%
filter(height > 180, mass <= 75) # tall and light individuals
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jar … 196 66 none orange orange 52 male
#> 2 Adi … 184 50 none dark blue NA female
#> 3 Wat … 193 48 none green, gr… unknown NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# The same command using the logical operator (&):
sw %>%
filter(height > 180 & mass <= 75) # tall and light individuals
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jar … 196 66 none orange orange 52 male
#> 2 Adi … 184 50 none dark blue NA female
#> 3 Wat … 193 48 none green, gr… unknown NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter for a range of a specific variable:
sw %>%
filter(height >= 150, height <= 165) # (a) using height twice
#> # A tibble: 9 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Leia… 150 49 brown light brown 19 female
#> 2 Beru… 165 75 brown light blue 47 female
#> 3 Mon … 150 NA auburn fair blue 48 female
#> 4 Nien… 160 68 none grey black NA male
#> 5 Shmi… 163 NA black fair brown 72 female
#> 6 Ben … 163 65 none grey, gre… orange NA male
#> 7 Cordé 157 NA brown light brown NA female
#> 8 Dormé 165 NA brown light brown NA female
#> 9 Padm… 165 45 brown light brown 46 female
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
sw %>%
filter(between(height, 150, 165)) # (b) using between(...)
#> # A tibble: 9 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Leia… 150 49 brown light brown 19 female
#> 2 Beru… 165 75 brown light blue 47 female
#> 3 Mon … 150 NA auburn fair blue 48 female
#> 4 Nien… 160 68 none grey black NA male
#> 5 Shmi… 163 NA black fair brown 72 female
#> 6 Ben … 163 65 none grey, gre… orange NA male
#> 7 Cordé 157 NA brown light brown NA female
#> 8 Dormé 165 NA brown light brown NA female
#> 9 Padm… 165 45 brown light brown 46 female
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter by multiple (alternative) conditions:
sw %>%
filter(homeworld == "Kashyyyk" | skin_color == "green")
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Chew… 228 112 brown unknown blue 200 male
#> 2 Gree… 173 74 <NA> green black 44 male
#> 3 Yoda 66 17 white green brown 896 male
#> 4 Bossk 190 113 none green red 53 male
#> 5 Rugo… 206 NA none green orange NA male
#> 6 Kit … 196 87 none green black NA male
#> 7 Pogg… 183 80 none green yellow NA male
#> 8 Tarf… 234 136 brown brown blue NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter cases with missing (NA) values on specific variables:
sw %>%
filter(is.na(gender))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 2 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 3 R5-D4 97 32 <NA> white, red red NA <NA>
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter cases with existing (non-NA) values on specific variables:
sw %>%
filter(!is.na(mass), !is.na(birth_year))
#> # A tibble: 36 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Dart… 202 136 none white yellow 41.9 male
#> 5 Leia… 150 49 brown light brown 19 female
#> 6 Owen… 178 120 brown, gr… light blue 52 male
#> 7 Beru… 165 75 brown light blue 47 female
#> 8 Bigg… 183 84 black light brown 24 male
#> 9 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> 10 Anak… 188 84 blond fair blue 41.9 male
#> # ... with 26 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
## Note: See
# ?dplyr::filter # for more help and examples.Note some details:
Separating multiple conditions by commas is the same as the logical AND (
&).Variable names are unquoted.
The comma between conditions or tests (
x, y, ...) means the same as&(logical AND), as each test results in a vector of Boolean values.Unlike in base R, rows for which the condition evaluates to
NAare dropped.Additional filter functions include
near()for testing numerical (near-)identity.
Practice: Use filter on the dplyr::starwars data (sw) to select very diverse or narrow subsets of individuals. For instance,
- which individual with blond hair and blue eyes has an unknown mass?
- of which species are individuals that are over 2m tall and have brown hair?
- which individuals from Tatooine are not male (but may be
NA)? - which individuals are neither male nor female OR heavier than 130kg?
3. select to select columns
Using select selects variables (columns) by their names or numbers:
# Select 4 specific variables (columns) of sw:
select(sw, name, species, birth_year, gender)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# The same when using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
select(name, species, birth_year, gender)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# The same when providing a vector of variable names:
sw %>%
select(c(name, species, birth_year, gender))
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# The same when providing column numbers:
sw %>%
select(1, 10, 7, 8)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# The same when providing a vector of column numbers:
sw %>%
select(c(1, 10, 7, 8))
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19 male
#> 2 C-3PO Droid 112 <NA>
#> 3 R2-D2 Droid 33 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19 female
#> 6 Owen Lars Human 52 male
#> 7 Beru Whitesun lars Human 47 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24 male
#> 10 Obi-Wan Kenobi Human 57 male
#> # ... with 77 more rows
# Select ranges of variables with ":":
sw %>%
select(name:mass, films:starships)
#> # A tibble: 87 x 6
#> name height mass films vehicles starships
#> <chr> <int> <dbl> <list> <list> <list>
#> 1 Luke Skywalker 172 77 <chr [5]> <chr [2]> <chr [2]>
#> 2 C-3PO 167 75 <chr [6]> <chr [0]> <chr [0]>
#> 3 R2-D2 96 32 <chr [7]> <chr [0]> <chr [0]>
#> 4 Darth Vader 202 136 <chr [4]> <chr [0]> <chr [1]>
#> 5 Leia Organa 150 49 <chr [5]> <chr [1]> <chr [0]>
#> 6 Owen Lars 178 120 <chr [3]> <chr [0]> <chr [0]>
#> 7 Beru Whitesun lars 165 75 <chr [3]> <chr [0]> <chr [0]>
#> 8 R5-D4 97 32 <chr [1]> <chr [0]> <chr [0]>
#> 9 Biggs Darklighter 183 84 <chr [1]> <chr [0]> <chr [1]>
#> 10 Obi-Wan Kenobi 182 77 <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows
# Select to re-order variables (columns) with everything():
sw %>%
select(species, name, gender, everything())
#> # A tibble: 87 x 13
#> species name gender height mass hair_color skin_color eye_color
#> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Human Luke… male 172 77 blond fair blue
#> 2 Droid C-3PO <NA> 167 75 <NA> gold yellow
#> 3 Droid R2-D2 <NA> 96 32 <NA> white, bl… red
#> 4 Human Dart… male 202 136 none white yellow
#> 5 Human Leia… female 150 49 brown light brown
#> 6 Human Owen… male 178 120 brown, gr… light blue
#> 7 Human Beru… female 165 75 brown light blue
#> 8 Droid R5-D4 <NA> 97 32 <NA> white, red red
#> 9 Human Bigg… male 183 84 black light brown
#> 10 Human Obi-… male 182 77 auburn, w… fair blue-gray
#> # ... with 77 more rows, and 5 more variables: birth_year <dbl>,
#> # homeworld <chr>, films <list>, vehicles <list>, starships <list>
# Select variables with helper functions:
sw %>%
select(starts_with("s"))
#> # A tibble: 87 x 3
#> skin_color species starships
#> <chr> <chr> <list>
#> 1 fair Human <chr [2]>
#> 2 gold Droid <chr [0]>
#> 3 white, blue Droid <chr [0]>
#> 4 white Human <chr [1]>
#> 5 light Human <chr [0]>
#> 6 light Human <chr [0]>
#> 7 light Human <chr [0]>
#> 8 white, red Droid <chr [0]>
#> 9 light Human <chr [1]>
#> 10 fair Human <chr [5]>
#> # ... with 77 more rows
sw %>%
select(ends_with("s"))
#> # A tibble: 87 x 5
#> mass species films vehicles starships
#> <dbl> <chr> <list> <list> <list>
#> 1 77 Human <chr [5]> <chr [2]> <chr [2]>
#> 2 75 Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 32 Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 136 Human <chr [4]> <chr [0]> <chr [1]>
#> 5 49 Human <chr [5]> <chr [1]> <chr [0]>
#> 6 120 Human <chr [3]> <chr [0]> <chr [0]>
#> 7 75 Human <chr [3]> <chr [0]> <chr [0]>
#> 8 32 Droid <chr [1]> <chr [0]> <chr [0]>
#> 9 84 Human <chr [1]> <chr [0]> <chr [1]>
#> 10 77 Human <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows
sw %>%
select(contains("_"))
#> # A tibble: 87 x 4
#> hair_color skin_color eye_color birth_year
#> <chr> <chr> <chr> <dbl>
#> 1 blond fair blue 19
#> 2 <NA> gold yellow 112
#> 3 <NA> white, blue red 33
#> 4 none white yellow 41.9
#> 5 brown light brown 19
#> 6 brown, grey light blue 52
#> 7 brown light blue 47
#> 8 <NA> white, red red NA
#> 9 black light brown 24
#> 10 auburn, white fair blue-gray 57
#> # ... with 77 more rows
sw %>%
select(matches("or"))
#> # A tibble: 87 x 4
#> hair_color skin_color eye_color homeworld
#> <chr> <chr> <chr> <chr>
#> 1 blond fair blue Tatooine
#> 2 <NA> gold yellow Tatooine
#> 3 <NA> white, blue red Naboo
#> 4 none white yellow Tatooine
#> 5 brown light brown Alderaan
#> 6 brown, grey light blue Tatooine
#> 7 brown light blue Tatooine
#> 8 <NA> white, red red Tatooine
#> 9 black light brown Tatooine
#> 10 auburn, white fair blue-gray Stewjon
#> # ... with 77 more rows
# Renaming variables:
sw %>%
rename(creature = name, from_planet = homeworld)
#> # A tibble: 87 x 13
#> creature height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Darth V… 202 136 none white yellow 41.9 male
#> 5 Leia Or… 150 49 brown light brown 19 female
#> 6 Owen La… 178 120 brown, gr… light blue 52 male
#> 7 Beru Wh… 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Biggs D… 183 84 black light brown 24 male
#> 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: from_planet <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
## Note: See
# ?dplyr::select # for more help and examples.
?dplyr::select_if # for more help and examples. Note some details:
selectworks both by specifying variable (column) names and by specifying column numbers.Variable names are unquoted.
The sequence of variable names (separated by commas) specifies the order of columns in the resulting tibble.
Selecting and adding
everything()allows re-ordering.Various helper functions (e.g.,
starts_with,ends_with,contains,matches,num_range) refer to (parts of) variable names.renamerenames specified variables (without quotes) and keeps all other variables.
Practice: Use select on the dplyr::starwars data (sw) to select and re-order specific subsets of variables (e.g., all variables starting with “h”, all even columns, all character variables, etc.).
4. mutate to compute new variables
Using mutate computes new variables (columns) from scratch or existing ones:
# Preparation: Save only a subset variables of sw as sws:
sws <- select(sw, name:mass, birth_year:species)
sws # => 87 cases (rows), but only 7 variables (columns)
#> # A tibble: 87 x 7
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19 male Tatooine Human
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19 female Alderaan Human
#> 6 Owen Lars 178 120 52 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57 male Stewjon Human
#> # ... with 77 more rows
# Compute 2 new variables and add them to existing ones:
mutate(sws, id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#> name height mass birth_year gender homeworld species id height_feet
#> <ch> <int> <dbl> <dbl> <chr> <chr> <chr> <int> <dbl>
#> 1 Luk… 172 77 19 male Tatooine Human 1 5.64
#> 2 C-3… 167 75 112 <NA> Tatooine Droid 2 5.48
#> 3 R2-… 96 32 33 <NA> Naboo Droid 3 3.15
#> 4 Dar… 202 136 41.9 male Tatooine Human 4 6.63
#> 5 Lei… 150 49 19 female Alderaan Human 5 4.92
#> 6 Owe… 178 120 52 male Tatooine Human 6 5.84
#> 7 Ber… 165 75 47 female Tatooine Human 7 5.41
#> 8 R5-… 97 32 NA <NA> Tatooine Droid 8 3.18
#> 9 Big… 183 84 24 male Tatooine Human 9 6.00
#> 10 Obi… 182 77 57 male Stewjon Human 10 5.97
#> # ... with 77 more rows
# The same using the pipe:
sws %>%
mutate(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#> name height mass birth_year gender homeworld species id height_feet
#> <ch> <int> <dbl> <dbl> <chr> <chr> <chr> <int> <dbl>
#> 1 Luk… 172 77 19 male Tatooine Human 1 5.64
#> 2 C-3… 167 75 112 <NA> Tatooine Droid 2 5.48
#> 3 R2-… 96 32 33 <NA> Naboo Droid 3 3.15
#> 4 Dar… 202 136 41.9 male Tatooine Human 4 6.63
#> 5 Lei… 150 49 19 female Alderaan Human 5 4.92
#> 6 Owe… 178 120 52 male Tatooine Human 6 5.84
#> 7 Ber… 165 75 47 female Tatooine Human 7 5.41
#> 8 R5-… 97 32 NA <NA> Tatooine Droid 8 3.18
#> 9 Big… 183 84 24 male Tatooine Human 9 6.00
#> 10 Obi… 182 77 57 male Stewjon Human 10 5.97
#> # ... with 77 more rows
# Transmute commputes and only keeps new variables:
sws %>%
transmute(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 2
#> id height_feet
#> <int> <dbl>
#> 1 1 5.64
#> 2 2 5.48
#> 3 3 3.15
#> 4 4 6.63
#> 5 5 4.92
#> 6 6 5.84
#> 7 7 5.41
#> 8 8 3.18
#> 9 9 6.00
#> 10 10 5.97
#> # ... with 77 more rows
# Compute variables based on multiple others (including computed ones):
sws %>%
mutate(BMI = mass / ((height / 100) ^ 2), # compute body mass index (kg/m^2)
BMI_low = BMI < 18.5, # classify low BMI values
BMI_high = BMI > 30, # classify high BMI values
BMI_norm = !BMI_low & !BMI_high # classify normal BMI values
)
#> # A tibble: 87 x 11
#> name height mass birth_year gender homeworld species BMI BMI_low
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <dbl> <lgl>
#> 1 Luke… 172 77 19 male Tatooine Human 26.0 FALSE
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid 26.9 FALSE
#> 3 R2-D2 96 32 33 <NA> Naboo Droid 34.7 FALSE
#> 4 Dart… 202 136 41.9 male Tatooine Human 33.3 FALSE
#> 5 Leia… 150 49 19 female Alderaan Human 21.8 FALSE
#> 6 Owen… 178 120 52 male Tatooine Human 37.9 FALSE
#> 7 Beru… 165 75 47 female Tatooine Human 27.5 FALSE
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid 34.0 FALSE
#> 9 Bigg… 183 84 24 male Tatooine Human 25.1 FALSE
#> 10 Obi-… 182 77 57 male Stewjon Human 23.2 FALSE
#> # ... with 77 more rows, and 2 more variables: BMI_high <lgl>,
#> # BMI_norm <lgl>
## Note: See
# ?dplyr::mutate # for more help and examples. Note some details:
mutatecomputes new variables (columns) and adds them to existing ones, whiletransmutedrops existing ones.Each
mutatecommand specifies a new variable name (without quotes), followed by=and a rule for computing the new variable from existing ones.Variable names are unquoted.
Multiple
mutatesteps are separated by commas, each of which creates a new variable.See http://r4ds.had.co.nz/transform.html#mutate-funs for useful functions for creating new variables.
Practice: Compute a new variable mass_pound from mass (in kg) and the age of each individual in sw relative to Yoda’s age. (Note that the variable birth_year is provided in years BBY, i.e., Before Battle of Yavin.)
5. summarise to compute summaries
summarise computes a function for a specified variable and collapses the values of the specified variable (i.e., the rows of a specified columns) to a single value. It provides many different summary statistics by itself, but is even more useful in combination with group_by (discussed next).
# Summarise allows computing a function for a variable (column):
summarise(sw, mn_mass = mean(mass, na.rm = TRUE)) # => 97.31 kg
#> # A tibble: 1 x 1
#> mn_mass
#> <dbl>
#> 1 97.3
# The same using the pipe:
sw %>%
summarise(mn_mass = mean(mass, na.rm = TRUE)) # => 97.31 kg
#> # A tibble: 1 x 1
#> mn_mass
#> <dbl>
#> 1 97.3
# Multiple summarise steps allow applying
# different functions for 1 dependent variable:
sw %>%
summarise(n_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
md_mass = median(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
max_mass = max(mass, na.rm = TRUE),
big_mass = any(mass > 1000)
)
#> # A tibble: 1 x 6
#> n_mass mn_mass md_mass sd_mass max_mass big_mass
#> <int> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 59 97.3 79 169. 1358 TRUE
# Multiple summarise steps also allow applying
# different functions to different dependent variables:
sw %>%
summarise(# Descriptives of height:
n_height = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
# Descriptives of mass:
n_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
# Counts of character variables:
n_names = n(),
n_species = n_distinct(species),
n_worlds = n_distinct(homeworld)
)
#> # A tibble: 1 x 9
#> n_height mn_height sd_height n_mass mn_mass sd_mass n_names n_species
#> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <int>
#> 1 81 174. 34.8 59 97.3 169. 87 38
#> # ... with 1 more variable: n_worlds <int>
## Note: See
# ?dplyr::summarise # for more help and examples. Note some details:
summarisecollapses multiple values into one value and returns a new tibble with as many rows as values computed.Each
summarisestep specifies a new variable name (without quotes), followed by=, and a function for computing the new variable from existing ones.Multiple
summarisesteps are separated by commas.Variable names are unquoted.
See https://dplyr.tidyverse.org/reference/summarise.html for examples and useful functions in combination with
summarise.
Practice: Apply all summary functions mentioned in ?dplyr::summarise to the sw dataset.
6. group_by to aggregate variables
Using group_by does not change the data, but the unit of aggregation for other commands, which is very useful in combination with mutate and summarise.
# Grouping does not change the data, but lists its groups:
group_by(sws, species) # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups: species [38]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19 male Tatooine Human
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19 female Alderaan Human
#> 6 Owen Lars 178 120 52 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57 male Stewjon Human
#> # ... with 77 more rows
# The same using the pipe:
sws %>%
group_by(species) # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups: species [38]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19 male Tatooine Human
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19 female Alderaan Human
#> 6 Owen Lars 178 120 52 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57 male Stewjon Human
#> # ... with 77 more rows
# group_by is ineffective by itself, but very powerful
# (a) in combination with `mutate` and
# (b) in combination with `summarise`.
# ad (a):
# In combination with mutate and an aggregation function,
# group_by changes the unit of aggregation:
sws %>%
mutate(mn_height_1 = mean(height, na.rm = TRUE)) %>% # aggregates over ALL cases
group_by(species) %>%
mutate(mn_height_2 = mean(height, na.rm = TRUE)) %>% # aggregates over current group (species)
group_by(gender) %>%
mutate(mn_height_3 = mean(height, na.rm = TRUE)) %>% # aggregates over current group (gender)
group_by(name) %>%
mutate(mn_height_4 = mean(height, na.rm = TRUE)) # aggregates over current group (name)
#> # A tibble: 87 x 11
#> # Groups: name [87]
#> name height mass birth_year gender homeworld species mn_height_1
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Luke… 172 77 19 male Tatooine Human 174.
#> 2 C-3PO 167 75 112 <NA> Tatooine Droid 174.
#> 3 R2-D2 96 32 33 <NA> Naboo Droid 174.
#> 4 Dart… 202 136 41.9 male Tatooine Human 174.
#> 5 Leia… 150 49 19 female Alderaan Human 174.
#> 6 Owen… 178 120 52 male Tatooine Human 174.
#> 7 Beru… 165 75 47 female Tatooine Human 174.
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid 174.
#> 9 Bigg… 183 84 24 male Tatooine Human 174.
#> 10 Obi-… 182 77 57 male Stewjon Human 174.
#> # ... with 77 more rows, and 3 more variables: mn_height_2 <dbl>,
#> # mn_height_3 <dbl>, mn_height_4 <dbl>
# ad (b):
# group_by is particularly useful in combination
# with summarise:
sws %>%
group_by(homeworld) %>%
summarise(count = n(),
mn_height = mean(height, na.rm = TRUE),
mn_mass = mean(mass, na.rm = TRUE)
)
#> # A tibble: 49 x 4
#> homeworld count mn_height mn_mass
#> <chr> <int> <dbl> <dbl>
#> 1 Alderaan 3 176. 64
#> 2 Aleen Minor 1 79 15
#> 3 Bespin 1 175 79
#> 4 Bestine IV 1 180 110
#> 5 Cato Neimoidia 1 191 90
#> 6 Cerea 1 198 82
#> 7 Champala 1 196 NaN
#> 8 Chandrila 1 150 NaN
#> 9 Concord Dawn 1 183 79
#> 10 Corellia 2 175 78.5
#> # ... with 39 more rows
# Note that this pipe returns a new tibble,
# with 49 rows (= different levels of homeworld) and
# - 1 column of the group variable (homeworld) and
# - 3 columns of the 3 newly summarised variables.
# group_by used with multiple variables yields a tibble
# containing the combination of all variable levels:
sw %>%
group_by(hair_color, eye_color) # => 35 groups (combinations)
#> # A tibble: 87 x 13
#> # Groups: hair_color, eye_color [35]
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Dart… 202 136 none white yellow 41.9 male
#> 5 Leia… 150 49 brown light brown 19 female
#> 6 Owen… 178 120 brown, gr… light blue 52 male
#> 7 Beru… 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Bigg… 183 84 black light brown 24 male
#> 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
# Counting the frequency of cases in groups:
sw %>%
group_by(hair_color, eye_color) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 35 x 3
#> # Groups: hair_color, eye_color [35]
#> hair_color eye_color n
#> <chr> <chr> <int>
#> 1 black brown 9
#> 2 brown brown 9
#> 3 none black 9
#> 4 brown blue 7
#> 5 none orange 7
#> 6 none yellow 6
#> 7 blond blue 3
#> 8 none blue 3
#> 9 none red 3
#> 10 black blue 2
#> # ... with 25 more rows
# The same using summarise:
sw %>%
group_by(hair_color, eye_color) %>%
summarise(n = n()) %>%
arrange(desc(n))
#> # A tibble: 35 x 3
#> # Groups: hair_color [13]
#> hair_color eye_color n
#> <chr> <chr> <int>
#> 1 black brown 9
#> 2 brown brown 9
#> 3 none black 9
#> 4 brown blue 7
#> 5 none orange 7
#> 6 none yellow 6
#> 7 blond blue 3
#> 8 none blue 3
#> 9 none red 3
#> 10 black blue 2
#> # ... with 25 more rows
## Note: See
# ?dplyr::group_by # for more help and examples. Note some details:
group_bychanges the unit of aggregation for other commands (mutateandsummarise).Variable names are unquoted.
When using
group_bywith multiple variables, they are separated by commas.Using
group_bywithmutateresults in a tibble that has the same number of cases (rows) as the original tibble. By contrast, usinggroup_bywithsummariseresults in a new tibble with all combinations of variable levels as its cases (rows).
Powerful pipes
The essential dplyr commands are quite simple by themselves, but form the verbs of a language for basic data manipulation. These commands become particularly powerful when they are combined into pipes (by using the %>% operator, as in the examples above). Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly. For instance, pipes allow selecting and sorting sub-groups of data, computing descriptive statistics (n, mean, median, standard deviation, etc.), and answering detailed questions about specific variables.
Examples
Each of the following questions can be answered by a pipe of dplyr commands:
What is the number and mean height and mass of individuals from Tatooine by species and gender?
Which humans are more than 5cm taller then the average human overall?
Which humans are more than 5cm taller than the average human of their own gender?
Exercises (WPA03)
The following exercises practice the essential dplyr commands and aim to show that show that simple pipes of them can solve quite intriguing puzzles about data.
Exercise 1
Star and R wars
Let’s tackle the universe/tidyverse by uncovering even more facts about the dplyr::starwars dataset. Answer the following questions by using pipes of basic dplyr commands (i.e., arranging, filtering, selecting, grouping, counting, summarizing).
- Save the tibble
dplyr::starwarsasswand report its dimensions.
## Data: -----
# ?dplyr::starwars
## (a) Basic data properties: ----
sw <- dplyr::starwars
dim(sw) # => 87 rows (denoting individuals) x 13 columns (variables)
#> [1] 87 13Known unknowns
How many missing (
NA) values doesswcontain?Which variable (column) has the most missing values?
Which individuals come from an unknown (missing)
homeworldbut have a knownbirth_yearor knownmass?
## Missing data: -----
# How many missing data points?
sum(is.na(sw)) # => 101 missing values.
#> [1] 101
# Which individuals come from an unknown (missing) homeworld
# but have a known birth_year or mass?
sw %>%
filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Yoda 66 17 white green brown 896 male
#> 2 IG-88 200 140 none metal red 15 none
#> 3 Qui-… 193 89 brown fair blue 92 male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Which variable (column) has the most missing values?
colSums(is.na(sw)) # => birth_year has 44 missing values
#> name height mass hair_color skin_color eye_color
#> 0 6 28 5 0 0
#> birth_year gender homeworld species films vehicles
#> 44 3 10 5 0 0
#> starships
#> 0
colMeans(is.na(sw)) # (amounting to 50.1% of all cases).
#> name height mass hair_color skin_color eye_color
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000
#> birth_year gender homeworld species films vehicles
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000
#> starships
#> 0.00000000
## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald":
# sw$hair_color[is.na(sw$hair_color)] <- "bald"Gender issues
How many humans are contained in
swoverall and by gender?How many and which individuals in
sware neither male nor female?Of which species in
swexist at least 2 different gender values?
## (c) Gender issues: -----
# (+) How many humans are there of each gender?
sw %>%
filter(species == "Human") %>%
group_by(gender) %>%
count()
#> # A tibble: 2 x 2
#> # Groups: gender [2]
#> gender n
#> <chr> <int>
#> 1 female 9
#> 2 male 26
## Answer: 35 Humans in total: 9 females, 26 male.
# (+) How many and which individuals are neither male nor female?
sw %>%
filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jabb… 175 1358 <NA> green-tan… orange 600 herma…
#> 2 IG-88 200 140 none metal red 15 none
#> 3 BB8 NA NA none none black NA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# (+) Of which species are there at least 2 different gender values?
sw %>%
group_by(species, gender) %>%
count() %>% # table shows species by gender:
group_by(species) %>% # Which species appear more than once in this table?
count() %>%
filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups: species [5]
#> species nn
#> <chr> <int>
#> 1 Droid 2
#> 2 Human 2
#> 3 Kaminoan 2
#> 4 Twi'lek 2
#> 5 <NA> 2
# alternative (and shorter) solution:
sw %>%
group_by(species)%>%
summarise(n_gender_vals = n_distinct(gender)) %>%
filter(n_gender_vals >= 2)
#> # A tibble: 5 x 2
#> species n_gender_vals
#> <chr> <int>
#> 1 Droid 2
#> 2 Human 2
#> 3 Kaminoan 2
#> 4 Twi'lek 2
#> 5 <NA> 2Popular homes and heights
From which
homeworlddo the most indidividuals (rows) come from?What is the mean
heightof all individuals with orange eyes from the most popular homeworld?
## (d) Homeworld issues: -----
# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from?
sw %>%
group_by(homeworld) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 49 x 2
#> # Groups: homeworld [49]
#> homeworld n
#> <chr> <int>
#> 1 Naboo 11
#> 2 Tatooine 10
#> 3 <NA> 10
#> 4 Alderaan 3
#> 5 Coruscant 3
#> 6 Kamino 3
#> 7 Corellia 2
#> 8 Kashyyyk 2
#> 9 Mirial 2
#> 10 Ryloth 2
#> # ... with 39 more rows
# => Naboo (with 11 individuals)
# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld?
sw %>%
filter(homeworld == "Naboo", eye_color == "orange") %>%
summarise(n = n(),
mn_height = mean(height))
#> # A tibble: 1 x 2
#> n mn_height
#> <int> <dbl>
#> 1 3 209.
## Note:
sw %>%
filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jabb… 175 1358 <NA> green-tan… orange 600 herma…
#> 2 Ackb… 180 83 none brown mot… orange 41 male
#> 3 Jar … 196 66 none orange orange 52 male
#> 4 Roos… 224 82 none grey orange NA male
#> 5 Rugo… 206 NA none green orange NA male
#> 6 Sebu… 112 40 none grey, red orange NA male
#> 7 Ben … 163 65 none grey, gre… orange NA male
#> 8 Saes… 188 NA none pale orange NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# (+) What is the mass and homeworld of the smallest droid?
sw %>%
filter(species == "Droid") %>%
arrange(height)
#> # A tibble: 5 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 2 R5-D4 97 32 <NA> white, red red NA <NA>
#> 3 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 4 IG-88 200 140 none metal red 15 none
#> 5 BB8 NA NA none none black NA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>Size and mass issues
Compute the median, mean, and standard deviation of
heightfor all droids.Compute the average height and mass by species and save the result as
h_m.Sort
h_mto list the 3 species with the smallest individuals (in terms of mean height).Sort
h_mto list the 3 species with the heaviest individuals (in terms of median mass).
## Size and mass issues (group summaries): -----
# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
filter(species == "Droid") %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
md_height = median(height, na.rm = TRUE),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#> n not_NA_h md_height mn_height sd_height
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 5 4 132 140 52.0
# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
group_by(species) %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
not_NA_m = sum(!is.na(mass)),
md_mass = median(mass, na.rm = TRUE)
)
h_m
#> # A tibble: 38 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Aleena 1 1 79 1 15
#> 2 Besalisk 1 1 198 1 102
#> 3 Cerean 1 1 198 1 82
#> 4 Chagrian 1 1 196 0 NA
#> 5 Clawdite 1 1 168 1 55
#> 6 Droid 5 4 140 4 53.5
#> 7 Dug 1 1 112 1 40
#> 8 Ewok 1 1 88 1 20
#> 9 Geonosian 1 1 183 1 80
#> 10 Gungan 3 3 209. 2 74
#> # ... with 28 more rows
# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Yoda's species 1 1 66 1 17
#> 2 Aleena 1 1 79 1 15
#> 3 Ewok 1 1 88 1 20
# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Hutt 1 1 175 1 1358
#> 2 Kaleesh 1 1 216 1 159
#> 3 Wookiee 2 2 231 2 124Bonus tasks
How many individuals come from the 3 most frequent (known) species?
Which individuals are more than 20% lighter (in terms of mass) than the average mass of individuals of their own homeworld?
## Bonus questions: -----
# How many individuals come from the 3 most frequent (known) species?
sw %>%
group_by(species) %>%
count %>%
arrange(desc(n)) %>%
filter(n > 1)
#> # A tibble: 9 x 2
#> # Groups: species [9]
#> species n
#> <chr> <int>
#> 1 Human 35
#> 2 Droid 5
#> 3 <NA> 5
#> 4 Gungan 3
#> 5 Kaminoan 2
#> 6 Mirialan 2
#> 7 Twi'lek 2
#> 8 Wookiee 2
#> 9 Zabrak 2
# Which individuals are more than 20% lighter (in terms of mass)
# than the average mass of individuals of their own homeworld?
sw %>%
select(name, homeworld, mass) %>%
group_by(homeworld) %>%
mutate(n_notNA_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
lighter = mass < (mn_mass - (.20 * mn_mass))
) %>%
filter(lighter == TRUE)
#> # A tibble: 5 x 6
#> # Groups: homeworld [4]
#> name homeworld mass n_notNA_mass mn_mass lighter
#> <chr> <chr> <dbl> <int> <dbl> <lgl>
#> 1 R2-D2 Naboo 32 6 64.2 TRUE
#> 2 Leia Organa Alderaan 49 2 64 TRUE
#> 3 R5-D4 Tatooine 32 8 85.4 TRUE
#> 4 Yoda <NA> 17 3 82 TRUE
#> 5 Padmé Amidala Naboo 45 6 64.2 TRUEExercise 2
Sleeping mammals
The dataset ggplot2::msleep contains a mammals sleep dataset (see ?msleep for details and the definition of variables).
- Save the data as
spand check the dimensions, variable types, and number of missing values in the dataset.
## Data:
# ?msleep # check variables
sp <- ggplot2::msleep# Check:
dim(sp) # 83 x 11 variables
#> [1] 83 11
glimpse(sp) # 5 <chr> and 6 <dbl>
#> Observations: 83
#> Variables: 11
#> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
#> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
#> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
#> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
#> $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
sum(is.na(sp)) # 136 missing values
#> [1] 136Arranging and filtering data
Use the dplyr-verbs arrange, group_by, and filter to answer the following questions by creating ordered subsets of the data:
Arrange the rows (alphabetically) by
vore,order, andname, and report thegenusof the top 3 mammals.What is the most common type of
vorein the data? How many omnivores are there?What is the most common
orderin the dataset? Are there more exemplars of theorder“Carnivora” or “Primates”?Which 2 mammals of the order “Primates” have the longest and shortest
sleep_totaltimes?
# Arranging rows:
sp1 <- sp %>%
arrange(vore, order, name)
sp1$genus[1:3] # => top 3: "Vulpes" "Phoca" "Acinonyx"
#> [1] "Vulpes" "Phoca" "Acinonyx"
# Counting common vores:
# (a) short solution:
sp %>%
count(vore) %>%
arrange(desc(n))
#> # A tibble: 5 x 2
#> vore n
#> <chr> <int>
#> 1 herbi 32
#> 2 omni 20
#> 3 carni 19
#> 4 <NA> 7
#> 5 insecti 5
# (a) is the same as (b):
sp %>%
group_by(vore) %>%
tally() %>%
arrange(desc(n))
#> # A tibble: 5 x 2
#> vore n
#> <chr> <int>
#> 1 herbi 32
#> 2 omni 20
#> 3 carni 19
#> 4 <NA> 7
#> 5 insecti 5
# (b) is the same as (c):
sp %>%
group_by(vore) %>%
summarise(n = n()) %>%
arrange(desc(n))
#> # A tibble: 5 x 2
#> vore n
#> <chr> <int>
#> 1 herbi 32
#> 2 omni 20
#> 3 carni 19
#> 4 <NA> 7
#> 5 insecti 5
# => 32 herbivores, 20 omnivores
# Counting common orders:
sp %>%
count(order) %>%
arrange(desc(n))
#> # A tibble: 19 x 2
#> order n
#> <chr> <int>
#> 1 Rodentia 22
#> 2 Carnivora 12
#> 3 Primates 12
#> 4 Artiodactyla 6
#> 5 Soricomorpha 5
#> 6 Cetacea 3
#> 7 Hyracoidea 3
#> 8 Perissodactyla 3
#> 9 Chiroptera 2
#> 10 Cingulata 2
#> 11 Didelphimorphia 2
#> 12 Diprotodontia 2
#> 13 Erinaceomorpha 2
#> 14 Proboscidea 2
#> 15 Afrosoricida 1
#> 16 Lagomorpha 1
#> 17 Monotremata 1
#> 18 Pilosa 1
#> 19 Scandentia 1
# OR:
sp %>%
group_by(order) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 19 x 2
#> # Groups: order [19]
#> order n
#> <chr> <int>
#> 1 Rodentia 22
#> 2 Carnivora 12
#> 3 Primates 12
#> 4 Artiodactyla 6
#> 5 Soricomorpha 5
#> 6 Cetacea 3
#> 7 Hyracoidea 3
#> 8 Perissodactyla 3
#> 9 Chiroptera 2
#> 10 Cingulata 2
#> 11 Didelphimorphia 2
#> 12 Diprotodontia 2
#> 13 Erinaceomorpha 2
#> 14 Proboscidea 2
#> 15 Afrosoricida 1
#> 16 Lagomorpha 1
#> 17 Monotremata 1
#> 18 Pilosa 1
#> 19 Scandentia 1
# => 22 Rodentia, 12 Carnivora = 12 Primates
# Primates with longest and shortest sleep_total times:
sp %>%
filter(order == "Primates") %>%
arrange(desc(sleep_total))
#> # A tibble: 12 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Owl … Aotus omni Prim… <NA> 17 1.8 NA
#> 2 Slow… Nyct… carni Prim… <NA> 11 NA NA
#> 3 Potto Pero… omni Prim… lc 11 NA NA
#> 4 Pata… Eryt… omni Prim… lc 10.9 1.1 NA
#> 5 Maca… Maca… omni Prim… <NA> 10.1 1.2 0.75
#> 6 Griv… Cerc… omni Prim… lc 10 0.7 NA
#> 7 Gala… Gala… omni Prim… <NA> 9.8 1.1 0.55
#> 8 Chim… Pan omni Prim… <NA> 9.7 1.4 1.42
#> 9 Squi… Saim… omni Prim… <NA> 9.6 1.4 NA
#> 10 Mong… Lemur herbi Prim… vu 9.5 0.9 NA
#> 11 Babo… Papio omni Prim… <NA> 9.4 1 0.667
#> 12 Human Homo omni Prim… <NA> 8 1.9 1.5
#> # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
# => max: Owl monkey: 17 hours, min: Human 8 hours.Computing new variables
Solve the following tasks by mutate, group_by, and summarise:
Compute a variable
sleep_awake_sumthat adds thesleep_totaltime and theawaketime of each mammal. What result do you expect and get?Which animals have the smallest and largest brain to body ratio (in terms of weight)? How many mammals have a larger ratio than humans?
What is the minimum, average (mean), and maximum sleep cycle length for each
vore? (Hint: First group the data bygroup_by, then usesummariseon thesleep_cyclevariable, but also count the number ofNAvalues for eachvore. When computing grouped summaries,NAvalues can be removed byna.rm = TRUE.)Replace your
summariseverb in the previous task bymutate. What do you get as a result? (Hint: The last two tasks illustrate the difference betweenmutateand groupedmutatecommands.)
# Computing `sleep_awake_sum`:
sp2 <- sp %>%
mutate(sleep_awake_sum = sleep_total + awake)
sp2$sleep_awake_sum # => all 24 hours
#> [1] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [12] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [23] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.05 24.00 24.00
#> [34] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [45] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [56] 24.00 24.00 24.00 24.00 24.05 24.00 24.00 24.00 24.00 24.00 24.00
#> [67] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
#> [ reached getOption("max.print") -- omitted 8 entries ]
# Computing brain to body ratios:
sp3 <- sp %>%
mutate(brain_to_body_wt = brainwt / bodywt) %>%
arrange(brain_to_body_wt)
sp3$name[1] # => smallest ratio: Cow
#> [1] "Cow"
sp3 %>%
arrange(desc(brain_to_body_wt))
#> # A tibble: 83 x 12
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Thir… Sper… herbi Rode… lc 13.8 3.4 0.217
#> 2 Owl … Aotus omni Prim… <NA> 17 1.8 NA
#> 3 Less… Cryp… omni Sori… lc 9.1 1.4 0.15
#> 4 Squi… Saim… omni Prim… <NA> 9.6 1.4 NA
#> 5 Maca… Maca… omni Prim… <NA> 10.1 1.2 0.75
#> 6 Litt… Myot… inse… Chir… <NA> 19.9 2 0.2
#> 7 Gala… Gala… omni Prim… <NA> 9.8 1.1 0.55
#> 8 Mole… Spal… <NA> Rode… <NA> 10.6 2.4 NA
#> 9 Tree… Tupa… omni Scan… <NA> 8.9 2.6 0.233
#> 10 Human Homo omni Prim… <NA> 8 1.9 1.5
#> # ... with 73 more rows, and 4 more variables: awake <dbl>, brainwt <dbl>,
#> # bodywt <dbl>, brain_to_body_wt <dbl>
# => largest ratio: Thirteen-lined ground squirrel,
# 9 mammals have larger brain_to_body_wt than Human.
# Computing sleep cycle length for each `vore`:
sp %>%
group_by(vore) %>%
summarise(n = n(),
n_NA = sum(is.na(sleep_cycle)),
non_NA = sum(!is.na(sleep_cycle)),
min_cyc = min(sleep_cycle, na.rm = TRUE),
mn_cyc = mean(sleep_cycle, na.rm = TRUE),
max_cyc = max(sleep_cycle, na.rm = TRUE)
)
#> # A tibble: 5 x 7
#> vore n n_NA non_NA min_cyc mn_cyc max_cyc
#> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 carni 19 14 5 0.333 0.373 0.417
#> 2 herbi 32 20 12 0.117 0.418 1
#> 3 insecti 5 2 3 0.117 0.161 0.2
#> 4 omni 20 9 11 0.133 0.592 1.5
#> 5 <NA> 7 6 1 0.183 0.183 0.183
# Replacing summarise by mutate:
sp4 <- sp %>%
group_by(vore) %>%
mutate(n = n(),
n_NA = sum(is.na(sleep_cycle)),
non_NA = sum(!is.na(sleep_cycle)),
min_cyc = min(sleep_cycle, na.rm = TRUE),
mn_cyc = mean(sleep_cycle, na.rm = TRUE),
max_cyc = max(sleep_cycle, na.rm = TRUE)
)
# => A tibble that contains grouped summaries as 6 new variables:
sp4 %>%
select(vore, name, sleep_cycle, n:max_cyc) %>%
arrange(vore)
#> # A tibble: 83 x 9
#> # Groups: vore [5]
#> vore name sleep_cycle n n_NA non_NA min_cyc mn_cyc max_cyc
#> <chr> <chr> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 carni Cheetah NA 19 14 5 0.333 0.373 0.417
#> 2 carni Northern f… 0.383 19 14 5 0.333 0.373 0.417
#> 3 carni Dog 0.333 19 14 5 0.333 0.373 0.417
#> 4 carni Long-nosed… 0.383 19 14 5 0.333 0.373 0.417
#> 5 carni Domestic c… 0.417 19 14 5 0.333 0.373 0.417
#> 6 carni Pilot whale NA 19 14 5 0.333 0.373 0.417
#> 7 carni Gray seal NA 19 14 5 0.333 0.373 0.417
#> 8 carni Thick-tail… NA 19 14 5 0.333 0.373 0.417
#> 9 carni Slow loris NA 19 14 5 0.333 0.373 0.417
#> 10 carni Northern g… NA 19 14 5 0.333 0.373 0.417
#> # ... with 73 more rowsExercise 3
Outliers
This exercise examines different possibilities for defining outliers and uses a generated dataset (entitled out.csv and available at http://rpository.com/ds4psy/data/out.csv) to illustate and compare them. It also helps to further disentangle the difference between mutate and grouped mutate commands.
Data on outliers
Use the following read_csv() command to obtain and load the data into an R object entitled outliers:
## Load data (as comma-separated file):
outliers <- read_csv("http://rpository.com/ds4psy/data/out.csv") # from online source
## Alternatively (from local source):
# outliers <- read_csv("out.csv") # from current directoryNot all outliers are alike
An outlier can be defined as an individual whose value in some variable deviates by more than a given criterion (e.g., 2 standard deviations) from the mean of the variable. However, this definition is incomplete unless it also specifies the reference group over which the means and deviations are computed. In the following, we explore the implications of different reference groups.
Basic tasks
Save the data into a tibble
outliersand report its number of observations and variables, and their types.How many missing data values are there in
outliers?What is the gender (or
sex) distribution in this sample?Create a plot that shows the distribution of
heightvalues for each gender.
# Load and inspect data:
# outliers <- read_csv("out.csv") # read in csv-file
# outliers <- as_tibble(data) # if data is not already a tibble
dim(outliers) # => 1000 observations (rows) x 3 variables (columns)
#> [1] 1000 3
# Missing data points:
sum(is.na(outliers)) # => 18 missing values
#> [1] 18
# Gender distribution:
outliers %>%
group_by(sex) %>%
count()
#> # A tibble: 2 x 2
#> # Groups: sex [2]
#> sex n
#> <chr> <int>
#> 1 female 507
#> 2 male 493
# => 50.7% females, 49.3% males.
# Distributions of `height` as density plot:
ggplot(outliers, aes(x = height)) +
geom_density(fill = "gold", alpha = 2/3) +
geom_density(aes(fill = sex), alpha = 2/5) +
labs(title = "Distribution of heights overall and by gender",
fill = "Gender") +
# scale_fill_manual(values = c("firebrick", "steelblue3")) +
scale_fill_brewer(palette = "Set1") + # using Brewer palette
theme_bw()
# Note: To avoid the warning about removing 18 cases with NA-values,
# we could first filter out those cases:
# non_NA_data <- filter(outliers, !is.na(height))
# Alternative solution as 2 histograms:
ggplot(outliers) +
facet_wrap(~sex) +
geom_histogram(aes(x = height, fill = sex), binwidth = 5, color = "grey10") +
labs(title = "Distribution of heights by gender",
x = "Height", y = "Frequency") +
scale_fill_brewer(name = "Gender:", palette = "Set1") + # using Brewer palette
# scale_fill_manual(name = "Gender:", values = c("firebrick", "steelblue3")) +
theme_bw()Defining different outliers
Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height:
outliers relative to the
heightof the overall sample (i.e., individuals withheightvalues deviating more than 2 SD from the overall mean ofheight);outliers relative to the
heightof some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals withheightvalues deviating more than 2 SD from the meanheightof their own gender).
Hints: As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE or FALSE) and added as new columns to data (via appropriate mutate commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping outliers by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x and y is provided by abs(x - y).
Relative outliers
Now use the 2 new outlier variables to define (or filter) 2 subsets of the data that contain 2 subgroups of people:
out_1: Individuals (females and males) withheightvalues that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are inoutliers?out_2: Individuals (females and males) withheightvalues that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are inoutliers?
## Defining different outliers: -----
# Included in data_out (below), but also possible to do separately:
# Compute the number, means and SD of height values in 2 ways:
# 1. overall:
outliers %>%
summarise(n = n(),
n_not_NA = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 4
#> n n_not_NA mn_height sd_height
#> <int> <int> <dbl> <dbl>
#> 1 1000 982 175. 11.3
# 2. by gender:
outliers %>%
group_by(sex) %>%
summarise(n = n(),
n_not_NA = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 2 x 5
#> sex n n_not_NA mn_height sd_height
#> <chr> <int> <int> <dbl> <dbl>
#> 1 female 507 501 169. 8.19
#> 2 male 493 481 181. 11.0
# Detecting and marking outliers (by logical variables):
# Compute the means, SDs, and corresponding outliers in 2 ways:
crit <- 2 # criterion value for detecting outliers (in SD units)
data_out <- outliers %>%
# 1. Compute means, SD, and outliers for overall sample:
mutate(mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
out_height = abs(height - mn_height) > (crit * sd_height)) %>%
group_by(sex) %>%
# 2. Compute same metrics for subgroups (by sex):
mutate(mn_sex_height = mean(height, na.rm = TRUE),
sd_sex_height = sd(height, na.rm = TRUE),
out_sex_height = abs(height - mn_sex_height) > (crit * sd_sex_height))
knitr::kable(head(data_out))| id | sex | height | mn_height | sd_height | out_height | mn_sex_height | sd_sex_height | out_sex_height |
|---|---|---|---|---|---|---|---|---|
| nr.1 | female | 164 | 174.7006 | 11.26015 | FALSE | 169.0679 | 8.193619 | FALSE |
| nr.2 | male | 178 | 174.7006 | 11.26015 | FALSE | 180.5676 | 11.026677 | FALSE |
| nr.3 | female | 177 | 174.7006 | 11.26015 | FALSE | 169.0679 | 8.193619 | FALSE |
| nr.4 | male | 188 | 174.7006 | 11.26015 | FALSE | 180.5676 | 11.026677 | FALSE |
| nr.5 | male | 193 | 174.7006 | 11.26015 | FALSE | 180.5676 | 11.026677 | FALSE |
| nr.6 | female | 168 | 174.7006 | 11.26015 | FALSE | 169.0679 | 8.193619 | FALSE |
## Relative outliers: -----
# Filter specific combinations of outliers:
# 1. Outliers relative to the entire population AND to their own gender:
out_1 <- data_out %>%
filter(out_height & out_sex_height) %>%
arrange(sex, height)
nrow(out_1) # => 21 individuals.
#> [1] 21
# 2. Outliers relative to their own gender, but NOT relative to the entire population:
out_2 <- data_out %>%
filter(!out_height & out_sex_height) %>%
arrange(sex, height)
nrow(out_2) # => 24 individuals.
#> [1] 24Bonus plots
Visualize the raw values and distributions of
heightfor both types of outliers (out_1andout_2) in 2 separate plots.Interpret both plots by describing the
heightandsexcombination of the individuals shown in each plot.
# Visualization and interpretation of both types of outliers:
# 1. Showing out_1:
ggplot(out_1, aes(x = sex, y = height)) +
geom_violin(aes(fill = sex)) +
geom_jitter(size = 4, alpha = 2/3) +
scale_fill_manual(values = c("firebrick", "steelblue3")) +
labs(title = "Outliers relative to both overall sample and gender",
x = "Gender", y = "Height (in cm)",
fill = "Gender:") +
theme_bw()
# Interpretation:
# `out_1` contains mostly short women (except for 1 tall woman)
# and mostly tall men (except for 2 short men).
# 2. Showing out_2:
ggplot(out_2, aes(x = sex, y = height)) +
geom_violin(aes(fill = sex)) +
geom_jitter(size = 4, alpha = 2/3) +
scale_fill_manual(values = c("firebrick", "steelblue3")) +
labs(title = "Outliers relative to gender but not overall sample",
x = "Gender", y = "Height (in cm)",
fill = "Gender:") +
theme_bw()
# Interpretation:
# `out_2` contains individuals which are either tall women or short men.Exercise 4
In Exercise 6 of WPA01 and Exercise 5 of WPA02 you used the p_info data (available at http://rpository.com/ds4psy/data/posPsy_participants.csv) from
- Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R. and Schüz, B. (2018). Data from, ‘Web-based positive psychology interventions: A reexamination of effectiveness’. Journal of Open Psychology Data, 6: 1. DOI: https://doi.org/10.5334/jopd.35
to explore the participant information and create some corresponding plots.
library(readr)
# Read data (from online source):
p_info <- read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")
# p_info
dim(p_info) # 295 rows, 6 columns
#> [1] 295 6Answer the same questions as in those exercises by transforming your earlier base R commands and graphs (created by using ggplot2 commands) into pipes of dplyr commands that provide the same information.
From WPA01: Exercise 6
Questions from Exercise 6 of WPA01:
Examine the participant information in p_info by describing each of its variables:
- How many individuals are contained in the dataset?
- What percentage of them is female (i.e., has a
sexvalue of 1)? - How many participants were in one of the 3 treatment groups (i.e., have an
interventionvalue of 1, 2, or 3)? - What is the participants’ mean education level? What percentage has a university degree (i.e., an
educvalue of at least 4)? - What is the age range (
mintomax) of participants? What is the average (mean and median) age? - Describe the range of
incomelevels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., anincomevalue of 1)?
# 1. How many individuals are contained in the dataset?
N <- nrow(p_info) # OR
N # 295
#> [1] 295
# Note:
p_info %>% count()
#> # A tibble: 1 x 1
#> n
#> <int>
#> 1 295
# would yield a tibble (with only 1 element: 295).
# 2. What percentage of them is female (i.e., has a `sex` value of 1)?
p_info %>%
group_by(sex) %>%
summarise(n_sex = n(), # number
pc_sex = n_sex/N * 100 # percentage
)
#> # A tibble: 2 x 3
#> sex n_sex pc_sex
#> <int> <int> <dbl>
#> 1 1 251 85.1
#> 2 2 44 14.9
# 3. How many participants were in one of the 3 treatment groups (i.e., have an `intervention` value of 1, 2, or 3)?
p_info %>%
filter(intervention < 4) %>%
count()
#> # A tibble: 1 x 1
#> n
#> <int>
#> 1 222
# => 222 individuals
# OR:
t_iv <- p_info %>%
group_by(intervention) %>%
summarise(n_iv = n(), # number
pc_iv = n_iv/N * 100 # percentage
)
t_iv
#> # A tibble: 4 x 3
#> intervention n_iv pc_iv
#> <int> <int> <dbl>
#> 1 1 72 24.4
#> 2 2 76 25.8
#> 3 3 74 25.1
#> 4 4 73 24.7
sum(t_iv$n_iv[1:3])
#> [1] 222
# => 222 in elements 1:3 of vector n_iv
# 4. What is the participants' mean education level?
p_info %>%
summarise(mn_edu = mean(educ))
#> # A tibble: 1 x 1
#> mn_edu
#> <dbl>
#> 1 3.98
# What percentage has a university degree (i.e., an `educ` value of at least 4)?
p_info %>%
group_by(educ) %>%
summarise(n_edu = n(), # number
pc_edu = n_edu/N * 100 # percentage
)
#> # A tibble: 5 x 3
#> educ n_edu pc_edu
#> <int> <int> <dbl>
#> 1 1 14 4.75
#> 2 2 21 7.12
#> 3 3 39 13.2
#> 4 4 104 35.3
#> 5 5 117 39.7
# 5. What is the age range (`min` to `max`) of participants?
# What is the average (mean and median) age?
p_info %>%
summarise(n = n(),
min_age = min(age),
mn_age = mean(age),
max_age = max(age))
#> # A tibble: 1 x 4
#> n min_age mn_age max_age
#> <int> <dbl> <dbl> <dbl>
#> 1 295 18 43.8 83
# 6. Describe the range of `income` levels present in this sample of participants.
# What percentage of participants self-identifies as a below-average income
# (i.e., an `income` value of 1)?
p_info %>%
group_by(income) %>%
summarise(n_income = n(),
pc_income = n_income/N * 100)
#> # A tibble: 3 x 3
#> income n_income pc_income
#> <int> <int> <dbl>
#> 1 1 73 24.7
#> 2 2 136 46.1
#> 3 3 86 29.2From WPA02: Exercise 5
Questions from Exercise 5 of WPA02
Use the p_info data to create some plots that descripte the sample of participants:
- A histogram that shows the distribution of participant
agein 3 ways:- overall,
- separately for each
sex, and - separately for each
intervention.
When using dplyr instead of ggplot2, we can replace the information contained in the histogram by a table of descriptives:
# Age distribution (as a table):
age_overall <- p_info %>%
summarise(n = n(), # number
min_age = min(age), # minimum
mn_age = mean(age), # mean
md_age = median(age), # median
sd_age = sd(age), # standard deviation
max_age = max(age) # maximum
)
age_overall
#> # A tibble: 1 x 6
#> n min_age mn_age md_age sd_age max_age
#> <int> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 295 18 43.8 44 12.4 83
# Age distribution by sex (as a table):
age_by_sex <- p_info %>%
group_by(sex) %>%
summarise(n = n(), # number
min_age = min(age), # minimum
mn_age = mean(age), # mean
md_age = median(age), # median
sd_age = sd(age), # standard deviation
max_age = max(age) # maximum
)
age_by_sex
#> # A tibble: 2 x 7
#> sex n min_age mn_age md_age sd_age max_age
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 251 19 43.9 44 12.1 83
#> 2 2 44 18 43.0 44 14.1 71
# Age distribution by intervention (as a table):
age_by_iv <- p_info %>%
group_by(intervention) %>%
summarise(n = n(), # number
min_age = min(age), # minimum
mn_age = mean(age), # mean
md_age = median(age), # median
sd_age = sd(age), # standard deviation
max_age = max(age) # maximum
)
age_by_iv
#> # A tibble: 4 x 7
#> intervention n min_age mn_age md_age sd_age max_age
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 72 22 44.6 45 12.1 71
#> 2 2 76 18 45.4 45.5 12.5 83
#> 3 3 74 19 43.3 44 12.2 71
#> 4 4 73 19 41.7 40 12.8 75- A bar plot that
- shows how many participants took part in each
intervention; or - shows how many participants of each
sextook part in eachintervention.
- shows how many participants took part in each
Using dplyr instead of ggplot2:
# Number of participants per intervention:
# (was contained in age_by_iv above):
age_by_iv %>%
select(intervention, n)
#> # A tibble: 4 x 2
#> intervention n
#> <int> <int>
#> 1 1 72
#> 2 2 76
#> 3 3 74
#> 4 4 73
# N and percentage by intervention and sex:
p_info %>%
group_by(intervention, sex) %>%
summarise(n_iv_sex = n(), # number
pc_iv_sex = n_iv_sex/N * 100 # percentage
)
#> # A tibble: 8 x 4
#> # Groups: intervention [?]
#> intervention sex n_iv_sex pc_iv_sex
#> <int> <int> <int> <dbl>
#> 1 1 1 62 21.0
#> 2 1 2 10 3.39
#> 3 2 1 66 22.4
#> 4 2 2 10 3.39
#> 5 3 1 62 21.0
#> 6 3 2 12 4.07
#> 7 4 1 61 20.7
#> 8 4 2 12 4.07More on data transformation
For more details on dplyr,
- study
vignette("dplyr")and the documentation for?arrange,?filter,?select, etc. - study https://dplyr.tidyverse.org/ and its examples;
- see the cheat sheet on data transformation;
- read Chapter 5: Data transformation and complete its exercises.
Conclusion
All ds4psy essentials so far:
| Nr. | Topic |
|---|---|
| 0. | Syllabus |
| 1. | Basic R concepts and commands |
| 2. | Visualizing data |
| 3. | Transforming data |
| 4. | Exploring data (EDA) |
| +. | Datasets |
[Last update on 2018-11-27 22:57:45 by hn.]