Introduction
All ds4psy essentials so far:
Nr. | Topic |
---|---|
0. | Syllabus |
1. | Basic R concepts and commands |
Course coordinates
- PSY-15150, at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Summer 2019: Mondays, 15:15–16:45, D435.
- Links to current course syllabus | ZeUS | Ilias
Preliminaries
This session provides some background knowledge and basics facts that are required for learning data science (in R or any other programming language). It assumes the following:
Software: You have installed the software prerequisites mentioned in Chapter 1.4 of r4ds. Specifically,
Readings: You have read the following chapters of r4ds:
- Introductions (Chapter 1 & Chapter 2);
- Workflow: basics (Chapter 4) & scripts (Chapter 6).
- Introductions (Chapter 1 & Chapter 2);
This implies that you understand and can do the following:
- Enter and run R commands at the prompt in the Console window of RStudio, and check their results;
- Use R as a calculator for simple arithmetic;
- Assign numeric values and characters to named objects;
- Call simple R functions on objects;
- Enter and run R scripts in the Editor window of RStudio.
Preparations
Create an R script (.R
) or an R-Markdown file (.Rmd
) and load the R packages of the tidyverse
. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## R basics | ds4psy
## 2019 04 15
## ----------------------------
## Preparations: ----------
library(tidyverse)
## Topic: ----------
# ...
## End of file (eof). ----------
Basic R Concepts and Commands
Orientation
This chapter contains a brief introduction into basic concepts and commands when learning R (with examples and exercises).
Strictly speaking, knowing some R is not a necessary precondition for reading and learning data science with our textbook (r4ds). However, it is certainly helpful – partly to appreciate how various tidyverse
commands let you solve many problems in simpler and more transparent ways. Thus, please work through the examples, try to understand them, and then solve the corresponding exercises below.
After working through this chapter, you should be able to:
- categorize R objects into data vs. functions;
- disginuish between different shapes (e.g., scalars, vectors, rectangles) and types (e.g., numbers, text, logical values) of data;
- create and change R objects by assignment;
- apply arithmetic functions to objects;
- select elements from vectors or rectangular data structures (by indexing).
Data vs. functions
“To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.”
John Chambers
Using R and most other programming languages consists in
defining or loading data (material, values), and
evaluating functions (actions, verbs).
Confusingly, both data and functions in R are objects (stuff) and evaluating functions typically returns or creates new data objects (more stuff). To distinguish data from functions, think of data as matter (i.e., material or values that are being measured or manipulated) and functions as procedures (i.e., actions, operations, or verbs, that measure or manipulate data).
In the following, we will introduce some ways to describe data (by shape and type), learn to create new data objects (by assigning them with <-
), and apply some pre-defined functions to these data objects.
Data
In R, different data objects are characterized by their shape and by their type.
The most common shapes of data are:
- scalars: atomic objects (i.e., a data point, with a length of 1);
- vectors: a chain/sequence of objects of the same type (i.e., extending in 1 dimension: length);
- data frames/matrices/tibbles: rectangular data (i.e., tables with 2 dimensions: rows vs. columns).
- scalars: atomic objects (i.e., a data point, with a length of 1);
The most common types of data are:
- numeric data (of type
double
orinteger
);
- text data (of type
character
);
- logical data (of type
logical
).
- numeric data (of type
Functions
In R, functions are ‘action objects’ that are applied to ‘data objects’ (a function’s so-called arguments, e.g., a1
and a2
) by specifying the function name and providing the arguments in round parentheses (i.e., function(a1, a2)
). Think of functions as a process that takes some input (its arguments) and transforms it into output (the result returned by the function). An example of a function with 2 arguments is:
sum(1, 2)
#> [1] 3
Here, the function sum
is applied to 2 numeric arguments 1
and 2
(or a data structure that consists of 2 numeric scalars). Evaluating the expression sum(1, 2)
returns a new data object 3
, which is the sum of the 2 original arguments.
Practice
Evaluate the following expressions and describe what they are doing (in terms of applying functions to arguments to obtain new data):
min(1, 2, -3, 4)
#> [1] -3
paste0("ab", "cd")
#> [1] "abcd"
substr("television", start = 5, stop = 10)
#> [1] "vision"
substr("television", 20, 30) # yields ""
#> [1] ""
Defining objects
To define a new object o
as x
, use the assignment function o <- x
and note that object names are case-sensitive (i.e., a
and A
are different object names).
For example, we can assign the output of a function to some object:
s <- sum(1, 2)
s
#> [1] 3
Here, we created a new object (named s
) and assigned the sum of 1 and 2 to it. As s
is a numeric object (with the value 3), we can now apply any numeric function to it:
s * 2
#> [1] 6
s^2
#> [1] 9
Here is another example of creating 2 simple objects and applying simple arithmetic functions to them:
o <- 10 # assign/set o to 10
O <- 5 # assign/set O to 5
# Computing with objects:
o * O
#> [1] 50
o * 0
#> [1] 0
o / O * 0
#> [1] 0
In R, objects can be described by their length and type.
Objects with a length of 1 are called scalars, longer objects (i.e.,
length(object) > 1
) are vectors or lists.The type of an object corresponds to the type of the object to which it has been assigned (e.g.,
numeric
,character
, orlogical
).
Hence, we can change an object’s type be re-assigning it:
o <- 100
o
#> [1] 100
# Check type:
is.numeric(o)
#> [1] TRUE
is.character(o)
#> [1] FALSE
is.logical(o)
#> [1] FALSE
# Re-assigning o:
o <- paste0("ene", " ", "mene", " ", "mu")
o
#> [1] "ene mene mu"
# Check type:
is.numeric(o)
#> [1] FALSE
is.character(o)
#> [1] TRUE
is.logical(o)
#> [1] FALSE
Practice
Assign o
to 2 > 1
and check its type.
# Re-assign o:
o <- 1 > 2
o
#> [1] FALSE
# Check type:
is.numeric(o)
#> [1] FALSE
is.character(o)
#> [1] FALSE
is.logical(o)
#> [1] TRUE
Naming objects
Beware of the following characteristics and constraints:
- R is case sensitive (so
tea_pot
,Tea_pot
andtea_Pot
are different names that denote 3 different objects)
- No spaces inside variables (even though
tea pot
is possible) - No special keywords that are reserved for R commands (
TRUE
,FALSE
,function
,for
,in
,next
,break
,if
,else
,repeat
,while
,NULL
,Inf
,NaN
,NA
and some variants likeNA_character
,NA_integer
, …)
Recommendations:
- Aim for short and consistent names;
- Avoid dots and special characters in names;
- Use
snake_case
(with underscores) orcamelCase
(with capitalised first letters) for combined names.
Scalars
Defining scalars
We have learned that scalars are objects of length 1 (aka. atomic objects) and how to define objects by assigning names to them. So let’s define some scalar objects and then use some generic functions to check their length and type:
a <- 1 # assign/set a to 1
a # print a
#> [1] 1
a + 2 # evaluate a + 2
#> [1] 3
sum(a, 2) # evaluate the function sum() with arguments a and 2
#> [1] 3
b <- 2
b
#> [1] 2
b * b
#> [1] 4
prod(b, b)
#> [1] 4
b ^ 2
#> [1] 4
b^3
#> [1] 8
c <- a + b # assign c to the sum of a + b
c
#> [1] 3
length(c) # 1 could indicate a scalar OR the number of digits...
#> [1] 1
length(1000) # Check: also 1 (i.e., NOT number of digits)
#> [1] 1
typeof(c) # numbers are of type "integer" or "double"
#> [1] "double"
typeof(3.14159) # decimal numbers are of type "double"
#> [1] "double"
d <- "word" # note the quotes ("")
d
#> [1] "word"
length(d) # also a scalar
#> [1] 1
typeof(d) # but of type "character"
#> [1] "character"
b
#> [1] 2
a
#> [1] 1
b > a # result is neither number nor chacacter:
#> [1] TRUE
typeof(a > b) # yet another type: "logical"
#> [1] "logical"
e <- b > a
e
#> [1] TRUE
length(e) # also a scalar
#> [1] 1
typeof(e) # of type "logical"
#> [1] "logical"
Thus, our exploration has shown that objects a
, b
and c
are numeric objects (which can be of type integer or of type double), whereas d
is a text object (of type character), and e
is the result of a test that is either TRUE
or FALSE
(of type logical).
Changing objects
To change an existing object, we need to re-assign it. Thus, changing an object works just like creating it:
# Check values (defined above):
a
#> [1] 1
b
#> [1] 2
a/b
#> [1] 0.5
a <- 100 # changes a
a # a has changed
#> [1] 100
a/b # a/b changes when a has been changed
#> [1] 50
b <- 200 # changes b
b # b has changed
#> [1] 200
a/b # a/b changes when b has been changed
#> [1] 0.5
d
#> [1] "word"
d <- "weird" # changes d
d
#> [1] "weird"
This implies that the order of evaluations matters: The same object (e.g., a
or a/b
) has different contents at different locations and at different times. (Note that the line numbers to the left of your editor window mark locations and that R scripts are typically evaluated in a top-down fashion.)
Applying functions to scalars
We have evaluated some simple functions to data arguments above, but not all functions can be applied to all data. Importantly, most functions require specific types of arguments to work (i.e., the actual argument types must match the required argument types of the function).
When viewing this requirement from the perspective of existing objects, the type of an object determines which functions can be applied to it:
# Start with numeric objects:
a
#> [1] 100
typeof(a) # a generic function (working with all object types)
#> [1] "double"
length(a) # a scalar
#> [1] 1
a + b
#> [1] 300
sum(a, b) # an arithmetic function (requiring numeric object types)
#> [1] 300
# Start with character objects:
d
#> [1] "weird"
typeof(d)
#> [1] "character"
length(d) # a scalar
#> [1] 1
nchar(d) # the "length" of a character object
#> [1] 5
# Start with logical objects:
e
#> [1] TRUE
typeof(e)
#> [1] "logical"
!e # negation (reverses logical value)
#> [1] FALSE
!!e
#> [1] TRUE
isTRUE(e) # tests a logcial expression
#> [1] TRUE
isTRUE(!e)
#> [1] FALSE
e == !!e # tests equality
#> [1] TRUE
In case of a mismatch between function and object types, an error may occur:
# Evaluate the following (and explain the error):
a + d
sum(a, d)
d^2
Arithmetic functions
For numeric objects, we can compute new numeric values by applying arithmetic functions:
## (A) Arithmetic operators: ----
+ 2 # keeping sign
#> [1] 2
- 3 # reversing sign
#> [1] -3
1 + 2 # addition
#> [1] 3
3 - 1 # subtraction
#> [1] 2
2 * 3 # multiplication
#> [1] 6
5 / 2 # division
#> [1] 2.5
5 %/% 2 # integer division
#> [1] 2
5 %% 2 # remainder of integer division (x mod y)
#> [1] 1
## (B) Operator precedence: ----
1 / 2 * 3 # left to right
#> [1] 1.5
1 + 2 * 3 # precedence: */ before +-
#> [1] 7
(1 + 2) * 3 # changing order by parentheses
#> [1] 9
# "BEDMAS" order:
# - brackets (),
# - exponents ^,
# - division / and multiplication *,
# - addition + and subtraction -
# See
# ?Syntax
# for complete rules.
2 * 2 * 2
#> [1] 8
2^3
#> [1] 8
## (C) Arithmetic with scalar objects: ----
x <- 2
y <- 3
+ x # keeping sign
#> [1] 2
- y # reversing sign
#> [1] -3
x + y # addition
#> [1] 5
x - y # subtraction
#> [1] -1
x * y # multiplication
#> [1] 6
x / y # division
#> [1] 0.6666667
x ^ y # exponentiation
#> [1] 8
x %/% y # integer division
#> [1] 0
x %% y # remainder of integer division (x mod y)
#> [1] 2
The same arithmetic operators also work with numeric vectors (see Exercise 3 below). (See ?Arithmetic
for help on arithmetic operators and ?Syntax
for a full list of precedence groups.)
Logical values and operators
By comparing numbers and using logical operators, we can obtain logical values (i.e., scalars of type logical that are either TRUE
or FALSE
) by conducting tests on numeric values:
## Logical comparisons:
2 > 1 # larger than
#> [1] TRUE
2 >= 2 # larger than or equal to
#> [1] TRUE
2 < 1 # smaller than
#> [1] FALSE
2 <= 1 # smaller than or equal to
#> [1] FALSE
1 == 1 # == ... equality
#> [1] TRUE
1 != 1 # != ... inequality
#> [1] FALSE
## Logical operators:
(2 > 1) & (1 > 2) # & ... logical AND
#> [1] FALSE
(2 < 1) | (1 < 2) # | ... logical OR
#> [1] TRUE
(1 < 1) | !(1 < 1) # ! ... logical negation
#> [1] TRUE
Vectors
Vectors are the most common and most important data type in R. A vector is a sequence of objects of the same type.
Creating vectors
To create a new vector, we can combine several objects of the same type with the c()
function, separating vector elements by commas:
# Creating vectors:
c(1, 2, 3)
#> [1] 1 2 3
c(a, b)
#> [1] 100 200
v <- c(a, b, c)
v
#> [1] 100 200 3
v <- c(c, c, c) # vectors can have repeated elements
v
#> [1] 3 3 3
v <- c(a, b, v) # Note that vectors can contain vectors, ...
v
#> [1] 100 200 3 3 3
v <- c(v, v) # but the result is only 1 vector, not 2.
v
#> [1] 100 200 3 3 3 100 200 3 3 3
# Character vectors:
w <- c("one", "two", "three")
w
#> [1] "one" "two" "three"
w <- c(w, "four", "5", "many")
w
#> [1] "one" "two" "three" "four" "5" "many"
# Applying functions to vectors:
length(v)
#> [1] 10
typeof(v)
#> [1] "double"
sum(v)
#> [1] 618
length(w)
#> [1] 6
typeof(w)
#> [1] "character"
# sum(w) # would yield an error
# Combining different types:
x <- c(1, "two", 3)
x
#> [1] "1" "two" "3"
typeof(x) # converted 1 to "1" (as all vector elements must be of the same type)
#> [1] "character"
Scalar objects are vectors
Actually, R has no dedicated type of scalar objects. Instead, individual numbers (of type integer or double) or text strings (of type character) are actually vectors of length 1:
a
#> [1] 100
is.vector(a)
#> [1] TRUE
length(a)
#> [1] 1
d
#> [1] "weird"
is.vector(d)
#> [1] TRUE
length(d)
#> [1] 1
e
#> [1] TRUE
is.vector(e)
#> [1] TRUE
length(e)
#> [1] 1
Special vector creation functions
For creating vectors with more than just a few elements (i.e., with larger length
values), the c
function becomes impractical. Some useful functions and shortcuts are:
# Sequences (with sep):
s1 <- seq(0, 100, 1) # is short for:
s2 <- seq(from = 0, to = 100, by = 1)
s2
#> [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
#> [24] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
#> [47] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#> [70] 69 70 71 72 73 74
#> [ reached getOption("max.print") -- omitted 26 entries ]
all.equal(s1, s2)
#> [1] TRUE
# Shorter version (with by = 1):
s3 <- 0:100
all.equal(s1, s3)
#> [1] TRUE
# But seq allows different step sizes:
s4 <- seq(0, 100, by = 25)
s4
#> [1] 0 25 50 75 100
# Replicating (with rep):
s5 <- rep(c(0, 1), 3) # is short for:
s5 <- rep(x = c(0, 1), times = 3)
s5
#> [1] 0 1 0 1 0 1
# Sampling vector elements (with sample):
sample(1:3, 10, replace = TRUE)
#> [1] 2 3 2 2 2 1 1 2 2 1
# Note:
# sample(1:3, 10, replace = FALSE) # would yield an error
coin <- c("H", "T") # 2 events: Heads or Tails
sample(coin, 5, TRUE) # is short for:
#> [1] "H" "H" "H" "T" "T"
sample(x = coin, size = 5, replace = TRUE) # flip coin 5 times
#> [1] "H" "H" "T" "T" "H"
sample(x = coin, size = 1000, replace = TRUE) # flip coin 1000 times
#> [1] "H" "H" "H" "T" "H" "H" "H" "H" "T" "T" "H" "H" "H" "T" "T" "H" "H"
#> [18] "H" "H" "H" "H" "T" "H" "H" "T" "H" "T" "H" "T" "T" "H" "H" "H" "H"
#> [35] "H" "T" "H" "H" "T" "T" "H" "H" "T" "T" "H" "H" "T" "H" "T" "H" "T"
#> [52] "H" "H" "H" "T" "H" "H" "H" "H" "H" "H" "T" "H" "H" "T" "H" "T" "H"
#> [69] "T" "T" "T" "H" "H" "H" "H"
#> [ reached getOption("max.print") -- omitted 925 entries ]
Indexing vectors
We often store a lot of values in vectors (e.g., the age of 1000 participants), but only need some of them for answering specific questions (e.g., what is the average age of all male participants?). To select only a subset of elements from a vector v
we can specify the condition or criterion for our selection in (square) brackets v[...]
.
# Example 1: Indexing numeric vectors
x <- 1:10
x
#> [1] 1 2 3 4 5 6 7 8 9 10
crit <- x > 5 # Condition: Which values of x are larger than 5?
crit
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
x[crit] # using crit to select values of x (for which crit is TRUE)
#> [1] 6 7 8 9 10
x[x > 5] # all in 1 step
#> [1] 6 7 8 9 10
## Example 2: Indexing character vectors
spices <- c("salt", "pepper", "cinnamon", "lemongrass", "mint", "mustard", "wasabi")
spices[nchar(spices) == 4] # spices with exactly 4 letters
#> [1] "salt" "mint"
spices[substr(spices, 2, 2) == "i"] # spices with an "i" at 2nd position
#> [1] "cinnamon" "mint"
Rectangular data
Vectors are 1-dimensional objects (i.e., have a length, but no width). By combining several vectors, we get a rectangular data structure.
Matrices
When a rectangle of data contains data of the same type in all cells, we get a matrix of data:
x <- 1:3
y <- 4:6
z <- 7:9
# Combining vectors (of the same length): ----
r1 <- rbind(x, y, z) # combine as rows
r1
#> [,1] [,2] [,3]
#> x 1 2 3
#> y 4 5 6
#> z 7 8 9
r2 <- cbind(x, y, z) # combine as columns
r2
#> x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
# Putting a vector into a rectangular matrix:
r3 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = TRUE)
r3
#> [,1] [,2] [,3] [,4]
#> [1,] 1 2 3 4
#> [2,] 5 6 7 8
#> [3,] 9 10 11 12
#> [4,] 13 14 15 16
#> [5,] 17 18 19 20
r4 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = FALSE)
r4
#> [,1] [,2] [,3] [,4]
#> [1,] 1 6 11 16
#> [2,] 2 7 12 17
#> [3,] 3 8 13 18
#> [4,] 4 9 14 19
#> [5,] 5 10 15 20
# Selecting cells, rows, or columns of matrices: ----
r1[2, 3] # in r1: select row 2, column 3
#> y
#> 6
r2[3, 1] # in r2: select row 3, column 1
#> x
#> 3
r1[2, ] # in r1: select row 2, all columns
#> [1] 4 5 6
r2[ , 1] # in r1: select column 1, all rows
#> [1] 1 2 3
r3
#> [,1] [,2] [,3] [,4]
#> [1,] 1 2 3 4
#> [2,] 5 6 7 8
#> [3,] 9 10 11 12
#> [4,] 13 14 15 16
#> [5,] 17 18 19 20
r3[2, 3:4] # in r3: select row 2, columns 3 to 4
#> [1] 7 8
r3[3:5, 2] # in r3: select rows 3 to 5, column 2
#> [1] 10 14 18
r4[] # in r4: select all rows and all columns (i.e., all of r4)
#> [,1] [,2] [,3] [,4]
#> [1,] 1 6 11 16
#> [2,] 2 7 12 17
#> [3,] 3 8 13 18
#> [4,] 4 9 14 19
#> [5,] 5 10 15 20
# Applying functions to matrices: ----
is.matrix(r1)
#> [1] TRUE
typeof(r2)
#> [1] "integer"
dim(r1) # dimensions of r2: 3 rows and 3 columns
#> [1] 3 3
nrow(r2) # number of rows of r2
#> [1] 3
ncol(r3) # number of columns of r3
#> [1] 4
sum(r1)
#> [1] 45
max(r2)
#> [1] 9
mean(r3)
#> [1] 10.5
colSums(r3) # column sums of r3
#> [1] 45 50 55 60
rowSums(r4) # row sums of r4
#> [1] 34 38 42 46 50
r4 > 10 # returns a matrix of logical values
#> [,1] [,2] [,3] [,4]
#> [1,] FALSE FALSE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE
#> [3,] FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE TRUE TRUE
#> [5,] FALSE FALSE TRUE TRUE
typeof(r4 > 10)
#> [1] "logical"
r4[r4 > 10] # indexing of matrices
#> [1] 11 12 13 14 15 16 17 18 19 20
Data frames / tibbles
As matrices contain data of only 1 type (e.g., all cells are all numeric, character, or logical data), we need another data structure for more diverse and interesting datasets. The most common rectangular data structure in R is a data frame (or tibble, which is a simpler version of a data frame used in the tidyverse
).
Let’s create a data frame from vectors:
# Create some vectors (of different types, but same length): -----
name <- c("Adam", "Bertha", "Cecily", "Dora", "Eve", "Nero", "Zeno")
gender <- c("male", "female", "female", "female", "female", "male", "male")
age <- c(21, 23, 22, 19, 21, 18, 24)
height <- c(165, 170, 168, 172, 158, 185, 182)
# Combine 4 vectors (of equal length) into a data frame:
df <- data.frame(name, gender, age, height)
df # Note: Vectors are the columns of the data frame!
#> name gender age height
#> 1 Adam male 21 165
#> 2 Bertha female 23 170
#> 3 Cecily female 22 168
#> 4 Dora female 19 172
#> 5 Eve female 21 158
#> 6 Nero male 18 185
#> 7 Zeno male 24 182
is.matrix(df)
#> [1] FALSE
is.data.frame(df)
#> [1] TRUE
dim(df) # 7 cases (rows) x 4 variables (columns)
#> [1] 7 4
# Note that
# sum(df) # would yield an error
We can easily turn any data frame into a tibble (by using the as_tibble
command of the tidyverse
package tibble
):
tb <- tibble::as_tibble(df)
dim(tb) # 7 cases (rows) x 4 variables (columns), as df
#> [1] 7 4
We will learn more about tibbles later. For now, a tibble is just a simpler and more convenient type of data frame. For instance, printing a tibble always shows its dimensions (as in dim(tb)
) and the type of each variable (column):
tb
#> # A tibble: 7 x 4
#> name gender age height
#> <fct> <fct> <dbl> <dbl>
#> 1 Adam male 21 165
#> 2 Bertha female 23 170
#> 3 Cecily female 22 168
#> 4 Dora female 19 172
#> 5 Eve female 21 158
#> 6 Nero male 18 185
#> 7 Zeno male 24 182
Working with data frames / tibbles
Selecting cells, cases (rows), or variables (columns) of a data frame:
# Selecting cells, rows or columns: -----
df[5, 3] # cell in row 5, column 3: 21 (age of Eve)
#> [1] 21
df[6, ] # row 6: Nero etc.
#> name gender age height
#> 6 Nero male 18 185
df[ , 4] # column 4: height values
#> [1] 165 170 168 172 158 185 182
names(df) # yields the names of all variables (columns), as a vector
#> [1] "name" "gender" "age" "height"
names(df)[4] # the name of the 4th variable
#> [1] "height"
# Selecting variables (columns) by their name (with $ operator):
df$gender # returns gender vector
#> [1] male female female female female male male
#> Levels: female male
df$age # returns age vector
#> [1] 21 23 22 19 21 18 24
Applying functions to variables (columns):
# Applying functions to columns of df:
df$gender == "male"
#> [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE
sum(df$gender == "male")
#> [1] 3
df$age < 21
#> [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE
df$age[df$age < 21]
#> [1] 19 18
df$name[df$age < 21]
#> [1] Dora Nero
#> Levels: Adam Bertha Cecily Dora Eve Nero Zeno
mean(df$height)
#> [1] 171.4286
df$height < 170
#> [1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE
df$gender[df$height < 170]
#> [1] male female female
#> Levels: female male
Creating new variables: To create a new variable, we simply assign something to a new variable name.
df
#> name gender age height
#> 1 Adam male 21 165
#> 2 Bertha female 23 170
#> 3 Cecily female 22 168
#> 4 Dora female 19 172
#> 5 Eve female 21 158
#> 6 Nero male 18 185
#> 7 Zeno male 24 182
dim(df) # 7 cases (rows) x 4 variables (columns)
#> [1] 7 4
# Create a new variable:
df$may_drink <- NA # initialize a new variable (column) with unknown (NA) values
df # => may_drink was added as a new column to df, all instances are NA
#> name gender age height may_drink
#> 1 Adam male 21 165 NA
#> 2 Bertha female 23 170 NA
#> 3 Cecily female 22 168 NA
#> 4 Dora female 19 172 NA
#> 5 Eve female 21 158 NA
#> 6 Nero male 18 185 NA
#> 7 Zeno male 24 182 NA
# Assign values: A person may drink (alcohol, in the US),
df$may_drink <- (df$age >= 21) # if s/he is 21 (or older)
df
#> name gender age height may_drink
#> 1 Adam male 21 165 TRUE
#> 2 Bertha female 23 170 TRUE
#> 3 Cecily female 22 168 TRUE
#> 4 Dora female 19 172 FALSE
#> 5 Eve female 21 158 TRUE
#> 6 Nero male 18 185 FALSE
#> 7 Zeno male 24 182 TRUE
# Note:
# - we did not use an if-then statement
# - we did not specify separate TRUE vs. FALSE cases
# - we can assign and set new variables in 1 step:
df$is_female <- (df$gender == "female")
df
#> name gender age height may_drink is_female
#> 1 Adam male 21 165 TRUE FALSE
#> 2 Bertha female 23 170 TRUE TRUE
#> 3 Cecily female 22 168 TRUE TRUE
#> 4 Dora female 19 172 FALSE TRUE
#> 5 Eve female 21 158 TRUE TRUE
#> 6 Nero male 18 185 FALSE FALSE
#> 7 Zeno male 24 182 TRUE FALSE
Changing variable types
When working with vectors or rectangles of data, we often need or want to convert the type of a variable into another one. To convert a variable, we simply assign it to itself (so that all its values will be preserved) and wrap a type conversion function (as.character
, as.integer
, as.numeric
or factor
) around it:
levels(df$gender) # currently a so-called "factor" variable
#> [1] "female" "male"
typeof(df$gender) # of type "integer"
#> [1] "integer"
df$gender <- as.character(df$gender) # convert into a character variable
typeof(df$gender) # now of type "character"
#> [1] "character"
df$gender <- as.factor(df$gender) # convert from "character" into a "factor"
df$gender
#> [1] male female female female female male male
#> Levels: female male
typeof(df$gender) # again of type "integer"
#> [1] "integer"
typeof(df$age) # numeric "double"
#> [1] "double"
df$age <- as.integer(df$age) # convert from "double" to "integer"
typeof(df$age) # "integer"
#> [1] "integer"
df$age <- as.numeric(df$age) # convert from "integer" to numeric "double"
typeof(df$age) # numeric "double"
#> [1] "double"
Importing data
In most cases, we don’t generate the data that we analyze, but obtain it from somewhere (e.g., online). For instance, Woodworth et al. (2018, DOI: https://doi.org/10.5334/jopd.35) examined the long-term effectiveness of different web-based positive psychology interventions (see this link for details). We can load their participant data into R with the following command (from the package readr
, which is part of the tidyverse
):
p_info <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")
dim(p_info) # 295 rows, 6 columns
#> [1] 295 6
p_info # prints a summary of the table/tibble
#> # A tibble: 295 x 6
#> id intervention sex age educ income
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4 2 35 5 3
#> 2 2 1 1 59 1 1
#> 3 3 4 1 51 4 3
#> 4 4 3 1 50 5 2
#> 5 5 2 2 58 5 2
#> 6 6 1 1 31 5 1
#> 7 7 3 1 44 5 2
#> 8 8 2 1 57 4 2
#> 9 9 1 1 36 4 3
#> 10 10 2 1 45 4 3
#> # … with 285 more rows
glimpse(p_info) # shows the first values for 6 variables (columns)
#> Observations: 295
#> Variables: 6
#> $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
#> $ intervention <dbl> 4, 1, 4, 3, 2, 1, 3, 2, 1, 2, 2, 2, 4, 4, 4, 4, 3, …
#> $ sex <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
#> $ age <dbl> 35, 59, 51, 50, 58, 31, 44, 57, 36, 45, 56, 46, 34,…
#> $ educ <dbl> 5, 1, 4, 5, 5, 5, 5, 4, 4, 4, 5, 4, 5, 1, 2, 1, 4, …
#> $ income <dbl> 3, 1, 3, 2, 2, 1, 2, 2, 3, 3, 1, 3, 3, 2, 2, 1, 2, …
When analyzing a data file from a remote source, it’s crucial to also obtain a description of the variables and values contained in the file (often called a Codebook). For the file posPsy_participants.csv
this could look like:
posPsy_participants.csv
contains demographic information on participants:
id
: participant IDintervention
: 3 positive psychology interventions, plus 1 control condition:- 1 = “Using Signature Strengths”,
- 2 = “Three Good Things”,
- 3 = “Gratitude Visit”,
- 4 = “Recording early memories” (control condition).
sex
:- 1 = female,
- 2 = male.
age
: participant’s age (in years).educ
: level of education:- 1 = Less than Year 12,
- 2 = Year 12,
- 3 = Vocational training,
- 4 = Bachelor’s degree,
- 5 = Postgraduate degree.
income
:- 1 = below average,
- 2 = average,
- 3 = above average.
We will examine these variables in Exercise 6 (below).
Exercises (WPA01)
The following exercises are your first weekly programming assignment (WPA01).
Please submit your solutions (as an .R
or .Rmd
script that is named LastFirstname_WPA01.R
) into the corresponding folder on Ilias by Thursday, May 2nd, 2019.
Exercise 1
This first exercise assumes that your current working environment contains the following objects and assignments:
a <- 100
b <- 200
d <- "weird"
e <- TRUE
o <- FALSE
O <- 5
Evaluate and explain the following results (and correct any errors that may occur):
# Note: The following assume the object definitions from above.
a
b
b <- a + a
a + a == b
!!a
sqrt(2) # see ?sqrt
sqrt(2)^2
sqrt(2)^2 == 2 # Why FALSE?
# Hint: Compute the difference sqrt(2)^2 - 2
sqrt(2)^2 - 2 # is not 0
o / O / 0 # (using o and O from above)
0 / (o * O)
0 / (o * 0)
a + b + C # are all objects defined?
sum(a, b) - sum(a + b)
b:a # divide b by a
length(b:a)
i <- i + 1 # increment i by 1
nchar(d) - length(d)
e
e + e + !!e
e <- stuff
paste(d, e) # paste "adds" 2 character objects
Exercise 2
With only a little knowledge of R you can perform quite fancy financial arithmetic. Assume that you have won an amount of EUR 1000 and are considering to deposit this amount into a new bank account that offers an annual interest rate of 0.1%.
- How much would your account be worth after waiting for 2 full years?
- What would be the time value of your money after 2 full years if the annual inflation rate is 2%?
- What would be the results to 1. and 2. if you waited for 100 years?
Answer these questions by defining well-named objects and performing simple arithmetic computations on them.
Exercise 3
When introducing arithmetic functions above, we showed that they can be used with numeric scalars (i.e., numeric objects with a length of 1).
- Demonstrate that the same arithmetic functions also work with 2 numeric vectors
x
andy
(of the same length).
- What happens when
x
andy
have different lengths?
Exercise 4
Predict the result of the arithmetic expression x %/% y * y + x %% y
. Then test your prediction by assigning some number to x
and y
and evaluating the expression. Finally, explain why the result occurs.
Exercise 5
Assume the following definitions for a survey:
- A person with an age from 1 to 17 years is classified as a minor,
- a person with an age from 18 to 64 years is classified as an adult,
- a person with an age from 65 to 99 years is classified as a senior.
Generate a vector with 100 random samples that specifies the age
of 100 people (in years), but contains exactly 20 minors, 50 adults, and 30 seniors.
Now use some functions on your age
vector to answer the following questions:
- What is the average (mean), minimum, and maximum age in this sample?
- How many people are younger than 25 years?
- What is the average (mean) age of people older than 50 years?
- How many people have a round age (i.e., an age that is divisible by 10)? What is their mean age?
Exercise 6
Examine the participant information in p_info
by describing each of its variables:
- How many individuals are contained in the dataset?
- What percentage of them is female (i.e., has a
sex
value of 1)? - How many participants were in one of the 3 treatment groups (i.e., have an
intervention
value of 1, 2, or 3)? - What is the participants’ mean education level? What percentage has a university degree (i.e., an
educ
value of at least 4)? - What is the age range (
min
tomax
) of participants? What is the average (mean and median) age? - Describe the range of
income
levels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., anincome
value of 1)?
References
The participant data p_info
(used in Exercise 6) is from:
- Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R. and Schüz, B. (2018). Data from, ‘Web-based positive psychology interventions: A reexamination of effectiveness’. Journal of Open Psychology Data, 6: 1. doi: https://doi.org/10.5334/jopd.35
All ds4psy essentials so far:
Nr. | Topic |
---|---|
0. | Syllabus |
1. | Basic R concepts and commands |
[Last update on 2019-04-29 19:20:42 by hn.]