Introduction

All ds4psy essentials so far:

ds4psy

Nr. Topic
0. Syllabus
1. Basic R concepts and commands

Course coordinates

Preliminaries

This session provides some background knowledge and basics facts that are required for learning data science (in R or any other programming language). It assumes the following:

  1. Software: You have installed the software prerequisites mentioned in Chapter 1.4 of r4ds. Specifically,

    1. A current version of R;
    2. A current version of RStudio;
    3. The R packages of the tidyverse.
  2. Readings: You have read the following chapters of r4ds:

    1. Introductions (Chapter 1 & Chapter 2);
    2. Workflow: basics (Chapter 4) & scripts (Chapter 6).

This implies that you understand and can do the following:

  • Enter and run R commands at the prompt in the Console window of RStudio, and check their results;
  • Use R as a calculator for simple arithmetic;
  • Assign numeric values and characters to named objects;
  • Call simple R functions on objects;
  • Enter and run R scripts in the Editor window of RStudio.

Preparations

Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)

## R basics | ds4psy
## 2019 04 15
## ----------------------------

## Preparations: ----------

library(tidyverse)

## Topic: ----------

# ...

## End of file (eof). ----------  

ds4psy: (1) R basics

Basic R Concepts and Commands

Orientation

ds4psy clock: (1) R basics

This chapter contains a brief introduction into basic concepts and commands when learning R (with examples and exercises).

Strictly speaking, knowing some R is not a necessary precondition for reading and learning data science with our textbook (r4ds). However, it is certainly helpful – partly to appreciate how various tidyverse commands let you solve many problems in simpler and more transparent ways. Thus, please work through the examples, try to understand them, and then solve the corresponding exercises below.

After working through this chapter, you should be able to:

  1. categorize R objects into data vs. functions;
  2. disginuish between different shapes (e.g., scalars, vectors, rectangles) and types (e.g., numbers, text, logical values) of data;
  3. create and change R objects by assignment;
  4. apply arithmetic functions to objects;
  5. select elements from vectors or rectangular data structures (by indexing).

Data vs. functions

“To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.”
John Chambers

Using R and most other programming languages consists in

  1. defining or loading data (material, values), and

  2. evaluating functions (actions, verbs).

Confusingly, both data and functions in R are objects (stuff) and evaluating functions typically returns or creates new data objects (more stuff). To distinguish data from functions, think of data as matter (i.e., material or values that are being measured or manipulated) and functions as procedures (i.e., actions, operations, or verbs, that measure or manipulate data).

In the following, we will introduce some ways to describe data (by shape and type), learn to create new data objects (by assigning them with <-), and apply some pre-defined functions to these data objects.

Data

In R, different data objects are characterized by their shape and by their type.

  1. The most common shapes of data are:

    • scalars: atomic objects (i.e., a data point, with a length of 1);
    • vectors: a chain/sequence of objects of the same type (i.e., extending in 1 dimension: length);
    • data frames/matrices/tibbles: rectangular data (i.e., tables with 2 dimensions: rows vs. columns).
  2. The most common types of data are:

    • numeric data (of type double or integer);
    • text data (of type character);
    • logical data (of type logical).

Functions

In R, functions are ‘action objects’ that are applied to ‘data objects’ (a function’s so-called arguments, e.g., a1 and a2) by specifying the function name and providing the arguments in round parentheses (i.e., function(a1, a2)). Think of functions as a process that takes some input (its arguments) and transforms it into output (the result returned by the function). An example of a function with 2 arguments is:

sum(1, 2)
#> [1] 3

Here, the function sum is applied to 2 numeric arguments 1 and 2 (or a data structure that consists of 2 numeric scalars). Evaluating the expression sum(1, 2) returns a new data object 3, which is the sum of the 2 original arguments.

Practice

Evaluate the following expressions and describe what they are doing (in terms of applying functions to arguments to obtain new data):

min(1, 2, -3, 4)
#> [1] -3
paste0("ab", "cd")
#> [1] "abcd"
substr("television", start = 5, stop = 10)
#> [1] "vision"
substr("television", 20,  30)  # yields ""
#> [1] ""

Defining objects

To define a new object o as x, use the assignment function o <- x and note that object names are case-sensitive (i.e., a and A are different object names).

For example, we can assign the output of a function to some object:

s <- sum(1, 2)
s
#> [1] 3

Here, we created a new object (named s) and assigned the sum of 1 and 2 to it. As s is a numeric object (with the value 3), we can now apply any numeric function to it:

s * 2
#> [1] 6
s^2
#> [1] 9

Here is another example of creating 2 simple objects and applying simple arithmetic functions to them:

o <- 10  # assign/set o to 10
O <-  5  # assign/set O to  5

# Computing with objects: 
o * O
#> [1] 50
o * 0
#> [1] 0
o / O * 0
#> [1] 0

In R, objects can be described by their length and type.

  • Objects with a length of 1 are called scalars, longer objects (i.e., length(object) > 1) are vectors or lists.

  • The type of an object corresponds to the type of the object to which it has been assigned (e.g., numeric, character, or logical).

Hence, we can change an object’s type be re-assigning it:

o <- 100
o
#> [1] 100

# Check type: 
is.numeric(o)
#> [1] TRUE
is.character(o)
#> [1] FALSE
is.logical(o)
#> [1] FALSE

# Re-assigning o:
o <- paste0("ene", " ", "mene", " ", "mu")
o
#> [1] "ene mene mu"

# Check type: 
is.numeric(o)
#> [1] FALSE
is.character(o)
#> [1] TRUE
is.logical(o)
#> [1] FALSE

Practice

Assign o to 2 > 1 and check its type.

# Re-assign o:
o <- 1 > 2
o
#> [1] FALSE

# Check type: 
is.numeric(o)
#> [1] FALSE
is.character(o)
#> [1] FALSE
is.logical(o)
#> [1] TRUE

Naming objects

Beware of the following characteristics and constraints:

  • R is case sensitive (so tea_pot, Tea_pot and tea_Pot are different names that denote 3 different objects)
  • No spaces inside variables (even though tea pot is possible)
  • No special keywords that are reserved for R commands (TRUE, FALSE, function, for, in, next, break, if, else, repeat, while, NULL, Inf, NaN, NA and some variants like NA_character, NA_integer, …)

Recommendations:

  • Aim for short and consistent names;
  • Avoid dots and special characters in names;
  • Use snake_case (with underscores) or camelCase (with capitalised first letters) for combined names.

Scalars

Defining scalars

We have learned that scalars are objects of length 1 (aka. atomic objects) and how to define objects by assigning names to them. So let’s define some scalar objects and then use some generic functions to check their length and type:

a <- 1     # assign/set a to 1
a          # print a
#> [1] 1
a + 2      # evaluate a + 2
#> [1] 3
sum(a, 2)  # evaluate the function sum() with arguments a and 2
#> [1] 3

b <- 2
b
#> [1] 2
b * b
#> [1] 4
prod(b, b)
#> [1] 4
b ^ 2
#> [1] 4
b^3
#> [1] 8

c <- a + b  # assign c to the sum of a + b 
c
#> [1] 3
length(c)       # 1 could indicate a scalar OR the number of digits...
#> [1] 1
length(1000)    # Check: also 1 (i.e., NOT number of digits)
#> [1] 1
typeof(c)       # numbers are of type "integer" or "double" 
#> [1] "double"
typeof(3.14159) # decimal numbers are of type "double"
#> [1] "double"

d <- "word" # note the quotes ("")
d
#> [1] "word"
length(d)  # also a scalar
#> [1] 1
typeof(d)  # but of type "character"
#> [1] "character"

b
#> [1] 2
a
#> [1] 1
b > a          # result is neither number nor chacacter:
#> [1] TRUE
typeof(a > b)  # yet another type: "logical"
#> [1] "logical"
e <- b > a
e
#> [1] TRUE
length(e)  # also a scalar
#> [1] 1
typeof(e)  # of type "logical"
#> [1] "logical"

Thus, our exploration has shown that objects a, b and c are numeric objects (which can be of type integer or of type double), whereas d is a text object (of type character), and e is the result of a test that is either TRUE or FALSE (of type logical).

Changing objects

To change an existing object, we need to re-assign it. Thus, changing an object works just like creating it:

# Check values (defined above):
a
#> [1] 1
b
#> [1] 2
a/b
#> [1] 0.5

a <- 100  # changes a
a         # a has changed
#> [1] 100
a/b       # a/b changes when a has been changed
#> [1] 50

b <- 200  # changes b
b         # b has changed
#> [1] 200
a/b       # a/b changes when b has been changed
#> [1] 0.5

d
#> [1] "word"
d <- "weird"  # changes d
d 
#> [1] "weird"

This implies that the order of evaluations matters: The same object (e.g., a or a/b) has different contents at different locations and at different times. (Note that the line numbers to the left of your editor window mark locations and that R scripts are typically evaluated in a top-down fashion.)

Applying functions to scalars

We have evaluated some simple functions to data arguments above, but not all functions can be applied to all data. Importantly, most functions require specific types of arguments to work (i.e., the actual argument types must match the required argument types of the function).

When viewing this requirement from the perspective of existing objects, the type of an object determines which functions can be applied to it:

# Start with numeric objects:
a
#> [1] 100
typeof(a)  # a generic function (working with all object types)
#> [1] "double"
length(a)  # a scalar
#> [1] 1
a + b      
#> [1] 300
sum(a, b)  # an arithmetic function (requiring numeric object types)
#> [1] 300

# Start with character objects:
d
#> [1] "weird"
typeof(d)
#> [1] "character"
length(d) # a scalar 
#> [1] 1
nchar(d)  # the "length" of a character object
#> [1] 5

# Start with logical objects:
e
#> [1] TRUE
typeof(e)
#> [1] "logical"
!e          # negation (reverses logical value)
#> [1] FALSE
!!e
#> [1] TRUE
isTRUE(e)   # tests a logcial expression
#> [1] TRUE
isTRUE(!e)
#> [1] FALSE
e == !!e    # tests equality
#> [1] TRUE

In case of a mismatch between function and object types, an error may occur:

# Evaluate the following (and explain the error):
a + d
sum(a, d)
d^2

Arithmetic functions

For numeric objects, we can compute new numeric values by applying arithmetic functions:

## (A) Arithmetic operators: ---- 
+ 2    # keeping sign
#> [1] 2
- 3    # reversing sign
#> [1] -3
1 + 2  # addition
#> [1] 3
3 - 1  # subtraction
#> [1] 2
2 * 3  # multiplication
#> [1] 6
5 / 2  # division
#> [1] 2.5

5 %/% 2  # integer division
#> [1] 2
5 %% 2   # remainder of integer division (x mod y)
#> [1] 1

## (B) Operator precedence: ---- 
1 / 2 * 3   # left to right
#> [1] 1.5
1 + 2 * 3   # precedence: */ before +-
#> [1] 7
(1 + 2) * 3 # changing order by parentheses
#> [1] 9
# "BEDMAS" order:  
# - brackets (), 
# - exponents ^, 
# - division / and multiplication *, 
# - addition + and subtraction -
# See 
# ?Syntax
# for complete rules.

2 * 2 * 2
#> [1] 8
2^3
#> [1] 8

## (C) Arithmetic with scalar objects: ---- 
x <- 2
y <- 3

+ x     # keeping sign 
#> [1] 2
- y     # reversing sign
#> [1] -3
x + y   # addition
#> [1] 5
x - y   # subtraction
#> [1] -1
x * y   # multiplication
#> [1] 6
x / y   # division
#> [1] 0.6666667
x ^ y   # exponentiation
#> [1] 8
x %/% y # integer division
#> [1] 0
x %% y  # remainder of integer division (x mod y)
#> [1] 2

The same arithmetic operators also work with numeric vectors (see Exercise 3 below). (See ?Arithmetic for help on arithmetic operators and ?Syntax for a full list of precedence groups.)

Logical values and operators

By comparing numbers and using logical operators, we can obtain logical values (i.e., scalars of type logical that are either TRUE or FALSE) by conducting tests on numeric values:

## Logical comparisons:
2 > 1   # larger than
#> [1] TRUE
2 >= 2  # larger than or equal to
#> [1] TRUE
2 < 1   # smaller than
#> [1] FALSE
2 <= 1  # smaller than or equal to
#> [1] FALSE

1 == 1  # == ... equality
#> [1] TRUE
1 != 1  # != ... inequality 
#> [1] FALSE

## Logical operators:
(2 > 1) & (1 > 2)   # & ... logical AND
#> [1] FALSE
(2 < 1) | (1 < 2)   # | ... logical OR
#> [1] TRUE
(1 < 1) | !(1 < 1)  # ! ... logical negation  
#> [1] TRUE

Vectors

Vectors are the most common and most important data type in R. A vector is a sequence of objects of the same type.

Creating vectors

To create a new vector, we can combine several objects of the same type with the c() function, separating vector elements by commas:

# Creating vectors: 
c(1, 2, 3)
#> [1] 1 2 3
c(a, b)
#> [1] 100 200

v <- c(a, b, c)
v
#> [1] 100 200   3

v <- c(c, c, c)  # vectors can have repeated elements
v
#> [1] 3 3 3

v <- c(a, b, v)  # Note that vectors can contain vectors, ...
v
#> [1] 100 200   3   3   3

v <- c(v, v)     # but the result is only 1 vector, not 2.
v
#>  [1] 100 200   3   3   3 100 200   3   3   3

# Character vectors:
w <- c("one", "two", "three")
w
#> [1] "one"   "two"   "three"

w <- c(w, "four", "5", "many")
w
#> [1] "one"   "two"   "three" "four"  "5"     "many"

# Applying functions to vectors:
length(v)
#> [1] 10
typeof(v)
#> [1] "double"
sum(v)
#> [1] 618

length(w)
#> [1] 6
typeof(w)
#> [1] "character"
# sum(w)  # would yield an error

# Combining different types:
x <- c(1, "two", 3)
x
#> [1] "1"   "two" "3"
typeof(x)  # converted 1 to "1" (as all vector elements must be of the same type)
#> [1] "character"

Scalar objects are vectors

Actually, R has no dedicated type of scalar objects. Instead, individual numbers (of type integer or double) or text strings (of type character) are actually vectors of length 1:

a
#> [1] 100
is.vector(a)
#> [1] TRUE
length(a)
#> [1] 1

d
#> [1] "weird"
is.vector(d)
#> [1] TRUE
length(d)
#> [1] 1

e
#> [1] TRUE
is.vector(e)
#> [1] TRUE
length(e)
#> [1] 1

Special vector creation functions

For creating vectors with more than just a few elements (i.e., with larger length values), the c function becomes impractical. Some useful functions and shortcuts are:

# Sequences (with sep):
s1 <- seq(0, 100, 1)  # is short for: 
s2 <- seq(from = 0, to = 100, by = 1)
s2
#>  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
#> [24] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
#> [47] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#> [70] 69 70 71 72 73 74
#>  [ reached getOption("max.print") -- omitted 26 entries ]
all.equal(s1, s2)
#> [1] TRUE

# Shorter version (with by = 1):
s3 <- 0:100
all.equal(s1, s3)
#> [1] TRUE

# But seq allows different step sizes:
s4 <- seq(0, 100, by = 25)
s4
#> [1]   0  25  50  75 100

# Replicating (with rep):
s5 <- rep(c(0, 1), 3)  # is short for:
s5 <- rep(x = c(0, 1), times = 3)
s5
#> [1] 0 1 0 1 0 1

# Sampling vector elements (with sample):
sample(1:3, 10, replace = TRUE)
#>  [1] 2 3 2 2 2 1 1 2 2 1
# Note:
# sample(1:3, 10, replace = FALSE)  # would yield an error

coin <- c("H", "T")    # 2 events: Heads or Tails
sample(coin, 5, TRUE)  # is short for: 
#> [1] "H" "H" "H" "T" "T"
sample(x = coin, size = 5, replace = TRUE)     # flip coin 5 times
#> [1] "H" "H" "T" "T" "H"
sample(x = coin, size = 1000, replace = TRUE)  # flip coin 1000 times
#>  [1] "H" "H" "H" "T" "H" "H" "H" "H" "T" "T" "H" "H" "H" "T" "T" "H" "H"
#> [18] "H" "H" "H" "H" "T" "H" "H" "T" "H" "T" "H" "T" "T" "H" "H" "H" "H"
#> [35] "H" "T" "H" "H" "T" "T" "H" "H" "T" "T" "H" "H" "T" "H" "T" "H" "T"
#> [52] "H" "H" "H" "T" "H" "H" "H" "H" "H" "H" "T" "H" "H" "T" "H" "T" "H"
#> [69] "T" "T" "T" "H" "H" "H" "H"
#>  [ reached getOption("max.print") -- omitted 925 entries ]

Indexing vectors

We often store a lot of values in vectors (e.g., the age of 1000 participants), but only need some of them for answering specific questions (e.g., what is the average age of all male participants?). To select only a subset of elements from a vector v we can specify the condition or criterion for our selection in (square) brackets v[...].

# Example 1: Indexing numeric vectors
x <- 1:10
x
#>  [1]  1  2  3  4  5  6  7  8  9 10

crit <- x > 5  # Condition: Which values of x are larger than 5?
crit
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

x[crit]   # using crit to select values of x (for which crit is TRUE)
#> [1]  6  7  8  9 10
x[x > 5]  # all in 1 step 
#> [1]  6  7  8  9 10


## Example 2: Indexing character vectors
spices <- c("salt", "pepper", "cinnamon", "lemongrass", "mint", "mustard", "wasabi")

spices[nchar(spices) == 4]           # spices with exactly 4 letters
#> [1] "salt" "mint"
spices[substr(spices, 2, 2) == "i"]  # spices with an "i" at 2nd position
#> [1] "cinnamon" "mint"

Rectangular data

Vectors are 1-dimensional objects (i.e., have a length, but no width). By combining several vectors, we get a rectangular data structure.

Matrices

When a rectangle of data contains data of the same type in all cells, we get a matrix of data:

x <- 1:3
y <- 4:6
z <- 7:9

# Combining vectors (of the same length): ---- 
r1 <- rbind(x, y, z)  # combine as rows
r1
#>   [,1] [,2] [,3]
#> x    1    2    3
#> y    4    5    6
#> z    7    8    9

r2 <- cbind(x, y, z)  # combine as columns
r2
#>      x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9

# Putting a vector into a rectangular matrix:
r3 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = TRUE)
r3
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20

r4 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = FALSE)
r4
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

# Selecting cells, rows, or columns of matrices: ---- 
r1[2, 3]  # in r1: select row 2, column 3
#> y 
#> 6
r2[3, 1]  # in r2: select row 3, column 1
#> x 
#> 3

r1[2,  ]  # in r1: select row 2, all columns
#> [1] 4 5 6
r2[ , 1]  # in r1: select column 1, all rows
#> [1] 1 2 3

r3
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20
r3[2, 3:4] # in r3: select row 2, columns 3 to 4
#> [1] 7 8
r3[3:5, 2] # in r3: select rows 3 to 5, column 2
#> [1] 10 14 18

r4[]  # in r4: select all rows and all columns (i.e., all of r4)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20


# Applying functions to matrices: ---- 
is.matrix(r1)
#> [1] TRUE
typeof(r2)
#> [1] "integer"

dim(r1)   # dimensions of r2: 3 rows and 3 columns
#> [1] 3 3
nrow(r2)  # number of rows of r2
#> [1] 3
ncol(r3)  # number of columns of r3
#> [1] 4

sum(r1)
#> [1] 45
max(r2)
#> [1] 9
mean(r3)
#> [1] 10.5
colSums(r3)  # column sums of r3
#> [1] 45 50 55 60
rowSums(r4)  # row sums of r4
#> [1] 34 38 42 46 50

r4 > 10  # returns a matrix of logical values
#>       [,1]  [,2] [,3] [,4]
#> [1,] FALSE FALSE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE
#> [3,] FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE TRUE TRUE
#> [5,] FALSE FALSE TRUE TRUE
typeof(r4 > 10)
#> [1] "logical"
r4[r4 > 10]  # indexing of matrices
#>  [1] 11 12 13 14 15 16 17 18 19 20

Data frames / tibbles

As matrices contain data of only 1 type (e.g., all cells are all numeric, character, or logical data), we need another data structure for more diverse and interesting datasets. The most common rectangular data structure in R is a data frame (or tibble, which is a simpler version of a data frame used in the tidyverse).

Let’s create a data frame from vectors:

# Create some vectors (of different types, but same length): -----  
name <- c("Adam", "Bertha", "Cecily", "Dora", "Eve", "Nero", "Zeno")
gender <- c("male", "female", "female", "female", "female", "male", "male")
age <- c(21, 23, 22, 19, 21, 18, 24)
height <- c(165, 170, 168, 172, 158, 185, 182)

# Combine 4 vectors (of equal length) into a data frame: 
df <- data.frame(name, gender, age, height)
df    # Note: Vectors are the columns of the data frame!
#>     name gender age height
#> 1   Adam   male  21    165
#> 2 Bertha female  23    170
#> 3 Cecily female  22    168
#> 4   Dora female  19    172
#> 5    Eve female  21    158
#> 6   Nero   male  18    185
#> 7   Zeno   male  24    182

is.matrix(df)
#> [1] FALSE
is.data.frame(df)
#> [1] TRUE
dim(df)  # 7 cases (rows) x 4 variables (columns)
#> [1] 7 4

# Note that 
# sum(df)  # would yield an error

We can easily turn any data frame into a tibble (by using the as_tibble command of the tidyverse package tibble):

tb <- tibble::as_tibble(df)
dim(tb)  # 7 cases (rows) x 4 variables (columns), as df
#> [1] 7 4

We will learn more about tibbles later. For now, a tibble is just a simpler and more convenient type of data frame. For instance, printing a tibble always shows its dimensions (as in dim(tb)) and the type of each variable (column):

tb
#> # A tibble: 7 x 4
#>   name   gender   age height
#>   <fct>  <fct>  <dbl>  <dbl>
#> 1 Adam   male      21    165
#> 2 Bertha female    23    170
#> 3 Cecily female    22    168
#> 4 Dora   female    19    172
#> 5 Eve    female    21    158
#> 6 Nero   male      18    185
#> 7 Zeno   male      24    182

Working with data frames / tibbles

Selecting cells, cases (rows), or variables (columns) of a data frame:

# Selecting cells, rows or columns: ----- 
df[5, 3]  # cell in row 5, column 3: 21 (age of Eve)
#> [1] 21
df[6, ]   # row 6: Nero etc.
#>   name gender age height
#> 6 Nero   male  18    185
df[ , 4]  # column 4: height values
#> [1] 165 170 168 172 158 185 182

names(df)    # yields the names of all variables (columns), as a vector
#> [1] "name"   "gender" "age"    "height"
names(df)[4] # the name of the 4th variable
#> [1] "height"

# Selecting variables (columns) by their name (with $ operator):
df$gender  # returns gender vector
#> [1] male   female female female female male   male  
#> Levels: female male
df$age     # returns age vector
#> [1] 21 23 22 19 21 18 24

Applying functions to variables (columns):

# Applying functions to columns of df:
df$gender == "male"
#> [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
sum(df$gender == "male")
#> [1] 3

df$age < 21
#> [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
df$age[df$age < 21]
#> [1] 19 18
df$name[df$age < 21]
#> [1] Dora Nero
#> Levels: Adam Bertha Cecily Dora Eve Nero Zeno

mean(df$height)
#> [1] 171.4286
df$height < 170
#> [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
df$gender[df$height < 170]
#> [1] male   female female
#> Levels: female male

Creating new variables: To create a new variable, we simply assign something to a new variable name.

df
#>     name gender age height
#> 1   Adam   male  21    165
#> 2 Bertha female  23    170
#> 3 Cecily female  22    168
#> 4   Dora female  19    172
#> 5    Eve female  21    158
#> 6   Nero   male  18    185
#> 7   Zeno   male  24    182
dim(df)  # 7 cases (rows) x 4 variables (columns)
#> [1] 7 4

# Create a new variable:
df$may_drink <- NA  # initialize a new variable (column) with unknown (NA) values
df # => may_drink was added as a new column to df, all instances are NA
#>     name gender age height may_drink
#> 1   Adam   male  21    165        NA
#> 2 Bertha female  23    170        NA
#> 3 Cecily female  22    168        NA
#> 4   Dora female  19    172        NA
#> 5    Eve female  21    158        NA
#> 6   Nero   male  18    185        NA
#> 7   Zeno   male  24    182        NA

# Assign values:  A person may drink (alcohol, in the US),  
df$may_drink <- (df$age >= 21)  # if s/he is 21 (or older)
df
#>     name gender age height may_drink
#> 1   Adam   male  21    165      TRUE
#> 2 Bertha female  23    170      TRUE
#> 3 Cecily female  22    168      TRUE
#> 4   Dora female  19    172     FALSE
#> 5    Eve female  21    158      TRUE
#> 6   Nero   male  18    185     FALSE
#> 7   Zeno   male  24    182      TRUE

# Note:
# - we did not use an if-then statement
# - we did not specify separate TRUE vs. FALSE cases
# - we can assign and set new variables in 1 step:

df$is_female <- (df$gender == "female")
df
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 4   Dora female  19    172     FALSE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 6   Nero   male  18    185     FALSE     FALSE
#> 7   Zeno   male  24    182      TRUE     FALSE

Changing variable types

When working with vectors or rectangles of data, we often need or want to convert the type of a variable into another one. To convert a variable, we simply assign it to itself (so that all its values will be preserved) and wrap a type conversion function (as.character, as.integer, as.numeric or factor) around it:

levels(df$gender)  # currently a so-called "factor" variable
#> [1] "female" "male"
typeof(df$gender)  # of type "integer"
#> [1] "integer"
df$gender <- as.character(df$gender)  # convert into a character variable
typeof(df$gender)  # now of type "character"
#> [1] "character"

df$gender <- as.factor(df$gender)  # convert from "character" into a "factor"
df$gender
#> [1] male   female female female female male   male  
#> Levels: female male
typeof(df$gender)  # again of type "integer"
#> [1] "integer"

typeof(df$age)  # numeric "double"
#> [1] "double"
df$age <- as.integer(df$age)  # convert from "double" to "integer"
typeof(df$age)  # "integer"
#> [1] "integer"
df$age <- as.numeric(df$age)  # convert from "integer" to numeric "double"
typeof(df$age)  # numeric "double"
#> [1] "double"

Importing data

In most cases, we don’t generate the data that we analyze, but obtain it from somewhere (e.g., online). For instance, Woodworth et al. (2018, DOI: https://doi.org/10.5334/jopd.35) examined the long-term effectiveness of different web-based positive psychology interventions (see this link for details). We can load their participant data into R with the following command (from the package readr, which is part of the tidyverse):

p_info <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")

dim(p_info)      # 295 rows, 6 columns
#> [1] 295   6
p_info           # prints a summary of the table/tibble
#> # A tibble: 295 x 6
#>       id intervention   sex   age  educ income
#>    <dbl>        <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1     1            4     2    35     5      3
#>  2     2            1     1    59     1      1
#>  3     3            4     1    51     4      3
#>  4     4            3     1    50     5      2
#>  5     5            2     2    58     5      2
#>  6     6            1     1    31     5      1
#>  7     7            3     1    44     5      2
#>  8     8            2     1    57     4      2
#>  9     9            1     1    36     4      3
#> 10    10            2     1    45     4      3
#> # … with 285 more rows
glimpse(p_info)  # shows the first values for 6 variables (columns)
#> Observations: 295
#> Variables: 6
#> $ id           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
#> $ intervention <dbl> 4, 1, 4, 3, 2, 1, 3, 2, 1, 2, 2, 2, 4, 4, 4, 4, 3, …
#> $ sex          <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
#> $ age          <dbl> 35, 59, 51, 50, 58, 31, 44, 57, 36, 45, 56, 46, 34,…
#> $ educ         <dbl> 5, 1, 4, 5, 5, 5, 5, 4, 4, 4, 5, 4, 5, 1, 2, 1, 4, …
#> $ income       <dbl> 3, 1, 3, 2, 2, 1, 2, 2, 3, 3, 1, 3, 3, 2, 2, 1, 2, …

When analyzing a data file from a remote source, it’s crucial to also obtain a description of the variables and values contained in the file (often called a Codebook). For the file posPsy_participants.csv this could look like:

posPsy_participants.csv contains demographic information on participants:

  • id: participant ID

  • intervention: 3 positive psychology interventions, plus 1 control condition:

    • 1 = “Using Signature Strengths”,
    • 2 = “Three Good Things”,
    • 3 = “Gratitude Visit”,
    • 4 = “Recording early memories” (control condition).
  • sex:

    • 1 = female,
    • 2 = male.
  • age: participant’s age (in years).

  • educ: level of education:

    • 1 = Less than Year 12,
    • 2 = Year 12,
    • 3 = Vocational training,
    • 4 = Bachelor’s degree,
    • 5 = Postgraduate degree.
  • income:

    • 1 = below average,
    • 2 = average,
    • 3 = above average.

We will examine these variables in Exercise 6 (below).

Exercises (WPA01)

ds4psy: Exercises 1

The following exercises are your first weekly programming assignment (WPA01).
Please submit your solutions (as an .R or .Rmd script that is named LastFirstname_WPA01.R) into the corresponding folder on Ilias by Thursday, May 2nd, 2019.

Exercise 1

This first exercise assumes that your current working environment contains the following objects and assignments:

a <- 100
b <- 200
d <- "weird"
e <- TRUE
o <- FALSE
O <- 5

Evaluate and explain the following results (and correct any errors that may occur):

# Note: The following assume the object definitions from above.
a
b
b <- a + a
a + a == b
!!a

sqrt(2)  # see ?sqrt
sqrt(2)^2
sqrt(2)^2 == 2  # Why FALSE? 
# Hint: Compute the difference sqrt(2)^2 - 2
sqrt(2)^2 - 2   # is not 0

o / O / 0   # (using o and O from above)
0 / (o * O)
0 / (o * 0)

a + b + C   # are all objects defined?

sum(a, b) - sum(a + b)

b:a  # divide b by a
length(b:a)

i <- i + 1  # increment i by 1

nchar(d) - length(d)

e
e + e + !!e

e <- stuff
paste(d, e)  # paste "adds" 2 character objects

Exercise 2

With only a little knowledge of R you can perform quite fancy financial arithmetic. Assume that you have won an amount of EUR 1000 and are considering to deposit this amount into a new bank account that offers an annual interest rate of 0.1%.

  1. How much would your account be worth after waiting for 2 full years?
  2. What would be the time value of your money after 2 full years if the annual inflation rate is 2%?
  3. What would be the results to 1. and 2. if you waited for 100 years?

Answer these questions by defining well-named objects and performing simple arithmetic computations on them.

Exercise 3

When introducing arithmetic functions above, we showed that they can be used with numeric scalars (i.e., numeric objects with a length of 1).

  1. Demonstrate that the same arithmetic functions also work with 2 numeric vectors x and y (of the same length).
  2. What happens when x and y have different lengths?

Exercise 4

Predict the result of the arithmetic expression x %/% y * y + x %% y. Then test your prediction by assigning some number to x and y and evaluating the expression. Finally, explain why the result occurs.

Exercise 5

Assume the following definitions for a survey:

  • A person with an age from 1 to 17 years is classified as a minor,
  • a person with an age from 18 to 64 years is classified as an adult,
  • a person with an age from 65 to 99 years is classified as a senior.

Generate a vector with 100 random samples that specifies the age of 100 people (in years), but contains exactly 20 minors, 50 adults, and 30 seniors.

Now use some functions on your age vector to answer the following questions:

  • What is the average (mean), minimum, and maximum age in this sample?
  • How many people are younger than 25 years?
  • What is the average (mean) age of people older than 50 years?
  • How many people have a round age (i.e., an age that is divisible by 10)? What is their mean age?

Exercise 6

Examine the participant information in p_info by describing each of its variables:

  1. How many individuals are contained in the dataset?
  2. What percentage of them is female (i.e., has a sex value of 1)?
  3. How many participants were in one of the 3 treatment groups (i.e., have an intervention value of 1, 2, or 3)?
  4. What is the participants’ mean education level? What percentage has a university degree (i.e., an educ value of at least 4)?
  5. What is the age range (min to max) of participants? What is the average (mean and median) age?
  6. Describe the range of income levels present in this sample of participants. What percentage of participants self-identifies as a below-average income (i.e., an income value of 1)?

References

The participant data p_info (used in Exercise 6) is from:

  • Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R. and Schüz, B. (2018). Data from, ‘Web-based positive psychology interventions: A reexamination of effectiveness’. Journal of Open Psychology Data, 6: 1. doi: https://doi.org/10.5334/jopd.35

All ds4psy essentials so far:

ds4psy

Nr. Topic
0. Syllabus
1. Basic R concepts and commands

[Last update on 2019-04-29 19:20:42 by hn.]