WPA02 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.
Instructions
To complete and submit these exercises, please remember and do the following:
Your WPAs can be written and submitted either as scripts of commented code (as
.R
or.Rmd
files) or as reproducible documents that combine text with code (in.html
or.pdf
formats).A simple
.Rmd
template is provided here.Alternatively, open a plain R script and save it as
LastnameFirstname_WPA##_yymmdd.R
.
Also enter the current assignment (e.g., WPA02), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.
Here is an example how your file JacksomeJill_WPA02_161106.Rmd
could look:
# Assignment: WPA 02
# Name: Jackson, Jill
# Date: 2017 November 6
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# A. In Class
# Numerical indexing:
# Exercise 1:
a <- letters[1:3] # a vector of the 1st 3 letters of the alphabet
rev(rev(a)) # reverse vector a twice
# Exercise 1a:
# ...
Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.
A. In Class
Here are some warm-up exercises that repeat the basic concepts of the current chapter:
Numerical indexing
1a. Use numerical indexing to print the 18th, 9th, 19th, 3rd, 15th, 15th, and 12th letters of the alphabet. (Hint: The letters of the alphabet are stored in a vector letters
.)
1b. Use numerical indexing to print the alphabet (a, b, c, etc.) except for its 18th, 9th, 19th, 3rd, 15th, 15th, and 12th letters. (Hint: Use your solution to 1a.)
1c. Use numerical indexing to print every 5th letter of the alphabet.
1d. Use numerical indexing to print the alphabet in reversed order (without using the function rev()
).
1e. Use numerical indexing to print the last 10 letters of the alphabet (in reversed order).
1f. Use numerical indexing to print all odd numbers between 1 and 100.
Logical indexing
2a. Use logical indexing to print all odd numbers between 1 and 100. (Hint: \(x\) is an odd number iff the remainder of dividing \(x\) by 2 is 1: x %% 2 == 1
.)
3b. Use logical indexing to print all numbers between 1000 and 2000 that are divisible by 17.
3c. Use logical indexing to print all numbers between 1 and 100 that are not multiples of 2 or 3.
Indexing of pets and people
4. The following vectors describe a group of people and their beloved pets:
name <- c("Anna", "Boris", "Cloe", "David", "Emma", "Fred", "Gundula", "Heidi", "Ian", "John", "Ken", "Ludmilla", "Mary", "Nathan")
age <- c(21, 22, 29, 27, 22, 19, 26, 17, NA, 28, 35, 21, 23, 23)
pet <- c("cat", "dog", "cat", "rabbit", "dog", "cat", "cat", "horse", "dog", "hamster", "snake", "cat", "guinea pig", "nintendo")
Copy this data into your environment to do the following exercises.
4a. What are the names of the people younger than 21? (Solve this once without and once with using which()
.)
4b. Who has a horse as a pet?
4c. Who has neither a cat nor a dog as a pet? What pets do these people have instead? (Try solving this once with and once without using the %in%
operator.)
4d. Which cat owners are older than 21?
4e. What is the average (or mean
) age of all dog owners? (Exclude any people with unknown age from this analysis.)
4f. How many people own a rodent as a pet? (Hint: Create a vector of all rodents
first. Then determine the number of rodent owners, by using (a) logical or (b) numerical indexing, without and with using which()
.)
4g. Print a list of all unique pets (once using unique()
and once using duplicate()
).
Changing vector values
5a. Nathan broke his Nintendo and got a fish as a pet. Change the pet
vector accordingly.
5b. Ian’s previously unknown age is 38. Adjust the age
vector accordingly and check how this changes the average age of dog owners.
5c. It turns out that the age of all cat owners was wrong and they actually are 3 years older. Change the age
vector accordingly.
5d. According to new regulations by the Ministry of Pet Ownership (MoPO), a person under the age of 18 may only own a hamster. Thus, the two people currently owning a horse and a hamster agree to swap their pets. Determine their names, change the pet
vector accordingly, and verify that the new assignment is legal.
Checkpoint 1
At this point you completed all warm-up exercises. This is good, but please keep carrying on…
B. At Home
Bar Survey
The following contain (fictional) data from a survey of 200 people at one of two bars in Konstanz (Casba and Klimperkasten) last Friday night at 1:30am. Each person was asked their age
and which brand of perfume (cologne
) they were wearing. After answering this question, a (very busy) researcher recorded how long each person spent talking to other people at the bar. All data is stored in the following 6 vector objects:
age
: The participant’s age (in years)bar
: Which bar the person went to (casba
orklimperkasten
)cologne
: Which cologne did the person wear (gio
orcalvinklein
)gender
: The person’s gender (male
orfemale
)id
: An ID code indicating the participant in the formx.n
, wherex
is the name of the bar the participant was at, andn
is a random indexing number)talk.time
: The amount of time the person spent talking to other people (in minutes)
Thankfully, you don’t need to type in the collected data yourself. The objects are stored in an RData file online.
A. Load the vectors into your current R session by running the following code:
load(file = url("http://Rpository.com/down/data/WPA02.RData")) # from online source
# load(file = "WPA02.RData") # from a local file
B. The str()
function provides basic information about the structure of objects. Familiarize yourself with the objects (‘bar’, id
, gender
, age
, cologne
, and talk.time
) by running the str()
function on each of the 6 vectors. Of what types are they? (Note that this information is also provided in the ‘Environment’ tab of RStudio.)
Reviewing the data
6a. Get the average, minimum and maximum age
of people and describe the distribution of the age
values. (Hint: Use hist()
to plot a histogram.)
6b. How many people were of each gender? (Hint: Use table()
)
6c. What was the average time a person spent talking? (Hint: Compute the mean of talk.time
.)
6d. What was the standard deviation of the talking times?
6e. Create talk.time.z
a \(z\)-score transformation of talk.time
and verify the mean and standard deviation of the talk.time.z
is as expected. (Hint: The \(z\)-score of \(x\) is defined as \(\frac{x - mean(x)}{sd(x)}\).)
Numerical indexing
7a. What was the value of the very first talk.time
?
7b. Of what gender were the first ten participants?
7c. Which brand of cologne
were the 10th through 20th participants wearing?
7d. Which bar did the last participant go to? (Hint: Don’t write the indexing number directly; instead, index the vector using the length()
function with the appropriate argument.)
Logical indexing (1 variable)
8a. How many people went to Casba? How many went to Klimperkasten?
8b. What percentage of all people went to Casba? (Hint: Use mean()
combined with a logical vector)
8c. How many people were less than 18 years old? How many were over 50?
8d. What percentage of people was at least 20 but not older than 30?
8e. How many people wore Gio? How many people wore Calvin Klein?
8f. How many people talked to others for less than an hour?
8g. What percentage of talking times were longer than three hours?
8h. What percentage of talking times were longer than 20 minutes but less than one hour?
Logical indexing (2 variables)
9a. What were the IDs of all people who went to Casba?
9b. What was the age of the youngest and oldest person at Klimperkasten?
9c. At which of the bars did people talk for longer? (Hint: Compare the average talking time of people who went to Casba vs. Klimperkasten.)
9d. Did people with different perfumes talk for different amounts of time? (Hint: Compare the average talking times of people wearing Gio vs. Calvin Klein.)
9e. Based on what you’ve learned so far, if someone wants to talk as much (or for as long) as possible, what brand of cologne should they wear?
Changing vector values by indexing
In the following exercises, we’ll use indexing and assignment to change some values within a vector. Because we typically don’t want to change the original data, we’ll make all of our adjustments on new vectors.
10a. Create new objects bar.r
, cologne.r
and talk.time.r
that are copies of the original bar
, cologne
and talk.time
objects. (Hint: Just assign the existing vectors to new objects.)
10b. In the bar.r
vector, change all "casba"
values to "c"
and change all "klimperkasten"
values to "k"
.
10c. In the cologne.r
vector, change the "gio"
values to "G"
and change the "calvinklein"
values to "C"
10d. As a measure against age-based discrimination, change all values in the age.r
vector that are lower than 21 to 21 and all age values above 39 to 39. (Check and describe the new age distribution with hist()
.)
10e. In the talk.time.r
vector, change all talk time values greater than 280 to 280. Confirm that you did this correctly by checking the maximum talking time in talk.time.r
.
Checkpoint 2
If you got this far you’re doing very well. But as things are just getting more interesting, you shouldn’t stop just yet…
Solving a paradox…
Remember the question about the average talking times of people wearing different brands of cologne? Perhaps there is a (causal or correlational?) relationship between both of these variables?
11a. Make a prediction: Based on what you’ve learned so far, if someone wanted to talk to people for as long as possible, what brand of cologne should they wear?
Let’s see if your prediction holds up…
11b. What was the average talking time of people who went to Casba and wore Gio?
11c. What was the average talking time of people who went to Casba and wore Calvin Klein?
11d. What was the average talking time of people who went to Klimperkasten and wore Gio?
11e. What was the average talking time of people who went to Klimperkasten who wore Calvin Klein?
11f. Based on what you’ve learned now, if someone’s goal was to talk to people for as long as possible, what brand of cologne should they wear?
You can visualize the data using the following code:
# Combine the relevant vectors in a dataframe:
survey.df <- data.frame(bar, cologne, talk.time)
# Create pirateplots to visualize the data:
yarrr:::pirateplot(talk.time ~ cologne, data = survey.df,
main = "Talk times by brand of cologne")
yarrr:::pirateplot(talk.time ~ cologne + bar, data = survey.df,
main = "Talk times by brand of cologne at each bar")
What you’ve just seen is an example of Simpson’s Paradox. If you want to learn more about this, check out its Wikipedia page.
Checkpoint 3
If you got this far you’re doing great. Let’s see whether you can also solve the following challenges…
C. Bonus challenges
12a. What were the mean ages of people of either gender at Casba?
12b. What percentage of women wore Calvin Klein?
13a. Let’s examine so-called tireless talkers, which are defined as people who talked for at least 100 minutes. Compare the median talking times of this group for both bars.
13b. Compare the median talking times of tireless talkers wearing Gio for both bars.
13c. What percentage of participants either went to Casba and talked for less than 2 hours or went to Klimperkasten and talked for more than 2 hours, but no longer than 3 hours?
14. Let’s make the Calvin Klein wearers look better by adding some bonus to their talking times.
14a. For all of the Calvin Klein wearers, add a random sample from a normal distribution (with a mean of 100 and a standard deviation of 10) to obtain their revised talking times. (Hint: Copy the original talk.time
vector into a new object and add the bonus to Calvin Klein wearers by using logical indexing.)
14b. Verify that the averages of revised talking times for Calvin Klein wearers exceed the talking times for Gio wearers for each bar.
That’s it – now it’s time to submit your assignment!
Save and submit your script or output file (including all code) to the appropriate folder on Ilias before midnight.
[WPA02.Rmd
updated on 2017-11-15 15:54:54 by hn.]