Answers for WPA04 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.
To complete and submit these exercises, please remember and do the following:
Your WPAs can be written and submitted either as scripts of commented code (as
.R
or.Rmd
files) or as reproducible documents that combine text with code (in.html
or.pdf
formats).A simple
.Rmd
template is provided here.Alternatively, open a plain R script and save it as
LastnameFirstname_WPA##_yymmdd.R
.
Also enter the current assignment (e.g., WPA04), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.
Here is an example how your file JamesJoy_WPA04_171120.R
could look:
# Assignment: WPA 04
# Name: Joy, James
# Date: 2017 Nov 20
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# A. In Class
# Combining vectors to matrices and data frames:
# Exercise 1:
a <- letters[1:3] # define some vector
# ...
Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.
A. In Class
Here are some warm-up exercises that review the basic concepts of the current chapter:
Checking and changing data
1. This exercise practices previous commands and skills (mostly indexing and subsetting of data frames) that are highly relevant for the current and future chapters. It uses a ficticious data set clothes.csv
which describes 600 items of clothing by the following variables:
idnr
: A unique number identifying each itemgroup
for whom this item is designed (kids
,men
, orwomen
)type
of clothing (dress
,pants
,shirt
,shoes
, orsuit
)brand
of the item (6 popular labels)color
of the item (8 different colors)price
: the item’s recommended retail price
The data is available at http://Rpository.com/down/data/WPA04_clothes.csv.
1a. Load the data into a data frame called clothes.df
and inspect it by using the head()
, View()
, str()
, or summary()
functions. (Hint: Use the read.table()
function and note that the data file is comma-delimited and contains a header row.)
clothes.df <- read.table(file = "http://Rpository.com/down/data/WPA04_clothes.csv",
header = TRUE, sep = ",") # read from online url
# clothes.df <- read.table(file = "data/WPA04_clothes.csv",
# header = TRUE, sep = ",") # read from local file
# head(clothes.df)
# View(clothes.df)
# summary(clothes.df)
# str(clothes.df)
1b. How many items of clothing exist of each color? How many pairs of shoes exist of each color? How many women’s dresses exist of each color? (Hint: Use the table()
function on appropriate subsets of the data.)
# Using table() on the appropriate vector (selected via subsetting):
table(clothes.df$color)
#>
#> black blue brown green grey red silver white
#> 176 73 45 35 34 50 32 155
table(clothes.df$color[clothes.df$type == "shoes"])
#>
#> black blue brown green grey red silver white
#> 49 15 15 0 9 0 0 35
table(clothes.df$color[clothes.df$type == "dress" & clothes.df$group == "women"])
#>
#> black blue brown green grey red silver white
#> 29 10 6 7 0 12 9 28
# alternative solutions:
t.color.by.type <- with(clothes.df, table(color, type)) # create table
t.color.by.type
#> type
#> color dress pants shirt shoes suit
#> black 36 33 34 49 24
#> blue 12 17 23 15 6
#> brown 6 10 11 15 3
#> green 7 9 19 0 0
#> grey 0 7 15 9 3
#> red 15 13 22 0 0
#> silver 10 9 12 0 1
#> white 34 22 52 35 12
rowSums(t.color.by.type) # sums of rows
#> black blue brown green grey red silver white
#> 176 73 45 35 34 50 32 155
colSums(t.color.by.type) # sums of cols
#> dress pants shirt shoes suit
#> 120 120 188 123 49
# in 2 steps:
women <- clothes.df[(clothes.df$group == "women"), ] # create subset
with(women, table(color, type)) # table on subset
#> type
#> color dress pants shirt shoes suit
#> black 29 12 20 24 9
#> blue 10 6 12 5 3
#> brown 6 9 7 6 1
#> green 7 4 8 0 0
#> grey 0 3 7 2 3
#> red 12 4 8 0 0
#> silver 9 5 8 0 0
#> white 28 9 21 20 6
1c. How many pairs of Adidas shoes are in this data set? What’s their average price?
# (1) Step by step:
log.vec <- clothes.df$brand == "Adidas" & clothes.df$type == "shoes"
indices <- which(log.vec)
length(indices)
#> [1] 85
mean(clothes.df$price[indices])
#> [1] 106.8835
# (2) 1 step solutions:
length(which(clothes.df$brand == "Adidas" & clothes.df$type == "shoes"))
#> [1] 85
nrow(clothes.df[(clothes.df$brand == "Adidas" & clothes.df$type == "shoes"), ])
#> [1] 85
mean(clothes.df$price[clothes.df$brand == "Adidas" & clothes.df$type == "shoes"])
#> [1] 106.8835
mean(clothes.df[clothes.df$brand == "Adidas" & clothes.df$type == "shoes", ]$price)
#> [1] 106.8835
1d. What are the cheapest and the most expensive items of clothing in the data set?
# Some solutions:
# (1) Step by step:
min.price <- min(clothes.df$price); min.price
#> [1] 10.1
ix.min.price <- which(clothes.df$price == min.price)
clothes.df[ix.min.price, ]
#> idnr group type brand color price
#> 25 25 kids shirt Adidas red 10.1
max.price <- max(clothes.df$price); max.price
#> [1] 199.9
ix.max.price <- which(clothes.df$price == max.price)
clothes.df[ix.max.price, ]
#> idnr group type brand color price
#> 112 112 men suit Calvin Klein silver 199.9
# (2) Alternative solution:
sorted.clothes.df <- clothes.df[order(clothes.df$price), ] # a. sort clothes by (increasing) price
sorted.clothes.df[1, ] # b. first item: cheapest item
#> idnr group type brand color price
#> 25 25 kids shirt Adidas red 10.1
sorted.clothes.df[nrow(sorted.clothes.df), ] # c. last item: most expensive item
#> idnr group type brand color price
#> 112 112 men suit Calvin Klein silver 199.9
# (3) Two 1 step solutions:
clothes.df[which(clothes.df$price == min(clothes.df$price)), ]
#> idnr group type brand color price
#> 25 25 kids shirt Adidas red 10.1
clothes.df[which(clothes.df$price == max(clothes.df$price)), ]
#> idnr group type brand color price
#> 112 112 men suit Calvin Klein silver 199.9
# (4) 1 step solution:
clothes.df[clothes.df$price == range(clothes.df$price), ]
#> idnr group type brand color price
#> 25 25 kids shirt Adidas red 10.1
#> 112 112 men suit Calvin Klein silver 199.9
# (5) Using dplyr:
library(dplyr)
clothes.df %>% filter(price == range(price))
#> idnr group type brand color price
#> 1 25 kids shirt Adidas red 10.1
#> 2 112 men suit Calvin Klein silver 199.9
1e. H&M is having a sale on men’s items. Deduct 20% of all corresponding items. (Hint: Determine the appropriate subset of prices \(x\) to change. Deducting 20% from \(x\) is identical to multiplying \(x\) by a factor of .80.)
# (1) Step by step:
ix.HM.men <- which(clothes.df$brand == "H&M" & clothes.df$group == "men")
mean(clothes.df$price[ix.HM.men]) # check mean
#> [1] 99.07955
clothes.df$price[ix.HM.men] <- clothes.df$price[ix.HM.men] * .80
mean(clothes.df$price[ix.HM.men]) # check mean again
#> [1] 79.26364
## (2) same in 1 step:
# clothes.df$price[clothes.df$brand == "H&M" & clothes.df$group == "men"] <-
# clothes.df$price[clothes.df$brand == "H&M" & clothes.df$group == "men"] * .80
Merging data
2. A group of fashion activists has collected the true street prices of all items and saved them in a file that is available at http://Rpository.com/down/data/WPA04_streetprices.tab.
2a. Load the online data on street prices into a new data frame called street.df
. (Hint: This data file is tab-delimited and contains a header row.)
street.df <- read.table(file = "http://Rpository.com/down/data/WPA04_streetprices.tab",
sep = "\t", header = TRUE) # read from online url
# street.df <- read.table(file = "data/WPA04_streetprices_2017.tab",
# sep = "\t", header = TRUE) # read from local file
head(street.df) # Note that cases/rows are NOT sorted by idnr:
#> idnr street.price
#> 545 545 96.7500
#> 14 14 64.6325
#> 164 164 62.6350
#> 271 271 152.6175
#> 312 312 149.7000
#> 281 281 87.9750
# str(street.df)
2b. Are the actual street prices typically cheaper or more expensive than the recommended retail prices (in the original data of clothes.df
)? By how many percent do the two types of prices differ on average? (Hint: Compare the corresponding means.)
org.price.mn <- mean(clothes.df$price); org.price.mn
#> [1] 98.25127
str.price.mn <- mean(street.df$street.price); str.price.mn
#> [1] 80.94269
str.price.mn < org.price.mn # Are street prices cheaper?
#> [1] TRUE
str.price.mn/org.price.mn * 100 # pc of avg. street.price of org.price:
#> [1] 82.38335
(org.price.mn - str.price.mn)/org.price.mn * 100 # pc of reduction relative to org.price.mn:
#> [1] 17.61665
2c. Combine the price data in street.df
with the original data stored in clothes.df
. (Hint: You could first use order()
to sort street.df
by increasing idnr
values and then use cbind()
to add the street.price
variable to clothes.df
. However, combining two data frames that share a common column is simpler by using the merge()
function.)
# (1) Step by step:
# a) Sort street.df by idnr:
head(street.df) # verify that idnr are unordered:
#> idnr street.price
#> 545 545 96.7500
#> 14 14 64.6325
#> 164 164 62.6350
#> 271 271 152.6175
#> 312 312 149.7000
#> 281 281 87.9750
ord.street.df <- street.df[order(street.df$idnr), ] # sort by idnr
head(ord.street.df) # verify that df is now ordered:
#> idnr street.price
#> 1 1 128.9600
#> 2 2 122.5500
#> 3 3 76.8825
#> 4 4 125.5295
#> 5 5 99.7520
#> 6 6 58.9250
# b) Check whether all idnr values are equal in both data frames:
all.equal(clothes.df$idnr, ord.street.df$idnr)
#> [1] TRUE
# c) Add street.price column from street.df to clothes.df:
clothes.1.df <- cbind(clothes.df, ord.street.df$street.price)
# head(clothes.1.df)
# d) Set the name of the last column back to "street.price":
names(clothes.1.df)[ncol(clothes.1.df)] <- "street.price"
head(clothes.1.df)
#> idnr group type brand color price street.price
#> 1 1 women shirt Marc O'Polo green 98.75 128.9600
#> 2 2 women dress Benetton black 125.75 122.5500
#> 3 3 men pants H&M black 71.96 76.8825
#> 4 4 women shirt Calvin Klein brown 116.75 125.5295
#> 5 5 kids shirt Zara blue 85.78 99.7520
#> 6 6 kids shirt Adidas red 85.70 58.9250
# str(clothes.1.df)
# --------------------
# (2) 1 step solution using merge():
clothes.2.df <- merge(x = clothes.df, y = street.df, by = "idnr")
head(clothes.2.df)
#> idnr group type brand color price street.price
#> 1 1 women shirt Marc O'Polo green 98.75 128.9600
#> 2 2 women dress Benetton black 125.75 122.5500
#> 3 3 men pants H&M black 71.96 76.8825
#> 4 4 women shirt Calvin Klein brown 116.75 125.5295
#> 5 5 kids shirt Zara blue 85.78 99.7520
#> 6 6 kids shirt Adidas red 85.70 58.9250
# str(clothes.2.df)
# --------------------
# Verify that both solutions yield equal street.price columns:
all.equal(clothes.1.df$street.price, clothes.2.df$street.price)
#> [1] TRUE
# Select one of the solutions:
clothes.df <- clothes.2.df
2d. Recompute the mean original and street prices by using the colMeans()
function on the appropriate columns/variables of clothes.df
. (Hint: Either use head()
to determine the appropriate column numbers or use a test on names()
to obtain their column numbers. The means should match those computed in 2b.)
# (1) Note that "price" and "street.price" are the 6. and 7. columns of clothes.df:
colMeans(clothes.df[ , 6:7])
#> price street.price
#> 98.25127 80.94269
# (2) (by first obtaining the column numbers):
column.nrs <- which(names(clothes.df) == "price" | names(clothes.df) == "street.price")
colMeans(clothes.df[ , column.nrs])
#> price street.price
#> 98.25127 80.94269
Aggregating data
3. This exercise practices different ways of aggregating data (i.e., computing summary statistics over groups of cases/rows that are defined by levels of categorical variables).
3a. Do the average recommended retail prices of clothing items vary according to the group
for which they are designed? (Hint: Use the aggregate()
function to aggregate the price
over the categorical variable group
.)
aggregate(formula = price ~ group, FUN = mean, data = clothes.df)
#> group price
#> 1 kids 77.02476
#> 2 men 94.56718
#> 3 women 105.00348
# The order of average prices is: women > men > kids.
3b. Do the different brands differ in their policies of changing the average recommended retail prices to the actual street prices? (Hint: Use the aggregate()
function twice to aggregate over the categorical variable brand
.)
aggregate(formula = price ~ brand, FUN = mean, data = clothes.df)
#> brand price
#> 1 Adidas 93.76625
#> 2 Benetton 116.01182
#> 3 Calvin Klein 113.25938
#> 4 H&M 88.82891
#> 5 Marc O'Polo 118.10909
#> 6 Zara 99.89463
aggregate(formula = street.price ~ brand, FUN = mean, data = clothes.df)
#> brand street.price
#> 1 Adidas 63.67688
#> 2 Benetton 106.56618
#> 3 Calvin Klein 113.64964
#> 4 H&M 70.00339
#> 5 Marc O'Polo 128.00773
#> 6 Zara 81.19432
# Most brands have cheaper street prices than recommended prices, with 2 exceptions:
# Calvin Klein's prices are about equal, Marc O'Polo more expensive on the street.
3c. Do the average recommended retail prices of different kinds (or type
) of clothing items vary according to the group
for which they are designed? (Hint: Use the aggregate()
function to aggregate over two categorical variables).
aggregate(formula = price ~ group + type, FUN = mean, data = clothes.df)
#> group type price
#> 1 kids dress 89.54882
#> 2 women dress 108.91830
#> 3 kids pants 67.90786
#> 4 men pants 82.98595
#> 5 women pants 101.74466
#> 6 kids shirt 69.69231
#> 7 men shirt 81.90082
#> 8 women shirt 93.33361
#> 9 kids shoes 91.36000
#> 10 men shoes 102.61500
#> 11 women shoes 112.71986
#> 12 men suit 129.73680
#> 13 women suit 126.40250
# OR
aggregate(formula = price ~ type + group, FUN = mean, data = clothes.df)
#> type group price
#> 1 dress kids 89.54882
#> 2 pants kids 67.90786
#> 3 shirt kids 69.69231
#> 4 shoes kids 91.36000
#> 5 pants men 82.98595
#> 6 shirt men 81.90082
#> 7 shoes men 102.61500
#> 8 suit men 129.73680
#> 9 dress women 108.91830
#> 10 pants women 101.74466
#> 11 shirt women 93.33361
#> 12 shoes women 112.71986
#> 13 suit women 126.40250
# Items for kids are typically the cheapest, for women the most expensive, with
# Overall, kids' pants are cheapest and men's suits are most expensive.
3d. Solve Exercise 3b. again, but now by using one command that allows computing summary statistics for multiple dependent variables. (Hint: Load and use the dplyr
package.)
library("dplyr")
# Template for using dplyr:
df %>% # dataframe to use
filter(var > n) %>% # filter condition
group_by(iv1, iv2) %>% # grouping variable(s)
summarise(
n = n(), # number of cases per group
a = mean(dv1), # calculate mean of dv1 in df
b = sd(dv2), # calculate sd of dv2 in df
c = max(dv3)) # calculate max of dv3 in df.
library("dplyr")
clothes.df %>%
group_by(brand) %>%
summarise(n = n(),
price.mn = mean(price),
street.mn = mean(street.price),
pc.reduced = round((street.mn - price.mn)/price.mn * 100, 1)
)
#> # A tibble: 6 × 5
#> brand n price.mn street.mn pc.reduced
#> <fctr> <int> <dbl> <dbl> <dbl>
#> 1 Adidas 160 93.76625 63.67688 -32.1
#> 2 Benetton 55 116.01182 106.56618 -8.1
#> 3 Calvin Klein 48 113.25938 113.64964 0.3
#> 4 H&M 211 88.82891 70.00339 -21.2
#> 5 Marc O'Polo 44 118.10909 128.00773 8.4
#> 6 Zara 82 99.89463 81.19432 -18.7
# Same results as above, but in 1 table (or "tibble").
3e. Solve Exercise 3c. again, but now excluding all items for kids
and computing not only average prices, but also average street prices, their standard deviations, and the number of items in each category. (Hint: Aggregating cases over multiple categories, filtering data, and computing multiple dependent statistics clearly calls for the dplyr
package.)
clothes.df %>%
filter(group != "kids") %>%
group_by(type, group) %>%
summarise(n = n(),
price.mn = mean(price),
street.mn = mean(street.price),
price.sd = sd(price),
street.sd = sd(street.price)
)
#> Source: local data frame [9 x 7]
#> Groups: type [?]
#>
#> type group n price.mn street.mn price.sd street.sd
#> <fctr> <fctr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 dress women 112 108.91830 90.75598 28.77324 33.09061
#> 2 pants men 37 82.98595 72.19331 31.43043 37.57264
#> 3 pants women 58 101.74466 83.14464 26.58351 38.64289
#> 4 shirt men 61 81.90082 74.77431 30.16340 37.17407
#> 5 shirt women 97 93.33361 76.98979 29.07833 36.27525
#> 6 shoes men 40 102.61500 76.59618 25.37502 25.64484
#> 7 shoes women 70 112.71986 83.68536 26.43946 26.87633
#> 8 suit men 25 129.73680 122.06558 28.64541 38.25193
#> 9 suit women 16 126.40250 110.89238 25.30128 35.99659
# Same results on mean price as before (plus others):
# Men's items are cheapter than women's, except for suits.
Checkpoint 1
At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…
B. At Home
JDM Study: Why do we overestimate others’ willingness to pay?
In this WPA, we will analyze data from the following study:
- Matthews, W. J., Gheorghiu, A. I., & Callan, M. J. (2016). Why do we overestimate others’ willingness to pay? Judgment and Decision Making, 11(1), 21–39.
The purpose of this research was to test if our beliefs about other people’s affluence (or wealth) affect how much we think they will be willing to pay for items.
You can find the full paper in the journal Judgment and Decision Making (in html or pdf format).
In this WPA, we will analyze some data from their first study. In Study 1, participants indicated the proportion of other people taking part in the survey who have more than themselves, and then whether other people would be willing to pay more than them for each of 10 products.
Products and proportions
The following table shows the 10 products and the proportion p of participants who indicated that others would be willing to pay more for the product than themselves (cf. Table 1 in Matthews et al., 2016, p. 23).
Table 1: Proportion of participants who indicated that the “typical participant” would pay more than they would for each product in Study 1.
Product # | Product | Reported p(other > self) |
---|---|---|
1 | A freshly-squeezed glass of apple juice | .695 |
2 | A Parker ballpoint pen | .863 |
3 | A pair of Bose noise-cancelling headphones | .705 |
4 | A voucher giving dinner for two at Applebee’s | .853 |
5 | A 16 oz jar of Planters dry-roasted peanuts | .774 |
6 | A one-month movie pass | .800 |
7 | An Ikea desk lamp | .863 |
8 | A Casio digital watch | .900 |
9 | A large, ripe pineapple | .674 |
10 | A handmade wooden chess set | .732 |
Variable description
Here are descriptions of the data variables (taken from the author’s dataset notes available at http://journal.sjdm.org/15/15909/Notes.txt):
id
: participant id code;gender
: participants’ gender,1 = male
,2 = female
.age
: participants’ age.income
: participants’ annual household income on categorical scale with 8 categorical options: Less than $15,000; $15,001–$25,000; $25,001–$35,000; $35,001–$50,000; $50,001–$75,000; $75,001–$100,000; $100,001–$150,000; greater than $150,000.p1-p10
: whether the “typical” survey respondent would pay more (coded as1
) or less (coded as0
) than oneself, for each of the 10 products.task
: whether the participant had to judge the proportion of other people who “have more money than you do” (coded as1
) or the proportion who “have less money than you do” (coded as0
).havemore
: participant’s response whentask = 1
.haveless
: participant’s response whentask = 0
.pcmore
: participant’s estimate of the proportion of people who have more than they do (calculated as 100-haveless whentask = 0
).
Managing your workspace
4. Navigate to a dedicated R project (in case you have not already done so) and start there with a clean slate (without any R objects from previous tasks):
4a. Open your R project from last week (which you called RCourse
or something similar). There should be at least one subdirectory called data
in this working directory.
# ok.
4b. Move or save your current R script (entitled LastFirst_WPA##_yymmdd.R
) into the main folder of your project directory.
# ok.
4c. Delete any R objects still stored in your current session and set your working directory to the path of your project directory. (Hint: Check your file browser for your current path and note that path descriptions vary for different operating systems.)
rm(list = ls()) # clean all R objects
setwd("~/Desktop/RCourse/") # set to your working directory
Getting and saving data
5. Let’s get the original data set of study 1 and store it as a text file in our data
directory:
5a. The original data are available at http://journal.sjdm.org/15/15909/data1.csv. Load this data set into a new R object called matthews.df
. (Hint: Use the read.table()
function and use the URL as the file name, provided you have a working internet connection.)
# (1) Get data from http://journal.sjdm.org/15/15909/data1.csv online:
matthews.df <- read.table(file = "http://journal.sjdm.org/15/15909/data1.csv",
sep = ",", header = TRUE)
# (2) from http://rpository.com/down/data/data1.csv online:
# matthews.df2 <- read.table(file = "http://rpository.com/down/data/data1.csv", sep = ",", header = TRUE)
# all.equal(matthews.df, matthews.df2)
# (3) from local file data/data1.csv:
# matthews.df3 <- read.table(file = "data/data1.csv", sep = ",", header = TRUE)
# all.equal(matthews.df, matthews.df3)
5b. Save the data set as a tab-delimited text file called matthews_study1.txt
in the data
directory of your working directory. (Hint: Use the write.table()
function.)
write.table(x = matthews.df,
file = "data/matthews_study1.txt",
sep = "\t")
Exploring data
6a. Explore the first few rows and the contents of matthews.df
using the head()
, View()
, str()
, and summary()
commands.
head(matthews.df)
#> id gender age income p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 task
#> 1 R_3PtNn51LmSFdLNM 2 26 7 1 1 1 1 1 1 1 1 1 1 0
#> 2 R_2AXrrg62pgFgtMV 2 32 4 1 1 1 1 1 1 1 1 1 1 0
#> 3 R_cwEOX3HgnMeVQHL 1 25 2 0 1 1 1 1 1 1 1 0 0 0
#> 4 R_d59iPwL4W6BH8qx 1 33 5 1 1 1 1 1 1 1 1 1 1 0
#> havemore haveless pcmore
#> 1 NA 50 50
#> 2 NA 25 75
#> 3 NA 10 90
#> 4 NA 50 50
#> [ reached getOption("max.print") -- omitted 2 rows ]
# View(matthews.df)
# str(matthews.df)
# summary(matthews.df)
6b. What are the variable/column names of the data frame?
names(matthews.df)
#> [1] "id" "gender" "age" "income" "p1" "p2"
#> [7] "p3" "p4" "p5" "p6" "p7" "p8"
#> [13] "p9" "p10" "task" "havemore" "haveless" "pcmore"
6c. What was the participants’ mean age?
mean(matthews.df$age)
#> [1] 31.71579
6d. Currently, participants’ gender
is coded as either 1
or 2
. Create a new character column called gender.c
that recodes these values as male
and female
, respectively. (Hint: Create a new column gender.c
and use logical indexing to define its value based on the value of the gender
column.)
# Create a new column called gender.c that recodes gender as a character string:
matthews.df$gender.c <- NA # initialize all values to NA.
matthews.df$gender.c[matthews.df$gender == 1] <- "male"
matthews.df$gender.c[matthews.df$gender == 2] <- "female"
6e. What percent of participants were male?
# 1)
mean(matthews.df$gender == 1)
#> [1] 0.6263158
# 2) with our new variable:
mean(matthews.df$gender.c == "male")
#> [1] 0.6263158
# 3) with the table() function:
table(matthews.df$gender.c)[2]/sum(table(matthews.df$gender.c))
#> male
#> 0.6263158
Computing row and column statistics
7. Let’s compute some means for columns or rows of data frames:
7a. Create a new dataframe called product.df
that only contains the 10 columns from p1
to p10
from matthews.df
by running the following code.
# Create product.df containing only columns p1, p2, ... p10:
product.df <- matthews.df[ , paste("p", 1:10, sep = "")]
7b. The colMeans()
function takes a dataframe as an argument, and returns a vector showing means across all rows for each column of data. Using colMeans()
, calculate the percentage of participants who indicated that the ‘typical’ participant would be willing to pay more than them for each item. Do your values match with those reported in Table 1?
colMeans(product.df)
#> p1 p2 p3 p4 p5 p6 p7
#> 0.6947368 0.8631579 0.7052632 0.8526316 0.7736842 0.8000000 0.8631579
#> p8 p9 p10
#> 0.9000000 0.6736842 0.7315789
# Yes, the numbers match.
7c. The rowMeans()
function is like colMeans()
, but for calculating means across columns for every row of data. Using rowMeans()
calculate for each participant, the percentage of the 10 items that the participant believed other people would spend more on. Save this data as a new vector called p.all
.
p.all <- rowMeans(product.df)
7d. Add the p.all
vector as a new column called p.all
to the matthews.df
dataframe
matthews.df$p.all <- p.all
7e. What was the average value of p.all
across all 190 participants? This value is the answer to the question: “How often does the average participant think that someone else would pay more for an item than themselves?”
mean(matthews.df$p.all) # mean of all participant means
#> [1] 0.7857895
7f. How does the value just computed (i.e., a mean across the means of 190 participants) compare to the mean value of the 10 product means? (Hint: The 10 product means were computed in 7b. above.)
mean(colMeans(product.df)) # mean of the 10 product means
#> [1] 0.7857895
# They are equal.
# Note: Checking for equality:
mean(matthews.df$p.all) == mean(colMeans(product.df)) # would return FALSE.
#> [1] FALSE
# To test for equality of numeric objects:
all.equal(mean(matthews.df$p.all), mean(colMeans(product.df))) # returns TRUE.
#> [1] TRUE
Merging and subsetting data
8. Let’s add a new table containing fictional demographic information about each participant. The data are stored in a text file at http://Rpository.com/down/data/matthews_demographics.txt.
8a. Load this data into an R object called demo.df
. (Hint: Use the read.table()
function, but first check the file for the appropriate column delimiter and the absence or presence of a variable header.)
demo.df <- read.table(file = "http://Rpository.com/down/data/matthews_demographics.txt",
sep = "\t", # data file is tab-delimited
header = TRUE) # data file contains a header row
head(demo.df)
#> id height race
#> 1 R_rcnylCRo9oKoClP 154 asian
#> 2 R_2SGkwNjBl2Ymiw5 170 white
#> 3 R_3Ok7Aza3vfRMKJX 171 black
#> 4 R_Xp767d7ZRcqMqTD 175 asian
#> 5 R_d5vg3yc1rC2g1JX 167 black
#> 6 R_cZ0mJRzPtTqDJ5f 167 asian
8b. Get the average height and the frequency of participants’ race
from this new data frame.
mean(demo.df$height)
#> [1] 171.3842
table(demo.df$race)
#>
#> asian black hispanic white
#> 41 41 15 93
8c. Combine the original data in matthews.df
with the new demographic data. (Hint: Use the merge()
function with an appropriate argument specifying a common column.)
matthews.df <- merge(x = matthews.df,
y = demo.df,
by = "id")
8d. Using either indexing or subset()
, calculate the mean age of all males and of all females:
# 1) logical indexing:
mean(matthews.df$age[matthews.df$gender.c == "male"])
#> [1] 29.76471
mean(matthews.df$age[matthews.df$gender.c == "female"])
#> [1] 34.98592
# 2) with subset():
male.df <- subset(matthews.df, subset = gender.c == "male")
mean(male.df$age)
#> [1] 29.76471
mean(subset(matthews.df, subset = gender.c == "female")$age)
#> [1] 34.98592
Checkpoint 2
At this point you are doing very well. But rather than separately computing each summary statistic, let’s try to aggregate over one or two categorical variables…
Aggregating data
9a. Calculate the mean age of male and female participants by aggregating age
over the gender.c
(using the aggregate()
function). Do you get the same results as before?
# 3) with aggregate():
aggregate(formula = age ~ gender.c, FUN = mean, data = matthews.df)
#> gender.c age
#> 1 female 34.98592
#> 2 male 29.76471
# Yes, the results are the same.
9b. Using aggregate()
calculate the mean p.all
value for male and female participants. Which gender is more likely to think that others would pay more for products than themselves?
aggregate(formula = p.all ~ gender.c, FUN = mean, data = matthews.df)
#> gender.c p.all
#> 1 female 0.8014085
#> 2 male 0.7764706
# Females are more likely to think that others will pay more than themselves
# relative to males.
9c. Using aggregate()
calculate the mean p.all
value of participants for each level of income
. Is there a consistent relationship between p.all
and income
?
aggregate(formula = p.all ~ income, FUN = mean, data = matthews.df)
#> income p.all
#> 1 1 0.9037037
#> 2 2 0.8044444
#> 3 3 0.7370370
#> 4 4 0.7862069
#> 5 5 0.7500000
#> 6 6 0.6958333
#> 7 7 0.8142857
#> 8 8 0.8666667
# The values tend to decrease from income = 1 to income = 6 (except for income = 3),
# then they go up again (for income = 7 and 8)!
9d. Now repeat the previous analysis, but only for females. (Hint: Use the subset
argument within the aggregate
function.)
aggregate(formula = p.all ~ income, FUN = mean, data = matthews.df,
subset = gender.c == "female")
#> income p.all
#> 1 1 0.8875000
#> 2 2 0.8294118
#> 3 3 0.8857143
#> 4 4 0.8625000
#> 5 5 0.7500000
#> 6 6 0.6785714
#> 7 7 0.7600000
#> 8 8 0.9000000
# For females, the group of income = 2 violates the overall trend.
9e. What was the mean age
for participants for each combination of gender
and income
? (Hint: Use the aggregate()
function with 2 independent variables.)
aggregate(formula = age ~ income + gender.c, FUN = mean, data = matthews.df)
#> income gender.c age
#> 1 1 female 31.12500
#> 2 2 female 36.35294
#> 3 3 female 33.14286
#> 4 4 female 36.75000
#> 5 5 female 35.00000
#> 6 6 female 34.00000
#> 7 7 female 37.60000
#> 8 8 female 38.50000
#> 9 1 male 28.73684
#> 10 2 male 30.14286
#> 11 3 male 29.45000
#> 12 4 male 28.52381
#> 13 5 male 31.00000
#> 14 6 male 29.60000
#> 15 7 male 43.50000
#> 16 8 male 23.00000
9f. The variable pcmore
reflects participants’ answer to the question: “What percentage of people taking part in this survey do you think earn more than you do?”. Using aggregate()
, calculate the median value of this variable for each level of income. What does the result tell you?
aggregate(formula = pcmore ~ income, FUN = median, data = matthews.df)
#> income pcmore
#> 1 1 80
#> 2 2 75
#> 3 3 50
#> 4 4 60
#> 5 5 50
#> 6 6 45
#> 7 7 50
#> 8 8 50
# The higher one's income, the less people tend to think that
# other people earn more than themselves (which makes sense).
# However, even people in the top 4 income levels think that
# about 50% earn more than themselves.
Checkpoint 3
At this point you are doing great, well done! But if you liked using the aggregate()
function, you will love the dplyr
package in the following exercises.
C. Challenges
Using dplyr
10. The remaining exercises focus on the dplyr
package, which allows combining multiple independent and dependent variables in one command.
10a. Load the dplyr
package (if not already loaded).
library(dplyr)
10b. For each level of gender
, calculate the summary statistics in the following table:
Variable | Description |
---|---|
n | Number of participants |
age.mean | Mean age |
age.sd | Standard deviation of age |
income.mean | Mean income |
pcmore.mean | Mean value of pcmore |
p.all.mean | Mean value of p.all |
Save the computed summary statistics to an object called gender.df
. (Hint: Use dplyr
with appropriate group_by
and summarise
arguments.)
gender.df <- matthews.df %>%
group_by(gender.c) %>%
summarise(
n = n(),
age.mean = mean(age),
age.sd = sd(age),
income.mean = mean(income),
pcmore.mean = mean(pcmore),
p.all.mean = mean(p.all)
)
gender.df
#> # A tibble: 2 × 7
#> gender.c n age.mean age.sd income.mean pcmore.mean p.all.mean
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 female 71 34.98592 10.430029 3.943662 58.80282 0.8014085
#> 2 male 119 29.76471 7.648757 3.285714 62.25210 0.7764706
10c. For each level of income
, calculate the summary statistics in the following table – but only for participants older than 21 – and save them to a new object called income.df
:
Variable | Description |
---|---|
n | Number of participants |
age.min | Minimum age |
age.mean | Mean age |
male.pc | Percentage of males |
female.pc | Percentage of females |
pcmore.mean | Mean value of pcmore |
p.all.mean | Mean value of p.all |
(Hint: Use dplyr
with appropriate filter
, group_by
and summarise
arguments.)
income.df <- matthews.df %>%
filter(age > 21) %>%
group_by(income) %>%
summarise(
n = n(),
age.mean = mean(age),
male.pc = mean(gender == 1),
female.pc = mean(gender == 2),
pcmore.mean = mean(pcmore),
p.all.mean = mean(p.all)
)
income.df
#> # A tibble: 8 × 7
#> income n age.mean male.pc female.pc pcmore.mean p.all.mean
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 29.76923 0.6923077 0.3076923 74.88462 0.9038462
#> 2 2 43 33.06977 0.6279070 0.3720930 70.09302 0.8069767
#> 3 3 25 31.20000 0.7600000 0.2400000 54.60000 0.7400000
#> 4 4 27 31.62963 0.7037037 0.2962963 61.25926 0.7814815
#> 5 5 26 33.30769 0.6153846 0.3846154 53.65385 0.7500000
#> 6 6 23 32.78261 0.3913043 0.6086957 46.00000 0.6826087
#> 7 7 7 39.28571 0.2857143 0.7142857 41.42857 0.8142857
#> 8 8 3 33.33333 0.3333333 0.6666667 33.33333 0.8666667
10d. Calculate several summary statistics (you choose which ones) aggregated at each level of race and gender. Save the results to an object called racegender.df
racegender.df <- matthews.df %>%
group_by(race, gender.c) %>%
summarise(
n = n(), # N
age.min = min(age), # youngest person
height.mean = mean(height), # mean height
income.mean = mean(income), # mean income
pcmore.mean = mean(pcmore), # mean pcmore
p.all.mean = mean(p.all) # mean p.all
)
racegender.df
#> Source: local data frame [8 x 8]
#> Groups: race [?]
#>
#> race gender.c n age.min height.mean income.mean pcmore.mean
#> <fctr> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 asian female 20 22 169.2000 3.700000 57.10000
#> 2 asian male 21 20 169.8571 3.142857 58.33333
#> 3 black female 17 23 174.6471 5.294118 53.52941
#> 4 black male 24 18 174.1250 3.000000 77.54167
#> 5 hispanic female 4 24 163.5000 2.000000 76.25000
#> 6 hispanic male 11 22 172.3636 3.181818 58.09091
#> 7 white female 30 20 171.7000 3.600000 60.60000
#> 8 white male 63 19 170.8413 3.460317 58.46032
#> # ... with 1 more variables: p.all.mean <dbl>
10e. Using dplyr
, calculate several summary statistics (you choose which ones) aggregated at each level of 3 independent variables of your choice. Save the results to an object called XYZ.df
, where XYZ
contains the names of the 3 variables over which you aggregated.
genderTaskIncome.df <- matthews.df %>%
group_by(gender.c, task, income) %>%
summarise(
n = n(), # N
pcmore.mean = mean(pcmore), # mean pcmore
pc.asian = mean(race == "asian"), # percent asians
p.all.mean = mean(p.all) # mean p.all
)
genderTaskIncome.df
#> Source: local data frame [29 x 7]
#> Groups: gender.c, task [?]
#>
#> gender.c task income n pcmore.mean pc.asian p.all.mean
#> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 female 0 1 3 73.33333 0.0000000 0.8666667
#> 2 female 0 2 11 67.27273 0.3636364 0.8363636
#> 3 female 0 3 3 56.66667 0.3333333 0.8666667
#> 4 female 0 4 5 71.00000 0.4000000 0.8600000
#> 5 female 0 5 4 63.75000 0.5000000 0.7000000
#> 6 female 0 6 8 50.62500 0.2500000 0.6500000
#> 7 female 0 7 2 50.00000 0.0000000 0.9000000
#> 8 female 0 8 2 50.00000 0.0000000 0.9000000
#> 9 female 1 1 5 55.00000 0.0000000 0.9000000
#> 10 female 1 2 6 66.66667 0.5000000 0.8166667
#> # ... with 19 more rows
10f. Save matthews.df
, gender.df
, income.df
, racegender.df
, and your XYZ.df
objects to a file called matthews_dfs.RData
in the data
subdirectory of your working directory. (Hint: Use the save()
function with an appropriate file argument.)
save(matthews.df, gender.df, income.df, racegender.df, genderTaskIncome.df,
file = "data/matthews_dfs.RData")
That’s it – hope you enjoyed working on this assignment!
[WPA04_answers.Rmd
updated on 2017-11-23 14:38:55 by hn.]