Answers for WPA04 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.

To complete and submit these exercises, please remember and do the following:

Your WPAs can be written and submitted either as scripts of commented code (as .R or .Rmd files) or as reproducible documents that combine text with code (in .html or .pdf formats).
- A simple .Rmd template is provided here.
- Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.
Also enter the current assignment (e.g., WPA04), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.

Here is an example how your file JamesJoy_WPA04_171120.R could look:

# Assignment: WPA 04
# Name: Joy, James
# Date: 2017 Nov 20
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# A. In Class

# Combining vectors to matrices and data frames:

# Exercise 1:
a <- letters[1:3] # define some vector

# ...

Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.

A. In Class

Here are some warm-up exercises that review the basic concepts of the current chapter:

Checking and changing data

1. This exercise practices previous commands and skills (mostly indexing and subsetting of data frames) that are highly relevant for the current and future chapters. It uses a ficticious data set clothes.csv which describes 600 items of clothing by the following variables:

idnr: A unique number identifying each item
group for whom this item is designed (kids, men, or women)
type of clothing (dress, pants, shirt, shoes, or suit)
brand of the item (6 popular labels)
color of the item (8 different colors)
price: the item’s recommended retail price

The data is available at http://Rpository.com/down/data/WPA04_clothes.csv.

1a. Load the data into a data frame called clothes.df and inspect it by using the head(), View(), str(), or summary() functions. (Hint: Use the read.table() function and note that the data file is comma-delimited and contains a header row.)

clothes.df <- read.table(file = "http://Rpository.com/down/data/WPA04_clothes.csv", 
                         header = TRUE, sep = ",")          # read from online url
# clothes.df <- read.table(file = "data/WPA04_clothes.csv", 
#                          header = TRUE, sep = ",")        # read from local file

# head(clothes.df)
# View(clothes.df)
# summary(clothes.df)
# str(clothes.df)

1b. How many items of clothing exist of each color? How many pairs of shoes exist of each color? How many women’s dresses exist of each color? (Hint: Use the table() function on appropriate subsets of the data.)

# Using table() on the appropriate vector (selected via subsetting):
table(clothes.df$color)
#> 
#>  black   blue  brown  green   grey    red silver  white 
#>    176     73     45     35     34     50     32    155
table(clothes.df$color[clothes.df$type == "shoes"])
#> 
#>  black   blue  brown  green   grey    red silver  white 
#>     49     15     15      0      9      0      0     35
table(clothes.df$color[clothes.df$type == "dress" & clothes.df$group == "women"])
#> 
#>  black   blue  brown  green   grey    red silver  white 
#>     29     10      6      7      0     12      9     28

# alternative solutions:
t.color.by.type <- with(clothes.df, table(color, type)) # create table
t.color.by.type
#>         type
#> color    dress pants shirt shoes suit
#>   black     36    33    34    49   24
#>   blue      12    17    23    15    6
#>   brown      6    10    11    15    3
#>   green      7     9    19     0    0
#>   grey       0     7    15     9    3
#>   red       15    13    22     0    0
#>   silver    10     9    12     0    1
#>   white     34    22    52    35   12
rowSums(t.color.by.type) # sums of rows
#>  black   blue  brown  green   grey    red silver  white 
#>    176     73     45     35     34     50     32    155
colSums(t.color.by.type) # sums of cols
#> dress pants shirt shoes  suit 
#>   120   120   188   123    49

# in 2 steps:
women <- clothes.df[(clothes.df$group == "women"), ]  # create subset
with(women, table(color, type)) # table on subset
#>         type
#> color    dress pants shirt shoes suit
#>   black     29    12    20    24    9
#>   blue      10     6    12     5    3
#>   brown      6     9     7     6    1
#>   green      7     4     8     0    0
#>   grey       0     3     7     2    3
#>   red       12     4     8     0    0
#>   silver     9     5     8     0    0
#>   white     28     9    21    20    6

1c. How many pairs of Adidas shoes are in this data set? What’s their average price?

# (1) Step by step:
log.vec <- clothes.df$brand == "Adidas" & clothes.df$type == "shoes"
indices <- which(log.vec)
length(indices) 
#> [1] 85
mean(clothes.df$price[indices])
#> [1] 106.8835

# (2) 1 step solutions: 
length(which(clothes.df$brand == "Adidas" & clothes.df$type == "shoes"))
#> [1] 85
nrow(clothes.df[(clothes.df$brand == "Adidas" & clothes.df$type == "shoes"), ])
#> [1] 85
mean(clothes.df$price[clothes.df$brand == "Adidas" & clothes.df$type == "shoes"])
#> [1] 106.8835
mean(clothes.df[clothes.df$brand == "Adidas" & clothes.df$type == "shoes", ]$price)
#> [1] 106.8835

1d. What are the cheapest and the most expensive items of clothing in the data set?

# Some solutions: 
# (1) Step by step:
min.price <- min(clothes.df$price); min.price
#> [1] 10.1
ix.min.price <- which(clothes.df$price == min.price)
clothes.df[ix.min.price, ]
#>    idnr group  type  brand color price
#> 25   25  kids shirt Adidas   red  10.1

max.price <- max(clothes.df$price); max.price
#> [1] 199.9
ix.max.price <- which(clothes.df$price == max.price)
clothes.df[ix.max.price, ]
#>     idnr group type        brand  color price
#> 112  112   men suit Calvin Klein silver 199.9

# (2) Alternative solution:
sorted.clothes.df <- clothes.df[order(clothes.df$price), ] # a. sort clothes by (increasing) price
sorted.clothes.df[1, ]                       # b. first item: cheapest item 
#>    idnr group  type  brand color price
#> 25   25  kids shirt Adidas   red  10.1
sorted.clothes.df[nrow(sorted.clothes.df), ] # c. last item: most expensive item
#>     idnr group type        brand  color price
#> 112  112   men suit Calvin Klein silver 199.9

# (3) Two 1 step solutions:
clothes.df[which(clothes.df$price == min(clothes.df$price)), ]
#>    idnr group  type  brand color price
#> 25   25  kids shirt Adidas   red  10.1
clothes.df[which(clothes.df$price == max(clothes.df$price)), ]
#>     idnr group type        brand  color price
#> 112  112   men suit Calvin Klein silver 199.9

# (4) 1 step solution: 
clothes.df[clothes.df$price == range(clothes.df$price), ]
#>     idnr group  type        brand  color price
#> 25    25  kids shirt       Adidas    red  10.1
#> 112  112   men  suit Calvin Klein silver 199.9

# (5) Using dplyr:
library(dplyr)
clothes.df %>% filter(price == range(price))
#>   idnr group  type        brand  color price
#> 1   25  kids shirt       Adidas    red  10.1
#> 2  112   men  suit Calvin Klein silver 199.9

1e. H&M is having a sale on men’s items. Deduct 20% of all corresponding items. (Hint: Determine the appropriate subset of prices $x$ to change. Deducting 20% from $x$ is identical to multiplying $x$ by a factor of .80.)

# (1) Step by step:
ix.HM.men <- which(clothes.df$brand == "H&M" & clothes.df$group == "men") 
mean(clothes.df$price[ix.HM.men]) # check mean
#> [1] 99.07955
clothes.df$price[ix.HM.men] <- clothes.df$price[ix.HM.men] * .80 
mean(clothes.df$price[ix.HM.men]) # check mean again
#> [1] 79.26364

## (2) same in 1 step:
# clothes.df$price[clothes.df$brand == "H&M" & clothes.df$group == "men"] <- 
#    clothes.df$price[clothes.df$brand == "H&M" & clothes.df$group == "men"] * .80

Merging data

2. A group of fashion activists has collected the true street prices of all items and saved them in a file that is available at http://Rpository.com/down/data/WPA04_streetprices.tab.

2a. Load the online data on street prices into a new data frame called street.df. (Hint: This data file is tab-delimited and contains a header row.)

street.df <- read.table(file = "http://Rpository.com/down/data/WPA04_streetprices.tab", 
                        sep = "\t", header = TRUE)              # read from online url
# street.df <- read.table(file = "data/WPA04_streetprices_2017.tab", 
#                        sep = "\t", header = TRUE)             # read from local file

head(street.df) # Note that cases/rows are NOT sorted by idnr:
#>     idnr street.price
#> 545  545      96.7500
#> 14    14      64.6325
#> 164  164      62.6350
#> 271  271     152.6175
#> 312  312     149.7000
#> 281  281      87.9750
# str(street.df)

2b. Are the actual street prices typically cheaper or more expensive than the recommended retail prices (in the original data of clothes.df)? By how many percent do the two types of prices differ on average? (Hint: Compare the corresponding means.)

org.price.mn <- mean(clothes.df$price); org.price.mn
#> [1] 98.25127
str.price.mn <- mean(street.df$street.price); str.price.mn
#> [1] 80.94269
str.price.mn < org.price.mn # Are street prices cheaper?
#> [1] TRUE

str.price.mn/org.price.mn * 100 # pc of avg. street.price of org.price:
#> [1] 82.38335
(org.price.mn - str.price.mn)/org.price.mn * 100 # pc of reduction relative to org.price.mn:
#> [1] 17.61665

2c. Combine the price data in street.df with the original data stored in clothes.df. (Hint: You could first use order() to sort street.df by increasing idnr values and then use cbind() to add the street.price variable to clothes.df. However, combining two data frames that share a common column is simpler by using the merge() function.)

# (1) Step by step: 
# a) Sort street.df by idnr:
head(street.df) # verify that idnr are unordered:
#>     idnr street.price
#> 545  545      96.7500
#> 14    14      64.6325
#> 164  164      62.6350
#> 271  271     152.6175
#> 312  312     149.7000
#> 281  281      87.9750
ord.street.df <- street.df[order(street.df$idnr), ] # sort by idnr
head(ord.street.df) # verify that df is now ordered:
#>   idnr street.price
#> 1    1     128.9600
#> 2    2     122.5500
#> 3    3      76.8825
#> 4    4     125.5295
#> 5    5      99.7520
#> 6    6      58.9250

# b) Check whether all idnr values are equal in both data frames:
all.equal(clothes.df$idnr, ord.street.df$idnr)
#> [1] TRUE

# c) Add street.price column from street.df to clothes.df:
clothes.1.df <- cbind(clothes.df, ord.street.df$street.price)
# head(clothes.1.df)

# d) Set the name of the last column back to "street.price":
names(clothes.1.df)[ncol(clothes.1.df)] <- "street.price"
head(clothes.1.df)
#>   idnr group  type        brand color  price street.price
#> 1    1 women shirt  Marc O'Polo green  98.75     128.9600
#> 2    2 women dress     Benetton black 125.75     122.5500
#> 3    3   men pants          H&M black  71.96      76.8825
#> 4    4 women shirt Calvin Klein brown 116.75     125.5295
#> 5    5  kids shirt         Zara  blue  85.78      99.7520
#> 6    6  kids shirt       Adidas   red  85.70      58.9250
# str(clothes.1.df)

# --------------------
# (2) 1 step solution using merge():
clothes.2.df <- merge(x = clothes.df, y = street.df, by = "idnr")
head(clothes.2.df)
#>   idnr group  type        brand color  price street.price
#> 1    1 women shirt  Marc O'Polo green  98.75     128.9600
#> 2    2 women dress     Benetton black 125.75     122.5500
#> 3    3   men pants          H&M black  71.96      76.8825
#> 4    4 women shirt Calvin Klein brown 116.75     125.5295
#> 5    5  kids shirt         Zara  blue  85.78      99.7520
#> 6    6  kids shirt       Adidas   red  85.70      58.9250
# str(clothes.2.df)

# --------------------
# Verify that both solutions yield equal street.price columns:
all.equal(clothes.1.df$street.price, clothes.2.df$street.price)
#> [1] TRUE

# Select one of the solutions:
clothes.df <- clothes.2.df

2d. Recompute the mean original and street prices by using the colMeans() function on the appropriate columns/variables of clothes.df. (Hint: Either use head() to determine the appropriate column numbers or use a test on names() to obtain their column numbers. The means should match those computed in 2b.)

# (1) Note that "price" and "street.price" are the 6. and 7. columns of clothes.df:
colMeans(clothes.df[ , 6:7])
#>        price street.price 
#>     98.25127     80.94269

# (2) (by first obtaining the column numbers):
column.nrs <- which(names(clothes.df) == "price" | names(clothes.df) == "street.price")
colMeans(clothes.df[ , column.nrs])
#>        price street.price 
#>     98.25127     80.94269

Aggregating data

3. This exercise practices different ways of aggregating data (i.e., computing summary statistics over groups of cases/rows that are defined by levels of categorical variables).

3a. Do the average recommended retail prices of clothing items vary according to the group for which they are designed? (Hint: Use the aggregate() function to aggregate the price over the categorical variable group.)

aggregate(formula = price ~ group, FUN = mean, data = clothes.df)
#>   group     price
#> 1  kids  77.02476
#> 2   men  94.56718
#> 3 women 105.00348

# The order of average prices is: women > men > kids.

3b. Do the different brands differ in their policies of changing the average recommended retail prices to the actual street prices? (Hint: Use the aggregate() function twice to aggregate over the categorical variable brand.)

aggregate(formula = price ~ brand, FUN = mean, data = clothes.df)
#>          brand     price
#> 1       Adidas  93.76625
#> 2     Benetton 116.01182
#> 3 Calvin Klein 113.25938
#> 4          H&M  88.82891
#> 5  Marc O'Polo 118.10909
#> 6         Zara  99.89463
aggregate(formula = street.price ~ brand, FUN = mean, data = clothes.df)
#>          brand street.price
#> 1       Adidas     63.67688
#> 2     Benetton    106.56618
#> 3 Calvin Klein    113.64964
#> 4          H&M     70.00339
#> 5  Marc O'Polo    128.00773
#> 6         Zara     81.19432

# Most brands have cheaper street prices than recommended prices, with 2 exceptions: 
# Calvin Klein's prices are about equal, Marc O'Polo more expensive on the street.

3c. Do the average recommended retail prices of different kinds (or type) of clothing items vary according to the group for which they are designed? (Hint: Use the aggregate() function to aggregate over two categorical variables).

aggregate(formula = price ~ group + type, FUN = mean, data = clothes.df)
#>    group  type     price
#> 1   kids dress  89.54882
#> 2  women dress 108.91830
#> 3   kids pants  67.90786
#> 4    men pants  82.98595
#> 5  women pants 101.74466
#> 6   kids shirt  69.69231
#> 7    men shirt  81.90082
#> 8  women shirt  93.33361
#> 9   kids shoes  91.36000
#> 10   men shoes 102.61500
#> 11 women shoes 112.71986
#> 12   men  suit 129.73680
#> 13 women  suit 126.40250
# OR 
aggregate(formula = price ~ type + group, FUN = mean, data = clothes.df)
#>     type group     price
#> 1  dress  kids  89.54882
#> 2  pants  kids  67.90786
#> 3  shirt  kids  69.69231
#> 4  shoes  kids  91.36000
#> 5  pants   men  82.98595
#> 6  shirt   men  81.90082
#> 7  shoes   men 102.61500
#> 8   suit   men 129.73680
#> 9  dress women 108.91830
#> 10 pants women 101.74466
#> 11 shirt women  93.33361
#> 12 shoes women 112.71986
#> 13  suit women 126.40250

# Items for kids are typically the cheapest, for women the most expensive, with 
# Overall, kids' pants are cheapest and men's suits are most expensive.

3d. Solve Exercise 3b. again, but now by using one command that allows computing summary statistics for multiple dependent variables. (Hint: Load and use the dplyr package.)

library("dplyr")

# Template for using dplyr:
df %>%                   # dataframe to use
  filter(var > n) %>%    # filter condition
  group_by(iv1, iv2) %>% # grouping variable(s)
  summarise(
    n = n(),             # number of cases per group
    a = mean(dv1),       # calculate mean of dv1 in df
    b = sd(dv2),         # calculate sd of dv2 in df
    c = max(dv3))        # calculate max of dv3 in df.

library("dplyr")

clothes.df %>%
  group_by(brand) %>%
  summarise(n = n(),
            price.mn = mean(price),
            street.mn = mean(street.price), 
            pc.reduced = round((street.mn - price.mn)/price.mn * 100, 1)
            )
#> # A tibble: 6 × 5
#>          brand     n  price.mn street.mn pc.reduced
#>         <fctr> <int>     <dbl>     <dbl>      <dbl>
#> 1       Adidas   160  93.76625  63.67688      -32.1
#> 2     Benetton    55 116.01182 106.56618       -8.1
#> 3 Calvin Klein    48 113.25938 113.64964        0.3
#> 4          H&M   211  88.82891  70.00339      -21.2
#> 5  Marc O'Polo    44 118.10909 128.00773        8.4
#> 6         Zara    82  99.89463  81.19432      -18.7

# Same results as above, but in 1 table (or "tibble").

3e. Solve Exercise 3c. again, but now excluding all items for kids and computing not only average prices, but also average street prices, their standard deviations, and the number of items in each category. (Hint: Aggregating cases over multiple categories, filtering data, and computing multiple dependent statistics clearly calls for the dplyr package.)

clothes.df %>%
  filter(group != "kids") %>%
  group_by(type, group) %>%
  summarise(n = n(),
            price.mn = mean(price),
            street.mn = mean(street.price),
            price.sd = sd(price),
            street.sd = sd(street.price)
            )
#> Source: local data frame [9 x 7]
#> Groups: type [?]
#> 
#>     type  group     n  price.mn street.mn price.sd street.sd
#>   <fctr> <fctr> <int>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1  dress  women   112 108.91830  90.75598 28.77324  33.09061
#> 2  pants    men    37  82.98595  72.19331 31.43043  37.57264
#> 3  pants  women    58 101.74466  83.14464 26.58351  38.64289
#> 4  shirt    men    61  81.90082  74.77431 30.16340  37.17407
#> 5  shirt  women    97  93.33361  76.98979 29.07833  36.27525
#> 6  shoes    men    40 102.61500  76.59618 25.37502  25.64484
#> 7  shoes  women    70 112.71986  83.68536 26.43946  26.87633
#> 8   suit    men    25 129.73680 122.06558 28.64541  38.25193
#> 9   suit  women    16 126.40250 110.89238 25.30128  35.99659

# Same results on mean price as before (plus others):
# Men's items are cheapter than women's, except for suits.

Checkpoint 1

At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…

B. At Home

JDM Study: Why do we overestimate others’ willingness to pay?

Abstract of Matthews et al. (2016)

In this WPA, we will analyze data from the following study:

Matthews, W. J., Gheorghiu, A. I., & Callan, M. J. (2016). Why do we overestimate others’ willingness to pay? Judgment and Decision Making, 11(1), 21–39.

The purpose of this research was to test if our beliefs about other people’s affluence (or wealth) affect how much we think they will be willing to pay for items.

You can find the full paper in the journal Judgment and Decision Making (in html or pdf format).

In this WPA, we will analyze some data from their first study. In Study 1, participants indicated the proportion of other people taking part in the survey who have more than themselves, and then whether other people would be willing to pay more than them for each of 10 products.

Products and proportions

The following table shows the 10 products and the proportion p of participants who indicated that others would be willing to pay more for the product than themselves (cf. Table 1 in Matthews et al., 2016, p. 23).

Table 1: Proportion of participants who indicated that the “typical participant” would pay more than they would for each product in Study 1.

Product #	Product	Reported p(other > self)
1	A freshly-squeezed glass of apple juice	.695
2	A Parker ballpoint pen	.863
3	A pair of Bose noise-cancelling headphones	.705
4	A voucher giving dinner for two at Applebee’s	.853
5	A 16 oz jar of Planters dry-roasted peanuts	.774
6	A one-month movie pass	.800
7	An Ikea desk lamp	.863
8	A Casio digital watch	.900
9	A large, ripe pineapple	.674
10	A handmade wooden chess set	.732

Variable description

Here are descriptions of the data variables (taken from the author’s dataset notes available at http://journal.sjdm.org/15/15909/Notes.txt):

id: participant id code;
gender: participants’ gender, 1 = male, 2 = female.
age: participants’ age.
income: participants’ annual household income on categorical scale with 8 categorical options: Less than $15,000; $15,001–$25,000; $25,001–$35,000; $35,001–$50,000; $50,001–$75,000; $75,001–$100,000; $100,001–$150,000; greater than $150,000.
p1-p10: whether the “typical” survey respondent would pay more (coded as 1) or less (coded as 0) than oneself, for each of the 10 products.
task: whether the participant had to judge the proportion of other people who “have more money than you do” (coded as 1) or the proportion who “have less money than you do” (coded as 0).
havemore: participant’s response when task = 1.
haveless: participant’s response when task = 0.
pcmore: participant’s estimate of the proportion of people who have more than they do (calculated as 100-haveless when task = 0).

Managing your workspace

4. Navigate to a dedicated R project (in case you have not already done so) and start there with a clean slate (without any R objects from previous tasks):

4a. Open your R project from last week (which you called RCourse or something similar). There should be at least one subdirectory called data in this working directory.

# ok.

4b. Move or save your current R script (entitled LastFirst_WPA##_yymmdd.R) into the main folder of your project directory.

# ok.

4c. Delete any R objects still stored in your current session and set your working directory to the path of your project directory. (Hint: Check your file browser for your current path and note that path descriptions vary for different operating systems.)

rm(list = ls())             # clean all R objects
setwd("~/Desktop/RCourse/") # set to your working directory

Getting and saving data

5. Let’s get the original data set of study 1 and store it as a text file in our data directory:

5a. The original data are available at http://journal.sjdm.org/15/15909/data1.csv. Load this data set into a new R object called matthews.df. (Hint: Use the read.table() function and use the URL as the file name, provided you have a working internet connection.)

# (1) Get data from http://journal.sjdm.org/15/15909/data1.csv online:
matthews.df <- read.table(file = "http://journal.sjdm.org/15/15909/data1.csv", 
                          sep = ",", header = TRUE)

# (2) from http://rpository.com/down/data/data1.csv online:
# matthews.df2 <- read.table(file = "http://rpository.com/down/data/data1.csv", sep = ",", header = TRUE)
# all.equal(matthews.df, matthews.df2)

# (3) from local file data/data1.csv:
# matthews.df3 <- read.table(file = "data/data1.csv", sep = ",", header = TRUE)
# all.equal(matthews.df, matthews.df3)

5b. Save the data set as a tab-delimited text file called matthews_study1.txt in the data directory of your working directory. (Hint: Use the write.table() function.)

write.table(x = matthews.df, 
            file = "data/matthews_study1.txt", 
            sep = "\t")

Exploring data

6a. Explore the first few rows and the contents of matthews.df using the head(), View(), str(), and summary() commands.

head(matthews.df)
#>                  id gender age income p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 task
#> 1 R_3PtNn51LmSFdLNM      2  26      7  1  1  1  1  1  1  1  1  1   1    0
#> 2 R_2AXrrg62pgFgtMV      2  32      4  1  1  1  1  1  1  1  1  1   1    0
#> 3 R_cwEOX3HgnMeVQHL      1  25      2  0  1  1  1  1  1  1  1  0   0    0
#> 4 R_d59iPwL4W6BH8qx      1  33      5  1  1  1  1  1  1  1  1  1   1    0
#>   havemore haveless pcmore
#> 1       NA       50     50
#> 2       NA       25     75
#> 3       NA       10     90
#> 4       NA       50     50
#>  [ reached getOption("max.print") -- omitted 2 rows ]
# View(matthews.df)
# str(matthews.df)
# summary(matthews.df)

6b. What are the variable/column names of the data frame?

names(matthews.df)
#>  [1] "id"       "gender"   "age"      "income"   "p1"       "p2"      
#>  [7] "p3"       "p4"       "p5"       "p6"       "p7"       "p8"      
#> [13] "p9"       "p10"      "task"     "havemore" "haveless" "pcmore"

6c. What was the participants’ mean age?

mean(matthews.df$age)
#> [1] 31.71579

6d. Currently, participants’ gender is coded as either 1 or 2. Create a new character column called gender.c that recodes these values as male and female, respectively. (Hint: Create a new column gender.c and use logical indexing to define its value based on the value of the gender column.)

# Create a new column called gender.c that recodes gender as a character string:
matthews.df$gender.c <- NA # initialize all values to NA.
matthews.df$gender.c[matthews.df$gender == 1] <- "male"
matthews.df$gender.c[matthews.df$gender == 2] <- "female"

6e. What percent of participants were male?

# 1) 
mean(matthews.df$gender == 1) 
#> [1] 0.6263158
# 2) with our new variable:
mean(matthews.df$gender.c == "male")
#> [1] 0.6263158
# 3) with the table() function:
table(matthews.df$gender.c)[2]/sum(table(matthews.df$gender.c))
#>      male 
#> 0.6263158

Computing row and column statistics

7. Let’s compute some means for columns or rows of data frames:

7a. Create a new dataframe called product.df that only contains the 10 columns from p1 to p10 from matthews.df by running the following code.

# Create product.df containing only columns p1, p2, ... p10: 
product.df <- matthews.df[ , paste("p", 1:10, sep = "")]

7b. The colMeans() function takes a dataframe as an argument, and returns a vector showing means across all rows for each column of data. Using colMeans(), calculate the percentage of participants who indicated that the ‘typical’ participant would be willing to pay more than them for each item. Do your values match with those reported in Table 1?

colMeans(product.df)
#>        p1        p2        p3        p4        p5        p6        p7 
#> 0.6947368 0.8631579 0.7052632 0.8526316 0.7736842 0.8000000 0.8631579 
#>        p8        p9       p10 
#> 0.9000000 0.6736842 0.7315789

# Yes, the numbers match.

7c. The rowMeans() function is like colMeans(), but for calculating means across columns for every row of data. Using rowMeans() calculate for each participant, the percentage of the 10 items that the participant believed other people would spend more on. Save this data as a new vector called p.all.

p.all <- rowMeans(product.df)

7d. Add the p.all vector as a new column called p.all to the matthews.df dataframe

matthews.df$p.all <- p.all

7e. What was the average value of p.all across all 190 participants? This value is the answer to the question: “How often does the average participant think that someone else would pay more for an item than themselves?”

mean(matthews.df$p.all) # mean of all participant means
#> [1] 0.7857895

7f. How does the value just computed (i.e., a mean across the means of 190 participants) compare to the mean value of the 10 product means? (Hint: The 10 product means were computed in 7b. above.)

mean(colMeans(product.df)) # mean of the 10 product means
#> [1] 0.7857895

# They are equal.

# Note: Checking for equality:
mean(matthews.df$p.all) == mean(colMeans(product.df)) # would return FALSE.
#> [1] FALSE

# To test for equality of numeric objects:
all.equal(mean(matthews.df$p.all), mean(colMeans(product.df))) # returns TRUE.
#> [1] TRUE

Merging and subsetting data

8. Let’s add a new table containing fictional demographic information about each participant. The data are stored in a text file at http://Rpository.com/down/data/matthews_demographics.txt.

8a. Load this data into an R object called demo.df. (Hint: Use the read.table() function, but first check the file for the appropriate column delimiter and the absence or presence of a variable header.)

demo.df <- read.table(file = "http://Rpository.com/down/data/matthews_demographics.txt", 
                      sep = "\t",    # data file is tab-delimited
                      header = TRUE) # data file contains a header row
head(demo.df)
#>                  id height  race
#> 1 R_rcnylCRo9oKoClP    154 asian
#> 2 R_2SGkwNjBl2Ymiw5    170 white
#> 3 R_3Ok7Aza3vfRMKJX    171 black
#> 4 R_Xp767d7ZRcqMqTD    175 asian
#> 5 R_d5vg3yc1rC2g1JX    167 black
#> 6 R_cZ0mJRzPtTqDJ5f    167 asian

8b. Get the average height and the frequency of participants’ race from this new data frame.

mean(demo.df$height)
#> [1] 171.3842
table(demo.df$race)
#> 
#>    asian    black hispanic    white 
#>       41       41       15       93

8c. Combine the original data in matthews.df with the new demographic data. (Hint: Use the merge() function with an appropriate argument specifying a common column.)

matthews.df <- merge(x = matthews.df,
                     y = demo.df,
                     by = "id")

8d. Using either indexing or subset(), calculate the mean age of all males and of all females:

# 1) logical indexing: 
mean(matthews.df$age[matthews.df$gender.c == "male"])
#> [1] 29.76471
mean(matthews.df$age[matthews.df$gender.c == "female"])
#> [1] 34.98592

# 2) with subset():
male.df <- subset(matthews.df, subset = gender.c == "male")
mean(male.df$age)
#> [1] 29.76471
mean(subset(matthews.df, subset = gender.c == "female")$age)
#> [1] 34.98592

Checkpoint 2

At this point you are doing very well. But rather than separately computing each summary statistic, let’s try to aggregate over one or two categorical variables…

Aggregating data

9a. Calculate the mean age of male and female participants by aggregating age over the gender.c (using the aggregate() function). Do you get the same results as before?

# 3) with aggregate():
aggregate(formula = age ~ gender.c, FUN = mean, data = matthews.df)
#>   gender.c      age
#> 1   female 34.98592
#> 2     male 29.76471

# Yes, the results are the same.

9b. Using aggregate() calculate the mean p.all value for male and female participants. Which gender is more likely to think that others would pay more for products than themselves?

aggregate(formula = p.all ~ gender.c, FUN = mean, data = matthews.df)
#>   gender.c     p.all
#> 1   female 0.8014085
#> 2     male 0.7764706

# Females are more likely to think that others will pay more than themselves 
# relative to males.

9c. Using aggregate() calculate the mean p.all value of participants for each level of income. Is there a consistent relationship between p.all and income?

aggregate(formula = p.all ~ income, FUN = mean, data = matthews.df)
#>   income     p.all
#> 1      1 0.9037037
#> 2      2 0.8044444
#> 3      3 0.7370370
#> 4      4 0.7862069
#> 5      5 0.7500000
#> 6      6 0.6958333
#> 7      7 0.8142857
#> 8      8 0.8666667

# The values tend to decrease from income = 1 to income = 6 (except for income = 3), 
# then they go up again (for income = 7 and 8)!

9d. Now repeat the previous analysis, but only for females. (Hint: Use the subset argument within the aggregate function.)

aggregate(formula = p.all ~ income, FUN = mean, data = matthews.df,
          subset = gender.c == "female")
#>   income     p.all
#> 1      1 0.8875000
#> 2      2 0.8294118
#> 3      3 0.8857143
#> 4      4 0.8625000
#> 5      5 0.7500000
#> 6      6 0.6785714
#> 7      7 0.7600000
#> 8      8 0.9000000

# For females, the group of income = 2 violates the overall trend.

9e. What was the mean age for participants for each combination of gender and income? (Hint: Use the aggregate() function with 2 independent variables.)

aggregate(formula = age ~ income + gender.c, FUN = mean, data = matthews.df)
#>    income gender.c      age
#> 1       1   female 31.12500
#> 2       2   female 36.35294
#> 3       3   female 33.14286
#> 4       4   female 36.75000
#> 5       5   female 35.00000
#> 6       6   female 34.00000
#> 7       7   female 37.60000
#> 8       8   female 38.50000
#> 9       1     male 28.73684
#> 10      2     male 30.14286
#> 11      3     male 29.45000
#> 12      4     male 28.52381
#> 13      5     male 31.00000
#> 14      6     male 29.60000
#> 15      7     male 43.50000
#> 16      8     male 23.00000

9f. The variable pcmore reflects participants’ answer to the question: “What percentage of people taking part in this survey do you think earn more than you do?”. Using aggregate(), calculate the median value of this variable for each level of income. What does the result tell you?

aggregate(formula = pcmore ~ income, FUN = median, data = matthews.df)
#>   income pcmore
#> 1      1     80
#> 2      2     75
#> 3      3     50
#> 4      4     60
#> 5      5     50
#> 6      6     45
#> 7      7     50
#> 8      8     50

# The higher one's income, the less people tend to think that 
# other people earn more than themselves (which makes sense).
# However, even people in the top 4 income levels think that 
# about 50% earn more than themselves.

Checkpoint 3

At this point you are doing great, well done! But if you liked using the aggregate() function, you will love the dplyr package in the following exercises.

C. Challenges

Using `dplyr`

10. The remaining exercises focus on the dplyr package, which allows combining multiple independent and dependent variables in one command.

10a. Load the dplyr package (if not already loaded).

library(dplyr)

10b. For each level of gender, calculate the summary statistics in the following table:

Variable	Description
n	Number of participants
age.mean	Mean age
age.sd	Standard deviation of age
income.mean	Mean income
pcmore.mean	Mean value of pcmore
p.all.mean	Mean value of p.all

Save the computed summary statistics to an object called gender.df. (Hint: Use dplyr with appropriate group_by and summarise arguments.)

gender.df <- matthews.df %>%
  group_by(gender.c) %>%
  summarise(
    n = n(),
    age.mean = mean(age),
    age.sd = sd(age),
    income.mean = mean(income),
    pcmore.mean = mean(pcmore),
    p.all.mean = mean(p.all)
  )

gender.df
#> # A tibble: 2 × 7
#>   gender.c     n age.mean    age.sd income.mean pcmore.mean p.all.mean
#>      <chr> <int>    <dbl>     <dbl>       <dbl>       <dbl>      <dbl>
#> 1   female    71 34.98592 10.430029    3.943662    58.80282  0.8014085
#> 2     male   119 29.76471  7.648757    3.285714    62.25210  0.7764706

10c. For each level of income, calculate the summary statistics in the following table – but only for participants older than 21 – and save them to a new object called income.df:

Variable	Description
n	Number of participants
age.min	Minimum age
age.mean	Mean age
male.pc	Percentage of males
female.pc	Percentage of females
pcmore.mean	Mean value of pcmore
p.all.mean	Mean value of p.all

(Hint: Use dplyr with appropriate filter, group_by and summarise arguments.)

income.df <- matthews.df %>%
  filter(age > 21) %>%
  group_by(income) %>%
  summarise(
    n = n(),
    age.mean = mean(age),
    male.pc = mean(gender == 1),
    female.pc = mean(gender == 2),
    pcmore.mean = mean(pcmore),
    p.all.mean = mean(p.all)
  )

income.df
#> # A tibble: 8 × 7
#>   income     n age.mean   male.pc female.pc pcmore.mean p.all.mean
#>    <int> <int>    <dbl>     <dbl>     <dbl>       <dbl>      <dbl>
#> 1      1    26 29.76923 0.6923077 0.3076923    74.88462  0.9038462
#> 2      2    43 33.06977 0.6279070 0.3720930    70.09302  0.8069767
#> 3      3    25 31.20000 0.7600000 0.2400000    54.60000  0.7400000
#> 4      4    27 31.62963 0.7037037 0.2962963    61.25926  0.7814815
#> 5      5    26 33.30769 0.6153846 0.3846154    53.65385  0.7500000
#> 6      6    23 32.78261 0.3913043 0.6086957    46.00000  0.6826087
#> 7      7     7 39.28571 0.2857143 0.7142857    41.42857  0.8142857
#> 8      8     3 33.33333 0.3333333 0.6666667    33.33333  0.8666667

10d. Calculate several summary statistics (you choose which ones) aggregated at each level of race and gender. Save the results to an object called racegender.df

racegender.df <- matthews.df %>%
  group_by(race, gender.c) %>%
  summarise(
    n = n(),  # N
    age.min = min(age),         # youngest person
    height.mean = mean(height), # mean height
    income.mean = mean(income), # mean income
    pcmore.mean = mean(pcmore), # mean pcmore
    p.all.mean = mean(p.all)    # mean p.all
  )

racegender.df
#> Source: local data frame [8 x 8]
#> Groups: race [?]
#> 
#>       race gender.c     n age.min height.mean income.mean pcmore.mean
#>     <fctr>    <chr> <int>   <int>       <dbl>       <dbl>       <dbl>
#> 1    asian   female    20      22    169.2000    3.700000    57.10000
#> 2    asian     male    21      20    169.8571    3.142857    58.33333
#> 3    black   female    17      23    174.6471    5.294118    53.52941
#> 4    black     male    24      18    174.1250    3.000000    77.54167
#> 5 hispanic   female     4      24    163.5000    2.000000    76.25000
#> 6 hispanic     male    11      22    172.3636    3.181818    58.09091
#> 7    white   female    30      20    171.7000    3.600000    60.60000
#> 8    white     male    63      19    170.8413    3.460317    58.46032
#> # ... with 1 more variables: p.all.mean <dbl>

10e. Using dplyr, calculate several summary statistics (you choose which ones) aggregated at each level of 3 independent variables of your choice. Save the results to an object called XYZ.df, where XYZ contains the names of the 3 variables over which you aggregated.

genderTaskIncome.df <- matthews.df %>%
  group_by(gender.c, task, income) %>%
  summarise(
    n = n(),  # N
    pcmore.mean = mean(pcmore),       # mean pcmore
    pc.asian = mean(race == "asian"), # percent asians
    p.all.mean = mean(p.all)          # mean p.all
  )

genderTaskIncome.df
#> Source: local data frame [29 x 7]
#> Groups: gender.c, task [?]
#> 
#>    gender.c  task income     n pcmore.mean  pc.asian p.all.mean
#>       <chr> <int>  <int> <int>       <dbl>     <dbl>      <dbl>
#> 1    female     0      1     3    73.33333 0.0000000  0.8666667
#> 2    female     0      2    11    67.27273 0.3636364  0.8363636
#> 3    female     0      3     3    56.66667 0.3333333  0.8666667
#> 4    female     0      4     5    71.00000 0.4000000  0.8600000
#> 5    female     0      5     4    63.75000 0.5000000  0.7000000
#> 6    female     0      6     8    50.62500 0.2500000  0.6500000
#> 7    female     0      7     2    50.00000 0.0000000  0.9000000
#> 8    female     0      8     2    50.00000 0.0000000  0.9000000
#> 9    female     1      1     5    55.00000 0.0000000  0.9000000
#> 10   female     1      2     6    66.66667 0.5000000  0.8166667
#> # ... with 19 more rows

10f. Save matthews.df, gender.df, income.df, racegender.df, and your XYZ.df objects to a file called matthews_dfs.RData in the data subdirectory of your working directory. (Hint: Use the save() function with an appropriate file argument.)

save(matthews.df, gender.df, income.df, racegender.df, genderTaskIncome.df, 
     file = "data/matthews_dfs.RData")

That’s it – hope you enjoyed working on this assignment!

[WPA04_answers.Rmd updated on 2017-11-23 14:38:55 by hn.]

WPA04: Advanced data frame manipulation (Answers)

Hansjörg Neth, SPDS, uni.kn

2017 Nov 23

A. In Class

Checking and changing data

Merging data

Aggregating data

Checkpoint 1

B. At Home

JDM Study: Why do we overestimate others’ willingness to pay?

Products and proportions

Variable description

Managing your workspace

Getting and saving data

Exploring data

Computing row and column statistics

Merging and subsetting data

Checkpoint 2

Aggregating data

Checkpoint 3

C. Challenges

Using `dplyr`