uni.kn.logo

WPA04 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


To complete and submit these exercises, please remember and do the following:

  1. Your WPAs can be written and submitted either as scripts of commented code (as .R or .Rmd files) or as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.

  2. Also enter the current assignment (e.g., WPA04), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.

Here is an example how your file JamesJoy_WPA04_171120.R could look:

# Assignment: WPA 04
# Name: Joy, James
# Date: 2017 Nov 20
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# A. In Class

# Combining vectors to matrices and data frames:

# Exercise 1:
a <- letters[1:3] # define some vector

# ...
  1. Complete as many exercises as you can by Wednesday (23:59).

  2. Submit your script or output file (including all code) to the appropriate folder on Ilias.


A. In Class

Here are some warm-up exercises that review the basic concepts of the current chapter:

Checking and changing data

1. This exercise practices previous commands and skills (mostly indexing and subsetting of data frames) that are highly relevant for the current and future chapters. It uses a ficticious data set clothes.csv which describes 600 items of clothing by the following variables:

  • idnr: A unique number identifying each item
  • group for whom this item is designed (kids, men, or women)
  • type of clothing (dress, pants, shirt, shoes, or suit)
  • brand of the item (6 popular labels)
  • color of the item (8 different colors)
  • price: the item’s recommended retail price

The data is available at http://Rpository.com/down/data/WPA04_clothes.csv.

1a. Load the data into a data frame called clothes.df and inspect it by using the head(), View(), str(), or summary() functions. (Hint: Use the read.table() function and note that the data file is comma-delimited and contains a header row.)

1b. How many items of clothing exist of each color? How many pairs of shoes exist of each color? How many women’s dresses exist of each color? (Hint: Use the table() function on appropriate subsets of the data.)

1c. How many pairs of Adidas shoes are in this data set? What’s their average price?

1d. What are the cheapest and the most expensive items of clothing in the data set?

1e. H&M is having a sale on men’s items. Deduct 20% of all corresponding items. (Hint: Determine the appropriate subset of prices \(x\) to change. Deducting 20% from \(x\) is identical to multiplying \(x\) by a factor of .80.)

Merging data

2. A group of fashion activists has collected the true street prices of all items and saved them in a file that is available at http://Rpository.com/down/data/WPA04_streetprices.tab.

2a. Load the online data on street prices into a new data frame called street.df. (Hint: This data file is tab-delimited and contains a header row.)

2b. Are the actual street prices typically cheaper or more expensive than the recommended retail prices (in the original data of clothes.df)? By how many percent do the two types of prices differ on average? (Hint: Compare the corresponding means.)

2c. Combine the price data in street.df with the original data stored in clothes.df. (Hint: You could first use order() to sort street.df by increasing idnr values and then use cbind() to add the street.price variable to clothes.df. However, combining two data frames that share a common column is simpler by using the merge() function.)

2d. Recompute the mean original and street prices by using the colMeans() function on the appropriate columns/variables of clothes.df. (Hint: Either use head() to determine the appropriate column numbers or use a test on names() to obtain their column numbers. The means should match those computed in 2b.)

Aggregating data

3. This exercise practices different ways of aggregating data (i.e., computing summary statistics over groups of cases/rows that are defined by levels of categorical variables).

3a. Do the average recommended retail prices of clothing items vary according to the group for which they are designed? (Hint: Use the aggregate() function to aggregate the price over the categorical variable group.)

3b. Do the different brands differ in their policies of changing the average recommended retail prices to the actual street prices? (Hint: Use the aggregate() function twice to aggregate over the categorical variable brand.)

3c. Do the average recommended retail prices of different kinds (or type) of clothing items vary according to the group for which they are designed? (Hint: Use the aggregate() function to aggregate over two categorical variables).

3d. Solve Exercise 3b. again, but now by using one command that allows computing summary statistics for multiple dependent variables. (Hint: Load and use the dplyr package.)

library("dplyr")

# Template for using dplyr:
df %>%                   # dataframe to use
  filter(var > n) %>%    # filter condition
  group_by(iv1, iv2) %>% # grouping variable(s)
  summarise(
    n = n(),             # number of cases per group
    a = mean(dv1),       # calculate mean of dv1 in df
    b = sd(dv2),         # calculate sd of dv2 in df
    c = max(dv3))        # calculate max of dv3 in df.

3e. Solve Exercise 3c. again, but now excluding all items for kids and computing not only average prices, but also average street prices, their standard deviations, and the number of items in each category. (Hint: Aggregating cases over multiple categories, filtering data, and computing multiple dependent statistics clearly calls for the dplyr package.)

Checkpoint 1

At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…


B. At Home

JDM Study: Why do we overestimate others’ willingness to pay?

Abstract of Matthews et al. (2016)

In this WPA, we will analyze data from the following study:

  • Matthews, W. J., Gheorghiu, A. I., & Callan, M. J. (2016). Why do we overestimate others’ willingness to pay? Judgment and Decision Making, 11(1), 21–39.

The purpose of this research was to test if our beliefs about other people’s affluence (or wealth) affect how much we think they will be willing to pay for items.

You can find the full paper in the journal Judgment and Decision Making (in html or pdf format).

In this WPA, we will analyze some data from their first study. In Study 1, participants indicated the proportion of other people taking part in the survey who have more than themselves, and then whether other people would be willing to pay more than them for each of 10 products.

Products and proportions

The following table shows the 10 products and the proportion p of participants who indicated that others would be willing to pay more for the product than themselves (cf. Table 1 in Matthews et al., 2016, p. 23).

Table 1: Proportion of participants who indicated that the “typical participant” would pay more than they would for each product in Study 1.

Product # Product Reported p(other > self)
1 A freshly-squeezed glass of apple juice .695
2 A Parker ballpoint pen .863
3 A pair of Bose noise-cancelling headphones .705
4 A voucher giving dinner for two at Applebee’s .853
5 A 16 oz jar of Planters dry-roasted peanuts .774
6 A one-month movie pass .800
7 An Ikea desk lamp .863
8 A Casio digital watch .900
9 A large, ripe pineapple .674
10 A handmade wooden chess set .732

Variable description

Here are descriptions of the data variables (taken from the author’s dataset notes available at http://journal.sjdm.org/15/15909/Notes.txt):

  • id: participant id code;
  • gender: participants’ gender, 1 = male, 2 = female.
  • age: participants’ age.
  • income: participants’ annual household income on categorical scale with 8 categorical options: Less than $15,000; $15,001–$25,000; $25,001–$35,000; $35,001–$50,000; $50,001–$75,000; $75,001–$100,000; $100,001–$150,000; greater than $150,000.
  • p1-p10: whether the “typical” survey respondent would pay more (coded as 1) or less (coded as 0) than oneself, for each of the 10 products.
  • task: whether the participant had to judge the proportion of other people who “have more money than you do” (coded as 1) or the proportion who “have less money than you do” (coded as 0).
  • havemore: participant’s response when task = 1.
  • haveless: participant’s response when task = 0.
  • pcmore: participant’s estimate of the proportion of people who have more than they do (calculated as 100-haveless when task = 0).

Managing your workspace

4. Navigate to a dedicated R project (in case you have not already done so) and start there with a clean slate (without any R objects from previous tasks):

4a. Open your R project from last week (which you called RCourse or something similar). There should be at least one subdirectory called data in this working directory.

4b. Move or save your current R script (entitled LastFirst_WPA##_yymmdd.R) into the main folder of your project directory.

4c. Delete any R objects still stored in your current session and set your working directory to the path of your project directory. (Hint: Check your file browser for your current path and note that path descriptions vary for different operating systems.)

rm(list = ls())             # clean all R objects
setwd("~/Desktop/RCourse/") # set to your working directory

Getting and saving data

5. Let’s get the original data set of study 1 and store it as a text file in our data directory:

5a. The original data are available at http://journal.sjdm.org/15/15909/data1.csv. Load this data set into a new R object called matthews.df. (Hint: Use the read.table() function and use the URL as the file name, provided you have a working internet connection.)

5b. Save the data set as a tab-delimited text file called matthews_study1.txt in the data directory of your working directory. (Hint: Use the write.table() function.)

Exploring data

6a. Explore the first few rows and the contents of matthews.df using the head(), View(), str(), and summary() commands.

6b. What are the variable/column names of the data frame?

6c. What was the participants’ mean age?

6d. Currently, participants’ gender is coded as either 1 or 2. Create a new character column called gender.c that recodes these values as male and female, respectively. (Hint: Create a new column gender.c and use logical indexing to define its value based on the value of the gender column.)

6e. What percent of participants were male?

Computing row and column statistics

7. Let’s compute some means for columns or rows of data frames:

7a. Create a new dataframe called product.df that only contains the 10 columns from p1 to p10 from matthews.df by running the following code.

7b. The colMeans() function takes a dataframe as an argument, and returns a vector showing means across all rows for each column of data. Using colMeans(), calculate the percentage of participants who indicated that the ‘typical’ participant would be willing to pay more than them for each item. Do your values match with those reported in Table 1?

7c. The rowMeans() function is like colMeans(), but for calculating means across columns for every row of data. Using rowMeans() calculate for each participant, the percentage of the 10 items that the participant believed other people would spend more on. Save this data as a new vector called p.all.

7d. Add the p.all vector as a new column called p.all to the matthews.df dataframe

7e. What was the average value of p.all across all 190 participants? This value is the answer to the question: “How often does the average participant think that someone else would pay more for an item than themselves?”

7f. How does the value just computed (i.e., a mean across the means of 190 participants) compare to the mean value of the 10 product means? (Hint: The 10 product means were computed in 7b. above.)

Merging and subsetting data

8. Let’s add a new table containing fictional demographic information about each participant. The data are stored in a text file at http://Rpository.com/down/data/matthews_demographics.txt.

8a. Load this data into an R object called demo.df. (Hint: Use the read.table() function, but first check the file for the appropriate column delimiter and the absence or presence of a variable header.)

8b. Get the average height and the frequency of participants’ race from this new data frame.

8c. Combine the original data in matthews.df with the new demographic data. (Hint: Use the merge() function with an appropriate argument specifying a common column.)

8d. Using either indexing or subset(), calculate the mean age of all males and of all females:

Checkpoint 2

At this point you are doing very well. But rather than separately computing each summary statistic, let’s try to aggregate over one or two categorical variables…

Aggregating data

9a. Calculate the mean age of male and female participants by aggregating age over the gender.c (using the aggregate() function). Do you get the same results as before?

9b. Using aggregate() calculate the mean p.all value for male and female participants. Which gender is more likely to think that others would pay more for products than themselves?

9c. Using aggregate() calculate the mean p.all value of participants for each level of income. Is there a consistent relationship between p.all and income?

9d. Now repeat the previous analysis, but only for females. (Hint: Use the subset argument within the aggregate function.)

9e. What was the mean age for participants for each combination of gender and income? (Hint: Use the aggregate() function with 2 independent variables.)

9f. The variable pcmore reflects participants’ answer to the question: “What percentage of people taking part in this survey do you think earn more than you do?”. Using aggregate(), calculate the median value of this variable for each level of income. What does the result tell you?

Checkpoint 3

At this point you are doing great, well done! But if you liked using the aggregate() function, you will love the dplyr package in the following exercises.

C. Challenges

Using dplyr

10. The remaining exercises focus on the dplyr package, which allows combining multiple independent and dependent variables in one command.

10a. Load the dplyr package (if not already loaded).

library(dplyr)

10b. For each level of gender, calculate the summary statistics in the following table:

Variable Description
n Number of participants
age.mean Mean age
age.sd Standard deviation of age
income.mean Mean income
pcmore.mean Mean value of pcmore
p.all.mean Mean value of p.all

Save the computed summary statistics to an object called gender.df. (Hint: Use dplyr with appropriate group_by and summarise arguments.)

10c. For each level of income, calculate the summary statistics in the following table – but only for participants older than 21 – and save them to a new object called income.df:

Variable Description
n Number of participants
age.min Minimum age
age.mean Mean age
male.pc Percentage of males
female.pc Percentage of females
pcmore.mean Mean value of pcmore
p.all.mean Mean value of p.all

(Hint: Use dplyr with appropriate filter, group_by and summarise arguments.)

10d. Calculate several summary statistics (you choose which ones) aggregated at each level of race and gender. Save the results to an object called racegender.df

10e. Using dplyr, calculate several summary statistics (you choose which ones) aggregated at each level of 3 independent variables of your choice. Save the results to an object called XYZ.df, where XYZ contains the names of the 3 variables over which you aggregated.

10f. Save matthews.df, gender.df, income.df, racegender.df, and your XYZ.df objects to a file called matthews_dfs.RData in the data subdirectory of your working directory. (Hint: Use the save() function with an appropriate file argument.)

That’s it – now it’s time to submit your assignment!

Save and submit your script or output file (including all code) to the appropriate folder on Ilias before midnight.


[WPA04.Rmd updated on 2017-11-20 11:27:02 by hn.]