Introduction

This file contains essential commands from Chapter 3: Data visualisation of the textbook r4ds and corresponding examples and exercises. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.

All ds4psy essentials so far:

Nr. Topic
0. Syllabus
1. Basic R concepts and commands
2. Visualizing data
3. Transforming data
4. Exploring data (EDA)
+. Datasets

Course coordinates

spds.uni.kn

Preparations

Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)

## Visualizing data | ds4psy
## 2018 11 12
## ----------------------------

## Preparations: ----------

library(tidyverse)

## 1. Topic: ----------

# etc.

## End of file (eof). ----------  

Visualizing data

In the following, we assume that you have read and worked through Chapter 3: Data visualization). Based on this background, we examine some essential commands of ggplot2 in the context of examples. However, the ggplot2 package extends far beyond this modest introduction – it is an important pillar (and predecessor) of the tidyverse and implements a language for and philosophy of data visualisation.

Essential commands and examples

General structure of ggplot calls

A generic template for creating a graph with ggplot is:

# Generic ggplot template: 
ggplot(data = <DATA>) +                              # 1. specify data set to use
  <GEOM_fun>(mapping = aes(<MAPPING>),               # 2. specify geom + mapping 
             <arg_1 = val_1, ..., arg_n = val_n>) +  # - optional arguments to geoms
  <FACET_fun> +                                      # - optional facets
  <LOOK_GOOD_fun>                                    # - optional themes, colors, labels, etc.
  
# Minimal ggplot template:
ggplot(<DATA>) +             # 1. specify data set to use
  <GEOM_fun>(aes(<MAPPING>)  # 2. specify geom + mapping 

The generic template includes the following parts:

  • <DATA> is a data frame or tibble that contains the data that is to be plotted.

  • <GEOM_fun> is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that are specified in aes(<MAPPING>). (A “mapping” specifies what goes where.)

  • A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized
    1. in the aesthetic mapping (when varying visual features according to data properties), or
    2. by setting its arguments to specific values in <arg_1 = val_1, ..., arg_n = val_n> (when remaining constant).
  • An optional <FACET_fun> splits a complex plot into multiple subplots.

  • A sequence of optional <LOOK_GOOD_fun> adjusts the visual features of plots (e.g., by adding plot themes, titles and text labels, color scales, and setting coordinate systems).

Scatterplots

The first type of graph we encounter is a scatterplot, which maps 2 (typically continuous) variables to the x- and y-dimension of a plot and thus allows judging the relationship between the variables. In ggplot2, we can create scatterplots by using geom_point. For instance, when exploring the mpg dataset, we can ask: How does a car’s mileage per gallon (on highways, hwy) relate to its engine displacement (displ)?

## Data:
# ggplot2::mpg

# Minimal scatterplot:
ggplot(data = mpg) +                             # 1. specify data set to use
  geom_point(mapping = aes(x = displ, y = hwy))  # 2. specify geom + mapping 


# Shorter version of the same plot:
ggplot(mpg) +                          # 1. specify data set to use
  geom_point(aes(x = displ, y = hwy))  # 2. specify geom + mapping


# When specifying the aes(...) settings directly after the data, 
# they apply to ALL geoms: 
ggplot(mpg, aes(x = displ, y = hwy)) +  # 1. specify data and aes-mapping for ALL geoms          
  geom_point()                          # 2. specify geom

To get different types of plots of the same data, we select different geoms:

# Different geoms:
ggplot(mpg) +                           # 1. specify data set to use
  geom_jitter(aes(x = displ, y = hwy))  # 2. specify geom + mapping


ggplot(mpg) +                           # 1. specify data set to use
  geom_smooth(aes(x = displ, y = hwy))  # 2. specify geom + mapping


ggplot(mpg) +                           # 1. specify data set to use
  geom_bin2d(aes(x = displ, y = hwy))   # 2. specify geom + mapping



# However, geoms require specific variables and aesthetic mappings!

# For instance, 
# ggplot(mpg) +                           # 1. specify data set to use
#   geom_bar(aes(x = displ, y = hwy))     # 2. specify geom + mapping
# would yield an error!

# But the following works (but plots something else):
ggplot(mpg) +                             # 1. specify data set to use
  geom_bar(aes(x = displ, fill = class))  # 2. specify geom + mapping

Aesthetic properties (like color, shape, size, transparency, etc.) can be used to structure plots (e.g., by grouping objects into categories) or as arguments to geoms:

  • Color:
# Color: ------ 
# Using color to group (as an aesthetic):
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, color = class))


# Setting color (as an argument):
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy), color = "steelblue")

  • Shape:
# Shape:  ------ 
# Using shape to group (as an aesthetic):
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, shape = class))


# Setting shape (as an argument):
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy), shape = 2)

  • Size:
# Size:  ------ 
# Using size to group (as an aesthetic):
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, size = class))


# Setting size (as an argument):
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy), size = 3)

  • Transparency (alpha):
# Transparency:  ------ 
# Setting alpha (as an aesthetic): 
ggplot(data = mpg) +                            
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))


# Setting alpha (as an argument): 
ggplot(data = mpg) +                            
  geom_point(mapping = aes(x = displ, y = hwy), alpha = 1/3)

Note that not all aesthetics are equally suited to serve all functions.

Creating nicer plots

A typical problem that renders many plots difficult to decipher is overplotting (i.e., too much information at the same location). One solution to this issue is provided by faceting, which provides an alternative way of grouping plots into several sub-plots:

# Setting alpha (as an aesthetic): 
ggplot(data = mpg) +                            
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~class)

An alternative solution to overplotting consists in the careful selection of geoms and aesthetics. For instance, by combining geoms, colors, and transparency, and adding a ggplot theme (see ?theme_bw), we can create informative and attractive plots:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(color = "firebrick", linetype = 1) +
  geom_point(shape = 21, color = "black", fill = "steelblue4", alpha = 1/4, size = 3, stroke = 1) + 
  theme_bw()

Adding titles and labels completes a plot:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(color = "firebrick", linetype = 1) +
  geom_point(shape = 21, color = "black", fill = "steelblue4", alpha = 1/4, size = 3, stroke = 1) + 
  labs(title = "Fuel usage by engine displacement", 
       x = "Engine displacement (in liters)", y = "Miles per gallon (on highway)", 
       caption = "Data from ggplot2::mpg") +
  theme_bw()

Other plot types

Here are some examples that illustrate the use of different geoms and aesthetic features for different types of plots.

Histograms

A histogram counts how often specific values of one (typically continuous) variable occur in the data. This allows viewing the distribution of values for this variable:

library(ggplot2)

# Data: ------ 
# ?ggplot2::mpg
# mpg

# Histogram: ------

# A minimal histogram:
ggplot(mpg, aes(x = cty)) +     # set mappings for ALL geoms
  geom_histogram(binwidth = 2)  # set binwidth parameter    


# (B) Adding aesthetics, labels and themes: ------ 

# Enhanced version of the same plot:
ggplot(mpg, aes(x = cty)) +    # set mappings for ALL geoms
  geom_histogram(aes(x = cty), binwidth = 2, fill = "gold", color = "black") +
  labs(title = "Distribution of fuel economy", 
       x = "Miles per gallon (in city)",
       caption = "Data from ggplot2::mpg") +
  theme_light()

Bar plots

Another common type of plot shows the values (across different levels of some variable as the height of bars. As this plot type can use both categorical or continuous variables, it turns out to be surprisingly complex to create good bar charts. To us get started, here are only a few examples:

Counts of cases

By default, geom_bar computes summary statistics of the data. When nothing else is specified, geom_bar counts the number or frequency of values (i.e., stat = "count") and maps this count to the y (i.e., y = ..count..):

## Data: 
# ggplot2::mpg

# (a) Count number of cases by class: 
ggplot(mpg) + 
  geom_bar(aes(x = class))


# (a) is the same as (b): 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count..))


# (b) is the same as (c):
ggplot(mpg) + 
  geom_bar(aes(x = class), stat = "count")


# (c) is the same as (d):
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count..), stat = "count")


# (e) prettier version:
ggplot(mpg) + 
  geom_bar(aes(x = class, fill = class), 
           # stat = "count", 
           color = "black") + 
  labs(title = "Counts of cars by class",
       x = "Class of car", y = "Frequency") + 
  scale_fill_brewer(name = "Class:", palette = "Blues") + 
  theme_bw()

Proportions of cases

An alternative to showing the count or frequency of cases is showing the corresponding proportion of cases:

## Data: 
# ggplot2::mpg

# (1) Proportion of cases by class: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..prop.., group = 1))


# is the same as: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count../sum(..count..)))

Bar plots of existing values

A common difficulty occurs when the table to plot already contains the values to be shown as bars. As there is nothing to be computed in this case, we need to specify stat = "identity" for geom_bar (to override its default of stat = "count").

For instance, let’s plot a bar chart that shows the election data from the following tibble de (and don’t worry if you don’t understand the commands used to generate the tibble at this point):

library(tidyverse)

## (a) Create a tibble of data: 
de_org <- tibble(
    party = c("CDU/CSU", "SPD", "Others"),
    share_2013 = c((.341 + .074), .257, (1 - (.341 + .074) - .257)), 
    share_2017 = c((.268 + .062), .205, (1 - (.268 + .062) - .205))
  )
de_org$party <- factor(de_org$party, levels = c("CDU/CSU", "SPD", "Others"))  # optional
# de_org

## Check that columns add to 100:
# sum(de_org$share_2013)  # => 1 (qed)
# sum(de_org$share_2017)  # => 1 (qed)

## (b) Converting de into a tidy data table:
de <- de_org %>%
  gather(share_2013:share_2017, key = "election", value = "share") %>%
  separate(col = "election", into = c("dummy", "year")) %>%
  select(year, party, share)

knitr::kable(de)
year party share
2013 CDU/CSU 0.415
2013 SPD 0.257
2013 Others 0.328
2017 CDU/CSU 0.330
2017 SPD 0.205
2017 Others 0.465
  1. A version with 2 x 3 separate bars (using position = "dodge"):
## Data: ----- 
# de  # => 6 x 3 tibble

## Note that year is of type character, which could be changed by:
# de$year <- parse_integer(de$year)

## (1) Bar chart with  side-by-side bars (dodge): ----- 

## (a) minimal version: 
bp_1 <- ggplot(de, aes(x = year, y = share, fill = party)) +
  ## (A) 3 bars per election (position = "dodge"):  
  geom_bar(stat = "identity", position = "dodge", color = "black") # 3 bars next to each other
bp_1


## (b) Version with text labels and customized colors: 
bp_1 + 
  ## prettier plot: 
  geom_text(aes(label = paste0(round(share * 100, 1), "%"), y = share + .015), 
            position = position_dodge(width = 1), 
            fontface = 2, color = "black") + 
  # Some set of high contrast colors: 
  scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) + 
  # Titles and labels: 
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") + 
  # coord_flip() + 
  theme_bw()

  1. A version with 2 bars with 3 segments (using position = "stack"):
## Data: ----- 
# de  # => 6 x 3 tibble

## (2) Bar chart with stacked bars: -----  

## (a) minimal version: 
bp_2 <- ggplot(de, aes(x = year, y = share, fill = party)) +
  ## (B) 1 bar per election (position = "stack"):
  geom_bar(stat = "identity", position = "stack") # 1 bar per election
bp_2


## (b) Version with text labels and customized colors: 
bp_2 +   
  ## prettier plot: 
  geom_text(aes(label = paste0(round(share * 100, 1), "%")), 
            position = position_stack(vjust = .5),
            color = rep(c("black", "white", "white"), 2), 
            fontface = 2) + 
  # Some set of high contrast colors: 
  scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) + 
  # Titles and labels: 
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") + 
  # coord_flip() + 
  theme_classic()

Bar plots with error bars

It is typically a good idea to show some measure of variability (e.g., the standard deviation, standard error, confidence interval, etc.) to any bar plots. There is an entire range of geoms that draw error bars:

## Create data to plot: ----- 
n_cat <- 6
set.seed(101)  # for reproducible randomness

data <- tibble(
  name = LETTERS[1:n_cat],
  value = sample(seq(25, 50), n_cat),
  sd = rnorm(n = n_cat, mean = 0, sd = 8))
# data

## Error bars: -----

## x-aesthetic only:

# (a) errorbar: 
ggplot(data) +
  geom_bar(aes(x = name, y = value), stat = "identity", fill = "steelblue") +
  geom_errorbar(aes(x = name, ymin = value - sd, ymax = value + sd), 
                width = 0.4, color = "orange", alpha = 1, size = 1.0) +
  labs(title = "Bar plot with error bars") + 
  theme_bw()


# (b) linerange: 
ggplot(data) +
  geom_bar(aes(x = name, y = value), stat = "identity", fill = "olivedrab3") +
  geom_linerange(aes(x = name, ymin = value - sd, ymax = value + sd), 
                 color = "firebrick", alpha = 1, size = 2.5) + 
  labs(title = "Bar plot with line range") + 
  theme_light()


## Additional y-aesthetic: 

# (c) crossbar:
ggplot(data) +
  geom_bar(aes(x = name, y = value), stat = "identity", fill = "forestgreen") +
  geom_crossbar(aes(x = name, y = value, ymin = value - sd, ymax = value + sd), 
                width = 0.3, color  = "sienna1", alpha = 1, size = 1.0) +
  labs(title = "Bar plot with crossbars") + 
  theme_classic()


# (d) pointrange: 
ggplot(data) +
  geom_bar(aes(x = name, y = value), stat = "identity", fill = "skyblue") +
  geom_pointrange(aes(x = name, y = value, ymin = value - sd, ymax = value + sd), 
                  color = "gold", alpha = 1.0, size = 1.2) +
  labs(title = "Bar plot with point ranges") + 
  theme_dark()

Line graphs

A line graph typically depicts developments of some item over time (or some other factor). To know which variable is to be plotted repeatedly, we need to specify the group property. For instance, the following plot shows the growth of orange trees by their age (using the data from datasets::Orange):

otrees <- as_tibble(datasets::Orange)
otrees
#> # A tibble: 35 x 3
#>    Tree    age circumference
#>  * <ord> <dbl>         <dbl>
#>  1 1       118            30
#>  2 1       484            58
#>  3 1       664            87
#>  4 1      1004           115
#>  5 1      1231           120
#>  6 1      1372           142
#>  7 1      1582           145
#>  8 2       118            33
#>  9 2       484            69
#> 10 2       664           111
#> # ... with 25 more rows

# basic version: 
ggplot(otrees) +
  geom_line(aes(x = age, y = circumference, group = Tree)) +
  labs(title = "Growth of orange trees") + 
  theme_bw()


# prettier version:
ggplot(otrees, aes(x = age, y = circumference, group = Tree, color = Tree)) +
  geom_line(size = 1.5) +
  geom_point(aes(shape = Tree), size = 3) +
  labs(title = "Growth of orange trees over time", 
       x = "Age (days elapsed)", "Circumference (in mm)") + 
  theme_bw()

Other plots

There are many more additional types of plots, some of which we will introduce later (in Chapter 7: Exploratory data analysis (EDA) and Exploring data (EDA)). In addition, see https://ggplot2.tidyverse.org/reference/ and the references provided below for additional plots and examples.

Details on ggplot

Note some details that characterize and distinguish ggplot2 commands:

  • ggplot requires data and maps independent variables to dimensions (e.g., the x- and y-axis) and dependent variables to geometric objects (called “geoms”). It typically assumes that the to-be-plotted <DATA> is in a table (data frame or tibble) in long format and contains independent variables as factors.

  • The arguments data = and mappings = can be omitted, but an aesthetic mapping aes(<MAPPING>) for at least one geom is needed.

  • Different geoms can be combined, but their order matters (as later layers are printed on top of earlier ones).

  • When multiple geoms use the same mappings, their common aes(<MAPPING>) can be moved into the initial ggplot call (behind <DATA>).

  • In ggplot, a sequence of commands is combined by +, rather than %>% (the forward pipe operator provided by magrittr). The + has to be at the end of the current line, rather than at the beginning of the next line.

The visual appearance of plots is highly customizable (e.g., by supplying aesthetic arguments, speciying labels and legends, and applying pre-defined themes to plots). Tuning plots can be a lot of fun, but keep in mind your current goals and the plot’s intended audience.

Exercises (WPA02)

Exercise 1

A scatterplot shows a data point (observation) as a function of 2 (typically continuous) variables x and y. This allows judging the relationship between x and y in the data.

  • Use the ggplot2::mpg data to create a scatterplot that shows a car’s fuel economy on the highway (on the y-axis) as a function of its fuel economy in the city (on the x-axis). How would you describe this relationship?
## Data:
# ggplot2::mpg

# Scatterplot: ------ 
# A minimal scatterplot + reference line:
ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline(color = "red3")  # adds x = y (45-degree) line


# => looks like a linear relationship between cty and hwy.
  • Does your plot suffer from overplotting? If so, create at least 2 different versions that solve this problem.

Yes, it seems that multiple points appear at the same position, which is a common issue with scatterplots and sign of overplotting.

Dealing with overplotting:

There are several ways of dealing with this issue:

  1. jitter adds randomness to positions;
  2. alpha uses transparency to show overlaps and the frequency of objects at positions;
  3. geom_size allows mapping count values (e.g., frequency) to object size;
  4. facet_wrap allows disentangling plots by levels of (other) variables.

Some possible solutions in our present case include:

## Dealing with overplotting: ----- 

# Adding randomness to point positions:  
ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), position = "jitter") +
  geom_abline(color = "red3")


## Note: Setting position = "jitter" 
## is the same as (except for randomness):
# ggplot(mpg) +
#  geom_jitter(aes(x = cty, y = hwy)) +
#  geom_abline(color = "red3")

# Using transparency (via setting alpha to < 1): 
my_plot <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), position = "identity", 
             pch = 21, fill = "steelblue", alpha = 1/4, size = 4) +
  geom_abline(linetype = 2, color = "red3") # + 
  # geom_rug(aes(x = cty, y = hwy), position = "jitter", alpha = 1/4, size = 1)
my_plot  # plots the plot


# Faceting (by another variable):
ggplot(mpg) +
  facet_wrap(~class) + 
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline(color = "red3")

  • Add informative titles, labels, and a theme to the plot.
# Adding labels and themes to plots: ----- 
my_plot +       # use the plot defined above
  labs(title = "Fuel economy on highway vs. city",
                x = "City (miles per gallon)",
                y = "Highway (miles per gallon)",
                caption = "Data from ggplot2::mpg") +
  # coord_fixed() +
  theme_bw()

  • Group the points in your scatterplot by the class of vehicles (in at least 2 different ways).
# Grouping by (the categorical variable) class: ------  

# Grouping by color:
ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy, color = class), 
             position = "jitter", alpha = 1/2, size = 4) +
  geom_abline(linetype = 2) +
  theme_bw()


# Grouping by facets: 
ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), 
             position = "jitter", alpha = 1/2, size = 2) +
  geom_abline(linetype = 2) +
  facet_wrap(~class) +
  theme_bw()

Exercise 2

The following plot repeats the histogram code from above (to plot the distribution of fuel economy in city environments), but adds a frequency polygon as a 2nd geom (see ?geom_freqpoly).

# Plot from above with an additional geom:
ggplot(mpg, aes(x = cty)) +    # set mappings for ALL geoms
  geom_histogram(aes(x = cty), binwidth = 2, fill = "gold", color = "black") +
  geom_freqpoly(color = "steelblue", size = 2) +
  labs(title = "Distribution of fuel economy", 
       x = "Miles per gallon (in city)",
       caption = "Data from ggplot2::mpg") +
  theme_light()

  • Why is the (blue) line of the polygon lower than the (yellow) bars of the histogram?

  • Change 1 value in the code so that both (lines and bars) have the same heights.

# Explanation: ----- 
# The histogram uses a binwidth of 2, which doubles the count values (shown on y-axis).

# Solution: ----- 
# Setting bindidth = 1 puts both on the same scale:
ggplot(mpg, aes(x = cty)) +    # set mappings for ALL geoms
  geom_histogram(aes(x = cty), binwidth = 1, fill = "gold", color = "black") +
  geom_freqpoly(color = "steelblue", size = 2) +
  labs(title = "Distribution of fuel economy", 
       x = "Miles per gallon (in city)",
       caption = "Data from ggplot2::mpg") +
  theme_light()


# Alternatively, we can add the same binwidth = 2 argument to geom_freqpoly:
ggplot(mpg, aes(x = cty)) +    # set mappings for ALL geoms
  geom_histogram(aes(x = cty), binwidth = 2, fill = "gold", color = "black") +
  geom_freqpoly(color = "steelblue", binwidth = 2, size = 2) +
  labs(title = "Distribution of fuel economy", 
       x = "Miles per gallon (in city)",
       caption = "Data from ggplot2::mpg") +
  theme_light()

  • Why can’t we simply replace geom_freqpoly by geom_line or geom_smooth to get a similar line?
# Try replacing geom_freqpoly by geom_line and geom_smooth: 

ggplot(mpg, aes(x = cty)) +    # set mappings for ALL geoms
  geom_histogram(aes(x = cty), binwidth = 2, fill = "gold", color = "black") +
  geom_freqpoly(color = "steelblue", binwidth = 2, size = 2) +
  geom_line() +      # Error: geom_line requires the following missing aesthetics: y
  # geom_smooth() +  # Error: stat_smooth requires the following missing aesthetics: y
  labs(title = "Distribution of fuel economy", 
       x = "Miles per gallon (in city)",
       caption = "Data from ggplot2::mpg") +
  theme_light()

# Note: Both geoms require a y-aesthetic (here: the count of cars per x-value).

Answer: Whereas geom_histogram and geom_freqpoly count the frequency of elements in each bin, both geom_line and geom_smooth require a y-aesthetic.

Exercise 3

Creating bar plots with the ggplot2::mpg data.

  • Plot the number or frequency of cases by cyl as a bar plot (in at least 2 different ways).
# Count the number of cases by cylinders: ------ 
ggplot(mpg) + 
  geom_bar(aes(x = cyl))


ggplot(mpg) + 
  geom_bar(aes(x = cyl, y = ..count..))


ggplot(mpg) + 
  geom_bar(aes(x = cyl), stat = "count")

  • Plot the proportion of cases in the mpg data by cyl (in at least 2 different ways).
# Proportion of cases: ----- 
ggplot(mpg) + 
  geom_bar(aes(x = cyl, y = ..prop.., group = 1))


# is the same as: 
ggplot(mpg) + 
  geom_bar(aes(x = cyl, y = ..count../sum(..count..)))

  • Create a prettier version by adding different colors, appropriate labels, and a suitable theme to your plot.
# Prettier version: ----- 
ggplot(mpg) + 
  geom_bar(aes(x = cyl, fill = as.factor(cyl)), 
           stat = "count", color = "black") + 
  labs(title = "Counts of cars by class",
       x = "Cylinders", y = "Frequency") + 
  scale_fill_brewer(name = "Cylinders:", palette = "Spectral") + 
  # coord_flip() + 
  theme_bw()

Exercise 4

The ChickWeight data (contained in R datasets) contains the results of an experiment that measures the effects of Diet on the early growth of chicks.

  • Save the ChickWeight data as a tibble and inspect its dimensions and variables.
# ?datasets::ChickWeight

# (a) Save data as tibble and inspect:
cw <- as_tibble(ChickWeight)
cw  # 578 observations (rows) x 4 variables (columns)
#> # A tibble: 578 x 4
#>    weight  Time Chick Diet 
#>  *  <dbl> <dbl> <ord> <fct>
#>  1     42     0 1     1    
#>  2     51     2 1     1    
#>  3     59     4 1     1    
#>  4     64     6 1     1    
#>  5     76     8 1     1    
#>  6     93    10 1     1    
#>  7    106    12 1     1    
#>  8    125    14 1     1    
#>  9    149    16 1     1    
#> 10    171    18 1     1    
#> # ... with 568 more rows
  • Create a line plot showing the weight development of each indivdual chick (on the y-axis) over Time (on the x-axis) for each Diet (in 4 different facets).
# Basic version:
ggplot(cw) +
  geom_line(aes(x = Time, y = weight, group = Chick, color = Chick)) +
  facet_wrap(~Diet)


# Fancy version: 
# Scatter and/or line plot showing the weight development of each chick (on the y-axis)
# over Time (on the x-axis) for each Diet (as different facets):
ggplot(cw, aes(x = Time, y = weight, group = Diet)) +
  facet_wrap(~Diet) +
  geom_point(alpha = 1/2) +
  geom_line(aes(group = Chick)) +
  geom_smooth(aes(color = Diet)) +
  labs(title = "Chick weight by time for different diets", 
       x = "Time (number of days)", y = "Weight (in gm)",
       caption = "Data from datasets::ChickWeight.") +
  theme_bw()

  • The following bar chart shows the number of chicks per Diet over Time.
    We see that the initial Diet groups contain a different numbers of chicks and some chicks drop out over Time:
# (c) Bar plot showing the number (count) of chicks per diet over time: 
ggplot(cw, aes(x = Time, fill = Diet)) +
  geom_bar(position = "dodge") +
  labs(title = "Number of chicks per diet over time", x = "Time (number of days)", y = "Number", 
       caption = "Data from datasets::ChickWeight.") +
  theme_bw()

Try re-creating this plot (with geom_bar and dodged bar positions).

Exercise 5

Use the p_info data from Exercise 6 of WPA01 to create some plots that descripte the sample of participants:

library(readr)

# Read data (from online source):
p_info <- read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")

# dim(p_info)      # 295 rows, 6 columns
# p_info           # prints a summary of the table/tibble
# glimpse(p_info)  # shows the first values for 6 variables (columns)

# Turn some categorial values into factors:
p_info$sex <- as.factor(p_info$sex)
p_info$intervention <- as.factor(p_info$intervention)

p_info  # Note that intervention and sex are now listed as <fct>.
#> # A tibble: 295 x 6
#>       id intervention sex     age  educ income
#>    <int> <fct>        <fct> <int> <int>  <int>
#>  1     1 4            2        35     5      3
#>  2     2 1            1        59     1      1
#>  3     3 4            1        51     4      3
#>  4     4 3            1        50     5      2
#>  5     5 2            2        58     5      2
#>  6     6 1            1        31     5      1
#>  7     7 3            1        44     5      2
#>  8     8 2            1        57     4      2
#>  9     9 1            1        36     4      3
#> 10    10 2            1        45     4      3
#> # ... with 285 more rows
  • A histogram that shows the distribution of participant age in 3 ways:
    • overall,
    • separately for each sex, and
    • separately for each intervention.
# (a) Histogramm showing the overall distribution of age: 
ggplot(p_info) +
  geom_histogram(mapping = aes(age), binwidth = 4, fill = "gold", col = "black") +
  labs(title = "Distribution of age values") +
  theme_bw()


# Note: Same distribution as frequency polygon:
ggplot(p_info) +
  geom_freqpoly(mapping = aes(x = age), binwidth = 4, color = "forestgreen")+ 
  labs(title = "Distribution of age values",
       x="Age", y = "Count")+
  theme_bw()


# (b) ... by sex:
ggplot(p_info) +
  geom_histogram(mapping = aes(age, fill = sex), binwidth = 4, col = "black") +
  labs(title = "Distribution of age values (by sex)") +
  scale_fill_brewer(name = "Gender:", palette = "Set1") + 
  theme_bw()


# OR: 
ggplot(p_info) +
  geom_histogram(mapping = aes(age, fill = sex), binwidth = 4, col = "black") +
  facet_grid(~sex) + 
  labs(title = "Distribution of age values (by sex)") +
  scale_fill_brewer(name = "Gender:", palette = "Set1") + 
  theme_bw()


# (c) ... by intervention:
ggplot(p_info) +
  geom_histogram(mapping = aes(age, fill = intervention), binwidth = 4, col = "black") +
  facet_grid(~intervention) + 
  labs(title = "Distribution of age values (by intervention)") +
  scale_fill_brewer(name = "Intervention group:", palette = "Spectral") + 
  theme_bw()

  • A bar plot that
    • shows how many participants took part in each intervention; or
    • shows how many participants of each sex took part in each intervention.
# Number of participants per intervention:
ggplot(p_info)+
  geom_bar(mapping = aes(x = intervention),  fill = "gold")+ 
  labs(title = "Number of participants per intervention", 
       x = "Intervention", y = "Count") +
  theme_bw()


# ... & by sex:
ggplot(p_info) +
  geom_bar(aes(x = intervention, fill = sex), position = "dodge") + 
  labs(title = "Number of participants per intervention (and sex)", 
       x = "Intervention", y = "Count") +
  scale_fill_brewer(name = "Gender:", palette = "Set1") + 
  theme_bw()

Note that it would be desirable to explain what gender is encoded by the values of 1 and 2. For this, we need to inspect the codebook and will later learn to encode the variable sex as a categorical variable (as a factor).

More on data visualization

See Chapter 3: Data visualization) and Chapter 7: Exploratory data analysis (EDA) (to be covered in 2 weeks) and complete their exercises.

The following links provide additional information:

Books or scripts on data visualization:

Conclusion

All ds4psy essentials so far:

Nr. Topic
0. Syllabus
1. Basic R concepts and commands
2. Visualizing data
3. Transforming data
4. Exploring data (EDA)
+. Datasets

[Last update on 2018-11-27 22:55:20 by hn.]