Introduction
This file contains essential commands from Chapter 3: Data visualisation of the textbook r4ds and corresponding examples and exercises. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.
All ds4psy essentials so far:
| Nr. | Topic |
|---|---|
| 0. | Syllabus |
| 1. | Basic R concepts and commands |
| 2. | Visualizing data |
| 3. | Transforming data |
| 4. | Exploring data (EDA) |
| +. | Datasets |
Course coordinates
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Winter 2018/2019: Mondays, 13:30–15:00, C511.
- Links to current course syllabus | ZeUS | Ilias
Preparations
Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## Visualizing data | ds4psy
## 2018 11 12
## ----------------------------
## Preparations: ----------
library(tidyverse)
## 1. Topic: ----------
# etc.
## End of file (eof). ---------- Visualizing data
In the following, we assume that you have read and worked through Chapter 3: Data visualization). Based on this background, we examine some essential commands of ggplot2 in the context of examples. However, the ggplot2 package extends far beyond this modest introduction – it is an important pillar (and predecessor) of the tidyverse and implements a language for and philosophy of data visualisation.
Essential commands and examples
General structure of ggplot calls
A generic template for creating a graph with ggplot is:
# Generic ggplot template:
ggplot(data = <DATA>) + # 1. specify data set to use
<GEOM_fun>(mapping = aes(<MAPPING>), # 2. specify geom + mapping
<arg_1 = val_1, ..., arg_n = val_n>) + # - optional arguments to geoms
<FACET_fun> + # - optional facets
<LOOK_GOOD_fun> # - optional themes, colors, labels, etc.
# Minimal ggplot template:
ggplot(<DATA>) + # 1. specify data set to use
<GEOM_fun>(aes(<MAPPING>) # 2. specify geom + mapping The generic template includes the following parts:
<DATA>is a data frame or tibble that contains the data that is to be plotted.<GEOM_fun>is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that are specified inaes(<MAPPING>). (A “mapping” specifies what goes where.)- A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized
- in the aesthetic mapping (when varying visual features according to data properties), or
- by setting its arguments to specific values in
<arg_1 = val_1, ..., arg_n = val_n>(when remaining constant).
An optional
<FACET_fun>splits a complex plot into multiple subplots.A sequence of optional
<LOOK_GOOD_fun>adjusts the visual features of plots (e.g., by adding plot themes, titles and text labels, color scales, and setting coordinate systems).
Scatterplots
The first type of graph we encounter is a scatterplot, which maps 2 (typically continuous) variables to the x- and y-dimension of a plot and thus allows judging the relationship between the variables. In ggplot2, we can create scatterplots by using geom_point. For instance, when exploring the mpg dataset, we can ask: How does a car’s mileage per gallon (on highways, hwy) relate to its engine displacement (displ)?
## Data:
# ggplot2::mpg
# Minimal scatterplot:
ggplot(data = mpg) + # 1. specify data set to use
geom_point(mapping = aes(x = displ, y = hwy)) # 2. specify geom + mapping
# Shorter version of the same plot:
ggplot(mpg) + # 1. specify data set to use
geom_point(aes(x = displ, y = hwy)) # 2. specify geom + mapping
# When specifying the aes(...) settings directly after the data,
# they apply to ALL geoms:
ggplot(mpg, aes(x = displ, y = hwy)) + # 1. specify data and aes-mapping for ALL geoms
geom_point() # 2. specify geomTo get different types of plots of the same data, we select different geoms:
# Different geoms:
ggplot(mpg) + # 1. specify data set to use
geom_jitter(aes(x = displ, y = hwy)) # 2. specify geom + mapping
ggplot(mpg) + # 1. specify data set to use
geom_smooth(aes(x = displ, y = hwy)) # 2. specify geom + mapping
ggplot(mpg) + # 1. specify data set to use
geom_bin2d(aes(x = displ, y = hwy)) # 2. specify geom + mapping
# However, geoms require specific variables and aesthetic mappings!
# For instance,
# ggplot(mpg) + # 1. specify data set to use
# geom_bar(aes(x = displ, y = hwy)) # 2. specify geom + mapping
# would yield an error!
# But the following works (but plots something else):
ggplot(mpg) + # 1. specify data set to use
geom_bar(aes(x = displ, fill = class)) # 2. specify geom + mappingAesthetic properties (like color, shape, size, transparency, etc.) can be used to structure plots (e.g., by grouping objects into categories) or as arguments to geoms:
- Color:
# Color: ------
# Using color to group (as an aesthetic):
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, color = class))
# Setting color (as an argument):
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy), color = "steelblue")- Shape:
# Shape: ------
# Using shape to group (as an aesthetic):
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, shape = class))
# Setting shape (as an argument):
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy), shape = 2)- Size:
# Size: ------
# Using size to group (as an aesthetic):
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, size = class))
# Setting size (as an argument):
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy), size = 3)- Transparency (
alpha):
# Transparency: ------
# Setting alpha (as an aesthetic):
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# Setting alpha (as an argument):
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), alpha = 1/3)Note that not all aesthetics are equally suited to serve all functions.
Creating nicer plots
A typical problem that renders many plots difficult to decipher is overplotting (i.e., too much information at the same location). One solution to this issue is provided by faceting, which provides an alternative way of grouping plots into several sub-plots:
# Setting alpha (as an aesthetic):
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~class)An alternative solution to overplotting consists in the careful selection of geoms and aesthetics. For instance, by combining geoms, colors, and transparency, and adding a ggplot theme (see ?theme_bw), we can create informative and attractive plots:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(color = "firebrick", linetype = 1) +
geom_point(shape = 21, color = "black", fill = "steelblue4", alpha = 1/4, size = 3, stroke = 1) +
theme_bw()Adding titles and labels completes a plot:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(color = "firebrick", linetype = 1) +
geom_point(shape = 21, color = "black", fill = "steelblue4", alpha = 1/4, size = 3, stroke = 1) +
labs(title = "Fuel usage by engine displacement",
x = "Engine displacement (in liters)", y = "Miles per gallon (on highway)",
caption = "Data from ggplot2::mpg") +
theme_bw()Other plot types
Here are some examples that illustrate the use of different geoms and aesthetic features for different types of plots.
Histograms
A histogram counts how often specific values of one (typically continuous) variable occur in the data. This allows viewing the distribution of values for this variable:
library(ggplot2)
# Data: ------
# ?ggplot2::mpg
# mpg
# Histogram: ------
# A minimal histogram:
ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(binwidth = 2) # set binwidth parameter
# (B) Adding aesthetics, labels and themes: ------
# Enhanced version of the same plot:
ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(aes(x = cty), binwidth = 2, fill = "gold", color = "black") +
labs(title = "Distribution of fuel economy",
x = "Miles per gallon (in city)",
caption = "Data from ggplot2::mpg") +
theme_light()Bar plots
Another common type of plot shows the values (across different levels of some variable as the height of bars. As this plot type can use both categorical or continuous variables, it turns out to be surprisingly complex to create good bar charts. To us get started, here are only a few examples:
Counts of cases
By default, geom_bar computes summary statistics of the data. When nothing else is specified, geom_bar counts the number or frequency of values (i.e., stat = "count") and maps this count to the y (i.e., y = ..count..):
## Data:
# ggplot2::mpg
# (a) Count number of cases by class:
ggplot(mpg) +
geom_bar(aes(x = class))
# (a) is the same as (b):
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count..))
# (b) is the same as (c):
ggplot(mpg) +
geom_bar(aes(x = class), stat = "count")
# (c) is the same as (d):
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count..), stat = "count")
# (e) prettier version:
ggplot(mpg) +
geom_bar(aes(x = class, fill = class),
# stat = "count",
color = "black") +
labs(title = "Counts of cars by class",
x = "Class of car", y = "Frequency") +
scale_fill_brewer(name = "Class:", palette = "Blues") +
theme_bw()Proportions of cases
An alternative to showing the count or frequency of cases is showing the corresponding proportion of cases:
## Data:
# ggplot2::mpg
# (1) Proportion of cases by class:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..prop.., group = 1))
# is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count../sum(..count..)))Bar plots of existing values
A common difficulty occurs when the table to plot already contains the values to be shown as bars. As there is nothing to be computed in this case, we need to specify stat = "identity" for geom_bar (to override its default of stat = "count").
For instance, let’s plot a bar chart that shows the election data from the following tibble de (and don’t worry if you don’t understand the commands used to generate the tibble at this point):
library(tidyverse)
## (a) Create a tibble of data:
de_org <- tibble(
party = c("CDU/CSU", "SPD", "Others"),
share_2013 = c((.341 + .074), .257, (1 - (.341 + .074) - .257)),
share_2017 = c((.268 + .062), .205, (1 - (.268 + .062) - .205))
)
de_org$party <- factor(de_org$party, levels = c("CDU/CSU", "SPD", "Others")) # optional
# de_org
## Check that columns add to 100:
# sum(de_org$share_2013) # => 1 (qed)
# sum(de_org$share_2017) # => 1 (qed)
## (b) Converting de into a tidy data table:
de <- de_org %>%
gather(share_2013:share_2017, key = "election", value = "share") %>%
separate(col = "election", into = c("dummy", "year")) %>%
select(year, party, share)
knitr::kable(de)| year | party | share |
|---|---|---|
| 2013 | CDU/CSU | 0.415 |
| 2013 | SPD | 0.257 |
| 2013 | Others | 0.328 |
| 2017 | CDU/CSU | 0.330 |
| 2017 | SPD | 0.205 |
| 2017 | Others | 0.465 |
- A version with 2 x 3 separate bars (using
position = "dodge"):
## Data: -----
# de # => 6 x 3 tibble
## Note that year is of type character, which could be changed by:
# de$year <- parse_integer(de$year)
## (1) Bar chart with side-by-side bars (dodge): -----
## (a) minimal version:
bp_1 <- ggplot(de, aes(x = year, y = share, fill = party)) +
## (A) 3 bars per election (position = "dodge"):
geom_bar(stat = "identity", position = "dodge", color = "black") # 3 bars next to each other
bp_1
## (b) Version with text labels and customized colors:
bp_1 +
## prettier plot:
geom_text(aes(label = paste0(round(share * 100, 1), "%"), y = share + .015),
position = position_dodge(width = 1),
fontface = 2, color = "black") +
# Some set of high contrast colors:
scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) +
# Titles and labels:
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
# coord_flip() +
theme_bw()- A version with 2 bars with 3 segments (using
position = "stack"):
## Data: -----
# de # => 6 x 3 tibble
## (2) Bar chart with stacked bars: -----
## (a) minimal version:
bp_2 <- ggplot(de, aes(x = year, y = share, fill = party)) +
## (B) 1 bar per election (position = "stack"):
geom_bar(stat = "identity", position = "stack") # 1 bar per election
bp_2
## (b) Version with text labels and customized colors:
bp_2 +
## prettier plot:
geom_text(aes(label = paste0(round(share * 100, 1), "%")),
position = position_stack(vjust = .5),
color = rep(c("black", "white", "white"), 2),
fontface = 2) +
# Some set of high contrast colors:
scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) +
# Titles and labels:
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
# coord_flip() +
theme_classic()Bar plots with error bars
It is typically a good idea to show some measure of variability (e.g., the standard deviation, standard error, confidence interval, etc.) to any bar plots. There is an entire range of geoms that draw error bars:
## Create data to plot: -----
n_cat <- 6
set.seed(101) # for reproducible randomness
data <- tibble(
name = LETTERS[1:n_cat],
value = sample(seq(25, 50), n_cat),
sd = rnorm(n = n_cat, mean = 0, sd = 8))
# data
## Error bars: -----
## x-aesthetic only:
# (a) errorbar:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "steelblue") +
geom_errorbar(aes(x = name, ymin = value - sd, ymax = value + sd),
width = 0.4, color = "orange", alpha = 1, size = 1.0) +
labs(title = "Bar plot with error bars") +
theme_bw()
# (b) linerange:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "olivedrab3") +
geom_linerange(aes(x = name, ymin = value - sd, ymax = value + sd),
color = "firebrick", alpha = 1, size = 2.5) +
labs(title = "Bar plot with line range") +
theme_light()
## Additional y-aesthetic:
# (c) crossbar:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "forestgreen") +
geom_crossbar(aes(x = name, y = value, ymin = value - sd, ymax = value + sd),
width = 0.3, color = "sienna1", alpha = 1, size = 1.0) +
labs(title = "Bar plot with crossbars") +
theme_classic()
# (d) pointrange:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "skyblue") +
geom_pointrange(aes(x = name, y = value, ymin = value - sd, ymax = value + sd),
color = "gold", alpha = 1.0, size = 1.2) +
labs(title = "Bar plot with point ranges") +
theme_dark()Line graphs
A line graph typically depicts developments of some item over time (or some other factor). To know which variable is to be plotted repeatedly, we need to specify the group property. For instance, the following plot shows the growth of orange trees by their age (using the data from datasets::Orange):
otrees <- as_tibble(datasets::Orange)
otrees
#> # A tibble: 35 x 3
#> Tree age circumference
#> * <ord> <dbl> <dbl>
#> 1 1 118 30
#> 2 1 484 58
#> 3 1 664 87
#> 4 1 1004 115
#> 5 1 1231 120
#> 6 1 1372 142
#> 7 1 1582 145
#> 8 2 118 33
#> 9 2 484 69
#> 10 2 664 111
#> # ... with 25 more rows
# basic version:
ggplot(otrees) +
geom_line(aes(x = age, y = circumference, group = Tree)) +
labs(title = "Growth of orange trees") +
theme_bw()
# prettier version:
ggplot(otrees, aes(x = age, y = circumference, group = Tree, color = Tree)) +
geom_line(size = 1.5) +
geom_point(aes(shape = Tree), size = 3) +
labs(title = "Growth of orange trees over time",
x = "Age (days elapsed)", "Circumference (in mm)") +
theme_bw()Other plots
There are many more additional types of plots, some of which we will introduce later (in Chapter 7: Exploratory data analysis (EDA) and Exploring data (EDA)). In addition, see https://ggplot2.tidyverse.org/reference/ and the references provided below for additional plots and examples.
Details on ggplot
Note some details that characterize and distinguish ggplot2 commands:
ggplotrequires data and maps independent variables to dimensions (e.g., the x- and y-axis) and dependent variables to geometric objects (called “geoms”). It typically assumes that the to-be-plotted<DATA>is in a table (data frame or tibble) in long format and contains independent variables as factors.The arguments
data =andmappings =can be omitted, but an aesthetic mappingaes(<MAPPING>)for at least one geom is needed.Different geoms can be combined, but their order matters (as later layers are printed on top of earlier ones).
When multiple geoms use the same mappings, their common
aes(<MAPPING>)can be moved into the initialggplotcall (behind<DATA>).In
ggplot, a sequence of commands is combined by+, rather than%>%(the forward pipe operator provided bymagrittr). The+has to be at the end of the current line, rather than at the beginning of the next line.
The visual appearance of plots is highly customizable (e.g., by supplying aesthetic arguments, speciying labels and legends, and applying pre-defined themes to plots). Tuning plots can be a lot of fun, but keep in mind your current goals and the plot’s intended audience.
Exercises (WPA02)
Exercise 1
A scatterplot shows a data point (observation) as a function of 2 (typically continuous) variables x and y. This allows judging the relationship between x and y in the data.
- Use the
ggplot2::mpgdata to create a scatterplot that shows a car’s fuel economy on the highway (on the y-axis) as a function of its fuel economy in the city (on the x-axis). How would you describe this relationship?
## Data:
# ggplot2::mpg
# Scatterplot: ------
# A minimal scatterplot + reference line:
ggplot(mpg) +
geom_point(aes(x = cty, y = hwy)) +
geom_abline(color = "red3") # adds x = y (45-degree) line
# => looks like a linear relationship between cty and hwy.- Does your plot suffer from overplotting? If so, create at least 2 different versions that solve this problem.
Yes, it seems that multiple points appear at the same position, which is a common issue with scatterplots and sign of overplotting.
Dealing with overplotting:
There are several ways of dealing with this issue:
jitteradds randomness to positions;
alphauses transparency to show overlaps and the frequency of objects at positions;
geom_sizeallows mapping count values (e.g., frequency) to object size;facet_wrapallows disentangling plots by levels of (other) variables.
Some possible solutions in our present case include:
## Dealing with overplotting: -----
# Adding randomness to point positions:
ggplot(mpg) +
geom_point(aes(x = cty, y = hwy), position = "jitter") +
geom_abline(color = "red3")
## Note: Setting position = "jitter"
## is the same as (except for randomness):
# ggplot(mpg) +
# geom_jitter(aes(x = cty, y = hwy)) +
# geom_abline(color = "red3")
# Using transparency (via setting alpha to < 1):
my_plot <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy), position = "identity",
pch = 21, fill = "steelblue", alpha = 1/4, size = 4) +
geom_abline(linetype = 2, color = "red3") # +
# geom_rug(aes(x = cty, y = hwy), position = "jitter", alpha = 1/4, size = 1)
my_plot # plots the plot
# Faceting (by another variable):
ggplot(mpg) +
facet_wrap(~class) +
geom_point(aes(x = cty, y = hwy)) +
geom_abline(color = "red3")- Add informative titles, labels, and a theme to the plot.
# Adding labels and themes to plots: -----
my_plot + # use the plot defined above
labs(title = "Fuel economy on highway vs. city",
x = "City (miles per gallon)",
y = "Highway (miles per gallon)",
caption = "Data from ggplot2::mpg") +
# coord_fixed() +
theme_bw()- Group the points in your scatterplot by the
classof vehicles (in at least 2 different ways).
# Grouping by (the categorical variable) class: ------
# Grouping by color:
ggplot(mpg) +
geom_point(aes(x = cty, y = hwy, color = class),
position = "jitter", alpha = 1/2, size = 4) +
geom_abline(linetype = 2) +
theme_bw()
# Grouping by facets:
ggplot(mpg) +
geom_point(aes(x = cty, y = hwy),
position = "jitter", alpha = 1/2, size = 2) +
geom_abline(linetype = 2) +
facet_wrap(~class) +
theme_bw()Exercise 2
The following plot repeats the histogram code from above (to plot the distribution of fuel economy in city environments), but adds a frequency polygon as a 2nd geom (see ?geom_freqpoly).
# Plot from above with an additional geom:
ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(aes(x = cty), binwidth = 2, fill = "gold", color = "black") +
geom_freqpoly(color = "steelblue", size = 2) +
labs(title = "Distribution of fuel economy",
x = "Miles per gallon (in city)",
caption = "Data from ggplot2::mpg") +
theme_light()Why is the (blue) line of the polygon lower than the (yellow) bars of the histogram?
Change 1 value in the code so that both (lines and bars) have the same heights.
# Explanation: -----
# The histogram uses a binwidth of 2, which doubles the count values (shown on y-axis).
# Solution: -----
# Setting bindidth = 1 puts both on the same scale:
ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(aes(x = cty), binwidth = 1, fill = "gold", color = "black") +
geom_freqpoly(color = "steelblue", size = 2) +
labs(title = "Distribution of fuel economy",
x = "Miles per gallon (in city)",
caption = "Data from ggplot2::mpg") +
theme_light()
# Alternatively, we can add the same binwidth = 2 argument to geom_freqpoly:
ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(aes(x = cty), binwidth = 2, fill = "gold", color = "black") +
geom_freqpoly(color = "steelblue", binwidth = 2, size = 2) +
labs(title = "Distribution of fuel economy",
x = "Miles per gallon (in city)",
caption = "Data from ggplot2::mpg") +
theme_light()- Why can’t we simply replace
geom_freqpolybygeom_lineorgeom_smoothto get a similar line?
# Try replacing geom_freqpoly by geom_line and geom_smooth:
ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(aes(x = cty), binwidth = 2, fill = "gold", color = "black") +
geom_freqpoly(color = "steelblue", binwidth = 2, size = 2) +
geom_line() + # Error: geom_line requires the following missing aesthetics: y
# geom_smooth() + # Error: stat_smooth requires the following missing aesthetics: y
labs(title = "Distribution of fuel economy",
x = "Miles per gallon (in city)",
caption = "Data from ggplot2::mpg") +
theme_light()
# Note: Both geoms require a y-aesthetic (here: the count of cars per x-value).Answer: Whereas geom_histogram and geom_freqpoly count the frequency of elements in each bin, both geom_line and geom_smooth require a y-aesthetic.
Exercise 3
Creating bar plots with the ggplot2::mpg data.
- Plot the number or frequency of cases by
cylas a bar plot (in at least 2 different ways).
# Count the number of cases by cylinders: ------
ggplot(mpg) +
geom_bar(aes(x = cyl))
ggplot(mpg) +
geom_bar(aes(x = cyl, y = ..count..))
ggplot(mpg) +
geom_bar(aes(x = cyl), stat = "count")- Plot the proportion of cases in the
mpgdata bycyl(in at least 2 different ways).
# Proportion of cases: -----
ggplot(mpg) +
geom_bar(aes(x = cyl, y = ..prop.., group = 1))
# is the same as:
ggplot(mpg) +
geom_bar(aes(x = cyl, y = ..count../sum(..count..)))- Create a prettier version by adding different colors, appropriate labels, and a suitable theme to your plot.
# Prettier version: -----
ggplot(mpg) +
geom_bar(aes(x = cyl, fill = as.factor(cyl)),
stat = "count", color = "black") +
labs(title = "Counts of cars by class",
x = "Cylinders", y = "Frequency") +
scale_fill_brewer(name = "Cylinders:", palette = "Spectral") +
# coord_flip() +
theme_bw()Exercise 4
The ChickWeight data (contained in R datasets) contains the results of an experiment that measures the effects of Diet on the early growth of chicks.
- Save the
ChickWeightdata as a tibble and inspect its dimensions and variables.
# ?datasets::ChickWeight
# (a) Save data as tibble and inspect:
cw <- as_tibble(ChickWeight)
cw # 578 observations (rows) x 4 variables (columns)
#> # A tibble: 578 x 4
#> weight Time Chick Diet
#> * <dbl> <dbl> <ord> <fct>
#> 1 42 0 1 1
#> 2 51 2 1 1
#> 3 59 4 1 1
#> 4 64 6 1 1
#> 5 76 8 1 1
#> 6 93 10 1 1
#> 7 106 12 1 1
#> 8 125 14 1 1
#> 9 149 16 1 1
#> 10 171 18 1 1
#> # ... with 568 more rows- Create a line plot showing the
weightdevelopment of each indivdual chick (on the y-axis) overTime(on the x-axis) for eachDiet(in 4 different facets).
# Basic version:
ggplot(cw) +
geom_line(aes(x = Time, y = weight, group = Chick, color = Chick)) +
facet_wrap(~Diet)
# Fancy version:
# Scatter and/or line plot showing the weight development of each chick (on the y-axis)
# over Time (on the x-axis) for each Diet (as different facets):
ggplot(cw, aes(x = Time, y = weight, group = Diet)) +
facet_wrap(~Diet) +
geom_point(alpha = 1/2) +
geom_line(aes(group = Chick)) +
geom_smooth(aes(color = Diet)) +
labs(title = "Chick weight by time for different diets",
x = "Time (number of days)", y = "Weight (in gm)",
caption = "Data from datasets::ChickWeight.") +
theme_bw()- The following bar chart shows the number of chicks per
DietoverTime.
We see that the initialDietgroups contain a different numbers of chicks and some chicks drop out overTime:
# (c) Bar plot showing the number (count) of chicks per diet over time:
ggplot(cw, aes(x = Time, fill = Diet)) +
geom_bar(position = "dodge") +
labs(title = "Number of chicks per diet over time", x = "Time (number of days)", y = "Number",
caption = "Data from datasets::ChickWeight.") +
theme_bw()Try re-creating this plot (with geom_bar and dodged bar positions).
Exercise 5
Use the p_info data from Exercise 6 of WPA01 to create some plots that descripte the sample of participants:
library(readr)
# Read data (from online source):
p_info <- read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")
# dim(p_info) # 295 rows, 6 columns
# p_info # prints a summary of the table/tibble
# glimpse(p_info) # shows the first values for 6 variables (columns)
# Turn some categorial values into factors:
p_info$sex <- as.factor(p_info$sex)
p_info$intervention <- as.factor(p_info$intervention)
p_info # Note that intervention and sex are now listed as <fct>.
#> # A tibble: 295 x 6
#> id intervention sex age educ income
#> <int> <fct> <fct> <int> <int> <int>
#> 1 1 4 2 35 5 3
#> 2 2 1 1 59 1 1
#> 3 3 4 1 51 4 3
#> 4 4 3 1 50 5 2
#> 5 5 2 2 58 5 2
#> 6 6 1 1 31 5 1
#> 7 7 3 1 44 5 2
#> 8 8 2 1 57 4 2
#> 9 9 1 1 36 4 3
#> 10 10 2 1 45 4 3
#> # ... with 285 more rows- A histogram that shows the distribution of participant
agein 3 ways:- overall,
- separately for each
sex, and - separately for each
intervention.
# (a) Histogramm showing the overall distribution of age:
ggplot(p_info) +
geom_histogram(mapping = aes(age), binwidth = 4, fill = "gold", col = "black") +
labs(title = "Distribution of age values") +
theme_bw()
# Note: Same distribution as frequency polygon:
ggplot(p_info) +
geom_freqpoly(mapping = aes(x = age), binwidth = 4, color = "forestgreen")+
labs(title = "Distribution of age values",
x="Age", y = "Count")+
theme_bw()
# (b) ... by sex:
ggplot(p_info) +
geom_histogram(mapping = aes(age, fill = sex), binwidth = 4, col = "black") +
labs(title = "Distribution of age values (by sex)") +
scale_fill_brewer(name = "Gender:", palette = "Set1") +
theme_bw()
# OR:
ggplot(p_info) +
geom_histogram(mapping = aes(age, fill = sex), binwidth = 4, col = "black") +
facet_grid(~sex) +
labs(title = "Distribution of age values (by sex)") +
scale_fill_brewer(name = "Gender:", palette = "Set1") +
theme_bw()
# (c) ... by intervention:
ggplot(p_info) +
geom_histogram(mapping = aes(age, fill = intervention), binwidth = 4, col = "black") +
facet_grid(~intervention) +
labs(title = "Distribution of age values (by intervention)") +
scale_fill_brewer(name = "Intervention group:", palette = "Spectral") +
theme_bw()- A bar plot that
- shows how many participants took part in each
intervention; or - shows how many participants of each
sextook part in eachintervention.
- shows how many participants took part in each
# Number of participants per intervention:
ggplot(p_info)+
geom_bar(mapping = aes(x = intervention), fill = "gold")+
labs(title = "Number of participants per intervention",
x = "Intervention", y = "Count") +
theme_bw()
# ... & by sex:
ggplot(p_info) +
geom_bar(aes(x = intervention, fill = sex), position = "dodge") +
labs(title = "Number of participants per intervention (and sex)",
x = "Intervention", y = "Count") +
scale_fill_brewer(name = "Gender:", palette = "Set1") +
theme_bw()Note that it would be desirable to explain what gender is encoded by the values of 1 and 2. For this, we need to inspect the codebook and will later learn to encode the variable sex as a categorical variable (as a factor).
More on data visualization
See Chapter 3: Data visualization) and Chapter 7: Exploratory data analysis (EDA) (to be covered in 2 weeks) and complete their exercises.
The following links provide additional information:
- study the
vignette("ggplot")and the documentation forggplotand various geoms (e.g.,geom_); - study https://ggplot2.tidyverse.org/reference/ and its examples;
- see the cheat sheet on data visualization.
Books or scripts on data visualization:
- Data Visualization. A practical introduction, by Kieran Healy
- Fundamentals of Data Visualization, by Claus O. Wilke
Conclusion
All ds4psy essentials so far:
| Nr. | Topic |
|---|---|
| 0. | Syllabus |
| 1. | Basic R concepts and commands |
| 2. | Visualizing data |
| 3. | Transforming data |
| 4. | Exploring data (EDA) |
| +. | Datasets |
[Last update on 2018-11-27 22:55:20 by hn.]