Basic Statistics and Projects in R

Data visualization with the tidyverse

Christian Althaus, Judith Bouman, Martin Wohlfender

Fundamentals

Claus Wilke, a professor of integrative biology at The University of Texas at Austin, wrote this book as a guide to

  • making visualizations that accurately reflect the data,
  • tell a story,
  • and look professional.

Note that the entire book was written in R Markdown using RStudio!

Ugly, bad, and wrong figures

  • ugly - A figure that has aesthetic problems but otherwise is clear and informative.
  • bad - A figure that has problems related to perception; it may be unclear, confusing, overly complicated, or deceiving.
  • wrong — A figure that has problems related to mathematics; it is objectively incorrect.

Aesthetics

All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as aesthetics.

Coordinate systems

Coordinate systems don’t have to be Cartesian.

Color scales

There are three fundamental use cases for color in data visualizations:

  1. We can use color to distinguish groups of data from each other.
  2. We can use color to represent data values.
  3. We can use color to highlight.

The types of colors we use and the way in which we use them are quite different for these three cases.

Color as a tool to distinguish

Color to represent data values

Color as a tool to highlight

ColorBrewer

Cynthia Brewer, a cartographer at Pennsylvania State University, designed the widely used color schemes ColorBrewer.

You can use the interactive web tool ColorBrewer 2.0 to choose an appropriate color scheme for your needs.

To use these color schemes in R, install the package RColorBrewer.

Colorblind safe figures

If you are not suffering from a color vision deficiency, it is very hard to imagine how it looks like to be colorblind.

The Color Blindness Simulator can close this gap for you. Just play around with it check whether your figures are colorblind safe.

unibeCols

The University of Bern has a set of corporate design colors that are defined in the manual “Gestaltungselemente”.

Thanks to Alan, you can easily install this color scheme with the unibeCols package: https://github.com/CTU-Bern/unibeCols

Visualizing (many) distributions

Visualizing (many) distributions

Visualizing geospatial data

Visualizing uncertainty

Visualizing uncertainty

Whenever you visualize uncertainty with error bars, you must specify what quantity and/or confidence level the error bars represent.

The principle of proportional ink

When a shaded region is used to represent a numerical value, the area of that shaded region should be directly proportional to the corresponding value. - Bergstrom & West

The principle of proportional ink

When a shaded region is used to represent a numerical value, the area of that shaded region should be directly proportional to the corresponding value. - Bergstrom & West

Handling overlapping data points

Handling overlapping data points

Don’t go 3D

Even though the 3D visualizations are shown from four different perspectives, it is difficult to envision how exactly the points are distributed in space.

Don’t go 3D

Instead, map one of the variables (in this case fuel efficiency) onto another aesthetic (size of the dots).

Commonly used image file formats

Acronym Name Type Application
pdf Portable Document Format vector general purpose
eps Encapsulated PostScript vector general purpose, outdated; use pdf
svg Scalable Vector Graphics vector online use
png Portable Network Graphics bitmap optimized for line drawings
jpeg Joint Photographic Experts Group bitmap optimized for photographic images
tiff Tagged Image File Format bitmap print production, accurate color reproduction
raw Raw Image File bitmap digital photography, needs post-processing
gif Graphics Interchange Format bitmap outdated for static figures, Ok for animations

Base plot vs. ggplot

plot(mtcars$disp, mtcars$hp,
     xlab = "displacement (cu. in.)",
     ylab = "power (hp)",
     main = "Scatter plot in base plot")

library(ggplot2)

ggplot(mtcars, aes(x = disp, y = hp)) +
    geom_point() +
    xlab("displacement (cu. in.)") +
    ylab("power (hp)") +
    ggtitle("Scatter plot in ggplot")

Data exploration and visualization with ggplot2

Artwork by @allison_horst

Program for the rest of the afternoon

  • General idea of using ggplot2
  • Basic graphs: geom_point, geom_line and geom_col (60 min)
  • Fancify basic graphs: colors, legend, axes, theme and patchwork (60 min)
  • Other types of geom: histogram, density, violin, boxplot (50 min)

Data visualization with ggplot2

Based on the grammar of graphics, a conceptual approach to building graphs from layers.

Pass a dataframe, map variables to aesthetics (e.g. y, x, colour), tell it which geometry to use (e.g. point, line)

2023 - R for the Rest of Us

Cheatsheet

Cheatsheet

Example data COVID-19

covid <- read.csv("data/raw/COVID19Cases_geoRegion.csv")
covid <- covid %>% mutate( datum = as.Date(datum) ) 

head(covid)
  geoRegion      datum entries sumTotal timeframe_14d timeframe_all
1        CH 2020-02-24       1        1         FALSE          TRUE
2        CH 2020-02-25       1        2         FALSE          TRUE
3        CH 2020-02-26      10       12         FALSE          TRUE
4        CH 2020-02-27      10       22         FALSE          TRUE
5        CH 2020-02-28      10       32         FALSE          TRUE
6        CH 2020-02-29      13       45         FALSE          TRUE
  offset_last7d sumTotal_last7d offset_last14d sumTotal_last14d offset_last28d
1       4385008               0        4383801                0        4376250
2       4385008               0        4383801                0        4376250
3       4385008               0        4383801                0        4376250
4       4385008               0        4383801                0        4376250
5       4385008               0        4383801                0        4376250
6       4385008               0        4383801                0        4376250
  sumTotal_last28d sum7d sum14d mean7d mean14d entries_diff_last_age     pop
1                0    NA     NA     NA      NA                     7 8738791
2                0    NA     NA     NA      NA                     7 8738791
3                0    NA     NA     NA      NA                     7 8738791
4                0    NA     NA   8.14      NA                     7 8738791
5                0    NA     NA  12.29      NA                     7 8738791
6                0    NA     NA  16.86      NA                     7 8738791
  inz_entries inzsumTotal inzmean7d inzmean14d inzsumTotal_last7d
1        0.01        0.01        NA         NA                 NA
2        0.01        0.02        NA         NA                 NA
3        0.11        0.14        NA         NA                 NA
4        0.11        0.25      0.09         NA                 NA
5        0.11        0.37      0.14         NA                 NA
6        0.15        0.51      0.19         NA                 NA
  inzsumTotal_last14d inzsumTotal_last28d inzsum7d inzsum14d sumdelta7d
1                  NA                  NA       NA        NA         NA
2                  NA                  NA       NA        NA         NA
3                  NA                  NA       NA        NA         NA
4                  NA                  NA       NA        NA         NA
5                  NA                  NA       NA        NA         NA
6                  NA                  NA       NA        NA         NA
  inzdelta7d         type type_variant             version datum_unit
1         NA COVID19Cases           NA 2023-01-24_06-03-16        day
2         NA COVID19Cases           NA 2023-01-24_06-03-16        day
3         NA COVID19Cases           NA 2023-01-24_06-03-16        day
4         NA COVID19Cases           NA 2023-01-24_06-03-16        day
5         NA COVID19Cases           NA 2023-01-24_06-03-16        day
6         NA COVID19Cases           NA 2023-01-24_06-03-16        day
  entries_letzter_stand entries_neu_gemeldet entries_diff_last
1                     1                    0               914
2                     1                    0               914
3                    10                    0               914
4                    10                    0               914
5                    10                    0               914
6                    13                    0               914
dim(covid)
[1] 30247    36

Data from the COVID-19 BAG dashboard: https://www.covid19.admin.ch/

dataframe setup: covid_cantons_2020

# filter data frame covid: 
# only keep confirmed cases in the cantons of Zurich, Bern and Vaud 
# in the first half of the year 2020
covid_cantons_2020 <- covid %>% filter(datum <= as.Date("2020-06-30") 
                    & (geoRegion == "ZH" | geoRegion == "BE" | geoRegion == "VD"))

# write data frame covid_cantons_2020 to a csv file
write.csv(x = covid_cantons_2020, file = "data/processed/covid_cantons_2020_06.csv")
  geoRegion      datum entries sumTotal timeframe_14d timeframe_all
1        BE 2020-02-24       0        0         FALSE          TRUE
2        BE 2020-02-25       0        0         FALSE          TRUE
3        BE 2020-02-26       0        0         FALSE          TRUE
4        BE 2020-02-27       1        1         FALSE          TRUE
5        BE 2020-02-28       0        1         FALSE          TRUE
6        BE 2020-02-29       1        2         FALSE          TRUE
  offset_last7d sumTotal_last7d offset_last14d sumTotal_last14d offset_last28d
1        507985               0         507871                0         507046
2        507985               0         507871                0         507046
3        507985               0         507871                0         507046
4        507985               0         507871                0         507046
5        507985               0         507871                0         507046
6        507985               0         507871                0         507046
  sumTotal_last28d sum7d sum14d mean7d mean14d entries_diff_last_age     pop
1                0    NA     NA     NA      NA                     7 1047473
2                0    NA     NA     NA      NA                     7 1047473
3                0    NA     NA     NA      NA                     7 1047473
4                0    NA     NA   0.29      NA                     7 1047473
5                0    NA     NA   0.86      NA                     7 1047473
6                0    NA     NA   1.29      NA                     7 1047473
  inz_entries inzsumTotal inzmean7d inzmean14d inzsumTotal_last7d
1         0.0        0.00        NA         NA                 NA
2         0.0        0.00        NA         NA                 NA
3         0.0        0.00        NA         NA                 NA
4         0.1        0.10      0.03         NA                 NA
5         0.0        0.10      0.08         NA                 NA
6         0.1        0.19      0.12         NA                 NA
  inzsumTotal_last14d inzsumTotal_last28d inzsum7d inzsum14d sumdelta7d
1                  NA                  NA       NA        NA         NA
2                  NA                  NA       NA        NA         NA
3                  NA                  NA       NA        NA         NA
4                  NA                  NA       NA        NA         NA
5                  NA                  NA       NA        NA         NA
6                  NA                  NA       NA        NA         NA
  inzdelta7d         type type_variant             version datum_unit
1         NA COVID19Cases           NA 2023-01-24_06-03-16        day
2         NA COVID19Cases           NA 2023-01-24_06-03-16        day
3         NA COVID19Cases           NA 2023-01-24_06-03-16        day
4         NA COVID19Cases           NA 2023-01-24_06-03-16        day
5         NA COVID19Cases           NA 2023-01-24_06-03-16        day
6         NA COVID19Cases           NA 2023-01-24_06-03-16        day
  entries_letzter_stand entries_neu_gemeldet entries_diff_last
1                     0                    0                75
2                     0                    0                75
3                     0                    0                75
4                     1                    0                75
5                     0                    0                75
6                     1                    0                75

Goal 1 (exercise 4)

geom_point: basic plot

library(ggplot2)

plot_covid_point_v0 <- ggplot(data = covid_cantons_2020, 
                              mapping = aes(x = datum, y = entries)) + 
  geom_point()

Note: does not use the %>% or |> pipes, it uses + instead…

geom_line: basic plot

plot_covid_line_v0 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion))

geom_col: basic plot

plot_covid_col_v0 <- ggplot(data = covid_cantons_2020, 
                            mapping = aes(x = datum, y = entries)) + 
  geom_col(position = "stack")

Exercise 4A: basic plot

  1. Read Ebola data and sort it by date.
  2. Determine what variables you need to include in your dataframe to make the type of plot shown below.
  3. Create a dataframe with the required variables and all data for 3 countries before 31 March 2015.

Exercise 4A: solution

# load library
library(dplyr)

# read Ebola data
data_ebola <- read.csv("data/raw/ebola.csv")

# format column datum of data_ebola as date
data_ebola$Date <- as.Date(data_ebola$Date)

# sort data_ebola by date
data_ebola <- arrange(data_ebola, Date)

head(data_ebola)
    X      Country       Date Cum_conf_cases Cum_susp_cases Cum_conf_death
1 641       Guinea 2014-08-29            482             25            287
2 642      Liberia 2014-08-29            322            382            225
3 643 Sierra Leone 2014-08-29            935             54            380
4 644      Nigeria 2014-08-29             15              3              6
5 636       Guinea 2014-09-05            604             56            362
6 637      Liberia 2014-09-05            614            369            431
dim(data_ebola)
[1] 2484    6

Exercise 4A: solution

2023 - R for the Rest of Us

# filter data_ebola: cumulative number of confirmed cases in Guinea, 
# Liberia and Sierra Leone before 31 March 2015 
data_ebola_cum_cases <- data_ebola %>% 
  select(date = Date, country = Country, cum_conf_cases = Cum_conf_cases) %>% 
  filter(date <= as.Date("2015-03-31") & 
        (country == "Guinea" | country ==  "Liberia" | country == "Sierra Leone"))

Exercise 4B: basic plot

Create basic point, line and column plots of the cumulative number of confirmed cases versus time.

Exercise 4B: solution

# crete point plot
plot_ebola_point_v0 <- ggplot(data = data_ebola_cum_cases, 
                              mapping = aes(x = date, y = cum_conf_cases)) + 
  geom_point()
  
# create line plot
plot_ebola_line_v0 <- ggplot(data = data_ebola_cum_cases, 
                             mapping = aes(x = date, y = cum_conf_cases)) + 
  geom_line(aes(group = country))

# create column plot
plot_ebola_col_v0 <- ggplot(data = data_ebola_cum_cases, 
                            mapping = aes(x = date, y = cum_conf_cases)) + 
  geom_col(position = "stack")

Exercise 4B: solution

geom_point: colour and fill

plot_covid_point_v1 <- ggplot(data = covid_cantons_2020, 
                              mapping = aes(x = datum, y = entries)) + 
  geom_point(alpha = 0.7, colour = "black", fill = "blue", 
             shape = 21, size = 1.5, stroke = 1.5)

geom_line: colour and fill

plot_covid_line_v1 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion), 
            alpha = 0.7, colour = "blue", linetype = "solid", linewidth = 1.5)

geom_col: colour and fill

plot_covid_col_v1 <- ggplot(data = covid_cantons_2020, 
                            mapping = aes(x = datum, y = entries)) + 
  geom_col(position = "stack", alpha = 0.7, fill = "blue", 
           linetype = "solid", linewidth = 0.5, width = 0.7)

Exercise 4C: colour and fill

Change global aesthetics of the 3 plots you created in Exercise 4B.

  1. Point plot: Try different values for alpha, colour, fill, shape, size and stroke.
  2. Line plot: Try different values for alpha, colour, linetype and linewidth.
  3. Column plot: Try different values for alpha, colour, fill, linetype, linewidth, position and width.

Exercise 4C: solution

# create point plot
plot_ebola_point_v1 <- ggplot(data = data_ebola_cum_cases, 
                              mapping = aes(x = date, y = cum_conf_cases)) + 
  geom_point(alpha = 0.7, colour = "blue", fill = "green", 
             shape = 22, size = 1.5, stroke = 1.5) 

# create line plot
plot_ebola_line_v1 <- ggplot(data = data_ebola_cum_cases, 
                             mapping = aes(x = date, y = cum_conf_cases)) + 
  geom_line(mapping = aes(group = country), 
            alpha = 0.7, colour = "blue", linetype = "dashed", linewidth = 1.5)

# create column plot
plot_ebola_col_v1 <- ggplot(data = data_ebola_cum_cases, 
                            mapping = aes(x = date, y = cum_conf_cases)) + 
  geom_col(alpha = 0.7, colour = "blue", fill = "green", 
           linetype = "solid", linewidth = 0.1, position = "stack", width = 0.7)

Exercise 4C: solution

geom_point: color per country

plot_covid_point_v2 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5)

Global vs. local aesthetics

ggplot(data = covid_cantons_2020, 
      mapping = aes(x = datum, y = entries, colour = geoRegion, 
                    fill = geoRegion, group_by = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  geom_line()

Global vs. local aesthetics

ggplot(data = covid_cantons_2020, 
                mapping = aes(x = datum, y = entries, group_by = geoRegion)) + 
  geom_point(alpha = 0.7, colour = "black", fill= "black", shape = 21, 
             size = 1.5, stroke = 1.5) +
  geom_line(colour = "red")

More examples on local vs. global aesthetics

geom_line: color per country

plot_covid_line_v2 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5)

geom_col: color per country

plot_covid_col_v2 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7, 
           linetype = "solid", linewidth = 0.5, width = 0.7)

Exercise 4D: color per country

Change aesthetic mappings of the 3 plots you created in Exercise 4C.

  1. Point plot: Set fill colour of points by country.
  2. Line plot: Set colour of lines by country.
  3. Column plot: Set fill colour of columns by country.

Exercise 4D: solution

# create point plot
plot_ebola_point_v2 <- ggplot(data = data_ebola_cum_cases, 
  mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_point(alpha = 0.7, shape = 22, size = 1.5, stroke = 1.5) 

# create line plot
plot_ebola_line_v2 <- ggplot(data = data_ebola_cum_cases, 
               mapping = aes(x = date, y = cum_conf_cases, colour = country)) + 
  geom_line(mapping = aes(group = country), 
            alpha = 0.7, linetype = "dashed", linewidth = 1.5)

# create column plot
plot_ebola_col_v2 <- ggplot(data = data_ebola_cum_cases, 
  mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_col(alpha = 0.7, linetype = "solid", 
           linewidth = 0.1, position = "stack", width = 0.7)

Exercise 4D: solution

geom_point: labels

plot_covid_point_v3 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_line: labels

plot_covid_line_v3 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_col: labels

plot_covid_col_v3 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

Exercise 4E: labels

Change the title and the labels of the axes of the 3 plots you created in Exercise 4D.

  1. Set the title to “Confirmed Ebola cases”.
  2. Set the label of x-axes to “Time”.
  3. Set the label of y-axes to “Cum. # of confirmed cases”.

Exercise 4E: solution

# create point plot
plot_ebola_point_v3 <- ggplot(data = data_ebola_cum_cases, 
mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_point(alpha = 0.7, shape = 22, size = 1.5, stroke = 1.5) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

# create line plot
plot_ebola_line_v3 <- ggplot(data = data_ebola_cum_cases, 
               mapping = aes(x = date, y = cum_conf_cases, colour = country)) + 
  geom_line(mapping = aes(group = country), 
            alpha = 0.7, linetype = "dashed", linewidth = 1.5) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

# create column plot
plot_ebola_col_v3 <- ggplot(data = data_ebola_cum_cases, 
mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_col(alpha = 0.7, linetype = "solid", 
           linewidth = 0.1, position = "stack", width = 0.7) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

Exercise 4E: solution

geom_point: change standard colors

library(unibeCols)

plot_covid_point_v4 <- ggplot(data = covid_cantons_2020, 
    mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
    scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_point: change standard colors

geom_line: change standard colors

plot_covid_line_v4 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_line: change standard colors

geom_col: change standard colors

plot_covid_col_v4 <- ggplot(data = covid_cantons_2020, 
                            mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_col: change standard colors

Exercise 4F

Change the colour, respectively fill, scale of the three plots you created in Exercise 4E.

  1. Point plot: Change fill scale manually.
  2. Line plot: Change colour scale manually.
  3. Column plot: Change fill scale manually.

Exercise 4F: solution

# create point plot
plot_ebola_point_v4 <- ggplot(data = data_ebola_cum_cases, 
  mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_point(alpha = 0.7, shape = 22, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_colour_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  ggtitle(label = "Confirmed Ebola") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

# create line plot
plot_ebola_line_v4 <- ggplot(data = data_ebola_cum_cases, 
               mapping = aes(x = date, y = cum_conf_cases, colour = country)) + 
  geom_line(mapping = aes(group = country), 
            alpha = 0.7, linetype = "dashed", linewidth = 1.5) +
  scale_colour_manual(name = "Country",
                      breaks = c("Guinea", "Liberia", "Sierra Leone"),
                      values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                      labels = c("GIN", "LBR", "SLE")) +
  ggtitle(label = "Confirmed Ebola") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

# create column plot
plot_ebola_col_v4 <- ggplot(data = data_ebola_cum_cases, 
mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_col(alpha = 0.7, linetype = "solid", 
           linewidth = 0.1, position = "stack", width = 0.7) +
  scale_fill_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_colour_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

Exercise 4F: solution

geom_point: scales

plot_covid_point_v5 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", 
                                  "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_point: scales

geom_line: scales

plot_covid_line_v5 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_line: scales

geom_col: scales

plot_covid_col_v5 <- ggplot(data = covid_cantons_2020, 
      mapping = aes(x = datum, y = entries, fill = geoRegion, group=geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 600, by = 100),
                     limits = c(0, 600)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_col: scales

Exercise 4G: scales

Change the scale of the axes of the three plots you created in Exercise 5.

  1. Point plot: Change breaks of x-axes to 29 August, 1 October, 1 December, 1 February, and 1 April.
  2. Line plot: Change breaks of y-axes of point and line plot to 0, 2500, 5000, 7500 and 10000.
  3. Column plot: Change breaks of y-axis of column plot to 0, 2500, 5000, 7500, 10000, 12500 and 15000.

Exercise 4G: solution

# create point plot
plot_ebola_point_v5 <- ggplot(data = data_ebola_cum_cases, 
  mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_point(alpha = 0.7, 
             shape = 22, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
    scale_colour_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 10000, by = 2500),
                     limits = c(0, 10000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

# create line plot
plot_ebola_line_v5 <- ggplot(data = data_ebola_cum_cases, 
                             mapping = aes(x = date, y = cum_conf_cases, colour = country)) + 
  geom_line(mapping = aes(group = country), 
            alpha = 0.7, linetype = "dashed", linewidth = 1.5) +
  scale_colour_manual(name = "Country",
                      breaks = c("Guinea", "Liberia", "Sierra Leone"),
                      values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                      labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 10000, by = 2500),
                     limits = c(0, 10000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

# create column plot
plot_ebola_col_v5 <- ggplot(data = data_ebola_cum_cases, 
  mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_col(alpha = 0.7, linetype = "solid", 
           linewidth = 0.1, position = "stack", width = 0.7) +
  scale_fill_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
    scale_colour_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 15000, by = 2500),
                     limits = c(0, 15000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases")

Exercise 4G: solution

Exercise 4G: solution

Exercise 4G: solution

Themes

Graphic from https://www.geeksforgeeks.org/themes-and-background-colors-in-ggplot2-in-r/

geom_point: themes

plot_covid_point_v6 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
    scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

geom_point: themes

geom_line: themes

plot_covid_line_v6 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

geom_line: themes

geom_col: themes

plot_covid_col_v6 <- ggplot(data = covid_cantons_2020, 
    mapping = aes(x = datum, y = entries, fill = geoRegion, colour=geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 600, by = 100),
                     limits = c(0, 600)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

geom_col: themes

Exercise 4H: themes

Change the theme of the three plots you created in Exercise 4G to theme_bw().

Exercise 4H: solution

# create point plot
plot_ebola_point_v6 <- ggplot(data = data_ebola_cum_cases, 
  mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_point(alpha = 0.7, shape = 22, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
    scale_colour_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 10000, by = 2500),
                     limits = c(0, 10000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

# create line plot
plot_ebola_line_v6 <- ggplot(data = data_ebola_cum_cases, 
                             mapping = aes(x = date, y = cum_conf_cases, colour = country)) + 
  geom_line(mapping = aes(group = country), 
            alpha = 0.7, linetype = "dashed", linewidth = 1.5) +
  scale_colour_manual(name = "Country",
                      breaks = c("Guinea", "Liberia", "Sierra Leone"),
                      values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                      labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 10000, by = 2500),
                     limits = c(0, 10000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

# create column plot
plot_ebola_col_v6 <- ggplot(data = data_ebola_cum_cases, 
mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_col(alpha = 0.7, linetype = "solid", 
           linewidth = 0.1, position = "stack", width = 0.7) +
  scale_fill_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_colour_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 15000, by = 2500),
                     limits = c(0, 15000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

Exercise 4H: solution

Exercise 4H: solution

Exercise 4H: solution

geom_point: facet

plot_covid_point_facet <- ggplot(data = covid_cantons_2020, 
   mapping = aes(x = datum, y = entries, fill = geoRegion, colour=geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
    scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  theme(panel.spacing = unit(2, "lines")) +
  facet_grid(cols = vars(geoRegion))

geom_point: facet

geom_line: facet

plot_covid_line_facet <- ggplot(data = covid_cantons_2020, 
                                mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  theme(panel.spacing = unit(2, "lines")) +
  facet_grid(cols = vars(geoRegion))

geom_line: facet

geom_col: facet

plot_covid_col_facet <- ggplot(data = covid_cantons_2020, 
 mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", 
                                  "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 600, by = 100),
                     limits = c(0, 600)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  theme(panel.spacing = unit(2, "lines")) +
  facet_grid(cols = vars(geoRegion))

geom_col: facet

Exercise 4I: facet

Create facet grids by country from the three plots you created in Exercise 4H.

Exercise 4I: solution

# create point plot
plot_ebola_point_facet <- ggplot(data = data_ebola_cum_cases, 
  mapping = aes(x = date, y = cum_conf_cases, colour = country,fill = country)) + 
  geom_point(alpha = 0.7,  
             shape = 22, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_colour_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", 
                                  "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 10000, by = 2500),
                     limits = c(0, 10000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  facet_grid(cols = vars(country))

# create line plot
plot_ebola_line_facet <- ggplot(data = data_ebola_cum_cases, 
                   mapping = aes(x = date, y = cum_conf_cases, colour = country)) + 
  geom_line(mapping = aes(group = country), 
            alpha = 0.7, linetype = "dashed", linewidth = 1.5) +
  scale_colour_manual(name = "Country",
                      breaks = c("Guinea", "Liberia", "Sierra Leone"),
                      values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                      labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 10000, by = 2500),
                     limits = c(0, 10000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  facet_grid(cols = vars(country))

# create column plot
plot_ebola_col_facet <- ggplot(data = data_ebola_cum_cases, 
mapping = aes(x = date, y = cum_conf_cases, fill = country, colour = country)) + 
  geom_col(alpha = 0.7, linetype = "solid", 
           linewidth = 0.1, position = "stack", width = 0.7) +
  scale_fill_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_colour_manual(name = "Country",
                    breaks = c("Guinea", "Liberia", "Sierra Leone"),
                    values = c(unibeRedS()[1], unibeOceanS()[1], unibeMustardS()[1]),
                    labels = c("GIN", "LBR", "SLE")) +
  scale_x_date(breaks = as.Date(c("2014-08-29", "2014-10-01", "2014-12-01", 
                                  "2015-02-01", "2015-04-01")),
               labels = c("29 August", "1 October", "1 December", "1 February", "1 April"),
               limits = as.Date(c("2014-08-28", "2015-04-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 15000, by = 2500),
                     limits = c(0, 15000)) +
  ggtitle(label = "Confirmed Ebola cases") +
  xlab(label = "Time") +
  ylab(label = "Cum. # of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  facet_grid(cols = vars(country))

Exercise 4I: solution

Exercise 4I: solution

Exercise 4I: solution

Patchwork

Artwork by @allison_horst

geom_point: grid

library(cowplot)

plot_covid_point_grid <- plot_grid(plotlist = list(plot_covid_point_v1, plot_covid_point_v2, plot_covid_point_v3, 
                                                   plot_covid_point_v4, plot_covid_point_v5, plot_covid_point_v6),
                                   labels = c("V1", "V2", "V3", "V4", "V5", "V6"), label_size = 12, nrow = 2)

Install cowplot:

install.packages("cowplot")

geom_point: grid

geom_line: grid

plot_covid_line_grid <- plot_grid(plotlist = list(plot_covid_line_v1, plot_covid_line_v2, plot_covid_line_v3, 
                                                  plot_covid_line_v4, plot_covid_line_v5, plot_covid_line_v6),
                                  labels = c("V1", "V2", "V3", "V4", "V5", "V6"), label_size = 12, nrow = 2)

geom_line: grid

geom_col: grid

plot_covid_col_grid <- plot_grid(plotlist = list(plot_covid_col_v1, plot_covid_col_v2, plot_covid_col_v3, 
                                                 plot_covid_col_v4, plot_covid_col_v5, plot_covid_col_v6),
                                 labels = c("V1", "V2", "V3", "V4", "V5", "V6"), label_size = 12, nrow = 2)

geom_col: grid

Exercise 4J: grid

Arrange six of the plots you created in the previous exercises into a grid.

Exercise 4J: solution

plot_ebola_line_grid <- plot_grid(plotlist = list(plot_ebola_line_v1, plot_ebola_line_v2, plot_ebola_line_v3, 
                                                  plot_ebola_line_v4, plot_ebola_line_v5, plot_ebola_line_v6),
                                  labels = c("V1", "V2", "V3", "V4", "V5", "V6"), label_size = 12, nrow = 2)

Exercise 4J: solution

Types of geom

Example data: insurance

insurance <- read.csv("data/raw/insurance_with_date.csv")
insurance <- insurance %>% mutate(children = as.factor(children))

head(insurance)
  X age    sex    bmi children smoker    region   charges       date
1 1  59   male 31.790        2     no southeast 13086.341 2001-01-15
2 2  24 female 22.600        0     no southwest  2574.268 2001-01-17
3 3  28 female 25.935        1     no northwest  4411.400 2001-01-22
4 4  22   male 25.175        0     no northwest  2321.417 2001-01-29
5 5  60 female 36.005        0     no northeast 13434.551 2001-02-06
6 6  38 female 28.000        3     no southwest  7262.940 2001-02-17
dim(insurance)
[1] 1338    9

Data adapted from “Machine Learning with R” by Brett Lantz.

Density plot / histogram

Exercise 5A: Can you reproduce these graphs using the insurance.csv dataset?

Density plot / histogram – solution 1

ggplot( insurance , aes(x = bmi, colour = sex, fill = sex ) ) + 
  geom_density( alpha = 0.4 ) +
  theme(text = element_text(size=20), legend.position = "bottom") +
  xlab( expression(paste( "BMI (kg/", m^2,")")) ) + 
  scale_colour_manual(name = "" , values=c("female"=unibePastelS()[1],
                               "male"=unibeIceS()[1]), labels = c("Female", "Male")) +
  scale_fill_manual(name = "", values=c("female"=unibePastelS()[1],
                               "male"=unibeIceS()[1]), labels = c("Female", "Male")) 

Density plot / histogram – solution 2

ggplot( insurance ) + 
  geom_histogram( aes(x = charges, y = after_stat(density), colour = sex, fill = sex ),
                  alpha = 0.4, bins = 100 ) +
  geom_density( aes(x = charges, colour = sex), linewidth = 1.5 ) +
  theme(text = element_text(size=20), legend.position = "top") +
  xlab( "Charges in Dollar" ) + 
  scale_colour_manual(name = "" , values=c("female"=unibePastelS()[1],
                               "male"=unibeIceS()[1]), labels = c("Female", "Male")) +
  scale_fill_manual(name = "", values=c("female"=unibePastelS()[1],
                               "male"=unibeIceS()[1]), labels = c("Female", "Male")) +
    geom_vline(aes(xintercept = median(charges)), color = unibeRedS()[1], linewidth = 1)

Quantiles

Excersize 5B: Can you reproduce this graph using the insurance.csv dataset?

Quantiles – solution

ggplot( insurance , aes(x = age, y = bmi, color =smoker) ) + 
  geom_point(  ) +
  geom_quantile(  ) +
  theme(text = element_text(size=20), legend.position = "top") +
  xlab( "Age (years)" ) + ylab( expression(paste( "BMI (kg/", m^2,")")) ) + 
  scale_colour_manual(name = "" , values=c("no"=unibeRedS()[1],
                               "yes"=unibeIceS()[1]), labels = c("No", "Yes")) +
  scale_fill_manual(name = "" , values=c("no"=unibeRedS()[1],
                               "yes"=unibeIceS()[1]), labels = c("No", "Yes")) 

violin plot / boxplot

Excersize 5C: Can you reproduce these graphs using the insurance.csv dataset?

violin plot / boxplot – solution 1

ggplot( insurance , aes(x = smoker, y = charges ) ) + 
  ylab( "Charges ($)" ) +
  geom_violin(  )

violin plot / boxplot – solution 2

ggplot( insurance , aes(x = smoker, y = charges ) ) + 
  geom_boxplot(  ) + 
  ylab( "Charges ($)" ) + 
  coord_flip()

Cheatsheet

Cheatsheet

Practice makes perfect

Community driven projects for practicing

Images by @tanyashapiro, @gkaramanis, @cscherer