Basic Statistics and Projects in R

Introduction to R, the tidyverse, and data wrangling

Christian Althaus, Alan Haynes

General introduction

Course objectives

Organize research projects in R following the principles of open science and reproducible research
Perform descriptive analysis of data sets and understand the fundamentals of modern data visualization
Apply basic inferential statistics to data sets from human and animal health

Follow-up courses

We offer the following follow-up courses within the theme Core Methods of the Public Health Sciences Course Program:

Date	Course	Objectives
19-23 June 2023	Introduction to Epidemiology and Study Design	Concepts and measures to quantify the frequency of health outcomes and their associations with exposures, epidemiological study designs and their potential sources of bias, concepts of causality
26 June 2023	Diagnostic Test Evaluation	Concepts of diagnostic test evaluation with examples from human and animal health, interpretation of diagnostic test results, reporting of diagnostic test accuracy studies
23-27 October 2023	Applied Regression Modeling in R	Different types of regression models, model selection, models for continuous, binary and categorical outcomes, and events

Lecturers and course assistants

Alan Haynes, CTU Bern
Christian Althaus, Institute of Social and Preventive Medicine
Judith Bouman, Institute of Social and Preventive Medicine
Martin Wohlfender, Institute of Social and Preventive Medicine
Ben Sypcher, Institute of Social and Preventive Medicine
Beatriz Vidondo, Veterinary Public Health Institute (VPHI)
Guy Schnidrig, Veterinary Public Health Institute (VPHI)

Timetable

Day	Time	Topic	Lecturer(s)
Monday	09:00-12:00	Projects in R: Introduction to R, the tidyverse, and data wrangling	Christian Althaus, Alan Haynes
Monday	13:00-17:00	Projects in R: Data visualization with the tidyverse	Christian Althaus, Judith Bouman, Martin Wohlfender
Tuesday	09:00-12:30	Projects in R: Reproducibility and GitHub	Christian Althaus, Alan Haynes
Thursday	09:00-12:30	Basic Statistics: Inference about the mean	Ben Spycher
Thursday	13:30-17:00	Basic Statistics: Non-normal and dependent/paired data	Beatriz Vidondo
Friday	09:00-12:30	Basic Statistics: Inference about proportions and rates	Ben Spycher
Friday	13:30-17:00	Basic Statistics: Continue R project with a guided data analysis	Ben Spycher, Beatriz Vidondo

Today

Time	Duration	Topic	Content
09:00-09:10	10 min	General introduction	Lecturers and course program
09:10-09:40	30 min	Introduction to R and RStudio	Hands-on, objects, functions, etc.
09:40-10:00	20 min	R projects	Files, folders, names, and templates
10:00-10:15	15 min	Base R and tidyverse	Concepts
10:15-10:30	15 min	Data	Read, CSV, Excel, REDCap
10:30-10:50	20 min	Coffee break
10:50-11:00	10 min	Data types
11:00-12:00	60 min	Data wrangling
12:00-13:00	60 min	Lunch break
13:00-13:30	30 min	Data visualization	Fundamentals
13:30-14:30	60 min	ggplot2	General ideas and basic graphs
14:30-14:50	20 min	Coffee break
14:50-15:50	60 min	ggplot2	Fancify basic graphs
15:50-16:10	20 min	Coffee break
16:10-17:00	50 min	Panels	Other types of geom’s

Course structure and material

Lectures: HTML slides
Exercises: .R and .qmd files
Assessment: Published report on your GitHub account
GitHub repository: https://github.com/ISPMBern/basic-statistics-and-projects-in-R
GitHub Pages site: https://ispmbern.github.io/basic-statistics-and-projects-in-R
Evaluation: https://www.tiny.cc/phs23 (or QR code), TAN: SPQPL
Interactive!

Installing R, RStudio, and Git

Mandatory steps (should be completed by now)

Download and install R.
Download and install RStudio.
Install Git using the following instructions.
If you don’t already have one, don’t forget to create a GitHub account.

Optional steps (can be done today)

Make sure RStudio knows about Git by following the corresponding section here.
Install the usethis package for R using the following command: install.packages("usethis")
Set up Git using the following command: usethis::use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")
Generate a personal access token (PAT) and store your PAT as described here.

Any other questions?

Introduction to R and RStudio

What is R?

R is a programming language for statistical computing and graphics that first appeared in 1993.
R is the open source implementation of the S language, which was developed by Bell laboratories in the 1970’s.
The aim of the S language, as expressed by John Chambers, is “to turn ideas into software, quickly and faithfully”.
The statisticians Ross Ihaka and Robert Gentleman developed R at the University of Auckland, New Zealand.
In 1995, the statistician Martin Mächler from ETH Zurich convinced Ihaka and Gentleman to make R free and open source under the GNU General Public License.
R is both open source and open development.

Why R?

Free and open source (popular in academic research)
High-level programming language designed for statistical computing (popular in computational biology, bioinformatics, epidemiology and public health sciences)
Reproducibility
Powerful and flexible - especially for data wrangling and visualization
Extensive add-on software (packages)
Strong community

(Why not R?)

Little centralized support, relies on online community and package developers
Can be cumbersome to update
Slower than more traditional programming languages (C/C++, Python)

R vs. RStudio

R is a programming language that runs computations. RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools, e.g.:

autocomplete, syntax, and spell checking functionality,
easier working with file paths (via projects and the here package),
integration with version control systems (e.g., GitHub, SVN).

Note that RStudio (the company) recently changed it’s name to Posit. RStudio (the IDE) remains unchanged.

RStudio

RStudio cheatsheet

Use the cheatsheet to find your way in RStudio: https://github.com/rstudio/cheatsheets/blob/main/rstudio-ide.pdf

Working with R in RStudio

Editor (top left): Press the Run button or select code and press CTRL + Enter (⌘ + Enter).
- Analysis script
- Reproducibility
Console (bottom left): Simply press Enter.
- R as a calculator
- Trying out things before adding to the editor

The result of your command(s) will appear in the tab Console if the commands are intended to print something, and/or in the tab Plots if the commands generate a plot.

R as a calculator

In the Console, R can simply be used as a calculator:

2 + 3

[1] 5

2 * 4

[1] 8

2^5

[1] 32

6 / 2 * (1 + 2)

[1] 9

Commenting in scripts

It is recommended to comment your scripts using #:

# This script illustrates R as a calculator

6 / 2 * (1 + 2) # Comments can also be placed to the right of code.

Objects in R

Objects can be a single piece of data (e.g., 3.14 or "Bern"), or they can consist of structured data.

Object classes

All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:

Class	Description	Example(s)
`numeric`	Any real number	`1`, `3.14`, `8.8e6`
`character`	Individual characters or strings, quoted	`"a"`, `"Hello, World!"`
`factor`	Categorical/qualitative variables	Ordered values of economic status
`logical`	Boolean variables	`TRUE` and `FALSE`
`Date`/`POSIXct`	Calendar dates and times	`"2023-06-05"`

Other object classes are array, data.frame, list, and tibble (similar to data.frame).

Assigning values to objects

R uses <- (shown in text blocks here as <-) to assign values to an object name (you might also see = used, but this is not best practice).

Object names are case-sensitive, i.e., X and x are different.

x <- 2
x

[1] 2

x * 4

[1] 8

x + 2

[1] 4

Combine values

The function c() combines/concatenates single R objects into a vector (or list) of R objects:

x <- c(1, 4, 6, 8)
x

[1] 1 4 6 8

sum(x) # sum() is another function that returns the sum of vector elements.

[1] 19

You can perform functions to entire vectors of numbers very easily.

x + 5

[1]  6  9 11 13

Writing your own functions

Every function in R has three basic parts: a name, a body of code, and a set of arguments. To make your own function, you need to replicate these parts and store them in an R object, which you can do with the function function().

The R ecosystem

The build-in functionality of Base R can be expaned with packages that others have developed and published.

The Comprehensive R Archive Network (CRAN) has been the main source of R packages. Nowadays, GitHub also contains many packages and has arguably become the primary location for package development. Some packages, e.g., tidyverse, are so-called meta-packages - they load a collection of other packages.

Install new packages as follows:

install.packages("packagename")

Packages must be loaded each R session to give access to their functionality:

library(packagename)

Getting help with R

The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages.

Cheat sheets exist for many packages and topics: https://rstudio.github.io/cheatsheets

Help for R is abundant. 99.9% of your questions will have been asked before, so Google is your friend, or ask on Twitter/Mastodon using the #rstats tag.

Ask ChatGPT

The Rollercoaster

Further resources

There are numerous online tutorials and books on R, RStudio with specific applications to epidemiology, public health, and data science:

R projects

Why use projects in R(Studio)?

The use of projects (.Rproj files) is fundamental to organized coding and project management. There are four main reasons why you should use projects essentially 100% of the time while using RStudio:

It can take less than 30 seconds to set up.
It keeps all relevant files in the same place.
It sets the working directory, so you can use relative paths.
It allows for version control (see tomorrow).

Folder structures

Using consistent folder structures across projects helps you work more efficiently. An example of the folder structure for a project looks like this:

R
- 00_main.R
- 01_cleaning.R
- 02_analysis.R
- 03_plotting.R
data
- processed
  - cleaned_data.csv
  - processed_data.rds
- raw
  - original_data.csv
  - spreadsheet_data.xlsx
output
- figures
  - 01_figure.png
  - 02_figure.pdf
- tables
  - 01_table.csv
  - 02_table.rds
products
- manuscript
  - manuscript.docx
  - manuscript.html
  - manuscript.pdf
  - manuscript.qmd
- report
  - report.html
  - report.qmd
- slides
  - slides.html
  - slides.qmd
.gitignore
README.md
project-template.Rproj

Naming files

Bad examples

myabstract.docx
Long file name using “spaces” & punctuation!.xlsx
figures 1.png
SDEFI7_jknsfol.txt

Better

2023-02-15_abstract-conference-X.docx
still-long-but-no-punctuation-or-spaces.xlsx
fig01_scatter-mpg-vs-vol.png
more-meaningful-name.txt

Naming R files

Analysis scripts can get looooooooong… Don’t be afraid to break them up into smaller chunks.

Use sequential numbers and descriptive names, e.g.:

01_cleaning.R cleans your data,
02_analysis.R performs your analysis,
03_plotting.R plots the results of your analysis.

Sequential numbers allow you to sort the files according to the sequence in which you run them.

Descriptive names inform you of what is actually in there.

You can use a main or master file (00_main.R) to run all other files and create a reproducible analysis.

Also see How to name files from Jennifer Bryan and The tidyverse style guide.

Add README files

Adding README.md (Markdown) or README.txt (plain text) files to your project folder and subfolders can be useful to describe your project and/or the content of folders, and provide instructions.

Tomorrow, you will learn more about Markdown (and Quarto).

Exercise 1: Create a project

We have set up a template project for you, including a directory structure: https://github.com/ISPMBern/project-template

Download the template (click Code and then Download ZIP).
Unzip the file to a suitable directory on your computer.
Rename the folder to something more suitable.
Rename the .Rproj file to something more meaningful to you (the same as the folder?).
Open the project in RStudio (double click the icon)

You will use this project for the rest of the course…

Required packages for the next exercises

To work on the next exercises, you have to install the following packages:

usethis - Workflow package
gitcreds - Queries Git credentials from R
here - Easy file referencing
tidyverse - A set of packages
medicaldata - Medical data sets
cowplot - Features to create publication-quality figures

Simply type install.packages("packagename"), but RStudio will ask you about it as well if you want to load a package that you haven’t installed yet.

Tidyverse and data wrangling

Base R

Collection of ca 25 packages that have been developed since R’s conception (ca 25 years ago)

This age is often evident in the syntax - inconsistent option names and/or ordering of options, sometimes possible to tell which features were afterthoughts

Syntax varies widely across add-on packages

Enter the tidyverse…

What is the tidyverse?

A group of R packages designed in a consistent manner along the principles of tidy data

Primarily contains packages for data import (readr), manipulation (“wrangling”; dplyr, forcats, stringr) and visualization (ggplot2)

What is the tidyverse?

A group of R packages designed in a consistent manner along the principles of tidy data

Primarily contains packages for data import (readr), manipulation (“wrangling”; dplyr, forcats, stringr) and visualization (ggplot2)

Load the whole thing via library(tidyverse) or individual packages as usual (e.g. library(ggplot2))

Tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Making small(!!) datasets by hand

Use case - look up tables.

In base R:
With the tidyverse:

data.frame(code = c(0, 1),
           label = c("male", "female"))

  code  label
1    0   male
2    1 female

Fine very small datasets, but unwieldy with many variables and/or many observations

tibble::tribble(
  ~code, ~label,
  0,     "male",
  1,     "female"
)

# A tibble: 2 × 2
   code label 
  <dbl> <chr> 
1     0 male  
2     1 female

Much nicer… immediately clear which label belongs to which code

No functional difference between a tibble and a dataframe, just a slightly different print method

Getting data into R

R has a wide range of tools for importing data.

Base R

read.csv
read.csv2
read.delim

tidyverse (readr)

read_csv
read_csv2
read_delim

Getting data into R

R has a wide range of tools for importing data.

Base R

read.csv
read.csv2
read.delim

tidyverse (readr)

read_csv
read_csv2
read_delim

Others

readxl::read_xlsx
REDCapR::redcap_read
haven::read_spss

Getting data into R

R has a wide range of tools for importing data.

Base R

read.csv
read.csv2
read.delim

tidyverse (readr)

read_csv
read_csv2
read_delim

Others

readxl::read_xlsx
REDCapR::redcap_read
haven::read_spss

And even more

secuTrialR::read_secuTrial
odbc::dbConnect
httr2

Data sometimes already exists in R…

Many published datasets already exist in R, either in the basic installation or via a package

data(mtcars)
data(iris) # very popular in examples

From packages, e.g. medicaldata¹

install.packages("medicaldata")
library(medicaldata)
strep_tb

# A tibble: 107 × 13
   patient_id arm     dose_strep_g dose_PAS_g gender baseline_condition
   <chr>      <fct>          <dbl>      <dbl> <fct>  <fct>             
 1 0001       Control            0          0 M      1_Good            
 2 0002       Control            0          0 F      1_Good            
 3 0003       Control            0          0 F      1_Good            
 4 0004       Control            0          0 M      1_Good            
 5 0005       Control            0          0 F      1_Good            
 6 0006       Control            0          0 M      1_Good            
 7 0007       Control            0          0 F      1_Good            
 8 0008       Control            0          0 M      1_Good            
 9 0009       Control            0          0 F      2_Fair            
10 0010       Control            0          0 M      2_Fair            
# ℹ 97 more rows
# ℹ 7 more variables: baseline_temp <fct>, baseline_esr <fct>,
#   baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
#   rad_num <dbl>, improved <lgl>

Getting data into R in practice

We place the dataset in the appropriate folder (01_original_data) and read it in with the appropriate function (e.g. read_csv)

library(readr)
data <- read_csv(here("data", "raw", "MyData.csv"))

Depending on the file, you might need read_csv2, which is configured for e.g. German environments where CSVs are actually semi-colon (;) separated, because the comma is used in numbers . . .

In base R

data <- read.csv(here("data", "raw", "MyData.csv"))

Virtually identical… readr is slightly faster and automatically converts some variable types

Getting data into R in practice

More common(?): excel files…

library(readxl) # informal tidyverse member
data <- read_xlsx(here("data", "raw", "MyData.xlsx"))

Once you’ve loaded a dataset, it’s good practice to inspect the data to see that it’s loaded correctly

str(strep_tb)

tibble [107 × 13] (S3: tbl_df/tbl/data.frame)
 $ patient_id         : chr [1:107] "0001" "0002" "0003" "0004" ...
 $ arm                : Factor w/ 2 levels "Streptomycin",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ dose_strep_g       : num [1:107] 0 0 0 0 0 0 0 0 0 0 ...
 $ dose_PAS_g         : num [1:107] 0 0 0 0 0 0 0 0 0 0 ...
 $ gender             : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 1 2 1 2 ...
 $ baseline_condition : Factor w/ 3 levels "1_Good","2_Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ baseline_temp      : Factor w/ 4 levels "1_98-98.9F","2_99-99.9F",..: 1 3 1 1 2 3 2 2 2 4 ...
 $ baseline_esr       : Factor w/ 4 levels "1_0-10","2_11-20",..: 2 2 3 3 3 3 3 3 3 3 ...
 $ baseline_cavitation: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 2 2 2 ...
 $ strep_resistance   : Factor w/ 3 levels "1_sens_0-8","2_mod_8-99",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ radiologic_6m      : Factor w/ 6 levels "6_Considerable_improvement",..: 1 2 2 2 2 1 2 2 2 2 ...
 $ rad_num            : num [1:107] 6 5 5 5 5 6 5 5 5 5 ...
 $ improved           : logi [1:107] TRUE TRUE TRUE TRUE TRUE TRUE ...

XLSX vs CSV vs database (e.g. REDCap)

Watch Darren Dahly open an Excel file

Excel is easy to use, accessible, …

…but it tries to be clever (e.g. formatting dates, gene names (30% of genetics papers contain mangled names in tables))

CSV can be created by Excel and other software (common export format from databases), has less issues with formatting

Both lack data validation protocols (more control over entered data)

Databases provide data validation, many are Human Research Act compliant (MS Access is not), many have ways to export data directly from the database to R, or simple ways to import exports

What do you think this is?

[1] "\xfc" "\xe4" "\xe9" "\xe0"

Dealing with special characters

[1] "\xfc" "\xe4" "\xe9" "\xe0"

[1] "ü" "ä" "é" "à"

Common issue when working in Switzerland - ä, ö, ü, é, è, à, etc

File encoding influences exactly how (special) characters are represented in a file

R needs that information to make sense of the data

# base
read.csv("path/to/file.csv", fileEncoding = "UTF-8")
# tidyverse (readr)
read_csv("path/to/file.csv", fileEncoding = "UTF-8",
         locale = locale(encoding = "UTF-8"))

Free text, check the encoding!

Pro-tip: use Notepad++ to discover the encoding used, and possibly to convert to a different encoding (saving it to a different file…)

Use English as much as possible - saves time dealing with special characters AND no need to translate tables for publications

Your turn!

Read file insurance_with_date.csv into R and explore it a little.

How many observations and variables does it have?
What types of variables does it include?
What difference is there when importing via tidyverse (readrs read_csv) vs base-R (read.csv)?

You have 5 minutes… go!

Solution

library(readr)
dat <- read_csv("data/raw/insurance_with_date.csv")
str(dat)

spc_tbl_ [1,338 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ X       : num [1:1338] 1 2 3 4 5 6 7 8 9 10 ...
 $ age     : num [1:1338] 59 24 28 22 60 38 51 44 47 29 ...
 $ sex     : chr [1:1338] "male" "female" "female" "male" ...
 $ bmi     : num [1:1338] 31.8 22.6 25.9 25.2 36 ...
 $ children: num [1:1338] 2 0 1 0 0 3 0 0 1 2 ...
 $ smoker  : chr [1:1338] "no" "no" "no" "no" ...
 $ region  : chr [1:1338] "southeast" "southwest" "northwest" "northwest" ...
 $ charges : num [1:1338] 13086 2574 4411 2321 13435 ...
 $ date    : Date[1:1338], format: "2001-01-15" "2001-01-17" ...
 - attr(*, "spec")=
  .. cols(
  ..   X = col_double(),
  ..   age = col_double(),
  ..   sex = col_character(),
  ..   bmi = col_double(),
  ..   children = col_double(),
  ..   smoker = col_character(),
  ..   region = col_character(),
  ..   charges = col_double(),
  ..   date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr>

dat2 <- read.csv("data/raw/insurance_with_date.csv")
str(dat2)

'data.frame':   1338 obs. of  9 variables:
 $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age     : int  59 24 28 22 60 38 51 44 47 29 ...
 $ sex     : chr  "male" "female" "female" "male" ...
 $ bmi     : num  31.8 22.6 25.9 25.2 36 ...
 $ children: int  2 0 1 0 0 3 0 0 1 2 ...
 $ smoker  : chr  "no" "no" "no" "no" ...
 $ region  : chr  "southeast" "southwest" "northwest" "northwest" ...
 $ charges : num  13086 2574 4411 2321 13435 ...
 $ date    : chr  "2001-01-15" "2001-01-17" "2001-01-22" "2001-01-29" ...

Data types

Text/character/string - in R, `character`

sex <- c("male", "female", "female", "male")

cannot (always) be used models, generally needs converting to another format

Encoded categorical information - `factors`

from a known list of options…

fct <- factor(sex, levels = c("male", "female", "non-binary"))
fct

[1] male   female female male  
Levels: male female non-binary

from encoded data…

factor(c(0, 1), levels = 0:2, labels = c("male", "female", "non-binary"))

[1] male   female
Levels: male female non-binary

factors are more suitable for models than text

Numbers - `numeric`

num <- 1:4
num

[1] 1 2 3 4

c(1,1,2,3)

[1] 1 1 2 3

seq(1, 10, 2)

[1] 1 3 5 7 9

seq(1, 10, length.out = 4)

[1]  1  4  7 10

num * num

[1]  1  4  9 16

num * 2

[1] 2 4 6 8

num %*% num # matrix multiplication (scalar product)

     [,1]
[1,]   30

Binary/Boolean/Yes/No variables - `logical`

TRUE

[1] TRUE

(lgl <- c(TRUE, FALSE, FALSE, TRUE))

[1]  TRUE FALSE FALSE  TRUE

num > 2

[1] FALSE FALSE  TRUE  TRUE

is a condition true or false, yes or no, e.g. death

Keeping information together - `dataframes`

For when all things the same length…

df <- data.frame(sex = sex,
                 fct = fct,
                 num = num,
                 lgl = lgl)
tibble::tibble(sex = sex,
               fct = fct,
               num = num,
               lgl = lgl)

# A tibble: 4 × 4
  sex    fct      num lgl  
  <chr>  <fct>  <int> <lgl>
1 male   male       1 TRUE 
2 female female     2 FALSE
3 female female     3 FALSE
4 male   male       4 TRUE

Keeping information together - `lists`

For when the objects have different lengths…

lst <- list(letter = "a",
     numbers = rnorm(10),
     data = df)
lst

$letter
[1] "a"

$numbers
 [1]  1.6767047 -1.8662187  1.0433395 -0.6191508  0.4764069  2.3812988
 [7] -0.7050137  1.1650151  0.6387094 -0.6682568

$data
     sex    fct num   lgl
1   male   male   1  TRUE
2 female female   2 FALSE
3 female female   3 FALSE
4   male   male   4  TRUE

Getting elements out again is the same for both

df$sex

[1] "male"   "female" "female" "male"

lst$letter

[1] "a"

Piping: %>%, |>

When googling, you will encounter pipes. They enable chaining operations together.

Two main varieties:

%>% is from the magrittr package, introduced ca 2014

|> (shown in text blocks here as |>) was added to base R in 2021 (v4.1.0)

data |> 
  mutate(new_var = rnorm(10)) |>
  rename(random = new_var) |> 
  etc()

Especially useful in data wrangling…

Without pipes

Essentially the same code as the last slide, just in base R

Nesting calls…

etc(
  rename(
    mutate(data,
           new_var = rnorm(10)),
    random = new_var
  )
)

…or saving intermediate objects…

tmp <- mutate(data, new_var = rnorm(10))
tmp <- rename(tmp, random = new_var)
etc(tmp)

Data wrangling with `dplyr` and the tidyverse

Most of the tidyverse uses verbs as their function names…

readr::read_csv reads a Comma-Separated-Value file
dplyr::filter keep observations matching some criteria
dplyr::mutate modifies the data (change existing, add new variables)
dplyr::rename renames variables
stringr::str_detect detects whether the first string contains a particular piece of text (regular expression - regex)

library(dplyr)

Selecting variables

The tidyverse offers a range of methods to select variables and these methods are used in many of the functions that we will discuss.

strep_tb |> select(patient_id, arm, gender)
strep_tb |> select(patient_id:gender, last_col())
strep_tb |> select(1:2, 13)
vars <- c("patient_id", "arm", "gender")
strep_tb |> select(all_of(vars))

Base R examples

strep_tb[, c("patient_id", "arm", "gender")]
strep_tb[, c(1, 2, 13)]
vars <- c("patient_id", "arm", "gender")
strep_tb[, vars]

Selecting variables

Also by variable class or aspects of the variable name

strep_tb |> select(where(is.factor))
strep_tb |> select(contains("i")) # contains a literal string
strep_tb |> select(matches("ll")) # matches a regex
strep_tb |> select(starts_with("b"))

More tricky with base R

strep_tb[, sapply(strep_tb, is.factor)]
strep_tb[, grepl("i", names(strep_tb))]
strep_tb[, grepl("^b", names(strep_tb))]

Selecting observations

filtering

strep_tb |> filter(arm == "Control")
strep_tb |> filter(dose_strep_g > 0)

strep_tb |> filter(improved)

With base R

strep_tb[strep_tb$arm == "Control", ]
strep_tb[strep_tb$dose_strep_g > 0, ]
strep_tb[strep_tb$improved, ]
subset(strep_tb, improved)

Selecting observations

sliceing

# specific rows
strep_tb |> slice(1, 2, 3, 5, 8, 13)

# first 10
strep_tb |> slice_head(n = 10)
strep_tb |> slice_head(n = -10) # exclude first 10 rows

# first 10% of observations
strep_tb |> slice_head(prop = 0.1)

# last 10
strep_tb |> slice_tail(n = 10)

Not so different with base R…

strep_tb[c(1, 2, 3, 5, 8, 13), ] # specific rows
strep_tb[1:10, ] # first 10
head(strep_tb, n = 10) # or strep_tb |> head(n = 10)
strep_tb[(nrow(strep_tb)-10):nrow(strep_tb), ] # last 10
tail(strep_tb, n = 10) # or strep_tb |> tail(n = 10)

Pay attention to ordering in the dataframe

Modifying data

mutate is your friend

strep_tb |> 
  mutate(dose_strep_g = dose_strep_g + 2)

Very simple in base too…

strep_tb$dose_strep_g <- strep_tb$dose_strep_g + 2

Not restricted to a single change

strep_tb |> 
  mutate(dose_strep_g = dose_strep_g + 2, 
         control = arm == "Control")

In base R, something similar can be done with the rarely used within function

within(strep_tb, {
  dose_strep_g <- dose_strep_g + 2
  control <- arm == "Control"
})

# A tibble: 107 × 14
   patient_id arm     dose_strep_g dose_PAS_g gender baseline_condition
   <chr>      <fct>          <dbl>      <dbl> <fct>  <fct>             
 1 0001       Control            2          0 M      1_Good            
 2 0002       Control            2          0 F      1_Good            
 3 0003       Control            2          0 F      1_Good            
 4 0004       Control            2          0 M      1_Good            
 5 0005       Control            2          0 F      1_Good            
 6 0006       Control            2          0 M      1_Good            
 7 0007       Control            2          0 F      1_Good            
 8 0008       Control            2          0 M      1_Good            
 9 0009       Control            2          0 F      2_Fair            
10 0010       Control            2          0 M      2_Fair            
# ℹ 97 more rows
# ℹ 8 more variables: baseline_temp <fct>, baseline_esr <fct>,
#   baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
#   rad_num <dbl>, improved <lgl>, control <lgl>

Same change to many variables

add two to all dose_* variables

strep_tb |> 
  mutate(across(starts_with("dose"), ~ .x + 2))

creating new variables

strep_tb |> 
  mutate(across(starts_with("dose"), ~ .x + 2, .names = "{.col}_plus2"))

more generic, using variable class

strep_tb |> 
  mutate(across(where(is.numeric),  ~ .x + 2, .names = "{.col}_plus2"))

Not useful here, but handy for e.g. factors (examples later)

One possible base method

for(var in c("dose_strep_g", "dose_PAS_g")){
  strep_tb[, paste0(var, "_plus2")] <- strep_tb[, var] + 2
}

Conditional modifications

Sometimes you need to do something under one circumstance, something else under another

none
if_else
case_when
Base R

all individuals have short measurements

strep_tb |> 
  mutate(dose_strep_g_corr = dose_strep_g + 2)

create text for male/female

strep_tb |> 
  mutate(txt = if_else(gender == "M", 
                       # when TRUE
                       "Male",
                       # when FALSE
                       "Female"))

We want some specific text in some cases

strep_tb |> 
  mutate(dose_strep_g_corr = case_when(
      # Male, Streptomycin
      gender == "M" & arm == "Streptomycin" ~ "M Strepto",
      # Female, Streptomycin
      gender == "F" & arm == "Streptomycin" ~ "F Strepto",
      # all others are OK
      TRUE ~ "Control")
    )

# initialize the variable
strep_tb$txt <- "Control"

# replace the values for males in the streptomycin group
strep_tb$txt[
  with(strep_tb, gender == "M" & arm == "Streptomycin") # which cases
  ] <- "M Strepto"


# replace the values for females in the streptomycin group, using a temporay variable
to_change <- with(strep_tb, gender == "F" & arm == "Streptomycin")

strep_tb$txt[to_change] <- "F Strepto"

# strep_tb$x[to_change] <- strep_tb$x[to_change] * 2

rm(to_change) # clean up

hmmm… tidyverse syntax is much nicer!?

Working with strings: `stringr`

The stringr package contains functions specifically for working with strings.

Most functions start with str_.

library(stringr)
txt <- "A siLLy exAmple "

Changing case

str_to_lower(txt)

[1] "a silly example "

str_to_sentence(txt)

[1] "A silly example "

str_to_title(txt)

[1] "A Silly Example "

str_to_upper(txt)

[1] "A SILLY EXAMPLE "

Lengths

str_length(txt)

[1] 16

str_count(txt, "\\w+")

[1] 3

Working with strings: `stringr`

Remove white space

str_squish(txt)

[1] "A siLLy exAmple"

Substrings

str_sub(txt, 3)

[1] "siLLy exAmple "

str_sub(txt, 3, -4)

[1] "siLLy exAmp"

word(txt, 2)

[1] "siLLy"

Replacements

str_replace(txt, "e", "XX")

[1] "A siLLy XXxAmple "

str_replace_all(txt, "e", "XX")

[1] "A siLLy XXxAmplXX "

Working with strings: `stringr`

Detect a substring

str_detect(txt, "x")

[1] TRUE

str_detect(txt, "z")

[1] FALSE

Splitting

str_split(txt, " ")

[[1]]
[1] "A"       "siLLy"   "exAmple" ""

Using stringr within mutate

strep_tb |> 
  mutate(txt = as.character(baseline_condition),
         upper = str_to_upper(txt),
         good = str_detect(txt, "Good"),
         no_number = str_replace(txt, "^[[:digit:]]_", "")) |> 
  select(txt:no_number) |> unique()

# A tibble: 3 × 4
  txt    upper  good  no_number
  <chr>  <chr>  <lgl> <chr>    
1 1_Good 1_GOOD TRUE  Good     
2 2_Fair 2_FAIR FALSE Fair     
3 3_Poor 3_POOR FALSE Poor

Working with factors: `forcats`

The forcats package contains functions specifically for working with factors.

Functions (almost) all begin with fct_.

fac <- strep_tb$baseline_condition[c(1, 15, 29)]
fac

[1] 1_Good 2_Fair 3_Poor
Levels: 1_Good 2_Fair 3_Poor

Reverse the levels (particularly useful when making plots)

library(forcats)
fct_rev(fac)

[1] 1_Good 2_Fair 3_Poor
Levels: 3_Poor 2_Fair 1_Good

# factor(fac, 
#        levels = rev(levels(fac)))

Changes the order levels are shown in e.g. tables

Working with factors: `forcats`

Change level names

fct_recode(fac,
           # new = "old"
           Good = "1_Good",
           Fair = "2_Fair",
           Poor = "3_Poor")

[1] Good Fair Poor
Levels: Good Fair Poor

# factor(as.character(fac),
#        levels = levels(fac),
#        labels = c("Good", "Fair", "Poor"))

This is also possible by regular expression (regex)

fct_relabel(fac,
            str_replace, # function to use
            # additional arguments to the function
            pattern = "^[123]_", # regex for "starts with 1, 2 or 3 and is followed by _"
            replacement = "" # replace with nothing
            )

[1] Good Fair Poor
Levels: Good Fair Poor

# factor(as.character(fac),
#        levels = levels(fac),
#        labels = gsub("^[123]_", "", levels(fac)))

Working with factors: `forcats`

Within mutate

strep_tb |> 
  mutate(baseline_condition_new = fct_relabel(baseline_condition,
                                              str_replace, 
                                              pattern = "^[[:digit:]]_", # "any leading digit followed by underscore"
                                              replacement = "")) |> 
  select(baseline_condition, baseline_condition_new) |> 
  str()

tibble [107 × 2] (S3: tbl_df/tbl/data.frame)
 $ baseline_condition    : Factor w/ 3 levels "1_Good","2_Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ baseline_condition_new: Factor w/ 3 levels "Good","Fair",..: 1 1 1 1 1 1 1 1 2 2 ...

Do it to all factors

strep_tb |> 
  mutate(across(where(is.factor), 
                ~ fct_relabel(.x,
                              str_replace, 
                              pattern = "^[[:digit:]]_", 
                              replacement = ""))) |> 
  select(where(is.factor)) |> 
  str()

tibble [107 × 8] (S3: tbl_df/tbl/data.frame)
 $ arm                : Factor w/ 2 levels "Streptomycin",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ gender             : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 1 2 1 2 ...
 $ baseline_condition : Factor w/ 3 levels "Good","Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ baseline_temp      : Factor w/ 4 levels "98-98.9F","99-99.9F",..: 1 3 1 1 2 3 2 2 2 4 ...
 $ baseline_esr       : Factor w/ 4 levels "0-10","11-20",..: 2 2 3 3 3 3 3 3 3 3 ...
 $ baseline_cavitation: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 2 2 2 ...
 $ strep_resistance   : Factor w/ 3 levels "sens_0-8","mod_8-99",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ radiologic_6m      : Factor w/ 6 levels "Considerable_improvement",..: 1 2 2 2 2 1 2 2 2 2 ...

Working with dates: `lubridate`

lubridate provides a comprehensive set of functions for working with dates and date-times

Dates come in many formats, lubridate handles them easily

library(lubridate)
ymd("2023-01-20")

[1] "2023-01-20"

ymd_hm("2023-01-20 10:15")

[1] "2023-01-20 10:15:00 UTC"

dmy("20 January 2023")

[1] "2023-01-20"

mdy("January 20 2023")

[1] "2023-01-20"

mdy("Januar 20 2023")

[1] NA

mdy("1 20 2023")

[1] "2023-01-20"

mdy("1 20 2023")

[1] "2023-01-20"

dmy("the 1st of May 2023 was a Monday")

[1] "2023-05-01"

Working with dates: `base`

In base R, it’s not so easy… (see ?strptime for details)

as.Date("2023-01-20")

[1] "2023-01-20"

as.Date("20 Jan 2023", format = "%d %B %Y")

[1] "2023-01-20"

as.Date("01 20 2023", format = "%m %d %Y")

[1] "2023-01-20"

as.POSIXct("2023-01-20 10:15")

[1] "2023-01-20 10:15:00 CET"

Very specific to system settings (language)…

Unless you have e.g. “2023-01-20 10:15”, stick with lubridate…

Working with dates: `lubridate`

We can do maths with date(-time)s…

date1 <- ymd("2023-01-20")
date2 <- ymd("2023-02-20")
diff <- date2 - date1
diff

Time difference of 31 days

str(diff)

 'difftime' num 31
 - attr(*, "units")= chr "days"

It’s normally worth converting it to a number…

diff |> as.numeric()

[1] 31

Add a certain number of months

date1 + months(2)

[1] "2023-03-20"

Working with dates: `lubridate`

Extracting components of the date(-time)

datetime <- ymd_hm("2023-01-23 15:30")
year(datetime)

[1] 2023

month(datetime)

[1] 1

hour(datetime)

[1] 15

With base R:

format(datetime, "%Y") # year
format(datetime, "%m") # month
format(datetime, "%H") # hour
format(datetime, "%M") # minute
format(datetime, "%x") # date, d.m.y format

Check the cheat sheet for lots more of lubridates capabilities

Working with dates: internals

ymd("2023-01-20") |> str()

 Date[1:1], format: "2023-01-20"

ymd_hm("2023-01-20 10:15") |> str()

 POSIXct[1:1], format: "2023-01-20 10:15:00"

Stored internally as numbers! This allows the maths operations to work

ymd("2023-01-20") |> as.numeric()

[1] 19377

ymd_hm("2023-01-20 10:15") |> as.numeric()

[1] 1674209700

What do those numbers mean?

They’re days and milliseconds since an origin…

Working with dates: the origin

When is that origin (timepoint 0)?

ymd("2023-01-20") - as.numeric(ymd("2023-01-20"))

[1] "1970-01-01"

ymd_hm("2023-01-20 10:15") - as.numeric(ymd_hm("2023-01-20 10:15"))

[1] "1970-01-01 UTC"

Days since 1st January 1970

Milliseconds since 1st January 1970

Your turn!

Using the insurance data you loaded earlier…

make factors out of the sex, and region
make a logical indicator for “has more than 2 children” and “smokes”
add 6 months to the date variable

Solution

reformatted <- dat |> 
  mutate(
    across(c(sex, region), factor),
    # sex = factor(sex),
    # region = factor(region),
    gt2_children = children > 2,
    smokes = smoker == "yes",
    date_6m = date + months(6)
    # date_6m = date + 30.4 * 6
   )

tibble [1,338 × 12] (S3: tbl_df/tbl/data.frame)
 $ X           : num [1:1338] 1 2 3 4 5 6 7 8 9 10 ...
 $ age         : num [1:1338] 59 24 28 22 60 38 51 44 47 29 ...
 $ sex         : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 2 2 ...
 $ bmi         : num [1:1338] 31.8 22.6 25.9 25.2 36 ...
 $ children    : num [1:1338] 2 0 1 0 0 3 0 0 1 2 ...
 $ smoker      : chr [1:1338] "no" "no" "no" "no" ...
 $ region      : Factor w/ 4 levels "northeast","northwest",..: 3 4 2 2 1 4 4 2 4 2 ...
 $ charges     : num [1:1338] 13086 2574 4411 2321 13435 ...
 $ date        : Date[1:1338], format: "2001-01-15" "2001-01-17" ...
 $ gt2_children: logi [1:1338] FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ smokes      : logi [1:1338] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ date_6m     : Date[1:1338], format: "2001-07-15" "2001-07-17" ...

Pivoting datasets

Sometimes it’s necessary to pivot data. E.g. All observations from an individual are on a single row. For our analysis we need them to be in a single variable. pivot_longer is the tool for the task.

pivot_longer(data, cols, 
             names_to = "year", 
             values_to = "cases")

Pivoting datasets

The opposite, observations on rows to observations in columns, is pivot_wider

pivot_wider(data, 
            names_from = "type", 
            values_from = "count")

In base R, both these scenarios are handled by reshape, the syntax and documentation thereof is confusing… which is why it’s two separate functions in dplyr

Merging datasets (joins)

(a <- data.frame(id = 1:3, 
                 v1 = letters[1:3]))

(b <- data.frame(id = 2:4, 
                 v2 = LETTERS[1:3]))

Left join (observations in the first dataframe)

a |> left_join(b)

  id v1   v2
1  1  a <NA>
2  2  b    A
3  3  c    B

Right join (observations in the second dataframe)

a |> right_join(b)

  id   v1 v2
1  2    b  A
2  3    c  B
3  4 <NA>  C

Inner join (observations in both)

a |> inner_join(b)

  id v1 v2
1  2  b  A
2  3  c  B

Merging datasets (joins)

Full join (all observations)

a |> full_join(b)

  id   v1   v2
1  1    a <NA>
2  2    b    A
3  3    c    B
4  4 <NA>    C

Full join with differing variable names to join on

a |> full_join(b, 
               by = join_by(id_a == id_b) # new syntax!
               # by = c("id_a" = "id_b") # older syntax
               )

In base R, use merge and combinations of all.x and all.y to specify the different join types, and by.x and by.y to specify the variables to join on

merge(a, b, all.x = TRUE,  all.y = FALSE, by.x, by.y)  # left
merge(a, b, all.x = FALSE, all.y = TRUE,  by.x, by.y)  # right
merge(a, b, all.x = FALSE, all.y = FALSE, by.x, by.y)  # inner
merge(a, b, all.x = TRUE,  all.y = TRUE,  by.x, by.y)  # full

Summarizing data

At some point, you will have to create summary data. dplyr can help with that too. summarize is the appropriate function.

strep_tb |> 
  summarize(n = n(),
            min = min(rad_num),
            median = median(rad_num),
            mean = mean(rad_num),
            max = min(rad_num),
            )

# A tibble: 1 × 5
      n   min median  mean   max
  <int> <dbl>  <dbl> <dbl> <dbl>
1   107     1      5  3.93     1

Remember across? It comes in useful here too

strep_tb |> 
  summarize(n = n(),
            across(c(rad_num, dose_strep_g), 
                   list(min = ~ min(.x, na.rm = TRUE), 
                        mean = mean, 
                        median = median, 
                        max = max), 
                   .names = "{.col}_{.fn}"),
            )

# A tibble: 1 × 9
      n rad_num_min rad_num_mean rad_num_median rad_num_max dose_strep_g_min
  <int>       <dbl>        <dbl>          <dbl>       <dbl>            <dbl>
1   107           1         3.93              5           6                0
# ℹ 3 more variables: dose_strep_g_mean <dbl>, dose_strep_g_median <dbl>,
#   dose_strep_g_max <dbl>

Summarizing data

What about grouped summaries? Two options… group_by()…

strep_tb |> 
  group_by(arm) |> 
  summarize(n = n(),
            min = min(rad_num),
            median = median(rad_num),
            mean = mean(rad_num),
            max = min(rad_num),
            )

… or .by (new syntax)

strep_tb |> 
  summarize(n = n(),
            min = min(rad_num),
            median = median(rad_num),
            mean = mean(rad_num),
            max = min(rad_num),
            .by = arm
            )

# A tibble: 2 × 6
  arm              n   min median  mean   max
  <fct>        <int> <dbl>  <dbl> <dbl> <dbl>
1 Control         52     1      3  3.13     1
2 Streptomycin    55     1      6  4.67     1

Labelling variables

Variable names tend to be short - less typing, less chance of making a typo. But they’re not very useful for tables… We can add labels to variables, which various packages know how to use

library(labelled)
strep_tb_lab <- strep_tb |> 
  set_variable_labels(arm = "Treatment",
                      dose_strep_g = "Dose of Streptomycin", 
                      rad_num = "Radiologic response", 
                      baseline_temp = "Temp. at baseline", 
                      improved = "Improvement in radiologic response")

Retrieve the labels again

var_label(strep_tb_lab$rad_num)

[1] "Radiologic response"

Publication type tables

Options:

create something yourself, based on the summaries above (e.g. reshape, format, etc)
use a package to do it for you…

Publication type tables

gtsummary is my package of choice

library(gtsummary)
strep_tb_lab |> 
  select(arm, dose_strep_g, baseline_temp,
         rad_num, improved) |> 
  tbl_summary(by = arm, 
              type = c(
                rad_num ~ "continuous"
              )) |> 
  add_overall()

Results are very customisable (see the help files). The package also provides support for model output.

Characteristic	Overall, N = 107¹	Streptomycin, N = 55¹	Control, N = 52¹
Dose of Streptomycin
0	52 (49%)	0 (0%)	52 (100%)
2	55 (51%)	55 (100%)	0 (0%)
Temp. at baseline
1_98-98.9F	7 (6.5%)	3 (5.5%)	4 (7.7%)
2_99-99.9F	25 (23%)	13 (24%)	12 (23%)
3_100-100.9F	32 (30%)	15 (27%)	17 (33%)
4_100F+	43 (40%)	24 (44%)	19 (37%)
Radiologic response	5.00 (2.00, 6.00)	6.00 (3.00, 6.00)	3.00 (1.00, 5.00)
Improvement in radiologic response	55 (51%)	38 (69%)	17 (33%)
¹ n (%); Median (IQR)

Basic Statistics and Projects in R

General introduction

Course objectives

Follow-up courses

Lecturers and course assistants

Timetable

Today

Course structure and material

Installing R, RStudio, and Git

Any other questions?

Introduction to R and RStudio

What is R?

Why R?

R vs. RStudio

RStudio

RStudio cheatsheet

Working with R in RStudio

R as a calculator

Commenting in scripts

Objects in R

Object classes

Assigning values to objects

Combine values

Writing your own functions

The R ecosystem

Getting help with R

Ask ChatGPT

The Rollercoaster

Further resources

R projects

Why use projects in R(Studio)?

Folder structures

Naming files

Naming R files

Add README files

Exercise 1: Create a project

Required packages for the next exercises

Tidyverse and data wrangling

Base R

What is the tidyverse?

What is the tidyverse?

Tidy data

Tidy data

Tidy data

Making small(!!) datasets by hand

Getting data into R

Getting data into R

Getting data into R

Data sometimes already exists in R…

Getting data into R in practice

Getting data into R in practice

XLSX vs CSV vs database (e.g. REDCap)

What do you think this is?

Dealing with special characters

Your turn!

Solution

Data types

Text/character/string - in R, character

Encoded categorical information - factors

Numbers - numeric

Binary/Boolean/Yes/No variables - logical

Keeping information together - dataframes

Keeping information together - lists

Piping: %>%, |>

Without pipes

Data wrangling with dplyr and the tidyverse

Selecting variables

Selecting variables

Selecting observations

Selecting observations

Modifying data

Same change to many variables

Conditional modifications

Working with strings: stringr

Working with strings: stringr

Working with strings: stringr

Working with factors: forcats

Working with factors: forcats

Working with factors: forcats

Working with dates: lubridate

Text/character/string - in R, `character`

Encoded categorical information - `factors`

Numbers - `numeric`

Binary/Boolean/Yes/No variables - `logical`

Keeping information together - `dataframes`

Keeping information together - `lists`

Data wrangling with `dplyr` and the tidyverse

Working with strings: `stringr`

Working with strings: `stringr`

Working with strings: `stringr`

Working with factors: `forcats`

Working with factors: `forcats`

Working with factors: `forcats`

Working with dates: `lubridate`

Working with dates: `base`

Working with dates: `lubridate`

Working with dates: `lubridate`