Basic Statistics and Projects in R

Introduction to R, the tidyverse, and data wrangling

Christian Althaus, Alan Haynes

General introduction

Course objectives

  • Organize research projects in R following the principles of open science and reproducible research
  • Perform descriptive analysis of data sets and understand the fundamentals of modern data visualization
  • Apply basic inferential statistics to data sets from human and animal health

Follow-up courses

We offer the following follow-up courses within the theme Core Methods of the Public Health Sciences Course Program:

Date Course Objectives
19-23 June 2023 Introduction to Epidemiology and Study Design Concepts and measures to quantify the frequency of health outcomes and their associations with exposures, epidemiological study designs and their potential sources of bias, concepts of causality
26 June 2023 Diagnostic Test Evaluation Concepts of diagnostic test evaluation with examples from human and animal health, interpretation of diagnostic test results, reporting of diagnostic test accuracy studies
23-27 October 2023 Applied Regression Modeling in R Different types of regression models, model selection, models for continuous, binary and categorical outcomes, and events

Lecturers and course assistants

  • Alan Haynes, CTU Bern
  • Christian Althaus, Institute of Social and Preventive Medicine
  • Judith Bouman, Institute of Social and Preventive Medicine
  • Martin Wohlfender, Institute of Social and Preventive Medicine
  • Ben Sypcher, Institute of Social and Preventive Medicine
  • Beatriz Vidondo, Veterinary Public Health Institute (VPHI)
  • Guy Schnidrig, Veterinary Public Health Institute (VPHI)

Timetable

Day Time Topic Lecturer(s)
Monday 09:00-12:00 Projects in R: Introduction to R, the tidyverse, and data wrangling Christian Althaus, Alan Haynes
Monday 13:00-17:00 Projects in R: Data visualization with the tidyverse Christian Althaus, Judith Bouman, Martin Wohlfender
Tuesday 09:00-12:30 Projects in R: Reproducibility and GitHub Christian Althaus, Alan Haynes
Thursday 09:00-12:30 Basic Statistics: Inference about the mean Ben Spycher
Thursday 13:30-17:00 Basic Statistics: Non-normal and dependent/paired data Beatriz Vidondo
Friday 09:00-12:30 Basic Statistics: Inference about proportions and rates Ben Spycher
Friday 13:30-17:00 Basic Statistics: Continue R project with a guided data analysis Ben Spycher, Beatriz Vidondo

Today

Time Duration Topic Content
09:00-09:10 10 min General introduction Lecturers and course program
09:10-09:40 30 min Introduction to R and RStudio Hands-on, objects, functions, etc.
09:40-10:00 20 min R projects Files, folders, names, and templates
10:00-10:15 15 min Base R and tidyverse Concepts
10:15-10:30 15 min Data Read, CSV, Excel, REDCap
10:30-10:50 20 min Coffee break
10:50-11:00 10 min Data types
11:00-12:00 60 min Data wrangling
12:00-13:00 60 min Lunch break
13:00-13:30 30 min Data visualization Fundamentals
13:30-14:30 60 min ggplot2 General ideas and basic graphs
14:30-14:50 20 min Coffee break
14:50-15:50 60 min ggplot2 Fancify basic graphs
15:50-16:10 20 min Coffee break
16:10-17:00 50 min Panels Other types of geom’s

Course structure and material

Installing R, RStudio, and Git

Mandatory steps (should be completed by now)

  1. Download and install R.
  2. Download and install RStudio.
  3. Install Git using the following instructions.
  4. If you don’t already have one, don’t forget to create a GitHub account.

Optional steps (can be done today)

  1. Make sure RStudio knows about Git by following the corresponding section here.
  2. Install the usethis package for R using the following command: install.packages("usethis")
  3. Set up Git using the following command: usethis::use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")
  4. Generate a personal access token (PAT) and store your PAT as described here.

Any other questions?

Introduction to R and RStudio

What is R?

  • R is a programming language for statistical computing and graphics that first appeared in 1993.
  • R is the open source implementation of the S language, which was developed by Bell laboratories in the 1970’s.
  • The aim of the S language, as expressed by John Chambers, is “to turn ideas into software, quickly and faithfully”.
  • The statisticians Ross Ihaka and Robert Gentleman developed R at the University of Auckland, New Zealand.
  • In 1995, the statistician Martin Mächler from ETH Zurich convinced Ihaka and Gentleman to make R free and open source under the GNU General Public License.
  • R is both open source and open development.

Why R?

  • Free and open source (popular in academic research)
  • High-level programming language designed for statistical computing (popular in computational biology, bioinformatics, epidemiology and public health sciences)
  • Reproducibility
  • Powerful and flexible - especially for data wrangling and visualization
  • Extensive add-on software (packages)
  • Strong community

(Why not R?)

  • Little centralized support, relies on online community and package developers
  • Can be cumbersome to update
  • Slower than more traditional programming languages (C/C++, Python)

R vs. RStudio

R is a programming language that runs computations. RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools, e.g.:

  • autocomplete, syntax, and spell checking functionality,
  • easier working with file paths (via projects and the here package),
  • integration with version control systems (e.g., GitHub, SVN).

Note that RStudio (the company) recently changed it’s name to Posit. RStudio (the IDE) remains unchanged.

RStudio

RStudio cheatsheet

Use the cheatsheet to find your way in RStudio: https://github.com/rstudio/cheatsheets/blob/main/rstudio-ide.pdf

Working with R in RStudio

  1. Editor (top left): Press the Run button or select code and press CTRL + Enter ( + Enter).
    • Analysis script
    • Reproducibility
  2. Console (bottom left): Simply press Enter.
    • R as a calculator
    • Trying out things before adding to the editor

The result of your command(s) will appear in the tab Console if the commands are intended to print something, and/or in the tab Plots if the commands generate a plot.

R as a calculator

In the Console, R can simply be used as a calculator:

2 + 3
[1] 5
2 * 4
[1] 8
2^5
[1] 32
6 / 2 * (1 + 2)
[1] 9

Commenting in scripts

It is recommended to comment your scripts using #:

# This script illustrates R as a calculator

6 / 2 * (1 + 2) # Comments can also be placed to the right of code.

Objects in R

Objects can be a single piece of data (e.g., 3.14 or "Bern"), or they can consist of structured data.

Object classes

All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:

Class Description Example(s)
numeric Any real number 1, 3.14, 8.8e6
character Individual characters or strings, quoted "a", "Hello, World!"
factor Categorical/qualitative variables Ordered values of economic status
logical Boolean variables TRUE and FALSE
Date/POSIXct Calendar dates and times "2023-06-05"

Other object classes are array, data.frame, list, and tibble (similar to data.frame).

Assigning values to objects

R uses <- (shown in text blocks here as <-) to assign values to an object name (you might also see = used, but this is not best practice).

Object names are case-sensitive, i.e., X and x are different.

x <- 2
x
[1] 2
x * 4
[1] 8
x + 2
[1] 4

Combine values

The function c() combines/concatenates single R objects into a vector (or list) of R objects:

x <- c(1, 4, 6, 8)
x
[1] 1 4 6 8
sum(x) # sum() is another function that returns the sum of vector elements.
[1] 19

You can perform functions to entire vectors of numbers very easily.

x + 5
[1]  6  9 11 13

Writing your own functions

Every function in R has three basic parts: a name, a body of code, and a set of arguments. To make your own function, you need to replicate these parts and store them in an R object, which you can do with the function function().

The R ecosystem

The build-in functionality of Base R can be expaned with packages that others have developed and published.

The Comprehensive R Archive Network (CRAN) has been the main source of R packages. Nowadays, GitHub also contains many packages and has arguably become the primary location for package development. Some packages, e.g., tidyverse, are so-called meta-packages - they load a collection of other packages.

Install new packages as follows:

install.packages("packagename")

Packages must be loaded each R session to give access to their functionality:

library(packagename)

Getting help with R

The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages.

Cheat sheets exist for many packages and topics: https://rstudio.github.io/cheatsheets

Help for R is abundant. 99.9% of your questions will have been asked before, so Google is your friend, or ask on Twitter/Mastodon using the #rstats tag.

Ask ChatGPT

The Rollercoaster

Further resources

There are numerous online tutorials and books on R, RStudio with specific applications to epidemiology, public health, and data science:

R projects

Why use projects in R(Studio)?

The use of projects (.Rproj files) is fundamental to organized coding and project management. There are four main reasons why you should use projects essentially 100% of the time while using RStudio:

  • It can take less than 30 seconds to set up.
  • It keeps all relevant files in the same place.
  • It sets the working directory, so you can use relative paths.
  • It allows for version control (see tomorrow).

Folder structures

Using consistent folder structures across projects helps you work more efficiently. An example of the folder structure for a project looks like this:

  • R
    • 00_main.R
    • 01_cleaning.R
    • 02_analysis.R
    • 03_plotting.R
  • data
    • processed
      • cleaned_data.csv
      • processed_data.rds
    • raw
      • original_data.csv
      • spreadsheet_data.xlsx
  • output
    • figures
      • 01_figure.png
      • 02_figure.pdf
    • tables
      • 01_table.csv
      • 02_table.rds
  • products
    • manuscript
      • manuscript.docx
      • manuscript.html
      • manuscript.pdf
      • manuscript.qmd
    • report
      • report.html
      • report.qmd
    • slides
      • slides.html
      • slides.qmd
  • .gitignore
  • README.md
  • project-template.Rproj

Naming files

Bad examples

  • myabstract.docx
  • Long file name using “spaces” & punctuation!.xlsx
  • figures 1.png
  • SDEFI7_jknsfol.txt

Better

  • 2023-02-15_abstract-conference-X.docx
  • still-long-but-no-punctuation-or-spaces.xlsx
  • fig01_scatter-mpg-vs-vol.png
  • more-meaningful-name.txt

Naming R files

Analysis scripts can get looooooooong… Don’t be afraid to break them up into smaller chunks.

Use sequential numbers and descriptive names, e.g.:

  • 01_cleaning.R cleans your data,
  • 02_analysis.R performs your analysis,
  • 03_plotting.R plots the results of your analysis.

Sequential numbers allow you to sort the files according to the sequence in which you run them.

Descriptive names inform you of what is actually in there.

You can use a main or master file (00_main.R) to run all other files and create a reproducible analysis.

Also see How to name files from Jennifer Bryan and The tidyverse style guide.

Add README files

Adding README.md (Markdown) or README.txt (plain text) files to your project folder and subfolders can be useful to describe your project and/or the content of folders, and provide instructions.

Tomorrow, you will learn more about Markdown (and Quarto).

Exercise 1: Create a project

We have set up a template project for you, including a directory structure: https://github.com/ISPMBern/project-template

  1. Download the template (click Code and then Download ZIP).
  2. Unzip the file to a suitable directory on your computer.
  3. Rename the folder to something more suitable.
  4. Rename the .Rproj file to something more meaningful to you (the same as the folder?).
  5. Open the project in RStudio (double click the icon)

You will use this project for the rest of the course…

Required packages for the next exercises

To work on the next exercises, you have to install the following packages:

  • usethis - Workflow package
  • gitcreds - Queries Git credentials from R
  • here - Easy file referencing
  • tidyverse - A set of packages
  • medicaldata - Medical data sets
  • cowplot - Features to create publication-quality figures

Simply type install.packages("packagename"), but RStudio will ask you about it as well if you want to load a package that you haven’t installed yet.

Tidyverse and data wrangling

Base R

Collection of ca 25 packages that have been developed since R’s conception (ca 25 years ago)

This age is often evident in the syntax - inconsistent option names and/or ordering of options, sometimes possible to tell which features were afterthoughts

Syntax varies widely across add-on packages

Enter the tidyverse…

What is the tidyverse?

A group of R packages designed in a consistent manner along the principles of tidy data

Primarily contains packages for data import (readr), manipulation (“wrangling”; dplyr, forcats, stringr) and visualization (ggplot2)

What is the tidyverse?

A group of R packages designed in a consistent manner along the principles of tidy data

Primarily contains packages for data import (readr), manipulation (“wrangling”; dplyr, forcats, stringr) and visualization (ggplot2)

Load the whole thing via library(tidyverse) or individual packages as usual (e.g. library(ggplot2))

Tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Making small(!!) datasets by hand

Use case - look up tables.

data.frame(code = c(0, 1),
           label = c("male", "female"))
  code  label
1    0   male
2    1 female

Fine very small datasets, but unwieldy with many variables and/or many observations

tibble::tribble(
  ~code, ~label,
  0,     "male",
  1,     "female"
)
# A tibble: 2 × 2
   code label 
  <dbl> <chr> 
1     0 male  
2     1 female

Much nicer… immediately clear which label belongs to which code

No functional difference between a tibble and a dataframe, just a slightly different print method

Getting data into R

R has a wide range of tools for importing data.

Base R

  • read.csv
  • read.csv2
  • read.delim

tidyverse (readr)

  • read_csv
  • read_csv2
  • read_delim

Getting data into R

R has a wide range of tools for importing data.

Base R

  • read.csv
  • read.csv2
  • read.delim

tidyverse (readr)

  • read_csv
  • read_csv2
  • read_delim

Others

  • readxl::read_xlsx
  • REDCapR::redcap_read
  • haven::read_spss

Getting data into R

R has a wide range of tools for importing data.

Base R

  • read.csv
  • read.csv2
  • read.delim

tidyverse (readr)

  • read_csv
  • read_csv2
  • read_delim

Others

  • readxl::read_xlsx
  • REDCapR::redcap_read
  • haven::read_spss

And even more

  • secuTrialR::read_secuTrial
  • odbc::dbConnect
  • httr2

Data sometimes already exists in R…

Many published datasets already exist in R, either in the basic installation or via a package

data(mtcars)
data(iris) # very popular in examples

From packages, e.g. medicaldata1

install.packages("medicaldata")
library(medicaldata)
strep_tb
# A tibble: 107 × 13
   patient_id arm     dose_strep_g dose_PAS_g gender baseline_condition
   <chr>      <fct>          <dbl>      <dbl> <fct>  <fct>             
 1 0001       Control            0          0 M      1_Good            
 2 0002       Control            0          0 F      1_Good            
 3 0003       Control            0          0 F      1_Good            
 4 0004       Control            0          0 M      1_Good            
 5 0005       Control            0          0 F      1_Good            
 6 0006       Control            0          0 M      1_Good            
 7 0007       Control            0          0 F      1_Good            
 8 0008       Control            0          0 M      1_Good            
 9 0009       Control            0          0 F      2_Fair            
10 0010       Control            0          0 M      2_Fair            
# ℹ 97 more rows
# ℹ 7 more variables: baseline_temp <fct>, baseline_esr <fct>,
#   baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
#   rad_num <dbl>, improved <lgl>

Getting data into R in practice

We place the dataset in the appropriate folder (01_original_data) and read it in with the appropriate function (e.g. read_csv)

library(readr)
data <- read_csv(here("data", "raw", "MyData.csv"))

Depending on the file, you might need read_csv2, which is configured for e.g. German environments where CSVs are actually semi-colon (;) separated, because the comma is used in numbers . . .

In base R

data <- read.csv(here("data", "raw", "MyData.csv"))

Virtually identical… readr is slightly faster and automatically converts some variable types

Getting data into R in practice

More common(?): excel files…

library(readxl) # informal tidyverse member
data <- read_xlsx(here("data", "raw", "MyData.xlsx"))

Once you’ve loaded a dataset, it’s good practice to inspect the data to see that it’s loaded correctly

str(strep_tb)
tibble [107 × 13] (S3: tbl_df/tbl/data.frame)
 $ patient_id         : chr [1:107] "0001" "0002" "0003" "0004" ...
 $ arm                : Factor w/ 2 levels "Streptomycin",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ dose_strep_g       : num [1:107] 0 0 0 0 0 0 0 0 0 0 ...
 $ dose_PAS_g         : num [1:107] 0 0 0 0 0 0 0 0 0 0 ...
 $ gender             : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 1 2 1 2 ...
 $ baseline_condition : Factor w/ 3 levels "1_Good","2_Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ baseline_temp      : Factor w/ 4 levels "1_98-98.9F","2_99-99.9F",..: 1 3 1 1 2 3 2 2 2 4 ...
 $ baseline_esr       : Factor w/ 4 levels "1_0-10","2_11-20",..: 2 2 3 3 3 3 3 3 3 3 ...
 $ baseline_cavitation: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 2 2 2 ...
 $ strep_resistance   : Factor w/ 3 levels "1_sens_0-8","2_mod_8-99",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ radiologic_6m      : Factor w/ 6 levels "6_Considerable_improvement",..: 1 2 2 2 2 1 2 2 2 2 ...
 $ rad_num            : num [1:107] 6 5 5 5 5 6 5 5 5 5 ...
 $ improved           : logi [1:107] TRUE TRUE TRUE TRUE TRUE TRUE ...

XLSX vs CSV vs database (e.g. REDCap)

Watch Darren Dahly open an Excel file

Excel is easy to use, accessible, …

…but it tries to be clever (e.g. formatting dates, gene names (30% of genetics papers contain mangled names in tables))

CSV can be created by Excel and other software (common export format from databases), has less issues with formatting

Both lack data validation protocols (more control over entered data)

Databases provide data validation, many are Human Research Act compliant (MS Access is not), many have ways to export data directly from the database to R, or simple ways to import exports

What do you think this is?

[1] "\xfc" "\xe4" "\xe9" "\xe0"

Dealing with special characters

[1] "\xfc" "\xe4" "\xe9" "\xe0"
[1] "ü" "ä" "é" "à"

Common issue when working in Switzerland - ä, ö, ü, é, è, à, etc

File encoding influences exactly how (special) characters are represented in a file

R needs that information to make sense of the data

# base
read.csv("path/to/file.csv", fileEncoding = "UTF-8")
# tidyverse (readr)
read_csv("path/to/file.csv", fileEncoding = "UTF-8",
         locale = locale(encoding = "UTF-8"))

Free text, check the encoding!

Pro-tip: use Notepad++ to discover the encoding used, and possibly to convert to a different encoding (saving it to a different file…)

Use English as much as possible - saves time dealing with special characters AND no need to translate tables for publications

Your turn!

Read file insurance_with_date.csv into R and explore it a little.

  • How many observations and variables does it have?
  • What types of variables does it include?
  • What difference is there when importing via tidyverse (readrs read_csv) vs base-R (read.csv)?

You have 5 minutes… go!

Solution

library(readr)
dat <- read_csv("data/raw/insurance_with_date.csv")
str(dat)
spc_tbl_ [1,338 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ X       : num [1:1338] 1 2 3 4 5 6 7 8 9 10 ...
 $ age     : num [1:1338] 59 24 28 22 60 38 51 44 47 29 ...
 $ sex     : chr [1:1338] "male" "female" "female" "male" ...
 $ bmi     : num [1:1338] 31.8 22.6 25.9 25.2 36 ...
 $ children: num [1:1338] 2 0 1 0 0 3 0 0 1 2 ...
 $ smoker  : chr [1:1338] "no" "no" "no" "no" ...
 $ region  : chr [1:1338] "southeast" "southwest" "northwest" "northwest" ...
 $ charges : num [1:1338] 13086 2574 4411 2321 13435 ...
 $ date    : Date[1:1338], format: "2001-01-15" "2001-01-17" ...
 - attr(*, "spec")=
  .. cols(
  ..   X = col_double(),
  ..   age = col_double(),
  ..   sex = col_character(),
  ..   bmi = col_double(),
  ..   children = col_double(),
  ..   smoker = col_character(),
  ..   region = col_character(),
  ..   charges = col_double(),
  ..   date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
dat2 <- read.csv("data/raw/insurance_with_date.csv")
str(dat2)
'data.frame':   1338 obs. of  9 variables:
 $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age     : int  59 24 28 22 60 38 51 44 47 29 ...
 $ sex     : chr  "male" "female" "female" "male" ...
 $ bmi     : num  31.8 22.6 25.9 25.2 36 ...
 $ children: int  2 0 1 0 0 3 0 0 1 2 ...
 $ smoker  : chr  "no" "no" "no" "no" ...
 $ region  : chr  "southeast" "southwest" "northwest" "northwest" ...
 $ charges : num  13086 2574 4411 2321 13435 ...
 $ date    : chr  "2001-01-15" "2001-01-17" "2001-01-22" "2001-01-29" ...

Data types

Text/character/string - in R, character

sex <- c("male", "female", "female", "male")

cannot (always) be used models, generally needs converting to another format

Encoded categorical information - factors

from a known list of options…

fct <- factor(sex, levels = c("male", "female", "non-binary"))
fct
[1] male   female female male  
Levels: male female non-binary

from encoded data…

factor(c(0, 1), levels = 0:2, labels = c("male", "female", "non-binary"))
[1] male   female
Levels: male female non-binary

factors are more suitable for models than text

Numbers - numeric

num <- 1:4
num
[1] 1 2 3 4
c(1,1,2,3)
[1] 1 1 2 3
seq(1, 10, 2)
[1] 1 3 5 7 9
seq(1, 10, length.out = 4)
[1]  1  4  7 10
num * num
[1]  1  4  9 16
num * 2
[1] 2 4 6 8
num %*% num # matrix multiplication (scalar product)
     [,1]
[1,]   30

Binary/Boolean/Yes/No variables - logical

TRUE
[1] TRUE
(lgl <- c(TRUE, FALSE, FALSE, TRUE))
[1]  TRUE FALSE FALSE  TRUE
num > 2
[1] FALSE FALSE  TRUE  TRUE

is a condition true or false, yes or no, e.g. death

Keeping information together - dataframes

For when all things the same length…

df <- data.frame(sex = sex,
                 fct = fct,
                 num = num,
                 lgl = lgl)
tibble::tibble(sex = sex,
               fct = fct,
               num = num,
               lgl = lgl)
# A tibble: 4 × 4
  sex    fct      num lgl  
  <chr>  <fct>  <int> <lgl>
1 male   male       1 TRUE 
2 female female     2 FALSE
3 female female     3 FALSE
4 male   male       4 TRUE 

Keeping information together - lists

For when the objects have different lengths…

lst <- list(letter = "a",
     numbers = rnorm(10),
     data = df)
lst
$letter
[1] "a"

$numbers
 [1]  1.6767047 -1.8662187  1.0433395 -0.6191508  0.4764069  2.3812988
 [7] -0.7050137  1.1650151  0.6387094 -0.6682568

$data
     sex    fct num   lgl
1   male   male   1  TRUE
2 female female   2 FALSE
3 female female   3 FALSE
4   male   male   4  TRUE

Getting elements out again is the same for both

df$sex
[1] "male"   "female" "female" "male"  
lst$letter
[1] "a"

Piping: %>%, |>

When googling, you will encounter pipes. They enable chaining operations together.

Two main varieties:

%>% is from the magrittr package, introduced ca 2014

|> (shown in text blocks here as |>) was added to base R in 2021 (v4.1.0)

data |> 
  mutate(new_var = rnorm(10)) |>
  rename(random = new_var) |> 
  etc()

Especially useful in data wrangling…

Without pipes

Essentially the same code as the last slide, just in base R

Nesting calls…

etc(
  rename(
    mutate(data,
           new_var = rnorm(10)),
    random = new_var
  )
)

…or saving intermediate objects…

tmp <- mutate(data, new_var = rnorm(10))
tmp <- rename(tmp, random = new_var)
etc(tmp)

Data wrangling with dplyr and the tidyverse

Most of the tidyverse uses verbs as their function names…

  • readr::read_csv reads a Comma-Separated-Value file
  • dplyr::filter keep observations matching some criteria
  • dplyr::mutate modifies the data (change existing, add new variables)
  • dplyr::rename renames variables
  • stringr::str_detect detects whether the first string contains a particular piece of text (regular expression - regex)
library(dplyr)

Selecting variables

The tidyverse offers a range of methods to select variables and these methods are used in many of the functions that we will discuss.

strep_tb |> select(patient_id, arm, gender)
strep_tb |> select(patient_id:gender, last_col())
strep_tb |> select(1:2, 13)
vars <- c("patient_id", "arm", "gender")
strep_tb |> select(all_of(vars))

Base R examples

strep_tb[, c("patient_id", "arm", "gender")]
strep_tb[, c(1, 2, 13)]
vars <- c("patient_id", "arm", "gender")
strep_tb[, vars]

Selecting variables

Also by variable class or aspects of the variable name

strep_tb |> select(where(is.factor))
strep_tb |> select(contains("i")) # contains a literal string
strep_tb |> select(matches("ll")) # matches a regex
strep_tb |> select(starts_with("b"))

More tricky with base R

strep_tb[, sapply(strep_tb, is.factor)]
strep_tb[, grepl("i", names(strep_tb))]
strep_tb[, grepl("^b", names(strep_tb))]

Selecting observations

filtering

strep_tb |> filter(arm == "Control")
strep_tb |> filter(dose_strep_g > 0)

strep_tb |> filter(improved)

With base R

strep_tb[strep_tb$arm == "Control", ]
strep_tb[strep_tb$dose_strep_g > 0, ]
strep_tb[strep_tb$improved, ]
subset(strep_tb, improved)

Selecting observations

sliceing

# specific rows
strep_tb |> slice(1, 2, 3, 5, 8, 13)

# first 10
strep_tb |> slice_head(n = 10)
strep_tb |> slice_head(n = -10) # exclude first 10 rows

# first 10% of observations
strep_tb |> slice_head(prop = 0.1)

# last 10
strep_tb |> slice_tail(n = 10)

Not so different with base R…

strep_tb[c(1, 2, 3, 5, 8, 13), ] # specific rows
strep_tb[1:10, ] # first 10
head(strep_tb, n = 10) # or strep_tb |> head(n = 10)
strep_tb[(nrow(strep_tb)-10):nrow(strep_tb), ] # last 10
tail(strep_tb, n = 10) # or strep_tb |> tail(n = 10)

Pay attention to ordering in the dataframe

Modifying data

mutate is your friend

strep_tb |> 
  mutate(dose_strep_g = dose_strep_g + 2)

Very simple in base too…

strep_tb$dose_strep_g <- strep_tb$dose_strep_g + 2

Not restricted to a single change

strep_tb |> 
  mutate(dose_strep_g = dose_strep_g + 2, 
         control = arm == "Control")

In base R, something similar can be done with the rarely used within function

within(strep_tb, {
  dose_strep_g <- dose_strep_g + 2
  control <- arm == "Control"
})
# A tibble: 107 × 14
   patient_id arm     dose_strep_g dose_PAS_g gender baseline_condition
   <chr>      <fct>          <dbl>      <dbl> <fct>  <fct>             
 1 0001       Control            2          0 M      1_Good            
 2 0002       Control            2          0 F      1_Good            
 3 0003       Control            2          0 F      1_Good            
 4 0004       Control            2          0 M      1_Good            
 5 0005       Control            2          0 F      1_Good            
 6 0006       Control            2          0 M      1_Good            
 7 0007       Control            2          0 F      1_Good            
 8 0008       Control            2          0 M      1_Good            
 9 0009       Control            2          0 F      2_Fair            
10 0010       Control            2          0 M      2_Fair            
# ℹ 97 more rows
# ℹ 8 more variables: baseline_temp <fct>, baseline_esr <fct>,
#   baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
#   rad_num <dbl>, improved <lgl>, control <lgl>

Same change to many variables

add two to all dose_* variables

strep_tb |> 
  mutate(across(starts_with("dose"), ~ .x + 2))

creating new variables

strep_tb |> 
  mutate(across(starts_with("dose"), ~ .x + 2, .names = "{.col}_plus2"))

more generic, using variable class

strep_tb |> 
  mutate(across(where(is.numeric),  ~ .x + 2, .names = "{.col}_plus2"))

Not useful here, but handy for e.g. factors (examples later)

One possible base method

for(var in c("dose_strep_g", "dose_PAS_g")){
  strep_tb[, paste0(var, "_plus2")] <- strep_tb[, var] + 2
}

Conditional modifications

Sometimes you need to do something under one circumstance, something else under another

all individuals have short measurements

strep_tb |> 
  mutate(dose_strep_g_corr = dose_strep_g + 2)

create text for male/female

strep_tb |> 
  mutate(txt = if_else(gender == "M", 
                       # when TRUE
                       "Male",
                       # when FALSE
                       "Female"))

We want some specific text in some cases

strep_tb |> 
  mutate(dose_strep_g_corr = case_when(
      # Male, Streptomycin
      gender == "M" & arm == "Streptomycin" ~ "M Strepto",
      # Female, Streptomycin
      gender == "F" & arm == "Streptomycin" ~ "F Strepto",
      # all others are OK
      TRUE ~ "Control")
    )
# initialize the variable
strep_tb$txt <- "Control"

# replace the values for males in the streptomycin group
strep_tb$txt[
  with(strep_tb, gender == "M" & arm == "Streptomycin") # which cases
  ] <- "M Strepto"


# replace the values for females in the streptomycin group, using a temporay variable
to_change <- with(strep_tb, gender == "F" & arm == "Streptomycin")

strep_tb$txt[to_change] <- "F Strepto"

# strep_tb$x[to_change] <- strep_tb$x[to_change] * 2

rm(to_change) # clean up

hmmm… tidyverse syntax is much nicer!?

Working with strings: stringr

The stringr package contains functions specifically for working with strings.

Most functions start with str_.

library(stringr)
txt <- "A siLLy exAmple "

Changing case

str_to_lower(txt)
[1] "a silly example "
str_to_sentence(txt)
[1] "A silly example "
str_to_title(txt)
[1] "A Silly Example "
str_to_upper(txt)
[1] "A SILLY EXAMPLE "

Lengths

str_length(txt)
[1] 16
str_count(txt, "\\w+")
[1] 3

Working with strings: stringr

Remove white space

str_squish(txt)
[1] "A siLLy exAmple"

Substrings

str_sub(txt, 3)
[1] "siLLy exAmple "
str_sub(txt, 3, -4)
[1] "siLLy exAmp"
word(txt, 2)
[1] "siLLy"

Replacements

str_replace(txt, "e", "XX")
[1] "A siLLy XXxAmple "
str_replace_all(txt, "e", "XX")
[1] "A siLLy XXxAmplXX "

Working with strings: stringr

Detect a substring

str_detect(txt, "x")
[1] TRUE
str_detect(txt, "z")
[1] FALSE

Splitting

str_split(txt, " ")
[[1]]
[1] "A"       "siLLy"   "exAmple" ""       

Using stringr within mutate

strep_tb |> 
  mutate(txt = as.character(baseline_condition),
         upper = str_to_upper(txt),
         good = str_detect(txt, "Good"),
         no_number = str_replace(txt, "^[[:digit:]]_", "")) |> 
  select(txt:no_number) |> unique()
# A tibble: 3 × 4
  txt    upper  good  no_number
  <chr>  <chr>  <lgl> <chr>    
1 1_Good 1_GOOD TRUE  Good     
2 2_Fair 2_FAIR FALSE Fair     
3 3_Poor 3_POOR FALSE Poor     

Working with factors: forcats

The forcats package contains functions specifically for working with factors.

Functions (almost) all begin with fct_.

fac <- strep_tb$baseline_condition[c(1, 15, 29)]
fac
[1] 1_Good 2_Fair 3_Poor
Levels: 1_Good 2_Fair 3_Poor

Reverse the levels (particularly useful when making plots)

library(forcats)
fct_rev(fac)
[1] 1_Good 2_Fair 3_Poor
Levels: 3_Poor 2_Fair 1_Good
# factor(fac, 
#        levels = rev(levels(fac)))

Changes the order levels are shown in e.g. tables

Working with factors: forcats

Change level names

fct_recode(fac,
           # new = "old"
           Good = "1_Good",
           Fair = "2_Fair",
           Poor = "3_Poor")
[1] Good Fair Poor
Levels: Good Fair Poor
# factor(as.character(fac),
#        levels = levels(fac),
#        labels = c("Good", "Fair", "Poor"))

This is also possible by regular expression (regex)

fct_relabel(fac,
            str_replace, # function to use
            # additional arguments to the function
            pattern = "^[123]_", # regex for "starts with 1, 2 or 3 and is followed by _"
            replacement = "" # replace with nothing
            ) 
[1] Good Fair Poor
Levels: Good Fair Poor
# factor(as.character(fac),
#        levels = levels(fac),
#        labels = gsub("^[123]_", "", levels(fac)))

Working with factors: forcats

Within mutate

strep_tb |> 
  mutate(baseline_condition_new = fct_relabel(baseline_condition,
                                              str_replace, 
                                              pattern = "^[[:digit:]]_", # "any leading digit followed by underscore"
                                              replacement = "")) |> 
  select(baseline_condition, baseline_condition_new) |> 
  str()
tibble [107 × 2] (S3: tbl_df/tbl/data.frame)
 $ baseline_condition    : Factor w/ 3 levels "1_Good","2_Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ baseline_condition_new: Factor w/ 3 levels "Good","Fair",..: 1 1 1 1 1 1 1 1 2 2 ...

Do it to all factors

strep_tb |> 
  mutate(across(where(is.factor), 
                ~ fct_relabel(.x,
                              str_replace, 
                              pattern = "^[[:digit:]]_", 
                              replacement = ""))) |> 
  select(where(is.factor)) |> 
  str()
tibble [107 × 8] (S3: tbl_df/tbl/data.frame)
 $ arm                : Factor w/ 2 levels "Streptomycin",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ gender             : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 1 2 1 2 ...
 $ baseline_condition : Factor w/ 3 levels "Good","Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ baseline_temp      : Factor w/ 4 levels "98-98.9F","99-99.9F",..: 1 3 1 1 2 3 2 2 2 4 ...
 $ baseline_esr       : Factor w/ 4 levels "0-10","11-20",..: 2 2 3 3 3 3 3 3 3 3 ...
 $ baseline_cavitation: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 2 2 2 ...
 $ strep_resistance   : Factor w/ 3 levels "sens_0-8","mod_8-99",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ radiologic_6m      : Factor w/ 6 levels "Considerable_improvement",..: 1 2 2 2 2 1 2 2 2 2 ...

Working with dates: lubridate

lubridate provides a comprehensive set of functions for working with dates and date-times

Dates come in many formats, lubridate handles them easily

library(lubridate)
ymd("2023-01-20")
[1] "2023-01-20"
ymd_hm("2023-01-20 10:15")
[1] "2023-01-20 10:15:00 UTC"
dmy("20 January 2023")
[1] "2023-01-20"
mdy("January 20 2023")
[1] "2023-01-20"
mdy("Januar 20 2023")
[1] NA
mdy("1 20 2023")
[1] "2023-01-20"
mdy("1 20 2023")
[1] "2023-01-20"
dmy("the 1st of May 2023 was a Monday")
[1] "2023-05-01"

Working with dates: base

In base R, it’s not so easy… (see ?strptime for details)

as.Date("2023-01-20")
[1] "2023-01-20"
as.Date("20 Jan 2023", format = "%d %B %Y")
[1] "2023-01-20"
as.Date("01 20 2023", format = "%m %d %Y")
[1] "2023-01-20"
as.POSIXct("2023-01-20 10:15")
[1] "2023-01-20 10:15:00 CET"

Very specific to system settings (language)…

Unless you have e.g. “2023-01-20 10:15”, stick with lubridate

Working with dates: lubridate

We can do maths with date(-time)s…

date1 <- ymd("2023-01-20")
date2 <- ymd("2023-02-20")
diff <- date2 - date1
diff
Time difference of 31 days
str(diff)
 'difftime' num 31
 - attr(*, "units")= chr "days"

It’s normally worth converting it to a number…

diff |> as.numeric()
[1] 31

Add a certain number of months

date1 + months(2)
[1] "2023-03-20"

Working with dates: lubridate

Extracting components of the date(-time)

datetime <- ymd_hm("2023-01-23 15:30")
year(datetime)
[1] 2023
month(datetime)
[1] 1
hour(datetime)
[1] 15

With base R:

format(datetime, "%Y") # year
format(datetime, "%m") # month
format(datetime, "%H") # hour
format(datetime, "%M") # minute
format(datetime, "%x") # date, d.m.y format

Check the cheat sheet for lots more of lubridates capabilities

Working with dates: internals

ymd("2023-01-20") |> str()
 Date[1:1], format: "2023-01-20"
ymd_hm("2023-01-20 10:15") |> str()
 POSIXct[1:1], format: "2023-01-20 10:15:00"

Stored internally as numbers! This allows the maths operations to work

ymd("2023-01-20") |> as.numeric()
[1] 19377
ymd_hm("2023-01-20 10:15") |> as.numeric()
[1] 1674209700

What do those numbers mean?

They’re days and milliseconds since an origin…

Working with dates: the origin

When is that origin (timepoint 0)?

ymd("2023-01-20") - as.numeric(ymd("2023-01-20"))
[1] "1970-01-01"
ymd_hm("2023-01-20 10:15") - as.numeric(ymd_hm("2023-01-20 10:15"))
[1] "1970-01-01 UTC"

Days since 1st January 1970

Milliseconds since 1st January 1970

Your turn!

Using the insurance data you loaded earlier…

  • make factors out of the sex, and region
  • make a logical indicator for “has more than 2 children” and “smokes”
  • add 6 months to the date variable

Solution

reformatted <- dat |> 
  mutate(
    across(c(sex, region), factor),
    # sex = factor(sex),
    # region = factor(region),
    gt2_children = children > 2,
    smokes = smoker == "yes",
    date_6m = date + months(6)
    # date_6m = date + 30.4 * 6
   )
tibble [1,338 × 12] (S3: tbl_df/tbl/data.frame)
 $ X           : num [1:1338] 1 2 3 4 5 6 7 8 9 10 ...
 $ age         : num [1:1338] 59 24 28 22 60 38 51 44 47 29 ...
 $ sex         : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 2 2 ...
 $ bmi         : num [1:1338] 31.8 22.6 25.9 25.2 36 ...
 $ children    : num [1:1338] 2 0 1 0 0 3 0 0 1 2 ...
 $ smoker      : chr [1:1338] "no" "no" "no" "no" ...
 $ region      : Factor w/ 4 levels "northeast","northwest",..: 3 4 2 2 1 4 4 2 4 2 ...
 $ charges     : num [1:1338] 13086 2574 4411 2321 13435 ...
 $ date        : Date[1:1338], format: "2001-01-15" "2001-01-17" ...
 $ gt2_children: logi [1:1338] FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ smokes      : logi [1:1338] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ date_6m     : Date[1:1338], format: "2001-07-15" "2001-07-17" ...

Pivoting datasets

Sometimes it’s necessary to pivot data. E.g. All observations from an individual are on a single row. For our analysis we need them to be in a single variable. pivot_longer is the tool for the task.

pivot_longer(data, cols, 
             names_to = "year", 
             values_to = "cases")   

Pivoting datasets

The opposite, observations on rows to observations in columns, is pivot_wider

pivot_wider(data, 
            names_from = "type", 
            values_from = "count")   

In base R, both these scenarios are handled by reshape, the syntax and documentation thereof is confusing… which is why it’s two separate functions in dplyr

Merging datasets (joins)

(a <- data.frame(id = 1:3, 
                 v1 = letters[1:3]))
  id v1
1  1  a
2  2  b
3  3  c
(b <- data.frame(id = 2:4, 
                 v2 = LETTERS[1:3]))
  id v2
1  2  A
2  3  B
3  4  C

Left join (observations in the first dataframe)

a |> left_join(b)
  id v1   v2
1  1  a <NA>
2  2  b    A
3  3  c    B

Right join (observations in the second dataframe)

a |> right_join(b)
  id   v1 v2
1  2    b  A
2  3    c  B
3  4 <NA>  C

Inner join (observations in both)

a |> inner_join(b)
  id v1 v2
1  2  b  A
2  3  c  B

Merging datasets (joins)

Full join (all observations)

a |> full_join(b)
  id   v1   v2
1  1    a <NA>
2  2    b    A
3  3    c    B
4  4 <NA>    C

Full join with differing variable names to join on

a |> full_join(b, 
               by = join_by(id_a == id_b) # new syntax!
               # by = c("id_a" = "id_b") # older syntax
               )

In base R, use merge and combinations of all.x and all.y to specify the different join types, and by.x and by.y to specify the variables to join on

merge(a, b, all.x = TRUE,  all.y = FALSE, by.x, by.y)  # left
merge(a, b, all.x = FALSE, all.y = TRUE,  by.x, by.y)  # right
merge(a, b, all.x = FALSE, all.y = FALSE, by.x, by.y)  # inner
merge(a, b, all.x = TRUE,  all.y = TRUE,  by.x, by.y)  # full

Summarizing data

At some point, you will have to create summary data. dplyr can help with that too. summarize is the appropriate function.

strep_tb |> 
  summarize(n = n(),
            min = min(rad_num),
            median = median(rad_num),
            mean = mean(rad_num),
            max = min(rad_num),
            )
# A tibble: 1 × 5
      n   min median  mean   max
  <int> <dbl>  <dbl> <dbl> <dbl>
1   107     1      5  3.93     1

Remember across? It comes in useful here too

strep_tb |> 
  summarize(n = n(),
            across(c(rad_num, dose_strep_g), 
                   list(min = ~ min(.x, na.rm = TRUE), 
                        mean = mean, 
                        median = median, 
                        max = max), 
                   .names = "{.col}_{.fn}"),
            )
# A tibble: 1 × 9
      n rad_num_min rad_num_mean rad_num_median rad_num_max dose_strep_g_min
  <int>       <dbl>        <dbl>          <dbl>       <dbl>            <dbl>
1   107           1         3.93              5           6                0
# ℹ 3 more variables: dose_strep_g_mean <dbl>, dose_strep_g_median <dbl>,
#   dose_strep_g_max <dbl>

Summarizing data

What about grouped summaries? Two options… group_by()

strep_tb |> 
  group_by(arm) |> 
  summarize(n = n(),
            min = min(rad_num),
            median = median(rad_num),
            mean = mean(rad_num),
            max = min(rad_num),
            )

… or .by (new syntax)

strep_tb |> 
  summarize(n = n(),
            min = min(rad_num),
            median = median(rad_num),
            mean = mean(rad_num),
            max = min(rad_num),
            .by = arm
            )
# A tibble: 2 × 6
  arm              n   min median  mean   max
  <fct>        <int> <dbl>  <dbl> <dbl> <dbl>
1 Control         52     1      3  3.13     1
2 Streptomycin    55     1      6  4.67     1

Labelling variables

Variable names tend to be short - less typing, less chance of making a typo. But they’re not very useful for tables… We can add labels to variables, which various packages know how to use

library(labelled)
strep_tb_lab <- strep_tb |> 
  set_variable_labels(arm = "Treatment",
                      dose_strep_g = "Dose of Streptomycin", 
                      rad_num = "Radiologic response", 
                      baseline_temp = "Temp. at baseline", 
                      improved = "Improvement in radiologic response")

Retrieve the labels again

var_label(strep_tb_lab$rad_num)
[1] "Radiologic response"

Publication type tables

Options:

  • create something yourself, based on the summaries above (e.g. reshape, format, etc)
  • use a package to do it for you…

Publication type tables

gtsummary is my package of choice

library(gtsummary)
strep_tb_lab |> 
  select(arm, dose_strep_g, baseline_temp,
         rad_num, improved) |> 
  tbl_summary(by = arm, 
              type = c(
                rad_num ~ "continuous"
              )) |> 
  add_overall()

Results are very customisable (see the help files). The package also provides support for model output.

Characteristic Overall, N = 1071 Streptomycin, N = 551 Control, N = 521
Dose of Streptomycin
    0 52 (49%) 0 (0%) 52 (100%)
    2 55 (51%) 55 (100%) 0 (0%)
Temp. at baseline
    1_98-98.9F 7 (6.5%) 3 (5.5%) 4 (7.7%)
    2_99-99.9F 25 (23%) 13 (24%) 12 (23%)
    3_100-100.9F 32 (30%) 15 (27%) 17 (33%)
    4_100F+ 43 (40%) 24 (44%) 19 (37%)
Radiologic response 5.00 (2.00, 6.00) 6.00 (3.00, 6.00) 3.00 (1.00, 5.00)
Improvement in radiologic response 55 (51%) 38 (69%) 17 (33%)
1 n (%); Median (IQR)