[1] 5
[1] 8
[1] 32
[1] 9
Introduction to R, the tidyverse, and data wrangling
We offer the following follow-up courses within the theme Core Methods of the Public Health Sciences Course Program:
Date | Course | Objectives |
---|---|---|
19-23 June 2023 | Introduction to Epidemiology and Study Design | Concepts and measures to quantify the frequency of health outcomes and their associations with exposures, epidemiological study designs and their potential sources of bias, concepts of causality |
26 June 2023 | Diagnostic Test Evaluation | Concepts of diagnostic test evaluation with examples from human and animal health, interpretation of diagnostic test results, reporting of diagnostic test accuracy studies |
23-27 October 2023 | Applied Regression Modeling in R | Different types of regression models, model selection, models for continuous, binary and categorical outcomes, and events |
Day | Time | Topic | Lecturer(s) |
---|---|---|---|
Monday | 09:00-12:00 | Projects in R: Introduction to R, the tidyverse, and data wrangling | Christian Althaus, Alan Haynes |
Monday | 13:00-17:00 | Projects in R: Data visualization with the tidyverse | Christian Althaus, Judith Bouman, Martin Wohlfender |
Tuesday | 09:00-12:30 | Projects in R: Reproducibility and GitHub | Christian Althaus, Alan Haynes |
Thursday | 09:00-12:30 | Basic Statistics: Inference about the mean | Ben Spycher |
Thursday | 13:30-17:00 | Basic Statistics: Non-normal and dependent/paired data | Beatriz Vidondo |
Friday | 09:00-12:30 | Basic Statistics: Inference about proportions and rates | Ben Spycher |
Friday | 13:30-17:00 | Basic Statistics: Continue R project with a guided data analysis | Ben Spycher, Beatriz Vidondo |
Time | Duration | Topic | Content |
---|---|---|---|
09:00-09:10 | 10 min | General introduction | Lecturers and course program |
09:10-09:40 | 30 min | Introduction to R and RStudio | Hands-on, objects, functions, etc. |
09:40-10:00 | 20 min | R projects | Files, folders, names, and templates |
10:00-10:15 | 15 min | Base R and tidyverse | Concepts |
10:15-10:30 | 15 min | Data | Read, CSV, Excel, REDCap |
10:30-10:50 | 20 min | Coffee break | |
10:50-11:00 | 10 min | Data types | |
11:00-12:00 | 60 min | Data wrangling | |
12:00-13:00 | 60 min | Lunch break | |
13:00-13:30 | 30 min | Data visualization | Fundamentals |
13:30-14:30 | 60 min | ggplot2 | General ideas and basic graphs |
14:30-14:50 | 20 min | Coffee break | |
14:50-15:50 | 60 min | ggplot2 | Fancify basic graphs |
15:50-16:10 | 20 min | Coffee break | |
16:10-17:00 | 50 min | Panels | Other types of geom’s |
.R
and .qmd
filesMandatory steps (should be completed by now)
Optional steps (can be done today)
install.packages("usethis")
usethis::use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")
(Why not R?)
R is a programming language that runs computations. RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools, e.g.:
here
package),Note that RStudio (the company) recently changed it’s name to Posit. RStudio (the IDE) remains unchanged.
Use the cheatsheet to find your way in RStudio: https://github.com/rstudio/cheatsheets/blob/main/rstudio-ide.pdf
Run
button or select code and press CTRL
+ Enter
(⌘
+ Enter
).
Enter
.
The result of your command(s) will appear in the tab Console
if the commands are intended to print something, and/or in the tab Plots
if the commands generate a plot.
In the Console
, R can simply be used as a calculator:
It is recommended to comment your scripts using #
:
Objects can be a single piece of data (e.g., 3.14
or "Bern"
), or they can consist of structured data.
All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:
Class | Description | Example(s) |
---|---|---|
numeric |
Any real number | 1 , 3.14 , 8.8e6 |
character |
Individual characters or strings, quoted | "a" , "Hello, World!" |
factor |
Categorical/qualitative variables | Ordered values of economic status |
logical |
Boolean variables | TRUE and FALSE |
Date /POSIXct |
Calendar dates and times | "2023-06-05" |
Other object classes are array
, data.frame
, list
, and tibble
(similar to data.frame
).
R uses <- (shown in text blocks here as <-
) to assign values to an object name (you might also see =
used, but this is not best practice).
Object names are case-sensitive, i.e., X
and x
are different.
The function c()
combines/concatenates single R objects into a vector (or list) of R objects:
[1] 1 4 6 8
[1] 19
You can perform functions to entire vectors of numbers very easily.
Every function in R has three basic parts: a name, a body of code, and a set of arguments. To make your own function, you need to replicate these parts and store them in an R object, which you can do with the function function()
.
The build-in functionality of Base R can be expaned with packages that others have developed and published.
The Comprehensive R Archive Network (CRAN) has been the main source of R packages. Nowadays, GitHub also contains many packages and has arguably become the primary location for package development. Some packages, e.g., tidyverse
, are so-called meta-packages - they load a collection of other packages.
Install new packages as follows:
Packages must be loaded each R session to give access to their functionality:
The help()
function and ?
help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages.
Cheat sheets exist for many packages and topics: https://rstudio.github.io/cheatsheets
Help for R is abundant. 99.9% of your questions will have been asked before, so Google is your friend, or ask on Twitter/Mastodon using the #rstats
tag.
There are numerous online tutorials and books on R, RStudio with specific applications to epidemiology, public health, and data science:
The use of projects (.Rproj
files) is fundamental to organized coding and project management. There are four main reasons why you should use projects essentially 100% of the time while using RStudio:
Using consistent folder structures across projects helps you work more efficiently. An example of the folder structure for a project looks like this:
Bad examples
Better
Analysis scripts can get looooooooong… Don’t be afraid to break them up into smaller chunks.
Use sequential numbers and descriptive names, e.g.:
01_cleaning.R
cleans your data,02_analysis.R
performs your analysis,03_plotting.R
plots the results of your analysis.Sequential numbers allow you to sort the files according to the sequence in which you run them.
Descriptive names inform you of what is actually in there.
You can use a main or master file (00_main.R
) to run all other files and create a reproducible analysis.
Also see How to name files from Jennifer Bryan and The tidyverse style guide.
Adding README.md
(Markdown) or README.txt
(plain text) files to your project folder and subfolders can be useful to describe your project and/or the content of folders, and provide instructions.
Tomorrow, you will learn more about Markdown (and Quarto).
We have set up a template project for you, including a directory structure: https://github.com/ISPMBern/project-template
.Rproj
file to something more meaningful to you (the same as the folder?).You will use this project for the rest of the course…
To work on the next exercises, you have to install the following packages:
usethis
- Workflow packagegitcreds
- Queries Git credentials from Rhere
- Easy file referencingtidyverse
- A set of packagesmedicaldata
- Medical data setscowplot
- Features to create publication-quality figuresSimply type install.packages("packagename")
, but RStudio will ask you about it as well if you want to load a package that you haven’t installed yet.
Collection of ca 25 packages that have been developed since R’s conception (ca 25 years ago)
This age is often evident in the syntax - inconsistent option names and/or ordering of options, sometimes possible to tell which features were afterthoughts
Syntax varies widely across add-on packages
Enter the tidyverse…
A group of R packages designed in a consistent manner along the principles of tidy data
Primarily contains packages for data import (readr
), manipulation (“wrangling”; dplyr
, forcats
, stringr
) and visualization (ggplot2
)
A group of R packages designed in a consistent manner along the principles of tidy data
Primarily contains packages for data import (readr
), manipulation (“wrangling”; dplyr
, forcats
, stringr
) and visualization (ggplot2
)
Load the whole thing via library(tidyverse)
or individual packages as usual (e.g. library(ggplot2)
)
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Use case - look up tables.
Fine very small datasets, but unwieldy with many variables and/or many observations
R has a wide range of tools for importing data.
Base R
read.csv
read.csv2
read.delim
tidyverse (readr
)
read_csv
read_csv2
read_delim
R has a wide range of tools for importing data.
Base R
read.csv
read.csv2
read.delim
tidyverse (readr
)
read_csv
read_csv2
read_delim
Others
readxl::read_xlsx
REDCapR::redcap_read
haven::read_spss
R has a wide range of tools for importing data.
Base R
read.csv
read.csv2
read.delim
tidyverse (readr
)
read_csv
read_csv2
read_delim
Others
readxl::read_xlsx
REDCapR::redcap_read
haven::read_spss
And even more
secuTrialR::read_secuTrial
odbc::dbConnect
httr2
Many published datasets already exist in R, either in the basic installation or via a package
From packages, e.g. medicaldata
1
# A tibble: 107 × 13
patient_id arm dose_strep_g dose_PAS_g gender baseline_condition
<chr> <fct> <dbl> <dbl> <fct> <fct>
1 0001 Control 0 0 M 1_Good
2 0002 Control 0 0 F 1_Good
3 0003 Control 0 0 F 1_Good
4 0004 Control 0 0 M 1_Good
5 0005 Control 0 0 F 1_Good
6 0006 Control 0 0 M 1_Good
7 0007 Control 0 0 F 1_Good
8 0008 Control 0 0 M 1_Good
9 0009 Control 0 0 F 2_Fair
10 0010 Control 0 0 M 2_Fair
# ℹ 97 more rows
# ℹ 7 more variables: baseline_temp <fct>, baseline_esr <fct>,
# baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
# rad_num <dbl>, improved <lgl>
We place the dataset in the appropriate folder (01_original_data
) and read it in with the appropriate function (e.g. read_csv
)
Depending on the file, you might need read_csv2
, which is configured for e.g. German environments where CSVs are actually semi-colon (;) separated, because the comma is used in numbers . . .
In base R
Virtually identical… readr
is slightly faster and automatically converts some variable types
More common(?): excel files…
Once you’ve loaded a dataset, it’s good practice to inspect the data to see that it’s loaded correctly
tibble [107 × 13] (S3: tbl_df/tbl/data.frame)
$ patient_id : chr [1:107] "0001" "0002" "0003" "0004" ...
$ arm : Factor w/ 2 levels "Streptomycin",..: 2 2 2 2 2 2 2 2 2 2 ...
$ dose_strep_g : num [1:107] 0 0 0 0 0 0 0 0 0 0 ...
$ dose_PAS_g : num [1:107] 0 0 0 0 0 0 0 0 0 0 ...
$ gender : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 1 2 1 2 ...
$ baseline_condition : Factor w/ 3 levels "1_Good","2_Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
$ baseline_temp : Factor w/ 4 levels "1_98-98.9F","2_99-99.9F",..: 1 3 1 1 2 3 2 2 2 4 ...
$ baseline_esr : Factor w/ 4 levels "1_0-10","2_11-20",..: 2 2 3 3 3 3 3 3 3 3 ...
$ baseline_cavitation: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 2 2 2 ...
$ strep_resistance : Factor w/ 3 levels "1_sens_0-8","2_mod_8-99",..: 1 1 1 1 1 1 1 1 1 1 ...
$ radiologic_6m : Factor w/ 6 levels "6_Considerable_improvement",..: 1 2 2 2 2 1 2 2 2 2 ...
$ rad_num : num [1:107] 6 5 5 5 5 6 5 5 5 5 ...
$ improved : logi [1:107] TRUE TRUE TRUE TRUE TRUE TRUE ...
Watch Darren Dahly open an Excel file
Excel is easy to use, accessible, …
…but it tries to be clever (e.g. formatting dates, gene names (30% of genetics papers contain mangled names in tables))
CSV can be created by Excel and other software (common export format from databases), has less issues with formatting
Both lack data validation protocols (more control over entered data)
Databases provide data validation, many are Human Research Act compliant (MS Access is not), many have ways to export data directly from the database to R, or simple ways to import exports
[1] "\xfc" "\xe4" "\xe9" "\xe0"
[1] "\xfc" "\xe4" "\xe9" "\xe0"
[1] "ü" "ä" "é" "à"
Common issue when working in Switzerland - ä, ö, ü, é, è, à, etc
File encoding influences exactly how (special) characters are represented in a file
R needs that information to make sense of the data
Free text, check the encoding!
Pro-tip: use Notepad++ to discover the encoding used, and possibly to convert to a different encoding (saving it to a different file…)
Use English as much as possible - saves time dealing with special characters AND no need to translate tables for publications
Read file insurance_with_date.csv
into R and explore it a little.
readr
s read_csv
) vs base-R (read.csv
)?You have 5 minutes… go!
spc_tbl_ [1,338 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ X : num [1:1338] 1 2 3 4 5 6 7 8 9 10 ...
$ age : num [1:1338] 59 24 28 22 60 38 51 44 47 29 ...
$ sex : chr [1:1338] "male" "female" "female" "male" ...
$ bmi : num [1:1338] 31.8 22.6 25.9 25.2 36 ...
$ children: num [1:1338] 2 0 1 0 0 3 0 0 1 2 ...
$ smoker : chr [1:1338] "no" "no" "no" "no" ...
$ region : chr [1:1338] "southeast" "southwest" "northwest" "northwest" ...
$ charges : num [1:1338] 13086 2574 4411 2321 13435 ...
$ date : Date[1:1338], format: "2001-01-15" "2001-01-17" ...
- attr(*, "spec")=
.. cols(
.. X = col_double(),
.. age = col_double(),
.. sex = col_character(),
.. bmi = col_double(),
.. children = col_double(),
.. smoker = col_character(),
.. region = col_character(),
.. charges = col_double(),
.. date = col_date(format = "")
.. )
- attr(*, "problems")=<externalptr>
'data.frame': 1338 obs. of 9 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 59 24 28 22 60 38 51 44 47 29 ...
$ sex : chr "male" "female" "female" "male" ...
$ bmi : num 31.8 22.6 25.9 25.2 36 ...
$ children: int 2 0 1 0 0 3 0 0 1 2 ...
$ smoker : chr "no" "no" "no" "no" ...
$ region : chr "southeast" "southwest" "northwest" "northwest" ...
$ charges : num 13086 2574 4411 2321 13435 ...
$ date : chr "2001-01-15" "2001-01-17" "2001-01-22" "2001-01-29" ...
character
cannot (always) be used models, generally needs converting to another format
factors
from a known list of options…
[1] male female female male
Levels: male female non-binary
from encoded data…
[1] male female
Levels: male female non-binary
factors are more suitable for models than text
numeric
[1] 1 2 3 4
[1] 1 1 2 3
[1] 1 3 5 7 9
[1] 1 4 7 10
logical
[1] TRUE
[1] TRUE FALSE FALSE TRUE
[1] FALSE FALSE TRUE TRUE
is a condition true or false, yes or no, e.g. death
dataframes
For when all things the same length…
lists
For when the objects have different lengths…
$letter
[1] "a"
$numbers
[1] 1.6767047 -1.8662187 1.0433395 -0.6191508 0.4764069 2.3812988
[7] -0.7050137 1.1650151 0.6387094 -0.6682568
$data
sex fct num lgl
1 male male 1 TRUE
2 female female 2 FALSE
3 female female 3 FALSE
4 male male 4 TRUE
Getting elements out again is the same for both
When googling, you will encounter pipes. They enable chaining operations together.
Two main varieties:
%>% is from the magrittr
package, introduced ca 2014
|> (shown in text blocks here as |>
) was added to base R in 2021 (v4.1.0)
Especially useful in data wrangling…
Essentially the same code as the last slide, just in base R
Nesting calls…
…or saving intermediate objects…
dplyr
and the tidyverseMost of the tidyverse uses verbs as their function names…
readr::read_csv
reads a Comma-Separated-Value filedplyr::filter
keep observations matching some criteriadplyr::mutate
modifies the data (change existing, add new variables)dplyr::rename
renames variablesstringr::str_detect
detects whether the first str
ing contains a particular piece of text (regular expression - regex)The tidyverse offers a range of methods to select variables and these methods are used in many of the functions that we will discuss.
Base R examples
Also by variable class or aspects of the variable name
More tricky with base R
filter
ing
With base R
slice
ing
Not so different with base R…
Pay attention to ordering in the dataframe
mutate
is your friend
Very simple in base too…
Not restricted to a single change
In base R, something similar can be done with the rarely used within
function
# A tibble: 107 × 14
patient_id arm dose_strep_g dose_PAS_g gender baseline_condition
<chr> <fct> <dbl> <dbl> <fct> <fct>
1 0001 Control 2 0 M 1_Good
2 0002 Control 2 0 F 1_Good
3 0003 Control 2 0 F 1_Good
4 0004 Control 2 0 M 1_Good
5 0005 Control 2 0 F 1_Good
6 0006 Control 2 0 M 1_Good
7 0007 Control 2 0 F 1_Good
8 0008 Control 2 0 M 1_Good
9 0009 Control 2 0 F 2_Fair
10 0010 Control 2 0 M 2_Fair
# ℹ 97 more rows
# ℹ 8 more variables: baseline_temp <fct>, baseline_esr <fct>,
# baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
# rad_num <dbl>, improved <lgl>, control <lgl>
add two to all dose_*
variables
creating new variables
more generic, using variable class
Not useful here, but handy for e.g. factors (examples later)
Sometimes you need to do something under one circumstance, something else under another
create text for male/female
We want some specific text in some cases
# initialize the variable
strep_tb$txt <- "Control"
# replace the values for males in the streptomycin group
strep_tb$txt[
with(strep_tb, gender == "M" & arm == "Streptomycin") # which cases
] <- "M Strepto"
# replace the values for females in the streptomycin group, using a temporay variable
to_change <- with(strep_tb, gender == "F" & arm == "Streptomycin")
strep_tb$txt[to_change] <- "F Strepto"
# strep_tb$x[to_change] <- strep_tb$x[to_change] * 2
rm(to_change) # clean up
hmmm… tidyverse syntax is much nicer!?
stringr
The stringr
package contains functions specifically for working with strings.
Most functions start with str_
.
Changing case
stringr
Remove white space
Substrings
stringr
Detect a substring
Using stringr
within mutate
strep_tb |>
mutate(txt = as.character(baseline_condition),
upper = str_to_upper(txt),
good = str_detect(txt, "Good"),
no_number = str_replace(txt, "^[[:digit:]]_", "")) |>
select(txt:no_number) |> unique()
# A tibble: 3 × 4
txt upper good no_number
<chr> <chr> <lgl> <chr>
1 1_Good 1_GOOD TRUE Good
2 2_Fair 2_FAIR FALSE Fair
3 3_Poor 3_POOR FALSE Poor
forcats
The forcats
package contains functions specifically for working with factors.
Functions (almost) all begin with fct_
.
[1] 1_Good 2_Fair 3_Poor
Levels: 1_Good 2_Fair 3_Poor
forcats
Change level names
[1] Good Fair Poor
Levels: Good Fair Poor
This is also possible by regular expression (regex)
forcats
Within mutate
strep_tb |>
mutate(baseline_condition_new = fct_relabel(baseline_condition,
str_replace,
pattern = "^[[:digit:]]_", # "any leading digit followed by underscore"
replacement = "")) |>
select(baseline_condition, baseline_condition_new) |>
str()
tibble [107 × 2] (S3: tbl_df/tbl/data.frame)
$ baseline_condition : Factor w/ 3 levels "1_Good","2_Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
$ baseline_condition_new: Factor w/ 3 levels "Good","Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
Do it to all factors
strep_tb |>
mutate(across(where(is.factor),
~ fct_relabel(.x,
str_replace,
pattern = "^[[:digit:]]_",
replacement = ""))) |>
select(where(is.factor)) |>
str()
tibble [107 × 8] (S3: tbl_df/tbl/data.frame)
$ arm : Factor w/ 2 levels "Streptomycin",..: 2 2 2 2 2 2 2 2 2 2 ...
$ gender : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 1 2 1 2 ...
$ baseline_condition : Factor w/ 3 levels "Good","Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
$ baseline_temp : Factor w/ 4 levels "98-98.9F","99-99.9F",..: 1 3 1 1 2 3 2 2 2 4 ...
$ baseline_esr : Factor w/ 4 levels "0-10","11-20",..: 2 2 3 3 3 3 3 3 3 3 ...
$ baseline_cavitation: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 2 2 2 ...
$ strep_resistance : Factor w/ 3 levels "sens_0-8","mod_8-99",..: 1 1 1 1 1 1 1 1 1 1 ...
$ radiologic_6m : Factor w/ 6 levels "Considerable_improvement",..: 1 2 2 2 2 1 2 2 2 2 ...
lubridate
lubridate
provides a comprehensive set of functions for working with dates and date-times
Dates come in many formats, lubridate
handles them easily
[1] "2023-01-20"
[1] "2023-01-20 10:15:00 UTC"
[1] "2023-01-20"
[1] "2023-01-20"
[1] NA
[1] "2023-01-20"
[1] "2023-01-20"
base
In base R, it’s not so easy… (see ?strptime
for details)
[1] "2023-01-20"
[1] "2023-01-20"
[1] "2023-01-20"
[1] "2023-01-20 10:15:00 CET"
Very specific to system settings (language)…
Unless you have e.g. “2023-01-20 10:15”, stick with lubridate
…
lubridate
We can do maths with date(-time)s…
Time difference of 31 days
It’s normally worth converting it to a number…
Add a certain number of months
lubridate
Extracting components of the date(-time)
[1] 2023
[1] 1
[1] 15
With base R:
Check the cheat sheet for lots more of lubridate
s capabilities
Date[1:1], format: "2023-01-20"
POSIXct[1:1], format: "2023-01-20 10:15:00"
Stored internally as numbers! This allows the maths operations to work
[1] 19377
[1] 1674209700
What do those numbers mean?
They’re days and milliseconds since an origin…
When is that origin (timepoint 0)?
[1] "1970-01-01"
[1] "1970-01-01 UTC"
Days since 1st January 1970
Milliseconds since 1st January 1970
Using the insurance data you loaded earlier…
sex
, and region
date
variabletibble [1,338 × 12] (S3: tbl_df/tbl/data.frame)
$ X : num [1:1338] 1 2 3 4 5 6 7 8 9 10 ...
$ age : num [1:1338] 59 24 28 22 60 38 51 44 47 29 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 2 2 ...
$ bmi : num [1:1338] 31.8 22.6 25.9 25.2 36 ...
$ children : num [1:1338] 2 0 1 0 0 3 0 0 1 2 ...
$ smoker : chr [1:1338] "no" "no" "no" "no" ...
$ region : Factor w/ 4 levels "northeast","northwest",..: 3 4 2 2 1 4 4 2 4 2 ...
$ charges : num [1:1338] 13086 2574 4411 2321 13435 ...
$ date : Date[1:1338], format: "2001-01-15" "2001-01-17" ...
$ gt2_children: logi [1:1338] FALSE FALSE FALSE FALSE FALSE TRUE ...
$ smokes : logi [1:1338] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ date_6m : Date[1:1338], format: "2001-07-15" "2001-07-17" ...
Sometimes it’s necessary to pivot data. E.g. All observations from an individual are on a single row. For our analysis we need them to be in a single variable. pivot_longer
is the tool for the task.
The opposite, observations on rows to observations in columns, is pivot_wider
In base R, both these scenarios are handled by reshape
, the syntax and documentation thereof is confusing… which is why it’s two separate functions in dplyr
In base R, use merge
and combinations of all.x
and all.y
to specify the different join types, and by.x
and by.y
to specify the variables to join on
At some point, you will have to create summary data. dplyr
can help with that too. summarize
is the appropriate function.
strep_tb |>
summarize(n = n(),
min = min(rad_num),
median = median(rad_num),
mean = mean(rad_num),
max = min(rad_num),
)
# A tibble: 1 × 5
n min median mean max
<int> <dbl> <dbl> <dbl> <dbl>
1 107 1 5 3.93 1
Remember across
? It comes in useful here too
strep_tb |>
summarize(n = n(),
across(c(rad_num, dose_strep_g),
list(min = ~ min(.x, na.rm = TRUE),
mean = mean,
median = median,
max = max),
.names = "{.col}_{.fn}"),
)
# A tibble: 1 × 9
n rad_num_min rad_num_mean rad_num_median rad_num_max dose_strep_g_min
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 107 1 3.93 5 6 0
# ℹ 3 more variables: dose_strep_g_mean <dbl>, dose_strep_g_median <dbl>,
# dose_strep_g_max <dbl>
What about grouped summaries? Two options… group_by()
…
… or .by
(new syntax)
# A tibble: 2 × 6
arm n min median mean max
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Control 52 1 3 3.13 1
2 Streptomycin 55 1 6 4.67 1
Variable names tend to be short - less typing, less chance of making a typo. But they’re not very useful for tables… We can add labels to variables, which various packages know how to use
Retrieve the labels again
Options:
gtsummary
is my package of choice
Results are very customisable (see the help files). The package also provides support for model output.
Characteristic | Overall, N = 1071 | Streptomycin, N = 551 | Control, N = 521 |
---|---|---|---|
Dose of Streptomycin | |||
0 | 52 (49%) | 0 (0%) | 52 (100%) |
2 | 55 (51%) | 55 (100%) | 0 (0%) |
Temp. at baseline | |||
1_98-98.9F | 7 (6.5%) | 3 (5.5%) | 4 (7.7%) |
2_99-99.9F | 25 (23%) | 13 (24%) | 12 (23%) |
3_100-100.9F | 32 (30%) | 15 (27%) | 17 (33%) |
4_100F+ | 43 (40%) | 24 (44%) | 19 (37%) |
Radiologic response | 5.00 (2.00, 6.00) | 6.00 (3.00, 6.00) | 3.00 (1.00, 5.00) |
Improvement in radiologic response | 55 (51%) | 38 (69%) | 17 (33%) |
1 n (%); Median (IQR) |
Public Health Sciences Course Program - Basic Statistics and Projects in R. Slides available on GitHub.