[1] 5
[1] 8
[1] 32
[1] 9
Introduction to R, the tidyverse, and data wrangling
We offer the following follow-up courses within the theme Core Methods of the Public Health Sciences Course Program:
| Date | Course | Objectives |
|---|---|---|
| 19-23 June 2023 | Introduction to Epidemiology and Study Design | Concepts and measures to quantify the frequency of health outcomes and their associations with exposures, epidemiological study designs and their potential sources of bias, concepts of causality |
| 26 June 2023 | Diagnostic Test Evaluation | Concepts of diagnostic test evaluation with examples from human and animal health, interpretation of diagnostic test results, reporting of diagnostic test accuracy studies |
| 23-27 October 2023 | Applied Regression Modeling in R | Different types of regression models, model selection, models for continuous, binary and categorical outcomes, and events |
| Day | Time | Topic | Lecturer(s) |
|---|---|---|---|
| Monday | 09:00-12:00 | Projects in R: Introduction to R, the tidyverse, and data wrangling | Christian Althaus, Alan Haynes |
| Monday | 13:00-17:00 | Projects in R: Data visualization with the tidyverse | Christian Althaus, Judith Bouman, Martin Wohlfender |
| Tuesday | 09:00-12:30 | Projects in R: Reproducibility and GitHub | Christian Althaus, Alan Haynes |
| Thursday | 09:00-12:30 | Basic Statistics: Inference about the mean | Ben Spycher |
| Thursday | 13:30-17:00 | Basic Statistics: Non-normal and dependent/paired data | Beatriz Vidondo |
| Friday | 09:00-12:30 | Basic Statistics: Inference about proportions and rates | Ben Spycher |
| Friday | 13:30-17:00 | Basic Statistics: Continue R project with a guided data analysis | Ben Spycher, Beatriz Vidondo |
| Time | Duration | Topic | Content |
|---|---|---|---|
| 09:00-09:10 | 10 min | General introduction | Lecturers and course program |
| 09:10-09:40 | 30 min | Introduction to R and RStudio | Hands-on, objects, functions, etc. |
| 09:40-10:00 | 20 min | R projects | Files, folders, names, and templates |
| 10:00-10:15 | 15 min | Base R and tidyverse | Concepts |
| 10:15-10:30 | 15 min | Data | Read, CSV, Excel, REDCap |
| 10:30-10:50 | 20 min | Coffee break | |
| 10:50-11:00 | 10 min | Data types | |
| 11:00-12:00 | 60 min | Data wrangling | |
| 12:00-13:00 | 60 min | Lunch break | |
| 13:00-13:30 | 30 min | Data visualization | Fundamentals |
| 13:30-14:30 | 60 min | ggplot2 | General ideas and basic graphs |
| 14:30-14:50 | 20 min | Coffee break | |
| 14:50-15:50 | 60 min | ggplot2 | Fancify basic graphs |
| 15:50-16:10 | 20 min | Coffee break | |
| 16:10-17:00 | 50 min | Panels | Other types of geom’s |
.R and .qmd filesMandatory steps (should be completed by now)
Optional steps (can be done today)
install.packages("usethis")usethis::use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")(Why not R?)
R is a programming language that runs computations. RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools, e.g.:
here package),Note that RStudio (the company) recently changed it’s name to Posit. RStudio (the IDE) remains unchanged.
Use the cheatsheet to find your way in RStudio: https://github.com/rstudio/cheatsheets/blob/main/rstudio-ide.pdf
Run button or select code and press CTRL + Enter (⌘ + Enter).
Enter.
The result of your command(s) will appear in the tab Console if the commands are intended to print something, and/or in the tab Plots if the commands generate a plot.
In the Console, R can simply be used as a calculator:
It is recommended to comment your scripts using #:
Objects can be a single piece of data (e.g., 3.14 or "Bern"), or they can consist of structured data.
All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:
| Class | Description | Example(s) |
|---|---|---|
numeric |
Any real number | 1, 3.14, 8.8e6 |
character |
Individual characters or strings, quoted | "a", "Hello, World!" |
factor |
Categorical/qualitative variables | Ordered values of economic status |
logical |
Boolean variables | TRUE and FALSE |
Date/POSIXct |
Calendar dates and times | "2023-06-05" |
Other object classes are array, data.frame, list, and tibble (similar to data.frame).
R uses <- (shown in text blocks here as <-) to assign values to an object name (you might also see = used, but this is not best practice).
Object names are case-sensitive, i.e., X and x are different.
The function c() combines/concatenates single R objects into a vector (or list) of R objects:
[1] 1 4 6 8
[1] 19
You can perform functions to entire vectors of numbers very easily.
Every function in R has three basic parts: a name, a body of code, and a set of arguments. To make your own function, you need to replicate these parts and store them in an R object, which you can do with the function function().
The build-in functionality of Base R can be expaned with packages that others have developed and published.
The Comprehensive R Archive Network (CRAN) has been the main source of R packages. Nowadays, GitHub also contains many packages and has arguably become the primary location for package development. Some packages, e.g., tidyverse, are so-called meta-packages - they load a collection of other packages.
Install new packages as follows:
Packages must be loaded each R session to give access to their functionality:
The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages.
Cheat sheets exist for many packages and topics: https://rstudio.github.io/cheatsheets
Help for R is abundant. 99.9% of your questions will have been asked before, so Google is your friend, or ask on Twitter/Mastodon using the #rstats tag.
There are numerous online tutorials and books on R, RStudio with specific applications to epidemiology, public health, and data science:
The use of projects (.Rproj files) is fundamental to organized coding and project management. There are four main reasons why you should use projects essentially 100% of the time while using RStudio:
Using consistent folder structures across projects helps you work more efficiently. An example of the folder structure for a project looks like this:
Bad examples
Better
Analysis scripts can get looooooooong… Don’t be afraid to break them up into smaller chunks.
Use sequential numbers and descriptive names, e.g.:
01_cleaning.R cleans your data,02_analysis.R performs your analysis,03_plotting.R plots the results of your analysis.Sequential numbers allow you to sort the files according to the sequence in which you run them.
Descriptive names inform you of what is actually in there.
You can use a main or master file (00_main.R) to run all other files and create a reproducible analysis.
Also see How to name files from Jennifer Bryan and The tidyverse style guide.
Adding README.md (Markdown) or README.txt (plain text) files to your project folder and subfolders can be useful to describe your project and/or the content of folders, and provide instructions.
Tomorrow, you will learn more about Markdown (and Quarto).
We have set up a template project for you, including a directory structure: https://github.com/ISPMBern/project-template
.Rproj file to something more meaningful to you (the same as the folder?).You will use this project for the rest of the course…
To work on the next exercises, you have to install the following packages:
usethis - Workflow packagegitcreds - Queries Git credentials from Rhere - Easy file referencingtidyverse - A set of packagesmedicaldata - Medical data setscowplot - Features to create publication-quality figuresSimply type install.packages("packagename"), but RStudio will ask you about it as well if you want to load a package that you haven’t installed yet.
Collection of ca 25 packages that have been developed since R’s conception (ca 25 years ago)
This age is often evident in the syntax - inconsistent option names and/or ordering of options, sometimes possible to tell which features were afterthoughts
Syntax varies widely across add-on packages
Enter the tidyverse…
A group of R packages designed in a consistent manner along the principles of tidy data
Primarily contains packages for data import (readr), manipulation (“wrangling”; dplyr, forcats, stringr) and visualization (ggplot2)
A group of R packages designed in a consistent manner along the principles of tidy data
Primarily contains packages for data import (readr), manipulation (“wrangling”; dplyr, forcats, stringr) and visualization (ggplot2)
Load the whole thing via library(tidyverse) or individual packages as usual (e.g. library(ggplot2))
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Use case - look up tables.
Fine very small datasets, but unwieldy with many variables and/or many observations
R has a wide range of tools for importing data.
Base R
read.csvread.csv2read.delimtidyverse (readr)
read_csvread_csv2read_delimR has a wide range of tools for importing data.
Base R
read.csvread.csv2read.delimtidyverse (readr)
read_csvread_csv2read_delimOthers
readxl::read_xlsxREDCapR::redcap_readhaven::read_spssR has a wide range of tools for importing data.
Base R
read.csvread.csv2read.delimtidyverse (readr)
read_csvread_csv2read_delimOthers
readxl::read_xlsxREDCapR::redcap_readhaven::read_spssAnd even more
secuTrialR::read_secuTrialodbc::dbConnecthttr2Many published datasets already exist in R, either in the basic installation or via a package
From packages, e.g. medicaldata1
# A tibble: 107 × 13
patient_id arm dose_strep_g dose_PAS_g gender baseline_condition
<chr> <fct> <dbl> <dbl> <fct> <fct>
1 0001 Control 0 0 M 1_Good
2 0002 Control 0 0 F 1_Good
3 0003 Control 0 0 F 1_Good
4 0004 Control 0 0 M 1_Good
5 0005 Control 0 0 F 1_Good
6 0006 Control 0 0 M 1_Good
7 0007 Control 0 0 F 1_Good
8 0008 Control 0 0 M 1_Good
9 0009 Control 0 0 F 2_Fair
10 0010 Control 0 0 M 2_Fair
# ℹ 97 more rows
# ℹ 7 more variables: baseline_temp <fct>, baseline_esr <fct>,
# baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
# rad_num <dbl>, improved <lgl>
We place the dataset in the appropriate folder (01_original_data) and read it in with the appropriate function (e.g. read_csv)
Depending on the file, you might need read_csv2, which is configured for e.g. German environments where CSVs are actually semi-colon (;) separated, because the comma is used in numbers . . .
In base R
Virtually identical… readr is slightly faster and automatically converts some variable types
More common(?): excel files…
Once you’ve loaded a dataset, it’s good practice to inspect the data to see that it’s loaded correctly
tibble [107 × 13] (S3: tbl_df/tbl/data.frame)
$ patient_id : chr [1:107] "0001" "0002" "0003" "0004" ...
$ arm : Factor w/ 2 levels "Streptomycin",..: 2 2 2 2 2 2 2 2 2 2 ...
$ dose_strep_g : num [1:107] 0 0 0 0 0 0 0 0 0 0 ...
$ dose_PAS_g : num [1:107] 0 0 0 0 0 0 0 0 0 0 ...
$ gender : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 1 2 1 2 ...
$ baseline_condition : Factor w/ 3 levels "1_Good","2_Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
$ baseline_temp : Factor w/ 4 levels "1_98-98.9F","2_99-99.9F",..: 1 3 1 1 2 3 2 2 2 4 ...
$ baseline_esr : Factor w/ 4 levels "1_0-10","2_11-20",..: 2 2 3 3 3 3 3 3 3 3 ...
$ baseline_cavitation: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 2 2 2 ...
$ strep_resistance : Factor w/ 3 levels "1_sens_0-8","2_mod_8-99",..: 1 1 1 1 1 1 1 1 1 1 ...
$ radiologic_6m : Factor w/ 6 levels "6_Considerable_improvement",..: 1 2 2 2 2 1 2 2 2 2 ...
$ rad_num : num [1:107] 6 5 5 5 5 6 5 5 5 5 ...
$ improved : logi [1:107] TRUE TRUE TRUE TRUE TRUE TRUE ...
Watch Darren Dahly open an Excel file
Excel is easy to use, accessible, …
…but it tries to be clever (e.g. formatting dates, gene names (30% of genetics papers contain mangled names in tables))
CSV can be created by Excel and other software (common export format from databases), has less issues with formatting
Both lack data validation protocols (more control over entered data)
Databases provide data validation, many are Human Research Act compliant (MS Access is not), many have ways to export data directly from the database to R, or simple ways to import exports
[1] "\xfc" "\xe4" "\xe9" "\xe0"
[1] "\xfc" "\xe4" "\xe9" "\xe0"
[1] "ü" "ä" "é" "à"
Common issue when working in Switzerland - ä, ö, ü, é, è, à, etc
File encoding influences exactly how (special) characters are represented in a file
R needs that information to make sense of the data
Free text, check the encoding!
Pro-tip: use Notepad++ to discover the encoding used, and possibly to convert to a different encoding (saving it to a different file…)
Use English as much as possible - saves time dealing with special characters AND no need to translate tables for publications
Read file insurance_with_date.csv into R and explore it a little.
readrs read_csv) vs base-R (read.csv)?You have 5 minutes… go!
spc_tbl_ [1,338 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ X : num [1:1338] 1 2 3 4 5 6 7 8 9 10 ...
$ age : num [1:1338] 59 24 28 22 60 38 51 44 47 29 ...
$ sex : chr [1:1338] "male" "female" "female" "male" ...
$ bmi : num [1:1338] 31.8 22.6 25.9 25.2 36 ...
$ children: num [1:1338] 2 0 1 0 0 3 0 0 1 2 ...
$ smoker : chr [1:1338] "no" "no" "no" "no" ...
$ region : chr [1:1338] "southeast" "southwest" "northwest" "northwest" ...
$ charges : num [1:1338] 13086 2574 4411 2321 13435 ...
$ date : Date[1:1338], format: "2001-01-15" "2001-01-17" ...
- attr(*, "spec")=
.. cols(
.. X = col_double(),
.. age = col_double(),
.. sex = col_character(),
.. bmi = col_double(),
.. children = col_double(),
.. smoker = col_character(),
.. region = col_character(),
.. charges = col_double(),
.. date = col_date(format = "")
.. )
- attr(*, "problems")=<externalptr>
'data.frame': 1338 obs. of 9 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 59 24 28 22 60 38 51 44 47 29 ...
$ sex : chr "male" "female" "female" "male" ...
$ bmi : num 31.8 22.6 25.9 25.2 36 ...
$ children: int 2 0 1 0 0 3 0 0 1 2 ...
$ smoker : chr "no" "no" "no" "no" ...
$ region : chr "southeast" "southwest" "northwest" "northwest" ...
$ charges : num 13086 2574 4411 2321 13435 ...
$ date : chr "2001-01-15" "2001-01-17" "2001-01-22" "2001-01-29" ...
charactercannot (always) be used models, generally needs converting to another format
factorsfrom a known list of options…
[1] male female female male
Levels: male female non-binary
from encoded data…
[1] male female
Levels: male female non-binary
factors are more suitable for models than text
numeric[1] 1 2 3 4
[1] 1 1 2 3
[1] 1 3 5 7 9
[1] 1 4 7 10
logical[1] TRUE
[1] TRUE FALSE FALSE TRUE
[1] FALSE FALSE TRUE TRUE
is a condition true or false, yes or no, e.g. death
dataframesFor when all things the same length…
listsFor when the objects have different lengths…
$letter
[1] "a"
$numbers
[1] 1.6767047 -1.8662187 1.0433395 -0.6191508 0.4764069 2.3812988
[7] -0.7050137 1.1650151 0.6387094 -0.6682568
$data
sex fct num lgl
1 male male 1 TRUE
2 female female 2 FALSE
3 female female 3 FALSE
4 male male 4 TRUE
Getting elements out again is the same for both
When googling, you will encounter pipes. They enable chaining operations together.
Two main varieties:
%>% is from the magrittr package, introduced ca 2014
|> (shown in text blocks here as |>) was added to base R in 2021 (v4.1.0)
Especially useful in data wrangling…
Essentially the same code as the last slide, just in base R
Nesting calls…
…or saving intermediate objects…
dplyr and the tidyverseMost of the tidyverse uses verbs as their function names…
readr::read_csv reads a Comma-Separated-Value filedplyr::filter keep observations matching some criteriadplyr::mutate modifies the data (change existing, add new variables)dplyr::rename renames variablesstringr::str_detect detects whether the first string contains a particular piece of text (regular expression - regex)The tidyverse offers a range of methods to select variables and these methods are used in many of the functions that we will discuss.
Base R examples
Also by variable class or aspects of the variable name
More tricky with base R
filtering
With base R
sliceing
Not so different with base R…
Pay attention to ordering in the dataframe
mutate is your friend
Very simple in base too…
Not restricted to a single change
In base R, something similar can be done with the rarely used within function
# A tibble: 107 × 14
patient_id arm dose_strep_g dose_PAS_g gender baseline_condition
<chr> <fct> <dbl> <dbl> <fct> <fct>
1 0001 Control 2 0 M 1_Good
2 0002 Control 2 0 F 1_Good
3 0003 Control 2 0 F 1_Good
4 0004 Control 2 0 M 1_Good
5 0005 Control 2 0 F 1_Good
6 0006 Control 2 0 M 1_Good
7 0007 Control 2 0 F 1_Good
8 0008 Control 2 0 M 1_Good
9 0009 Control 2 0 F 2_Fair
10 0010 Control 2 0 M 2_Fair
# ℹ 97 more rows
# ℹ 8 more variables: baseline_temp <fct>, baseline_esr <fct>,
# baseline_cavitation <fct>, strep_resistance <fct>, radiologic_6m <fct>,
# rad_num <dbl>, improved <lgl>, control <lgl>
add two to all dose_* variables
creating new variables
more generic, using variable class
Not useful here, but handy for e.g. factors (examples later)
Sometimes you need to do something under one circumstance, something else under another
create text for male/female
We want some specific text in some cases
# initialize the variable
strep_tb$txt <- "Control"
# replace the values for males in the streptomycin group
strep_tb$txt[
with(strep_tb, gender == "M" & arm == "Streptomycin") # which cases
] <- "M Strepto"
# replace the values for females in the streptomycin group, using a temporay variable
to_change <- with(strep_tb, gender == "F" & arm == "Streptomycin")
strep_tb$txt[to_change] <- "F Strepto"
# strep_tb$x[to_change] <- strep_tb$x[to_change] * 2
rm(to_change) # clean uphmmm… tidyverse syntax is much nicer!?
stringrThe stringr package contains functions specifically for working with strings.
Most functions start with str_.
Changing case
stringrRemove white space
Substrings
stringrDetect a substring
Using stringr within mutate
strep_tb |>
mutate(txt = as.character(baseline_condition),
upper = str_to_upper(txt),
good = str_detect(txt, "Good"),
no_number = str_replace(txt, "^[[:digit:]]_", "")) |>
select(txt:no_number) |> unique()# A tibble: 3 × 4
txt upper good no_number
<chr> <chr> <lgl> <chr>
1 1_Good 1_GOOD TRUE Good
2 2_Fair 2_FAIR FALSE Fair
3 3_Poor 3_POOR FALSE Poor
forcatsThe forcats package contains functions specifically for working with factors.
Functions (almost) all begin with fct_.
[1] 1_Good 2_Fair 3_Poor
Levels: 1_Good 2_Fair 3_Poor
forcatsChange level names
[1] Good Fair Poor
Levels: Good Fair Poor
This is also possible by regular expression (regex)
forcatsWithin mutate
strep_tb |>
mutate(baseline_condition_new = fct_relabel(baseline_condition,
str_replace,
pattern = "^[[:digit:]]_", # "any leading digit followed by underscore"
replacement = "")) |>
select(baseline_condition, baseline_condition_new) |>
str()tibble [107 × 2] (S3: tbl_df/tbl/data.frame)
$ baseline_condition : Factor w/ 3 levels "1_Good","2_Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
$ baseline_condition_new: Factor w/ 3 levels "Good","Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
Do it to all factors
strep_tb |>
mutate(across(where(is.factor),
~ fct_relabel(.x,
str_replace,
pattern = "^[[:digit:]]_",
replacement = ""))) |>
select(where(is.factor)) |>
str()tibble [107 × 8] (S3: tbl_df/tbl/data.frame)
$ arm : Factor w/ 2 levels "Streptomycin",..: 2 2 2 2 2 2 2 2 2 2 ...
$ gender : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 1 2 1 2 ...
$ baseline_condition : Factor w/ 3 levels "Good","Fair",..: 1 1 1 1 1 1 1 1 2 2 ...
$ baseline_temp : Factor w/ 4 levels "98-98.9F","99-99.9F",..: 1 3 1 1 2 3 2 2 2 4 ...
$ baseline_esr : Factor w/ 4 levels "0-10","11-20",..: 2 2 3 3 3 3 3 3 3 3 ...
$ baseline_cavitation: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 2 2 2 2 ...
$ strep_resistance : Factor w/ 3 levels "sens_0-8","mod_8-99",..: 1 1 1 1 1 1 1 1 1 1 ...
$ radiologic_6m : Factor w/ 6 levels "Considerable_improvement",..: 1 2 2 2 2 1 2 2 2 2 ...
lubridatelubridate provides a comprehensive set of functions for working with dates and date-times
Dates come in many formats, lubridate handles them easily
[1] "2023-01-20"
[1] "2023-01-20 10:15:00 UTC"
[1] "2023-01-20"
[1] "2023-01-20"
[1] NA
[1] "2023-01-20"
[1] "2023-01-20"
baseIn base R, it’s not so easy… (see ?strptime for details)
[1] "2023-01-20"
[1] "2023-01-20"
[1] "2023-01-20"
[1] "2023-01-20 10:15:00 CET"
Very specific to system settings (language)…
Unless you have e.g. “2023-01-20 10:15”, stick with lubridate…
lubridateWe can do maths with date(-time)s…
Time difference of 31 days
It’s normally worth converting it to a number…
Add a certain number of months
lubridateExtracting components of the date(-time)
[1] 2023
[1] 1
[1] 15
With base R:
Check the cheat sheet for lots more of lubridates capabilities
Date[1:1], format: "2023-01-20"
POSIXct[1:1], format: "2023-01-20 10:15:00"
Stored internally as numbers! This allows the maths operations to work
[1] 19377
[1] 1674209700
What do those numbers mean?
They’re days and milliseconds since an origin…
When is that origin (timepoint 0)?
[1] "1970-01-01"
[1] "1970-01-01 UTC"
Days since 1st January 1970
Milliseconds since 1st January 1970
Using the insurance data you loaded earlier…
sex, and regiondate variabletibble [1,338 × 12] (S3: tbl_df/tbl/data.frame)
$ X : num [1:1338] 1 2 3 4 5 6 7 8 9 10 ...
$ age : num [1:1338] 59 24 28 22 60 38 51 44 47 29 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 2 2 ...
$ bmi : num [1:1338] 31.8 22.6 25.9 25.2 36 ...
$ children : num [1:1338] 2 0 1 0 0 3 0 0 1 2 ...
$ smoker : chr [1:1338] "no" "no" "no" "no" ...
$ region : Factor w/ 4 levels "northeast","northwest",..: 3 4 2 2 1 4 4 2 4 2 ...
$ charges : num [1:1338] 13086 2574 4411 2321 13435 ...
$ date : Date[1:1338], format: "2001-01-15" "2001-01-17" ...
$ gt2_children: logi [1:1338] FALSE FALSE FALSE FALSE FALSE TRUE ...
$ smokes : logi [1:1338] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ date_6m : Date[1:1338], format: "2001-07-15" "2001-07-17" ...
Sometimes it’s necessary to pivot data. E.g. All observations from an individual are on a single row. For our analysis we need them to be in a single variable. pivot_longer is the tool for the task.
The opposite, observations on rows to observations in columns, is pivot_wider
In base R, both these scenarios are handled by reshape, the syntax and documentation thereof is confusing… which is why it’s two separate functions in dplyr
In base R, use merge and combinations of all.x and all.y to specify the different join types, and by.x and by.y to specify the variables to join on
At some point, you will have to create summary data. dplyr can help with that too. summarize is the appropriate function.
strep_tb |>
summarize(n = n(),
min = min(rad_num),
median = median(rad_num),
mean = mean(rad_num),
max = min(rad_num),
)# A tibble: 1 × 5
n min median mean max
<int> <dbl> <dbl> <dbl> <dbl>
1 107 1 5 3.93 1
Remember across? It comes in useful here too
strep_tb |>
summarize(n = n(),
across(c(rad_num, dose_strep_g),
list(min = ~ min(.x, na.rm = TRUE),
mean = mean,
median = median,
max = max),
.names = "{.col}_{.fn}"),
)# A tibble: 1 × 9
n rad_num_min rad_num_mean rad_num_median rad_num_max dose_strep_g_min
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 107 1 3.93 5 6 0
# ℹ 3 more variables: dose_strep_g_mean <dbl>, dose_strep_g_median <dbl>,
# dose_strep_g_max <dbl>
What about grouped summaries? Two options… group_by()…
… or .by (new syntax)
# A tibble: 2 × 6
arm n min median mean max
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Control 52 1 3 3.13 1
2 Streptomycin 55 1 6 4.67 1
Variable names tend to be short - less typing, less chance of making a typo. But they’re not very useful for tables… We can add labels to variables, which various packages know how to use
Retrieve the labels again
Options:
gtsummary is my package of choice
Results are very customisable (see the help files). The package also provides support for model output.
| Characteristic | Overall, N = 1071 | Streptomycin, N = 551 | Control, N = 521 |
|---|---|---|---|
| Dose of Streptomycin | |||
| 0 | 52 (49%) | 0 (0%) | 52 (100%) |
| 2 | 55 (51%) | 55 (100%) | 0 (0%) |
| Temp. at baseline | |||
| 1_98-98.9F | 7 (6.5%) | 3 (5.5%) | 4 (7.7%) |
| 2_99-99.9F | 25 (23%) | 13 (24%) | 12 (23%) |
| 3_100-100.9F | 32 (30%) | 15 (27%) | 17 (33%) |
| 4_100F+ | 43 (40%) | 24 (44%) | 19 (37%) |
| Radiologic response | 5.00 (2.00, 6.00) | 6.00 (3.00, 6.00) | 3.00 (1.00, 5.00) |
| Improvement in radiologic response | 55 (51%) | 38 (69%) | 17 (33%) |
| 1 n (%); Median (IQR) | |||
Public Health Sciences Course Program - Basic Statistics and Projects in R. Slides available on GitHub.