Reproducibility and GitHub
dplyr
)
%>%
, |>
)gpplot2
)
data
are mapped to aes
thetic properties of geom
etric objectspatchwork
, cowplot
)Don’t forget about the online resources!
Time | Duration | Topic | Content |
---|---|---|---|
09:00-10:30 | 90 min | Reproducible documents | Markdown, Quarto |
10:30-11:00 | 30 min | Coffee break | Coffee, sun, fresh air |
11:00-12:00 | 60 min | Version control and collaboration | Git/GitHub |
12:00-12:30 | 30 min | Websites | GitHub Pages |
Open Science is the conduct of science in such a way that others can collaborate and contribute, where research data, laboratory notes and other research processes are freely available, with licence terms that allow re-use, redistribution and reproduction of the research. (FOSTER)
Also see the Open Science resources at the University of Bern.
Wanna implement these practices in your research group? Read Ten simple rules for implementing open and reproducible research practices after attending a training course.
In between repeatability and reproducibility, there is also ‘runnability’ (same researcher, new machine).
.Rproj
).02_analysis.R
).00_main.R
) from which you can source
other scripts:raw
and processed
data. Never, ever change your raw data!README.md
) to provide additional information.Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
John Gruber (long-time Mac aficionados may know him from his blog Daring Fireball) created the Markdown language in 2004, with Aaron Swartz (creater of atx
) acting as a beta tester. Gruber had the goal of enabling people “to write using an easy-to-read and easy-to-write plain text format, optionally convert it to structurally valid XHTML (or HTML).”
Nowadays, there are many different flavors of Markdown, e.g., GitHub Flavored Markdown, which makes it the primary choice to communicate important information about your project/repository.
Markdown Syntax | Output |
---|---|
|
Header 1 |
|
Header 2 |
|
Header 3 |
|
Header 4 |
|
Header 5 |
|
Header 6 |
Markdown Syntax | Output |
---|---|
|
italics and bold |
|
superscript2 / subscript2 |
|
|
|
verbatim code |
Markdown Syntax | Output |
---|---|
|
|
|
|
|
|
|
continues after
|
You can use different types of (hyper)links.
Markdown
Output
You can embed named hyperlinks, direct URL’s like https://quarto.org/, and links to other places in the document. The syntax is similar for embedding an inline image:
.
| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
| 12 | 12 | 12 | 12 |
| 123 | 123 | 123 | 123 |
| 1 | 1 | 1 | 1 |
Right | Left | Default | Center |
---|---|---|---|
12 | 12 | 12 | 12 |
123 | 123 | 123 | 123 |
1 | 1 | 1 | 1 |
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do. - Donald Knuth on Literate Programming1
> Let us change our traditional attitude to the construction of programs: Instead of
imagining that our main task is to instruct a computer what to do, let us concentrate
rather on explaining to human beings what we want a computer to do. - Donald Knuth on
Literate Programming^[Knuth (1984, Comput J)](https://doi.org/10.1093/comjnl/27.2.97)
R Markdown has been around for roughly a decade and was fundamentally built for R. That’s why RStudio (now Posit) developed Quarto, a next-generation, R Markdown-like open-source scientific and technical publishing system built on Pandoc.
Quarto is as friendly to Python, Julia, Observable JavaScript, and Jupyter notebooks as it is to R. It’s not a language-specific library, but an external software application.
If you start creating reproducible documents, just use Quarto.
---
title: "ggplot2 demo"
author: "Norah Jones"
date: "5/22/2021"
format:
html:
fig-width: 8
fig-height: 4
code-fold: true
---
## Air Quality
@fig-airquality further explores the impact of temperature
on ozone level.
```{r}
#| label: fig-airquality
#| fig-cap: Temperature and ozone level.
#| warning: false
library(ggplot2)
ggplot(airquality, aes(Temp, Ozone)) +
geom_point() +
geom_smooth(method = "loess")
```
These are just some examples. There are a lot more options.
Option | Description |
---|---|
fig-height: 4 |
Plots generated from this chunk will have a height of 4 inches. |
fig-width: 6 |
Plots generated from this chunk will have a width of 6 inches. |
dpi: 150 |
Plots generated will have a dots per inch (pixel density) of 150 |
echo: false |
Code will not be echoed (i.e., not shown) |
eval: false |
Nothing will be evaluated, but code still be printed |
cache: true |
Results will be cached, and chunk will not be run in subsequent renders, unless code is changed. |
message: false |
No messages will be printed |
warning: false |
No warnings will be printed |
include: false |
No ouputs/echo/messages/etc will be returned |
Let’s create a reproducible HTML report!
Not, it’s your turn!
.qmd
file.Sometimes, your analysis might depend on specific versions of R (rarely) or packages (more frequent). Regular updates of packages may break existing code. To solve this potential problem, you can use tools to ensure that you (or your collaborators, including you in 5 years time) can reproduce your environment and rerun your analysis:
renv
produces a lockfile with all information required to replicate the environment.groundhog
installs the version of a package that existed at a particular point in time. It is slightly more lightweight than renv
.Finally, you can add sessionInfo()
to your reports, which lists the used versions of R and packages.
file.R
file1.R
file1-final.R
file1-final2.R
file1-final2-2023-02.R
file1-final2-2023-02_final.R
...
Version control systems allow you to retain a single file, while also preserving the history of the file.
Git and SVN are (probably) the most well known. We will be working with Git and GitHub.
Git is a version control system.
It’s not always trivial to use, but Happy Git and GitHub for the useR is a super resource especially for R people.
Illustrations from the Openscapes blog GitHub for supporting, contributing, and failing safely by Allison Horst and Julia Lowndes
GitHub is a website that integrates Git and serves as an online version of your repository. It can also host websites, check R packages, etc.
Illustrations from the Openscapes blog GitHub for supporting, contributing, and failing safely by Allison Horst and Julia Lowndes
Illustrations from the Openscapes blog GitHub for supporting, contributing, and failing safely by Allison Horst and Julia Lowndes
Artwork by @allison_horst
git
is a command line tool - type commands into the command prompt.
RStudio offers a way to work with Git (or SVN) repositories.
It can do the most important things, but not everything (terminal use resolves this though).
GitHub also offers a nice desktop app (offering easier authentication).
The following slides assume a GitHub-first approach. This would be my recommended approach.
Install Git (from e.g. https://gitforwindows.org/ or by installing Xcode on macOS) if you don’t yet have a Git installation on your system.
In R, install the usethis
and gitcreds
packages (install.packages(c("usethis", "gitcreds"))
).
Type usethis::create_github_token()
into the R console.
An internet browser window should open with a GitHub page. Accept the default settings and click “Generate token”. Leave the window open.
Back in R, type gitcreds::gitcreds_set()
into the console.
Copy and paste the token from the GitHub window into the R console.
You should now be able to work with your GitHub account from within R.
Open files
Modify as necessary
Save changes
Create new files
…
Click the green button to move your committed content to GitHub.
Worthwhile going to your repository to check that it’s there, at least the first time.
You have your files, but there isn’t a related project on GitHub (yet)…
You will not have a Git pane in RStudio…
Initialize Git with usethis::use_git()
(install the usethis
R package if you don’t already have it). RStudio will/should restart and the Git pane should now be visible.
Commit your changes as before (Git pane, select the files to commit, write your message and click commit).
Connect the project to GitHub via usethis::use_github()
.
This will create a GitHub repository in your account with the same name as the folder and pushes your commits to date to GitHub.
Make a Git repository out of the folder you’ve been using the last 2 days.
usethis::use_git()
)usethis::use_github()
)If you have time, try the GitHub-first approach. Make a new repository on GitHub based on https://github.com/ISPMBern/project-template (click “use this template” and “create a new repository”), then clone it to your computer, make a change, and get it back to GitHub.
Illustrations from the Openscapes blog GitHub for supporting, contributing, and failing safely by Allison Horst and Julia Lowndes
Each collaborator
Artwork by @allison_horst
Good practice to pull from the main repository occasionally (especially when it’s a very active repository).
RStudio doesn’t have a built-in way to do that though, nor does GitHub Studio - you have to use the command line…
Configure a “remote”… in the command line (e.g. the terminal tab in RStudio).
git remote add ispm https://github.com/ISPMBern/projects-in-R.git
git pull ispm main
If others are working on related things (e.g. they need the output from your parts), it might be useful to set a remote to their fork so that you can pull their changes and check that your work is still compatible.
Conflicts occur when two commits (modifications) change the same line(s) to different things.
Git does not know what to do about it, so it needs your help.
Git will issue an error when you try to pull highlighting the issue.
Go to the offending file and find the conflict. It’ll look something like this:
If you have questions, please
<<<<<<< HEAD
open an issue
=======
ask your question in IRC.
>>>>>>> branch-a
Change the file such that it is correct - it might be the top (the remote), it might be the bottom (your version) or some combination of the two.
Commit the file(s) to complete the merge.
GitHub Pages are public webpages hosted and published through GitHub. You can use GitHub Pages to share results from your research project, host a blog, or even share your résumé.
With the Quickstart from GitHub Pages, you can create a user site at yourusername.github.io. Let’s have a look!
Today
medicaldata
or tidytuesday, or something else).The analysis does not need to be anything sophisticated. Just find something interesting in the data and tell a story about it. It does need to be reproducible though (i.e., derived from .qmd
)!
Final assesment
For the final assessment by the end of the course, you will extend your report with additional statistical analyses. Send the link of the final HTML report (don’t forget to make your repository public!) to christian.althaus@unibe.ch and ben.spycher@unibe.ch (Subject: Assessment BSPR course) by 16 June 2023.
Optional: Instead of sending us the link to the HTML file in your GitHub Repository (e.g., https://github.com/calthaus/BSPR-exercises/blob/main/products/reports/report.html), you better send us the link to the HTML file on your GitHub Pages site (e.g., https://calthaus.github.io/BSPR-exercises/products/reports/report.html).
Public Health Sciences Course Program - Basic Statistics and Projects in R. Slides available on GitHub.