Reproducible workflows in R

King’s Open Research Summer School

QR code
Dr Ewan Carr

Department of Biostatistics & Health Informatics
King’s College London

July 25, 2025

πŸ–ΌοΈ

  • My aim is to surface key tools and habits β€” you’ll need to explore further on your own.

  • Reproducibility is necessary for open science, but not sufficient. Broader change in motivations, incentives, and culture is essential.

  • Reset academic publishing models.
  • Reward high quality team science regardless of null findings.
  • Stop chasing small, noisy effects in tiny samples.

β€œResearch can be open and reproducible and still completely and obviously wrong.”

β€” Richard McElreath
Which Kind of Science Reform

This talk

  1. Code and data hygiene
  2. Version control with Git
  3. Managing your environment
  4. Workflow automation
  5. Wrap-up

🫣

Not doing horrible things with your data or code

🧼 Data hygiene

  1. Keep raw and processed separate
  2. Never edit raw data; automate instead.
    • Avoid final_final_v3.csv
  3. Set raw data as read only
  4. Version your data
    • e.g., data/raw/YYYY-MM-DD
  5. Back up regularly; test your backups.

πŸ—‚οΈ Consistent project layout

β”œβ”€β”€ data
β”‚   β”œβ”€β”€ raw
β”‚   └── clean
β”œβ”€β”€ 1-cleaning
β”œβ”€β”€ 2-analysis
β”‚   β”œβ”€β”€ a-descriptives
β”‚   β”œβ”€β”€ b-models
β”‚   └── c-processing
β”œβ”€β”€ 3-figures
β”œβ”€β”€ 4-writing
└── README.txt

Use relative file paths

setwd() is banned

🧨 Absolute file paths are brittle.


πŸŽ’ Relative file paths are portable.

Use here for relative paths

Set a project root (i.e., a top-level directory) from:

  • An RStudio project
  • A Git repository
  • Manually (with a .here file)

Then, construct relative paths with here():

library(here)
data <- read_csv(
  here("data", "raw", "messy-data.csv")
)

This works well with RStudio projects.

🧹 Write clean, organised code

  • Write modular code; break tasks into functions or scripts (e.g., scripts cleaning, analysis, visualisation).

  • Follow a style guide1 and use a code formatter2.

  • Document with inline comments and README.md files.

  • Use consistent file and variable naming conventions (e.g., snake_case).

The Air code formatter1

πŸ”

Version control

Use git.

How does it work?

Initialise the repository:

git init
git add README.md cleaning.R
git commit -m "Initial commit"


Then on GitHub, create a new repository and connect the remote repository to your local one:

git remote add origin \
  https://github.com/your-username/your-repo.git

Push your changes to GitHub

git push origin main


Then, repeat:

1git add .
2git commit -m "Describe your changes"
3git push
1
Add recent changes
2
β€˜Commit’ them to the local repository
3
β€˜Push’ them to GitHub

Getting started

  1. Create an account on GitHub.com

  2. Download GitHub Desktop

  3. Add some files. Press buttons, break things.

KEEP BACKUPS

Include a README.md file

A good README.md helps others (and your future self) understand your project.

  • What and who is the project for?
  • Describe the folder structure and key files.
  • Step-by-step instructions, including software dependencies and steps to reproduce.

What to learn next?

  • πŸ“„ Using .gitignore effectively
  • ✍️ Writing good commit messages
  • 🌿 Branching and pull requests
  • 🀝 Collaborating in real-world projects
  • βš”οΈ Handling merge conflicts
  • πŸ”„ Continuous integration (e.g., GitHub Actions)
  • πŸ§ͺ Automated testing

πŸ“¦

Environment control

Your entire environment should be easily reproducible

We need a way of capturing the state of your computing environment, such that you or others can recreate it later.

  • Software (e.g., R, Python, command line tools)
  • R packages
  • Operating system

How?

  1. Report session information in your scripts:

    sessionInfo()
    sessioninfo::session_info()
  2. Set a seed:

    set.seed(42)
  3. Use a dated CRAN repository:

    options(repos = c(
      CRAN = "packagemanager.posit.co/cran/2025-07-25"
    ))
  4. Use pak or renv to control package versions.

Use a dated CRAN repository

renv

renv is an R package to manage and reproduce the exact package versions used in a project.

Setting up renv

renv is an R package. So first, install the package (once):

install.packages("renv")

Then, from the top-level directory, initialise the project:

renv::init()

Install the required packages:

renv::install()             # All required packages
renv::install("tidyverse")  # A specific package

Save the current state:

renv::snapshot()


Then share renv.lock with collaborator (via Git).

Restoring from a renv lockfile

When re-initialising a project (e.g., on a new computer, or as a collaborator):

renv::restore()
  • Compares the lockfile to the current project library.
  • Installs any missing or mismatched packages.

R packages are just one part of
your computing environment.

Containers

Containers package your entire computing environment so it runs consistently everywhere.

This typically involves Docker or Singularity.

Your computer
Docker
A container

Where do containers come from?

  1. Build them yourself.
  2. Download a container made by someone else.

Docker example

πŸ€–

Workflow automation

Level 1: source()

Create a script that runs your other scripts:

run.R
source("01-cleaning.R")
source("02-analysis.R")
source("03-plots.R")
  • A good start that supports basic automation.
  • Simplistic; runs everything unconditionally, no awareness of dependencies, scales poorly.

Level 2: Make

Make is a tool that runs only the parts of your code that need updating, based on what’s changed.

Makefile
1clean.csv: raw.csv cleaning.R
    Rscript cleaning.R

2plot.png: clean.csv analysis.R
    Rscript analysis.R
1
Declares target (clean.csv) with dependencies (raw.csv, cleaning.R). If either change, the code is run.
2
Declares another target (plot.png), with two dependencies (clean.csv, analysis.R).

Once you’ve defined your Makefile, you can then run:

make plot.png

to re-run all necessary preceding steps.

To learn more:

Level 3: targets

  • targets is an R package for building reproducible workflows by tracking and running R functions instead of files.

  • Like Make, but it works at the level of R objects and code, not just scripts and outputs.

Use Quarto to automate reporting

Wrapping up



Phew, that was a lot.

  • 🌱 Start small, build incrementally.
  • 🀝 Talk to colleagues about how they organise their code; establish shared practices.
  • πŸŒ€ Use version control early β€” it saves time (and headaches) later.
  • 🧠 Tools help β€” but culture and communication are key.

If you read one thing…

Comprehensibility of research

This is not transparency nor openness. What I mean is research has sufficient documentation and justification to reduce error and empower others to make up their own minds about its value. Research should be intelligible. Access is not sufficient. Research can be replicable without being reasonable or correct. Materials and data can be open without being intelligible, and they can be partly closed while still being comprehensible.

Thank you for listening.

Slides and practical materials