Answers to your questions

One script to run the analysis

Start with a very simple workflow:

  • Split the work into numbered stages
  • Keep related scripts together
  • Use one file to run the whole analysis
my-project/
├── data/
├── output/
├── 01-cleaning/
│   ├── 01-import-survey.R
│   ├── 02-clean-survey.R
│   └── 03-build-analysis-data.R
├── 02-analysis/
│   ├── 01-descriptives.R
│   ├── 02-subgroups.R
│   └── 03-sensitivity-checks.R
├── 03-plots/
│   ├── 01-figures.R
│   └── 02-tables.R
└── run-analysis.R

run-analysis.R

source("01-cleaning/01-import-survey.R")
source("01-cleaning/02-clean-survey.R")
source("01-cleaning/03-build-analysis-data.R")
source("02-analysis/01-descriptives.R")
source("02-analysis/02-subgroups.R")
source("03-plots/01-figures.R")

Number the folders and scripts if the order matters.

If the workflow gets larger, look at targets or Makefiles.

Using here() in run-analysis.R

library(here)

source(here("01-cleaning", "01-import-survey.R"))
source(here("01-cleaning", "02-clean-survey.R"))
source(here("01-cleaning", "03-build-analysis-data.R"))
source(here("02-analysis", "01-descriptives.R"))
source(here("02-analysis", "02-subgroups.R"))
source(here("03-plots", "01-figures.R"))

here() helps you build paths from the project root, so the same script is less likely to break when you move between computers or open the project from a different working directory.

Load packages inside the script that uses them, e.g. 02-analysis/01-descriptives.R starts with the library() calls needed for that analysis step.

tidylog: Stata-style feedback

tidylog prints short messages as you work through a pipeline.

library(tidyverse)
library(tidylog, warn.conflicts = FALSE)

patients |>
  filter(age >= 40) |>
  mutate(age_band = if_else(age < 65, "40-64", "65+")) |>
  left_join(clinics, by = "patient_id")
filter: removed 2 rows
mutate: new variable 'age_band'
left_join: added one column ('clinic')
           matched 3 of 4 rows
  • Handy when learning or debugging
  • Especially useful for joins and filters

Regular expressions: when are they useful?

Use regex when you care about a pattern rather than one exact value.

  • Find IDs such as PT-0042
  • Check that a date looks like 2026-03-23
  • Pull a code out of messy text
  • Filter rows containing a pattern

Think of regex as a compact language for describing text patterns.

Regular expressions in stringr

notes |>
  mutate(
    has_id = str_detect(text, "PT-\\d{4}"),
    patient_id = str_extract(text, "PT-\\d{4}"),
    date = str_extract(text, "\\d{4}-\\d{2}-\\d{2}")
  )
  • str_detect() answers: “does this pattern appear?”
  • str_extract() pulls out the matching text

For more practice, see the regex practical or try patterns at regex101.com.

Stacked bar charts: the data shape

For a stacked bar chart, your data should usually be in long format.

plot_data <- tibble(
  condition = rep(c("Asthma", "COPD", "Diabetes"), each = 3),
  age_band = rep(c("18-39", "40-64", "65+"), times = 3),
  n = c(22, 36, 18, 10, 28, 31, 12, 24, 26)
)
  • condition defines each bar
  • age_band defines the stacked pieces
  • n gives the height of each piece

Stacked bar charts: what the data look like

condition age_band n
Asthma 18-39 22
Asthma 40-64 36
Asthma 65+ 18
COPD 18-39 10
COPD 40-64 28
COPD 65+ 31
Diabetes 18-39 12
Diabetes 40-64 24
Diabetes 65+ 26

Stacked bar charts: plotting

ggplot(plot_data,
       aes(x = condition,
           y = n,
           fill = age_band)) +
  geom_col()
geom_col(position = "fill")

Use position = "fill" when you want proportions rather than counts.

“Double” filtering in the same row

survey |>
  filter(previous_answer == "Yes",
         current_answer == "Yes")
  • Multiple conditions in filter() are combined with AND
  • This is the usual answer when you care about one row at a time
  • See the recent dplyr blog post for newer filtering functionality: dplyr 1.2.0

Filtering from the previous time point with lag()

responses |>
  arrange(participant, wave) |>
  group_by(participant) |>
  filter(lag(pain) == "Yes",
         pain == "Yes")

Sort first, then group by participant, then use lag() to refer to the previous observation in time.