Answers to your questions

One script to run the analysis

Start with a very simple workflow:

Split the work into numbered stages
Keep related scripts together
Use one file to run the whole analysis

my-project/
├── data/
├── output/
├── 01-cleaning/
│   ├── 01-import-survey.R
│   ├── 02-clean-survey.R
│   └── 03-build-analysis-data.R
├── 02-analysis/
│   ├── 01-descriptives.R
│   ├── 02-subgroups.R
│   └── 03-sensitivity-checks.R
├── 03-plots/
│   ├── 01-figures.R
│   └── 02-tables.R
└── run-analysis.R

`run-analysis.R`

source("01-cleaning/01-import-survey.R")
source("01-cleaning/02-clean-survey.R")
source("01-cleaning/03-build-analysis-data.R")
source("02-analysis/01-descriptives.R")
source("02-analysis/02-subgroups.R")
source("03-plots/01-figures.R")

Number the folders and scripts if the order matters.

If the workflow gets larger, look at targets or Makefiles.

Using `here()` in `run-analysis.R`

library(here)

source(here("01-cleaning", "01-import-survey.R"))
source(here("01-cleaning", "02-clean-survey.R"))
source(here("01-cleaning", "03-build-analysis-data.R"))
source(here("02-analysis", "01-descriptives.R"))
source(here("02-analysis", "02-subgroups.R"))
source(here("03-plots", "01-figures.R"))

here() helps you build paths from the project root, so the same script is less likely to break when you move between computers or open the project from a different working directory.

Load packages inside the script that uses them, e.g. 02-analysis/01-descriptives.R starts with the library() calls needed for that analysis step.

`tidylog`: Stata-style feedback

tidylog prints short messages as you work through a pipeline.

library(tidyverse)
library(tidylog, warn.conflicts = FALSE)

patients |>
  filter(age >= 40) |>
  mutate(age_band = if_else(age < 65, "40-64", "65+")) |>
  left_join(clinics, by = "patient_id")

filter: removed 2 rows
mutate: new variable 'age_band'
left_join: added one column ('clinic')
           matched 3 of 4 rows

Handy when learning or debugging
Especially useful for joins and filters

Regular expressions: when are they useful?

Use regex when you care about a pattern rather than one exact value.

Find IDs such as PT-0042
Check that a date looks like 2026-03-23
Pull a code out of messy text
Filter rows containing a pattern

Think of regex as a compact language for describing text patterns.

Regular expressions in `stringr`

notes |>
  mutate(
    has_id = str_detect(text, "PT-\\d{4}"),
    patient_id = str_extract(text, "PT-\\d{4}"),
    date = str_extract(text, "\\d{4}-\\d{2}-\\d{2}")
  )

str_detect() answers: “does this pattern appear?”
str_extract() pulls out the matching text

For more practice, see the regex practical or try patterns at regex101.com.

Stacked bar charts: the data shape

For a stacked bar chart, your data should usually be in long format.

plot_data <- tibble(
  condition = rep(c("Asthma", "COPD", "Diabetes"), each = 3),
  age_band = rep(c("18-39", "40-64", "65+"), times = 3),
  n = c(22, 36, 18, 10, 28, 31, 12, 24, 26)
)

condition defines each bar
age_band defines the stacked pieces
n gives the height of each piece

Stacked bar charts: what the data look like

condition	age_band	n
Asthma	18-39	22
Asthma	40-64	36
Asthma	65+	18
COPD	18-39	10
COPD	40-64	28
COPD	65+	31
Diabetes	18-39	12
Diabetes	40-64	24
Diabetes	65+	26

Stacked bar charts: plotting

ggplot(plot_data,
       aes(x = condition,
           y = n,
           fill = age_band)) +
  geom_col()

geom_col(position = "fill")

Use position = "fill" when you want proportions rather than counts.

“Double” filtering in the same row

survey |>
  filter(previous_answer == "Yes",
         current_answer == "Yes")

Multiple conditions in filter() are combined with AND
This is the usual answer when you care about one row at a time
See the recent dplyr blog post for newer filtering functionality: dplyr 1.2.0

Filtering from the previous time point with `lag()`

responses |>
  arrange(participant, wave) |>
  group_by(participant) |>
  filter(lag(pain) == "Yes",
         pain == "Yes")

Sort first, then group by participant, then use lag() to refer to the previous observation in time.