Data wrangling

This practical uses synthetic Hospital Episode Statistics (HES) data on hip fractures to practise core data manipulation skills: importing data, validating keys, joining data at different levels, creating derived variables, and producing summaries. You will work with four linked tables that reflect common NHS-style data structure: patient-level, spell-level, episode-level, and outcomes.

All data in this practical is simulated for teaching purposes.

Datasets

We will use four linked datasets:

hes_patients

One row per patient.

Contains patient-level information that does not usually change across admissions, such as:

  • patient_id
  • Sex
  • Year of birth
  • Ethnicity
  • Area deprivation

hes_admissions

One row per hospital spell.

A spell is one continuous hospital stay, from admission to discharge. Contains spell-level information such as:

  • spell_id
  • patient_id
  • Admission date
  • Discharge date
  • Admitting hospital

hes_episodes

One or more rows per spell.

An episode is one phase of care within a hospital stay, usually under one consultant or specialty.

Contains episode-level information such as:

  • episode_id
  • spell_id
  • Episode start and end date
  • Procedure codes
  • Diagnosis codes

hes_outcomes

One row per spell with derived outcomes.

Contains summary measures calculated for each hospital stay, such as:

  • spell_id
  • Length of stay
  • 30-day mortality
  • Readmission indicator
  • Discharge destination

In HES data, a spell is the full hospital stay from admission to discharge.

An episode is one part of that stay under a particular consultant or clinical team.

For example, a patient is admitted with a hip fracture through the emergency department. They are first under trauma and orthopaedics for surgery, and later transferred to a geriatric medicine team for ongoing recovery. That could appear as one spell made up of two episodes.

So, one spell can contain multiple episodes.

Join keys

The four tables are linked, but they are not all stored at the same level. Before joining them, it helps to be clear about which identifier belongs to which unit.

  • patients is at the patient level, with one row per patient_id
  • admissions is at the spell level, with one row per spell_id
  • outcomes is also at the spell level, with one row per spell_id
  • episodes is at the episode level, with potentially multiple rows for each spell_id

Import

  1. Download hes_patients.csv, hes_admissions.csv, hes_episodes.csv, and hes_outcomes.csv, then import all four datasets using read_csv().
library(tidyverse)

patients <- read_csv("../data/hes_patients.csv")
admissions <- read_csv("../data/hes_admissions.csv")
episodes <- read_csv("../data/hes_episodes.csv")
outcomes <- read_csv("../data/hes_outcomes.csv")
library(tidyverse)
library(here)

patients <- read_csv(here("data", "hes_patients.csv"))
admissions <- read_csv(here("data", "hes_admissions.csv"))
episodes <- read_csv(here("data", "hes_episodes.csv"))
outcomes <- read_csv(here("data", "hes_outcomes.csv"))

Data cleaning

In the admissions table, the discharge_destination column contains some values such as “UNK” and “Unknown” that are being used to represent missing data.

  1. Use recode to replace “UNK” and “Unknown” with proper NA values in discharge_destination.
admissions <-
  admissions |>
  mutate(
    discharge_destination = recode(
      discharge_destination,
      "Unknown" = NA_character_,
      "UNK" = NA_character_
    )
  )

# Confirm recoding worked
admissions |>
  count(discharge_destination)
# A tibble: 5 × 2
  discharge_destination     n
  <chr>                 <int>
1 Care Home               955
2 Died                     70
3 Home                   3477
4 Rehabilitation         2008
5 <NA>                    201

Validate keys and relationships

  1. Next, check the linking keys used to connect tables:
    • Every admissions$patient_id should match a patients$patient_id
    • Every outcomes$spell_id should match an admissions$spell_id
    • Every episodes$spell_id should match an admissions$spell_id

Also check for unmatched records (sometimes called orphan records): rows in one table whose linking key does not appear in the related table. For example, if an admission has a patient_id that does not exist in patients, that admission cannot be linked to a patient record.

flowchart TB

    classDef patients fill:#eef6ff,stroke:#7aa6d8,stroke-width:1.5px,color:#1f2d3d
    classDef admissions fill:#f3f8ee,stroke:#8fb27e,stroke-width:1.5px,color:#1f2d3d
    classDef episodes fill:#fff4e8,stroke:#d9a15b,stroke-width:1.5px,color:#1f2d3d
    classDef outcomes fill:#f7efff,stroke:#a88acb,stroke-width:1.5px,color:#1f2d3d

    P["patients<br/><code>patient_id</code>"]
    A["admissions<br/><code>spell_id</code><br><code>patient_id</code>"]
    E["episodes<br/><code>episode_id</code><br><code>spell_id</code>"]
    O["outcomes<br/><code>spell_id</code>"]

    P <-->|"one-to-many"| A
    A <-->|"one-to-many"| E
    A <-->|"one-to-one"| O

    class P patients
    class A admissions
    class E episodes
    class O outcomes

key_checks <- tibble(
  table = c("patients", "admissions", "episodes", "outcomes"),
  rows = c(nrow(patients), nrow(admissions), nrow(episodes), nrow(outcomes)),
  unique_key = c(
    n_distinct(patients$patient_id),
    n_distinct(admissions$spell_id),
    n_distinct(episodes$episode_id),
    n_distinct(outcomes$spell_id)
  )
)

key_checks
# A tibble: 4 × 3
  table       rows unique_key
  <chr>      <int>      <int>
1 patients    5000       5000
2 admissions  6711       6711
3 episodes   10429      10429
4 outcomes    6708       6708
orphan_checks <- tibble(
  check = c(
    "admissions without patient",
    "episodes without spell",
    "outcomes without spell"
  ),
  n_orphans = c(
    sum(!admissions$patient_id %in% patients$patient_id),
    sum(!episodes$spell_id %in% admissions$spell_id),
    sum(!outcomes$spell_id %in% admissions$spell_id)
  )
)

orphan_checks
# A tibble: 3 × 2
  check                      n_orphans
  <chr>                          <int>
1 admissions without patient        18
2 episodes without spell            22
3 outcomes without spell            15

Build a spell-level analysis table

We now want to merge the tables to create a single data file, a frame containing information about each spell. This will combine information from admissions, outcomes, and patients.

  1. Join in two steps: first join admissions to outcomes by spell_id, then join patients to the result by patient_id. You can use a pipe to combine these join operations and store the final spell-level table as spell_tbl.
spell_tbl <- admissions |>
  inner_join(outcomes, by = join_by(spell_id)) |>
  inner_join(patients, by = join_by(patient_id))
  1. Create episode_summary from the episodes table. This table is currently at episode level, so group by spell_id and summarise to produce a new spell-level data frame with one row per spell. It should contain:

    • n_episodes: count of episodes in the spell
    • total_episode_days: total days covered by episodes in the spell
    • n_consultant_specialties: number of distinct consultant specialties in the spell
episode_summary <- episodes |>
  mutate(episode_days = as.numeric(episode_end - episode_start)) |>
  group_by(spell_id) |>
  summarise(
    n_episodes = n(),
    total_episode_days = sum(episode_days, na.rm = TRUE),
    n_consultant_specialties = n_distinct(consultant_specialty, na.rm = TRUE),
    .groups = "drop"
  )
  1. Join episode_summary to spell_tbl by spell_id to create your final analysis table dat.
dat <- spell_tbl |>
  left_join(episode_summary, by = "spell_id")

Derive analysis fields

  1. Create these derived columns in dat:
  • age_at_admission from date_of_birth and admission_date
  • age_band (<75, 75-84, 85+)
  • surgery_delay_group (<=2 days, >2 days)
  • admission_year
dat <- dat |>
  mutate(
    admission_date = ymd(admission_date),
    date_of_birth = ymd(date_of_birth),
    age_at_admission = as.integer(time_length(
      interval(date_of_birth, admission_date),
      "years"
    )),
    age_band = case_when(
      between(age_at_admission, 0, 74) ~ "<75",
      between(age_at_admission, 75, 84) ~ "75-84",
      age_at_admission >= 85 ~ "85+",
      .default = NA_character_
    ),
    surgery_delay_group = case_when(
      time_to_surgery_days <= 2 ~ "<=2 days",
      time_to_surgery_days > 2 ~ ">2 days",
      .default = NA_character_
    ),
    admission_year = year(admission_date)
  )

Make comparisons

  1. Compare outcome rates by surgery delay group (<=2 days vs >2 days) for:
  • readmission_30d
  • death_30d
  • long_stay_flag
dat |>
  group_by(surgery_delay_group) |>
  summarise(
    n = n(),
    readmission_30d_rate = mean(readmission_30d),
    death_30d_rate = mean(death_30d),
    long_stay_rate = mean(long_stay_flag),
    .groups = "drop"
  )
# A tibble: 2 × 5
  surgery_delay_group     n readmission_30d_rate death_30d_rate long_stay_rate
  <chr>               <int>                <dbl>          <dbl>          <dbl>
1 <=2 days             1830               0.0874         0.0175         0.0464
2 >2 days              4863               0.115          0.0319         0.320 
  1. Summarise outcomes by age_band and sex.
dat |>
  group_by(age_band, sex) |>
  summarise(
    n = n(),
    death_30d_rate = mean(death_30d),
    readmission_30d_rate = mean(readmission_30d),
    .groups = "drop"
  ) |>
  arrange(age_band, sex)
# A tibble: 6 × 5
  age_band sex        n death_30d_rate readmission_30d_rate
  <chr>    <chr>  <int>          <dbl>                <dbl>
1 75-84    Female  2072         0.0256               0.105 
2 75-84    Male     782         0.0243               0.104 
3 85+      Female   772         0.0466               0.123 
4 85+      Male     341         0.0440               0.0968
5 <75      Female  1977         0.0248               0.104 
6 <75      Male     749         0.0200               0.117 
  1. Rank providers by mean LOS and readmission rate. Restrict to providers with at least 500 spells.
dat |>
  group_by(provider_code) |>
  summarise(
    n = n(),
    mean_los = mean(los_days),
    readmission_30d_rate = mean(readmission_30d),
    .groups = "drop"
  ) |>
  filter(n >= 500) |>
  arrange(desc(mean_los))
# A tibble: 7 × 4
  provider_code     n mean_los readmission_30d_rate
  <chr>         <int>    <dbl>                <dbl>
1 R17             800     13.4               0.106 
2 R16             880     12.8               0.122 
3 R15             955     12.4               0.117 
4 R14            1058     12.1               0.105 
5 R13             974     11.8               0.107 
6 R12            1044     11.2               0.103 
7 R11             982     10.6               0.0947

Communicate

  1. Create a simple summary table of provider-level results (top 10 by mean LOS) using group_by and summarise.

    Create using tinytable:

library(tinytable)

provider_summary <- dat |>
  group_by(provider_code) |>
  summarise(
    n = n(),
    mean_los = mean(los_days),
    readmission_30d_rate = mean(readmission_30d),
    .groups = "drop"
  ) |>
  filter(n >= 500) |>
  arrange(desc(mean_los)) |>
  slice_head(n = 10)

# Stage 1: basic tinytable with raw variable names
tt(provider_summary)
provider_code n mean_los readmission_30d_rate
R17 800 13.36500 0.10625000
R16 880 12.76648 0.12159091
R15 955 12.40838 0.11727749
R14 1058 12.11257 0.10491493
R13 974 11.81294 0.10677618
R12 1044 11.15584 0.10344828
R11 982 10.64868 0.09470468
# Stage 2: add tinytable formatting and grouped headers
provider_table <- provider_summary |>
  mutate(`Readmission within 30 days (%)` = 100 * readmission_30d_rate) |>
  select(
    `ID` = provider_code,
    `Number of spells` = n,
    `Mean length of stay (days)` = mean_los,
    `Readmission within 30 days (%)`
  )

tt(provider_table, caption = "Top 10 providers by mean length of stay") |>
  theme_striped() |>
  group_tt(j = list("Provider" = 1:2, "Outcomes" = 3:4)) |>
  style_tt("groupj", bold = TRUE) |>
  style_tt(j = 1, bold = TRUE) |>
  style_tt(align = "cccc") |>
  format_tt(j = 3:4, digits = 1)
Top 10 providers by mean length of stay
Provider Outcomes
ID Number of spells Mean length of stay (days) Readmission within 30 days (%)
R17 800 13 11
R16 880 13 12
R15 955 12 12
R14 1058 12 10
R13 974 12 11
R12 1044 11 10
R11 982 11 9