Data wrangling

This practical uses synthetic Hospital Episode Statistics (HES) data on hip fractures to practise core data manipulation skills: importing data, validating keys, joining data at different levels, creating derived variables, and producing summaries. You will work with four linked tables that reflect common NHS-style data structure: patient-level, spell-level, episode-level, and outcomes.

All data in this practical is simulated for teaching purposes.

Datasets

We will use four linked datasets:

`hes_patients`

One row per patient.

Contains patient-level information that does not usually change across admissions, such as:

patient_id
Sex
Year of birth
Ethnicity
Area deprivation

`hes_admissions`

One row per hospital spell.

A spell is one continuous hospital stay, from admission to discharge. Contains spell-level information such as:

spell_id
patient_id
Admission date
Discharge date
Admitting hospital

`hes_episodes`

One or more rows per spell.

An episode is one phase of care within a hospital stay, usually under one consultant or specialty.

Contains episode-level information such as:

episode_id
spell_id
Episode start and end date
Procedure codes
Diagnosis codes

`hes_outcomes`

One row per spell with derived outcomes.

Contains summary measures calculated for each hospital stay, such as:

spell_id
Length of stay
30-day mortality
Readmission indicator
Discharge destination

What is a spell and what is an episode?

In HES data, a spell is the full hospital stay from admission to discharge.

An episode is one part of that stay under a particular consultant or clinical team.

For example, a patient is admitted with a hip fracture through the emergency department. They are first under trauma and orthopaedics for surgery, and later transferred to a geriatric medicine team for ongoing recovery. That could appear as one spell made up of two episodes.

So, one spell can contain multiple episodes.

Join keys

The four tables are linked, but they are not all stored at the same level. Before joining them, it helps to be clear about which identifier belongs to which unit.

patients is at the patient level, with one row per patient_id
admissions is at the spell level, with one row per spell_id
outcomes is also at the spell level, with one row per spell_id
episodes is at the episode level, with potentially multiple rows for each spell_id

Import

Download hes_patients.csv, hes_admissions.csv, hes_episodes.csv, and hes_outcomes.csv, then import all four datasets using read_csv().

library(tidyverse)

patients <- read_csv("../data/hes_patients.csv")
admissions <- read_csv("../data/hes_admissions.csv")
episodes <- read_csv("../data/hes_episodes.csv")
outcomes <- read_csv("../data/hes_outcomes.csv")

Tip: Portable paths with here

library(tidyverse)
library(here)

patients <- read_csv(here("data", "hes_patients.csv"))
admissions <- read_csv(here("data", "hes_admissions.csv"))
episodes <- read_csv(here("data", "hes_episodes.csv"))
outcomes <- read_csv(here("data", "hes_outcomes.csv"))

Data cleaning

In the admissions table, the discharge_destination column contains some values such as “UNK” and “Unknown” that are being used to represent missing data.

Use recode to replace “UNK” and “Unknown” with proper NA values in discharge_destination.

Solution

admissions <-
  admissions |>
  mutate(
    discharge_destination = recode(
      discharge_destination,
      "Unknown" = NA_character_,
      "UNK" = NA_character_
    )
  )

# Confirm recoding worked
admissions |>
  count(discharge_destination)

# A tibble: 5 × 2
  discharge_destination     n
  <chr>                 <int>
1 Care Home               955
2 Died                     70
3 Home                   3477
4 Rehabilitation         2008
5 <NA>                    201

Validate keys and relationships

Next, check the linking keys used to connect tables:
- Every admissions$patient_id should match a patients$patient_id
- Every outcomes$spell_id should match an admissions$spell_id
- Every episodes$spell_id should match an admissions$spell_id

Also check for unmatched records (sometimes called orphan records): rows in one table whose linking key does not appear in the related table. For example, if an admission has a patient_id that does not exist in patients, that admission cannot be linked to a patient record.

flowchart TB

    classDef patients fill:#eef6ff,stroke:#7aa6d8,stroke-width:1.5px,color:#1f2d3d
    classDef admissions fill:#f3f8ee,stroke:#8fb27e,stroke-width:1.5px,color:#1f2d3d
    classDef episodes fill:#fff4e8,stroke:#d9a15b,stroke-width:1.5px,color:#1f2d3d
    classDef outcomes fill:#f7efff,stroke:#a88acb,stroke-width:1.5px,color:#1f2d3d

    P["patients<br/><code>patient_id</code>"]
    A["admissions<br/><code>spell_id</code><br><code>patient_id</code>"]
    E["episodes<br/><code>episode_id</code><br><code>spell_id</code>"]
    O["outcomes<br/><code>spell_id</code>"]

    P <-->|"one-to-many"| A
    A <-->|"one-to-many"| E
    A <-->|"one-to-one"| O

    class P patients
    class A admissions
    class E episodes
    class O outcomes

Solution

key_checks <- tibble(
  table = c("patients", "admissions", "episodes", "outcomes"),
  rows = c(nrow(patients), nrow(admissions), nrow(episodes), nrow(outcomes)),
  unique_key = c(
    n_distinct(patients$patient_id),
    n_distinct(admissions$spell_id),
    n_distinct(episodes$episode_id),
    n_distinct(outcomes$spell_id)
  )
)

key_checks

# A tibble: 4 × 3
  table       rows unique_key
  <chr>      <int>      <int>
1 patients    5000       5000
2 admissions  6711       6711
3 episodes   10429      10429
4 outcomes    6708       6708

orphan_checks <- tibble(
  check = c(
    "admissions without patient",
    "episodes without spell",
    "outcomes without spell"
  ),
  n_orphans = c(
    sum(!admissions$patient_id %in% patients$patient_id),
    sum(!episodes$spell_id %in% admissions$spell_id),
    sum(!outcomes$spell_id %in% admissions$spell_id)
  )
)

orphan_checks

# A tibble: 3 × 2
  check                      n_orphans
  <chr>                          <int>
1 admissions without patient        18
2 episodes without spell            22
3 outcomes without spell            15

Build a spell-level analysis table

We now want to merge the tables to create a single data file, a frame containing information about each spell. This will combine information from admissions, outcomes, and patients.

Join in two steps: first join admissions to outcomes by spell_id, then join patients to the result by patient_id. You can use a pipe to combine these join operations and store the final spell-level table as spell_tbl.

Solution

spell_tbl <- admissions |>
  inner_join(outcomes, by = join_by(spell_id)) |>
  inner_join(patients, by = join_by(patient_id))

Create episode_summary from the episodes table. This table is currently at episode level, so group by spell_id and summarise to produce a new spell-level data frame with one row per spell. It should contain:
- n_episodes: count of episodes in the spell
- total_episode_days: total days covered by episodes in the spell
- n_consultant_specialties: number of distinct consultant specialties in the spell

Solution

episode_summary <- episodes |>
  mutate(episode_days = as.numeric(episode_end - episode_start)) |>
  group_by(spell_id) |>
  summarise(
    n_episodes = n(),
    total_episode_days = sum(episode_days, na.rm = TRUE),
    n_consultant_specialties = n_distinct(consultant_specialty, na.rm = TRUE),
    .groups = "drop"
  )

Join episode_summary to spell_tbl by spell_id to create your final analysis table dat.

Solution

dat <- spell_tbl |>
  left_join(episode_summary, by = "spell_id")

Derive analysis fields

Create these derived columns in dat:

age_at_admission from date_of_birth and admission_date
age_band (<75, 75-84, 85+)
surgery_delay_group (<=2 days, >2 days)
admission_year

Solution

dat <- dat |>
  mutate(
    admission_date = ymd(admission_date),
    date_of_birth = ymd(date_of_birth),
    age_at_admission = as.integer(time_length(
      interval(date_of_birth, admission_date),
      "years"
    )),
    age_band = case_when(
      between(age_at_admission, 0, 74) ~ "<75",
      between(age_at_admission, 75, 84) ~ "75-84",
      age_at_admission >= 85 ~ "85+",
      .default = NA_character_
    ),
    surgery_delay_group = case_when(
      time_to_surgery_days <= 2 ~ "<=2 days",
      time_to_surgery_days > 2 ~ ">2 days",
      .default = NA_character_
    ),
    admission_year = year(admission_date)
  )

Make comparisons

Compare outcome rates by surgery delay group (<=2 days vs >2 days) for:

readmission_30d
death_30d
long_stay_flag

Solution

dat |>
  group_by(surgery_delay_group) |>
  summarise(
    n = n(),
    readmission_30d_rate = mean(readmission_30d),
    death_30d_rate = mean(death_30d),
    long_stay_rate = mean(long_stay_flag),
    .groups = "drop"
  )

# A tibble: 2 × 5
  surgery_delay_group     n readmission_30d_rate death_30d_rate long_stay_rate
  <chr>               <int>                <dbl>          <dbl>          <dbl>
1 <=2 days             1830               0.0874         0.0175         0.0464
2 >2 days              4863               0.115          0.0319         0.320

Summarise outcomes by age_band and sex.

Solution

dat |>
  group_by(age_band, sex) |>
  summarise(
    n = n(),
    death_30d_rate = mean(death_30d),
    readmission_30d_rate = mean(readmission_30d),
    .groups = "drop"
  ) |>
  arrange(age_band, sex)

# A tibble: 6 × 5
  age_band sex        n death_30d_rate readmission_30d_rate
  <chr>    <chr>  <int>          <dbl>                <dbl>
1 75-84    Female  2072         0.0256               0.105 
2 75-84    Male     782         0.0243               0.104 
3 85+      Female   772         0.0466               0.123 
4 85+      Male     341         0.0440               0.0968
5 <75      Female  1977         0.0248               0.104 
6 <75      Male     749         0.0200               0.117

Rank providers by mean LOS and readmission rate. Restrict to providers with at least 500 spells.

Solution

dat |>
  group_by(provider_code) |>
  summarise(
    n = n(),
    mean_los = mean(los_days),
    readmission_30d_rate = mean(readmission_30d),
    .groups = "drop"
  ) |>
  filter(n >= 500) |>
  arrange(desc(mean_los))

# A tibble: 7 × 4
  provider_code     n mean_los readmission_30d_rate
  <chr>         <int>    <dbl>                <dbl>
1 R17             800     13.4               0.106 
2 R16             880     12.8               0.122 
3 R15             955     12.4               0.117 
4 R14            1058     12.1               0.105 
5 R13             974     11.8               0.107 
6 R12            1044     11.2               0.103 
7 R11             982     10.6               0.0947

Communicate

Create a simple summary table of provider-level results (top 10 by mean LOS) using group_by and summarise.

Create using tinytable:

Solution

library(tinytable)

provider_summary <- dat |>
  group_by(provider_code) |>
  summarise(
    n = n(),
    mean_los = mean(los_days),
    readmission_30d_rate = mean(readmission_30d),
    .groups = "drop"
  ) |>
  filter(n >= 500) |>
  arrange(desc(mean_los)) |>
  slice_head(n = 10)

# Stage 1: basic tinytable with raw variable names
tt(provider_summary)

provider_code	n	mean_los	readmission_30d_rate
R17	800	13.36500	0.10625000
R16	880	12.76648	0.12159091
R15	955	12.40838	0.11727749
R14	1058	12.11257	0.10491493
R13	974	11.81294	0.10677618
R12	1044	11.15584	0.10344828
R11	982	10.64868	0.09470468

# Stage 2: add tinytable formatting and grouped headers
provider_table <- provider_summary |>
  mutate(`Readmission within 30 days (%)` = 100 * readmission_30d_rate) |>
  select(
    `ID` = provider_code,
    `Number of spells` = n,
    `Mean length of stay (days)` = mean_los,
    `Readmission within 30 days (%)`
  )

tt(provider_table, caption = "Top 10 providers by mean length of stay") |>
  theme_striped() |>
  group_tt(j = list("Provider" = 1:2, "Outcomes" = 3:4)) |>
  style_tt("groupj", bold = TRUE) |>
  style_tt(j = 1, bold = TRUE) |>
  style_tt(align = "cccc") |>
  format_tt(j = 3:4, digits = 1)

Top 10 providers by mean length of stay
Provider		Outcomes
ID	Number of spells	Mean length of stay (days)	Readmission within 30 days (%)
R17	800	13	11
R16	880	13	12
R15	955	12	12
R14	1058	12	10
R13	974	12	11
R12	1044	11	10
R11	982	11	9