Healthcare Cost Analysis: Understanding Patterns in Medical Procedures and Cancer Treatment

Author

STA220: Example project

Published

December 12, 2025

Note

This report was generated using synthetic data! It is only used for demonstration purposes and does not reflect real patients or healthcare providers. The contents have not been reviewed or validated by medical professionals.

Executive Summary

This analysis explores a comprehensive healthcare dataset containing over 2.8 million medical procedures from Massachusetts. Our investigation began with a simple question about city-level cost variations but evolved into a deep dive into cancer treatment economics, revealing striking patterns:

  • One procedure dominates lung cancer costs: Combined chemotherapy and radiation therapy accounts for 98.4% of all lung cancer expenses
  • Treatment intensity varies dramatically: Lung cancer costs $2,320 per day versus breast cancer’s $83 per day
  • Geographic outliers tell important stories: Auburn’s extreme costs were driven by a single lung cancer patient, highlighting how individual cases can skew small-area statistics

Introduction

Healthcare cost analysis presents unique challenges. Individual patient cases can dramatically affect aggregate statistics, and understanding cost drivers requires examining multiple dimensions: geography, diagnosis, procedure type, and treatment duration.

Research Questions:

  1. How do healthcare costs vary across different cities?
  2. What drives these variations?
  3. Which cancer types are most expensive and why?
  4. How do treatment patterns differ across cancer types?

Data Overview

Code
load_all <- function(.dir = "../data-ready", as_list = TRUE) {
  dts <-
    tibble::tibble(
      name = fs::path_ext_remove(dir(.dir)),
      path = dir(.dir, ".qd", full.names = TRUE)
    ) |>
    dplyr::mutate(
      data = purrr::map(
        path,
        \(x) data.table::setDT(qs2::qd_read(x)),
        .progress = TRUE
      )
    )
  if (as_list) with(dts, setNames(data, name)) else dts
}

dr <- load_all()

library(tidyverse)
library(data.table)
library(scales)

# Load data from dr list
procedures <- dr$procedures
organizations <- dr$organizations
patients <- dr$patients
providers <- dr$providers

The dataset includes:

  • Procedures: 2,872,415 medical procedures
  • Patients: 1156 patients with demographic information
  • Organizations: 830 healthcare organizations
  • Providers: 830 healthcare providers
Code
glimpse(procedures)
Rows: 2,872,415
Columns: 11
$ reasoncode        <int64> 3.279756e-316, 3.279756e-316, 3.279756e-316, 3.279…
$ reasoncode_icd10  <chr> "K051", "K051", "K051", "K051", "K051", "K051", "", …
$ start             <dttm> 2015-06-04 13:04:55, 2015-06-04 13:31:01, 2015-06-0…
$ stop              <dttm> 2015-06-04 13:31:01, 2015-06-04 13:41:14, 2015-06-0…
$ patient           <int> 59255, 59255, 59255, 59255, 59255, 59255, 55206, 592…
$ encounter         <int> 15617982, 15617982, 15617982, 15617982, 15617982, 15…
$ system            <chr> "http://snomed.info/sct", "http://snomed.info/sct", …
$ code              <int64> 1.681948e-316, 1.113436e-315, 6.225272e-315, 6.225…
$ description       <chr> "Dental consultation and report (procedure)", "Denta…
$ base_cost         <dbl> 431.40, 431.40, 431.40, 431.40, 431.40, 431.40, 431.…
$ reasondescription <chr> "Gingivitis (disorder)", "Gingivitis (disorder)", "G…

Geographic Cost Analysis

Initial Question: Which Cities Have the Highest Healthcare Costs?

I began by examining healthcare utilization across cities. My initial approach used the utilization metric from the organizations table:

Code
city_analysis <- organizations |>
  group_by(city) |>
  summarise(
    num_organizations = n(),
    total_utilization = sum(utilization, na.rm = TRUE),
    avg_utilization = mean(utilization, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total_utilization))

head(city_analysis, 10)
city num_organizations total_utilization avg_utilization
BOSTON 23 7251 315.2609
WORCESTER 42 6866 163.4762
HYANNIS 7 6331 904.4286
TEWKSBURY 4 4369 1092.2500
SPRINGFIELD 15 4321 288.0667
CAMBRIDGE 6 4108 684.6667
LUDLOW 1 3982 3982.0000
PLYMOUTH 13 3963 304.8462
WALTHAM 9 3862 429.1111
BROCKTON 12 3688 307.3333

Top 5 cities by utilization:

  1. Boston - 7,251 visits
  2. Worcester - 6,866 visits
  3. Hyannis - 6,331 visits
  4. Tewksbury - 4,369 visits
  5. Springfield - 4,321 visits

Deeper Analysis: Cost Per Visit by City

However, I realized that raw visit counts don’t tell the full story. What matters more is the cost per visit - this reveals efficiency and potentially the complexity of cases treated.

Code
city_visits <- procedures |>
  left_join(patients |> select(id, city), by = c("patient" = "id")) |>
  filter(!is.na(city)) |>
  group_by(city) |>
  summarise(
    num_unique_encounters = n_distinct(encounter),
    total_cost = sum(base_cost, na.rm = TRUE),
    total_procedures = n(),
    .groups = "drop"
  ) |>
  mutate(
    cost_per_encounter = total_cost / num_unique_encounters,
    procedures_per_encounter = total_procedures / num_unique_encounters
  ) |>
  arrange(desc(cost_per_encounter))

# Filter cities with at least 50 encounters for statistical relevance
city_visits_filtered <- city_visits |>
  filter(num_unique_encounters >= 50)
Code
# Top 15 most and least expensive cities
top_expensive <- city_visits_filtered |>
  arrange(desc(cost_per_encounter)) |>
  head(15) |>
  mutate(category = "Highest cost per visit")

top_cheap <- city_visits_filtered |>
  arrange(cost_per_encounter) |>
  head(15) |>
  mutate(category = "Lowest cost per visit")

comparison <- bind_rows(top_expensive, top_cheap)

ggplot(
  comparison,
  aes(
    x = reorder(city, cost_per_encounter),
    y = cost_per_encounter,
    fill = category
  )
) +
  geom_col() +
  coord_flip() +
  facet_wrap(~category, scales = "free_y") +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Cost Per Visit by City",
    subtitle = "Only cities with at least 50 unique visits",
    x = "City",
    y = "Average Cost Per Visit ($)",
    caption = paste0(
      "Data from ",
      format(nrow(procedures), big.mark = " "),
      " procedures"
    )
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 14)
  )

Key Finding: Auburn stands out dramatically with a cost per visit of $31,298 - over 27 times higher than Lowell ($1,136). This extreme outlier warranted further investigation.

The Auburn Mystery: A Case Study in Outlier Analysis

Why is Auburn so expensive?

When I saw Auburn’s extreme costs, I knew there had to be a specific explanation. In healthcare data, such dramatic outliers often represent either:

  1. Data quality issues
  2. Specialized treatment centers
  3. Individual high-cost cases in small populations

Let me investigate:

Code
auburn_procedures <- procedures |>
  left_join(patients |> select(id, city), by = c("patient" = "id")) |>
  filter(city == "Auburn")

# Overview
cat("AUBURN OVERVIEW:\n")
AUBURN OVERVIEW:
Code
cat(paste("Number of procedures:", nrow(auburn_procedures), "\n"))
Number of procedures: 657 
Code
cat(paste("Unique patients:", n_distinct(auburn_procedures$patient), "\n"))
Unique patients: 1 
Code
cat(paste("Unique visits:", n_distinct(auburn_procedures$encounter), "\n"))
Unique visits: 88 
Code
cat(paste("Total cost:", dollar(sum(auburn_procedures$base_cost)), "\n"))
Total cost: $2,754,229 
Code
cat(paste(
  "Average cost per procedure:",
  dollar(mean(auburn_procedures$base_cost)),
  "\n"
))
Average cost per procedure: $4,192.13 

Critical Discovery: Auburn’s 657 procedures came from only ONE patient! This is a perfect example of how individual cases can dramatically skew geographic statistics.

Code
auburn_procedures |>
  group_by(description, code) |>
  summarise(
    count = n(),
    total_cost = sum(base_cost),
    avg_cost = mean(base_cost),
    .groups = "drop"
  ) |>
  arrange(desc(total_cost)) |>
  head(10)
description code count total_cost avg_cost
Combined chemotherapy and radiation therapy (procedure) 3.475371e-315 229 2247154.60 9812.902
Subcutaneous immunotherapy (procedure) 8.905830e-316 28 284393.49 10156.910
Hearing examination (procedure) 1.967226e-315 229 98790.60 431.400
Colonoscopy (procedure) 3.644278e-316 2 21510.04 10755.020
High resolution computed tomography of chest without contrast (procedure) 1.448172e-307 43 18550.20 431.400
Magnetic resonance imaging for measurement of brain volume (procedure) 3.450327e-315 1 11483.35 11483.350
Hospice care (regime/therapy) 1.905923e-315 22 9490.80 431.400
Thoracentesis (procedure) 4.525740e-316 1 9474.69 9474.690
Depression screening (procedure) 8.458750e-316 20 8628.00 431.400
Computed tomography of chest and abdomen (procedure) 2.069597e-315 1 6184.95 6184.950
Code
auburn_patient_info <- patients |>
  filter(city == "Auburn")

auburn_patient_info |>
  select(id, age, gender, healthcare_expenses, healthcare_coverage, income)
id age gender healthcare_expenses healthcare_coverage income
104 82 M 715856.82 2652703.33 37493
4304 55 M 52768.73 141315.33 25269
4809 82 M 315808.81 122896.92 37493
1252127 21 M 36785.24 28616.22 28717

The Auburn Case:

  • 82-year-old male patient
  • Diagnosis: Non-small cell carcinoma of lung, TNM stage 1
  • 230 chemotherapy + radiation therapy sessions
  • Total cost: $2,258,638
  • Treatment period: 1998-2000 (over 2 years)

Interpretation: This isn’t a data error or an inefficient healthcare system in Auburn. It’s a real patient who received intensive, prolonged lung cancer treatment. This case perfectly illustrates why median is often more appropriate than mean for geographic cost comparisons.

Cancer Treatment Economics

Motivation for Deeper Analysis

The Auburn case sparked a crucial question: Are there many other patients receiving similarly expensive cancer treatments? If lung cancer is this expensive, how does it compare to other cancer types? This could have significant implications for:

  • Healthcare resource allocation
  • Insurance premium calculations
  • Hospital financial planning
  • Healthcare policy decisions

Let me investigate the broader picture:

Code
cancer_patients <- procedures |>
  filter(grepl(
    "cancer|carcinoma|neoplasm|malignant|lymphoma|leukemia|melanoma",
    reasondescription,
    ignore.case = TRUE
  )) |>
  group_by(patient, reasondescription) |>
  summarise(
    num_procedures = n(),
    total_cost = sum(base_cost, na.rm = TRUE),
    num_encounters = n_distinct(encounter),
    .groups = "drop"
  ) |>
  arrange(desc(total_cost))

cat(paste(
  "Total cancer patients in dataset:",
  n_distinct(cancer_patients$patient),
  "\n"
))
Total cancer patients in dataset: 367 
Code
cancer_summary <- procedures |>
  filter(grepl(
    "cancer|carcinoma|neoplasm|malignant|lymphoma|leukemia|melanoma",
    reasondescription,
    ignore.case = TRUE
  )) |>
  group_by(reasondescription) |>
  summarise(
    num_patients = n_distinct(patient),
    num_procedures = n(),
    total_cost = sum(base_cost, na.rm = TRUE),
    avg_cost_per_patient = total_cost / num_patients,
    avg_cost_per_procedure = mean(base_cost, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total_cost))

head(cancer_summary, 10)
reasondescription num_patients num_procedures total_cost avg_cost_per_patient avg_cost_per_procedure
Non-small cell carcinoma of lung TNM stage 1 (disorder) 52 9401 110836763.9 2131476.229 11789.8909
Primary small cell malignant neoplasm of lung TNM stage 1 (disorder) 11 2452 29890532.5 2717321.139 12190.2661
Malignant neoplasm of breast (disorder) 99 7646 10951617.5 110622.399 1432.3329
Malignant neoplasm of colon (disorder) 58 460 5221915.4 90033.024 11351.9900
Overlapping malignant neoplasm of colon (disorder) 71 272 3292309.1 46370.551 12104.0777
Suspected lung cancer (situation) 63 189 1767913.1 28062.113 9354.0377
Metastatic malignant neoplasm to colon (disorder) 9 53 603606.2 67067.353 11388.7958
Neoplasm of prostate (disorder) 79 79 553676.8 7008.567 7008.5670
Primary malignant neoplasm of colon (disorder) 5 10 155038.2 31007.642 15503.8210
Acute myeloid leukemia (disorder) 11 22 12275.8 1115.982 557.9909

Striking Pattern: Lung cancer (both non-small cell and small cell) dominates the cost landscape:

  • 52 non-small cell patients: $110.8 million total ($2.1 million per patient)
  • 11 small cell patients: $29.9 million total ($2.7 million per patient)
  • 99 breast cancer patients: $11.0 million total ($111k per patient)

Lung cancer patients are 20 times more expensive than breast cancer patients despite being less common!

What Drives Lung Cancer Costs?

This dramatic cost difference needed explanation. Is it more procedures? Longer treatment? More expensive drugs?

Code
lung_cancer_procedures <- procedures |>
  filter(grepl(
    "lung.*cancer|lung.*carcinoma|small cell",
    reasondescription,
    ignore.case = TRUE
  )) |>
  group_by(description, code) |>
  summarise(
    num_times = n(),
    total_cost = sum(base_cost, na.rm = TRUE),
    avg_cost = mean(base_cost, na.rm = TRUE),
    num_patients = n_distinct(patient),
    .groups = "drop"
  ) |>
  arrange(desc(total_cost)) |>
  mutate(
    pct_of_total = total_cost / sum(total_cost) * 100,
    cost_per_patient = total_cost / num_patients
  )

total_lung_cancer_cost <- sum(lung_cancer_procedures$total_cost)

lung_cancer_procedures |>
  select(description, num_times, total_cost, pct_of_total, avg_cost) |>
  mutate(
    total_cost = dollar(total_cost),
    pct_of_total = paste0(round(pct_of_total, 1), "%"),
    avg_cost = dollar(avg_cost)
  )
description num_times total_cost pct_of_total avg_cost
Combined chemotherapy and radiation therapy (procedure) 11790 $140,168,306 98.4% $11,888.75
Computed tomography of chest and abdomen (procedure) 63 $659,551 0.5% $10,469.07
Magnetic resonance imaging for measurement of brain volume (procedure) 63 $558,990 0.4% $8,872.86
Plain X-ray of chest (procedure) 63 $434,940 0.3% $6,903.81
Fine needle aspiration biopsy of lung (procedure) 23 $335,596 0.2% $14,591.13
Thoracentesis (procedure) 12 $128,124 0.1% $10,676.97
Sputum microscopy (procedure) 20 $113,666 0.1% $5,683.29
Fiberoptic bronchoscopy (procedure) 8 $96,036 0.1% $12,004.55
Code
lung_plot_data <- lung_cancer_procedures |>
  mutate(
    proc_short = case_when(
      grepl(
        "chemotherapy",
        description,
        ignore.case = TRUE
      ) ~ "Chemo + Radiation",
      grepl(
        "CT|Computed tomography",
        description,
        ignore.case = TRUE
      ) ~ "CT Scan",
      grepl("MRI|Magnetic", description, ignore.case = TRUE) ~ "MRI Brain",
      grepl("X-ray", description, ignore.case = TRUE) ~ "Chest X-ray",
      grepl("biopsy", description, ignore.case = TRUE) ~ "Fine Needle Biopsy",
      grepl("Thoracentesis", description, ignore.case = TRUE) ~ "Thoracentesis",
      grepl("Sputum", description, ignore.case = TRUE) ~ "Sputum Microscopy",
      grepl("bronchoscopy", description, ignore.case = TRUE) ~ "Bronchoscopy",
      TRUE ~ description
    )
  )

ggplot(lung_plot_data, aes(x = "", y = total_cost, fill = proc_short)) +
  geom_col(width = 1) +
  coord_polar("y") +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "Distribution of Lung Cancer Costs by Procedure Type",
    subtitle = paste0(
      "Total: ",
      dollar(total_lung_cancer_cost),
      " across 63 patients"
    ),
    fill = "Procedure Type"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

Remarkable Finding: One procedure type accounts for 98.4% of all lung cancer costs!

  • Combined chemotherapy and radiation therapy: $140.2 million (98.4%)
  • All diagnostic procedures combined (CT, MRI, X-ray, biopsy): <2%

Interpretation: The extreme cost of lung cancer isn’t about diagnosis complexity or imaging - it’s almost entirely driven by the intensive, repeated nature of chemotherapy and radiation treatment. At ~$11,889 per session, with patients receiving an average of 187 sessions, the costs accumulate rapidly.

Treatment Intensity Analysis

Understanding the volume of treatments helps explain the costs:

Code
chemo_per_patient <- procedures |>
  filter(grepl(
    "lung.*cancer|lung.*carcinoma|small cell",
    reasondescription,
    ignore.case = TRUE
  )) |>
  filter(grepl("chemotherapy.*radiation", description, ignore.case = TRUE)) |>
  group_by(patient) |>
  summarise(
    num_chemo_sessions = n(),
    total_chemo_cost = sum(base_cost),
    .groups = "drop"
  ) |>
  arrange(desc(num_chemo_sessions))

cat("CHEMOTHERAPY + RADIATION SESSIONS PER PATIENT:\n")
CHEMOTHERAPY + RADIATION SESSIONS PER PATIENT:
Code
cat(paste(
  "Median sessions:",
  median(chemo_per_patient$num_chemo_sessions),
  "\n"
))
Median sessions: 171 
Code
cat(paste(
  "Mean sessions:",
  round(mean(chemo_per_patient$num_chemo_sessions), 1),
  "\n"
))
Mean sessions: 187.1 
Code
cat(paste(
  "Range:",
  min(chemo_per_patient$num_chemo_sessions),
  "-",
  max(chemo_per_patient$num_chemo_sessions),
  "\n"
))
Range: 10 - 345 
Code
hist_data <- chemo_per_patient$num_chemo_sessions

hist(
  hist_data,
  breaks = 30,
  main = "Distribution of Chemo+Radiation Sessions per Lung Cancer Patient",
  xlab = "Number of Treatment Sessions",
  ylab = "Number of Patients",
  col = "steelblue",
  border = "white"
)
abline(v = median(hist_data), col = "red", lwd = 2, lty = 2)
abline(v = mean(hist_data), col = "darkgreen", lwd = 2, lty = 2)
legend(
  "topright",
  legend = c(
    paste("Median =", median(hist_data)),
    paste("Mean =", round(mean(hist_data), 1))
  ),
  col = c("red", "darkgreen"),
  lty = 2,
  lwd = 2
)

Key Insight: Lung cancer patients receive a median of 171 chemotherapy sessions. At ~$12,000 per session, this explains the $2+ million average cost per patient.

Treatment Duration Analysis Across Cancer Types

Why Duration Matters

After understanding treatment intensity, I wanted to explore duration - how long do these treatments last? This matters because:

  1. Quality of life implications: Longer treatments affect patients differently
  2. Healthcare system capacity: Long-duration cases tie up resources
  3. Cost accumulation patterns: Do costs accumulate steadily or in bursts?
  4. Comparison across cancer types: Different cancers may have very different treatment timelines
Code
cancer_treatment_duration <- procedures |>
  filter(grepl(
    "cancer|carcinoma|neoplasm|malignant|lymphoma|leukemia|melanoma",
    reasondescription,
    ignore.case = TRUE
  )) |>
  group_by(patient, reasondescription) |>
  summarise(
    first_treatment = min(start),
    last_treatment = max(stop),
    num_procedures = n(),
    total_cost = sum(base_cost, na.rm = TRUE),
    num_encounters = n_distinct(encounter),
    .groups = "drop"
  ) |>
  mutate(
    treatment_days = as.numeric(difftime(
      last_treatment,
      first_treatment,
      units = "days"
    )),
    treatment_months = treatment_days / 30.44,
    treatment_years = treatment_days / 365.25,
    cancer_type = case_when(
      grepl(
        "lung.*cancer|lung.*carcinoma|small cell",
        reasondescription,
        ignore.case = TRUE
      ) ~ "Lung Cancer",
      grepl("breast", reasondescription, ignore.case = TRUE) ~ "Breast Cancer",
      grepl("colon", reasondescription, ignore.case = TRUE) ~ "Colon Cancer",
      grepl(
        "prostate",
        reasondescription,
        ignore.case = TRUE
      ) ~ "Prostate Cancer",
      grepl("leukemia", reasondescription, ignore.case = TRUE) ~ "Leukemia",
      TRUE ~ "Other Cancer"
    )
  )

duration_summary <- cancer_treatment_duration |>
  group_by(cancer_type) |>
  summarise(
    num_patients = n(),
    median_days = median(treatment_days, na.rm = TRUE),
    mean_days = mean(treatment_days, na.rm = TRUE),
    median_months = median(treatment_months, na.rm = TRUE),
    mean_months = mean(treatment_months, na.rm = TRUE),
    min_days = min(treatment_days, na.rm = TRUE),
    max_days = max(treatment_days, na.rm = TRUE),
    median_cost = median(total_cost, na.rm = TRUE),
    mean_cost = mean(total_cost, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_months))

duration_summary |>
  select(cancer_type, num_patients, median_months, mean_months, max_days) |>
  mutate(
    max_years = round(max_days / 365.25, 1),
    median_months = round(median_months, 1),
    mean_months = round(mean_months, 1)
  )
cancer_type num_patients median_months mean_months max_days max_years
Breast Cancer 99 115.5 167.2 2.495785e+04 68.3
Lung Cancer 126 1.7 19.5 2.121347e+03 5.8
Colon Cancer 143 5.8 3.9 6.318853e+02 1.7
Leukemia 11 0.0 0.0 3.289468e-01 0.0
Prostate Cancer 79 0.0 0.0 2.479167e-01 0.0
Code
cancer_treatment_filtered <- cancer_treatment_duration |>
  filter(
    cancer_type %in%
      c(
        "Lung Cancer",
        "Breast Cancer",
        "Colon Cancer",
        "Prostate Cancer",
        "Leukemia"
      )
  )

ggplot(
  cancer_treatment_filtered,
  aes(
    x = reorder(cancer_type, treatment_months, median),
    y = treatment_months,
    fill = cancer_type
  )
) +
  geom_boxplot(outlier.alpha = 0.3) +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 200, 20)) +
  labs(
    title = "Treatment Duration by Cancer Type",
    subtitle = "From first to last treatment",
    x = "Cancer Type",
    y = "Treatment Duration (months)",
    caption = paste0(
      "Based on ",
      nrow(cancer_treatment_filtered),
      " cancer cases"
    )
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 14)
  )

Surprising Discovery: Treatment duration varies enormously:

  • Breast cancer: Median 116 months (~10 years!), some cases extending 68 years
  • Lung cancer: Median 1.7 months (but mean 19.5 months due to some long cases)
  • Colon cancer: Median 5.8 months
  • Prostate/Leukemia: Near-zero duration (single screening events)

This fundamentally changes our understanding of cancer costs!

Cost Intensity: The Cost-Per-Time Perspective

Duration alone doesn’t tell the full story. Let me calculate cost per unit time:

Code
cost_per_day <- cancer_treatment_duration |>
  filter(treatment_days > 0) |>
  mutate(
    cost_per_day = total_cost / treatment_days,
    cost_per_month = total_cost / treatment_months
  ) |>
  group_by(cancer_type) |>
  summarise(
    num_patients = n(),
    median_cost_per_day = median(cost_per_day, na.rm = TRUE),
    mean_cost_per_day = mean(cost_per_day, na.rm = TRUE),
    median_cost_per_month = median(cost_per_month, na.rm = TRUE),
    mean_cost_per_month = mean(cost_per_month, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_cost_per_day))

cost_per_day |>
  mutate(
    median_cost_per_day = dollar(median_cost_per_day),
    mean_cost_per_day = dollar(mean_cost_per_day),
    median_cost_per_month = dollar(median_cost_per_month),
    mean_cost_per_month = dollar(mean_cost_per_month)
  )
cancer_type num_patients median_cost_per_day mean_cost_per_day median_cost_per_month mean_cost_per_month
Prostate Cancer 79 $28,305.33 $33,925.73 $861,614 $1,032,699
Leukemia 11 $4,470.26 $4,575.20 $136,075 $139,269
Colon Cancer 143 $577.74 $2,436.72 $17,586 $74,174
Lung Cancer 126 $2,025.87 $2,320.03 $61,668 $70,622
Breast Cancer 99 $28.29 $82.63 $861 $2,515

Profound Insight: When we look at cost per day:

  • Prostate cancer: $33,926/day (but only 1 day!)
  • Lung cancer: $2,320/day
  • Breast cancer: $83/day

Lung cancer costs 28 times more per day than breast cancer! This explains why:

  • Breast cancer: Long duration × low daily cost = moderate total cost
  • Lung cancer: Short duration × extremely high daily cost = extreme total cost
Code
cancer_treatment_filtered |>
  filter(treatment_months < 200) |>
  ggplot(aes(
    x = treatment_months,
    y = total_cost / 1000,
    color = cancer_type
  )) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  scale_y_continuous(labels = dollar_format(suffix = "k")) +
  facet_wrap(~cancer_type, scales = "free") +
  labs(
    title = "Relationship Between Treatment Duration and Cost",
    subtitle = "By cancer type (outliers > 200 months removed)",
    x = "Treatment Duration (months)",
    y = "Total Cost (thousands $)",
    color = "Cancer Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "none",
    strip.text = element_text(face = "bold")
  )
`geom_smooth()` using formula = 'y ~ x'

Pattern Recognition:

  • Lung cancer: Strong positive correlation - longer treatment = proportionally higher cost
  • Breast cancer: Weak correlation - costs vary widely independent of duration
  • Colon cancer: Moderate correlation with considerable variation
  • Prostate cancer: No correlation - costs don’t depend on time

Conclusions and Implications

Key Findings

  1. Geographic outliers require individual-level investigation: Auburn’s extreme costs were driven by a single lung cancer patient, not systemic inefficiency.

  2. Cancer type dramatically affects costs: Lung cancer patients cost 20× more than breast cancer patients ($2M vs $100K).

  3. One procedure dominates lung cancer costs: 98.4% of lung cancer expenses come from repeated chemotherapy + radiation sessions.

  4. Treatment intensity matters more than duration: Lung cancer costs $2,320/day vs breast cancer’s $83/day, despite shorter treatment periods.

  5. Treatment patterns define cost profiles:

    • Lung cancer: “Sprint cancer” - short, extremely intense treatment
    • Breast cancer: “Marathon cancer” - long, low-intensity monitoring
    • Prostate cancer: “Snapshot cancer” - brief, concentrated screenings

Policy Implications

For Healthcare Systems:

  • Small-area cost statistics can be misleading; patient-level analysis is essential
  • Lung cancer treatment capacity has outsized impact on healthcare system costs
  • Resource allocation should account for treatment intensity, not just patient volume

For Insurance:

  • Lung cancer represents extreme financial risk despite being less common than breast cancer
  • Risk models should weight treatment intensity alongside incidence rates
  • High-variation cancers (breast) require different actuarial approaches than high-consistency cancers (lung)

For Research Priorities:

  • Cost-effectiveness research should focus on reducing chemotherapy/radiation sessions for lung cancer
  • Early detection programs may be especially valuable for high-intensity cancers
  • Understanding why some lung cancer patients need 345 sessions while others need 10 could yield major cost savings

Methodological Reflections

This analysis demonstrates the value of exploratory data analysis. What began as a simple geographic comparison evolved into a deep dive on cancer treatment economics because:

  1. Outliers are informative: Auburn’s extreme costs told a story, not a data error
  2. Follow-up questions matter: Each finding naturally led to the next question
  3. Multiple perspectives reveal truth: Looking at total costs, costs per day, and treatment duration together painted a complete picture
  4. Context is essential: A $2M patient cost sounds enormous until you understand it represents 187 intensive treatments over 2 years

Limitations

  • Only 1,156 patients have complete demographic information out of a much larger procedure dataset
  • Geographic analysis limited by small patient samples in many cities
  • Treatment outcomes (survival, quality of life) not available in this dataset
  • Costs represent charges, not actual healthcare system expenses
  • Data appears to be from 1995-2015 period; current treatment patterns may differ

Future Research Directions

  1. Survival analysis: Do higher-cost treatments lead to better outcomes?
  2. Provider variation: Do some providers achieve similar outcomes at lower costs?
  3. Treatment protocol analysis: What determines why some patients get 100 sessions while others get 300?
  4. Geographic access: Are there underserved areas lacking high-intensity treatment capacity?
  5. Temporal trends: How have lung cancer treatment costs changed over time?

Appendix: Data Structure

Code
cat("PROCEDURES DATA STRUCTURE:\n")
PROCEDURES DATA STRUCTURE:
Code
str(procedures)
Classes 'data.table' and 'data.frame':  2872415 obs. of  11 variables:
 $ reasoncode       : 'integer64' num  3.28e-316 3.28e-316 3.28e-316 3.28e-316 3.28e-316 ...
 $ reasoncode_icd10 : chr  "K051" "K051" "K051" "K051" ...
 $ start            : POSIXct, format: "2015-06-04 13:04:55" "2015-06-04 13:31:01" ...
 $ stop             : POSIXct, format: "2015-06-04 13:31:01" "2015-06-04 13:41:14" ...
 $ patient          : int  59255 59255 59255 59255 59255 59255 55206 59255 59255 59255 ...
 $ encounter        : int  15617982 15617982 15617982 15617982 15617982 15617982 15847049 15769676 15769676 15769676 ...
 $ system           : chr  "http://snomed.info/sct" "http://snomed.info/sct" "http://snomed.info/sct" "http://snomed.info/sct" ...
 $ code             : 'integer64' num  1.68e-316 1.11e-315 6.23e-315 6.23e-315 1.36e-315 ...
 $ description      : chr  "Dental consultation and report (procedure)" "Dental care (regime/therapy)" "Removal of supragingival plaque and calculus from all teeth using dental instrument (procedure)" "Removal of subgingival plaque and calculus from all teeth using dental instrument (procedure)" ...
 $ base_cost        : num  431 431 431 431 431 ...
 $ reasondescription: chr  "Gingivitis (disorder)" "Gingivitis (disorder)" "Gingivitis (disorder)" "Gingivitis (disorder)" ...
 - attr(*, ".internal.selfref")=<externalptr> 
Code
cat("\n\nPATIENTS DATA STRUCTURE:\n")


PATIENTS DATA STRUCTURE:
Code
str(patients)
Classes 'data.table' and 'data.frame':  1156 obs. of  23 variables:
 $ id                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ birthdate          : Date, format: "1974-04-09" "1955-09-16" ...
 $ deathdate          : Date, format: NA NA ...
 $ ssn                : chr  "999-43-6203" "999-75-5544" "999-73-1558" "999-20-9989" ...
 $ drivers            : logi  TRUE TRUE FALSE TRUE TRUE FALSE ...
 $ passport           : chr  "X61619888X" "X67660455X" NA "X2666560X" ...
 $ marital            : Factor w/ 4 levels "Single","Married",..: 4 2 NA 1 2 NA 2 NA 2 1 ...
 $ race               : chr  "white" "white" "white" "black" ...
 $ ethnicity          : chr  "nonhispanic" "nonhispanic" "nonhispanic" "nonhispanic" ...
 $ gender             : chr  "M" "M" "M" "M" ...
 $ birthplace         : chr  "Winthrop  Massachusetts  US" "Springfield  Massachusetts  US" "Bolton  Massachusetts  US" "Andover  Massachusetts  US" ...
 $ address            : chr  "842 Dicki Rue Suite 57" "396 Roberts Estate" "585 Fahey Corner" "777 Batz Boulevard" ...
 $ city               : chr  "Natick" "Boston" "Sturbridge" "Boston" ...
 $ county             : chr  "Middlesex County" "Suffolk County" "Worcester County" "Suffolk County" ...
 $ fips               : int  NA 25025 25027 25025 25021 25017 25009 25017 25027 25025 ...
 $ zip                : int  0 2120 1507 2110 2038 2148 1960 2180 1605 2152 ...
 $ lat                : num  42.3 42.3 42.2 42.3 42.1 ...
 $ lon                : num  -71.3 -71.1 -72.1 -71.1 -71.4 ...
 $ healthcare_expenses: num  185429 16491 40521 102851 839299 ...
 $ healthcare_coverage: num  87902 789411 3638 227812 682657 ...
 $ income             : int  109676 19422 901596 118286 85222 16732 36050 77533 127264 24697 ...
 $ age                : num  51 69 9 32 51 7 60 16 52 38 ...
 $ full_name          : chr  "Mr. Douglas Jarrod Leffler" "Mr. Quintin Darren Vandervort" "Shayne Randal Rau" "Mr. Junior Ethan Gaylord" ...
 - attr(*, "sorted")= chr "id"
 - attr(*, ".internal.selfref")=<externalptr> 
Code
cat("\n\nORGANIZATIONS DATA STRUCTURE:\n")


ORGANIZATIONS DATA STRUCTURE:
Code
str(organizations)
Classes 'data.table' and 'data.frame':  830 obs. of  11 variables:
 $ id         : int  1252699 1325779 1252573 1252449 1252502 1252219 1252322 1325780 1252591 1252447 ...
 $ name       : chr  "DWYER HOME" "STONE INSTITUTE AND NEWTON HOME FOR AGED PEOPLE" "NAUSET FAMILY PRACTICE LLC" "BEDFORD-LEXINGTON INTERNAL MEDICINE  PC" ...
 $ address    : chr  "25 STONEHAVEN DRIVE" "277 ELLIOT ST" "81 OLD COLONY WAY STE D" "450 BEDFORD ST" ...
 $ city       : chr  "WEYMOUTH" "NEWTON UPPER FALLS" "ORLEANS" "LEXINGTON" ...
 $ state      : chr  "MA" "MA" "MA" "MA" ...
 $ zip        : int  21903951 24641201 26533278 2420 27402211 10603914 16101254 17027237 21203320 24662650 ...
 $ lat        : num  42.2 42.3 41.8 42.5 41.6 ...
 $ lon        : num  -70.9 -71.2 -70 -71.3 -70.9 ...
 $ phone      : chr  "7816605000" "6175270023" "5082401141" "7812746274" ...
 $ revenue    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ utilization: int  26 1 124 276 74 526 37 4 13 89 ...
 - attr(*, ".internal.selfref")=<externalptr> 
Code
cat("\n\nPROVIDERS DATA STRUCTURE:\n")


PROVIDERS DATA STRUCTURE:
Code
str(providers)
Classes 'data.table' and 'data.frame':  830 obs. of  13 variables:
 $ id          : int  1320073 1320263 1320016 1319718 1334722 1319794 1320338 1319726 1320067 1320106 ...
 $ organization: int  1252549 1252739 1252492 1252194 1325819 1252270 1252814 1252202 1252543 1252582 ...
 $ name        : chr  "Robert Nicolas" "Arielle Dare" "Leopoldo Schmeler" "Clayton Jast" ...
 $ gender      : chr  "M" "F" "M" "M" ...
 $ speciality  : chr  "GENERAL PRACTICE" "GENERAL PRACTICE" "GENERAL PRACTICE" "GENERAL PRACTICE" ...
 $ address     : chr  "1455 MAIN STREET" "1340 CENTRE ST STE 208" "77B WARREN ST" "300 FIRST AVENUE" ...
 $ city        : chr  "CHATHAM" "NEWTON" "BRIGHTON" "WESTBOROUGH" ...
 $ state       : chr  "MA" "MA" "MA" "MA" ...
 $ zip         : int  26332083 24611900 21353601 15812831 27803479 1702 12017828 26312550 24451922 27231026 ...
 $ lat         : num  41.7 42.3 42.3 42.3 41.9 ...
 $ lon         : num  -70 -71.2 -71.1 -71.6 -71.1 ...
 $ encounters  : int  25 2 63 160 1 479 2 200 4 264 ...
 $ procedures  : int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Session Information

Code
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Stockholm
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.4.0        data.table_1.18.2.1 lubridate_1.9.5    
 [4] forcats_1.0.1       stringr_1.6.0       dplyr_1.2.0        
 [7] purrr_1.2.1         readr_2.2.0         tidyr_1.3.2        
[10] tibble_3.3.1        ggplot2_4.0.2       tidyverse_2.0.0    

loaded via a namespace (and not attached):
 [1] generics_0.1.4        lattice_0.22-9        stringi_1.8.7        
 [4] hms_1.1.4             digest_0.6.39         magrittr_2.0.4       
 [7] evaluate_1.0.5        grid_4.5.2            timechange_0.4.0     
[10] RColorBrewer_1.1-3    fastmap_1.2.0         Matrix_1.7-4         
[13] qs2_0.1.7             jsonlite_2.0.0        mgcv_1.9-4           
[16] cli_3.6.5             rlang_1.1.7           splines_4.5.2        
[19] withr_3.0.2           yaml_2.3.12           otel_0.2.0           
[22] tools_4.5.2           tzdb_0.5.0            vctrs_0.7.1          
[25] R6_2.6.1              lifecycle_1.0.5       fs_1.6.6             
[28] stringfish_0.18.0     htmlwidgets_1.6.4     pkgconfig_2.0.3      
[31] RcppParallel_5.1.11-2 pillar_1.11.1         gtable_0.3.6         
[34] glue_1.8.0            Rcpp_1.1.1            xfun_0.56            
[37] tidyselect_1.2.1      knitr_1.51            dichromat_2.0-0.1    
[40] farver_2.1.2          htmltools_0.5.9       nlme_3.1-168         
[43] rmarkdown_2.30        labeling_0.4.3        compiler_4.5.2       
[46] S7_0.2.1