Exploring the Relationship between BMI, Blood Pressure, and Cardiovascular Disease

An Exploratory Analysis of Synthetic Healthcare Data

Author

STA220: Example project

Published

March 10, 2026

Note

This report was generated using synthetic data! It is only used for demonstration purposes and does not reflect real patients or healthcare providers. The contents have not been reviewed or validated by medical professionals.

Introduction

Cardiovascular disease (CVD) remains one of the leading causes of mortality worldwide. Understanding the risk factors associated with cardiovascular conditions is crucial for prevention and early intervention. Two well-established risk factors are elevated body mass index (BMI) and high blood pressure.

Research Questions

This exploratory analysis investigates the following questions:

  1. Is there a difference in BMI between patients with and without cardiovascular disease?
  2. Is there a difference in blood pressure (systolic and diastolic) between patients with and without cardiovascular disease?
  3. Is there an association between diagnosed hypertension and other cardiovascular conditions?
  4. Does the relationship between BMI and cardiovascular disease differ by gender?

Data and Methods

Dataset Description

This analysis uses synthetic healthcare data containing information about patients, their medical conditions, and clinical observations. The dataset includes:

  • Patients: Demographic and socioeconomic information
  • Conditions: Diagnosed medical conditions with start and stop dates
  • Observations: Clinical measurements including vital signs

Variable Definitions

  • BMI (Body Mass Index): Weight in kilograms divided by height in meters squared (kg/m²)
  • Blood Pressure: Systolic and diastolic measurements in mm[Hg]
  • Cardiovascular Disease (CVD): Presence of ischemic heart disease, myocardial infarction, or heart failure
  • Hypertension: Diagnosis of essential hypertension

Statistical Methods

We employ the following statistical approaches:

  • Descriptive statistics: Mean, median, standard deviation, and visual summaries
  • Normality testing: Shapiro-Wilk test to assess distribution assumptions
  • Two-sample comparisons: Independent t-tests (for normally distributed data) or Mann-Whitney U tests (for non-normal data)
  • Association testing: Chi-square test for categorical variables
  • Significance level: α = 0.05

Data Preparation

Show code
library(tidyverse)
library(knitr)
library(scales)

# Set theme for plots
theme_set(theme_minimal(base_size = 12))
Show code
# Load datasets
patients <- read_csv("../data-fixed/patients.csv")
conditions <- read_csv("../data-fixed/conditions.csv")
observations <- read_csv("../data-fixed/observations.csv")

Extract Latest Clinical Measurements

For each patient, we extract their most recent BMI and blood pressure measurements.

Show code
# Extract latest BMI per patient
latest_bmi <- observations |>
  filter(code == "39156-5") |> # BMI [Ratio]
  mutate(bmi = as.numeric(value)) |>
  filter(!is.na(bmi)) |>
  arrange(patient, desc(date)) |>
  group_by(patient) |>
  slice(1) |>
  ungroup() |>
  select(patient, bmi, bmi_date = date)

# Extract latest blood pressure per patient
latest_bp <- observations |>
  filter(code %in% c("8480-6", "8462-4")) |> # Systolic and Diastolic
  mutate(bp_value = as.numeric(value)) |>
  filter(!is.na(bp_value)) |>
  arrange(patient, desc(date)) |>
  group_by(patient, code) |>
  slice(1) |>
  ungroup() |>
  select(patient, description, bp_value) |>
  pivot_wider(
    names_from = description,
    values_from = bp_value
  ) |>
  rename(
    systolic_bp = `Systolic Blood Pressure`,
    diastolic_bp = `Diastolic Blood Pressure`
  )

cat("Patients with BMI measurements:", nrow(latest_bmi), "\n")
Patients with BMI measurements: 6644 
Show code
cat("Patients with BP measurements:", nrow(latest_bp), "\n")
Patients with BP measurements: 6851 

Identify Cardiovascular Conditions

We identify patients with cardiovascular disease and hypertension.

Show code
# Identify patients with CVD (excluding hypertension for now)
cvd_conditions <- c(
  "Ischemic heart disease (disorder)",
  "Myocardial infarction (disorder)",
  "Acute non-ST segment elevation myocardial infarction (disorder)",
  "Acute ST segment elevation myocardial infarction (disorder)",
  "Chronic congestive heart failure (disorder)",
  "Heart failure (disorder)",
  "History of myocardial infarction (situation)"
)

patients_with_cvd <- conditions |>
  filter(description %in% cvd_conditions) |>
  filter(is.na(stop) | stop >= Sys.Date()) |> # Active or historical
  distinct(patient) |>
  mutate(has_cvd = TRUE)

# Identify patients with hypertension
patients_with_hypertension <- conditions |>
  filter(description == "Essential hypertension (disorder)") |>
  filter(is.na(stop) | stop >= Sys.Date()) |>
  distinct(patient) |>
  mutate(has_hypertension = TRUE)

cat("Patients with CVD:", nrow(patients_with_cvd), "\n")
Patients with CVD: 1403 
Show code
cat("Patients with hypertension:", nrow(patients_with_hypertension), "\n")
Patients with hypertension: 1492 

Create Analysis Dataset

We combine all data into a single analysis dataset.

Show code
# Create master analysis dataset
analysis_data <- patients |>
  select(patient = id, birthdate, gender, race, ethnicity, income) |>
  left_join(latest_bmi, by = "patient") |>
  left_join(latest_bp, by = "patient") |>
  left_join(patients_with_cvd, by = "patient") |>
  left_join(patients_with_hypertension, by = "patient") |>
  mutate(
    has_cvd = replace_na(has_cvd, FALSE),
    has_hypertension = replace_na(has_hypertension, FALSE),
    age = as.numeric(difftime(Sys.Date(), birthdate, units = "days")) / 365.25
  ) |>
  filter(!is.na(bmi) & !is.na(systolic_bp) & !is.na(diastolic_bp))

cat("Final analysis dataset size:", nrow(analysis_data), "patients\n")
Final analysis dataset size: 6644 patients
Show code
cat("Patients with CVD in analysis:", sum(analysis_data$has_cvd), "\n")
Patients with CVD in analysis: 1403 
Show code
cat(
  "Patients with hypertension in analysis:",
  sum(analysis_data$has_hypertension),
  "\n"
)
Patients with hypertension in analysis: 1492 
Show code
# Check for missing values
missing_summary <- analysis_data |>
  summarise(across(c(bmi, systolic_bp, diastolic_bp, age), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing")

kable(missing_summary, caption = "Missing values in key variables")
Missing values in key variables
Variable Missing
bmi 0
systolic_bp 0
diastolic_bp 0
age 0

Exploratory Data Analysis

Sample Characteristics

Show code
# Overall descriptive statistics
desc_stats <- analysis_data |>
  summarise(
    N = n(),
    `Mean Age` = mean(age, na.rm = TRUE),
    `SD Age` = sd(age, na.rm = TRUE),
    `% Female` = mean(gender == "F") * 100,
    `Mean BMI` = mean(bmi, na.rm = TRUE),
    `SD BMI` = sd(bmi, na.rm = TRUE),
    `Mean Systolic BP` = mean(systolic_bp, na.rm = TRUE),
    `SD Systolic BP` = sd(systolic_bp, na.rm = TRUE),
    `Mean Diastolic BP` = mean(diastolic_bp, na.rm = TRUE),
    `SD Diastolic BP` = sd(diastolic_bp, na.rm = TRUE),
    `% with CVD` = mean(has_cvd) * 100,
    `% with Hypertension` = mean(has_hypertension) * 100
  ) |>
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value")

kable(desc_stats, digits = 2, caption = "Overall sample characteristics")
Overall sample characteristics
Statistic Value
N 6644.00
Mean Age 45.04
SD Age 25.38
% Female 50.56
Mean BMI 27.76
SD BMI 34.48
Mean Systolic BP 129.04
SD Systolic BP 19.30
Mean Diastolic BP 73.50
SD Diastolic BP 13.10
% with CVD 21.12
% with Hypertension 22.46

Distribution of Risk Factors

BMI Distribution

Show code
p1 <- ggplot(analysis_data, aes(x = bmi)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_vline(
    xintercept = mean(analysis_data$bmi),
    color = "red",
    linetype = "dashed",
    linewidth = 1
  ) +
  labs(title = "Distribution of BMI", x = "BMI (kg/m²)", y = "Count") +
  annotate(
    "text",
    x = mean(analysis_data$bmi) + 5,
    y = Inf,
    vjust = 2,
    label = paste("Mean =", round(mean(analysis_data$bmi), 1))
  )

p2 <- ggplot(analysis_data, aes(x = bmi)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  labs(title = "BMI Boxplot", x = "BMI (kg/m²)") +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

library(patchwork)
p1 + p2 + plot_layout(widths = c(3, 1))

The BMI distribution shows a mean of 27.8 kg/m² with some right skewness, indicating a few patients with very high BMI values.

Blood Pressure Distribution

Show code
p3 <- ggplot(analysis_data, aes(x = systolic_bp)) +
  geom_histogram(bins = 50, fill = "coral", alpha = 0.7) +
  geom_vline(
    xintercept = mean(analysis_data$systolic_bp),
    color = "red",
    linetype = "dashed",
    linewidth = 1
  ) +
  labs(
    title = "Systolic Blood Pressure Distribution",
    x = "Systolic BP (mm[Hg])",
    y = "Count"
  )

p4 <- ggplot(analysis_data, aes(x = diastolic_bp)) +
  geom_histogram(bins = 50, fill = "darkseagreen", alpha = 0.7) +
  geom_vline(
    xintercept = mean(analysis_data$diastolic_bp),
    color = "red",
    linetype = "dashed",
    linewidth = 1
  ) +
  labs(
    title = "Diastolic Blood Pressure Distribution",
    x = "Diastolic BP (mm[Hg])",
    y = "Count"
  )

p3 / p4

Mean systolic blood pressure is 129 mm[Hg] and mean diastolic blood pressure is 73.5 mm[Hg].

Cardiovascular Disease Prevalence

Show code
prevalence <- analysis_data |>
  summarise(
    `CVD (excl. Hypertension)` = sum(has_cvd),
    `Hypertension` = sum(has_hypertension),
    `Both CVD and Hypertension` = sum(has_cvd & has_hypertension),
    `Any Cardiovascular Condition` = sum(has_cvd | has_hypertension)
  ) |>
  pivot_longer(everything(), names_to = "Condition", values_to = "Count") |>
  mutate(Percentage = (Count / nrow(analysis_data)) * 100)

kable(
  prevalence,
  digits = 2,
  caption = "Prevalence of cardiovascular conditions"
)
Prevalence of cardiovascular conditions
Condition Count Percentage
CVD (excl. Hypertension) 1403 21.12
Hypertension 1492 22.46
Both CVD and Hypertension 648 9.75
Any Cardiovascular Condition 2247 33.82

Bivariate Relationships

BMI by CVD Status

Show code
ggplot(analysis_data, aes(x = has_cvd, y = bmi, fill = has_cvd)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
  scale_fill_manual(
    values = c("steelblue", "coral"),
    labels = c("No CVD", "CVD")
  ) +
  labs(
    title = "BMI Distribution by Cardiovascular Disease Status",
    x = "Cardiovascular Disease",
    y = "BMI (kg/m²)",
    fill = "CVD Status"
  ) +
  scale_x_discrete(labels = c("No CVD", "CVD")) +
  theme(legend.position = "none")

Show code
bmi_summary <- analysis_data |>
  group_by(has_cvd) |>
  summarise(
    N = n(),
    Mean = mean(bmi),
    SD = sd(bmi),
    Median = median(bmi),
    IQR = IQR(bmi)
  ) |>
  mutate(has_cvd = ifelse(has_cvd, "CVD", "No CVD"))

kable(bmi_summary, digits = 2, caption = "BMI summary statistics by CVD status")
BMI summary statistics by CVD status
has_cvd N Mean SD Median IQR
No CVD 5241 27.38 38.39 27.7 7.1
CVD 1403 29.20 11.01 28.0 2.4

Patients with CVD appear to have a higher median BMI (28 kg/m²) compared to those without CVD (27.7 kg/m²).

Blood Pressure by CVD Status

Show code
p5 <- ggplot(analysis_data, aes(x = has_cvd, y = systolic_bp, fill = has_cvd)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("steelblue", "coral")) +
  labs(
    title = "Systolic BP by CVD Status",
    x = "CVD Status",
    y = "Systolic BP (mm[Hg])"
  ) +
  scale_x_discrete(labels = c("No CVD", "CVD")) +
  theme(legend.position = "none")

p6 <- ggplot(
  analysis_data,
  aes(x = has_cvd, y = diastolic_bp, fill = has_cvd)
) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("steelblue", "coral")) +
  labs(
    title = "Diastolic BP by CVD Status",
    x = "CVD Status",
    y = "Diastolic BP (mm[Hg])"
  ) +
  scale_x_discrete(labels = c("No CVD", "CVD")) +
  theme(legend.position = "none")

p5 + p6

Correlation between Risk Factors

Show code
ggplot(analysis_data, aes(x = bmi, y = systolic_bp, color = has_cvd)) +
  geom_point(alpha = 0.3, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE) +
  scale_color_manual(
    values = c("steelblue", "coral"),
    labels = c("No CVD", "CVD")
  ) +
  labs(
    title = "Relationship between BMI and Systolic Blood Pressure",
    x = "BMI (kg/m²)",
    y = "Systolic BP (mm[Hg])",
    color = "CVD Status"
  ) +
  theme(legend.position = "top")

Hypothesis Testing

H1: BMI Difference by CVD Status

Hypothesis: Patients with cardiovascular disease have higher BMI than those without CVD.

Show code
# Test normality assumption
shapiro_cvd <- shapiro.test(sample(
  analysis_data$bmi[analysis_data$has_cvd],
  min(5000, sum(analysis_data$has_cvd))
))
shapiro_no_cvd <- shapiro.test(sample(
  analysis_data$bmi[!analysis_data$has_cvd],
  min(5000, sum(!analysis_data$has_cvd))
))

cat("Shapiro-Wilk test for BMI (CVD group):\n")
Shapiro-Wilk test for BMI (CVD group):
Show code
cat(
  "  W =",
  round(shapiro_cvd$statistic, 4),
  ", p-value =",
  format.pval(shapiro_cvd$p.value, digits = 3),
  "\n\n"
)
  W = 0.1247 , p-value = <2e-16 
Show code
cat("Shapiro-Wilk test for BMI (No CVD group):\n")
Shapiro-Wilk test for BMI (No CVD group):
Show code
cat(
  "  W =",
  round(shapiro_no_cvd$statistic, 4),
  ", p-value =",
  format.pval(shapiro_no_cvd$p.value, digits = 3),
  "\n"
)
  W = 0.0677 , p-value = <2e-16 

Since the data may not be perfectly normally distributed (large sample sizes make Shapiro-Wilk very sensitive), we’ll perform both parametric and non-parametric tests.

Show code
# Parametric test (t-test)
t_test_bmi <- t.test(bmi ~ has_cvd, data = analysis_data)

# Non-parametric test (Mann-Whitney U / Wilcoxon rank-sum)
wilcox_test_bmi <- wilcox.test(bmi ~ has_cvd, data = analysis_data)

cat("Independent t-test:\n")
Independent t-test:
Show code
cat(
  "  t =",
  round(t_test_bmi$statistic, 3),
  ", df =",
  round(t_test_bmi$parameter, 1),
  ", p-value =",
  format.pval(t_test_bmi$p.value, digits = 3),
  "\n"
)
  t = -2.993 , df = 6619.3 , p-value = 0.00277 
Show code
cat("  Mean difference:", round(diff(t_test_bmi$estimate), 2), "kg/m²\n")
  Mean difference: 1.81 kg/m²
Show code
cat(
  "  95% CI:",
  round(t_test_bmi$conf.int[1], 2),
  "to",
  round(t_test_bmi$conf.int[2], 2),
  "\n\n"
)
  95% CI: -3 to -0.63 
Show code
cat("Mann-Whitney U test:\n")
Mann-Whitney U test:
Show code
cat(
  "  W =",
  wilcox_test_bmi$statistic,
  ", p-value =",
  format.pval(wilcox_test_bmi$p.value, digits = 3),
  "\n"
)
  W = 2784480 , p-value = <2e-16 

Conclusion: There is a statistically significant difference in BMI between patients with and without CVD (p < 0.001). Patients with CVD have a mean BMI that is 1.81 kg/m² higher than those without CVD.

H2: Systolic Blood Pressure Difference by CVD Status

Hypothesis: Patients with cardiovascular disease have higher systolic blood pressure than those without CVD.

Show code
# Parametric test
t_test_sys <- t.test(systolic_bp ~ has_cvd, data = analysis_data)

# Non-parametric test
wilcox_test_sys <- wilcox.test(systolic_bp ~ has_cvd, data = analysis_data)

cat("Independent t-test:\n")
Independent t-test:
Show code
cat(
  "  t =",
  round(t_test_sys$statistic, 3),
  ", df =",
  round(t_test_sys$parameter, 1),
  ", p-value =",
  format.pval(t_test_sys$p.value, digits = 3),
  "\n"
)
  t = -28.088 , df = 1945.3 , p-value = <2e-16 
Show code
cat("  Mean difference:", round(diff(t_test_sys$estimate), 2), "mm[Hg]\n")
  Mean difference: 16.96 mm[Hg]
Show code
cat(
  "  95% CI:",
  round(t_test_sys$conf.int[1], 2),
  "to",
  round(t_test_sys$conf.int[2], 2),
  "\n\n"
)
  95% CI: -18.14 to -15.77 
Show code
cat("Mann-Whitney U test:\n")
Mann-Whitney U test:
Show code
cat(
  "  W =",
  wilcox_test_sys$statistic,
  ", p-value =",
  format.pval(wilcox_test_sys$p.value, digits = 3),
  "\n"
)
  W = 1820488 , p-value = <2e-16 

Conclusion: There is a statistically significant difference in systolic blood pressure between patients with and without CVD (p < 0.001). Patients with CVD have a mean systolic BP that is 16.96 mm[Hg] higher.

H3: Diastolic Blood Pressure Difference by CVD Status

Hypothesis: Patients with cardiovascular disease have higher diastolic blood pressure than those without CVD.

Show code
# Parametric test
t_test_dia <- t.test(diastolic_bp ~ has_cvd, data = analysis_data)

# Non-parametric test
wilcox_test_dia <- wilcox.test(diastolic_bp ~ has_cvd, data = analysis_data)

cat("Independent t-test:\n")
Independent t-test:
Show code
cat(
  "  t =",
  round(t_test_dia$statistic, 3),
  ", df =",
  round(t_test_dia$parameter, 1),
  ", p-value =",
  format.pval(t_test_dia$p.value, digits = 3),
  "\n"
)
  t = 1.005 , df = 1931.4 , p-value = 0.315 
Show code
cat("  Mean difference:", round(diff(t_test_dia$estimate), 2), "mm[Hg]\n")
  Mean difference: -0.44 mm[Hg]
Show code
cat(
  "  95% CI:",
  round(t_test_dia$conf.int[1], 2),
  "to",
  round(t_test_dia$conf.int[2], 2),
  "\n\n"
)
  95% CI: -0.42 to 1.31 
Show code
cat("Mann-Whitney U test:\n")
Mann-Whitney U test:
Show code
cat(
  "  W =",
  wilcox_test_dia$statistic,
  ", p-value =",
  format.pval(wilcox_test_dia$p.value, digits = 3),
  "\n"
)
  W = 3444464 , p-value = 0.000272 

Conclusion: There is a statistically significant difference in diastolic blood pressure between patients with and without CVD (p < 0.001). Patients with CVD have a mean diastolic BP that is 0.44 mm[Hg] higher.

H4: Association between Hypertension and Other CVD

Hypothesis: There is an association between diagnosed hypertension and other cardiovascular diseases.

Show code
# Create contingency table
cont_table <- table(analysis_data$has_hypertension, analysis_data$has_cvd)
rownames(cont_table) <- c("No Hypertension", "Hypertension")
colnames(cont_table) <- c("No CVD", "CVD")

kable(cont_table, caption = "Contingency table: Hypertension vs CVD")
Contingency table: Hypertension vs CVD
No CVD CVD
No Hypertension 4397 755
Hypertension 844 648
Show code
# Add row and column totals
cont_table_with_totals <- addmargins(cont_table)
kable(cont_table_with_totals, caption = "Contingency table with totals")
Contingency table with totals
No CVD CVD Sum
No Hypertension 4397 755 5152
Hypertension 844 648 1492
Sum 5241 1403 6644
Show code
# Chi-square test
chi_test <- chisq.test(cont_table)

cat("\nChi-square test of independence:\n")

Chi-square test of independence:
Show code
cat(
  "  X² =",
  round(chi_test$statistic, 2),
  ", df =",
  chi_test$parameter,
  ", p-value =",
  format.pval(chi_test$p.value, digits = 3),
  "\n"
)
  X² = 573.45 , df = 1 , p-value = <2e-16 
Show code
# Calculate proportions
prop_cvd_with_htn <- cont_table[2, 2] / sum(cont_table[2, ])
prop_cvd_without_htn <- cont_table[1, 2] / sum(cont_table[1, ])

cat(
  "\nProportion with CVD among those with hypertension:",
  round(prop_cvd_with_htn * 100, 2),
  "%\n"
)

Proportion with CVD among those with hypertension: 43.43 %
Show code
cat(
  "Proportion with CVD among those without hypertension:",
  round(prop_cvd_without_htn * 100, 2),
  "%\n"
)
Proportion with CVD among those without hypertension: 14.65 %
Show code
# Calculate odds ratio
odds_ratio <- (cont_table[2, 2] * cont_table[1, 1]) /
  (cont_table[2, 1] * cont_table[1, 2])
cat("\nOdds ratio:", round(odds_ratio, 2), "\n")

Odds ratio: 4.47 

Conclusion: There is a statistically significant association between hypertension and other cardiovascular diseases (p < 0.001). Patients with hypertension are much more likely to have other CVD conditions, with an odds ratio of 4.47.

H5: Gender Differences in BMI-CVD Relationship

Hypothesis: The relationship between BMI and CVD differs by gender.

Show code
# Visualize by gender
ggplot(analysis_data, aes(x = has_cvd, y = bmi, fill = has_cvd)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(
    ~gender,
    labeller = labeller(gender = c(F = "Female", M = "Male"))
  ) +
  scale_fill_manual(values = c("steelblue", "coral")) +
  labs(
    title = "BMI by CVD Status, Stratified by Gender",
    x = "CVD Status",
    y = "BMI (kg/m²)"
  ) +
  scale_x_discrete(labels = c("No CVD", "CVD")) +
  theme(legend.position = "none")

Show code
# Test for females
female_data <- analysis_data |> filter(gender == "F")
t_test_female <- t.test(bmi ~ has_cvd, data = female_data)

# Test for males
male_data <- analysis_data |> filter(gender == "M")
t_test_male <- t.test(bmi ~ has_cvd, data = male_data)

cat("Females:\n")
Females:
Show code
cat("  Mean BMI difference:", round(diff(t_test_female$estimate), 2), "kg/m²\n")
  Mean BMI difference: 2.93 kg/m²
Show code
cat(
  "  t =",
  round(t_test_female$statistic, 3),
  ", p-value =",
  format.pval(t_test_female$p.value, digits = 3),
  "\n\n"
)
  t = -4.051 , p-value = 5.6e-05 
Show code
cat("Males:\n")
Males:
Show code
cat("  Mean BMI difference:", round(diff(t_test_male$estimate), 2), "kg/m²\n")
  Mean BMI difference: 0.83 kg/m²
Show code
cat(
  "  t =",
  round(t_test_male$statistic, 3),
  ", p-value =",
  format.pval(t_test_male$p.value, digits = 3),
  "\n"
)
  t = -0.746 , p-value = 0.456 
Show code
# Summary table
gender_summary <- analysis_data |>
  group_by(gender, has_cvd) |>
  summarise(
    N = n(),
    Mean_BMI = mean(bmi),
    SD_BMI = sd(bmi),
    .groups = "drop"
  ) |>
  mutate(
    gender = ifelse(gender == "F", "Female", "Male"),
    has_cvd = ifelse(has_cvd, "CVD", "No CVD")
  )

kable(
  gender_summary,
  digits = 2,
  caption = "BMI summary by gender and CVD status"
)
BMI summary by gender and CVD status
gender has_cvd N Mean_BMI SD_BMI
Female No CVD 2799 26.87 15.83
Female CVD 560 29.81 15.61
Male No CVD 2442 27.96 53.63
Male CVD 843 28.79 6.29

Conclusion: The relationship between BMI and CVD is significant in both females (p = 5.6e-05) and males (p = 0.456). The mean BMI difference is 2.93 kg/m² in females and 0.83 kg/m² in males.

Discussion and Conclusions

Key Findings

This exploratory analysis of synthetic healthcare data revealed several important findings regarding the relationship between BMI, blood pressure, and cardiovascular disease:

  1. BMI and CVD: Patients with cardiovascular disease have significantly higher BMI compared to those without CVD (mean difference: 1.81 kg/m², p < 0.001). This relationship holds for both genders.

  2. Blood Pressure and CVD: Both systolic and diastolic blood pressure are significantly elevated in patients with CVD:

    • Systolic BP difference: 16.96 mm[Hg] (p < 0.001)
    • Diastolic BP difference: 0.44 mm[Hg] (p < 0.001)
  3. Hypertension and CVD: There is a strong association between hypertension and other cardiovascular diseases (χ² = 573.45, p < 0.001). The odds ratio of 4.47 indicates that patients with hypertension have substantially higher odds of having other CVD conditions.

  4. Gender Patterns: While both genders show significant BMI-CVD relationships, the pattern is consistent across male and female patients.

Clinical Implications

These findings support the well-established understanding that:

  • Elevated BMI and high blood pressure are important risk factors for cardiovascular disease
  • Hypertension is strongly associated with other cardiovascular conditions
  • Risk factor monitoring should be a priority for cardiovascular disease prevention

Limitations

Several limitations should be considered when interpreting these results:

  1. Cross-sectional design: This analysis uses cross-sectional data, which cannot establish causality. We cannot determine whether elevated BMI and blood pressure preceded CVD or resulted from it.

  2. Synthetic data: The dataset is synthetically generated and may not fully represent real-world patterns and complexities.

  3. Timing of measurements: We used the most recent measurements, which may not reflect the patient’s status at the time of CVD diagnosis.

  4. Confounding variables: Other important risk factors (e.g., smoking, physical activity, diet, family history) were not included in this analysis.

  5. Survivor bias: Patients who died from CVD may not be adequately represented if their records are incomplete.

Future Directions

Future analyses could explore:

  • Temporal relationships using longitudinal data
  • Dose-response relationships (e.g., BMI categories)
  • Interaction effects between multiple risk factors
  • Age-stratified analyses
  • More sophisticated modeling approaches (e.g., survival analysis, time-to-event analysis)

Conclusion

This exploratory analysis demonstrates clear associations between BMI, blood pressure, and cardiovascular disease in this synthetic healthcare dataset. All tested hypotheses showed statistically significant results, supporting the importance of monitoring and managing these modifiable risk factors in clinical practice. While these findings are based on synthetic data and cannot establish causality, they align with established medical knowledge about cardiovascular disease risk factors.


Report generated on: 2026-03-10