Health data

Published

March 16, 2026

This website includes documentation for lectures and exercises concerning the health data part of the course STA220 Health data and questionnaires. General information about the course is found on the corresponding Canvas page.

Plan

Note
  • See Canvas/TimeEdit for schedule.
  • Note that this plan only considers the health data part of the course!
  • See details below regarding the literature
  • Some literature is only used partially (as described in the handouts)
  • The literature is associated with each lecture/session but is not required reading before the session (unless otherwise stated).
Warning
  • The plan, as well as the content of each session, is preliminary and may change before the respective session is scheduled.
  • Modifications to lecture slides (and thus indirectly to the lecture handouts) may be made even after the session has taken place. Such changes are only intended to clarify points discussed during the session or to correct mistakes.
Weekday Time Shared1 Compulsory Activity Session title subtitle literature
w4
Mon 10:15-12 X Lecture EL1 Intro Overwiev of this part of the course (Nguyen 2022, ch. 1), (Ludvigsson et al. 2009), (Laugesen et al. 2021)
Wed 08:15-12 X Lecture EL2 European legislation GDPR, EHDS and DA (Vukovic et al. 2022), (Nguyen 2022, ch. 4)
w5
Mon 10:15-12 X Lecture EL3 Swedish legislation Laws and regulation (“Public Access and Secrecy | Swedish National Data Service” 2025), (Görman 2024)
w9
Mon 10:15-12 Lecture EL4 Tooling Positron and version control (“The Unix Shell: Summary and Setup” 2026), (Rodrigues 2023, ch. 4)
13:15-16 Computer training ECS1 IDE and version control Positron, git and GitHub
Wed 10:15-12 Lecture EL5 Reproducibility {targets} and safe environments (Nguyen 2022, ch. 2), (Baker 2016), (Oliveira Andrade 2025), (Kavianpour et al. 2022), (Peng and Hicks 2021)
13:15-16 Compulsory Seminar ES1 Ethics and Legality
w10
Wed 10:15-12 Lecture EL6 Data formats data.table, SQL, DuckDB, parquett, SAS, JSON, API (Nguyen 2022, ch. 2), (Wickham, Çetinkaya-Rundel, and Grolemund, n.d., ch. 21), (Fenk, Furu, and Bakken, n.d.), (Data Analysis Using Data.table, n.d.)
13:15-16 Compulsory Computer training ECS2 Pipeline and reproducibility targets and understanding the data
w11
Mon 10:15-12 Lecture EL7 Medical coding ICD, ATC, KVÅ, regex (Nguyen 2022, ch. 3), (Wickham, Çetinkaya-Rundel, and Grolemund, n.d., ch. 15), (Bindel and Seifert 2025), (Alharbi, Isouard, and Tolchard 2021), (Nelson et al. 2024)
13:15-16 Compulsory Computer training ECS3 Data formats and regex
Wed 11:15-13 Lecture EL8 Health care registers From cradle to grave (Hiyoshi 2026), (Ludvigsson et al. 2016)
14:15-17 Computer training ECS4 Data project
w12
Mon 10:15-12 Lecture EL9 Documentation Quarto (Wickham, Çetinkaya-Rundel, and Grolemund, n.d., ch. 28-29)
13:15-16 Computer training ECS5 Presentation Quarto
Wed 10:15-12 Lecture EL10 Biggish data
13:15-16 Computer training ECS6 Project Quarto
w13
Mon 08:15-12 X Lecture EL11 Recap L1, L4-L9
13:15-16 Compulsory Seminar ES2 Project presentations
1 Shared sessin for the whole course (AGE and EB).

Software

To complete the exercises and assignments for this course you need to install the following software (which are free and available for all major operating systems):

Some additional recommended software (also free but not mandatory for the course):

Positron extensions

Positron is an integrated development environment (IDE) for R. It makes it easy to write and execute R code, manage projects, and visualize data. It is made by Posit (formerly RStudio) PBC and is free for individual use.

Positron is based on Code OS, an open source version of VS Code from Microsoft. As such, it supports a wide range of extensions and customization options.

We will use GitHub Pull Requests. Search for it in the Extensions pane in Positron and install it.

Instructions for Git in Positron

Litterature

Course books are available for GU students through the “O’Reilly Learning for Higher Education”. Instructions for access

Primary course book. Not every chapter of the book will be discussed during the course, and some programming examples in the book are written in languages that we will not use, as our programming focus is mainly on R.

Recommended books.

Additional mandatory reading

PDF versions of scientific articles are found in Canvas.

DISA exam

  • The literature listed in the table above (literature column) is used for examination.
  • Some references, however, are only partially used (see handouts and lecture slides).
  • Some of the literature is examined implicitly as part of the project part of the course.

The DISA exam will examine:

  • The course book Nguyen (2022) (selected parts of chapter 1-4)

  • All PDF:s found in the Canvas module

    • Most tables and detailed methods sections can be skipped
  • Lecture handouts, excluding:

    • EL3 (examined only as part of ES1)
    • EL8 where Hiyoshi (2026) (one of the PDF:s) is a sufficient reference
    • EL10 which is formally not examined but which might be useful for the project component of the course.

The written examination will assess your understanding of the main themes covered in the course. The purpose of the exam is not to test memorization of details, but your ability to explain concepts and reflect on methodological issues in health data research.

Structure of the exam

Section A – Conceptual understanding

In this section you will answer a number of short questions about key concepts from the course. Your answers should be brief (usually 1–3 sentences). The goal is to demonstrate that you understand the terminology and the purpose of the concepts discussed in the course.

Section B – Applied understanding

In this section you will be asked to explain or interpret practical situations related to health data analysis. Your answers should be somewhat longer (typically 4–6 sentences) and demonstrate that you understand how the concepts from the course apply in real research settings. The goal here is to show that you can connect course concepts to practical research situations.

Section C – Reflection and synthesis

The final question asks you to reflect on the broader research workflow in modern health data analysis. This part of the exam focuses on your ability to combine ideas from several parts of the course. A good answer should show that you can reason about how these elements interact in real research projects.

Podcast episode

This “podcast episode” has been generated by NotebookML based on the lecture handouts and course literature. It might provide a helpful summary of the course.

References

Alharbi, Musaed Ali, Godfrey Isouard, and Barry Tolchard. 2021. “Historical Development of the Statistical Classification of Causes of Death and Diseases.” Edited by Rahman Shiri. Cogent Medicine 8 (1): 1893422. https://doi.org/10.1080/2331205X.2021.1893422.
Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.
Bindel, Lilly Josephine, and Roland Seifert. 2025. “Problems Associated with the ATC System of Drug Classification.” Naunyn-Schmiedeberg’s Archives of Pharmacology, December. https://doi.org/10.1007/s00210-025-04833-1.
Data Analysis Using Data.table. n.d. 2026. https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html.
Fenk, Simone Rahel, Kari Furu, and Inger Johanne Bakken. n.d. “Improve Data Management in Register-Based Research: Transition from CSV to Parquet.” https://doi.org/10.1101/2025.10.15.25337992.
Görman, Ulf. 2024. “Guide to the Ethical Review of Research on Humans.” Uppsala. https://etikprovningsmyndigheten.se/wp-content/uploads/2024/05/Guide-to-the-ethical-review_webb.pdf.
Hiyoshi, Ayako. 2026. “Overview of Swedish Register Data for Health Research.” Annals of Clinical Epidemiology advpub. https://doi.org/10.37737/ace.27005.
Kavianpour, Sanaz, James Sutherland, Esma Mansouri-Benssassi, Natalie Coull, and Emily Jefferson. 2022. “Next-Generation Capabilities in Trusted Research Environments: Interview Study.” Journal of Medical Internet Research 24 (9): e33720. https://doi.org/10.2196/33720.
Laugesen, Kristina, Jonas F Ludvigsson, Morten Schmidt, Mika Gissler, Unnur Anna Valdimarsdottir, Astrid Lunde, and Henrik Toft Sørensen. 2021. “Nordic Health Registry-Based Research: A Review of Health Care Systems and Key Registries.” Clinical Epidemiology Volume 13 (July): 533–54. https://doi.org/10.2147/CLEP.S314959.
Ludvigsson, Jonas F., Catarina Almqvist, Anna Karin Edstedt Bonamy, Rickard Ljung, Karl Michaëlsson, Martin Neovius, Olof Stephansson, and Weimin Ye. 2016. “Registers of the Swedish Total Population and Their Use in Medical Research.” European Journal of Epidemiology, 1–12. https://doi.org/10.1007/s10654-016-0117-y.
Ludvigsson, Jonas F., Petra Otterblad-Olausson, Birgitta U. Pettersson, and Anders Ekbom. 2009. “The Swedish Personal Identity Number: Possibilities and Pitfalls in Healthcare and Medical Research.” European Journal of Epidemiology 24 (11): 659–67. https://doi.org/10.1007/s10654-009-9350-y.
Nelson, Stuart J., Ying Yin, Eduardo A. Trujillo Rivera, Yijun Shao, Phillip Ma, Mark S. Tuttle, Jennifer Garvin, and Qing Zeng-Treitler. 2024. “Are ICD Codes Reliable for Observational Studies? Assessing Coding Consistency for Data Quality.” DIGITAL HEALTH 10 (September): 20552076241297056. https://doi.org/10.1177/20552076241297056.
Nguyen, Andrew. 2022. Hands-on healthcare data: taming the complexity of real-world data. First edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly.
Oliveira Andrade, Rodrigo de. 2025. “Huge Reproducibility Project Fails to Validate Dozens of Biomedical Studies.” Nature 641 (8062): 293–94. https://doi.org/10.1038/d41586-025-01266-x.
Peng, Roger D., and Stephanie C. Hicks. 2021. “Reproducible Research: A Retrospective.” Annual Review of Public Health 42 (Volume 42, 2021): 79–93. https://doi.org/10.1146/annurev-publhealth-012420-105110.
“Public Access and Secrecy | Swedish National Data Service.” 2025. https://snd.se/en/research-data-support/introduction-legal-aspects-research/public-access-and-secrecy.
Rodrigues, Bruno. 2023. “Building Reproducible Analytical Pipelines with R.” https://raps-with-r.dev/.
“The Unix Shell: Summary and Setup.” 2026. https://swcarpentry.github.io/shell-novice/.
Vukovic, Jakov, Damir Ivankovic, Claudia Habl, and Jelena Dimnjakovic. 2022. “Enablers and Barriers to the Secondary Use of Health Data in Europe: General Data Protection Regulation Perspective.” Archives of Public Health 80 (1): 115. https://doi.org/10.1186/s13690-022-00866-7.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. n.d. “R for Data Science (2e).” https://r4ds.hadley.nz/.