Handouts

Updated handouts

This document is generated automatically and contains all lecture slides.
When modifications are made to any of the lecture slides, this should be reflected here automatically.
If you are viewing the HTML version of the handouts, a static PDF might also be downloaded by clicking Other formats > Typst in the right side menu of the page.
If you are reading the PDF version of the document, the HTML version is availble here

Those handouts where last modified: 2026-03-20.

Separate handouts for each lecture

session	PDF	Word
EL1	PDF	DOCX
EL2	PDF	DOCX
EL3	PDF	DOCX
EL4	PDF	DOCX
EL5	PDF	DOCX
EL6	PDF	DOCX
EL7	PDF	DOCX
EL8	PDF	DOCX
EL9	PDF	DOCX
EL10	PDF	DOCX
EL11_raw	PDF	DOCX
EL11	PDF	DOCX

EL1: Intro

Lecture Slides

Associated litterature (References at the end)

(Nguyen 2022, ch. 1)
(Ludvigsson et al. 2009)
(Laugesen et al. 2021)

Important aspects

Data (where does is come from, what does it contain)
Ethics and legal (how to handle sensitive data, what laws and regulations apply)
Project management (how to plan and execute a data project, version control, reproducibility, R specific packages for efficient data handling)

Data – what is it?

EU Data Act | Article 2, Definitions:

For the purposes of this Regulation, the following definitions apply:

‘data’ means any digital representation of acts, facts or information and any compilation of such acts, facts or information, including in the form of sound, visual or audio-visual recording;

‘metadata’ means a structured description of the contents or the use of data facilitating the discovery or use of that data;

‘personal data’ means personal data as defined in Article 4, point (1), of Regulation (EU) 2016/679;

‘non-personal data’ means data other than personal data;

Course structure

Lectures on different data sources/registers
🧑‍💻 Exercises on data management and analysis
- R with some additional tools (Git, GitHub, targets, data.table)
A data project with written report and presentation
Final exam 😱
Instruction web page in addition to Canvas
Litterature: accessible through GU library (O’Reilly Learning for Higher Education) or otherwise shared (no need to purchase books)

Different types of data

🩻 Images
- Statistical image analysis
🧪 Lab samples
📝 Unstructured medical records
- Natural Language Processing
📺 Sensor data
- Time series (“big data”)
💻 EHRs (electronic health records)
- Structured but hierarchical rather than tabular
💻 Structured medical records
- tabular data

Usages

👩🏼‍🔬 Research
📒 Quality control/improvement
🔖 Administration/reporting
📰 News coverage
🧑‍💻 Building prediction models and tools

Register data

Three types of health care registers:

Administrative registers
Health care registers
Quality registers

💶 Administrative data

(As found in all types of registers)

Billing codes
- Direct (what something actually cost)
- Estimated (DRG codes for different types of procedures)
Claims data
- Primary for reimbursement (insurance company or other payer)
- Secondarily for Health economy/epidemiology
How to contact patients, health care providers etc
Dates and times for visits, procedures etc

🏥 Hospital background data

hospital characteristics
staffing
resources
geographical area
level of specialization
private, public

🩺 Clinical data

health care registers
- Mandatory (by law)
- eg: National patient register, cancer register (diagnoses)
quality registers
- Optional for health care providers
- (Mandatory within organizations joining)
- conditions (diabetes, cancer, etc)
- procedures (total hip arthroplasty)
- Diagnoses, treatments, health status, questionnaires (PROM/PREM)

🤓 Individual background data

socioeconomic data
education
income
occupation
family relations
migration status
Mortality data
- date of death
- cause of death

🗺️ Aggregated data

“Micro” vs. “macro” data.

population data
neighborhood characteristics
pollution
crime rates

Inclusion/exclusion criteria

👍 Defines the target study/register population
👎 Define exceptions to the general rules

🦿Simple example

“Every Swedish resident who had total hip arthroplasty performed in Sweden”

Include: all ages, all hospitals, all reasons for the prosthesis, all types of prosthesis
Exclude: Swedish residents with surgery performed in other countries. Non-Swedish residents with the procedure performed in Sweden.

🥴 Complicated example

The National Quality Register for Ovarian Cancer

Inclusion
1. Epithelial borderline tumours of the ovary
  - Topography code according to ICD-O/2: C56.9.
  - Morphology code according to ICD-O/2 ≥80103 and <85900.
  - Borderline tumours with 5th digit 3 in the morphology code according to ICD-O/2 and benign behaviour flag = 3.
2. Epithelial ovarian cancer:
  - Topography code according to ICD-O/2: C56.9.
  - Morphology code according to ICD-O/2 ≥80103 and <85900.
  - Malignant tumours with 5th digit 3 in the morphology code according to ICD-O/2 and benign behaviour flag blank.
3. Non-epithelial ovarian cancer:
  - Topography code according to ICD-O/2: C56.9.
  - Morphology code according to ICD-O/2 ≥85903 and <95900, with the exception of mesotheliomas with ICD-O/2 codes in the interval ≥90500 and <90600.
  - Malignant tumours with digit 3 as the fifth digit in the morphology code according to ICD-O/2.
  - Exception for granulosa cell tumours, where all cases with morphology codes according to ICD-O/2 in the interval ≥86200 and ≤86223 are included.
4. Malignant tumours of the fallopian tube:
  - Topography code according to ICD-O/2: C57.0.
  - Morphology code according to ICD-O/2 ≥80003 and <95900, with the exception of mesotheliomas with ICD-O/2 codes in the interval ≥90500 and <90600.
  - Malignant tumours with digit 3 as the fifth digit in the morphology code according to ICD-O/2.
Exclusion
- Epithelial ovarian cancer and borderline tumours of the ovary
  Cases with behavior codes 0, 1, 2, 6, or 9 as the fifth digit in the ICD-O/2 morphology code are excluded.
  Morphology codes according to ICD-O/2 <80103 and ≥85900 are excluded.
- Non-epithelial ovarian cancer
  Cases with digits 0, 1, 2, 6, or 9 as the fifth digit in the ICD-O/2 morphology code are excluded, with the exception of granulosa cell tumours, for which cases with ICD-O/2 morphology codes in the interval ≥86200 and ≤86223 are included even when the final digit is 0, 1, 2, or 3.
  Morphology codes according to ICD-O/2 <85903, as well as codes in the intervals ≥90500 and <90600 (mesotheliomas) and ≥95900, are excluded.
- Tumours of the fallopian tube
  Cases with behaviour codes 0, 1, 2, 6, or 9 as the fifth digit in the ICD-O/2 morphology code are excluded.
  Morphology codes according to ICD-O/2 in the intervals ≥90500 and <90600 (mesotheliomas) and ≥95900 are excluded.
- For all diagnoses, cases are excluded if the diagnosis is based solely on:
  - clinical examination (basis of diagnosis 1),
  - imaging procedures including radiography, scintigraphy, ultrasound, MRI, CT (or equivalent examinations) (basis of diagnosis 2),
  - autopsy with or without histopathological examination (basis of diagnosis 4 or 7),
  - surgery without histopathological examination (basis of diagnosis 6), or
  - other laboratory investigations (basis of diagnosis 8).
  - cases with age <18 years are excluded.

Coverage and completeness

🏥 Institutional coverage: proportion of all eligible units/clinics that are connected to the registry
- e.g., 90% of hospitals performing the procedure are connected
- Should be known by the “register holder”
🤒 Case coverage: proportion of patients who should have been reported from connected units that are actually included
- e.g., 85% of eligible patients registered
- The aim is to use 100 % but this is not always possible
Data completeness: proportion of required data fields that are filled in for the registered patients
- 🚬 e.g., 95% of patients have smoking status recorded
- 🩸 e.g., 80% of patients have blood pressure data available

What is recorded?

👩🏻‍⚖️ Some registers are mandated by law and regulations
Quality registers often have a steering committee and register holder
Reseasrh initiated databases according to specific protocols

Data linking

Unique personal identifier
- Not in every country!
- Social security number similar purpose but not as widely used
study specific id number
HSA (“Hälso- och sjukvårdens adressregister” for staff and organizations)

Unique personal identifier

(Swedish: personnummer, reading: (Ludvigsson et al. 2009))

121212-1212 Tolvan Tolvansson

10 (or 12) digits
date of birth-4 digits
assigned at birth or immigration
used in all health care contacts
used for all administrative data
sometimes reused after death
sometimes changed (uncommon)
sometimes inclusion criteria for register
similar in the Nordic countries
- Denmark: CPR number
- Norway: Fødselsnummer
- Finland: Henkilötunnus
- Iceland: Kennitala

Combining data

Similar registries in different areas/regions/countries
- Different individuals but similar data
Same definitions and variables?
Same inclusion criteria?
Don’t get fooled by similar names!
Differences and similarities within the Nordic countries (Laugesen et al. 2021)

Working with health care data

A lot to do before the statistical analysis!

Legalities
- Do I have the right to access this data?
- What am I allowed to do?
- What am I not allowed to do?
Data management
- large datasets
- multiple datasets
- different formats
- missing data
- data cleaning
- data transformation
- data wrangling
- data munging
- data governance
- data engineering
Planning
- What is the purpose?
- How can I achieve my goals?
- What if I change my plans later?
- Can I redo my analysis?
- How do I present/communicate my results?

R as a tool but …

Large files often comes exported from SAS (initially “Statistical Analysis System”)
Comma-Separated Values (csv) or text files
Application Programming Interface (API) calls
Structured Query Language (SQL) databases
Hierarchical data structures (eXtensible Markup Language, XML; JavaScript Object Notation, JSON, …)

Our use of R

{data.table} to handle large data sets efficiently
{targets} to streamline a reproducible pipeline
Git for version control
GitHub for collaboration
Quarto for reporting

EL2: European legislation

Lecture Slides

Associated litterature (References at the end)

(Vukovic et al. 2022)
(Nguyen 2022, ch. 4)

Legal part

Today: European legislation (mainly GDPR!)
- Lecture handouts main source for examination (seminar ES1 and possibly DISA exam).
- Article (Vukovic et al. 2022). Focus on the introduction, the section “GDPR‐related enablers and barriers to cross‐country health data exchange in Europe” (in the results section incl. figures and tables), discussion and conclusions.
Next lecture: Swedish legislation (including associated reading). Examined as part of seminar ES1 (not the DISA exam).
Consequence: (Nguyen 2022, ch. 4). Read the beginning. Skip “Medical Information Mart for Intensive Care”. Read the “Synthea” section. The “Synthea” section does not have a legal focus but the legal parts explains why we use this data. Read for your own understanding (not examined).

European legislation

GDPR (our focus)
- Defines the legal conditions for processing personal data
- Focuses on protection, safeguards, and accountability
European Health Data Space (EHDS)
- Establishes a European framework for access to health data for research, statistics, and policy.
- Focuses on data access, governance, and interoperability
- Increases opportunities for cross-national health statistics
EU Data Act
- Regulates who may access data and under what conditions, across sectors.
- Indirectly relevant for health statistics through device-generated and digital service data.

Important

Legal and governance frameworks enable access to data, but statistical expertise remains essential for ensuring data quality, valid inference, and meaningful interpretation.

EU law vs Swedish law

EU legislation tends to be more detailed in the legal text itself
This is because EU law must be:
- applied uniformly across many different legal systems
- interpreted without relying on national preparatory works
Interpretation of EU law relies mainly on:
- the wording of the legislation
- recitals (non-binding explanations before the articles describing the purpose and context).
Swedish legislation is often:
- shorter and less detailed in the statutory text
- supplemented by extensive preparatory works (förarbeten)
In Sweden, preparatory works are a central interpretative source for courts and authorities

➡️ The difference reflects different legislative techniques, not necessarily a difference in regulatory ambition.

Source

The GDPR is available in all official EU languages via EUR-Lex. Take a quick look to get a very brief overview. However, it is recommended reading only if you suffer from insomnia — it is not required for fulfilling the course requirements!

European Union (EU)

GDPR applies directly and uniformly as law
No national implementation required
Member States may:
- introduce supplementary legislation
- allow legal exceptions, e.g. for:
  - research
  - public interest
  - health data

European Economic Area (EEA)

Countries: Norway, Iceland and Liechtenstein

GDPR applies via the EEA Agreement
Implemented into national law
In practice:
- very similar application as within the EU
- same core principles, rights, and obligations

United Kingdom (UK)

EU GDPR no longer applies directly after Brexit
Replaced by:
- UK GDPR
- Data Protection Act 2018

Switzerland

Not part of EU or EEA
GDPR does not apply as law
Instead: Federal Act on Data Protection (FADP)
- Revised to align closely with GDPR

International laws

Note that other countries have different laws and regulations
In USA, for example, HIPAA regulates the use and disclosure of protected health information (PHI)
- Different states have different laws as well
When collaborating internationally, compliance with all relevant laws is required

Definitions

GDPR article 4:

Personal data: means any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.
Processing: means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organization, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;
Pseudonymisation: means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;
Controller: means the natural or legal person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data […]
Processor: means a natural or legal person, public authority, agency or other body which processes personal data on behalf of the controller;
Consent of the data subject: means any freely given, specific, informed and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her;
Personal data breach: means a breach of security leading to the accidental or unlawful destruction, loss, alteration, unauthorized disclosure of, or access to, personal data transmitted, stored or otherwise processed;
Data concerning health: means personal data related to the physical or mental health of a natural person, including the provision of health care services, which reveal information about his or her health status;

Legal grounds for processing personal data:

GDPR article 6 (1):

Processing shall be lawful only if and to the extent that at least one of the following applies:

the data subject has given consent to the processing of his or her personal data for one or more specific purposes; ~~processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract;~~
~~processing is necessary for compliance with a legal obligation to which the controller is subject;~~
~~processing is necessary in order to protect the vital interests of the data subject or of another natural person;~~
processing is necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller;
processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.
~~processing is necessary for compliance with a legal obligation to which the controller is subject under Union or Member State law requiring the processing of personal data for a specific purpose.~~

Legal ground (d) is the most relevant if you work with secondary data in the public sector (research and reporting etc). (a) is relevant to collect primary data for research etc. (e) is a delicate one …

Processing of special categories of personal data

GDPR Article 9 (1):

Processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation shall be prohibited. 😱

But …

Paragraph 1 shall not apply if one of the following applies:

1. the data subject has given explicit consent to the processing of those personal data for one or more specified purposes […]
1. processing is necessary for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health or ensuring high standards of quality and safety of health care and of medicinal products or medical devices […]
1. processing is necessary for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes (🥳) in accordance with Article 89(1) based on Union or Member State law which shall be proportionate to the aim pursued, respect the essence of the right to data protection and provide for suitable and specific measures to safeguard the fundamental rights and the interests of the data subject.

Safeguards

GDPR article 89:

Safeguards and derogations relating to processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes

Processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, shall be subject to appropriate safeguards, in accordance with this Regulation, for the rights and freedoms of the data subject. Those safeguards shall ensure that technical and organisational measures are in place in particular in order to ensure respect for the principle of data minimisation. Those measures may include pseudonymisation provided that those purposes can be fulfilled in that manner. Where those purposes can be fulfilled by further processing which does not permit or no longer permits the identification of data subjects, those purposes shall be fulfilled in that manner.
Where personal data are processed for scientific or historical research purposes or statistical purposes, Union or Member State law may provide for derogations […] so far as such rights are likely to render impossible or seriously impair the achievement of the specific purposes, and such derogations are necessary for the fulfilment of those purposes.

Technical Safeguards

Examples of technical safeguards include:

Pseudonymisation
Encryption of personal data
Access controls and authentication
Logging and monitoring of access
Secure storage and transmission
Use of secure software environments often enforced by organizational standards

Organisational Safeguards

Examples of organizational safeguards include:

Defined roles and responsibilities
Internal policies and procedures
Staff training and confidentiality obligations
Data protection by design and by default
Incident and breach response procedures
Documentation and accountability measures
Working for a health care organization might require an agreement of secrecy

Data Minimisation and Purpose Limitation

Only data that are necessary should be processed
Data are processed only for specified purposes
Access is limited to authorized personnel
Retention periods are defined and respected

Pseudonymisation and Anonymisation

Pseudonymisation reduces risks while allowing reuse of the data
- Identifiers are kept separately and protected
Anonymisation removes data from GDPR scope (if irreversible)

Important

Pseudonymisation $\neq$ Anonymisation
Removing the Swedish personal identification number (PIN) is not a guarantee for pseudonymisation

Data Controller

The entity that determines the purposes and means of the processing of personal data
Bears the primary legal responsibility
Responsible for:
- Lawful basis
- Compliance with GDPR principles
- Transparency and information to data subjects
- Appropriate technical and organizational measures
Typical examples:
- Public authorities
- Universities
- Regions and municipalities

Controller–Processor Relationship

A formal Data Processing Agreement (DPA) is required
The agreement must specify:
- Subject matter and duration
- Nature and purpose of processing
- Types of personal data
- Categories of data subjects (patients, students, citizens, …)
- Security measures
The controller remains responsible even when processing is outsourced

Example: Sahlgrenska

If a researcher work at the Sahlgrenska university hospital, VGR might be the data controller (personuppgiftsansvarig; PUA)
If he/she asks for statistical consulting from the Sahlgrenska Academy, GU might be the data processor (personuppgiftsbiträde; PUB)

European Health Data Space (EHDS)

What is it?

EHDS is an EU-wide legal and technical framework for the use and sharing of health data
It aims to:
- improve access to health data across borders
- support healthcare, research, statistics, and policy-making
Focuses on data access and governance

Two Main Pillars

Primary use of health data
- Use of data for individual patient care
- Cross-border access to electronic health records
Secondary use of health data for
- statistics
- scientific research
- public health
- policy evaluation and innovation

Implementation Timeline

2025: EHDS regulation enters into force
2025–2027: Development of implementing and technical acts
From ~2029 onward:
- national infrastructures become operational
- cross-border access for secondary use starts to function in practice

Why EHDS Matters for Statisticians

EHDS explicitly recognizes statistics as a legitimate purpose
It facilitates access to:
- large-scale health datasets
- cross-national data sources
It increases demand for:
- data quality assessment
- metadata interpretation
- harmonization and comparability analyses

EHDS Does Not Do This

EHDS does not:
- define statistical methods
- ensure data quality automatically
- guarantee comparability across countries
Legal and technical access ≠ valid statistical inference

EU Data Act

What Is It?

The Data Act is an EU regulation on access to and sharing of data
Focuses mainly on:
- data generated by connected products and digital services (IoT)
- business-to-business (B2B) and business-to-government (B2G) data sharing
It is not a data protection regulation

➡️ The Data Act is about who may access data and under what conditions.

How the Data Act Relates to Health Data

The Data Act does not primarily target health registers
However, it may affect:
- data generated by medical devices
- digital health services
- health-related IoT data

➡️ Health data may fall under the Data Act depending on how it is generated.

EL3: Swedish legislation

Lecture Slides

Associated litterature (References at the end)

(“Public Access and Secrecy | Swedish National Data Service” 2025)
(Görman 2024)

Literature

Examination

The Swedish legal system will only be examined during the seminar ES1 (not in the DISA exam). You are allowed to use the sources during the seminar (no need to memorize individual laws and paragraphs by names and numbers). Nevertheless, read the literature before the seminar!

Lecture handouts
Web link (“Public Access and Secrecy | Swedish National Data Service” 2025)
Report (Görman 2024). Most important: p. 26-32, 42-45, 49 and 75-90.
Swedish reading students might enjoy lagen.nu for source material (however, this is not required for the course).
- A translation using Google translate might be used for non Swedish reading students. However, please note that legal texts are formally valid only in their original language.

Swedish legal system

Backgrund to Swedish law

Fundamental laws (“grundlagar”) decided by the parliament but stable over time
- Asort of codified constitution distributed in four parts
- The Freedom of the Press Act (Tryckfrihetsförordningen) of relevance
Ordinary laws (parliament)
- New ones all the time (previously twice a year)
- New laws to update existing laws
Published in Svensk författningsamling, SFS
- Digitally since 2018-04-01 (previously printed)
- The “big blue book” is only a smaller collection of important laws
ordinances (“förordning”) from the government to implement laws
regulations (“föreskrifter”) from authorities to implement laws and ordinances

Two main branches of law

Civil and criminal law:
- state what you can not do (everything else is “legal”)
- Handled by ordinary courts
- Ex: Brottsbalken kap 20 om tjänstefel m.m. (The Swedish Penal Code (Brottsbalken), Chapter 20 — Offences Relating to Public Office.)
Public law: relationship between individuals and the state etc
- state what the public authorities must and can do (everything else is “illegal”)
- Handled by administrative courts
- What we mostly care of here

The principle of freedom vs. the principle of legality

Private actors are generally free unless restricted by law, while public authorities require explicit legal authority to act.

Fundamental law

The Freedom of the Press Act (TF)

(🇸🇪: Tryckfrihetsförordningen)

World’s oldest freedom of the press law (since 1766)
Chapter 2: Public access to official documents
Applies to public authorities and institutions
- Including health care registers and medical records held by public authorities

In order to promote a free exchange of opinion, comprehensive and pluralistic information, and free artistic creation, everyone shall have the right of access to official documents. [TF 2.1]

But there are exceptions (TF 2.2): - e.g., if disclosure would violate privacy or national security - if so, the government has the right to provide ordinary laws that restrict access (which they do!)

Confidentiality

The Public Access to Information and Secrecy Act (OSL)

(🇸🇪: Offentlighets- och sekretesslagen)

Law that regulates public access to official documents and confidentiality
Applies to public authorities and institutions
Defines what information is considered confidential and under what circumstances

OSL Chap 21

Confidentiality for private individuals’ personal circumstances no mater the context

E.g., health data, economic circumstances, family relations

OFS 21.1: Secrecy applies to information concerning an individual’s health or sexual life, such as information about illnesses, substance abuse, sexual orientation, gender reassignment, sexual offenses, or other similar information, if it can be assumed that disclosure of the information would cause significant harm to the individual or to someone closely related to them.

OSL Chap 24

Secrecy for the protection of individuals in research and statistics.

A few special research databases etc
Some regulations for research ethics boards

OSL Chapter 25

Secrecy for the protection of individuals in activities relating to health and medical care etc.

OFS 25.1: Within the health and medical care services, secrecy applies to information concerning an individual’s state of health or other personal circumstances, unless it is clear that the information may be disclosed without causing harm to the individual or to someone closely related to them. The same applies to other medical activities, such as forensic medical and forensic psychiatric examinations, insemination, in vitro fertilization, abortion, sterilization, circumcision, and measures to prevent communicable diseases.

Exceptions exists,
- for example to submit medical patient data to quality registers
- to share data between public organizations for research purposes or statistics (OFS 25.11 p. 5).

OSL Chapter 10

Provisions on disclosure overriding secrecy and provisions on exemptions from secrecy

OFS 10.28: Secrecy does not prevent information from being disclosed to another authority where a duty to provide information follows from an act or an ordinance.

This would apply to data sharing for research purposes when there is a legal basis for that

Health care data

The Patient Data Act (PDL)

(🇸🇪: Patientdatalagen)

regulates the processing of personal data within health and medical care in Sweden.
Applies to healthcare providers (public and private).
Main objectives:
- Protect patient privacy
- Ensure safe and effective healthcare
- Enable secondary use of health data under strict conditions

Chapter 7 PDL

National and regional quality registers

Opt-out for patients (every one is included by default until they opt out)

PDL 7.4: Personal data in national and regional quality registers may be processed for the purpose of systematically and continuously developing and ensuring the quality of healthcare.

PDL 7.5: Personal data processed for the purposes set out in Section 4 may also be processed for the purposes of

the production of statistics,
estimating numbers for the planning of clinical research,
research within health and medical care,
disclosure to a party that will use the data for purposes referred to in Sections 1 and 3 or in Section 4, and
…

The Health Data Registers Act

(🇸🇪: Lag om hälsodataregister [SFS 1998:543])

This law regulates health data registers outside the health and medical care system. A new law is being proposed to replace this one.

§ 1: A central administrative authority within the health care sector may carry out automated processing of personal data in health data registers. The central administrative authority that carries out the processing of personal data is the controller.

§ 3: Personal data in a health data register may be processed for for the following purposes:

the production of statistics,
follow-up, evaluation and quality assurance of health and medical care, and
research and epidemiological studies

Specific registers

Register (Swedish)	Register (English)	Governing act / ordinance
Folkbokföringen	Population Register	Population Registration Act (1991:481); Population Registration Ordinance (1991:749)
Totalbefolkningsregistret (RTB)	Total Population Register	Official Statistics Act (2001:99); Official Statistics Ordinance (2001:100)
Nationella patientregistret	National Patient Register	Health Data Act (1998:543); Ordinance on the National Patient Register (2001:707)
Cancerregistret	Swedish Cancer Register	Health Data Act (1998:543); Cancer Register Ordinance (2001:709)
Dödsorsaksregistret	Cause of Death Register	Health Data Act (1998:543); Cause of Death Register Ordinance (2001:709)
Läkemedelsregistret	Prescribed Drug Register	Act on the Prescribed Drug Register (2005:258); Ordinance (2005:363)
Medicinska födelseregistret	Medical Birth Register	Health Data Act (1998:543); Medical Birth Register Ordinance (2001:708)
Tandhälsoregistret	Dental Health Register	Health Data Act (1998:543); Dental Health Register Ordinance (2008:194)

Other legislation

The Archives Data Act (ADL)

(🇸🇪 Arkivdatalagen)

Regulates the management of public records and archives
Applies to public authorities and institutions
Differnt authorities then have different rules for how long data must be kept
- For example research data is often required to be kept for at least 10 (or 25) years

Statistics and research?

A statistical purpose refers to the production of aggregated information describing groups or populations (e.g. summary tables or prevalence estimates), and does not include analyses or decisions concerning identifiable individuals (e.g. individual predictions or case assessments).
- Does not require particular statistical methods etc
Research refers to systematic activities aimed at generating new, generalisable knowledge, and excludes activities focused on individual decisions, control, or routine administration.

The Ethical Review Act (EPL)

(🇸🇪: Etikprövningslagen)

Regulates ethical review of research involving humans (including their data!)
- Had received some criticism and might be revised
Applies to research conducted in Sweden
Requires ethical review and approval by the Swedish Ethical Review Authority
Aims to protect the rights, safety, and well-being of research participants
Based on the Declaration of Helsinki and other international ethical guidelines
One application for each new research project
- Ammendments for changes in already approved projects
Application fees applies

Access to data

Public, non-sensitive individual information

Certain individual data are public by default (e.g. declared income, address).
Such information may be accessed upon request from authorities like the Swedish Tax Agency, unless specific secrecy provisions apply.
Might still not be used for research without ethical review

Aggregated data (including health data)

Aggregated information that cannot be linked to identifiable individuals may often be disclosed.
Aggregated health statistics produced through “automated processes” might be disclosed upon request.

Individual-level health data for statistical purposes

Access to identifiable health data is possible within authorities conducting statistical activities.
This typically requires that the data are used solely for statistical purposes,

Individual-level data via register-holding authorities

Identifiable data may be accessed by staff or contractors working on behalf of the authority responsible for the register.

Individual-level data for research

Access to identifiable personal or health data for research purposes generally requires:
- approval under the Ethical Review Act,
- a lawful basis under GDPR,
- and a disclosure decision under OSL by the data-holding authority.
Data are typically provided under strict conditions (e.g. pseudonymisation, secure environments).

EL4: Tooling

Lecture Slides

Associated litterature (References at the end)

(“The Unix Shell: Summary and Setup” 2026)
(Rodrigues 2023, ch. 4)

Reading and practicing

(“The Unix Shell: Summary and Setup” 2026) introduces the Unix shell and the file system, which are essential parts for using git. Read and practice according to the instructions. (You may use the Terminal tab in RStudio or similarly in Posotron). Only the first sections are required:
(Rodrigues 2023, ch. 4) introduces the basics of Git for R users and applies to all common operating systems. Read and practice by following the instructions.

Recommended references:

A bit old but still relevant: Happy Git with R
Staples (2023) illustrates the constant evolution in technology. As a professional, you should not expect that what you learn in this course will stay relevant for ever. This article is a good illustration of how the linguistic use of R has changed in less than a decade. It is also a practical example of how you can use GitHub in a slightly different way and it touches briefly on {data.table}, which we will introduce later. It also illustrates that a lot of R code is not written for statistical analysis (the last and final step) but for data management. The article also mentions some statistical techniques which you will meet in later courses (ignore the details for now).

Motivation

It is widely acknowledged that the most fundamental developments in statistics in the past 60 years are driven by information technology (IT). We should not underestimate the importance of pen and paper as a form of IT but it is since people start using computers to do statistical analysis that we really changed the role statistics plays in our research as well as normal life.

Although: “Let’s not kid ourselves: the most widely used piece of software for statistics is Excel.”” /Brian Ripley (2002)

Short overview

Panta rei!

We teach you the present
But your work is in the future
We might use history to predict the future?
At least we should learn that things changes constantly!

Terminal

Don’t rely too heavily on your computers graphical user interface (GUI)
You know how to use the R console!
To use a terminal/Unix shell/Bash… is similar
To navigate the file system is basic practice
A Unix Terminal is included if you are using Mac (and Linux)
Emulators are common for Windows (one is included in Git, use that one after installation)

Early computer languages

ALGOL / ALGOL 60
- Algorithmic Language
- Influential in academic/scientific computing
- Introduced structured programming concepts that shaped later languages
PL/I
- Used in some government and industrial contexts
- Combined scientific and business computing features (Fortran + Cobol)

General-Purpose Programming Languages

Early statistical computing relied heavily on:

FORTRAN
- Dominant language for numerical and statistical computation
- Statistical methods implemented as libraries and subroutines
- Still used as numerical back-end (e.g., BLAS/LAPACK) in modern statistical software
C
- Emerged in the 1970s as a systems programming language
- Widely used for implementing statistical software infrastructure
- Core language of many modern statistical environments (e.g., R, parts of SAS, Stata)
- Often used to interface with high-performance numerical libraries

📌 These languages required substantial programming expertise.

FORTAN and C

FORTAN is still used in R for subroutines such as least squares. Even in modern packages!

Same for C, such as for lm, which is popular for high performance computing (fast execution time).

Early Statistical Packages

Several dedicated statistical systems emerged:

SPSS (1968)
- Originally batch-oriented
- Widely used in social sciences
- Originaly “Statistical Package for the Social Sciences”
BMDP (1960s)
- Bio-Medical Data Package
- Developed at UCLA
- Common in medical statistics
GENSTAT (1968)
- Focused on agricultural statistics
MINITAB (1972)
- Designed for teaching and education
- Still popular in quality control

SAS: A Transitional System

SAS (early 1970s)
Developed for agricultural and biostatistical analysis
Script-based, but largely batch-oriented
Became a standard in:
- Government agencies
- Large organizations
Known for strong data management capabilities
Still widely used in pharmaceutical industry

📌 SAS predates S and influenced later statistical workflows.

SAS data

Still common to receive data in SAS-format (data_file.sas7bdat) from register holders!
Sometimes necessary to use SAS to export the data (genAI might be a good aid in those cases)

Limitations of Pre-S Systems

Common limitations included:

Batch processing rather than interactivity
- But we still sometimes need batch processing for large scale projects!
- Rscript script.R
Separation of data management and analysis
Limited graphics capabilities
High barriers to exploratory data analysis

S

S takes form at Bell Laboratories (interactive statistical computing).
John Chambers leads the effort.
1976: first working version of S runs on GCOS
1979: S2 is ported to UNIX; UNIX becomes the primary platform
1980: S is first distributed outside Bell Labs
1981: source versions are made available
1984: key S books published (often called the “Brown Book” era)

The New S Language

1988: “New S” is released (major language redesign)
1988: S-PLUS is first released as a commercial implementation of S
- Last seen in Tibco Spotfire 2012
1991: Statistical Models in S (“White Book”) popularizes formula notation (the ~ operator), data frames, and modeling workflows

R

1993: first versions of R is published (Auckland; Ross Ihaka & Robert Gentleman)
1995: R becomes open source (GPL)
1997: the R Core group forms; CRAN is founded (Kurt Hornik)
2000: R 1.0.0 is released 2000-02-29

RStudio brings an IDE to the R community

2009: RStudio (the company) is founded
2011: RStudio IDE is introduced as an open-source IDE for R (desktop + server)

Microsoft (Revolution Analytics)

Jan 2015: Microsoft announces it will acquire Revolution Analytics
Microsoft promotes enterprise R offerings (e.g., Microsoft R Open / R Server)
2016: SQL Server 2016 introduces R Services (in-database R)
2017: Microsoft expands the stack under “Machine Learning Server” branding
June 2021: Microsoft announces retirement of Microsoft Machine Learning Server
2023: Microsoft in still a relevant player
- platinum member of The R consorsium
- Owner of GitHub
- VS Code

RStudio becomes Posit

July 27, 2022: RStudio rebrands as Posit
July 28, 2022: Quarto is announced as a next-generation scientific and technical publishing system (multi-language, multi-engine)
A public benefit company (not only relevant for its shareholders)
Also a platinum member of The R consortium # Modern IDE

Positron

New generation IDE for data science (and of course statistics!)
From Posit PBC
Free for individual use
Based on Code OSS (open source version of VS Code from Microsoft)
For both R and Python (Julia?)

Quick tour

Positron assistant (mentioned in the video)

This feature will most likely be disabled in any secure working environment. Such environments often have strict rules about data privacy and security, which may conflict with the assistant’s functionality. Health data in SENSITIVE and SECURE environments must not be shared with external services, including AI assistants, to comply with data protection regulations and institutional policies.

It is recommended to not rely on such tools during the course (even if all our data is synthetic). If you start to rely on such tools, you might get difficulties the day you work with real data (might lead to prosecution for “brott mot tystnadsplikten” which is not only public, but actually civil law (“Brottsbalken”) with prison sentence as a possibility). Society put an extreme emphasis on protecting health data, and rightfully so!

We will use Positron in this course
This is for you to learn and see an alternative IDE …
… and to get used to the fact that you should never stop learning!
RStudio, however, is still a great tool and it is still developed and maintained.
You may use both in the future (maybe Positron on your computer but RStudio in a secure server, which tend to be more slowly updated)

Some differences

No package installer (yet). You need to use pak::pkg_install() or install.packages() etc.
No inline rendering of results in Quarto documents (yet)
You can use multiple active R sessions at once
Great integrated tools from VS Code and extensions, such as GitHub integration

Version control

Before Version Control Systems

1950s–1970s: Early software development relied on:

Manual file naming:
- analysis_final.f
- analysis_final_v2.f
Physical media:
- Punch cards
- Magnetic tape
Centralised mainframes

📌 No automated tracking of changes.

Floppy discs, Mail, and Shared Directories

1970s–1980s: Common practices included:

Copying files to:
- Floppy disks
- Magnetic tapes (actually still used for backups)
Sending media by postal mail
Sharing files via:
- Network drives
- FTP servers

📌 Version control was social, not technical.

Early Version Control Systems

1980s: First-generation tools focused on single files:

SCCS (Source Code Control System, 1972)
RCS (Revision Control System, 1982)

Characteristics:

Versioning per file
Linear history
Central storage

Centralised Version Control

1990s: Project-level systems emerge:

CVS (Concurrent Versions System)
Subversion (SVN)

Key features:

Central repository
Multiple users
Check-in / check-out model

📌 Still required constant access to the central server.

Limitations of Centralised Systems

Common problems:

Single point of failure
Poor support for branching and merging
Difficult offline work
Slow operations on large repositories

These limitations became critical for large projects.

Git

Git was created by Linus Torvalds in 2005
Original motivation:
- Support Linux kernel development
- Replace proprietary tools

Design principles:

Distributed architecture
Fast local operations
Strong support for branching and merging

Terminal

Windows: the Git installer comes with a terminal
Linux/Unix (incl. Mac): use your terminal/Warp
Terminal pane in RStudio/Positron

GUI

The basics

Four short videos to watch

Cheat-sheet

Some interactive learning tools:

Some basics

You initiate a folder as a git project
Git will track all changes made in that folder
- New files
- Modified files (especially text, such as programming scripts/functions etc)
- Deleted files

Branching

Distributed Version Control with Git

Key ideas in Git:

Every clone is a full repository (“folder”)
Local commits without network access
Cheap and fast branches
Cryptographic integrity (hash-based)

📌 Collaboration becomes more flexible and robust.

Hosting Platforms

Platforms built around Git:

GitHub (2008)
Bitbucket (2008)
GitLab (2011)
Gitea - open source alternative to GitHub (get the same functionallity locally or on a server)

They add:

Pull requests / merge requests (collaboration tools)
Issue tracking (discussions and bug reports etc)
Code review
CI/CD integration (advanced)

Issue tracking

Issue tracking is not part of Git but is implemented in most (if not all) hosting platforms.
Used for bug reports and discussions between developers and users
Issues can be closed when fixed/adressed but are still found in the history
Each issue gets a number (in order) and those can be referenced in commit messages etc (ex: Fix #37, which will automatically close the issue)

Version Control Today

Modern usage includes:

Code
Documentation
Data analysis (scripts, notebooks)
Configuration and infrastructure

Git is integrated into:

IDEs (RStudio, VS Code, Positron)
CI/CD pipelines
Cloud platforms

Version Control Beyond Code

Today, version control supports:

Reproducible research
Collaborative writing
Data science workflows
Teaching and learning

📌 Version control is now a core professional skill.

WARNING!

Not everything should be shared!
Scripts and documentation yes!
But Health data is sensitive!
- Do not share it!
- Avoid unintentional sharing!
- Private repositories are still shared with hosting provider!
Avoid explicit file paths and sensitive info in scripts!
- Can give information about data location and internal structure!
No hard-coded passwords or API keys!

Git basics

(After installing the Git software)

Collect all files related to a project in a folder
Initialize a git repository in that folder
Make changes to files
Stage changes for commit
Commit changes with a message
Possibly push commits to remote repository

cd path/to/your/project
git init
git status
## make changes to files
git add filename1 filename2
git commit -m "Descriptive message about changes"
git remote add origin

Video tutorials

The video below is a good start to understand the basic concepts of Git and GitHub (and there are others to be found on YouTube).

Inspirational video

Watch this video even though some parts might be overwhelming. It gives a good overview of the current state (2025), even though many things will be too advanced for this course (it is not specifically aimed for statisticians or R users).

.gitignore

The .gitignore file is very important in settings with health data! Pay close attention to this section of the video!

Git in Positron

Remember that Positron is build on Code OSS (which shares a lot of features with VS Code).
Branching and merging is possible but we will not cover that here.

Short official introduction from Microsoft:

More detailed introductions. Watch both! The first one is based on a Windows version of VS code and the second on Mac but the concepts are the same:

Overwhelmed?

This video includes some parts which might be overwhelming if you are new to Git and GitHub. Don’t worry! You don’t need to understand everything right away. Just try to follow along with the basic concepts and steps. You will get more comfortable with practice.

Statistics projects

Not a single R script

Real projects are more complex than a single R script!
Multiple scripts
Multiple data files
Documentation
Reports
Version control
Reproducible workflows

Project structure

Common file structures

Help you organize your thoughts
Help others to collaborate
simplifies paths used in your code

Example structure

/.../my_project/
  ├── README.md        - project documentation
  ├── TODO             - what should be done next?
  ├── .git             - handled by git (hidden folder)
  ├── .gitignore       - used by git but your responsibility!
  ├── data/            - your data files (not under version control!)
      ├── cancer.csv 
      └── patients.qs 
  ├── R/               - your saved R functions
      ├── function1.R 
      └── function2.R
  ├── reports/
  └── _targets.R       - targets pipeline script

`README.md`

Document the purpose of the project
What is it about?
What is the aim?
Who to contact for questions?
In what circumstances was it created?

Markdown format (simple text with some possible formatting)

`data` folder

Store your data files here as they are when you get them
Avoid any modifications to the raw files!
It is very easy to forget what you do if it can not be traced by code
Do NOT include this folder in version control!
Git is not good at handling large files
Sensitive data should not be shared!
Add data/* to your .gitignore file
In realistic projects, data might come in varying formats
- csv, txt, xlsx, sas7bdat, sav, dta, etc
- some files might be very big (gigabytes not uncommon)

`R` folder

Store your R functions here
Document their purpose inline!
Helps you to reuse code
Easier to read main scripts
Easier to test and debug code
Easier to share code between projects

`reports` folder

Store your reports here
Quarto documents
Documemnt your analysis
Include figures and tables
Share with external collaborators

Computer practice

In ECS1 we will use Positron and git/GitHub in action!

Also see the “Reading and practicing” section above for a more in-depth introduction (homework).

What you should know

Reflect on the use of different software and how the rapid development in this field interplay with other important aspects of our field
Be able to describe the principles of basic git commands (init, add, stage, commit, push, pull) and what they are used for (may be theoretical questions in the written exam)
Use Git and GitHub in practice (but you can choose to do it either by commands or the GUI), this will be assessed in computer exercises and a later project.
Similarily, you need to organize your projects according to best practice (but we will be the focus of EL5).

EL5: Reproducibility

Lecture Slides

Associated litterature (References at the end)

(Nguyen 2022, ch. 2)
(Baker 2016)
(Oliveira Andrade 2025)
(Kavianpour et al. 2022)
(Peng and Hicks 2021)

Reading

Baker (2016) and Oliveira Andrade (2025) motivates why reproducible analysis is important!
Peng and Hicks (2021) introduce and elaborate more on the subject.
Nguyen (2022) only read section “Basic Introduction to Docker and Containers” in chapter 2 (you do not need to install Docker for this course, just be aware of the concept)
Kavianpour et al. (2022) introduces Trusted Research Environment as well as some perceived challenges with those (we will not use such environments in the course but this is likely the reality you need to cope with later on in your career, so you should get familiar with the concept).

Consider

After reading papers like Baker (2016) and Oliveira Andrade (2025) , one might be tempted to conclude that nothing is trustworthy — not even our own analyses. If every result comes with assumptions, uncertainty, and potential bias, what does trust really mean? Statisticians live in a world of probabilities, but most people prefer certainty.
Reproducibility might sound like a good thing but how much does the technical details matter if we are not allowed to freely share the underlying health care data anyway?

For reference:

Targets overview
Target manual
You read briefly about Docker in the mandatory reading. If this seems interesting, you might read: An Introduction to Rocker: Docker Containers for R (not mandatory).

Practice

A workshop with 7 parts was given at the R Medicine conference 2025. Practice according to part 1-3 (part 4-7 are slightly outside the scope of the course and only recommended if you have additional time and interest).

Clone this GitHub repo. Exercies are found in the code folder.

Follow the recording from the workshop while you are practicing: The power of {targets} package for reproducible data science

Notes on the recording

The full video is close to three hours long, but this includes breaks as well as the more advanced topics that are not mandatory (only the first 92 minutes, which includes several 10 min breaks is mandatory). Nevertheless, it is recommended to watch the full video (just watching part 4-7 for inspirational purposes without doing the exercises and without any expectation to fully grasp all the details).
RStudio is used in the recording. Please try to use Positron; however, feel free to revert to RStudio this time if following the instructions in one application while working in another feels overwhelming.
You may recognize some of the material from Part 3, as presented during the lecture

This practice is homework (no dedicated computer session is targeting targets but you should use at least some components of it in a later project, so you should practice the basics now)!

Motivational quotes

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. / Buckheit & Donoho

[It is important to have] a clear understanding of how data analysis was conducted when critical life and death decisions must be made. / Peng et al. 2021

Reproducible analysis

The definition of reproducible research generally consists of the following elements.

A published data analysis is reproducible if the analytic data sets and the computer code used to create the data analysis are made available to others for independent study and analysis

This is rather vague …

Why is this a problem?

Many, even simple, results depend on a precise record of procedures and processing and the order in which they occur.
Many statistical algorithms have many tuning parameters that can dramatically change the results if they are not inputted exactly the same way as was done by the original investigators.
If any of these myriad details are not properly recorded in the code, then there is a significant possibility that the results produced by others will differ from the original investigators’ results.

Recap folder structure

/.../my_project/
  ├── README.md        - project documentation
  ├── TODO             - what should be done next?
  ├── .git             - handled by git (hidden folder)
  ├── .gitignore       - used by git but your responsibility!
  ├── data/            - your data files (not under version control!)
      ├── cancer.csv 
      └── patients.qs 
  ├── R/               - your saved R functions
      ├── function1.R 
      └── function2.R
  ├── reports/
  └── _targets.R       - targets pipeline script

Pipeline

A pipeline is a computational workflow that does statistics, analytics, or data science.
A pipeline contains tasks to prepare datasets, run models, and summarize results for a business deliverable or research paper.
Pipeline tools coordinate the pieces of computationally demanding analysis projects.

Pipe operator

The simplest example of pipelines might be when you combine R code by the pipe operator (|> or %>%), which in turn is inspired by the pipe operator in Unix (|).

Long running processes

Imaigine you have nice R-scripts to take care of every step in your project
You have either one very very … very … long file or a bunch of files which you execute in order
It takes two weeks to run all scripts (not unrealistic for a big project!)
You report the result to your collaborators
They ask you to update a tiny detail somewhere in the middle of your workflow
You need to rerun everything (while screaming in frustration)!
This will repeat again and again until you finally
1. Lose your mind and quit your job
2. Adapt a more reproducible workflow

Steps-wise procedure

The default setting in R is to save the workspace to disk when you end the session
This is usually not recommended and therefore often disabled in for example RStudio
- (If you re-start your session at a later point you might have no idea how the restored objects relates to each other.)
You may still want to execute things in a more step-wise fashion
1. Convert big data files (.sas7bdat from SAS etc) to something more R-friendly
2. Perform some data management (select relevant columns and rows), cleaning, add calculated variables etc
3. Perform some exploratory data analysis (EDA) to get a better understanding of the data
4. Perform some statistical analysis
5. Report and visualize the results

Caching

If you perform each step in different script files, you might start each file with some input and end it with some output
The output from the previous script is used as input in the next script
save() and load() are standard but can be very slow for large objects
saveRDS() and loadRDS() use serialization and are nowadays much more efficient
qs2::qs_save() and qs2::qs_read() (or qs2::qd_save() and qs2::qd_read()) from the qs2 package are currently “state-of-the-art” (but things tend to change quickly)

Benchmarking

Single-threaded

Algorithm	Compression	Save Time (s)	Read Time (s)
qs2	7.96	13.4	50.4
qdata	8.45	10.5	34.8
base::serialize	1.1	8.87	51.4
saveRDS	8.68	107	63.7
fst	2.59	5.09	46.3
parquet	8.29	20.3	38.4
qs (legacy)	7.97	9.13	48.1

Multi-threaded (8 threads)

Algorithm	Compression	Save Time (s)	Read Time (s)
qs2	7.96	3.79	48.1
qdata	8.45	1.98	33.1
fst	2.59	5.05	46.6
parquet	8.29	20.2	37.0
qs (legacy)	7.97	3.21	52.0

qs2, qdata and qs with compress_level = 3
parquet via the arrow package using zstd compression_level = 3
base::serialize with ascii = FALSE and xdr = FALSE

Datasets used

1000 genomes non-coding VCF 1000 genomes non-coding variants (2743 MB)
B-cell data B-cell mouse data, Greiff 2017 (1057 MB)
IP location IPV4 range data with location information (198 MB)
Netflix movie ratings Netflix ML prediction dataset (571 MB)

These datasets are openly licensed and represent a combination of numeric and text data across multiple domains. See inst/analysis/datasets.R on Github.

Who cares?

Imagine you work with data for the whole Swedish population (> 10 M people)
You have medical prescription data for everyone and lets say on average 20 prescriptions (since 2015) per person => 10 * 20 = 200 M rows of data
You are working in a secure environment were data is stored on a network server (no SSD drive)
The same environment is shared by 20 other researchers and you are all competing for the same bandwidth
It will get increasingly frustrating just to load and save the big files
You might need to wait for half an hour before you can start working 😩

Automated process

Instead of relying on individual R scripts, use a reproducible pipeline.
This is common practice in software development
- For example GNU Make is a build automation tool that automatically determines which parts of a program need to be rebuilt and executes the necessary commands based on rules defined in a Makefile.
Iteration during project development (which is equally relevant for a health data project), will be much faster and reliable.
Caching as described above is automated and you only need to read/write files to disk/network storage when actually needed.

Alternative solutions

GNU Make with R
rmake creates and maintains a build process for complex analytic tasks. The package allows easy generation of a Makefile for the (GNU) ‘make’ tool
drake was the first more established R-oriented alternative
targets is the currently most developed and used alternative
rixpress might be a rising star (at least if you are combining R and Python; polyglot)?

We will use targets in this course!

targets

The {targets} package is a Make-like pipeline tool for statistics and data science in R.
The package skips costly runtime for tasks that are already up to date, orchestrates the necessary computation with implicit parallel computing, and abstracts files as R objects.
If all the current output matches the current upstream code and data, then the whole pipeline is up to date, and the results are more trustworthy than otherwise.
The tasks themselves are called “targets”, and they run the functions and return R objects.
The targets package orchestrates the targets and stores the output objects to make your pipeline efficient, painless (well … hopefully relatively so …), and reproducible.

Example

Data: survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey.

Question: Is there any association between Gender and BMI?

## pak::pkg_install("NHANES")
str(NHANES::NHANES)

R script

A normal R script might look like:

## pak::pkg_install("NHANES")
library(tidyverse)

df <- as_tibble(NHANES::NHANES)
tbl <- count(df, Gender)
tst <- t.test(BMI ~ Gender, df)
gg <- ggplot(df, aes(BMI, color = Gender)) + geom_density()

You must run the script in order (but might not always do that during the initial “trial and error” process … if so, confusion will later arise)!
objects df, tbl and tst only lives in the active session
- No problem in this example
- but working with big and complex data might introduce long-running processes => time consuming!

Results

#| echo: false
gg
tst

Figure and R output

You should see a figure and some R output on this slide. I just notice it doesn’t show on my phone when viewing the slides, so if you don’t see it, you might try another browser etc (it seems to render correct in the handouts however).

Pipeline

Define steps in your analysis as targets
Define dependencies between targets
Automatically track changes and rerun only necessary parts

Why?

You do not want to rerun everything all the time!
You want to keep track of what you have done
You want to be able to reproduce your results later
You want to share your workflow with others

target

Reformulate from ordinary object assignment:

df <- as_tibble(NHANES::NHANES)

to a target

library(targets)
tar_target(df, as_tibble(NHANES::NHANES))

Put all targets in a list

#| results: "hide"
list(
  tar_target(df, as_tibble(NHANES::NHANES)),
  tar_target(tbl, count(df, Species)),
  tar_target(tst, t.test(BMI ~ Gender, df)),
  tar_target(gg, {
    ggplot(df, aes(BMI, color = Gender)) + geom_density()
  })
)

`_targets.R` file

Put that list into a file _targets.R together with some additional code:

#| results: "hide"
library(targets)
tar_source() # Source all scripts with functions stored in R folder
tar_option_set(
  # Specify all needed packages here
  # NOTE! They will not be available in the interactive R-session
  # so you might load them above as well to avoid confusion
  packages = c("tidyverse"),
  # Format used for caching (qs which is actually qs2 is best for big data)
  format = "qs",
  # Additional settings ...
  seed = 123
)

list(
  tar_target(df, as_tibble(NHANES::NHANES)),
  tar_target(tbl, count(df, Species)),
  tar_target(tst, t.test(BMI ~ Gender, df)),
  tar_target(gg, {
    ggplot(df, aes(BMI, color = Gender)) + geom_density()
  })
)

Benefits

execute the whole project by tar_make()
intermediate steps are cached and cached objects retrieved by tar_load() or tar_read()
dependency among individual steps can be visualized: tar_visnetwork()

Dependency graph

Live show

#| eval: false
cd
git clone https://github.com/STA220/targets.git
cd targets
positron .

Or perform those steps from within Positron.

R-versions

Imagine you finish your research project and you submit a manuscript to a journal
The journal comes back one year later (they didn’t have time until know) and one reviewer ask you to double check some steps of the analysis
You make some modifications and rerun your pipelines.
But it doesn’t work!
Half a year ago you updated your R installation and a week ago all of your installed packages.
- One package was retired from CRAN (could happen even though there was nothing wrong with the package when you used it … maybe the maintainer just didn’t have time or energy to keep maintaining it)
- One package has deprecated one of the functions you previously used
- On package has redesigned the output format from a specific function you used
😱

Pros and cons

It is always a good idea to be able to reproduce exactly what you did!
But if you never update you might miss essential bug fixes!
If your results change, how can you know which version was the most accurate?

Best practice (I guess) is to rerun the analysis both with the frozen versions and with the most up-to-date versions of the packages … because you have all the time in the world and nothing more important to do … right … 😂

Different operating systems

R should behave similar on different operating systems.
Exceptions nevertheless exist!
- parallel::mclapply() behaves differently on Windows compared to Unix-like systems because Windows does not support forked processes.
- sort(c("ä", "a", "z")) might differ due to locale settings
- Earlier versions of R relied on system-specific native encoding (notably on Windows), whereas modern versions (R ≥ 4.2) use UTF-8 as the default across platforms.

Combining all?

you could very well use both targets + renv + git + GitHub etc
At some point you might start to wonder whether you are a medical statistician or a software developer … and the more complex/esoteric tools you use, the more difficult it might be to collaborate with others
Try to find some middle ground … but always be open to new ideas!

TRE

Historically: centralized data on mainframe computer
More recently: working with data locally (desktop/laptop)
- Full volume encryption (BitLocker on Windows, FileVault on Mac etc)
- Possible to work off-line (trains, flights etc)
- Backups! Backups!!! BACKUPS!!!
- Usually only password protected and no logging of of activities
- Sometimes admin user (sometimes not)
Nowadays: (Different names, same concept) Secure Research Envoronemnt (SRE), Trusted Research Envoronment (TRE), Secure Data Environment (SDE), Data Clean Room, Data Safe Havens, …

In practice

Connect through VPN
2-factor authentication (BankID common in Sweden)
Often by Remote Desktop (to access a Windows virtual desktop)
Sometimes a containerized Linux setup for R/RStudio etc accessed through a web browser (from within the virtual Windows machine)
Limited internet access from within the environment (maybe possible to install CRAN-packages tied to a certain historical snapshot but not the latest released/developed versions).
Data import and exports goes through administrator
Pros: Secure, centralized backups, possibility to scale/share large computing resources (CPU, RAM, GPU etc), less dependent on individual computer (laptop), might aid collaboration
Cons: needs internet connection, restrictions on what software (packages) to use, difficult to export results and/or import script files etc, mentally exhausting to use for example a Mac Keyboard when working in Windows, can be expensive (pay per clock-cycle of CPU usage etc instead of simply purchasing a computer once).

EL6: Data formats

Lecture Slides

Associated litterature (References at the end)

(Nguyen 2022, ch. 2)
(Wickham, Çetinkaya-Rundel, and Grolemund, n.d., ch. 21)
(Fenk, Furu, and Bakken, n.d.)
(Data Analysis Using Data.table, n.d.)

Reading

(Nguyen 2022, ch. 2) on Databases: it is important to understand some basic concepts. You might, however, ignore sections about “(Hyper)graph databases”. Try to understand the SQL code even though you might not need to write such code on your own.
Wickham, Çetinkaya-Rundel, and Grolemund (n.d.) chap 21 describes how to connect to a database from R.
Fenk, Furu, and Bakken (n.d.) describe why the current common practice with register data delivery in SAS format should be replaced by parquet files. Until this has happened, we often need to use SAS for data extraction.
Data Analysis Using Data.table (n.d.) introduces the data.table syntax, its general form, how to subset rows, select and compute on columns, and perform aggregations by group.

Recommended references:

Arel-Bundock (2025) provides a good comparison between base R, dplyr and data.table.

Optional practice

We won’t focus on SQL in this course but it might still be good to know some of the basics. Two useful resources for your own practicing (if you like):

Imaginaed scenario

You have requested register data from Statistics Sweden (SCB) and The National Board of Health and Welfare (NBHW; SoS)
Aftar a couple of months (it previously took a year), you receive some huge files with file ending .sas7bdat
What do you do now?

.sas7bdat

Many govermental agencies are using SAS
Their standard format when delivering big data sets
SAS is great for handling big data sets (no need to read everything into memory)
But you don’t know how to use SAS 😩

First try

It is often possible to read medium-sized SAS files to R by haven::read_sas()
Based on the ReadStat C library
Based on reverse-engineering of binary files!
You need to put some trust that those wizards where good “hackers” (your analyzis and future patients lifes may depend on it) :-)
But similar trust always applies when you use open source software, which depend on other peoples software, which depends on …
But, yes, it is legal according to EU legislation :-)

Directive 2009/24/EC art 6 The authorisation of the rightholder shall not be required where reproduction of the code and translation of its form within the meaning of points (a) and (b) of Article 4(1) are indispensable to obtain the information necessary to achieve the interoperability of an independently created computer program with other programs […]

But let’s say this did not work for your data! 😩

Large SAS files

Best practice for current SAS version (9.4) is to export only nessecary data to csv
- SAS Viya (another solution not likely replacing SAS 9.4) can export parquet files
You need a SAS license (expensive) as well as some SAS syntax for this
- But you might ask GenAI to generate such code for you
The exported csv file can later be read into R

SAS in the course?

We are not using SAS in the course (and it is no longer available for students to download … it’s to expensive even for a large university …). Hence, this is just as a future advice if you will work in an organisation with a SAS license (it is still available for empoyers within GU for a monthly cost).

Databases

a database is a collection of “tables” (tabular data frames)
Differences between data frames and database tables:
- Database tables are stored on disk and can be arbitrarily large.
- Data frames are stored in memory
- Database tables almost always have indexes; makes it possible to quickly find rows of interest
  - Data frames and tibbles don’t have indexes, but data.tables do, which is one of the reasons that they’re so fast.
- No fixed row order in a table

Row vs column oriented

Most classical databases are optimized for rapidly collecting data, not analyzing existing data.
These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column like R.
More recently, there’s been much development of column-oriented databases that make analyzing the existing data much faster.

DBMS

Databases are run by database management systems (DBMS’s for short), which come in three basic forms

Client-server: run on a powerful central server, which you connect to from your computer (the client). For example PostgreSQL, MariaDB, SQL Server, and Oracle.
Cloud: like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.
In-process: like SQLite or duckdb, run entirely on your computer.

Relevance

Client-server: Those are sometimes used in our field but if so, you will hopefully get some initial help from the IT department who will provide acess details etc. Such access is then integrated into R and Positron (as well as RStudio). If thi is relevant, you can most likely use tools such as dbplyr to query the data without to much use of SQL itself.
Cloud: Commercial cloud products are less likely to be used for sensitive health data within our field.
In-process: They’re great for working with large datasets locally where you (the statistician) is the primary user. SQLite was a good tool in the past (it might still be but …) nowadays DuckDB is the most prominent implementation!

SQL

Different database systems may use different query languages.
The Structured Query Language (SQL) is the most widely used for relational databases.
Defined by international standards (ANSI/ISO).

What is SQL used for?

SQL is used to:

Retrieve data (SELECT)
Filter data (WHERE)
Aggregate data (GROUP BY)
Combine tables (JOIN)
Modify data (INSERT, UPDATE, DELETE)

SQL Dialects

SQL is a standard, but implementations differ slightly.
These variations are called dialects.
- T-SQL (Microsoft SQL Server)
- PL/SQL (Oracle)
- PostgreSQL SQL dialect
- MySQL SQL dialect

Most core functionality is similar across systems.

A Simple Example

SELECT name, age
FROM students
WHERE age >= 18
ORDER BY age;

This query:

Selects two columns
Filters rows
Sorts the result

Relevance?

I don’t think SQL is a necessary skill to master for most statisticians
You should nevertheless be aware of its existence and understand simple code examples as above
If an employer requires SQL skills, you can probably learn enough if/when needed
Nevertheless, the more languages/tools etc you know, the more competitive you become (but don’t sacrifice statistical knowledge for technical.

DuckDB

Free and open source (backed by foundation)
Column based
Very easy to set up
Very efficient!
SQL is an option as API language but not a requirement
Integration by DBMS and “hidden” SQL under the hood: dbplyr
A more “native” implementation by duckplyr
- Better (but perhaps not as widely used yet and therefore a bit less mature/stable)
Can be used both with disk data and data in RAM

Example

## pak::pak("tidyverse/duckplyr")
library(duckplyr)
conflicted::conflicts_prefer(dplyr::filter)
## Extra tools to connect to internet data
db_exec("INSTALL httpfs")
db_exec("LOAD httpfs")

## Use some online data which is to big to keep in memory
year <- 2022:2024 # therefore, select just 3 years
base_url <- "https://blobs.duckdb.org/flight-data-partitioned/"
files <- paste0("Year=", year, "/data_0.parquet")
urls <- paste0(base_url, files)


## Connect to the data (without downloading)
flights <- read_parquet_duckdb(urls)
## nrow(flights) # It's not in memory!
count(flights, Year) # processing outside memory

Complex queries can be executed on the remote data. Note how only the relevant columns are fetched and the 2024 data isn’t even touched, as it’s not needed for the result.

#| cache: true
out <-
  flights |>
  mutate(InFlightDelay = ArrDelay - DepDelay) |>
  summarize(
    .by = c(Year, Month),
    MeanInFlightDelay = mean(InFlightDelay, na.rm = TRUE),
    MedianInFlightDelay = median(InFlightDelay, na.rm = TRUE),
  ) |>
  filter(Year < 2024)

out |>
  explain()

DuckDB and Data Files

DuckDB can query files directly:

Parquet
CSV
JSON

No data import (to your computers Random Access Memory [RAM]) is required!

Example

Let’s say you used SAS to export your data to a csv file
DuckDB can query the relevant data from that file directly and collect only the data you actually need!
(If you knew you only needed i smaller data set you might have performed similar work already in SAS but you often query the data multiple times for different purposes so it still good to have a readable version of everything)

## This is not writing from SAS but to illustrate the point :-)
csv_file <- tempfile(fileext = ".csv")
write.csv(NHANES::NHANES, csv_file)

read_csv_duckdb(csv_file) |>
  filter(Age >= 18) |>
  summarise(BMI_mean = mean(as.numeric(BMI), na.rm = TRUE), .by = Gender)

Parquet

From Apache Arrow
Open format
flat files with possible hierarchical structure by transparent structure of folders and filenames
very fast to read and write

Example

So we might work with CSV + DuckDB but it is even more efficient if we could convert the CSV to a parquet file
The solution below reads the CSV data and then exports it to a new parquet file
This means that all data must still fit into memory
- If it doesn’t, do it chunk-wise (piece by piece, 1 M rows at the time or similar)

#| cache: true
## install.packages("arrow")
library(arrow)
parquet_file <- tempfile(fileext = ".parquet")
read_csv_arrow(csv_file) |>
  write_parquet(parquet_file)

## Now
read_parquet_duckdb(parquet_file)

tidy data

duckplyr let’s you use the familiar dplyr/tidyverse syntax
If certain steps in your pipeline are not compatible with DuckDB, data might be collected into memory and the pipeline contonous nevertheless
After filteringen/aggregation etc you can always choose to collect() the data and use your ordinary workflow from there

Data files

.qs2: A fast, compressed R-specific binary format used for caching by the {targets} package.
Data must be fully read from disk into memory (RAM), typically via targets::tar_read().
.sas7bdat: A proprietary binary format used by SAS and commonly encountered in register data deliveries.
It can be sometimes be imported into R (haven::read_sas(), which also allow selection of specific columns at import).
The dataset is read into memory.
.csv (Comma-Separated Values): A plain-text format following a loosely defined standard.
It is human-readable but inefficient for large datasets.
Tools such as DuckDB (via {duckplyr}) can read only the required columns and rows and perform filtering and aggregation during import.
.parquet (Apache Parquet): A columnar, compressed, and standardized binary format designed for efficient analytics.
It supports selective column access and integrates well with modern analytical engines such as DuckDB and Arrow.

Archiving

Swedish law say that we must archive the data (within the public sector).
General practice is to store every “official document” forever … which is a long time :-)
Usualy conceptualized as 500 years.
Retention (dokumenthanteringsplan/gallringsplan) usually limit the time needed to store source material for statistcal reporting/research etc
- Varies between organisations but usually 10 years (25 for some research, which is also regulated by the EU)
Can you read a 25 year old data file?
- Yes, if storeed as an inefficient CSV-file …
- Maybe not if using another less human-readable format …
The work of a professional archivist might contain responsibilities for data conversion from time to time (CD-ROMS or floppy disks etc might otherwise not be accasible)
It seems, however, less common that organizational standardd etc imply that statisticians are limited to certain formats (it is more often up to you and your collegues)

Faster in memory?

DuckDB and {duckplyr} makes it possible to work with data which does not fit into memory
But what if data would fit (which is more common)?
- You can use tibbles and dplyr etc as usual …
  - … but it might be slooooooooooooow … for big enough data
  - (Although dplyr also gets some speed improvements from time to time)
{data.table} is much more efficient!

data.table

An R package (not a data storage format or a database system)
Comparable to working with data.frame in base R or tibble in the tidyverse
Sometimes described as a Domain Specific Language (DSL) within R
Introduced in 2006 (mature, stable, and widely used — older than the tidyverse)
Provides faster alternatives to several base R functions: fifelse(), fcase(), forder(), fread(), fcoalesce(),
as.IDate(), %chin%
Core components implemented in C for performance
Uses concepts similar to databases:
- Integer-based keys and indices
- Efficient joins and grouping
Reference semantics
Very flexible and compact syntax
Strongly appreciated by many statisticians —
unknown to some — and slightly intimidating to others 🙂

Copy-on-modify

Assume df is a large data.frame/tibble with many rows and columns, one of them called old.
Let f() be some function.
What happens when you add a new column using
df$new <- f(df$old) or df <- mutate(df, new = f(old))?
Even though you reuse the name df, R typically creates a modified copy of the object.
In R, objects use copy-on-modify semantics:
- When an object is changed, a new copy is created (if needed).
- The name df is just a reference to an object in memory.
- After reassignment, df refers to the new object.
The previous object may remain in memory temporarily until garbage collection runs.
For very large objects, this can result in substantial temporary memory use

Reference semantics

In data.table, reference semantics are used instead.
You can modify the existing object directly: df[, new := f(old)]
Or delete columns in place: df[, old := NULL]
No full copy is made
There is still only one object in memory (and it is still called df).

Example

#| echo: true
library(data.table)
dt <- NHANES::NHANES
setDT(dt)
setkey(dt, ID) # Use the ID column as index
## How many adults by gender do we have
dt[Age >= 18, .N, Gender]
## Mean BMI by gender
dt[Age >= 18, mean(BMI, na.rm = TRUE), Gender]
## Convert all factors to characters
dt[, names(.SD) := lapply(.SD, as.character), .SDcols = is.factor]

joins

x[i, on, nomatch]
| |  |   |
| |  |   \__ If NULL only returns rows linked in x and i tables
| |  \____ a character vector or list defining match logic
| \_____ primary data.table, list or data.frame
\____ secondary data.table

Join type	data.table	SQL	dplyr
Right join	`DT1[DT2]`	RIGHT JOIN	`right_join(DT2, DT1)`
Left join	`DT2[DT1]`	LEFT JOIN	`left_join(DT1, DT2)`
Inner join	`DT1[DT2, nomatch = 0]`	INNER JOIN	`inner_join(DT1, DT2)`
Full join	`merge(DT1, DT2, all = TRUE)`	FULL OUTER JOIN	`full_join(DT1, DT2)`
Anti join	`DT1[!DT2]`	NOT EXISTS	`anti_join(DT1, DT2)`
Semi join	`DT1[DT2, nomatch = 0][, unique(.SD)]`	EXISTS	`semi_join(DT1, DT2)`
Rolling join	`DT1[DT2, on = "date", roll = TRUE]`	—	`join_by()` + `rolling`
Non-equi join	`DT1[DT2, on = .(x >= lower, x <= upper)]`	—	`join_by(x >= lower, x <= upper)`
Update join	`DT1[DT2, value := i.value]`	UPDATE … JOIN	`DT1 %>% left_join(DT2) %>% mutate(...)`

External API

We previously assumed you get some data delivery from a register holder
Or mayby you work within the organization where data is collected, and therefore have access to the original data base.
Some data can also be openly accessed by an Programming Application Interface (API)
The PX-WEB API is used by a large number of statistical authorities (and others) world-widea to provide access to aggregated data
The {pxweb} R package simplifies the process

Available resources

#| echo: false
pxweb::pxweb_api_catalogue() |>
  purrr::map_chr("description") |>
  tibble::enframe() |>
  knitr::kable()

Example

The first time you need the data, use pxweb_interactive() from your console.
It will guide you through all the necessary steps.
It will then provide you the necessary R code to replicate the query

library(pxweb)
library(data.table)
library(ggplot2)

url <- "https://api.scb.se/OV0104/v1/doris/sv/ssd/BE/BE0101/BE0101A/BefolkningR1860N"

## PXWEB query
pxweb_query_list <-
  list(
    "Alder" = "*",
    "Kon" = c("1", "2"),
    "ContentsCode" = c("0000053A"),
    "Tid" = as.character(1864:2024)
  )

px_data <-
  pxweb_get(
    url = url,
    query = pxweb_query_list
  )

## A data.table with population numbers
bef <-
  px_data |>
  as.data.frame() |>
  setDT()

## Some data cleaning
setnames(
  bef,
  c("ålder", "kön", "år", "Antal"),
  c("age", "sex", "year", "N")
)
bef[, `:=`(
  age = as.numeric(gsub(" år", "", age)),
  sex = factor(sex, c("män", "kvinnor"), c("males", "females")),
  year = as.integer(year)
)]
bef <- bef[!is.na(age)] # Remove totals

## Aggregate for age groups
bef[,
  age_group := cut(
    age,
    c(-Inf, 17, 66, Inf),
    c("children", "adults", "elderly")
  )
]
bef_ag <- bef[, .(N = sum(N)), .(age_group, sex, year)]

## Visualize
gg <- bef_ag |>
  ggplot(aes(year, N, color = age_group, linetype = sex)) +
  geom_line() +
  theme(legend.position = "bottom") +
  scale_y_continuous(
    labels = scales::label_number(scale = 1e-6, suffix = "M")
  )

toi #| echo: false gg

## Hierachical data

- We like tabular data! - If we get wide data we can transform it to long data (and vice versa) - But we might sometimes encounter more hierarchical data structures - JSON (JavaScript Object Notation) most popular - XML is another format (used for example by the IRS/Skatteverket sometimes)

txgnjs { "system": "ICD-10", "chapter": { "code": "Chapter I", "title": "Certain infectious and parasitic diseases", "range": "A00–B99", "blocks": [ { "code": "A00–A09", "title": "Intestinal infectious diseases", "categories": [ { "code": "A00", "title": "Cholera", "includes": ["Vibrio cholerae infection"], "excludes": ["Carrier state"], "subcategories": [ { "code": "A00.0", "title": "Cholera due to Vibrio cholerae 01, biovar cholerae" }, { "code": "A00.1", "title": "Cholera due to Vibrio cholerae 01, biovar eltor" } ] } ] } ] } }

## Example

toi #| cache: true ## db_exec("INSTALL json") db_exec("LOAD json") url <- "https://raw.githubusercontent.com/LuChang-CS/icd_hierarchical_structure/main/ICD-10-CM/diagnosis_codes.json" (icd <- duckplyr::read_json_duckdb(url))

Flattening

So you now have a duckplyr data frame. One of its columns is nested and, with additional data frames as children) If we unnest the children for chapter 1, we see that those children also have children … and so it continous. To get a simple translation between individual codes and their description, you might need to define some recursive function to unnest until you have found the last children in line.

#| cache: true
(chap1 <- icd |> slice(1) |> collect())
chap1 |>
  select(children) |>
  tidyr::unnest(children)

EL7: Medical coding

Lecture Slides

Associated litterature (References at the end)

(Nguyen 2022, ch. 3)
(Wickham, Çetinkaya-Rundel, and Grolemund, n.d., ch. 15)
(Bindel and Seifert 2025)
(Alharbi, Isouard, and Tolchard 2021)
(Nelson et al. 2024)

Reading

(Nguyen 2022, ch. 3) on standardized vocabularies. You may skip the sections “CPT”, “LOINC”, “RxNorm” and “Using the Unified Medical Language System” (not examined within the course). Even if you skip those section, remember to read the conclusions in the end of the chapter!
Alharbi, Isouard, and Tolchard (2021) provides an historical expose of the development of medical coding, with focus on the International Classificatin of Diseases (ICD).
Nelson et al. (2024) argue (based on a statistical analysis) that we should not put to much trust in the coded data (you may skip the methods section).
Bindel and Seifert (2025) introduces the Anatomical Therapeutic Chemical (ATC) and some associated problems. Focus on the introduction and conclusion sections (results and discussion may be skipped).
(Wickham, Çetinkaya-Rundel, and Grolemund, n.d., ch. 15): Using {stringr} for regular expressions in R including exercises

Recommended references:

History of the Statistical Classification of Diseases and Causes of Death

Recommended practice:

Regular expressins: But please not that this is general practice and deviations may exist between those exercises and R.

Overview

Standardized vocabularies, controlled vocabularies, terminologies and ontologies …
This is a field of its own (health informatics)
Let’s just call it “medical coding” for now.

Relevance

Imaganine you are diagnosed with “cancer” (hope not …)
Your doctor writes that you have “kräfta” in your medical records
- “kräfta” (Swedish) = cancer (latin), although the astrological sign “cancer” is a “crab” (latin does not distinguish the two)
She might as well write:
- The patient was diagnosed with a malignant neoplasm of the colon.
- Histology confirms invasive adenocarcinoma.
- Evidence of metastatic disease to the liver.
Natural languages (English/Swedish/Latin etc) are not well suited for statistical analysis
- Natural language processing (NLP) is nice but outside the scope of the course
Statisticians need clear definitions of diagnoses, procedures, medications etc.
Therefore, such information is encoded in a standardized way

Granularity/reliability

Cancer might be coded by an ICD-10 code (International Classificatin of Diseases v. 10) as “C” (or possibly “D”)
Cancer, however, is a very general term. Is it lung cancer, brain cancer, skin cancer etc (those are very different)
The more we learn about a diseases, the more granularity we expect from the coding
The coding systems therefore tend to be quite complex, evolve over time and often have regional differences
Even though the intention of the coding system might be granular and precise, the data quality often relies on different coding practices in different hospitals etc.
The codes might also be misused for re-imbursment practices
- There was a regional scandal in Western Sweden not so many years ago (you may read about int in Swedish)
In practice, the medical doctor might dictate a diagnosis, which then needs to be translated to a code by administrative staff

Example

The Swedish Hip Arthroplasty Register identified that one hospital appeared to have an unusually high number of patients recorded with severe respiratory problems
At first glance, this raised a clinical question: could hip problems somehow lead to serious breathing problems?
However, hip surgery is often performed under general anesthesia. During general anesthesia, patients are intubated and mechanically ventilated, which involves procedures related to the respiratory system.
It was eventually discovered that a procedural code related to anesthesia and airway management had been incorrectly registered as a severe respiratory diagnosis.
The apparent “complication” was therefore not a real clinical problem, but a coding error.

Lesson: Register data reflect coding practices. Without understanding how variables are defined and recorded, one may draw incorrect conclusions.

ICD – International Classification of Diseases

Maintained by the World Health Organization (WHO)
Global standard for coding diseases and causes of death
Used for:
- Clinical documentation
- Mortality statistics
- Epidemiological research
- Health system planning and monitoring

Historical Background

First version: 1893 (International List of Causes of Death)
WHO assumed responsibility in 1948 (ICD-6)
Major revisions approximately every 10–20 years
Each revision reflects:
- Advances in medical knowledge
- Changes in disease concepts
- Administrative and reporting needs

ICD has evolved from a mortality list to a comprehensive disease classification.

Major ICD Versions

ICD-7 (used in many countries in the 1950s–1970s)
- Still used in the Swedish cancer register for backward compability
ICD-8 (used in many countries in the 1960s–1980s)
ICD-9 (widely used until the early 2000s)
- Also still used in the Swedish cancer register
ICD-10 (introduced in the 1990s; still dominant in many countries)
- From 1997 in Sweden. What we currently most care about
ICD-11 (adopted in 2019; gradually being implemented)

Different countries adopted versions at different times, creating challenges for international comparisons.

National Modifications

Several countries use national adaptations:

ICD-10-CM (USA; Clinical Modification)
ICD-10-CA (Canada)
ICD-10-SE (Sweden)
- A fifth position (ignoring the dot) sometimes used for more granularity
- ICD-10: S72.0 Fracture of neck of femur
- ICD-10-SE: S72.00 Fracture of neck of femur, closed; S72.01 Fracture of neck of femur, open; S72.10 Pertrochanteric fracture, closed; S72.11 Pertrochanteric fracture, open, …

Feature	WHO ICD-10	ICD-10-SE (Sweden)	ICD-10-CM (USA)
Maintained by	WHO	Swedish National Board of Health and Welfare (Socialstyrelsen)	U.S. National Center for Health Statistics (NCHS)
Primary purpose	Global disease classification	National clinical and statistical reporting	Clinical documentation and reimbursement
Level of detail	Moderate	More detailed than WHO ICD-10	Much more detailed than WHO ICD-10
Additional digits	Typically 3–4 characters	Often includes 5th character extensions	Up to 7 characters
Laterality (right/left)	Usually not specified	Limited	Frequently specified
Encounter type (initial, follow-up, sequela)	Not included	Not included	Explicitly coded
Administrative focus	Epidemiology and mortality statistics	Clinical and national register reporting	Strongly tied to billing and reimbursement
International comparability	High (reference standard)	High within Nordic context, requires mapping internationally	Requires crosswalk to WHO ICD-10 for comparison

Structure of ICD-10

Typical format:

One letter (chapter)
Two digits (category)
Optional dot
additional digit(s) (subcategory)

Example:

I21 – Acute myocardial infarction (AMI)
I21.0 – AMI of anterior wall
I21.9 – AMI, unspecified

Hierarchy

Chapter → Block → Category (3-digit) → Subcategory (4-digit+)

Researchers must decide:

Analyse at 3-digit level?
Or at more detailed subcategory level?

There is a trade-off between specificity and statistical power.

What Does an ICD Code Represent?

An ICD code reflects:

Clinical documentation
Coding rules
Administrative structure
Local practice

It does not necessarily reflect:

Biological mechanism
Diagnostic certainty
Uniform clinical interpretation

Changes Over Time

Between ICD versions:

Codes may be split (1-to-many) into more detailed categories (common)
Codes may be merged (many-to-one, although uncommon)
Codes may move between chapters (affects the aggregated chapter counts/incidence/prevalence)
Definitions may change

Example: A condition classified under one chapter in ICD-9 may appear elsewhere in ICD-10.

Implication: Observed changes in incidence may reflect coding changes rather than true epidemiology.

Crosswalks Between Versions

When analyzing long time series:

Mapping tables (“crosswalks”) are often used
Mapping may be:
- One-to-one
- One-to-many
- Many-to-one

Crosswalks are rarely exact. Information loss or ambiguity is common.

Aggregation to broader diagnostic groups is often necessary.

Crosswalk Patterns (WHO ICD-9 → WHO ICD-10)

When mapping between ICD versions, different structural relationships may occur.

Mapping type	ICD-9 (WHO)	ICD-10 (WHO)	Interpretation
One → Many (Split)	250 – Diabetes mellitus	E10–E14	One broad ICD-9 category split into multiple etiological types
Many → One (Merge)	038 – Septicaemia; 790.7 – Bacteraemia	A41 – Other sepsis	Separate ICD-9 concepts consolidated into broader ICD-10 category
Many ↔︎ Many (Reorganisation)	296 – Affective psychoses; 300 – Neurotic disorders	F30–F39 (Mood disorders); F40–F48 (Neurotic disorders)	Structural reorganisation and conceptual reclassification across chapters
Chapter relocation	011 – Pulmonary tuberculosis	A15 – Respiratory tuberculosis	Infectious diseases reorganised under new chapter structure

Crosswalk before or after

The Swedish caner register has an internal “crosswalk” applied uniformly to the register itself
- Updated regularly, 2026 version with 360 pages (in Swedish)
Other registers typically only records the current version in use
If so, you might perform the crosswalk yourself after receiving the data
- Applies if you want to look at longer time trends etc or combine data from different periods

ICD-O

ICD-O (International Classification of Diseases for Oncology):

Used mainly in cancer registries
Combines:
- Topography (tumour site)
- Morphology (histology and behaviour)
Current commonly used version: - ICD-O-3
Earlier versions: ICD-O, ICD-O-2
ICD-O is more detailed than ICD-10 for cancer incidence studies.

Relationship Between ICD-10 and ICD-O

ICD-10 commonly used for mortality and hospital discharge diagnoses
ICD-O primarily used for cancer registry incidence data

A cancer case may have:

An ICD-10 code in hospital data
An ICD-O morphology and topography code in a cancer registry

Researchers must understand which system underlies their dataset.

Variation in Coding

Differences between hospitals or regions may arise due to:

Coding training
- A primary health care unit may encounter all possible diagnosis (wide but shallow knowledge) while a very specialized unit might have routines for a very narrow but detailed coding
Local guidelines
- Regions are independent in Sweden
Administrative incentives
Reimbursement systems
- public and private health care providers may have different incentives
Electronic health record design
- National registers often relies on combining multiple different sources
Somatic vs psychiatric care
- In psychiatric care, diagnoses may sometimes be recorded with less specificity, potentially due to concerns about stigma or the sensitive nature of certain conditions

Registers capture both clinical events and coding behaviour.

Validity of ICD Codes

Important research question:

Does the code correspond to the true disease?

Validation studies compare ICD codes with a reference standard
(e.g., chart review, clinical registry, laboratory confirmation).

Key Measures of Validity

Sensitivity
Among patients who truly have the disease,
how many receive the correct ICD code?
→ Measures undercoding (missed cases).
Specificity
Among patients who do not have the disease,
how many are correctly not assigned the code?
→ Measures overcoding (false positives).
Positive Predictive Value (PPV)
Among patients assigned the ICD code,
how many truly have the disease?
→ Measures how reliable the code is for identifying true cases.

Conceptual 2×2 Table

	True disease	No disease
ICD code present	True positive	False positive
ICD code absent	False negative	True negative

Sensitivity = True positives / (True positives + False negatives)
Specificity = True negatives / (True negatives + False positives)
PPV = True positives / (True positives + False positives)

Why It Matters

Low sensitivity → underestimated incidence
Low PPV → inflated case counts
Variation in validity may depend on:

Diagnosis (e.g., myocardial infarction vs mild depression)
Care setting (inpatient vs primary care)
Time period (coding changes)
ICD version and national modification

Not all ICD codes are equally reliable for research.

Practical Implications for Statisticians

Before analysis, always clarify:

Which ICD version?
Which national modification?
Which coding level (3-digit vs 4-digit)?
Has coding practice changed over time?
Are crosswalks required?
Is there validation evidence for the diagnosis?

ICD is a classification system. It is not identical to clinical truth. Transparent documentation of code selection is essential for reproducible research.

ATC for drugs

Anatomical Therapeutic Chemical (ATC) classification
categorizing therapeutic drugs,
structured into 14 main groups and 5 levels, with a disease-oriented focus
introduced in the 1960s
In 1980, the World Health Organization (WHO) recommended the ATC system as the “state of the art”
follow a hierarchical structure: 1 letter, 2 digits, 2 letters, 2 digits
- Example: C09AA05
Several problems exist (Bindel and Seifert 2025) but it is nevertheless widely used

Sweden

We have ATC codes in the national prescription register
When a doctor prescribes a medication, the ATC code will follow automatically
Less room for interpretation when coding
New medications are introduced over time
The Swedish Medical Products Agency (MPA; Läkemedelsverket) make such decisions
Daily updates to the National Substance Register

Procedure codes

We use ICD for diagnoses and medical condition
But how are patients with such diagnosis treated?
What actions (in addition to the prescription of medicines) do we have?
USA has a special version of ICD for this: ICD-10-PCS
- PCS = Procedure Coding System

NOMESCO

In Sweden, medical procedures are coded using the NOMESCO Classification of Surgical Procedures (NCSP).

Developed by the Nordic Medico-Statistical Committee (NOMESCO)
Used in Sweden, Denmark, Finland, Norway, and Iceland
Primarily for on surgical procedures

KVÅ

In Sweden implemented through KVÅ (Klassifikation av vårdåtgärder)
Maintained nationally by Socialstyrelsen
Includes the NOMESCO-NCSP codes for surgery
Also includes additional codes for non-surgical treatments and activities
- Administration of chemotherapy (cytostatic treatment)
- Radiotherapy sessions
- Dialysis treatment (hemodialysis, peritoneal dialysis)
- Blood transfusion
- Vaccination
- Advanced wound care (non-surgical)
- Multidisciplinary team conference (MDT conference)
- Smoking cessation counselling
- Nutritional counselling
- Physiotherapy interventions
- Occupational therapy interventions
- Psychotherapeutic treatment sessions
- Structured patient education programmes
- Palliative care planning

Structure

Alphanumeric codes (typically 5 characters)
First letter indicates anatomical or procedural group
Subsequent characters specify procedure type and detail

Example:

NFB49 – Primary total hip replacement
JKA20 – Appendectomy

Purpose

Record surgical and certain non-surgical interventions
Used in:
- National Patient Register
- Quality registers
- Reimbursement and administrative reporting
- Health services research

Important Distinction

ICD-10-SE → Diagnosis codes
NOMESCO/KVÅ → Procedure codes

A patient record may therefore contain:

An ICD diagnosis (e.g., hip fracture)
A NOMESCO procedure code (e.g., hip replacement surgery)

Implications for Research

Diagnosis and procedure must not be confused
Trends in procedures may reflect:
- Clinical practice changes
- Technology changes
- Policy and reimbursement incentives
International comparisons require awareness that other countries (e.g., USA) use different coding systems

NOMESCO codes capture what was done, not what disease the patient had.

DRG

DRG (Diagnosis-Related Groups) is a classification system used to group hospital cases into categories expected to require similar levels of resources.

In Sweden:

Based on the NordDRG system
Used for:
- Hospital reimbursement
- Resource allocation
- Health care management
- Productivity and efficiency analyses

How Is a DRG Determined?

A DRG is assigned based on a combination of:

Primary diagnosis (ICD-10-SE)
Secondary diagnoses
Procedure codes (NOMESCO/KVÅ)
Age
Sex
Discharge status
Presence of complications or comorbidities

DRG codes are therefore derived classifications, not primary clinical codes.

Implications for Research

DRG reflects resource use, not disease incidence.
Changes in reimbursement rules may influence coding behavior.
Regional comparisons must consider administrative incentives.
DRG is suitable for health services and economic analyses, but less appropriate for etiological research.

SNOMED CT – What Is It?

SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) is a large clinical terminology system.

Maintained by SNOMED International
Contains hundreds of thousands of clinical concepts
Designed for structured documentation in electronic health records

Unlike ICD or ATC, SNOMED CT is primarily a terminology, not a statistical classification.

Terminology vs Classification

System	Type	Purpose
ICD	Classification	Epidemiology and health statistics
ATC	Classification	Drug classification
KVÅ / NOMESCO	Classification	Medical procedures
SNOMED CT	Terminology	Detailed clinical documentation

Classification systems simplify reality for statistics and reporting, while terminologies allow very detailed clinical descriptions.

Why SNOMED CT Is Not Widely Used in Registers

Despite its strengths, SNOMED CT is rarely used directly in:

national health registers
epidemiological statistics

Main reasons:

Too detailed for statistical aggregation
Harder to ensure consistent coding
Statistical reporting systems are built around ICD

Regular expressions

Regular Expressions (Regex)

A way to describe patterns in text
Used to:
- Identify diagnosis codes (ICD-10)
- Identify drug codes (ATC)
- Clean register data
- Validate variables
In R:
- Base R: grepl(), sub(), gsub()
- Tidyverse: stringr::str_detect(), str_extract(), str_replace()
- More efficient for big data: stringfish

Relevance

Typical use cases:

Select all ICD-10 codes starting with "I21" (acute myocardial infarction)
Identify all ATC codes beginning with "C09" (antihypertensives)
Check for malformed codes (quality control)
Extract codes embedded in free text (e.g., notes, text fields)

Basic Building Blocks

Symbol	Meaning
`^`	Start of string
`$`	End of string
`.`	Any character
`*`	0 or more repetitions
`+`	1 or more repetitions
`?`	0 or 1 repetition
`{m,n}`	Between m and n repetitions
`[ABC]`	Any of A, B, or C
`[0-9]`	Any digit
`\\d`	Any digit (PCRE)

Different versions

There are different implementations of regular expressions! The implementation in base R is described by ?base::regex in R. Perl-like Regular Expressions (PCRE) is a commonly used alternative requireing (perl = TRUE as argument) for the base functions.

Example: ICD-10 Structure

ICD-10 codes typically follow:

One letter
Two digits
Optional dot and additional digit

Regex pattern: ^[A-Z][0-9]{2}(\\.[0-9])?$

where \\ is ussed to remove the special meaning of . as described above. Hence, in this case \\. is interpreted as a literal . as to be found in the character string. In R:

grepl("^[A-Z][0-9]{2}(\\.[0-9])?$", icd) # base
stringr::str_detect(icd, "^[A-Z][0-9]{2}(\\.[0-9])?$")
## Faster and using 10 CPU cores in paralell (only relevant for "big enough data"):
stringfish::sf_grepl(icd, "^[A-Z][0-9]{2}(\\.[0-9])?$", nthreads = 10L)

Example: Select a Diagnosis Group

All acute myocardial infarction codes: ^I21
In R: stringr::str_detect(icd, "^I21")

This selects:

I21
I21.0
I21.9

But not:

I20
I22

ATC Code Structure

Example: C09AA05
Regex pattern:^[A-Z][0-9]{2}[A-Z]{2}[0-9]{2}$
In R: stringr::str_detect(atc, "^[A-Z][0-9]{2}[A-Z]{2}[0-9]{2}$")
This will find any ATC code.
You might receive data with a variable supposed to contain only ATC codes
It might as well contain other information such as ??, don't know, XXXXXXX etc
You might replace such character strings by <NA>

Implementations

base R and {stringr} both use the same underlying regex engine (PCRE)
- but {stringr} is more “user friendly”.
Stringfish seems technically sperior but is less maintained (more of a hobby project).

Common Mistakes

Forgetting ^ when matching prefixes
- This is problematic even in the stringfish::sf_starts() implementation! See bug report.
Forgetting to escape .
Not validating full string with $
Overmatching (e.g., I2 instead of ^I21)

Standardised Groupings

To account for overall disease burden, researchers often use established grouping systems, such as:

Charlson Comorbidity Index (ICD)
Elixhauser Comorbidity Index (ICD)
Similar groupings of ATC-codes
Combinations of those

These indices:

Aggregate multiple ICD codes into clinically meaningful comorbidity categories
Are commonly used for:
- Risk adjustment
- Prognostic modelling
- Confounding control in observational studies

`{decoder}`

The R package {decoder} provides descriptions for many commonly used coding systems.
In register data, you often only have the raw codes (e.g., ICD, ATC), without textual labels.
{decoder} allows you to translate codes into meaningful descriptions (in Swedish or English), making interpretation easier and more transparent.

Up-to-date?

I am the maintainer of {decoder} and {coder} but I have not had the time or energy to update them for a cuople of years. There are some reported issues.

`{coder}`

The R package {coder} can be used to aggregate individual diagnosis codes into broader clinical categories.
Common applications include:
- Charlson Comorbidity Index
- Elixhauser Comorbidity Index
- Other diagnosis-based groupings

This allows:

Standardised comorbidity adjustment
(Sort of/relatively …) transparent and reproducible case definitions
Consistent grouping across studies (hopefully)

EL8: Health care registers

Lecture Slides

Associated litterature (References at the end)

(Hiyoshi 2026)
(Ludvigsson et al. 2016)

Reading

Hiyoshi (2026) gives a brief and recent introduction and to all registers covered in this lecture.
Ludvigsson et al. (2016) provides an excellent overview of the Swedish population registers and their use in medical research.
- The Personal Data Act (PUL) has been replaced by GDPR.
- The former Regional Ethical Review Boards were replaced in 2019 by a single national authority, the Swedish Ethical Review Authority (Etikprövningsmyndigheten)
- The legislation governing legal gender recognition has been revised, with a new law adopted in 2024 and entering into force in 2025.

Recommended references:

Brooke et al. (2017) introduces the Swedish cause of death register
Ludvigsson et al. (2019) describes The longitudinal integrated database for health insurance and labour market studies (LISA) and its use in medical research
Everhov et al. (2025) have investigated the completeness of the national patient register (focus on the intoduction, discussion and conslussions)
Emilsson et al. (2015) gives an overview of the Swedish Quality Registers. Some things have changed since then, but most of it is still relevant.

Types of registers

National health registers

mandated by law
Sometimes with regional data collection and yearly updates
Governed by state authority

Quality registers

Volontary for health care providers but commonly adopted
opt-out for individuals
A typically Swedish concept
more than 100 in Sweden
But similar data collections in other countries by other names
Governed by regional authorities
Collaboration within The Swedish Association of Local Authorities and Regions (SALAR/SKR)

Authority	National registers / systems	Notes
National Board of Health and Welfare (Socialstyrelsen)	National Patient Register, Medical Birth Register, Cause of Death Register, Cancer Register, Prescribed Drug Register, Dental Health Register	Main authority responsible for Swedish health data registers
Public Health Agency of Sweden (Folkhälsomyndigheten)	National Vaccination Register, SmiNet (notifiable communicable diseases)	Registers related to infectious disease surveillance
Swedish eHealth Agency (E-hälsomyndigheten)	National prescription database (dispensed prescription data)	Data on dispensed medicines from pharmacies
Statistics Sweden (SCB)	LISA database, Population Register, Education Register, Income and Tax Register	Socioeconomic and demographic registers often linked to health data
Swedish regions / national quality register organisations	National Quality Registers (e.g. SWEDEHEART, Swedish Hip Arthroplasty Register, National Diabetes Register)	Clinical quality registers maintained by healthcare organisations

Swedish Medical Birth Register

Established in 1973
Covers pregnancies resulting in delivery in Sweden
Includes live births and stillbirths from gestational week 22+0
Contains information reported by maternal care, delivery care, and neonatal care

Main variables include:

mother’s previous pregnancies, smoking status, delivery clinic
gestational age, pain relief, and mode of delivery
diagnoses and procedures for mother and child
child’s sex, weight, length, head circumference, and condition at birth

National Register of Congenital Anomalies

Surveillance register for congenital malformations (ICD-10) and chromosomal abnormalities
For live births and stillbirths (≥22 weeks gestation)
Also includes anonymised data on terminated pregnancies due to fetal anomalies

Background

monitoring of congenital anomalies started in 1964 following the Thalidomide (Neurosedyn) scandal
integrated with the Medical Birth Register in 1973
reporting expanded in 1999 to include terminated pregnancies due to fetal anomalies

Population registration

Maintained by the Tax Agency
- administrative register used for legal residence and taxation
Personal identity number assigned at birth (at the hospital)
Legal sex registered as part of the personal identity number
Child’s given name decided by parents and reported to the Tax Agency within 3 months
Registration of municipality of residence (and thereby county)
Registration district (earlier parish, now district)
Registered residential address
Migration events (immigration and emigration)
Changes of address within Sweden
Marital status and family relations
Date of death

The Total population Register (RTP)

maintained by Statistics Sweden (SCB)
based on data from the Population registration
structured for statistical analysis and research
used as the sampling frame for surveys and register-based studies

Childhood

During childhood and school years, health information is collected through:

Child health services (BVC)
School health services
Vaccination records
Examinations by school nurses or physicians
dentists etc

In contrast to many other stages of life in Sweden, these data are not collected in a national register.

Vaccination Register

National register maintained by the Public Health Agency of Sweden (Folkhälsomyndigheten)
Established in 2013
Primarily covers vaccinations given within national vaccination programmes

Includes

vaccine administered
date of vaccination
dose number
healthcare provider

Purpose

monitor vaccination coverage
detect changes in uptake
support surveillance of vaccine-preventable diseases

Limitations

vaccinations given outside national programmes (e.g. travel vaccines or by occupational health services) may not be fully captured
reporting comes from many providers (regions, schools, private clinics)
- therefore coverage may vary for some vaccines

Military conscription

The conscription register contains information from military conscription examination

physical measurements (e.g. height, weight)
cognitive ability tests
psychological assessments
health status and diagnoses
physical fitness

Important characteristics:

covers most Swedish men born roughly 1951–1990
Inactive 2010-2016
Both men and women since 2017 (but only approximately 25 %)
examinations typically performed at age 18–19
data collected by the Swedish Armed Forces

Implications for research:

provides rich health and ability data in late adolescence for historic cohort
widely used in epidemiological and social science research
mainly includes men, which limits generalisability

Swedish Dental Health Register

National register maintained by the National Board of Health and Welfare (Socialstyrelsen)
Established in 2008
Includes adults (≥20 years)

Main variables include:

dental diagnoses
dental procedures (treatment codes)
dental status (e.g. number of remaining teeth)
treatment dates
treatment costs and reimbursement

Important characteristics:

based on reports submitted within the national dental care subsidy system
includes both public and private dental care providers

Screening registers

breast cancer (mammography)
cervical cancer
colorectal cancer

These programmes aim to detect disease at an early stage in otherwise healthy individuals.

Data from screening programmes are often stored in regional systems and coordinated nationally through:

national screening programmes
national quality registers
regional cancer centres (RCC)

Typical variables include:

invitation to screening
participation
screening results
follow-up examinations
detected diagnoses

Screening data are important for studies of:

participation in preventive care
early disease detection
effectiveness of screening programmes

National Patient Register (NPR)

National register maintained by the National Board of Health and Welfare (Socialstyrelsen)
Covers specialised health care in Sweden
Data are collected by regions and healthcare providers and then reported to the national register

Coverage

inpatient care since the 1960s
nationwide coverage since 1987
specialised outpatient care since 2001

Each record represents a healthcare contact or episode of care.

Examples of contacts included:

hospital admissions
outpatient specialist visits
psychiatric specialist care
day surgery or day care procedures

Main variables include:

diagnoses (ICD codes)
procedures (KVÅ codes – Swedish classification of healthcare interventions)
case-mix classification (DRG codes)
dates of admission and discharge
hospital or clinic
patient demographics

Historical note:

earlier records used ICD-8 and ICD-9
today diagnoses are classified using ICD-10-SE

Important limitations:

primary care is not included
coverage before 1987 is incomplete
coding practices may vary between regions and over time

Additional considerations:

reporting is nationally standardized but based on regional data collection
historically, some psychiatric care has been less consistently reported
many outpatient contacts require a physician encounter, but coding practices for contacts involving other professionals (e.g. nurses or psychologists) may vary

Despite these limitations, the NPR is widely used for:

epidemiological research
disease surveillance
health services research

Primary care data

Primary care accounts for a large share of health care contacts in Sweden, but there is still no comprehensive national primary care register comparable to NPR.

most primary care data are stored in regional electronic health record systems
reporting practices differ between regions
national coverage for research and statistics is therefore limited

Recent developments

there have been long-standing discussions about establishing a national primary care register
the National Board of Health and Welfare has begun collecting some aggregated primary care statistics
several pilot initiatives and data collections are ongoing

Prescribed Drug Register (PDR)

National register maintained by the National Board of Health and Welfare (Socialstyrelsen)
Covers dispensed prescription drugs from Swedish pharmacies
Established in 2005
Nationwide coverage

Profession	Prescribing rights
Physicians	Broad prescribing rights (most medications)
Dentists	Medications related to dental care
Midwives	Selected medications (e.g., contraceptives)
Nurses	Limited list of medications after additional training
Optometrists	Certain diagnostic eye medications

Main variables include:

drug classification (ATC code)
date of dispensing
amount dispensed
dosage information
prescribing clinic or prescriber category

The register contains drugs dispensed at pharmacies, not all prescriptions written by physicians.

PDR does not include

drugs administered in hospitals
over-the-counter drugs
- since 2009, many non-prescription drugs can also be sold outside pharmacies (e.g. supermarkets and petrol stations)
most herbal medicines or dietary supplements

Dispension does not guarantee that the patient actually used the medication

used for

pharmacoepidemiology
studies of drug safety and effectiveness
studies of treatment patterns

Cancer Register

National register maintained by the National Board of Health and Welfare (Socialstyrelsen)
Established in 1958
Covers all newly diagnosed malignant (and some benign) tumors in Sweden
Based on “primary tumour”

Reporting is mandatory for clinicians (“A-form”) and pathology laboratories (“B-form”).

Includes

date of diagnosis
tumor site and morphology
stage and diagnostic basis
reporting clinic and region

Diagnoses are classified using ICD and ICD-O codes.

Cancer reporting in Sweden is organised through six healthcare regions:

Northern
Uppsala–Örebro
Stockholm–Gotland
South-East
Western (VGR + [Northern] Halland)
Southern

Within each region, a Regional Cancer Centre (RCC) coordinates:

cancer data reporting (combining A- and B-forms)
quality improvement
clinical guidelines
cancer care monitoring

Strengths:

one of the oldest nationwide cancer registers in the world
high completeness due to mandatory reporting
enables long-term studies of cancer incidence and survival

Limitations:

limited information on treatment and outcomes
additional clinical details often require linkage to quality registers or NPR

Other disease-specific registers

The Swedish Cancer Register is one of the few nationwide disease-specific health data registers.

Some other diagnoses are monitored through national systems, often related to infectious disease surveillance.

Examples:

HIV (InfCare HIV register)
notifiable infectious diseases reported through SmiNet
tuberculosis surveillance systems
hepatitis registers

Characteristics

often linked to infectious disease control
reporting may be mandatory under the Communicable Diseases Act
maintained by the Public Health Agency of Sweden

Cause of Death Register

National register maintained by the National Board of Health and Welfare (Socialstyrelsen)
Established in 1952
Covers all deaths among persons registered in Sweden
Information is based on death certificates completed by physicians.
Diagnoses are classified using ICD codes. Currently ICD-10 (WHO version, not ICD-10-SE)

Each death certificate contains:

underlying cause of death
(the disease or injury that initiated the chain of events leading to death)
contributing causes of death
(other conditions that contributed to the death)

Example structure:

Disease → complication → death
Diabetes → kidney failure → death

This structure allows analyses of both underlying and contributing causes of death.

Strengths

nationwide coverage since the 1950s
very long time series for mortality research
standardized (international) classification using ICD

Limitations

cause of death depends on clinical judgement and available information
autopsy rates have decreased over time
misclassification can occur, especially at older ages (multimorbidity)

Note

The register is used when the cause of death is of importance. The date of eath is otherwise also found in TPR and can often be accesed directly from the Tax Agency.

LISA

Not a health care register but often used for background data.

Longitudinal Integration Database for Health Insurance and Labour Market Studies
Maintained by Statistics Sweden (SCB)
based on administrative registers from several authorities
Established in 1990, with data available annually from that year

The database contains socioeconomic information for all individuals aged 16 and older registered in Sweden.

Purpose

provide background variables for research and statistics
enable studies of social determinants of health

Main types of variables include

education level
income and taxation
employment status
occupation and workplace
social insurance benefits
family situation

used for

adjustment for socioeconomic factors
studies of inequalities in health
labour market and health research

National Quality Registers

National quality registers collect detailed clinical information on specific diseases or treatments.

They are designed to:

monitor and improve quality of care
support clinical quality improvement

But they can also be used for research

Typical characteristics:

focus on specific diagnoses, treatments, or procedures
contain more detailed clinical data than national health data registers
participation is generally voluntary for healthcare providers

There are currently around 100 national quality registers in Sweden.

Organisation

Quality registers are typically initiated by clinical communities.

Key actors include:

steering groups for each register
regional cancer centres and register centers
The system is nationally coordinated by the Swedish Association of Local Authorities and Regions (SALAR / SKR)
Registers are hosted by regions (who acts as data controllers)
funded through national and regional support
Quality registers therefore complement national health data registers by providing more detailed clinical information.

PROM/PREM

In addition to clinical data, many registers also collect:

PROM – Patient Reported Outcome Measures
PREM – Patient Reported Experience Measures

These variables capture:

patients’ own assessment of health status
quality of care from the patient perspective

If collected longitudinally, comparisons before/after treatment becomes possible.

The Swedish register ecosystem

Sweden has a unique infrastructure of population-based registers that can be linked using the personal identity number.

Register type	Examples	Maintained by
Population registers	Population Register / RTB	Statistics Sweden (SCB)
Socioeconomic registers	LISA database	Statistics Sweden (SCB)
Health data registers	NPR, MBR, PDR, Cancer Register, Cause of Death Register, Dental Health Register	National Board of Health and Welfare
Infectious disease surveillance	National Vaccination Register, SmiNet	Public Health Agency of Sweden
Social insurance registers	Sickness absence, disability benefits	Swedish Social Insurance Agency
Social services registers	Social assistance, child welfare	National Board of Health and Welfare
Quality registers	~100 disease- or treatment-specific registers	Regions (SALAR)

Key characteristics:

Nationwide coverage for many registers
Longitudinal data spanning several decades
Possibility of individual-level linkage across registers by the individual personal number

Limitations:

some sectors lack national registers (e.g. primary care, school health services)
coverage and data quality may vary across registers, health care providers (private vs public, secondary vs “tertary” care and over time)

International perspective

Many countries maintain health-related registers, but the structure and coverage vary considerably.

Common challenges internationally:

fragmented healthcare systems
lack of unique personal identifiers
multiple data holders (insurance systems, hospitals, regions)
limited possibilities for linking data across sectors

As a result, nationwide longitudinal register studies are often more difficult to conduct than in the Nordic countries.

Countries with similar register infrastructures

The Nordic countries have relatively similar systems based on:

nationwide administrative registers
universal healthcare systems
personal identification numbers

These countries are therefore often used in comparative register-based research.

Examples from other countries

Health data systems in other countries are often organised differently:

United Kingdom: NHS administrative datasets
Netherlands: population registers linked with health insurance data
United States: insurance claims databases and cohort studies
Canada: provincial administrative health data

These systems can provide valuable data, but often lack:

nationwide coverage
consistent linkage across registers
very long follow-up periods

The Nordic register systems are therefore widely used in international epidemiological research.

Swedish Twin Registry

Maintained by Karolinska Institutet
Established in 1961
One of the largest twin registers in the world

Information on

twins born in Sweden since the late 1800s
zygosity (monozygotic / dizygotic)
health outcomes
lifestyle and environmental factors

The register currently includes more than 190,000 twins.

research use

Main purpose to study the role of genetic vs environmental factors in health and disease

Typical study designs:

twin concordance studies
co-twin control studies
longitudinal follow-up

EL9: Documentation

Lecture Slides

Associated litterature (References at the end)

(Wickham, Çetinkaya-Rundel, and Grolemund, n.d., ch. 28-29)

Reading

(Wickham, Çetinkaya-Rundel, and Grolemund, n.d., ch. 28-29) introduces Quarto with good examples

For reference

quarto.org

Levels of documentation

Documentation can exist at several levels in a project.

Code level
- scripts (comments, e.g. # clean data and merge cohorts)
- functions (Roxygen2 documentation)
- notes and reminders (e.g. # TODO:)
Project level
- README.md
- (GitHub repos may also have a wiki or a polished website)
Output level
- figures and tables
- reports (PDF / HTML)
- presentations (slides)

Code level

Documenting R scripts

Use Air for better formatting of the R code itself (will format the code according to best practice for readability and collaboration)
Section labels by Shift + Ctrl/Cmd + R in Positron (and RStudio)
Easiest way to aid navigation to R scripts

## My section heading -----------------------------------------------------

Individual scripts

As soon as you distribute an individual R script from a bigger project (by e-mail or as a printed copy) some context would get lost.
The commentr package is designed to maintain such context
- remotes::install_bitbucket("cancercentrum/commentr")

#| echo: true
commentr::header_comment(
  "My nice script",
  "Bla bla bla ...",
  "Erik Bülow",
  "xxx@yyy.zzz",
  "John Doe"
)

## TODO: Change this to something better:
1 + 2

Documenting R functions

What does this function do?

#| echo: true
hello <- function(who = "world") {
  sprintf("Hello %s!", who)
}

#| echo: true
#' Say hello
#'
#' Create a simple greeting. If no name is provided, the function
#' returns a greeting to the world.
#'
#' @param who A character vector giving the name(s) of the person or
#'   object to greet. If missing, the greeting defaults to `"world"`.
#'
#' @return A character vector of the same length as `who` containing
#'   greeting messages.
#'
#' @examples
#' hello()
#' hello("you")
#' hello(c("Alice", "Bob"))
hello <- function(who = "world") {
  sprintf("Hello %s!", who)
}

There is a standardized syntax to document R functions
You should be able to click a small light bulb (💡) to insert a sceleton when writing your function
The same syntax is used within modern R packages
- but it is useful even within your own projects
roxygen2 is a package used for package documentation but it also has a great manual for documenteing functions in general

Project level

Documenting a project

You have a README.md file within your git repository
It can contain plain text without any formatting
It will be displayed up-front in GitHub
You can also include formatting!
It will be nicely rendered on GitHub or elsewhere

Basic formatting

What you want	Markdown syntax	Result
Heading	`# Heading`	Heading
Italic	`text`	text
Bold	`text`	text
Inline code	`code`	`code`
Link	`[text](https://example.com)`	text
Bullet list	`- item`	• item
Numbered list	`1. item`	1. item

Comment or heading

# is used for comments in R files and within R blocks but for headings in .md and .qmd files!

Visual mode

Instead of editing the source code, you may also use the “visual editor” in Positron:

Warning

Use both?

Personally, I do most editing in source mode (better for the R chunks), but I find it much easier to include references/citations and to paste in external figures in using the visual mode.

Reproducible Workflows

Modern biostatistics projects rarely involve only statistics.

data cleaning and preprocessing
statistical modelling
visualisation
interpretation of results
reporting and documentation

A good workflow should make it possible to combine all of these steps in one place.

Output level

What?

Quarto is a publishing system for scientific and technical documents.

It allows you to combine:

text
R code
statistical results
tables and figures

in a single workflow.

This makes it easier to create reproducible statistical reports.

Why?

In many projects people work with:

R scripts for analysis 🥳
Word documents for reporting 😱
PowerPoint slides for presentations 😱

This separation often leads to:

copy–paste errors 🥴
outdated figures 👎
results that cannot easily be reproduced 🤯

Quarto helps avoid these problems!

Reproducible analysis

In a Quarto document you can:

write the explanation of your analysis
include the R code used
generate figures and tables automatically
update everything by re-running the document

If the data or model changes, the entire report can be regenerated automatically.
It is widely used in data science, epidemiology and biostatistics.
Integrated in Positron and developed by Posit PBC

Used in three ways

For communicating to decision-makers, who want to focus on the conclusions, not the code behind the analysis.
For collaborating with collegues (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).
As an environment in which to do data science and statistics, as a modern-day lab notebook where you can capture not only what you did, but also what you were thinking.

A simple report

When you chose New > New File ... > Quarto document, the file will contain:

---
title: "Untitled"
format: html
---

This is a YAML header (Yet Another Markup Language)
Different output formats are available:
- html: can also be published online (Quartopub or GitHub pages)
- docx: Word document for collaborating with non-statisticians etc
- typst: PDF rendering (ever heard of $\LaTeX$? … This is the modern alternative!)
- revealjs: slide decks (you are looking at one!)
Can render to multiple formats at once!

This is the yaml header for this presentation:

---
title: "EL9: Documentation"
subtitle: "Quarto"
reading: "[@wickham, ch. 28-29]"
format:
  revealjs:
    theme: blood
    transition: fade
    slide-number: true
    css: styles.css
    smaller: true
    scrollable: true
    incremental: false
  typst:
    include-before-body:
      - text: |
          // Gör alla länkar tydliga i PDF: blå + understrukna
          #show link: set text(fill: rgb("1a5fb4"))
          #show link: underline
  docx: default
bibliography: references.bib
---

YAML structure

YAML uses indentation to represent structure.

Think of it like a tree.

format:
  html:
    code-fold: true

This corresponds to the structure:

format
└── html
    └── code-fold: true

If indentation is wrong?

format:
html:
  code-fold: true

Now Quarto interprets this as:

format
html
└── code-fold: true

This is not the intended structure, and the document may fail to render.

💡 Rule of thumb

use two spaces per level
never use tabs
when something fails, check indentation first

Three parts

Quarto documents have three parts:

An (optional) YAML header surrounded by ---s.
Chunks of R code surrounded by ```{r} and ```.

```{r}          ← chunk header
#| echo: false  ← options controlling the output
summary(x)      ← R code
```             ← end of chunk

Keybindings: Cmd + option + I (Mac) or Ctrl + Alt + I (Windows)

Text with markdown format

Order

Part 1 (YAML header) always comes first but 2 (code block) and 3 (Markdown text) are usually intermingled throughout the rest of the document.

Inline code

Sometimes we want to insert small pieces of R output directly in the text.

This is called inline code.

Inline code will always just show the result of the call (not the code itself in the rendered document):

Todays date is `r Sys.Date()` and it is `r format(Sys.time(), format = "%H")`

Todys date is r Sys.Date() and it is r format(Sys.time(), format = "%I %p")

To higlight with bold and italic:

Todays date is **`r Sys.Date()`** and it is *`r format(Sys.time(), format = "%I %p")`*

Todays date is r Sys.Date() and it is r format(Sys.time(), format = "%I %p")

To publish

Decision makers, stake holders, friends or the public might not care about R. Disable R code (echoing the code itself), warnings and messages:

---
title: "Untitled"
format: html
execute:
  echo: false
  warning: false
  message: false
---

For collaborators

Might still care less about verbose R messages and warnings (they trust that you have already considered those)
May still need the underlaying R code for deeper understaing
Use code-fold

---
title: "Untitled"
format:
  html:
    code-fold: true
execute:
  warning: false
  message: false
---

This can also be set individually for each code chunk:

```{r}
#| echo: false
#| warning: false
#| message: true
message("will be muted!")
warning("will be ignored")
plot(something_beautiful, realy = TRUE)
```

Figures

Figures are usualy displayed without problem (and you can adjust their size, add captions etc):

```{r}
#| fig-cap: This is origo!
plot(0, 0)
```

#| fig-cap: This is origo!
plot(0, 0)

Tables

Just printing the R output might not look very nice:

#| echo: true
head(iris)

Better with kable

The easiest way to make tibbles/data.frames/data.tables look nicer is with df-print: kable in the yaml header.

---
title: "Untitled"
format: html
df-print: kable
---

Or manually for individual outputs:

```{r}
#| tbl-cap: Some iris data
knitr::kable(head(iris))
```

#| tbl-cap: Some iris data
knitr::kable(head(iris))

Best with gt

Very flexible
Publication ready
Package from Posit
flextable may be preferred instead if rendering Office documents (Word, PowerPoint)

gtsummary

gtsummary package build on top of {gt} to easily summarise datasets, regression models etc

```{r}
gtsummary::tbl_summary(iris, Species)
```

gtsummary::tbl_summary(iris, Species)

Presentations

Quarto can export to PowerPoint
And to the older Beamer format
But I recommend revealjs
- format: revealjs
- can also be modified with speaker notes, laser pointer etc
Info and examples

For example (as used for these slides):

format:
  revealjs:
    theme: blood
    transition: fade
    slide-number: true
    css: styles.css
    smaller: true
    scrollable: true
    incremental: false

Integrate with `{targets}`

tarchetypes::tar_quarto()

in `_targets.R`

library(targets)
library(tarchetypes) # Let you integrate a quarto report/presentation
list(
  tar_target(data, data.frame(x = seq_len(26), y = letters))
  tar_quarto(report, "report.qmd") # Make presentation
)

In `report.qmd`

---
title: My report
format: html
---

## Here is my beautiful data!

```{r}
gt::gt(tar_read(data))
```

Try it

Copy this code to the terminal (not the console!) and try targets::tar_make() in the console!

#| eval: false
#| echo: true
cd
mkdir test_targets_quarto
cat > test_targets_quarto/_targets.R <<'EOF'
library(targets)
library(tarchetypes) # Lets you integrate a quarto report/presentation

list(
  tar_target(data, iris),
  tar_target(model, lm(Sepal.Length ~ ., data)),
  tar_quarto(report, "report.qmd") # Make presentation
)
EOF

cat > test_targets_quarto/report.qmd <<'EOF'
---
title: My project
author: My name
date: today
format:
  typst: default
  docx: default
  html: default
  revealjs:
    output-file: slides.html
---

## Here are some flowers!

gtsummary::tbl_summary(targets::tar_read(data))

## And here is a model

gtsummary::tbl_regression(targets::tar_read(model))
EOF

positron test_targets_quarto
## rm -r test_targets_quarto

Manuscripts

Quarto Manuscript
framework for writing and publishing scholarly articles
Something for your later thesis work?
Handles references
Export to Word documents (.docx) etc
- which is still the most common format for submission to medical/epidemiological journals
- Still used for collaborative work with non-statisticians etc

Use references

When writing scientific reports we need to:

acknowledge previous research
support claims with evidence
allow readers to find the original sources

This is done through citations and a reference list.

Example:

Don’t blame me, it was according to Wickham, Çetinkaya-Rundel, and Grolemund (n.d.).

At the end of the document a bibliography is automatically generated.

How Quarto handles references

Quarto uses a bibliography file, usually in .bib format.

Example in the document header:

bibliography: references.bib

Then citations can be written directly in the text:

Several studies have examined this relationship @smith2020.

Quarto will render something like:

Several studies have examined this relationship (Smith 2020).

The full reference will appear automatically in the bibliography.

Online portfolio

Looking for a job after you fininsh the master program?
An online portfolia might increase your visability!
Perhaps add your course project to such portfolio?
Demo

EL10: Biggish data

Lecture Slides

Associated litterature (References at the end)

Examination

This lecture will not be examined. You are encouraged to experiment with the concepts in your project work if you find them useful, but this is not required.

Big data

Biggish data

Our focus:

Only tabular data
Original data might fit into physical RAM but …
might be multiplied due to temporary necessity
physical RAM might be constrained due to other processes/settings
things might just take too much time … and life is short …

Reading large datasets

When working with large datasets, where the data is stored matters.

Network storage (e.g. shared drives)

slower data transfer (limited bandwidth)
higher latency (delay before data starts loading)
multiple users may compete for resources
repeated reads can be very inefficient

Local storage

much faster read/write speeds
low latency
more stable and predictable performance
better suited for iterative data analysis

Secure environment

In secure environments, both types of storage may look local, but they behave differently.

“Local” disk (e.g. SSD on the server/VM)
- attached directly to the machine you are working on
- high bandwidth, low latency
- behaves like true local storage
- typically much faster for data analysis
Mapped network folder
- accessed over the network (even if it looks like a normal folder)
- lower bandwidth and higher latency
- shared with other users
- slower, especially for repeated reads

💡 Key difference:
Not where the data is stored, but how it is accessed (direct disk vs network).

💡 Practical advice:
Use “local” disk (SSD on the environment) for active analysis, and network storage for long-term storage.

Practical implication

For large datasets:

avoid repeatedly reading data directly from network drives
copy data locally when possible
- fs::file_copy(path, new_path)
- (“libuv provides a wide variety of cross-platform sync and async file system operations.”)
- obviously only if “local drive” is also in the (same) secure environment!

💡 Key message:
Data access can easily become the main bottleneck — not your code.

HDD (Hard Disk Drive)

mechanical (spinning disks)
slower, especially for random access
typical speeds:
- ~50–150 MB/s (sequential read)
larger capacity at lower cost
often used in older systems or for backups

SSD (Solid State Drive)

no moving parts
much faster and more reliable
typical speeds:
- ~500 MB/s (SATA SSD)
- ~2000–7000 MB/s (NVMe SSD)
standard in most modern laptops

Typical sizes

laptops: 256 GB – 1 TB SSD
desktops / servers: 1–4 TB SSD + optional HDD storage
network drives: often very large but slower

RAM and data analysis in R

When working in R, available RAM (memory) is often the main limiting factor.

R can typically use most of the available RAM on your machine
some memory is needed by:
- the operating system
- other applications
a safe rule of thumb is to assume ~50–75% of total RAM is available for R

How large datasets can you work with?

datasets must fit in memory (unless using special tools)
but you also need memory for:
- intermediate objects
- copies created during transformations
- model objects
many R operations create temporary copies of data
memory usage can double or triple during processing
running out of RAM leads to:
- slow performance
- crashes

Example

Tip

Rule of thumb: you can usually work comfortably with data that is at most ~1/3 to 1/2 of your available RAM

If you have:

16 GB RAM → usable ≈ 8–12 GB
→ practical dataset size ≈ 3–6 GB
Modern R is 64-bit → can use large amounts of memory
- (32-bit R was limited to ~4 GB — mostly obsolete today)

Practical advice

check memory: ps::ps_system_memory()
avoid unnecessary copies of large objects
- Use {data.table} for reference semantics
- or Parquet files to read only the necessary data
consider pipelines (targets) or chunked processing

Numeric vs integer in R

R’s default numeric type is double precision (numeric).

x <- 1
typeof(1) # "double"
typeof(1L) # weird syntax to get "integer"

Why this matters

integers use less memory (4 bytes vs 8 bytes)
can be important for large datasets
100 million values:
- numeric ≈ 800 MB
- integer ≈ 400 MB

In practice

R often converts to numeric automatically
some tools (e.g. data.table) use integers efficiently
- compare as.IDate() vs as.Date()
useful when working with:
- IDs
- categories
- counts

💡 Key message:
Choosing the right data type can significantly reduce memory usage.

ALTREP (Alternative Representation)

R can represent some objects in a compact or lazy way.

introduced in R 3.5.0
avoids allocating full memory immediately

#| echo: true
x <- 1:1e9
lobstr::obj_size(x) # actual memory used
format(object.size(x), units = "GB") # reserved memory
## as.numeric(x) # will use the reserved memory

What happens?

data is generated on demand
not all values are stored in memory

💡 Key message:
Not all objects in R are fully materialised — some are computed when needed.

Other memory optimisations in R

R uses several mechanisms to reduce memory usage (not only ALTREP).

Shared strings (string interning)
- identical strings may be stored only once
- reduces memory when many values are repeated

But memory still fills up…

Even with smart representations:

many operations still create real objects
temporary objects accumulate

👉 R still needs to free memory

Garbage collection

When you work in R, memory is constantly used for intermediate objects.

many operations create temporary results
these objects are no longer needed after a step is completed
but R does not always remove them immediately

What is the problem?

memory can fill up with objects that are no longer used
there are no active references (“pointers”) to these objects
but they still occupy memory

Solution: garbage collection

R periodically identifies objects that are no longer reachable
these are removed, and memory is freed
GC is triggered when R needs more memory
may cause short pauses during execution

Practical advice

rm(large_object) # Remove objects you no longer need (still takes up memory)
gc() # manual garbage collection (free the memory)
dt[, new := f(old)] # reference semantics by `{data.table}` avoids "hidden" objects

💡 Key message:
You rarely need to manage memory explicitly — but inefficient code can still use too much of it.

Unnecassary computations for large objects

Sometimes R (or the computer hardware) is not the problem, but the IDE might be.
Common for RStudio (RStudio and R both run in the same process)
- Memory problems might make RSTudio crash
Positron separates the R process from the IDE
- Better, but still not perfect

Still open?

CPU

The Central Processing Unit (CPU) determines how fast computations are performed.

CPUs work in clock cycles (e.g. 3 GHz ≈ 3 billion cycles per second)
CPU matters most when:
- running models (e.g. regression, mixed models)
- performing simulations
- using loops or inefficient code
CPU matters less when:
- reading data (disk/network bottleneck)
- working with very large objects (RAM bottleneck)

Note

Writing efficient code (vectorisation) often matters more than CPU speed. Applies also to R packages on CRAN!

How R uses the CPU

R is often single-threaded
many operations use only one core
some libraries can use multiple cores (parallel computing)

How many cores do you have?

#| echo: true
benchmarkme::get_cpu()

Vectorization in R

R is designed to work efficiently with vectors, not loops.
In R, what looks like a scalar is actually a vector of length 1
vectorized operations apply to whole objects at once
loops process one element at a time (often slower)

## slow
result <- numeric(length(x))
for (i in seq_along(x)) {
  result[i] <- x[i] + y[i]
}

## fast
result <- x + y

Why is vectorization faster?

Vectorized operations in R are usually implemented in compiled C code.

R code (e.g. for loops) is interpreted → each step has overhead
vectorized operations (e.g. x + y) are:
- implemented in C
- run as tight, optimized loops

Parallel computing

Parallel computing means using multiple CPU cores at the same time.

most R code runs on a single core
your computer may have many cores (e.g. 4–16)
a shared server such as TRE might have several hundreds!

Why use parallel computing?

to speed up computationally intensive tasks
especially when tasks can be done independently

💡 Key message:
Parallel computing allows you to do multiple computations at once — but only if the problem can be split.

Paralellization

Some packages and functions may use multithreading by default
- But this is unusual
Some functions might have arguments such as nthreads, ncores etc
- You may want to use them
To handle multicore/thread computations in general can be (very) difficult (I’ve heard)

Split-apply-combine

Parallel computing works best when:

tasks are independent
tasks are computationally heavy
tasks can be repeated many times

Examples

simulations
bootstrapping
cross-validation
applying a function many times

💡 If tasks depend on each other, parallelisation will not help.

When parallel computing does NOT help

Parallel computing is often not the solution.

reading data → bottleneck is disk/network
large objects → bottleneck is RAM
inefficient code → vectorization is better

Example in R

lapply(1:100, function(i) slow_function(i)) # Sequential
parallel::mclapply(1:100, slow_function, mc.cores = 4) # Parallel

What happens?

each core processes part of the work
results are combined at the end

💡 More cores ≠ always faster

Limitations of parallel computing

overhead (starting processes takes time)
copying data between processes
limited by number of cores
not all functions are parallel-friendly

`{mirai}`

Suposed to be up to 1,000 times “faster” compared to earlier methods
- (according to some metric …)
should not block the main process while executing (haven’t tried)

Designed for simplicity, a ‘mirai’ evaluates an R expression asynchronously in a parallel process, locally or distributed over the network, with the result automatically available upon completion.

#| eval: false
#| echo: true

library(mirai)
library(data.table)
daemons(11)

## data.table with 1 billion rows:
dt <- data.table(x = seq_len(1e9), y = rnorm(1e9))
## row sums (row wise )
mirai_map(dt, sum)[.flat]

`{purrr}`

Introduced in_paralell() in 2025
Based on mirai but without the specialised syntax
Needs to be explicit on which objects/functions to export to each worker/daemon

#| echo: true
library(purrr)
library(mirai)

## Set up parallel processing (6 background processes)
daemons(6)

## Sequential version
mtcars |> map_dbl(\(x) mean(x))
#>    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb
#>  20.09   6.19 230.72 146.69   3.60   3.22  17.85   0.44   0.41   3.69   2.81

## Parallel version - just wrap your function with in_parallel()
mtcars |> map_dbl(in_parallel(\(x) mean(x)))
#>    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb
#>  20.09   6.19 230.72 146.69   3.60   3.22  17.85   0.44   0.41   3.69   2.81

## Don't forget to clean up when done
daemons(0)

`{targets}`

The targets package does implement parallel processing efficiently (based on {mirai} via {crew})
covered in the R medecine workshop (see EL5)

Note

Data must be copied for each paralell worker
If all data is needed for each worker, it will be multiplied in memory
- This might not be possible for big data
- It also takes time
Hence, there is no guarantee that the over all time will decrease
Often a tradeof between no of workers/threads/deamons/cores and efficiency etc
“Standard” computations in {data.table} is optimized for single core
Message handling and progress bar etc is complicated
- So is random number generation

GPU

A Graphics Processing Unit (GPU) is designed for massively parallel computations.

many simple cores (thousands)
optimized for performing the same operation many times
originally developed for graphics

How is it different from a CPU?

CPU: few powerful cores → general-purpose tasks
GPU: many simple cores → parallel tasks

When are GPUs useful?

machine learning / deep learning
large matrix operations
simulations that can be parallelised

In R

most R code does not use the GPU
standard packages are CPU-based
GPU requires specialized tools (e.g. torch, tensorflow, gpuR)

💡 Key message:
GPUs can be extremely powerful — but only for specific types of problems.

In most R workflows, CPU, RAM, and data access matter much more.

Profiling

First version of R script might be inneficient
Optimazation is difficult
Might be suboptimal if made ad hoc
{profvis} visualise time and memory usage for each step
improve the most important step first
- For example change {base} and {tidyverse} code to {data.table}
- set keys
- Efficient handling of dates and strings
itterate
Integrated in RStudio (works in Positron to)

Modify packages

Functions from packages might be inefficient
Profile those as well
Clone package from GitHub if available or download source code from CRAN
Improve
Just load the modified function in global environment
- might need to modify internal calls to :::-accesed functions
Or rebuild/install the package if more convenient

Notify maintainer

If you find ways to improve a package, the maintainer might be eager to hear! Suggest pull request or open GitHub issue!

C-code changes

If the package use C-code it is probably already very fast
If not, and you can fix it, the package must be re-compiled
OS dependent
Might be tricky if using restricted environments (TRE)
Can still be done by rhub (might need to change maintainer-e-mail temporarily)

Complete Reading list

Articles might be obtained through the GU library (UB) or as PDF files uploaded to Canvas .

References

Alharbi, Musaed Ali, Godfrey Isouard, and Barry Tolchard. 2021. “Historical Development of the Statistical Classification of Causes of Death and Diseases.” Edited by Rahman Shiri. Cogent Medicine 8 (1): 1893422. https://doi.org/10.1080/2331205X.2021.1893422.

Arel-Bundock, Vincent. 2025. “Data.table Vs. Base Vs. Dplyr Vincent Arel-Bundock.” https://arelbundock.com/posts/dt_tb_df/index.html.

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.

Bindel, Lilly Josephine, and Roland Seifert. 2025. “Problems Associated with the ATC System of Drug Classification.” Naunyn-Schmiedeberg’s Archives of Pharmacology, December. https://doi.org/10.1007/s00210-025-04833-1.

Brooke, Hannah Louise, Mats Talbäck, Jesper Hörnblad, Lars Age Johansson, Jonas Filip Ludvigsson, Henrik Druid, Maria Feychting, and Rickard Ljung. 2017. “The Swedish Cause of Death Register.” European Journal of Epidemiology 32 (9): 765–73. https://doi.org/10.1007/s10654-017-0316-1.

Data Analysis Using Data.table. n.d. 2026. https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html.

Emilsson, L., B. Lindahl, M. Köster, M. Lambe, and J. F. Ludvigsson. 2015. “Review of 103 Swedish Healthcare Quality Registries.” Journal of Internal Medicine 277 (1): 94–136. https://doi.org/10.1111/joim.12303.

Everhov, Åsa H., Thomas Frisell, Mehdi Osooli, Hannah L. Brooke, Hanne K. Carlsen, Karin Modig, Karl Mårild, et al. 2025. “Diagnostic Accuracy in the Swedish National Patient Register: A Review Including Diagnoses in the Outpatient Register.” European Journal of Epidemiology 40 (3): 359–69. https://doi.org/10.1007/s10654-025-01221-0.

Fenk, Simone Rahel, Kari Furu, and Inger Johanne Bakken. n.d. “Improve Data Management in Register-Based Research: Transition from CSV to Parquet.” https://doi.org/10.1101/2025.10.15.25337992.

Görman, Ulf. 2024. “Guide to the Ethical Review of Research on Humans.” Uppsala. https://etikprovningsmyndigheten.se/wp-content/uploads/2024/05/Guide-to-the-ethical-review_webb.pdf.

Hiyoshi, Ayako. 2026. “Overview of Swedish Register Data for Health Research.” Annals of Clinical Epidemiology advpub. https://doi.org/10.37737/ace.27005.

Kavianpour, Sanaz, James Sutherland, Esma Mansouri-Benssassi, Natalie Coull, and Emily Jefferson. 2022. “Next-Generation Capabilities in Trusted Research Environments: Interview Study.” Journal of Medical Internet Research 24 (9): e33720. https://doi.org/10.2196/33720.

Laugesen, Kristina, Jonas F Ludvigsson, Morten Schmidt, Mika Gissler, Unnur Anna Valdimarsdottir, Astrid Lunde, and Henrik Toft Sørensen. 2021. “Nordic Health Registry-Based Research: A Review of Health Care Systems and Key Registries.” Clinical Epidemiology Volume 13 (July): 533–54. https://doi.org/10.2147/CLEP.S314959.

Ludvigsson, Jonas F., Catarina Almqvist, Anna Karin Edstedt Bonamy, Rickard Ljung, Karl Michaëlsson, Martin Neovius, Olof Stephansson, and Weimin Ye. 2016. “Registers of the Swedish Total Population and Their Use in Medical Research.” European Journal of Epidemiology, 1–12. https://doi.org/10.1007/s10654-016-0117-y.

Ludvigsson, Jonas F., Petra Otterblad-Olausson, Birgitta U. Pettersson, and Anders Ekbom. 2009. “The Swedish Personal Identity Number: Possibilities and Pitfalls in Healthcare and Medical Research.” European Journal of Epidemiology 24 (11): 659–67. https://doi.org/10.1007/s10654-009-9350-y.

Ludvigsson, Jonas F., Pia Svedberg, Ola Olén, Gustaf Bruze, and Martin Neovius. 2019. “The Longitudinal Integrated Database for Health Insurance and Labour Market Studies (LISA) and Its Use in Medical Research.” European Journal of Epidemiology 34 (4): 423–37. https://doi.org/10.1007/s10654-019-00511-8.

Nelson, Stuart J., Ying Yin, Eduardo A. Trujillo Rivera, Yijun Shao, Phillip Ma, Mark S. Tuttle, Jennifer Garvin, and Qing Zeng-Treitler. 2024. “Are ICD Codes Reliable for Observational Studies? Assessing Coding Consistency for Data Quality.” DIGITAL HEALTH 10 (September): 20552076241297056. https://doi.org/10.1177/20552076241297056.

Nguyen, Andrew. 2022. Hands-on healthcare data: taming the complexity of real-world data. First edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly.

Oliveira Andrade, Rodrigo de. 2025. “Huge Reproducibility Project Fails to Validate Dozens of Biomedical Studies.” Nature 641 (8062): 293–94. https://doi.org/10.1038/d41586-025-01266-x.

Peng, Roger D., and Stephanie C. Hicks. 2021. “Reproducible Research: A Retrospective.” Annual Review of Public Health 42 (Volume 42, 2021): 79–93. https://doi.org/10.1146/annurev-publhealth-012420-105110.

“Public Access and Secrecy | Swedish National Data Service.” 2025. https://snd.se/en/research-data-support/introduction-legal-aspects-research/public-access-and-secrecy.

Rodrigues, Bruno. 2023. “Building Reproducible Analytical Pipelines with R.” https://raps-with-r.dev/.

Staples, Timothy L. 2023. “Expansion and Evolution of the R Programming Language.” Royal Society Open Science 10 (4): 221550. https://doi.org/10.1098/rsos.221550.

“The Unix Shell: Summary and Setup.” 2026. https://swcarpentry.github.io/shell-novice/.

Vukovic, Jakov, Damir Ivankovic, Claudia Habl, and Jelena Dimnjakovic. 2022. “Enablers and Barriers to the Secondary Use of Health Data in Europe: General Data Protection Regulation Perspective.” Archives of Public Health 80 (1): 115. https://doi.org/10.1186/s13690-022-00866-7.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. n.d. “R for Data Science (2e).” https://r4ds.hadley.nz/.

Separate handouts for each lecture

EL1: Intro

Important aspects

Data – what is it?

Course structure

Different types of data

Usages

Register data

💶 Administrative data

🏥 Hospital background data

🩺 Clinical data

🤓 Individual background data

🗺️ Aggregated data

Inclusion/exclusion criteria

🦿Simple example

🥴 Complicated example

Coverage and completeness

What is recorded?

Data linking

Unique personal identifier

Combining data

Working with health care data

R as a tool but …

Our use of R

EL2: European legislation

Legal part

European legislation

EU law vs Swedish law

GDPR

European Union (EU)

European Economic Area (EEA)

United Kingdom (UK)

Switzerland

International laws

Definitions

Legal grounds for processing personal data:

Processing of special categories of personal data

But …

Safeguards

Technical Safeguards

Organisational Safeguards

Data Minimisation and Purpose Limitation

Pseudonymisation and Anonymisation

Data Controller

Data Processor (PUB) under GDPR

Controller–Processor Relationship

Example: Sahlgrenska

European Health Data Space (EHDS)

What is it?

Two Main Pillars

Implementation Timeline

How EHDS Relates to GDPR

Why EHDS Matters for Statisticians

EHDS Does Not Do This

EU Data Act

What Is It?

How the Data Act Relates to Health Data

EL3: Swedish legislation

Literature

Swedish legal system

Backgrund to Swedish law

Two main branches of law

Fundamental law

The Freedom of the Press Act (TF)

Confidentiality

The Public Access to Information and Secrecy Act (OSL)

OSL Chap 21

OSL Chap 24

OSL Chapter 25

OSL Chapter 10

Health care data

The Patient Data Act (PDL)

Chapter 7 PDL

The Health Data Registers Act

Specific registers

Other legislation

The Archives Data Act (ADL)

GDPR and Swedish law

Statistics and research?

The Ethical Review Act (EPL)

`README.md`

`data` folder

`R` folder

`reports` folder