(Nguyen 2022, ch. 3) on standardized vocabularies. You may skip the sections “CPT”, “LOINC”, “RxNorm” and “Using the Unified Medical Language System” (not examined within the course). Even if you skip those section, remember to read the conclusions in the end of the chapter!
Alharbi, Isouard, and Tolchard (2021) provides an historical expose of the development of medical coding, with focus on the International Classificatin of Diseases (ICD).
Nelson et al. (2024) argue (based on a statistical analysis) that we should not put to much trust in the coded data (you may skip the methods section).
Bindel and Seifert (2025) introduces the Anatomical Therapeutic Chemical (ATC) and some associated problems. Focus on the introduction and conclusion sections (results and discussion may be skipped).
Regular expressins: But please not that this is general practice and deviations may exist between those exercises and R.
Overview
Standardized vocabularies, controlled vocabularies, terminologies and ontologies …
This is a field of its own (health informatics)
Let’s just call it “medical coding” for now.
Relevance
Imaganine you are diagnosed with “cancer” (hope not …)
Your doctor writes that you have “kräfta” in your medical records
“kräfta” (Swedish) = cancer (latin), although the astrological sign “cancer” is a “crab” (latin does not distinguish the two)
She might as well write:
The patient was diagnosed with a malignant neoplasm of the colon.
Histology confirms invasive adenocarcinoma.
Evidence of metastatic disease to the liver.
Natural languages (English/Swedish/Latin etc) are not well suited for statistical analysis
Natural language processing (NLP) is nice but outside the scope of the course
Statisticians need clear definitions of diagnoses, procedures, medications etc.
Therefore, such information is encoded in a standardized way
Granularity/reliability
Cancer might be coded by an ICD-10 code (International Classificatin of Diseases v. 10) as “C” (or possibly “D”)
Cancer, however, is a very general term. Is it lung cancer, brain cancer, skin cancer etc (those are very different)
The more we learn about a diseases, the more granularity we expect from the coding
The coding systems therefore tend to be quite complex, evolve over time and often have regional differences
Even though the intention of the coding system might be granular and precise, the data quality often relies on different coding practices in different hospitals etc.
The codes might also be misused for re-imbursment practices
In practice, the medical doctor might dictate a diagnosis, which then needs to be translated to a code by administrative staff
Example
The Swedish Hip Arthroplasty Register identified that one hospital appeared to have an unusually high number of patients recorded with severe respiratory problems
At first glance, this raised a clinical question: could hip problems somehow lead to serious breathing problems?
However, hip surgery is often performed under general anesthesia. During general anesthesia, patients are intubated and mechanically ventilated, which involves procedures related to the respiratory system.
It was eventually discovered that a procedural code related to anesthesia and airway management had been incorrectly registered as a severe respiratory diagnosis.
The apparent “complication” was therefore not a real clinical problem, but a coding error.
Lesson: Register data reflect coding practices. Without understanding how variables are defined and recorded, one may draw incorrect conclusions.
ICD – International Classification of Diseases
Maintained by the World Health Organization (WHO)
Global standard for coding diseases and causes of death
Used for:
Clinical documentation
Mortality statistics
Epidemiological research
Health system planning and monitoring
Historical Background
First version: 1893 (International List of Causes of Death)
WHO assumed responsibility in 1948 (ICD-6)
Major revisions approximately every 10–20 years
Each revision reflects:
Advances in medical knowledge
Changes in disease concepts
Administrative and reporting needs
ICD has evolved from a mortality list to a comprehensive disease classification.
Major ICD Versions
ICD-7 (used in many countries in the 1950s–1970s)
Still used in the Swedish cancer register for backward compability
ICD-8 (used in many countries in the 1960s–1980s)
ICD-9 (widely used until the early 2000s)
Also still used in the Swedish cancer register
ICD-10 (introduced in the 1990s; still dominant in many countries)
From 1997 in Sweden. What we currently most care about
ICD-11 (adopted in 2019; gradually being implemented)
Different countries adopted versions at different times, creating challenges for international comparisons.
National Modifications
Several countries use national adaptations:
ICD-10-CM (USA; Clinical Modification)
ICD-10-CA (Canada)
ICD-10-SE (Sweden)
A fifth position (ignoring the dot) sometimes used for more granularity
ICD-10: S72.0 Fracture of neck of femur
ICD-10-SE: S72.00 Fracture of neck of femur, closed; S72.01 Fracture of neck of femur, open; S72.10 Pertrochanteric fracture, closed; S72.11 Pertrochanteric fracture, open, …
Feature
WHO ICD-10
ICD-10-SE (Sweden)
ICD-10-CM (USA)
Maintained by
WHO
Swedish National Board of Health and Welfare (Socialstyrelsen)
U.S. National Center for Health Statistics (NCHS)
Primary purpose
Global disease classification
National clinical and statistical reporting
Clinical documentation and reimbursement
Level of detail
Moderate
More detailed than WHO ICD-10
Much more detailed than WHO ICD-10
Additional digits
Typically 3–4 characters
Often includes 5th character extensions
Up to 7 characters
Laterality (right/left)
Usually not specified
Limited
Frequently specified
Encounter type (initial, follow-up, sequela)
Not included
Not included
Explicitly coded
Administrative focus
Epidemiology and mortality statistics
Clinical and national register reporting
Strongly tied to billing and reimbursement
International comparability
High (reference standard)
High within Nordic context, requires mapping internationally
Other registers typically only records the current version in use
If so, you might perform the crosswalk yourself after receiving the data
Applies if you want to look at longer time trends etc or combine data from different periods
ICD-O
ICD-O (International Classification of Diseases for Oncology):
Used mainly in cancer registries
Combines:
Topography (tumour site)
Morphology (histology and behaviour)
Current commonly used version: - ICD-O-3
Earlier versions: ICD-O, ICD-O-2
ICD-O is more detailed than ICD-10 for cancer incidence studies.
Relationship Between ICD-10 and ICD-O
ICD-10 commonly used for mortality and hospital discharge diagnoses
ICD-O primarily used for cancer registry incidence data
A cancer case may have:
An ICD-10 code in hospital data
An ICD-O morphology and topography code in a cancer registry
Researchers must understand which system underlies their dataset.
Variation in Coding
Differences between hospitals or regions may arise due to:
Coding training
A primary health care unit may encounter all possible diagnosis (wide but shallow knowledge) while a very specialized unit might have routines for a very narrow but detailed coding
Local guidelines
Regions are independent in Sweden
Administrative incentives
Reimbursement systems
public and private health care providers may have different incentives
Electronic health record design
National registers often relies on combining multiple different sources
Somatic vs psychiatric care
In psychiatric care, diagnoses may sometimes be recorded with less specificity, potentially due to concerns about stigma or the sensitive nature of certain conditions
Registers capture both clinical events and coding behaviour.
Validity of ICD Codes
Important research question:
Does the code correspond to the true disease?
Validation studies compare ICD codes with a reference standard
(e.g., chart review, clinical registry, laboratory confirmation).
Key Measures of Validity
Sensitivity
Among patients who truly have the disease,
how many receive the correct ICD code?
→ Measures undercoding (missed cases).
Specificity
Among patients who do not have the disease,
how many are correctly not assigned the code?
→ Measures overcoding (false positives).
Positive Predictive Value (PPV)
Among patients assigned the ICD code,
how many truly have the disease?
→ Measures how reliable the code is for identifying true cases.
Low sensitivity → underestimated incidence
Low PPV → inflated case counts
Variation in validity may depend on:
Diagnosis (e.g., myocardial infarction vs mild depression)
Care setting (inpatient vs primary care)
Time period (coding changes)
ICD version and national modification
Not all ICD codes are equally reliable for research.
Practical Implications for Statisticians
Before analysis, always clarify:
Which ICD version?
Which national modification?
Which coding level (3-digit vs 4-digit)?
Has coding practice changed over time?
Are crosswalks required?
Is there validation evidence for the diagnosis?
ICD is a classification system. It is not identical to clinical truth. Transparent documentation of code selection is essential for reproducible research.
ATC for drugs
Anatomical Therapeutic Chemical (ATC) classification
categorizing therapeutic drugs,
structured into 14 main groups and 5 levels, with a disease-oriented focus
introduced in the 1960s
In 1980, the World Health Organization (WHO) recommended the ATC system as the “state of the art”
Multidisciplinary team conference (MDT conference)
Smoking cessation counselling
Nutritional counselling
Physiotherapy interventions
Occupational therapy interventions
Psychotherapeutic treatment sessions
Structured patient education programmes
Palliative care planning
Structure
Alphanumeric codes (typically 5 characters)
First letter indicates anatomical or procedural group
Subsequent characters specify procedure type and detail
Example:
NFB49 – Primary total hip replacement
JKA20 – Appendectomy
Purpose
Record surgical and certain non-surgical interventions
Used in:
National Patient Register
Quality registers
Reimbursement and administrative reporting
Health services research
Important Distinction
ICD-10-SE → Diagnosis codes
NOMESCO/KVÅ → Procedure codes
A patient record may therefore contain:
An ICD diagnosis (e.g., hip fracture)
A NOMESCO procedure code (e.g., hip replacement surgery)
Implications for Research
Diagnosis and procedure must not be confused
Trends in procedures may reflect:
Clinical practice changes
Technology changes
Policy and reimbursement incentives
International comparisons require awareness that other countries (e.g., USA) use different coding systems
NOMESCO codes capture what was done, not what disease the patient had.
DRG
DRG (Diagnosis-Related Groups) is a classification system used to group hospital cases into categories expected to require similar levels of resources.
In Sweden:
Based on the NordDRG system
Used for:
Hospital reimbursement
Resource allocation
Health care management
Productivity and efficiency analyses
How Is a DRG Determined?
A DRG is assigned based on a combination of:
Primary diagnosis (ICD-10-SE)
Secondary diagnoses
Procedure codes (NOMESCO/KVÅ)
Age
Sex
Discharge status
Presence of complications or comorbidities
DRG codes are therefore derived classifications, not primary clinical codes.
Implications for Research
DRG reflects resource use, not disease incidence.
Changes in reimbursement rules may influence coding behavior.
Regional comparisons must consider administrative incentives.
DRG is suitable for health services and economic analyses, but less appropriate for etiological research.
SNOMED CT – What Is It?
SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) is a large clinical terminology system.
Maintained by SNOMED International
Contains hundreds of thousands of clinical concepts
Designed for structured documentation in electronic health records
Unlike ICD or ATC, SNOMED CT is primarily a terminology, not a statistical classification.
Terminology vs Classification
System
Type
Purpose
ICD
Classification
Epidemiology and health statistics
ATC
Classification
Drug classification
KVÅ / NOMESCO
Classification
Medical procedures
SNOMED CT
Terminology
Detailed clinical documentation
Classification systems simplify reality for statistics and reporting, while terminologies allow very detailed clinical descriptions.
Why SNOMED CT Is Not Widely Used in Registers
Despite its strengths, SNOMED CT is rarely used directly in:
national health registers
epidemiological statistics
Main reasons:
Too detailed for statistical aggregation
Harder to ensure consistent coding
Statistical reporting systems are built around ICD
Select all ICD-10 codes starting with "I21" (acute myocardial infarction)
Identify all ATC codes beginning with "C09" (antihypertensives)
Check for malformed codes (quality control)
Extract codes embedded in free text (e.g., notes, text fields)
Basic Building Blocks
Symbol
Meaning
^
Start of string
$
End of string
.
Any character
*
0 or more repetitions
+
1 or more repetitions
?
0 or 1 repetition
{m,n}
Between m and n repetitions
[ABC]
Any of A, B, or C
[0-9]
Any digit
\\d
Any digit (PCRE)
Different versions
There are different implementations of regular expressions! The implementation in base R is described by ?base::regex in R. Perl-like Regular Expressions (PCRE) is a commonly used alternative requireing (perl = TRUE as argument) for the base functions.
Example: ICD-10 Structure
ICD-10 codes typically follow:
One letter
Two digits
Optional dot and additional digit
Regex pattern: ^[A-Z][0-9]{2}(\\.[0-9])?$
where \\ is ussed to remove the special meaning of . as described above. Hence, in this case \\. is interpreted as a literal . as to be found in the character string. In R:
grepl("^[A-Z][0-9]{2}(\\.[0-9])?$", icd) # basestringr::str_detect(icd, "^[A-Z][0-9]{2}(\\.[0-9])?$")# Faster and using 10 CPU cores in paralell (only relevant for "big enough data"):stringfish::sf_grepl(icd, "^[A-Z][0-9]{2}(\\.[0-9])?$", nthreads =10L)
Example: Select a Diagnosis Group
All acute myocardial infarction codes: ^I21
In R: stringr::str_detect(icd, "^I21")
This selects:
I21
I21.0
I21.9
But not:
I20
I22
ATC Code Structure
Example: C09AA05
Regex pattern:^[A-Z][0-9]{2}[A-Z]{2}[0-9]{2}$
In R: stringr::str_detect(atc, "^[A-Z][0-9]{2}[A-Z]{2}[0-9]{2}$")
This will find any ATC code.
You might receive data with a variable supposed to contain only ATC codes
It might as well contain other information such as ??, don't know, XXXXXXX etc
You might replace such character strings by <NA>
Implementations
base R and {stringr} both use the same underlying regex engine (PCRE)
but {stringr} is more “user friendly”.
Stringfish seems technically sperior but is less maintained (more of a hobby project).
Common Mistakes
Forgetting ^ when matching prefixes
This is problematic even in the stringfish::sf_starts() implementation! See bug report.
Forgetting to escape .
Not validating full string with $
Overmatching (e.g., I2 instead of ^I21)
Standardised Groupings
To account for overall disease burden, researchers often use established grouping systems, such as:
Charlson Comorbidity Index (ICD)
Elixhauser Comorbidity Index (ICD)
Similar groupings of ATC-codes
Combinations of those
These indices:
Aggregate multiple ICD codes into clinically meaningful comorbidity categories
Are commonly used for:
Risk adjustment
Prognostic modelling
Confounding control in observational studies
{decoder}
The R package {decoder} provides descriptions for many commonly used coding systems.
In register data, you often only have the raw codes (e.g., ICD, ATC), without textual labels.
{decoder} allows you to translate codes into meaningful descriptions (in Swedish or English), making interpretation easier and more transparent.
Up-to-date?
I am the maintainer of {decoder} and {coder} but I have not had the time or energy to update them for a cuople of years. There are some reported issues.
{coder}
The R package {coder} can be used to aggregate individual diagnosis codes into broader clinical categories.
Common applications include:
Charlson Comorbidity Index
Elixhauser Comorbidity Index
Other diagnosis-based groupings
This allows:
Standardised comorbidity adjustment
(Sort of/relatively …) transparent and reproducible case definitions
Consistent grouping across studies (hopefully)
Alharbi, Musaed Ali, Godfrey Isouard, and Barry Tolchard. 2021. “Historical Development of the Statistical Classification of Causes of Death and Diseases.” Edited by Rahman Shiri. Cogent Medicine 8 (1): 1893422. https://doi.org/10.1080/2331205X.2021.1893422.
Bindel, Lilly Josephine, and Roland Seifert. 2025. “Problems Associated with the ATC System of Drug Classification.”Naunyn-Schmiedeberg’s Archives of Pharmacology, December. https://doi.org/10.1007/s00210-025-04833-1.
Nelson, Stuart J., Ying Yin, Eduardo A. Trujillo Rivera, Yijun Shao, Phillip Ma, Mark S. Tuttle, Jennifer Garvin, and Qing Zeng-Treitler. 2024. “Are ICD Codes Reliable for Observational Studies? Assessing Coding Consistency for Data Quality.”DIGITAL HEALTH 10 (September): 20552076241297056. https://doi.org/10.1177/20552076241297056.
Nguyen, Andrew. 2022. Hands-on healthcare data: taming the complexity of real-world data. First edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. n.d. “R for Data Science (2e).”https://r4ds.hadley.nz/.