EL10: Biggish data

Examination

This lecture will not be examined. You are encouraged to experiment with the concepts in your project work if you find them useful, but this is not required.

Big data

Biggish data

Our focus:

Only tabular data
Original data might fit into physical RAM but …
might be multiplied due to temporary necessity
physical RAM might be constrained due to other processes/settings
things might just take too much time … and life is short …

Reading large datasets

When working with large datasets, where the data is stored matters.

Network storage (e.g. shared drives)

slower data transfer (limited bandwidth)
higher latency (delay before data starts loading)
multiple users may compete for resources
repeated reads can be very inefficient

Local storage

much faster read/write speeds
low latency
more stable and predictable performance
better suited for iterative data analysis

Secure environment

In secure environments, both types of storage may look local, but they behave differently.

“Local” disk (e.g. SSD on the server/VM)
- attached directly to the machine you are working on
- high bandwidth, low latency
- behaves like true local storage
- typically much faster for data analysis
Mapped network folder
- accessed over the network (even if it looks like a normal folder)
- lower bandwidth and higher latency
- shared with other users
- slower, especially for repeated reads

💡 Key difference:
Not where the data is stored, but how it is accessed (direct disk vs network).

💡 Practical advice:
Use “local” disk (SSD on the environment) for active analysis, and network storage for long-term storage.

Practical implication

For large datasets:

avoid repeatedly reading data directly from network drives
copy data locally when possible
- fs::file_copy(path, new_path)
- (“libuv provides a wide variety of cross-platform sync and async file system operations.”)
- obviously only if “local drive” is also in the (same) secure environment!

💡 Key message:
Data access can easily become the main bottleneck — not your code.

HDD (Hard Disk Drive)

mechanical (spinning disks)
slower, especially for random access
typical speeds:
- ~50–150 MB/s (sequential read)
larger capacity at lower cost
often used in older systems or for backups

SSD (Solid State Drive)

no moving parts
much faster and more reliable
typical speeds:
- ~500 MB/s (SATA SSD)
- ~2000–7000 MB/s (NVMe SSD)
standard in most modern laptops

Typical sizes

laptops: 256 GB – 1 TB SSD
desktops / servers: 1–4 TB SSD + optional HDD storage
network drives: often very large but slower

RAM and data analysis in R

When working in R, available RAM (memory) is often the main limiting factor.

R can typically use most of the available RAM on your machine
some memory is needed by:
- the operating system
- other applications
a safe rule of thumb is to assume ~50–75% of total RAM is available for R

How large datasets can you work with?

datasets must fit in memory (unless using special tools)
but you also need memory for:
- intermediate objects
- copies created during transformations
- model objects
many R operations create temporary copies of data
memory usage can double or triple during processing
running out of RAM leads to:
- slow performance
- crashes

Example

Tip

Rule of thumb: you can usually work comfortably with data that is at most ~1/3 to 1/2 of your available RAM

If you have:

16 GB RAM → usable ≈ 8–12 GB
→ practical dataset size ≈ 3–6 GB
Modern R is 64-bit → can use large amounts of memory
- (32-bit R was limited to ~4 GB — mostly obsolete today)

Practical advice

check memory: ps::ps_system_memory()
avoid unnecessary copies of large objects
- Use {data.table} for reference semantics
- or Parquet files to read only the necessary data
consider pipelines (targets) or chunked processing

Numeric vs integer in R

R’s default numeric type is double precision (numeric).

x <- 1
typeof(1) # "double"
typeof(1L) # weird syntax to get "integer"

Why this matters

integers use less memory (4 bytes vs 8 bytes)
can be important for large datasets
100 million values:
- numeric ≈ 800 MB
- integer ≈ 400 MB

In practice

R often converts to numeric automatically
some tools (e.g. data.table) use integers efficiently
- compare as.IDate() vs as.Date()
useful when working with:
- IDs
- categories
- counts

💡 Key message:
Choosing the right data type can significantly reduce memory usage.

ALTREP (Alternative Representation)

R can represent some objects in a compact or lazy way.

introduced in R 3.5.0
avoids allocating full memory immediately

x <- 1:1e9
lobstr::obj_size(x) # actual memory used

680 B

format(object.size(x), units = "GB") # reserved memory

[1] "3.7 Gb"

# as.numeric(x) # will use the reserved memory

What happens?

data is generated on demand
not all values are stored in memory

💡 Key message:
Not all objects in R are fully materialised — some are computed when needed.

Other memory optimisations in R

R uses several mechanisms to reduce memory usage (not only ALTREP).

Shared strings (string interning)
- identical strings may be stored only once
- reduces memory when many values are repeated

But memory still fills up…

Even with smart representations:

many operations still create real objects
temporary objects accumulate

👉 R still needs to free memory

Garbage collection

When you work in R, memory is constantly used for intermediate objects.

many operations create temporary results
these objects are no longer needed after a step is completed
but R does not always remove them immediately

What is the problem?

memory can fill up with objects that are no longer used
there are no active references (“pointers”) to these objects
but they still occupy memory

Solution: garbage collection

R periodically identifies objects that are no longer reachable
these are removed, and memory is freed
GC is triggered when R needs more memory
may cause short pauses during execution

Practical advice

rm(large_object) # Remove objects you no longer need (still takes up memory)
gc() # manual garbage collection (free the memory)
dt[, new := f(old)] # reference semantics by `{data.table}` avoids "hidden" objects

💡 Key message:
You rarely need to manage memory explicitly — but inefficient code can still use too much of it.

Unnecassary computations for large objects

Sometimes R (or the computer hardware) is not the problem, but the IDE might be.
Common for RStudio (RStudio and R both run in the same process)
- Memory problems might make RSTudio crash
Positron separates the R process from the IDE
- Better, but still not perfect

Still open?

CPU

The Central Processing Unit (CPU) determines how fast computations are performed.

CPUs work in clock cycles (e.g. 3 GHz ≈ 3 billion cycles per second)
CPU matters most when:
- running models (e.g. regression, mixed models)
- performing simulations
- using loops or inefficient code
CPU matters less when:
- reading data (disk/network bottleneck)
- working with very large objects (RAM bottleneck)

Note

Writing efficient code (vectorisation) often matters more than CPU speed. Applies also to R packages on CRAN!

How R uses the CPU

R is often single-threaded
many operations use only one core
some libraries can use multiple cores (parallel computing)

How many cores do you have?

benchmarkme::get_cpu()

$vendor_id
character(0)

$model_name
[1] "Apple M2 Max"

$no_of_cores
[1] 12

Vectorization in R

R is designed to work efficiently with vectors, not loops.
In R, what looks like a scalar is actually a vector of length 1
vectorized operations apply to whole objects at once
loops process one element at a time (often slower)

# slow
result <- numeric(length(x))
for (i in seq_along(x)) {
  result[i] <- x[i] + y[i]
}

# fast
result <- x + y

Why is vectorization faster?

Vectorized operations in R are usually implemented in compiled C code.

R code (e.g. for loops) is interpreted → each step has overhead
vectorized operations (e.g. x + y) are:
- implemented in C
- run as tight, optimized loops

Parallel computing

Parallel computing means using multiple CPU cores at the same time.

most R code runs on a single core
your computer may have many cores (e.g. 4–16)
a shared server such as TRE might have several hundreds!

Why use parallel computing?

to speed up computationally intensive tasks
especially when tasks can be done independently

💡 Key message:
Parallel computing allows you to do multiple computations at once — but only if the problem can be split.

Paralellization

Some packages and functions may use multithreading by default
- But this is unusual
Some functions might have arguments such as nthreads, ncores etc
- You may want to use them
To handle multicore/thread computations in general can be (very) difficult (I’ve heard)

Split-apply-combine

Parallel computing works best when:

tasks are independent
tasks are computationally heavy
tasks can be repeated many times

Examples

simulations
bootstrapping
cross-validation
applying a function many times

💡 If tasks depend on each other, parallelisation will not help.

When parallel computing does NOT help

Parallel computing is often not the solution.

reading data → bottleneck is disk/network
large objects → bottleneck is RAM
inefficient code → vectorization is better

Example in R

lapply(1:100, function(i) slow_function(i)) # Sequential
parallel::mclapply(1:100, slow_function, mc.cores = 4) # Parallel

What happens?

each core processes part of the work
results are combined at the end

💡 More cores ≠ always faster

Limitations of parallel computing

overhead (starting processes takes time)
copying data between processes
limited by number of cores
not all functions are parallel-friendly

`{mirai}`

Suposed to be up to 1,000 times “faster” compared to earlier methods
- (according to some metric …)
should not block the main process while executing (haven’t tried)

Designed for simplicity, a ‘mirai’ evaluates an R expression asynchronously in a parallel process, locally or distributed over the network, with the result automatically available upon completion.

library(mirai)
library(data.table)
daemons(11)

# data.table with 1 billion rows:
dt <- data.table(x = seq_len(1e9), y = rnorm(1e9))
# row sums (row wise )
mirai_map(dt, sum)[.flat]

`{purrr}`

Introduced in_paralell() in 2025
Based on mirai but without the specialised syntax
Needs to be explicit on which objects/functions to export to each worker/daemon

library(purrr)
library(mirai)

# Set up parallel processing (6 background processes)
daemons(6)

# Sequential version
mtcars |> map_dbl(\(x) mean(x))

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

#>    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb
#>  20.09   6.19 230.72 146.69   3.60   3.22  17.85   0.44   0.41   3.69   2.81

# Parallel version - just wrap your function with in_parallel()
mtcars |> map_dbl(in_parallel(\(x) mean(x)))

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

#>    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb
#>  20.09   6.19 230.72 146.69   3.60   3.22  17.85   0.44   0.41   3.69   2.81

# Don't forget to clean up when done
daemons(0)

`{targets}`

The targets package does implement parallel processing efficiently (based on {mirai} via {crew})
covered in the R medecine workshop (see EL5)

Note

Data must be copied for each paralell worker
If all data is needed for each worker, it will be multiplied in memory
- This might not be possible for big data
- It also takes time
Hence, there is no guarantee that the over all time will decrease
Often a tradeof between no of workers/threads/deamons/cores and efficiency etc
“Standard” computations in {data.table} is optimized for single core
Message handling and progress bar etc is complicated
- So is random number generation

GPU

A Graphics Processing Unit (GPU) is designed for massively parallel computations.

many simple cores (thousands)
optimized for performing the same operation many times
originally developed for graphics

How is it different from a CPU?

CPU: few powerful cores → general-purpose tasks
GPU: many simple cores → parallel tasks

When are GPUs useful?

machine learning / deep learning
large matrix operations
simulations that can be parallelised

In R

most R code does not use the GPU
standard packages are CPU-based
GPU requires specialized tools (e.g. torch, tensorflow, gpuR)

💡 Key message:
GPUs can be extremely powerful — but only for specific types of problems.

In most R workflows, CPU, RAM, and data access matter much more.

Profiling

First version of R script might be inneficient
Optimazation is difficult
Might be suboptimal if made ad hoc
{profvis} visualise time and memory usage for each step
improve the most important step first
- For example change {base} and {tidyverse} code to {data.table}
- set keys
- Efficient handling of dates and strings
itterate
Integrated in RStudio (works in Positron to)

Modify packages

Functions from packages might be inefficient
Profile those as well
Clone package from GitHub if available or download source code from CRAN
Improve
Just load the modified function in global environment
- might need to modify internal calls to :::-accesed functions
Or rebuild/install the package if more convenient

Notify maintainer

If you find ways to improve a package, the maintainer might be eager to hear! Suggest pull request or open GitHub issue!

C-code changes

If the package use C-code it is probably already very fast
If not, and you can fix it, the package must be re-compiled
OS dependent
Might be tricky if using restricted environments (TRE)
Can still be done by rhub (might need to change maintainer-e-mail temporarily)