This lecture will not be examined. You are encouraged to experiment with the concepts in your project work if you find them useful, but this is not required.
Big data
Biggish data
Our focus:
Only tabular data
Original data might fit into physical RAM but …
might be multiplied due to temporary necessity
physical RAM might be constrained due to other processes/settings
things might just take too much time … and life is short …
Reading large datasets
When working with large datasets, where the data is stored matters.
Network storage (e.g. shared drives)
slower data transfer (limited bandwidth)
higher latency (delay before data starts loading)
multiple users may compete for resources
repeated reads can be very inefficient
Local storage
much faster read/write speeds
low latency
more stable and predictable performance
better suited for iterative data analysis
Secure environment
In secure environments, both types of storage may look local, but they behave differently.
“Local” disk (e.g. SSD on the server/VM)
attached directly to the machine you are working on
high bandwidth, low latency
behaves like true local storage
typically much faster for data analysis
Mapped network folder
accessed over the network (even if it looks like a normal folder)
lower bandwidth and higher latency
shared with other users
slower, especially for repeated reads
💡 Key difference:
Not where the data is stored, but how it is accessed (direct disk vs network).
💡 Practical advice:
Use “local” disk (SSD on the environment) for active analysis, and network storage for long-term storage.
Practical implication
For large datasets:
avoid repeatedly reading data directly from network drives
copy data locally when possible
fs::file_copy(path, new_path)
(“libuv provides a wide variety of cross-platform sync and async file system operations.”)
obviously only if “local drive” is also in the (same) secure environment!
💡 Key message:
Data access can easily become the main bottleneck — not your code.
Modern R is 64-bit → can use large amounts of memory
(32-bit R was limited to ~4 GB — mostly obsolete today)
Practical advice
check memory: ps::ps_system_memory()
avoid unnecessary copies of large objects
Use {data.table} for reference semantics
or Parquet files to read only the necessary data
consider pipelines (targets) or chunked processing
Numeric vs integer in R
R’s default numeric type is double precision (numeric).
x <-1typeof(1) # "double"typeof(1L) # weird syntax to get "integer"
Why this matters
integers use less memory (4 bytes vs 8 bytes)
can be important for large datasets
100 million values:
numeric ≈ 800 MB
integer ≈ 400 MB
In practice
R often converts to numeric automatically
some tools (e.g. data.table) use integers efficiently
compare as.IDate() vs as.Date()
useful when working with:
IDs
categories
counts
💡 Key message:
Choosing the right data type can significantly reduce memory usage.
ALTREP (Alternative Representation)
R can represent some objects in a compact or lazy way.
introduced in R 3.5.0
avoids allocating full memory immediately
x <-1:1e9lobstr::obj_size(x) # actual memory used
680 B
format(object.size(x), units ="GB") # reserved memory
[1] "3.7 Gb"
# as.numeric(x) # will use the reserved memory
What happens?
data is generated on demand
not all values are stored in memory
💡 Key message:
Not all objects in R are fully materialised — some are computed when needed.
Other memory optimisations in R
R uses several mechanisms to reduce memory usage (not only ALTREP).
Shared strings (string interning)
identical strings may be stored only once
reduces memory when many values are repeated
But memory still fills up…
Even with smart representations:
many operations still create real objects
temporary objects accumulate
👉 R still needs to free memory
Garbage collection
When you work in R, memory is constantly used for intermediate objects.
many operations create temporary results
these objects are no longer needed after a step is completed
but R does not always remove them immediately
What is the problem?
memory can fill up with objects that are no longer used
there are no active references (“pointers”) to these objects
but they still occupy memory
Solution: garbage collection
R periodically identifies objects that are no longer reachable
these are removed, and memory is freed
GC is triggered when R needs more memory
may cause short pauses during execution
Practical advice
rm(large_object) # Remove objects you no longer need (still takes up memory)gc() # manual garbage collection (free the memory)dt[, new :=f(old)] # reference semantics by `{data.table}` avoids "hidden" objects
💡 Key message:
You rarely need to manage memory explicitly — but inefficient code can still use too much of it.
Unnecassary computations for large objects
Sometimes R (or the computer hardware) is not the problem, but the IDE might be.
Common for RStudio (RStudio and R both run in the same process)
Suposed to be up to 1,000 times “faster” compared to earlier methods
(according to some metric …)
should not block the main process while executing (haven’t tried)
Designed for simplicity, a ‘mirai’ evaluates an R expression asynchronously in a parallel process, locally or distributed over the network, with the result automatically available upon completion.
library(mirai)library(data.table)daemons(11)# data.table with 1 billion rows:dt <-data.table(x =seq_len(1e9), y =rnorm(1e9))# row sums (row wise )mirai_map(dt, sum)[.flat]
{purrr}
Introduced in_paralell() in 2025
Based on mirai but without the specialised syntax
Needs to be explicit on which objects/functions to export to each worker/daemon
library(purrr)library(mirai)# Set up parallel processing (6 background processes)daemons(6)# Sequential versionmtcars |>map_dbl(\(x) mean(x))
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
#> mpg cyl disp hp drat wt qsec vs am gear carb#> 20.09 6.19 230.72 146.69 3.60 3.22 17.85 0.44 0.41 3.69 2.81# Parallel version - just wrap your function with in_parallel()mtcars |>map_dbl(in_parallel(\(x) mean(x)))
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
#> mpg cyl disp hp drat wt qsec vs am gear carb#> 20.09 6.19 230.72 146.69 3.60 3.22 17.85 0.44 0.41 3.69 2.81# Don't forget to clean up when donedaemons(0)
{targets}
The targets package does implement parallel processing efficiently (based on {mirai} via {crew})