The default simpr
workflow is easy to understand but
computationally inefficient. Although simpr
ease of use over computational speed, two things together can
make simpr more efficient:
packageConsider a standard simpr
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(100, 200)) %>%
generate(10) %>%
fit(lm = ~ lm(b ~ a)) %>%
An issue with this workflow is that it involves shuttling a lot of
data around: the generate()
step adds a list-column with
simulated data to a tibble, which is then sent to fit()
which adds a list-column containing large model objects, and then these
are all sent to tidy_fits()
for extracting essential model
Instead, you can simply place the call to generate()
later in the chain:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(100, 200)) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits() %>%
This means that the data is generated, fit, and tidied all at once
once you call generate()
. This means that these steps can
occur on a single parallel worker without pushing lots of data
Behind the scenes, before generate()
is called,
simply stores successive commands in the
object. When generate()
is called,
these successive commands are executed in order. Data munging, including
with per_sim()
or on the final tidied data, using
or tidyr
is also supported in this
workflow. Below, data is specified, and the commands for reshaping,
fitting, and tidying, and selecting columns from the tidied output are
all written before generate()
and are executed
specify(control = ~ rnorm(n, mean = 0),
intervention_1 = ~ rnorm(n, mean = 0.2),
intervention_2 = ~ rnorm(n, mean = 2)) %>%
define(n = c(6, 12)) %>%
per_sim() %>%
pivot_longer(cols = everything(),
names_to = "group",
values_to = "response") %>%
fit(lm = ~ lm(response ~ group)) %>%
tidy_fits() %>%
select(.sim_id, n, term, estimate) %>%
#> # A tibble: 12 × 4
#> .sim_id n term estimate
#> <int> <dbl> <chr> <dbl>
#> 1 1 6 (Intercept) -0.217
#> 2 1 6 groupintervention_1 -0.00835
#> 3 1 6 groupintervention_2 2.82
#> 4 2 12 (Intercept) -0.168
#> 5 2 12 groupintervention_1 0.599
#> 6 2 12 groupintervention_2 2.14
#> 7 3 6 (Intercept) -0.220
#> 8 3 6 groupintervention_1 0.839
#> 9 3 6 groupintervention_2 2.12
#> 10 4 12 (Intercept) 0.401
#> 11 4 12 groupintervention_1 0.251
#> 12 4 12 groupintervention_2 1.60
packageChanging the evaluation order makes little difference on its own, but
combined with parallel processing can produce a speedup.
uses the furrr
package, part of the futureverse suite of packages
designed around the future
package. These packages are
designed to make parallel processing transparent as easy to use.
To use parallel processing with simpr
, simply load the
package and declare your “plan” for code execution
with future::plan()
. The three most relevant plans are:
, the default R behavior using just one
, a parallel processing approach.multicore
, another parallel processing approach that
can be faster than multiprocess
, but which doesn’t work
with R Studio and only works on Linux/Mac.For both multiprocess
and multicore
you must tell R how many cores you want to use. You can check how many
your computer has available with
The optimized version of the opening example, rewritten to both change execution order and use parallel processing:
plan(multisession, availableCores() - 1)
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(100, 200)) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits() %>%
Results will vary. The parallel version for this small simulation actually takes longer, because there fixed costs in setting up the workers. The speed advantage will become more apparent for larger simulations with slow data-generation or data-fitting steps.