---
title: "simpr"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{simpr}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup, message = FALSE}
library(simpr)
set.seed(2001)
```
This vignette provides an overview of each step in the `simpr` workflow:
1. `specify()` your data-generating process
2. `define()` parameters that you want to systematically vary across your simulation design (e.g. *n*, effect size)
3. `generate()` the simulation data
4. `fit()` models to your data (e.g. `lm()`)
5. `tidy_fits()` for further processing, such as computing power or Type I Error rates
## `specify()` your data-generating process
`specify()` takes arbitrary R expressions that can be used for generating data. **Each argument should have a name and be prefixed with `~`, the tilde operator.** Order matters: later arguments can refer to earlier arguments, but not the other way around.
Good---`b` specification refers to `a` specification, and comes after `a`:
```{r correct_order}
specify(a = ~ runif(6),
b = ~ a + rnorm(6)) %>%
generate(1)
```
Error---`b` specification refers to `a` specification, but comes before `a`, so `generate()` doesn't know what `a` is:
```{r incorrect_order, warning = TRUE, purl = FALSE}
specify(b = ~ a + rnorm(6),
a = ~ runif(6)) %>%
generate(1)
```
All arguments must imply the same number of rows. Arguments that imply 1 row are recycled.
OK---both `a` and `b` imply 6 rows:
```{r correct_number}
specify(a = ~ runif(6),
b = ~ rnorm(6)) %>%
generate(1)
```
OK---`a` implies 1 row and `b` implies 6 rows, so `a` is recycled:
```{r recycle_number}
specify(a = ~ runif(1),
b = ~ rnorm(6)) %>%
generate(1)
```
Error---`a` implies 2 rows and `b` implies 6 rows:
```{r incorrect_number, warning = TRUE, purl = FALSE}
specify(a = ~ runif(2),
b = ~ rnorm(6)) %>%
generate(1)
```
Using `x` as an argument to `specify()` is not recommended, because for technical reasons `x` is always placed as the first argument. This means that if `x` refers to prior variables it will return an error:
```{r x_error, warning = TRUE, purl = FALSE}
specify(y = ~ runif(6),
x = ~ y + runif(6)) %>%
generate(1)
```
The same specification works fine when `x` is renamed:
```{r x_error_fixed}
specify(y = ~ runif(6),
a = ~ y + runif(6)) %>%
generate(1)
```
### Advanced: expressions that generate multiple columns
`specify()` accepts expressions that generate multiple columns simultaneously in a `matrix`, `data.frame`, or `tibble`. By default, the column names in the output append a number to the variable name.
Here's an example using `MASS::mvrnorm()`, which returns draws from the multivariate normal distribution as a matrix. `MASS::mvrnorm()` determines the number of columns for the output data from the length of `mu` and the dimensions of the variance-covariance matrix `Sigma`.
```{r multicolumn_default}
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3)))) %>%
generate(1)
```
The argument was named `a` in `specify()`, so `generate()` creates three variables named `a_1`, `a_2`, and `a_3`.
You can change the separator between the argument name and the number via `.sep`. Here, we change it from the default `"_"` to `"."`:
```{r multicolumn_sep}
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))),
.sep = ".") %>%
generate(1)
```
Alternatively, you can give a two-sided formula to set names. The argument name is ignored, and the left hand side must use `c` or `cbind`.
```{r multicolumn_two_sided}
specify(y = c(a, b, c) ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3)))) %>%
generate(1)
```
If your expression already produces column names, those are used by default. The argument name is again ignored:
```{r multicolumn_.use_names}
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))) %>%
magrittr::set_colnames(c("d", "e", "f"))) %>%
generate(1)
```
This is useful for dealing with functions from other packages that already provide informative names (e.g., `lavaan::simulateData()`). You can turn this behavior off with `.use_names = FALSE`.
Whatever method you use, you can still refer to these generated variables in subsequent arguments to `specify()`:
```{r multicolumn_refer}
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))),
b = ~ a_1 - a_2) %>%
generate(1)
```
## `define()` parameters that you want to systematically vary
`define()` creates metaparameters (also called simulation factors): values that you want to systematically vary. An obvious choice is sample size.
Instead of writing a number for the `n` argument of `rnorm`, we write a placeholder value `samp_size` (this can be any valid R name), and we write a corresponding argument in `define()` that contains the possible values that `samp_size` can take on:
```{r define_samp_size}
specify(a = ~ rnorm(samp_size)) %>%
define(samp_size = c(10, 20)) %>%
generate(1)
```
Each argument to `define()` is a vector or list with the desired values to be used in the expressions in `specify()`. `specify()` expressions can refer to the names of define arguments, and `generate()` will substitute the possible values that argument when generating data.
When `define()` has multiple arguments, each possible combination is generated:
```{r define_samp_size_mu}
specify(a = ~ rnorm(samp_size, mu)) %>%
define(samp_size = c(10, 20),
mu = c(0, 10)) %>%
generate(1)
```
(If not all combinations are desired, see options for filtering at the `generate()` step, below.)
`define()` can also take lists for any type of value used in `specify`. For instance, the argument `s` is defined as a list with two variables representing two different possible correlation matrices, and we use that placeholder value `s` in the `specify()` statement:
```{r define_matrix}
specify(a = ~ MASS::mvrnorm(6, rep(0, 2), Sigma = s)) %>%
define(s = list(independent = diag(rep(1, 2)),
dependent = matrix(c(1, 0.5, 0.5, 1), nrow = 2))) %>%
generate(1)
```
In the output, `simpr` creates a column `s_index` using the names of the list elements of `s` to make the results easier to view and filter.
`define()` also supports lists of functions. Here, the `specify` command has a placeholder function `distribution`, where `distribution` is defined to be either `rnorm()` or `rlnorm()` (the lognormal distribution):
```{r define_function}
specify(y = ~ distribution(6)) %>%
define(distribution = list(normal = rnorm,
lognormal = rlnorm)) %>%
generate(1)
```
## `generate()` the simulation data
The main argument to `generate()` is `.reps`, the number of repetitions for each combination of metaparameters in `define()`.
```{r generate_n_mu_2}
specify(a = ~ rnorm(n, mu)) %>%
define(n = c(6, 12),
mu = c(0, 10)) %>%
generate(2)
```
Since there are 4 possible combinations of `n` and `mu`, there are a total of 4 * 2 = 8 simulations generated, 2 for each possible combination.
If some combination of variables is not desired, add filtering criteria to `generate()` using the same syntax as `dplyr::filter()`. Here we arbitrarily filter to all combinations of `n` and `mu` where `n` is greater than `mu`, but any valid filtering criteria can be applied.
```{r generate_filter}
specify(a = ~ rnorm(n, mu)) %>%
define(n = c(6, 12),
mu = c(0, 10)) %>%
generate(2, n > mu)
```
To preserve reproducibility, a given simulation in the filtered version of `generate` is still the same as if all possible combinations were generated. This can be tracked using the `.sim_id` that `generate()` includes in the output data, which uniquely identifies the simulation run given the same inputs to `specify`, `define`, and `.reps`. Above, note that `.sim_id` skips 3 and 7 See `vignette("Reproducing simulations")` for more information on using `generate()` to filter.
`generate()` by default continues with the next iteration if an error is produced, and returns a column `.sim_error` with the text of the error. Alternative error handling mechanisms are available, see `vignette("Managing simulation errors")`.
## `fit()` models to your data
`fit()` uses similar syntax to `generate()`: you can write arbitrary R expressions to fit models. **Again, each argument should have a name and be prefixed with `~`, the tilde operator.**
Below, we fit both a t-test and a linear model.
```{r fit_initial}
specify(a = ~ rnorm(6),
b = ~ a + rnorm(6)) %>%
generate(1) %>%
fit(t_test = ~ t.test(a, b),
lm = ~ lm(b ~ a))
```
`fit()` adds columns to the overall tibble to contain each type of fit. Printing the object displays a preview of the first fit object in each column.
Although the function is named `fit`, any arbitrary R expression can be used. Below, one fit column will include the mean of `a`, while the second will include vectors that are five larger than `a`. The result is always a list-column, so any type of return object is allowed:
```{r fit_describe}
specify(a = ~ rnorm(6)) %>%
generate(1) %>%
fit(mean = ~ mean(a),
why_not = ~ a + 5)
```
`fit()` is computed for each individual simulated dataset, so usually you do not need to refer to the dataset itself. If a reference to the dataset is needed, use `.`.
The below code is equivalent to the previous example, but explicitly referencing the dataset using `.` and `.$`:
```{r fit_explicit}
specify(a = ~ rnorm(6),
b = ~ a + rnorm(6)) %>%
generate(1) %>%
## .$ and data = . not actually required here
fit(t_test = ~ t.test(.$a, .$b),
lm = ~ lm(b ~ a, data = .))
```
### Advanced: pre-fit data munging with `per_sim()
Sometimes data manipulation is required between `generate()` and `fit()`. After `generate()`, run `per_sim()` and then chain any `dplyr` or `tidyr` verbs that work on `data.frame`s or `tibble`s. These verbs will be applied to every individual simulated dataset.
A common use-case is needing to reshape wide to long. Consider an intervention study with a control group, an intervention group that does slightly better than the control, and a second intervention group that does much better. This is easiest to specify in wide format, with separate variables for each group and differing means by group:
```{r fit_reshape_1}
wide_gen = specify(control = ~ rnorm(6, mean = 0),
intervention_1 = ~ rnorm(6, mean = 0.2),
intervention_2 = ~ rnorm(6, mean = 2)) %>%
generate(2)
wide_gen
```
But to run an ANOVA, we need the outcome in one column and the group name in another column. We first run `per_sim()` to indicate we want to compute on individual simulated datasets, and then we can use `tidyr::pivot_longer()` to reshape each dataset into a format ready for analysis:
```{r fit_reshape_success}
long_gen = wide_gen %>%
per_sim() %>%
pivot_longer(cols = everything(),
names_to = "group",
values_to = "response")
long_gen
```
Each simulation is now reshaped and ready for fitting:
```{r long_fit}
long_fit = long_gen %>%
fit(aov = ~ aov(response ~ group),
lm = ~ lm(response ~ group))
long_fit
```
## `tidy_fits()` for further processing
The output of `fit()` is not yet amenable to plotting or analysis. `tidy_fits()` applies `broom::tidy()` to each fit object and binds them together in a single `tibble`:
```{r tidy_fits_simple}
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits()
```
All the same metaparameter information appears in the left-hand column, and all the model information from `broom::tidy()` is provided. This is a convenient format for filtering, plotting, and calculating diagnostics.
If more than one kind of fit is present, `tidy_fits()` simply brings them together in the same `tibble` using `bind_rows`; this means there may be many `NA` values where one type of model has no values. In the example below, `t.test` returns the column `estimate1`, but `lm` does not, so for the `lm` rows there are `NA` values.
```{r tidy_fits_complex}
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a),
t_test = ~ t.test(a, b)) %>%
tidy_fits()
```
Any option taken by the tidier can be passed through `tidy_fits()`. Below, we specify the `conf.level` and `conf.int` options for `broom::tidy.lm()`:
```{r tidy_fits_custom}
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits(conf.level = 0.99, conf.int = TRUE)
```
`glance_fits()` analogously provides the one-row summary provided by `broom::glance()` for each simulation:
```{r glance_fits_simple}
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
glance_fits()
```
`apply_fits` can take any arbitrary expression (preceded by `~`) or function and apply it to each fit object. The special value `.` indicates the current fit.
Below, the maximum Cook's Distance for each simulated model fit is computed using `cooks.distance()`.
```{r apply_fits}
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
apply_fits(~ max(cooks.distance(.)))
```