---
title: "simpr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{simpr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message = FALSE}
library(simpr)
set.seed(2001)
```

This vignette provides an overview of each step in the `simpr` workflow:


1. `specify()` your data-generating process
  
2. `define()` parameters that you want to systematically vary across your simulation design (e.g. *n*, effect size)
  
3. `generate()` the simulation data
  
4. `fit()` models to your data (e.g. `lm()`)
  
5. `tidy_fits()` for further processing, such as computing power or Type I Error rates
  
## `specify()` your data-generating process

`specify()` takes arbitrary R expressions that can be used for generating data.  **Each argument should have a name and be prefixed with `~`, the tilde operator.**  Order matters: later arguments can refer to earlier arguments, but not the other way around.

Good---`b` specification refers to `a` specification, and comes after `a`:
```{r correct_order}
specify(a = ~ runif(6), 
        b = ~ a + rnorm(6)) %>% 
  generate(1)

```

Error---`b` specification refers to `a` specification, but comes before `a`, so `generate()` doesn't know what `a` is:
```{r incorrect_order, warning = TRUE, purl = FALSE}
specify(b = ~ a + rnorm(6),
        a = ~ runif(6)) %>% 
  generate(1)
```

All arguments must imply the same number of rows.  Arguments that imply 1 row are recycled.

OK---both `a` and `b` imply 6 rows:
```{r correct_number}
specify(a = ~ runif(6), 
        b = ~ rnorm(6)) %>% 
  generate(1)
```

OK---`a` implies 1 row and `b` implies 6 rows, so `a` is recycled:
```{r recycle_number}
specify(a = ~ runif(1), 
        b = ~ rnorm(6)) %>% 
  generate(1)
```

Error---`a` implies 2 rows and `b` implies 6 rows:
```{r incorrect_number, warning = TRUE, purl = FALSE}
specify(a = ~ runif(2), 
        b = ~ rnorm(6)) %>% 
  generate(1)
```

Using `x` as an argument to `specify()` is not recommended, because for technical reasons `x` is always placed as the first argument.  This means that if `x` refers to prior variables it will return an error:

```{r x_error, warning = TRUE, purl = FALSE}
specify(y = ~ runif(6),
        x = ~ y + runif(6)) %>% 
  generate(1)

```

The same specification works fine when `x` is renamed:

```{r x_error_fixed}
specify(y = ~ runif(6),
        a = ~ y + runif(6)) %>% 
  generate(1)
```

### Advanced: expressions that generate multiple columns
`specify()` accepts expressions that generate multiple columns simultaneously in a `matrix`, `data.frame`, or `tibble`.  By default, the column names in the output append a number to the variable name.

Here's an example using `MASS::mvrnorm()`, which returns draws from the multivariate normal distribution as a matrix. `MASS::mvrnorm()` determines the number of columns for the output data from the length of `mu` and the dimensions of the variance-covariance matrix `Sigma`.

```{r multicolumn_default}
specify(a = ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3)))) %>% 
  generate(1)

```

The argument was named `a` in `specify()`, so `generate()` creates three variables named `a_1`, `a_2`, and `a_3`.

You can change the separator between the argument name and the number via `.sep`. Here, we change it from the default `"_"` to `"."`:
```{r multicolumn_sep}
specify(a = ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3))),
        .sep = ".") %>% 
  generate(1)

```

Alternatively, you can give a two-sided formula to set names. The argument name is ignored, and the left hand side must use `c` or `cbind`.

```{r multicolumn_two_sided}
specify(y = c(a, b, c) ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3)))) %>% 
  generate(1)

```

If your expression already produces column names, those are used by default. The argument name is again ignored:

```{r multicolumn_.use_names}
specify(a = ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3))) %>% 
          magrittr::set_colnames(c("d", "e", "f"))) %>% 
  generate(1)
```

This is useful for dealing with functions from other packages that already provide informative names (e.g., `lavaan::simulateData()`). You can turn this behavior off with `.use_names = FALSE`.

Whatever method you use, you can still refer to these generated variables in subsequent arguments to `specify()`:

```{r multicolumn_refer}
specify(a = ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3))),
        b = ~ a_1 - a_2) %>% 
  generate(1)
```

## `define()` parameters that you want to systematically vary
`define()` creates metaparameters (also called simulation factors): values that you want to systematically vary.  An obvious choice is sample size.

Instead of writing a number for the `n` argument of `rnorm`, we write a placeholder value `samp_size` (this can be any valid R name), and we write a corresponding argument in `define()` that contains the possible values that `samp_size` can take on:

```{r define_samp_size}
specify(a = ~ rnorm(samp_size)) %>% 
  define(samp_size = c(10, 20)) %>% 
  generate(1)
```

Each argument to `define()` is a vector or list with the desired values to be used in the expressions in `specify()`.  `specify()` expressions can refer to the names of define arguments, and `generate()` will substitute the possible values that argument when generating data.

When `define()` has multiple arguments, each possible combination is generated:

```{r define_samp_size_mu}
specify(a = ~ rnorm(samp_size, mu)) %>% 
  define(samp_size = c(10, 20),
         mu = c(0, 10)) %>% 
  generate(1)
```

(If not all combinations are desired, see options for filtering at the `generate()` step, below.)

`define()` can also take lists for any type of value used in `specify`. For instance, the argument `s` is defined as a list with two variables representing two different possible correlation matrices, and we use that placeholder value `s` in the `specify()` statement:

```{r define_matrix}

specify(a = ~ MASS::mvrnorm(6, rep(0, 2), Sigma = s)) %>% 
  define(s = list(independent = diag(rep(1, 2)),
                  dependent = matrix(c(1, 0.5, 0.5, 1), nrow = 2))) %>% 
  generate(1)

```

In the output, `simpr` creates a column `s_index` using the names of the list elements of `s` to make the results easier to view and filter.

`define()` also supports lists of functions.  Here, the `specify` command has a placeholder function `distribution`, where `distribution` is defined to be either `rnorm()` or `rlnorm()` (the lognormal distribution):

```{r define_function}
specify(y = ~ distribution(6)) %>%
  define(distribution = list(normal = rnorm,
                             lognormal = rlnorm)) %>%
  generate(1)
```

## `generate()` the simulation data
The main argument to `generate()` is `.reps`,  the number of repetitions for each combination of metaparameters in `define()`.

```{r generate_n_mu_2}
specify(a = ~ rnorm(n, mu)) %>% 
  define(n = c(6, 12),
         mu = c(0, 10)) %>% 
  generate(2)
```

Since there are 4 possible combinations of `n` and `mu`, there are a total of 4 * 2 = 8 simulations generated, 2 for each possible combination.

If some combination of variables is not desired, add filtering criteria to `generate()` using the same syntax as `dplyr::filter()`. Here we arbitrarily filter to all combinations of `n` and `mu` where `n` is greater than `mu`, but any valid filtering criteria can be applied.

```{r generate_filter}
specify(a = ~ rnorm(n, mu)) %>% 
  define(n = c(6, 12),
         mu = c(0, 10)) %>% 
  generate(2, n > mu)
```

To preserve reproducibility, a given simulation in the filtered version of `generate` is still the same as if all possible combinations were generated.  This can be tracked using the `.sim_id` that `generate()` includes in the output data, which uniquely identifies the simulation run given the same inputs to `specify`, `define`, and `.reps`. Above, note that `.sim_id` skips 3 and 7  See `vignette("Reproducing simulations")` for more information on using `generate()` to filter.

`generate()` by default continues with the next iteration if an error is produced, and returns a column `.sim_error` with the text of the error.  Alternative error handling mechanisms are available, see `vignette("Managing simulation errors")`.

## `fit()` models to your data

`fit()` uses similar syntax to `generate()`: you can write arbitrary R expressions to fit models. **Again, each argument should have a name and be prefixed with `~`, the tilde operator.**

Below, we fit both a t-test and a linear model.

```{r fit_initial}
specify(a = ~ rnorm(6),
        b = ~ a + rnorm(6)) %>% 
  generate(1) %>% 
  fit(t_test = ~ t.test(a, b),
      lm = ~ lm(b ~ a))
```

`fit()` adds columns to the overall tibble to contain each type of fit. Printing the object displays a preview of the first fit object in each column.

Although the function is named `fit`, any arbitrary R expression can be used. Below, one fit column will include the mean of `a`, while the second will include vectors that are five larger than `a`.  The result is always a list-column, so any type of return object is allowed:

```{r fit_describe}
specify(a = ~ rnorm(6)) %>% 
  generate(1) %>% 
  fit(mean = ~ mean(a),
      why_not = ~ a + 5)
```

`fit()` is computed for each individual simulated dataset, so usually you do not need to refer to the dataset itself.  If a reference to the dataset is needed, use `.`.

The below code is equivalent to the previous example, but explicitly referencing the dataset using `.` and `.$`:

```{r fit_explicit}
specify(a = ~ rnorm(6),
        b = ~ a + rnorm(6)) %>% 
  generate(1) %>% 
  ## .$ and data = . not actually required here
  fit(t_test = ~ t.test(.$a, .$b),
      lm = ~ lm(b ~ a, data = .))
```

### Advanced: pre-fit data munging with `per_sim()

Sometimes data manipulation is required between `generate()` and `fit()`.  After `generate()`, run `per_sim()` and then chain any `dplyr` or `tidyr` verbs that work on `data.frame`s or `tibble`s.  These verbs will be applied to every individual simulated dataset.

A common use-case is needing to reshape wide to long.  Consider an intervention study with a control group, an intervention group that does slightly better than the control, and a second intervention group that does much better.  This is easiest to specify in wide format, with separate variables for each group and differing means by group:

```{r fit_reshape_1}
wide_gen = specify(control = ~ rnorm(6, mean = 0),
        intervention_1 = ~ rnorm(6, mean = 0.2),
        intervention_2 = ~ rnorm(6, mean = 2)) %>% 
  generate(2) 

wide_gen
```

But to run an ANOVA, we need the outcome in one column and the group name in another column. We first run `per_sim()` to indicate we want to compute on individual simulated datasets, and then we can use `tidyr::pivot_longer()` to reshape each dataset into a format ready for analysis:

```{r fit_reshape_success}
long_gen = wide_gen %>%  
  per_sim() %>% 
  pivot_longer(cols = everything(),
               names_to = "group", 
               values_to = "response")

long_gen
```

Each simulation is now reshaped and ready for fitting:

```{r long_fit}
long_fit = long_gen %>% 
  fit(aov = ~ aov(response ~ group),
      lm = ~ lm(response ~ group))

long_fit
```


## `tidy_fits()` for further processing
The output of `fit()` is not yet amenable to plotting or analysis.  `tidy_fits()` applies `broom::tidy()` to each fit object and binds them together in a single `tibble`:

```{r tidy_fits_simple}
specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a)) %>% 
  tidy_fits()

```

All the same metaparameter information appears in the left-hand column, and all the model information from `broom::tidy()` is provided. This is a convenient format for filtering, plotting, and calculating diagnostics.

If more than one kind of fit is present, `tidy_fits()` simply brings them together in the same `tibble` using `bind_rows`; this means there may be many `NA` values where one type of model has no values.  In the example below, `t.test` returns the column `estimate1`, but `lm` does not, so for the `lm` rows there are `NA` values.

```{r tidy_fits_complex}
specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a),
      t_test = ~ t.test(a, b)) %>% 
  tidy_fits()

```

Any option taken by the tidier can be passed through `tidy_fits()`.  Below, we specify the `conf.level` and `conf.int` options for `broom::tidy.lm()`:

```{r tidy_fits_custom}
specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a)) %>% 
  tidy_fits(conf.level = 0.99, conf.int = TRUE)

```

`glance_fits()` analogously provides the one-row summary provided by `broom::glance()` for each simulation:

```{r glance_fits_simple}
specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a)) %>% 
  glance_fits()

```

`apply_fits` can take any arbitrary expression (preceded by `~`) or function and apply it to each fit object. The special value `.` indicates the current fit.

Below, the maximum Cook's Distance for each simulated model fit is computed using `cooks.distance()`.

```{r apply_fits}
specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a)) %>% 
  apply_fits(~ max(cooks.distance(.)))
```