--- title: "Reproducing simulations" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Reproducing simulations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message = FALSE} library(simpr) ``` `simpr` is designed with reproducibility in mind. If you set the same seed, you get the same results. ```{r run_1} set.seed(500) run_1 = specify(a = ~ runif(6)) %>% generate(3) run_1 ``` ```{r run_2} set.seed(500) run_2 = specify(a = ~ runif(6)) %>% generate(3) run_2 ``` ```{r run_compare} identical(run_1, run_2) ``` What's more, `generate()` can take filtering criteria, so that you can re-generate specific repetitions or conditions *without having to recreate the entire simulation*. This requires that the seed, specification, definition, and number of reps is identical to the simulation you are trying to reproduce. ```{r filter_after_generating} set.seed(500) filter_after_generating = specify(a = ~ runif(6)) %>% generate(3) %>% filter(.sim_id == 2) filter_after_generating ``` ```{r filter_while_generating} ## Much faster, same result! set.seed(500) filter_while_generating = specify(a = ~ runif(6)) %>% generate(3, .sim_id == 2) filter_while_generating ``` ```{r filter_test} identical(filter_after_generating, filter_while_generating) ``` Although only one repetition was generated above, it is the same data as was generated when we actually did the full simulation. A common use case is for regenerating the data in cases where an error was created. Here's an example of a simulation that only generated errors in one condition. We generate some data and fit a logistic regression, but notice that we get some errors. ```{r fit_tidy, warning = TRUE, purl = FALSE} set.seed(500) fit_tidy = specify(a = ~ sample(0:max, size = 10, replace = TRUE), b = ~ a + rnorm(10)) %>% define(max = c(0, 1, 10)) %>% generate(3) %>% fit(lm = ~ glm(a ~ b, family = "binomial")) %>% tidy_fits() fit_tidy ``` One options for regenerating is to filter directly to the problematic `max == 10` condition to examine the generated data. ```{r filter_max_10} set.seed(500) filter_max_10 = specify(a = ~ sample(0:max, size = 10, replace = TRUE), b = ~ a + rnorm(10)) %>% define(max = c(0, 1, 10)) %>% generate(3, max == 10) filter_max_10 ``` Looking at the raw generated data, we can see our outcome variable is often larger than 1, which makes no sense for a logistic regression. In general, we could also filter down to only values of `.sim_id` which generated errors to examine those: ```{r fit_error_data} fit_errors = filter(fit_tidy, !is.na(.fit_error)) set.seed(500) fit_error_data = specify(a = ~ sample(1:max, size = 10, replace = TRUE), b = ~ a + rnorm(10)) %>% define(max = c(0, 1, 10)) %>% generate(3, .sim_id %in% fit_errors$.sim_id) fit_error_data ``` This approach is useful in cases where we don't know which conditions are producing the errors. Sometimes simulation errors arise from numerical issues arising from unlucky draws from the data-generating mechanism, and are not systematic.