---
title: "Reproducing simulations"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Reproducing simulations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message = FALSE}
library(simpr)
```

`simpr` is designed with reproducibility in mind.  If you set the same seed, you get the same results.

```{r run_1}
set.seed(500)
run_1 = specify(a = ~ runif(6)) %>% 
  generate(3)

run_1
```

```{r run_2}
set.seed(500)
run_2 = specify(a = ~ runif(6)) %>% 
  generate(3)

run_2
```

```{r run_compare}
identical(run_1, run_2)
```

What's more, `generate()` can take filtering criteria, so that you can re-generate specific repetitions or conditions *without having to recreate the entire simulation*. This requires that the seed, specification, definition, and number of reps is identical to the simulation you are trying to reproduce.

```{r filter_after_generating}
set.seed(500)
filter_after_generating = specify(a = ~ runif(6)) %>% 
  generate(3) %>% 
  filter(.sim_id == 2)

filter_after_generating
```
```{r filter_while_generating}
## Much faster, same result!
set.seed(500)
filter_while_generating = specify(a = ~ runif(6)) %>% 
  generate(3, .sim_id == 2)

filter_while_generating
```
```{r filter_test}
identical(filter_after_generating, filter_while_generating)
```

Although only one repetition was generated above, it is the same data as was generated when we actually did the full simulation.

A common use case is for regenerating the data in cases where an error was created.  Here's an example of a simulation that only generated errors in one condition. We generate some data and fit a logistic regression, but notice that we get some errors.

```{r fit_tidy, warning = TRUE, purl = FALSE}
set.seed(500)
fit_tidy = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
        b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3) %>% 
  fit(lm = ~ glm(a ~ b, family = "binomial")) %>% 
  tidy_fits()

fit_tidy
```

One options for regenerating is to filter directly to the problematic `max == 10` condition to examine the generated data.

```{r filter_max_10}
set.seed(500)
filter_max_10 = specify(a = ~ sample(0:max, size = 10, replace = TRUE),
        b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3, max == 10)

filter_max_10
```

Looking at the raw generated data, we can see our outcome variable is often larger than 1, which makes no sense for a logistic regression.

In general, we could also filter down to only values of `.sim_id` which generated errors to examine those:

```{r fit_error_data}
fit_errors = filter(fit_tidy, !is.na(.fit_error))

set.seed(500)
fit_error_data = specify(a = ~ sample(1:max, size = 10, replace = TRUE),
                     b = ~ a + rnorm(10))  %>% 
  define(max = c(0, 1, 10)) %>%
  generate(3, .sim_id %in% fit_errors$.sim_id)

fit_error_data
```

This approach is useful in cases where we don't know which conditions are producing the errors.  Sometimes simulation errors arise from numerical issues arising from unlucky draws from the data-generating mechanism, and are not systematic.