Define a Simulation Scenario

Constructs a purely declarative ssdsims_scenario object: the root of the targets-based pipeline (see TARGETS-DESIGN.md section 1). The object stores only declarative fields - a scalar seed, the replicate count nsim, the sample sizes nrow, the dataset names, and the fit and hc argument grids. It performs no random-number generation, no task expansion, and has no dependency on targets.

Usage

ssd_define_scenario(
  data,
  nsim,
  seed,
  ...,
  nrow = 6L,
  replace = TRUE,
  rescale = FALSE,
  computable = FALSE,
  at_boundary_ok = TRUE,
  min_pmix = ssd_pmix(ssd_min_pmix = ssdtools::ssd_min_pmix),
  range_shape1 = list(c(0.05, 20)),
  range_shape2 = list(c(0.05, 20)),
  nrow_max = 1000L,
  dists = ssd_distset(BCANZ = ssdtools::ssd_dists_bcanz()),
  est_method = "multi",
  proportion = 0.05,
  ci = FALSE,
  nboot = 1000,
  ci_method = "weighted_samples",
  parametric = TRUE,
  samples = FALSE,
  partition_by = NULL,
  bundle = NULL
)

Arguments

data: An ssd_scenario_data() collection: a validated, named collection of Conc tibbles assembled from data frames and/or ssd_gen() generator datasets.
nsim: A count of the number of data sets to generate.
seed: A scalar whole number; the scenario's RNG root. Required - changing it fully re-roots the scenario's random-number draws.
...: Unused; must be empty.
nrow: A whole-number vector of sample sizes (the fit-step truncation axis), each between 5 (the fit floor) and nrow_max (the shared draw size, the universal ceiling). A value within nrow_max but above a dataset's row count is still valid: its replace = TRUE cell draws with replacement, while its replace = FALSE cell (which cannot exceed the dataset size) is silently discarded for that dataset.
replace: A logical vector (a cross-join axis of one or two values) specifying whether the shared sample draw resamples with replacement. Defaults to TRUE (the standard resampling model, drawing nrow_max rows, so nrow is not capped by the dataset size); FALSE draws a permutation, capping the effective draw - and so each nrow - at the dataset size.
rescale: A flag specifying whether to leave the values unchanged (FALSE) or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values (TRUE) or a string specifying whether to leave the values unchanged ("no") or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values ("geomean") or to logistically transform ("odds").
computable: A flag specifying whether to only return fits with numerically computable standard errors.
at_boundary_ok: A flag specifying whether a model with one or more parameters at the boundary should be considered to have converged (default = TRUE).
min_pmix: An ssd_pmix() collection of one or more single-argument min_pmix functions, referenced by name. This is the only accepted form; a bare function, a plain list, or a character vector of names is rejected (no string-to-function resolution). The collection's name (not the function value) is what the task path hashes; the function is materialised on the scenario (keyed by name) for execution and isolated via scenario_min_pmix(). Defaults to ssd_pmix(ssd_min_pmix = ssdtools::ssd_min_pmix).
range_shape1: A list of numeric vectors of length two of the lower and upper bounds for the shape1 parameter.
range_shape2: A list of numeric vectors of length two of the lower and upper bounds for the shape2 parameter.
nrow_max: A whole number (default 1000L): the fixed size of the shared sample draw that every nrow value sub-truncates. A sample-level scenario setting, not a cross-join axis. The effective per-dataset draw is min(nrow_max, nrow(data)) when replace = FALSE (the high default draws the full permutation) and nrow_max rows when replace = TRUE; each nrow must not exceed the effective draw size.
dists: An ssd_distset() collection of one or more named distribution sets (pools model-averaged together to form one SSD). The fit step fits the union of every set's members once; the hc step subsets that union fit per set and re-averages, so the set name is an hc cross-join axis ("distset") while individual distributions never fan out. A bare character vector or plain list aborts, pointing to ssd_distset().
est_method: A string specifying whether to estimate directly from the model-averaged cumulative distribution function (est_method = 'multi') or to take the arithmetic mean of the estimates from the individual cumulative distribution functions weighted by the AICc derived weights (est_method = 'arithmetic') or or to use the geometric mean instead (est_method = 'geometric').
proportion: A numeric vector of proportion values to estimate hazard concentrations for.
ci: A flag specifying whether to estimate confidence intervals (by bootstrapping).
nboot: A count of the number of bootstrap samples to use to estimate the confidence limits. A value of 10,000 is recommended for official guidelines.
ci_method: A string specifying which method to use for estimating the standard error and confidence limits from the bootstrap samples. The default and recommended value is still ci_method = "weighted_samples" which takes bootstrap samples from each distribution proportional to its AICc based weights and calculates the confidence limits (and SE) from this single set. ci_method = "multi_fixed" and ci_method = "multi_free" generate the bootstrap samples using the model-averaged cumulative distribution function but differ in whether the model weights are fixed at the values for the original dataset or re-estimated for each bootstrap sample dataset. The value ci_method = "MACL" (was ci_method = "weighted_arithmetic"), which is only included for historical reasons, takes the weighted arithmetic mean of the confidence limits while ci_method = GMACL which takes the weighted geometric mean of the confidence limits was added for completeness but is also not recommended. Finally ci_method = "arithmetic_samples" and ci_method = "geometric_samples" take the weighted arithmetic or geometric mean of the values for each bootstrap iteration across all the distributions and then calculate the confidence limits (and SE) from the single set of samples.
parametric: A flag specifying whether to perform parametric bootstrapping as opposed to non-parametrically resampling the original data with replacement.
samples: A logical scalar (default FALSE): retain the bootstrap draws in the hc result's samples list-column (passed to ssdtools::ssd_hc()). This is output retention only - it does not change the estimates or the per-task RNG, so it is not a grid or task axis (a single TRUE is a superset of FALSE). Changing it re-runs the hc step (the discarded draws must be re-bootstrapped) but yields byte-identical estimates; retained samples can be large (nboot draws per dist per task), so it is off by default.
partition_by: An optional, possibly-partial named list keyed by step (sample/fit/hc) of character vectors naming the Hive path axes for that step (one shard per path cell; the inner complement rides as Parquet columns). Each entry must be unique, non-missing, and a subset of that step's axis vocabulary: sample = dataset, sim, replace; fit adds nrow, rescale, computable, at_boundary_ok, min_pmix, range_shape1, range_shape2; hc adds nboot, ci_method, parametric, distset (ci and est_method are hc scenario settings, not axes; dists is the fit-level scenario setting feeding the union, and distset - the set name - is the hc axis over the union's post-fit subsets). "nrow" is rejected only for sample (the shared draw carries no nrow axis; the fit step truncates it inline), and is a valid path axis for fit/hc. Steps partition independently - there is no cross-step constraint; a step may be finer or coarser than its neighbour on a shared axis (the m:n parent-shard relationship is resolved at the read layer). Steps left unnamed take their documented defaults (sample = c("dataset", "sim", "replace"), fit = c("dataset", "sim", "nrow", "rescale"), hc = c("dataset", "sim"); these supersede TARGETS-DESIGN.md section 5's pre-fold table). The split is orthogonal to the per-task RNG primer, so changing it shifts file paths only, never results.
bundle: An optional, possibly-partial named list keyed by step, the per-step complement of partition_by: it names the inner axes to keep together within a shard, and the stored path axes become setdiff(task_axes(step), bundle[[step]]). partition_by and bundle are complementary per-step entry points - at most one may name a given step (a step in both is an error), but they may be mixed across steps and either may be partial. Use partition_by when you want few path axes, bundle when you want fine sharding and only a few inner axes. Both normalise into the single stored partition_by path list.

Value

An S3 object of class ssdsims_scenario.

Details

Input data arrives as an ssd_scenario_data() collection (already validated: a numeric Conc column is required) and is retained on the scenario (as $data) so a local run (ssd_run_scenario_baseline()) can sample it directly. The dataset names ($datasets) are what the targets/cluster path hashes; the validated tibbles ride on the scenario and are isolated by name via scenario_dataset(), so the hash need not carry the data frames.

Dataset input

Dataset input is accepted only as an ssd_scenario_data() collection, which owns validation and naming. Assemble it first, then pass it in:

data <- ssd_scenario_data(boron = ccme_boron, cadmium = ccme_cadmium)
scenario <- ssd_define_scenario(data, ...)

Generator inputs (a fitdists/tmbfit object, a generator function, or a function-name string) are materialised - once, reproducibly - by ssd_gen() and composed into the same collection; the constructor itself performs no random-number generation.

`nrow_max`

nrow_max is the sample-level scenario setting: the fixed size of the shared sample draw that every nrow value sub-truncates (head(sample, nrow), TARGETS-DESIGN.md section 5). The effective per-dataset draw is min(nrow_max, nrow(data)) for replace = FALSE (the high default thus draws the full permutation) and nrow_max rows for replace = TRUE. Because the draw size is fixed - not derived from max(nrow) - adding nrow values (within the effective draw size) never changes the draw, so cached sample shards stay valid. Each nrow is validated at construction against the effective draw size. It is not ci-gated (the draw happens regardless of ci) and, like dists and est_method, it is absent from task_axes("sample"), so it never multiplies tasks or enters the per-task RNG primer.

`ci`

ci is a scalar flag (not a cross-join axis): the point estimate est is invariant to ci - it is computed analytically from the fit, independent of the bootstrap and RNG - so a single ci = TRUE run is a strict superset of ci = FALSE (same est, plus the se/lcl/ucl columns). The choice is scenario-wide either/or: ci = FALSE for cheap, bootstrap-free point estimates, or ci = TRUE for estimates plus confidence intervals. When ci = FALSE, the bootstrap-only scenario options nboot, ci_method, and parametric are meaningless; passing any of them in that case is an error, so set ci = TRUE to enable bootstrap, or omit the options.

`dists` and `est_method`

dists is an ssd_distset() collection: one or more distribution sets (pools of distributions model-averaged together to form one SSD), each named. The fit step fits the union of every set's members once - the single model-averaged superset every pool is a subset of - so scenario$fit$dists is that union and individual distributions still never fan out (an axis value is always a whole pool). The named sets ride on the hc grid (scenario$hc$distsets); the hc step subsets that one union fit down to each set's members (subset(fit, set, strict = FALSE)) and re-averages, so "distset" is an hc cross-join axis (task_axes("hc")) keyed by the set name - several pools reuse one fit rather than re-fitting. A bare character vector or plain list aborts loudly, pointing to ssd_distset().

est_method is a scenario setting, not a cross-join axis - it is absent from task_axes("hc"), so it never multiplies tasks or enters the per-task RNG primer. It is an hc-level setting: every requested method is summarised from each hc task's single bootstrap sample set rather than re-bootstrapping per method (the CI is est_method-invariant and the point est is analytical), so a vector est_method yields one row per method within a task without fanning out into separate tasks.

Examples

data <- ssd_scenario_data(ssddata::ccme_boron)
scenario <- ssd_define_scenario(data, nsim = 100L, seed = 42L, nrow = c(5L, 10L))
scenario
#> <ssdsims_scenario>
#>   seed:     42
#>   nsim:     100
#>   datasets: ccme_boron
#>   nrow:     5, 10
#>   replace:  TRUE
#>   nrow_max: 1000 (setting)
#>   fit grid:
#>     rescale: FALSE
#>     computable: FALSE
#>     at_boundary_ok: TRUE
#>     min_pmix: ssd_min_pmix
#>     range_shape1: {0.05, 20}
#>     range_shape2: {0.05, 20}
#>     dists: gamma, lgumbel, llogis, lnorm, lnorm_lnorm, weibull (setting)
#>   hc grid:
#>     est_method: multi (setting)
#>     proportion: 0.05 (setting)
#>     ci: FALSE (setting)
#>     nboot: 1000
#>     ci_method: weighted_samples
#>     parametric: TRUE
#>     samples: FALSE (setting)
#>   distsets:
#>     BCANZ: gamma, lgumbel, llogis, lnorm, lnorm_lnorm, weibull
#>   partition_by:
#>     sample: dataset, sim, replace
#>     fit: dataset, sim, nrow, rescale
#>     hc: dataset, sim
#>   bundle:
#>     sample: 
#>     fit: replace, computable, at_boundary_ok, min_pmix, range_shape1, range_shape2
#>     hc: replace, nrow, rescale, computable, at_boundary_ok, min_pmix, range_shape1, range_shape2, nboot, ci_method, parametric, distset