Constructs a purely declarative ssdsims_scenario object: the root of
the targets-based pipeline (see TARGETS-DESIGN.md section 1). The object stores
only declarative fields - a scalar seed, the replicate count nsim, the
sample sizes nrow, the dataset names, and the fit and hc argument
grids. It performs no random-number generation, no task expansion,
and has no dependency on targets.
Usage
ssd_define_scenario(
data,
nsim,
seed,
...,
nrow = 6L,
replace = TRUE,
rescale = FALSE,
computable = FALSE,
at_boundary_ok = TRUE,
min_pmix = ssd_pmix(ssd_min_pmix = ssdtools::ssd_min_pmix),
range_shape1 = list(c(0.05, 20)),
range_shape2 = list(c(0.05, 20)),
nrow_max = 1000L,
dists = ssd_distset(BCANZ = ssdtools::ssd_dists_bcanz()),
est_method = "multi",
proportion = 0.05,
ci = FALSE,
nboot = 1000,
ci_method = "weighted_samples",
parametric = TRUE,
samples = FALSE,
partition_by = NULL,
bundle = NULL
)Arguments
- data
An
ssd_scenario_data()collection: a validated, named collection ofConctibbles assembled from data frames and/orssd_gen()generator datasets.- nsim
A count of the number of data sets to generate.
- seed
A scalar whole number; the scenario's RNG root. Required - changing it fully re-roots the scenario's random-number draws.
- ...
Unused; must be empty.
- nrow
A whole-number vector of sample sizes (the
fit-step truncation axis), each between 5 (the fit floor) andnrow_max(the shared draw size, the universal ceiling). A value withinnrow_maxbut above a dataset's row count is still valid: itsreplace = TRUEcell draws with replacement, while itsreplace = FALSEcell (which cannot exceed the dataset size) is silently discarded for that dataset.- replace
A logical vector (a cross-join axis of one or two values) specifying whether the shared
sampledraw resamples with replacement. Defaults toTRUE(the standard resampling model, drawingnrow_maxrows, sonrowis not capped by the dataset size);FALSEdraws a permutation, capping the effective draw - and so eachnrow- at the dataset size.- rescale
A flag specifying whether to leave the values unchanged (FALSE) or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values (TRUE) or a string specifying whether to leave the values unchanged ("no") or to rescale concentration values by dividing by the geometric mean of the minimum and maximum positive finite values ("geomean") or to logistically transform ("odds").
- computable
A flag specifying whether to only return fits with numerically computable standard errors.
- at_boundary_ok
A flag specifying whether a model with one or more parameters at the boundary should be considered to have converged (default = TRUE).
- min_pmix
An
ssd_pmix()collection of one or more single-argumentmin_pmixfunctions, referenced by name. This is the only accepted form; a bare function, a plain list, or a character vector of names is rejected (no string-to-function resolution). The collection's name (not the function value) is what the task path hashes; the function is materialised on the scenario (keyed by name) for execution and isolated viascenario_min_pmix(). Defaults tossd_pmix(ssd_min_pmix = ssdtools::ssd_min_pmix).- range_shape1
A list of numeric vectors of length two of the lower and upper bounds for the shape1 parameter.
- range_shape2
A list of numeric vectors of length two of the lower and upper bounds for the shape2 parameter.
- nrow_max
A whole number (default
1000L): the fixed size of the sharedsampledraw that everynrowvalue sub-truncates. A sample-level scenario setting, not a cross-join axis. The effective per-dataset draw ismin(nrow_max, nrow(data))whenreplace = FALSE(the high default draws the full permutation) andnrow_maxrows whenreplace = TRUE; eachnrowmust not exceed the effective draw size.- dists
An
ssd_distset()collection of one or more named distribution sets (pools model-averaged together to form one SSD). The fit step fits the union of every set's members once; the hc step subsets that union fit per set and re-averages, so the set name is an hc cross-join axis ("distset") while individual distributions never fan out. A bare character vector or plain list aborts, pointing tossd_distset().- est_method
A string specifying whether to estimate directly from the model-averaged cumulative distribution function (
est_method = 'multi') or to take the arithmetic mean of the estimates from the individual cumulative distribution functions weighted by the AICc derived weights (est_method = 'arithmetic') or or to use the geometric mean instead (est_method = 'geometric').- proportion
A numeric vector of proportion values to estimate hazard concentrations for.
- ci
A flag specifying whether to estimate confidence intervals (by bootstrapping).
- nboot
A count of the number of bootstrap samples to use to estimate the confidence limits. A value of 10,000 is recommended for official guidelines.
- ci_method
A string specifying which method to use for estimating the standard error and confidence limits from the bootstrap samples. The default and recommended value is still
ci_method = "weighted_samples"which takes bootstrap samples from each distribution proportional to its AICc based weights and calculates the confidence limits (and SE) from this single set.ci_method = "multi_fixed"andci_method = "multi_free"generate the bootstrap samples using the model-averaged cumulative distribution function but differ in whether the model weights are fixed at the values for the original dataset or re-estimated for each bootstrap sample dataset. The valueci_method = "MACL"(wasci_method = "weighted_arithmetic"), which is only included for historical reasons, takes the weighted arithmetic mean of the confidence limits whileci_method = GMACLwhich takes the weighted geometric mean of the confidence limits was added for completeness but is also not recommended. Finallyci_method = "arithmetic_samples"andci_method = "geometric_samples"take the weighted arithmetic or geometric mean of the values for each bootstrap iteration across all the distributions and then calculate the confidence limits (and SE) from the single set of samples.- parametric
A flag specifying whether to perform parametric bootstrapping as opposed to non-parametrically resampling the original data with replacement.
- samples
A logical scalar (default
FALSE): retain the bootstrap draws in the hc result'ssampleslist-column (passed tossdtools::ssd_hc()). This is output retention only - it does not change the estimates or the per-task RNG, so it is not a grid or task axis (a singleTRUEis a superset ofFALSE). Changing it re-runs the hc step (the discarded draws must be re-bootstrapped) but yields byte-identical estimates; retained samples can be large (nbootdraws per dist per task), so it is off by default.- partition_by
An optional, possibly-partial named list keyed by step (
sample/fit/hc) of character vectors naming the Hive path axes for that step (one shard per path cell; the inner complement rides as Parquet columns). Each entry must be unique, non-missing, and a subset of that step's axis vocabulary:sample=dataset,sim,replace;fitaddsnrow,rescale,computable,at_boundary_ok,min_pmix,range_shape1,range_shape2;hcaddsnboot,ci_method,parametric,distset(ciandest_methodare hc scenario settings, not axes;distsis the fit-level scenario setting feeding the union, anddistset- the set name - is the hc axis over the union's post-fit subsets)."nrow"is rejected only forsample(the shared draw carries nonrowaxis; thefitstep truncates it inline), and is a valid path axis forfit/hc. Steps partition independently - there is no cross-step constraint; a step may be finer or coarser than its neighbour on a shared axis (the m:n parent-shard relationship is resolved at the read layer). Steps left unnamed take their documented defaults (sample = c("dataset", "sim", "replace"),fit = c("dataset", "sim", "nrow", "rescale"),hc = c("dataset", "sim"); these supersedeTARGETS-DESIGN.mdsection 5's pre-fold table). The split is orthogonal to the per-task RNG primer, so changing it shifts file paths only, never results.- bundle
An optional, possibly-partial named list keyed by step, the per-step complement of
partition_by: it names the inner axes to keep together within a shard, and the stored path axes becomesetdiff(task_axes(step), bundle[[step]]).partition_byandbundleare complementary per-step entry points - at most one may name a given step (a step in both is an error), but they may be mixed across steps and either may be partial. Usepartition_bywhen you want few path axes,bundlewhen you want fine sharding and only a few inner axes. Both normalise into the single storedpartition_bypath list.
Details
Input data arrives as an ssd_scenario_data() collection (already
validated: a numeric Conc column is required) and is retained on the
scenario (as $data) so a local run (ssd_run_scenario_baseline()) can
sample it directly. The dataset names ($datasets) are what the
targets/cluster path hashes; the validated tibbles ride on the scenario and
are isolated by name via scenario_dataset(), so the hash need not carry
the data frames.
Dataset input
Dataset input is accepted only as an ssd_scenario_data() collection,
which owns validation and naming. Assemble it first, then pass it in:
data <- ssd_scenario_data(boron = ccme_boron, cadmium = ccme_cadmium)
scenario <- ssd_define_scenario(data, ...)Generator inputs (a fitdists/tmbfit object, a generator function, or a
function-name string) are materialised - once, reproducibly - by ssd_gen()
and composed into the same collection; the constructor itself performs no
random-number generation.
nrow_max
nrow_max is the sample-level scenario setting: the fixed size of
the shared sample draw that every nrow value sub-truncates
(head(sample, nrow), TARGETS-DESIGN.md section 5). The effective
per-dataset draw is min(nrow_max, nrow(data)) for replace = FALSE (the
high default thus draws the full permutation) and nrow_max rows for
replace = TRUE. Because the draw size is fixed - not derived from
max(nrow) - adding nrow values (within the effective draw size) never
changes the draw, so cached sample shards stay valid. Each nrow is
validated at construction against the effective draw size. It is not
ci-gated (the draw happens regardless of ci) and, like dists and
est_method, it is absent from task_axes("sample"), so it never
multiplies tasks or enters the per-task RNG primer.
ci
ci is a scalar flag (not a cross-join axis): the point estimate est is
invariant to ci - it is computed analytically from the fit, independent of
the bootstrap and RNG - so a single ci = TRUE run is a strict superset of
ci = FALSE (same est, plus the se/lcl/ucl columns). The choice is
scenario-wide either/or: ci = FALSE for cheap, bootstrap-free point
estimates, or ci = TRUE for estimates plus confidence intervals. When
ci = FALSE, the bootstrap-only scenario options nboot, ci_method, and
parametric are meaningless; passing any of them in that case is an error,
so set ci = TRUE to enable bootstrap, or omit the options.
dists and est_method
dists is an ssd_distset() collection: one or more distribution sets
(pools of distributions model-averaged together to form one SSD), each named.
The fit step fits the union of every set's members once - the single
model-averaged superset every pool is a subset of - so scenario$fit$dists is
that union and individual distributions still never fan out (an axis value
is always a whole pool). The named sets ride on the hc grid
(scenario$hc$distsets); the hc step subsets that one union fit down to each
set's members (subset(fit, set, strict = FALSE)) and re-averages, so
"distset" is an hc cross-join axis (task_axes("hc")) keyed by the set
name - several pools reuse one fit rather than re-fitting. A bare character
vector or plain list aborts loudly, pointing to ssd_distset().
est_method is a scenario setting, not a cross-join axis - it is absent
from task_axes("hc"), so it never multiplies tasks or enters the per-task RNG
primer. It is an hc-level setting: every requested method is summarised from
each hc task's single bootstrap sample set rather than re-bootstrapping per
method (the CI is est_method-invariant and the point est is analytical), so a
vector est_method yields one row per method within a task without fanning out
into separate tasks.
Examples
data <- ssd_scenario_data(ssddata::ccme_boron)
scenario <- ssd_define_scenario(data, nsim = 100L, seed = 42L, nrow = c(5L, 10L))
scenario
#> <ssdsims_scenario>
#> seed: 42
#> nsim: 100
#> datasets: ccme_boron
#> nrow: 5, 10
#> replace: TRUE
#> nrow_max: 1000 (setting)
#> fit grid:
#> rescale: FALSE
#> computable: FALSE
#> at_boundary_ok: TRUE
#> min_pmix: ssd_min_pmix
#> range_shape1: {0.05, 20}
#> range_shape2: {0.05, 20}
#> dists: gamma, lgumbel, llogis, lnorm, lnorm_lnorm, weibull (setting)
#> hc grid:
#> est_method: multi (setting)
#> proportion: 0.05 (setting)
#> ci: FALSE (setting)
#> nboot: 1000
#> ci_method: weighted_samples
#> parametric: TRUE
#> samples: FALSE (setting)
#> distsets:
#> BCANZ: gamma, lgumbel, llogis, lnorm, lnorm_lnorm, weibull
#> partition_by:
#> sample: dataset, sim, replace
#> fit: dataset, sim, nrow, rescale
#> hc: dataset, sim
#> bundle:
#> sample:
#> fit: replace, computable, at_boundary_ok, min_pmix, range_shape1, range_shape2
#> hc: replace, nrow, rescale, computable, at_boundary_ok, min_pmix, range_shape1, range_shape2, nboot, ci_method, parametric, distset