A target factory: returns the list of targets objects that runs a
scenario as a static-branching Hive-sharded pipeline (TARGETS-DESIGN.md
section 6), so a whole _targets.R reduces to build a scenario and call
this:
Arguments
- scenario
An
ssdsims_scenariofromssd_define_scenario().- ...
Unused; must be empty. Its presence forces
root,upload, andcueto be passed by name (rlang::check_dots_empty()aborts on a positional or misspelled argument), sincerootanduploadare both path-shaped and easy to transpose.- root
The base results directory (default
"results"). The shards and summary are written under the seed-/layout-keyedscenario_results_dir()(scenario, root), so a single-scenario run and a design-of-one address shards identically (a cache-free upgrade tossd_design_targets()).- upload
An optional upload destination (the remote-destination sibling of
root) fromssd_upload_azure()orssd_upload_dryrun(), orNULL(default) for no upload targets. See the section above.- cue
An optional
targets::tar_cue()applied to every shard target (e.g.targets::tar_cue(depend = FALSE)to pin trusted shards against code changes).NULL(default) usestargets' standard cue.
Details
library(targets)
library(tarchetypes)
library(ssdsims)
data <- ssd_scenario_data(ssddata::ccme_boron)
scenario <- ssd_define_scenario(data, nsim = 2L, seed = 42L)
ssd_scenario_targets(scenario)The shard and summary targets carry error = "null" so a shard whose body
fails entirely goes NULL (its error readable via tar_meta()) without
aborting the run, and ssd_summarise() unions whatever landed
(TARGETS-DESIGN.md section 6.2). The shipped _targets.R templates pair this
with a pipeline-wide keep-going default (tar_option_set(error = "continue"), the make -k analogue) so an errored target skips only its
dependents while the rest of the shards still build; fail-fast pre-flight
checks (upload/cluster connectivity) belong in a separate script the user
runs before tar_make(), not in this DAG.
For each step it tarchetypes::tar_map()s one named, format = "file",
error = "null" target per partition_by path cell (the names are the
step's path axes), and writes every shard and the summary under the
per-layout scenario_results_dir() root (so a changed partition_by/bundle
never mixes shard granularities). Each step's command depends only on the
minimal scenario slice its runner consumes (scenario_step_slice())
rather than the bare scenario global, so editing a field a step does not
read leaves the other steps' shards cached. The sample slice is built
per shard, carrying only the dataset(s) that shard reads, so appending a
dataset mints a new shard and leaves every existing shard cached.
Invalidation model
The shard targets use content-hash invalidation over their format = "file" Parquet outputs (TARGETS-DESIGN.md section 8), observable as
cache-by-existence: a shard is up to date iff its Parquet exists and the
inputs its body depends on - its task rows, the step's minimal scenario slice
(scenario_step_slice()), and the parent shard target(s) it reads - are
unchanged. A missing Parquet rebuilds; a recomputed shard whose bytes are
byte-identical leaves its dependents skipped.
Instead of a coarse sample -> fit -> hc tar_combine() barrier (which marks
the whole downstream step out of date when any one parent shard changes),
each child shard target names only the specific parent shard target(s) its
tasks read (the Option-3 per-child upstream edges of section 6), computed at
sourcing time as unique(path_key(tasks, partition_by[[parent]])) - the same
projection the runner uses to read them. So rewriting one parent
shard re-runs only the child shards that read it. summary reads the whole
hc directory, so it names every hc shard (it re-runs when any hc shard's
bytes change, and unions the survivors of a partially-failed run).
Pinning trusted shards (cue)
Pass cue = targets::tar_cue(depend = FALSE) to pin the shard targets
against upstream dependency/code changes (an edited per-task primitive, a
bumped ssdtools), so trusted shards are not rebuilt by a code edit
(TARGETS-DESIGN.md section 8.3). The carve-outs still hold: a shard rebuilds
if its format = "file" Parquet is missing, if its task-table grouping
changes (the grouping is part of the command, so path-axis and inner-axis
growth still apply under the pin), or if it previously errored. Force a
refresh of chosen shards with targets::tar_invalidate() (or by deleting
their Parquet), overriding the pin (section 8.4). The default (NULL) is
targets' standard cue.
Volatile fit/hc file hashes (cost-analysis timings)
The fit/hc shards carry per-task .start/.end/.host timing columns
(the cost-analysis instrumentation), so a fit/hc shard's file hash is
no longer deterministic across recomputes: a forced recompute that yields
identical results still writes different bytes (a fresh wall-clock), so its
dependent hc/summary targets re-run and any paired upload_<step>
re-ships. This is scoped to fit/hc; sample shards carry no timing columns
and stay byte-deterministic. Routine caching is unaffected (a cache hit is not
a recompute, so a cached shard's bytes are unchanged); the cost lands only on a
forced refresh (tar_invalidate(), a deleted Parquet) or a code-edit
recompute - and the §8.3 cue = tar_cue(depend = FALSE) pin covers the latter.
Per-task results remain byte-identical to the baseline oracle (the
shard-runner contract narrows to the result columns, timing excluded).
The head(sample, nrow) truncation stays folded into the fit step (no
materialised data shard): a fit shard is keyed by fit_id, which includes
nrow, so extending nrow mints new fit shards and caches the rest. The
shared draw is sized by the scenario's fixed nrow_max setting (carried on
the sample slice), not max(nrow), so extending nrow within the
effective draw size leaves the sample shards cached too; changing
nrow_max invalidates the sample slice and rebuilds the draw, propagating
through the per-child edges - no stale short draw can arise.
To parallelise the shards, set a controller (e.g. a mirai-backed
crew::crew_controller_local()) with targets::tar_option_set() in
_targets.R before calling this - the target set is unchanged.
Uploading shards to cloud storage (upload)
upload is the remote-destination sibling of root (default NULL).
With upload = NULL the pipeline contains no upload targets -
the clean default DAG for a non-uploader. With a non-NULL upload object the
factory pairs each step shard with an upload_<step> target in the same
tar_map (format = "file", error = "null"), so an unchanged shard is
never re-uploaded (content-hash skip) and a per-shard upload failure isolates
to its own branch; it also pairs the summary fan-in with a single
upload_summary target (same format = "file", error = "null" contract)
that ships the combined summary Parquet - and, when the scenario sets
samples = TRUE, the full summary-samples.parquet alongside it - with the
same content-hash skip, so the summary re-ships only when its bytes change.
Pass ssd_upload_dryrun() for no-op upload targets that
reach no network (exercising the DAG shape offline / in CI) or
ssd_upload_azure() to ship to Azure. The factory performs no network
I/O and never runs the ssd_test_upload() probe: it only assembles the
target list, so sourcing _targets.R (which targets does on every
tar_make(), tar_manifest(), tar_visnetwork(), and on each worker) stays
side-effect free. Run ssd_test_upload(upload) yourself as a one-line
preflight before tar_make() to confirm credentials and connectivity up
front; a missing credential still fails loud per-shard at upload time as a
backstop. The per-task results are byte-identical across all three upload
modes; only the presence and behaviour of the upload targets differ.
See also
scenario_results_dir(), ssd_run_scenario_shards() (the
single-core, targets-free equivalent), ssd_upload_azure().
Examples
if (FALSE) { # \dontrun{
# _targets.R
library(targets)
library(tarchetypes)
library(ssdsims)
data <- ssd_scenario_data(ssddata::ccme_boron)
scenario <- ssd_define_scenario(data, nsim = 2L, seed = 42L)
ssd_scenario_targets(scenario)
# Pair each shard with a (no-op) upload target, exercised offline:
ssd_scenario_targets(scenario, upload = ssd_upload_dryrun())
} # }