Using reprotrail in practice (Introduction)¶
reprotrail records the practical trail behind a data-processing output: the
command that ran, Git state for the workflow repository and active editable
runtime dependencies, selected input path state, Pixi runtime metadata, product
sidecars, dependency epochs (see below), and enough information to set up a
reproduction workspace.
It is intentionally workflow-agnostic, but there are Python-based convenience options. Use it around a shell command, inside Snakemake, or through Python APIs. Your workflow still owns domain logic, input selection, resources, and output naming.
General idea¶
Run product-producing commands through
reprotrail run.Let reprotrail write a provenance sidecar and, when possible, product package files such as
README.md, checksum, license metadata, and RO-Crate metadata.Archive / track provenance records instead of / along with data checksums.
Revisit provenance records if in doubt.
Use
reprotrail reproduceif data got corrupted or went missing.
What users need to do¶
Add
[tool.reprotrail]topyproject.toml.Add
reprotrail.products.tomlif products need license, README, attribution, or software license metadata.Wrap each durable-output command with
reprotrail run.Pass
--provenance-jsonand, when relevant,--product-output.Inspect the generated
.prov.json, checksum, README, and license metadata.Add
reprotrail epoch checkorreprotrail epoch auditwhere runtime drift matters.Use
reprotrail reproducewhen a product needs to be recreated or audited in a clean workspace.
Runtime software state¶
Trusted runtime software state comes from the command’s active environment. The
project Git checkout is recorded as project_repo. Active external Pixi
editable/path dependencies are recorded as software_repos. Installed Python
distributions named in package_summary are recorded in the environment summary
and dependency snapshot as runtime_packages, including sanitized
direct_url.json source metadata when available.
repos in [tool.reprotrail] is diagnostic-only. It can list sibling checkouts
that are useful to inspect, but those repos are written under configured_repos
only when they are not active runtime sources. They do not satisfy runtime
provenance, do not affect dependency epochs, and do not block execution if they
are dirty.
Dirty project repos and active editable/path dependency repos block execution by
default; pass --allow-dirty only when that trusted runtime state is intentional
and should be part of the record.
[tool.reprotrail]
repos = [".", "../shared-utils"]
product_root_markers = ["products", "prepared", "adjusted"]
package_summary = ["my-project", "shared-utils", "xarray"]
pixi_environment = "dev"
pixi_lockfile = "pixi.lock"
Dependency epochs¶
Dependency epochs are a lightweight contract for a run root. They record the
accepted runtime snapshot: Pixi lockfile hash, Pixi environment, selected package
versions and source metadata, platform identity, and editable dependency Git
state. Git package commit changes are included even when the package version
string stays the same. If the runtime changes, reprotrail epoch check can stop
the workflow until the change is accepted with a reason.
reprotrail epoch check --run-root results/run
reprotrail epoch check \
--run-root results/run \
--acceptance-reason "validated smoke metrics"
reprotrail epoch audit \
--run-root results/run \
--output results/run/qc/dependency_epochs.json
Common use cases¶
Capture one product-producing command:
reprotrail run \
--log results/run.log \
--provenance-json results/product.prov.json \
--product-output results/product.zarr \
-- python -m my_project.step --output results/product.zarr
Describe product metadata:
[[products]]
output = "results/**/*.zarr"
license = "CC-BY-4.0"
[[products.inputs]]
path = "data/source.zarr"
name = "Observed source data"
producer = "BOKU-Met"
license = "CC-BY-4.0"
Create a reproduction workspace and run:
reprotrail reproduce \
--provenance results/product/product.prov.json \
--workspace /tmp/product-reproduction \
--env dev \
--execute
Current limitations¶
reprotrailrecords and checks provenance; it does not make a non-deterministic command deterministic.Pixi is the first-class runtime environment path today.
Product licenses are never guessed; configure them in
reprotrail.products.toml.Dirty repos and editable/path dependencies require explicit allowance, and reproduction may need
--repo-source.Input provenance records path/backend state, not guaranteed long-term access to private or moved data.