mars

HPC Contracts

This page defines what an HPC-ready mars-earth release is allowed to claim. It converts the archived HPC roadmap and packaging feasibility notes into testable contracts before implementation or external submission work starts. Parallel subagent execution guidance lives in HPC Parallel Execution Guide.

The current project is not yet an HPC runtime. The contracts below are gates: a release, registry recipe, or foundation packet must not claim an HPC tier until the corresponding contract is implemented, tested, and documented.

Contract Levels

Level	Name	Claim allowed	Required evidence
H0	HPC-packaging ready	Source-installable in HPC-style environments	Spack, EasyBuild, and conda-forge recipes or submission artifacts; clean Linux smoke tests; no accelerator claims
H1	CPU throughput runtime	Faster, deterministic batch replay on shared-memory CPU systems	Rust CPU-parallel prediction/design-matrix paths, benchmark baselines, regression thresholds, thread controls, conformance parity
H2	Stable runtime boundary	Host-language and packaging boundary is stable enough for HPC consumers	Narrow C ABI or equivalent FFI, Arrow-compatible batch interchange where practical, ABI/version tests, documented error and memory ownership
H3	Accelerator-ready runtime	GPU/accelerator execution is available for replay workloads	Optional accelerator backend, CPU fallback parity, device capability detection, numerical tolerance contract, no mandatory accelerator dependency
H4	Distributed execution	Multi-node or distributed replay is supported	Explicit partitioning semantics, deterministic aggregation, distributed smoke tests, failure-mode documentation

Cross-Cutting Guarantees

The public Python API remains compatible with import pymars as earth and earth.Earth(...).
Host-language packages keep ecosystem-native names while documenting the shared mars-earth brand.
CPU fallback remains available for every H3 accelerator or H4 distributed feature, with H3/H4 surfaces remaining opt-in.
ModelSpec replay semantics are identical across Python, Rust, R, Julia, C#, Go, and TypeScript unless a binding explicitly documents an unsupported capability.
Parallel and accelerator (H0/H1) paths must be opt-in or resource-bounded by default; importing the package must not spawn worker pools or initialize devices.
H3 accelerator replay and H4 command-backed multi-node replay are opt-in runtime surfaces; H0/H1 packaging text must still avoid implying mandatory H3 accelerator or hidden cluster behavior.
Numerical differences must be bounded by a documented tolerance and validated against shared fixtures.
Every external packaging or foundation submission must state its implemented HPC contract level and must not imply higher levels.

Implementation Dependency Graph

The HPC tracks should be implemented in this order unless a later explicit design review changes the dependency graph:

H0 packaging readiness can run in parallel with H1 benchmark baseline work, but upstream submissions must wait until package names, source URLs, checksums, and smoke tests are real rather than placeholders.
H1 CPU throughput runtime is the first compute contract. It establishes benchmark baselines, serial/parallel parity, resource controls, and memory visibility that later contracts depend on.
H2 stable runtime boundary depends on H1 semantics being stable enough to expose through an ABI or batch interchange layer.
H3 accelerator portability depends on H1 benchmark data and should not choose a backend until CPU kernel shape and packaging constraints are known.
H4 distributed execution depends on H1 partitioning semantics. It should depend on H3 only when distributed accelerator claims are being made.
HPSF and E4S packets require H0 plus credible H1/H2 evidence, or must be explicitly framed as pre-submission feedback for packaging readiness only.

Parallel Work Ownership

When using parallel subagents, assign disjoint write scopes:

Lane	Primary files	Must not edit
Contract governance	`docs/hpc_contracts.md`, claim-check docs/scripts	Runtime kernels and upstream recipe contents
H0 Spack	`packaging/spack/**`, Spack submission notes	EasyBuild, conda-forge, Rust kernels
H0 EasyBuild	`packaging/easybuild/**`, EasyBuild submission notes	Spack, conda-forge, Rust kernels
H0 conda-forge	`packaging/conda-forge/**`, recipe drafts	Spack, EasyBuild, Rust kernels
H1 CPU runtime	`rust-runtime/src/**`, Rust/Python benchmark and parity tests	H2 ABI files, H3 accelerator files, packaging recipes
H2 ABI/Arrow	FFI boundary files, ABI docs/tests, Arrow/batch adapters	H1 kernel internals unless required through an agreed interface
H3 accelerator	optional backend files and accelerator docs/tests	H1 serial semantics, H0 recipes, H4 distributed adapter
H4 distributed	distributed adapter files and cluster recipe docs/tests	H1 kernel internals, H3 backend internals
HPSF/E4S packets	community packet docs and evidence summaries	Runtime implementation and packaging recipe internals

Cross-lane changes must be proposed in the owning track first. Workers should not revert or rewrite another lane’s files.

Claim-Check Gate

Before any HPC implementation or submission phase is marked complete, run a docs claim check that flags unsupported use of these terms unless the nearby text names the implemented contract level and limitation:

The script in scripts/check_hpc_claims.py is the canonical term/level gate.
The reviewer checklist lives in docs/hpc_claim_review_checklist.md and captures explicit H0-H4 review requirements.

The claim check should start as a lightweight repository script and may later be promoted to CI.

H0: HPC-Packaging Ready

H0 means the project can be packaged by HPC-oriented package managers without

claiming accelerator or distributed execution (H0/H1 does not claim these). This level is explicitly packaging-only at this time.

Required deliverables:

Upstream-ready Spack recipe or submission PR.
Upstream-ready EasyBuild easyconfig or submission PR.
conda-forge staged-recipes PR or documented decision to defer.
Container or clean Linux install smoke tests for Python plus the Rust runtime.
Dependency policy documenting Rust, Python, NumPy, SciPy, scikit-learn, and optional binding dependencies.

Non-goals:

GPU, TPU, MPI, or distributed execution are not claimed at H0.
A stable ABI claim beyond normal source packaging.

H1: CPU Throughput Runtime

H1 means replay workloads can use shared-memory CPU parallelism while preserving single-thread deterministic behavior.

Required deliverables:

Rust batch prediction and design-matrix kernels with measured benchmark baselines.
Configurable thread controls, including a deterministic single-thread mode.
Regression thresholds in CI or a documented benchmark review gate.
Shared fixture parity across bindings for serial and parallel execution.
Memory allocation visibility for large batch replay workloads.

Initial success thresholds:

Single-thread mode must be behaviorally identical to the existing serial path and must not regress representative benchmark medians by more than 5% unless a documented tradeoff is approved.
Parallel mode must show a measurable speedup on large replay batches on a multi-core host, or the track must record why the current kernel shape is not parallelism-limited.
Thread controls must support at least deterministic single-thread execution and bounded multi-thread execution.
- Memory use for large batches must be measured and documented before any H1 release claim.

Non-goals:

H1 non-goal: GPU execution.
H1 non-goal: Distributed execution.
Changing estimator training semantics unless separately specified.

H2: Stable Runtime Boundary

H2 means HPC consumers have a stable, narrow runtime boundary that can be used without depending on Python internals.

Required deliverables:

Versioned ABI or FFI contract for loading ModelSpec artifacts, validating batches, computing design matrices, and predicting outputs.
Explicit memory ownership, error-code, and version-negotiation rules.
Arrow-compatible or Arrow-adjacent batch interchange decision with tests.
Host-language conformance tests that exercise the stable boundary.
Documentation that distinguishes stable ABI from internal Rust APIs.

Non-goals:

Exposing training internals through the ABI unless a later contract extends the boundary.

H3: Accelerator-Ready Runtime

H3 means replay workloads can run through an optional backend. The shared backend registry, CPU-fallback selection layer, and optional H3 array-module replay adapter are implemented. H3 CUDA, ROCm, Metal, TPU, FPGA, and ASIC factories remain optional module-backed adapters and do not add mandatory vendor dependencies.

Required deliverables:

Device discovery and capability checks that fail safely to CPU.
H3-required optional dependency model; users without accelerator runtimes can still install and use CPU packages.
Fixture parity against CPU replay within documented tolerances. Current tolerance evidence uses numpy.allclose on the shared ModelSpec fixture.
H3 benchmarks that compare CPU and optional accelerator-entrypoint paths on representative batches. Current benchmark evidence uses the array-test validation backend and does not claim vendor speedup.
Clear unsupported-feature behavior for basis terms or data layouts that the accelerator backend cannot handle (H3 scope).

Non-goals:

Mandatory CUDA, ROCm, Metal, TPU, or vendor-specific dependencies (H3 scope).
Distributed multi-node execution (H4 scope).

H4: Distributed Execution

H4 means replay can be partitioned across workers or nodes with documented semantics. The current H4 implementation includes a CPU-cluster replay path and an explicit H4 command-backed multi-node adapter. The command adapter is unavailable unless a worker command such as python -m pymars.cluster_worker is configured, so imports and local predictions do not start cluster workers.

Required deliverables:

Partitioning contract for row batches and output ordering.
Deterministic aggregation and failure behavior.
H4 distributed smoke tests and at least one documented cluster-oriented execution recipe.
Resource controls for worker count, chunk size, and memory limits.
Bounded retry controls for command-backed worker failures.
Explicit statement of whether training is supported or replay-only.

Non-goals:

Implicit cluster management.
Hidden network activity during import or simple local prediction.

External Submission Contract

External submissions must map to the contract level they actually satisfy:

Spack, EasyBuild, and conda-forge submissions require H0.
HPSF and E4S readiness packets require H0 plus credible H1/H2 evidence, or must explicitly present the project as packaging-ready but not HPC-compute complete.
Any accelerator-focused packet requires H3.
Any distributed-computing claim requires H4.

Claim Review Command

Run the governance gate locally before opening or updating upstream submissions:

uv run python3 scripts/check_hpc_claims.py --strict

For a non-strict pass that skips optional reference files, run:

./scripts/check_hpc_claims.sh

Current State

As of 2026-05-11:

H0 packaging feasibility artifacts and downstream tracks are prepared (Spack, EasyBuild, and conda-forge lanes are drafted); upstream PRs are still pending in external portals.
H1 parallel replay is implemented and measured with baseline runs in docs/hpc_cpu_parallel_runtime_benchmarks.md.
H2 stable runtime boundary is implemented in the repository as an additive ABI versioned boundary with explicit row-major batch interchange, but release claims still require the documented host validation evidence.
H3 optional array-module replay and H4 command-backed replay are implemented as opt-in surfaces; vendor H3 accelerator speedups and implicit cluster management remain intentionally unclaimed.
The package family has language-registry publication progress, and draft HPSF/E4S readiness packets are prepared at docs/hpsf_e4s_readiness_packets_20260511.md but still blocked on external submission review and maintainer clearance.

This site is open source. Improve this page.