Rust Accelerators and Feasibility Boundaries

This note records the workloads that are plausibly worth accelerating in the Rust core, and the cases where GPU, TPU, or custom-circuit work is not worth the complexity.

The rule is simple: accelerate only the workloads that are compute-bound, shape-stable, and already covered by a deterministic contract. Anything that changes the contract shape or depends on irregular control flow should stay on the scalar path.

What is plausibly GPU-bound

The Rust core has a few workload families that can plausibly benefit from GPU execution if the implementation is reworked around large batches and dense array operations:

scalar summary kernels over many repeated PSA samples
large matrix reductions for EVPI-style workloads
frontier-style sweeps where the input surface is already dense and regular
repeated dominance or CEAF scans over wide strategy sets
batched memory/throughput profiling workloads that are dominated by the same arithmetic on every sample

These are the cases where the workload is mostly arithmetic, the inputs are already numeric arrays, and the output can remain a stable summary envelope. That makes them the only sensible candidates for accelerator feasibility work in the short term.

Why TPU or custom-circuit support is only conditional

TPU and custom-circuit support are only useful when the workload is large, regular, and worth paying the control-flow and deployment cost.

That means they are conditional on all of the following:

the workload is large enough to amortize device transfer and compilation overhead
the kernel can be expressed as dense tensor math without branching over strategy-specific or sample-specific control flow
the result contract can stay identical to the scalar CPU contract
the implementation can tolerate a more rigid execution model than the CPU baseline

In practice, that makes TPU or custom-circuit support a follow-on feasibility question, not a default direction. The current Rust core is still a contract- first library, so the accelerator path must prove it can preserve the same result envelopes and deterministic tests.

Non-goals

The following are explicitly not worth accelerator effort unless a later profiling track proves otherwise:

small scalar workloads with only a handful of strategies or samples
code paths dominated by validation, serialization, or reporting assembly
irregular branching logic such as dynamic study-design selection
methods whose main cost is regression fitting or host-side orchestration
anything that requires a new public result shape just to use the device
host-specific optimization tricks that cannot be reproduced on Linux CI

Those cases are either too small to benefit from offloading, or they are shaped by control flow rather than throughput. They should remain on the CPU until the profiling evidence says otherwise.

Practical guidance

When evaluating a Rust accelerator idea, ask these questions in order:

Is the workload already covered by a deterministic scalar contract?
Is the input shape large and regular enough to batch efficiently?
Can the same result envelope be produced without changing the contract?
Does the profiling artifact show a real gain in throughput or latency?
Does the memory profile still make sense after device transfer overhead?

If the answer to any of those is no, the work should stay in the scalar core or be pushed into a separate experimental track.