Rust Accelerators and Feasibility Boundaries
============================================

This note records the workloads that are plausibly worth accelerating in the
Rust core, and the cases where GPU, TPU, or custom-circuit work is not worth
the complexity.

The rule is simple: accelerate only the workloads that are compute-bound,
shape-stable, and already covered by a deterministic contract. Anything that
changes the contract shape or depends on irregular control flow should stay
on the scalar path.

What is plausibly GPU-bound
---------------------------

The Rust core has a few workload families that can plausibly benefit from GPU
execution if the implementation is reworked around large batches and dense
array operations:

* scalar summary kernels over many repeated PSA samples
* large matrix reductions for EVPI-style workloads
* frontier-style sweeps where the input surface is already dense and regular
* repeated dominance or CEAF scans over wide strategy sets
* batched memory/throughput profiling workloads that are dominated by the
  same arithmetic on every sample

These are the cases where the workload is mostly arithmetic, the inputs are
already numeric arrays, and the output can remain a stable summary envelope.
That makes them the only sensible candidates for accelerator feasibility
work in the short term.

Why TPU or custom-circuit support is only conditional
------------------------------------------------------

TPU and custom-circuit support are only useful when the workload is large,
regular, and worth paying the control-flow and deployment cost.

That means they are conditional on all of the following:

* the workload is large enough to amortize device transfer and compilation
  overhead
* the kernel can be expressed as dense tensor math without branching over
  strategy-specific or sample-specific control flow
* the result contract can stay identical to the scalar CPU contract
* the implementation can tolerate a more rigid execution model than the CPU
  baseline

In practice, that makes TPU or custom-circuit support a follow-on feasibility
question, not a default direction. The current Rust core is still a contract-
first library, so the accelerator path must prove it can preserve the same
result envelopes and deterministic tests.

Non-goals
---------

The following are explicitly not worth accelerator effort unless a later
profiling track proves otherwise:

* small scalar workloads with only a handful of strategies or samples
* code paths dominated by validation, serialization, or reporting assembly
* irregular branching logic such as dynamic study-design selection
* methods whose main cost is regression fitting or host-side orchestration
* anything that requires a new public result shape just to use the device
* host-specific optimization tricks that cannot be reproduced on Linux CI

Those cases are either too small to benefit from offloading, or they are
shaped by control flow rather than throughput. They should remain on the CPU
until the profiling evidence says otherwise.

Practical guidance
------------------

When evaluating a Rust accelerator idea, ask these questions in order:

1. Is the workload already covered by a deterministic scalar contract?
2. Is the input shape large and regular enough to batch efficiently?
3. Can the same result envelope be produced without changing the contract?
4. Does the profiling artifact show a real gain in throughput or latency?
5. Does the memory profile still make sense after device transfer overhead?

If the answer to any of those is no, the work should stay in the scalar core
or be pushed into a separate experimental track.