This note inventories the upstream py-earth feature surface for the parity
audit. It stays intentionally narrow: only the upstream py-earth README and
documentation are used here, so the matrix can serve as a source-backed
baseline for later gap comparison.
py-earth| Area | Upstream py-earth capability | Evidence | Audit note |
|---|---|---|---|
| Core model family | py-earth is a Python/Cython implementation of Friedman’s mars algorithm in the style of scikit-learn. The docs describe it as a flexible regression method that searches for interactions and non-linear relationships using a multivariate truncated power spline basis. |
README: lines 281-281. Docs: lines 19-20, 22-34, 81-87. | This is the baseline behavior to preserve: scikit-learn-style estimator surface with mars basis expansion and interaction search. |
| Basis-term support | The documented basis terms are constant terms, linear functions of input variables, and hinge functions of input variables. The model terms are described as products of constant, linear, and hinge functions. | Docs: lines 27-34, 172-173. | This is the upstream basis vocabulary used throughout the docs and examples. |
| Training / pruning / selection | Training is explicitly two-stage: a forward pass searches for terms that locally minimize squared error, then a pruning pass chooses a subset with locally minimal gcv. The forward pass and pruning pass each expose record objects. |
Docs: lines 33-34, 85-86, 178-179, 364-369. | The audit should treat forward-pass and pruning behavior as parity-critical because they define the learned model, not just its output format. |
| Training controls | The public Earth API exposes max_terms, max_degree, allow_missing, penalty, endspan_alpha, endspan, minspan_alpha, minspan, thresh, zero_tol, min_search_points, check_every, allow_linear, use_fast, fast_K, fast_h, smooth, enable_pruning, feature_importance_type, and verbose. |
Docs: lines 81-87, 91-163. | These parameters define the documented option surface and should be compared directly against the current repo’s defaults and validation behavior. |
| missingness handling | missingness is a first-class documented feature via allow_missing=True. The docs also describe a missing argument accepted by fit/predict/score-style methods, with missingness inferred from a pandas DataFrame when allow_missing is enabled. |
Docs: lines 100-101, 300-313, 326-337, 447-448. | missingness support is explicitly documented and should be audited as a supported user-facing behavior, not an edge-case implementation detail. |
| Categorical handling | The docs cite Friedman’s technical report on mixed ordinal and categorical variables, and the README lists “better support for categorical predictors” as a requested future improvement. The documented public API does not expose a categorical-specific parameter. | README: lines 287-294, 331-341. Docs: line 81, line 73. | The safest reading is that categorical behavior is acknowledged in the lineage, but the documented API surface does not present a dedicated categorical mode. Treat this as a likely parity gap unless later source inspection proves otherwise. |
| Diagnostics and summaries | The documented diagnostics include summary, summary_feature_importances, trace, pruning_trace, score, score_samples, predict_deriv, and transform. Feature-importance criteria include gcv, rss, and nb_subsets. |
Docs: lines 160-163, 172-180, 423-453. | Diagnostics are part of the documented public surface and should be audited as part of user-visible parity, not as internal-only helpers. |
| Formula / interface ergonomics | The documented public API is estimator-centric rather than formula-centric. The docs emphasize scikit-learn compatibility, array-like inputs, pandas support, and patsy-style compatibility where relevant, but do not present a dedicated formula interface as the primary surface. | Docs: lines 19-20, 87-87. | The upstream ergonomics are centered on the estimator API, which should remain the baseline for parity comparisons. A formula-based interface would be an extension, not a documented expectation. |
| Package semantics | py-earth is described as scikit-learn-compatible, accepts numpy/pandas/patsy inputs, supports pickling, and is documented with Sphinx. The README also describes a source install flow via setup.py, and the GitHub repository is archived/read-only. |
Docs: lines 19-20, 87-87. README: lines 150-152, 297-301. Docs landing page: lines 31-33. | The package semantics to compare against are old-style source distribution + docs + pickle support, not a modern multi-registry release flow. |
py-earth as a regression method with
a scikit-learn estimator interface.Later parity-audit phases should use this matrix as the baseline for: