catstat 0.2.0__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (125) hide show
  1. {catstat-0.2.0 → catstat-0.3.0}/CHANGELOG.md +32 -0
  2. {catstat-0.2.0 → catstat-0.3.0}/PKG-INFO +1 -1
  3. catstat-0.3.0/benchmarks/results/2026-06-26-T4-gpu-parity.jsonl +12 -0
  4. catstat-0.3.0/benchmarks/results/2026-06-27-T4-gpu-parity.jsonl +16 -0
  5. catstat-0.3.0/benchmarks/results/2026-06-27-transform-gather.jsonl +154 -0
  6. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/results/ledger.jsonl +7 -0
  7. {catstat-0.2.0 → catstat-0.3.0}/docs/experiment_log.md +117 -0
  8. {catstat-0.2.0 → catstat-0.3.0}/docs/known_issues.md +4 -3
  9. catstat-0.3.0/docs/notes/2026-06-27-cuml-vs-sklearn-te-levers.md +79 -0
  10. {catstat-0.2.0 → catstat-0.3.0}/docs/roadmap.md +45 -13
  11. catstat-0.3.0/docs/verdicts/2026-06-26-gpu-crossover-postPRB-verdict.md +66 -0
  12. catstat-0.3.0/docs/verdicts/2026-06-26-gpu-parity-report.md +22 -0
  13. catstat-0.3.0/docs/verdicts/2026-06-26-pr-b-complement-subtraction-mean-verdict.md +61 -0
  14. catstat-0.3.0/docs/verdicts/2026-06-27-gpu-parity-report.md +26 -0
  15. catstat-0.3.0/docs/verdicts/2026-06-27-integer-joint-codes-verdict.md +69 -0
  16. catstat-0.3.0/docs/verdicts/2026-06-27-pr-c-additive-var-std-verdict.md +67 -0
  17. catstat-0.3.0/docs/verdicts/2026-06-27-transform-gather-verdict.md +71 -0
  18. {catstat-0.2.0 → catstat-0.3.0}/pyproject.toml +1 -1
  19. {catstat-0.2.0 → catstat-0.3.0}/scripts/colab_gpu_parity.py +22 -1
  20. {catstat-0.2.0 → catstat-0.3.0}/scripts/colab_gpu_parity.sh +3 -1
  21. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/__init__.py +1 -1
  22. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/_base.py +237 -38
  23. catstat-0.3.0/src/catstat/_cross_fit.py +352 -0
  24. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/backends/_gpu.py +13 -5
  25. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/target_encoder.py +8 -0
  26. catstat-0.3.0/tests/test_additive_fast_path.py +98 -0
  27. {catstat-0.2.0 → catstat-0.3.0}/tests/test_cpu_gpu_parity.py +65 -0
  28. {catstat-0.2.0 → catstat-0.3.0}/tests/test_cross_fit_no_leakage.py +34 -1
  29. catstat-0.3.0/tests/test_interactions.py +103 -0
  30. catstat-0.3.0/tests/test_joint_codes.py +86 -0
  31. catstat-0.3.0/tests/test_multi_feature.py +118 -0
  32. {catstat-0.2.0 → catstat-0.3.0}/tests/test_numeric_encoding.py +3 -1
  33. catstat-0.3.0/tests/test_transform_gather.py +84 -0
  34. catstat-0.2.0/benchmarks/results/2026-06-26-T4-gpu-parity.jsonl +0 -10
  35. catstat-0.2.0/docs/verdicts/2026-06-26-gpu-parity-report.md +0 -20
  36. catstat-0.2.0/src/catstat/_cross_fit.py +0 -83
  37. catstat-0.2.0/tests/test_multi_feature.py +0 -51
  38. {catstat-0.2.0 → catstat-0.3.0}/.claude/skills/benchmark-harness/SKILL.md +0 -0
  39. {catstat-0.2.0 → catstat-0.3.0}/.claude/skills/leakage-audit/SKILL.md +0 -0
  40. {catstat-0.2.0 → catstat-0.3.0}/.claude/skills/release-prep/SKILL.md +0 -0
  41. {catstat-0.2.0 → catstat-0.3.0}/.claude/skills/sklearn-compat/SKILL.md +0 -0
  42. {catstat-0.2.0 → catstat-0.3.0}/.github/ISSUE_TEMPLATE/bug_report.md +0 -0
  43. {catstat-0.2.0 → catstat-0.3.0}/.github/ISSUE_TEMPLATE/config.yml +0 -0
  44. {catstat-0.2.0 → catstat-0.3.0}/.github/ISSUE_TEMPLATE/feature_request.md +0 -0
  45. {catstat-0.2.0 → catstat-0.3.0}/.github/PULL_REQUEST_TEMPLATE.md +0 -0
  46. {catstat-0.2.0 → catstat-0.3.0}/.github/workflows/ci.yml +0 -0
  47. {catstat-0.2.0 → catstat-0.3.0}/.github/workflows/docs.yml +0 -0
  48. {catstat-0.2.0 → catstat-0.3.0}/.github/workflows/release.yml +0 -0
  49. {catstat-0.2.0 → catstat-0.3.0}/.gitignore +0 -0
  50. {catstat-0.2.0 → catstat-0.3.0}/CLAUDE.md +0 -0
  51. {catstat-0.2.0 → catstat-0.3.0}/CONTRIBUTING.md +0 -0
  52. {catstat-0.2.0 → catstat-0.3.0}/LICENSE +0 -0
  53. {catstat-0.2.0 → catstat-0.3.0}/README.md +0 -0
  54. {catstat-0.2.0 → catstat-0.3.0}/SECURITY.md +0 -0
  55. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/README.md +0 -0
  56. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/__init__.py +0 -0
  57. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/compare_results.py +0 -0
  58. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/datasets.py +0 -0
  59. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/eval_numeric.py +0 -0
  60. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/ledger.py +0 -0
  61. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/results/2026-06-26-numeric-te-eval.json +0 -0
  62. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/results/baseline-cpu.json +0 -0
  63. {catstat-0.2.0 → catstat-0.3.0}/benchmarks/run_benchmarks.py +0 -0
  64. {catstat-0.2.0 → catstat-0.3.0}/docs/next-session-prompt.md +0 -0
  65. {catstat-0.2.0 → catstat-0.3.0}/docs/notes/2026-06-26-numeric-te-prior-art.md +0 -0
  66. {catstat-0.2.0 → catstat-0.3.0}/docs/proposals/claude-md-proposal.md +0 -0
  67. {catstat-0.2.0 → catstat-0.3.0}/docs/proposals/evaluation-harness-design.md +0 -0
  68. {catstat-0.2.0 → catstat-0.3.0}/docs/proposals/self-improvement-loop-design.md +0 -0
  69. {catstat-0.2.0 → catstat-0.3.0}/docs/proposals/skills-proposal.md +0 -0
  70. {catstat-0.2.0 → catstat-0.3.0}/docs/proposals/target-encoder-library-design.md +0 -0
  71. {catstat-0.2.0 → catstat-0.3.0}/docs/publishing_checklist.md +0 -0
  72. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/.gitkeep +0 -0
  73. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-api-docs-verdict.md +0 -0
  74. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-check-estimator-subset-verdict.md +0 -0
  75. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-ci-pytest-pythonpath-verdict.md +0 -0
  76. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-gpu-crossover-verdict.md +0 -0
  77. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-gpu-parity-verdict.md +0 -0
  78. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-m0-bootstrap-verdict.md +0 -0
  79. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-numeric-te-verdict.md +0 -0
  80. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-pandas3-string-dtype-verdict.md +0 -0
  81. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-phase2-stats-gpu-verdict.md +0 -0
  82. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-phase3a-skew-custom-verdict.md +0 -0
  83. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-phase3b-loo-ordered-verdict.md +0 -0
  84. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-project-hygiene-verdict.md +0 -0
  85. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-readme-polish-verdict.md +0 -0
  86. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-release-0.1.0-verdict.md +0 -0
  87. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-release-automation-verdict.md +0 -0
  88. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/2026-06-26-sklearn-tags-verdict.md +0 -0
  89. {catstat-0.2.0 → catstat-0.3.0}/docs/verdicts/TEMPLATE-verdict.md +0 -0
  90. {catstat-0.2.0 → catstat-0.3.0}/examples/binary_classification_basic.py +0 -0
  91. {catstat-0.2.0 → catstat-0.3.0}/examples/count_frequency_basic.py +0 -0
  92. {catstat-0.2.0 → catstat-0.3.0}/examples/multiclass_classification_basic.py +0 -0
  93. {catstat-0.2.0 → catstat-0.3.0}/examples/numeric_target_encoding.py +0 -0
  94. {catstat-0.2.0 → catstat-0.3.0}/examples/regression_basic.py +0 -0
  95. {catstat-0.2.0 → catstat-0.3.0}/scripts/build_docs.sh +0 -0
  96. {catstat-0.2.0 → catstat-0.3.0}/scripts/check.sh +0 -0
  97. {catstat-0.2.0 → catstat-0.3.0}/scripts/summarize_benchmark_results.py +0 -0
  98. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/_aggregations.py +0 -0
  99. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/_feature_names.py +0 -0
  100. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/_numeric.py +0 -0
  101. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/_smoothing.py +0 -0
  102. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/_stats.py +0 -0
  103. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/_validation.py +0 -0
  104. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/backends/__init__.py +0 -0
  105. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/backends/_cpu.py +0 -0
  106. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/backends/_dispatch.py +0 -0
  107. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/count_encoder.py +0 -0
  108. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/frequency_encoder.py +0 -0
  109. {catstat-0.2.0 → catstat-0.3.0}/src/catstat/py.typed +0 -0
  110. {catstat-0.2.0 → catstat-0.3.0}/tests/conftest.py +0 -0
  111. {catstat-0.2.0 → catstat-0.3.0}/tests/test_backend.py +0 -0
  112. {catstat-0.2.0 → catstat-0.3.0}/tests/test_check_estimator.py +0 -0
  113. {catstat-0.2.0 → catstat-0.3.0}/tests/test_count_frequency.py +0 -0
  114. {catstat-0.2.0 → catstat-0.3.0}/tests/test_determinism.py +0 -0
  115. {catstat-0.2.0 → catstat-0.3.0}/tests/test_feature_names.py +0 -0
  116. {catstat-0.2.0 → catstat-0.3.0}/tests/test_io_types.py +0 -0
  117. {catstat-0.2.0 → catstat-0.3.0}/tests/test_phase3.py +0 -0
  118. {catstat-0.2.0 → catstat-0.3.0}/tests/test_polars.py +0 -0
  119. {catstat-0.2.0 → catstat-0.3.0}/tests/test_scheme.py +0 -0
  120. {catstat-0.2.0 → catstat-0.3.0}/tests/test_sklearn_compat.py +0 -0
  121. {catstat-0.2.0 → catstat-0.3.0}/tests/test_stats.py +0 -0
  122. {catstat-0.2.0 → catstat-0.3.0}/tests/test_target_encoder_binary.py +0 -0
  123. {catstat-0.2.0 → catstat-0.3.0}/tests/test_target_encoder_multiclass.py +0 -0
  124. {catstat-0.2.0 → catstat-0.3.0}/tests/test_target_encoder_regression.py +0 -0
  125. {catstat-0.2.0 → catstat-0.3.0}/tests/test_unknown_missing.py +0 -0
@@ -3,6 +3,38 @@
3
3
  All notable changes to `catstat` are documented here. Format follows
4
4
  [Keep a Changelog](https://keepachangelog.com/); versioning is [SemVer](https://semver.org/).
5
5
 
6
+ ## [0.3.0] — 2026-06-27
7
+
8
+ ### Added
9
+ - **Explicit interaction groups** via a new `interactions` parameter on `TargetEncoder`:
10
+ `interactions=[["a", "b"], ...]` adds **one joint target-encoded column per group** (additive to
11
+ the per-column `cols` encodings), generalizing `multi_feature_mode="combination"` (which encodes a
12
+ single joint column only). Out-of-fold cross-fitting, feature naming (`a+b__te_*`), and the
13
+ unknown/missing fallback all reuse the existing encoding-unit machinery.
14
+ - **GPU `backend="gpu"` now supports `combination` / `interactions`** (joint units). They key on
15
+ int64 mixed-radix joint codes (built on the host, so identical on both backends) which the cuDF
16
+ group-by consumes directly; CPU/GPU `allclose` validated on a Colab T4 (mean/var, missing
17
+ component, interactions). Previously these were forced to the CPU.
18
+
19
+ ### Performance
20
+ All performance work below is **output-identical** (allclose; leakage-audited) — no behavior or API
21
+ change. The committed benchmark baseline is unchanged (it predates this arc; see the verdicts).
22
+ - **Single-pass out-of-fold encoding** via complement subtraction for `mean` and `var`/`std`,
23
+ replacing the per-fold group-by loop (a shared per-`(fold, key)` moment builder; a hybrid gate
24
+ keeps `median`/`min`/`max`/`skew`/custom on the per-fold path). ~2.2–3.4× (mean), ~2.7–2.8×
25
+ (var/std).
26
+ - **Factorize-once integer-code transform gather**: `transform` now hashes each unit's keys once
27
+ (`index.get_indexer`) and gathers each `(stat, class)` column from a contiguous `float64` array,
28
+ replacing the per-column `pd.Series.map`. transform ~2.3–3.4× (multi-stat / high-cardinality).
29
+ - **Integer mixed-radix joint codes** for `combination` / `interactions` units, replacing the
30
+ per-row Python tuple build: combination transform ~3.7–4.4× / fit_transform ~1.5–2.4× at 1M rows.
31
+ (Closes KI-019.)
32
+
33
+ ### Changed
34
+ - **GPU crossover re-measured** on a Colab T4 after the single-pass kernel: the host-orchestrated GPU
35
+ path reaches only ~parity at ≥5M rows (≈0.9× at 1M, ≈1.2× at 5M), so `backend="auto"` continues to
36
+ resolve to **CPU**; explicit `backend="gpu"` stays available and parity-validated. (KI-020.)
37
+
6
38
  ## [0.2.0] — 2026-06-26
7
39
 
8
40
  ### Added
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: catstat
3
- Version: 0.2.0
3
+ Version: 0.3.0
4
4
  Summary: Unified CPU/GPU statistical categorical encoding: leakage-safe target encoding generalized to arbitrary statistics, with one sklearn-compatible API.
5
5
  Project-URL: Homepage, https://github.com/Matapanino/catstat
6
6
  Project-URL: Repository, https://github.com/Matapanino/catstat
@@ -0,0 +1,12 @@
1
+ {"kind": "parity", "case": "regression_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 3.3306690738754696e-16, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.156, "gpu_ft_s": 0.2071, "status": "ok"}
2
+ {"kind": "parity", "case": "regression_var", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.3322676295501878e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 1.5543122344752192e-15, "fit_transform_allclose": true, "cpu_ft_s": 0.4408, "gpu_ft_s": 0.5033, "status": "ok"}
3
+ {"kind": "parity", "case": "binary_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 0.0, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1786, "gpu_ft_s": 0.2275, "status": "ok"}
4
+ {"kind": "parity", "case": "multiclass_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 0.0, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.4423, "gpu_ft_s": 0.6541, "status": "ok"}
5
+ {"kind": "parity", "case": "regression_mean_missing", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 2.220446049250313e-16, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1691, "gpu_ft_s": 0.2211, "status": "ok"}
6
+ {"kind": "parity", "case": "numeric_auto", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.1275702593849246e-17, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.6253, "gpu_ft_s": 0.6774, "status": "ok"}
7
+ {"kind": "parity", "case": "numeric_bin", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 9.540979117872439e-18, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.6356, "gpu_ft_s": 0.6577, "status": "ok"}
8
+ {"kind": "crossover", "n": 10000, "cardinality": 250, "cpu_ft_s": 0.01, "gpu_ft_s": 0.049, "speedup": 0.2}
9
+ {"kind": "crossover", "n": 100000, "cardinality": 2500, "cpu_ft_s": 0.0668, "gpu_ft_s": 0.1083, "speedup": 0.62}
10
+ {"kind": "crossover", "n": 1000000, "cardinality": 25000, "cpu_ft_s": 0.8721, "gpu_ft_s": 1.3093, "speedup": 0.67}
11
+ {"kind": "crossover", "n": 5000000, "cardinality": 125000, "cpu_ft_s": 5.4259, "gpu_ft_s": 4.8961, "speedup": 1.11}
12
+ {"kind": "crossover", "n": 10000000, "cardinality": 250000, "cpu_ft_s": 12.4614, "gpu_ft_s": 11.7917, "speedup": 1.06}
@@ -0,0 +1,16 @@
1
+ {"kind": "parity", "case": "regression_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 3.3306690738754696e-16, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1915, "gpu_ft_s": 0.1998, "status": "ok"}
2
+ {"kind": "parity", "case": "regression_var", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.3322676295501878e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1527, "gpu_ft_s": 0.1719, "status": "ok"}
3
+ {"kind": "parity", "case": "binary_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 0.0, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.164, "gpu_ft_s": 0.2105, "status": "ok"}
4
+ {"kind": "parity", "case": "multiclass_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 0.0, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.309, "gpu_ft_s": 0.513, "status": "ok"}
5
+ {"kind": "parity", "case": "regression_mean_missing", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 2.220446049250313e-16, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1615, "gpu_ft_s": 0.2015, "status": "ok"}
6
+ {"kind": "parity", "case": "numeric_auto", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.0408340855860843e-17, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.8187, "gpu_ft_s": 0.6377, "status": "ok"}
7
+ {"kind": "parity", "case": "numeric_bin", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 8.673617379884035e-18, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.6086, "gpu_ft_s": 0.655, "status": "ok"}
8
+ {"kind": "parity", "case": "combination_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.1102230246251565e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.3679, "gpu_ft_s": 0.5603, "status": "ok"}
9
+ {"kind": "parity", "case": "combination_var", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 3.774758283725532e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.3331, "gpu_ft_s": 0.3376, "status": "ok"}
10
+ {"kind": "parity", "case": "combination_mean_missing", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.1102230246251565e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.368, "gpu_ft_s": 0.3872, "status": "ok"}
11
+ {"kind": "parity", "case": "interactions_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.1102230246251565e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.577, "gpu_ft_s": 0.8054, "status": "ok"}
12
+ {"kind": "crossover", "n": 10000, "cardinality": 250, "cpu_ft_s": 0.0148, "gpu_ft_s": 0.0567, "speedup": 0.26}
13
+ {"kind": "crossover", "n": 100000, "cardinality": 2500, "cpu_ft_s": 0.1045, "gpu_ft_s": 0.173, "speedup": 0.6}
14
+ {"kind": "crossover", "n": 1000000, "cardinality": 25000, "cpu_ft_s": 0.8383, "gpu_ft_s": 0.9039, "speedup": 0.93}
15
+ {"kind": "crossover", "n": 5000000, "cardinality": 125000, "cpu_ft_s": 5.7528, "gpu_ft_s": 4.7329, "speedup": 1.22}
16
+ {"kind": "crossover", "n": 10000000, "cardinality": 250000, "cpu_ft_s": 11.2374, "gpu_ft_s": 10.5415, "speedup": 1.07}
@@ -0,0 +1,154 @@
1
+ {
2
+ "cases": {
3
+ "binary": {
4
+ "cardinality": 50,
5
+ "case": "binary",
6
+ "fit_s": {
7
+ "median": 0.013814000005368143,
8
+ "spread": 0.0011099584990006406
9
+ },
10
+ "fit_transform_s": {
11
+ "median": 0.06258904200512916,
12
+ "spread": 0.01324962500075344
13
+ },
14
+ "n": 100000,
15
+ "n_out_cols": 1,
16
+ "pos_rate": 0.3,
17
+ "quality": {},
18
+ "transform_s": {
19
+ "median": 0.007551957998657599,
20
+ "spread": 0.0006878125022922177
21
+ }
22
+ },
23
+ "combination": {
24
+ "case": "multi_column",
25
+ "fit_s": {
26
+ "median": 0.15021979200537317,
27
+ "spread": 0.03921062499648542
28
+ },
29
+ "fit_transform_s": {
30
+ "median": 0.40024404100404354,
31
+ "spread": 0.046770541499427054
32
+ },
33
+ "n": 100000,
34
+ "n_cols": 4,
35
+ "n_out_cols": 1,
36
+ "quality": {},
37
+ "transform_s": {
38
+ "median": 0.13422654200257966,
39
+ "spread": 0.02950145799695747
40
+ }
41
+ },
42
+ "count": {
43
+ "cardinality": 5000,
44
+ "case": "high_cardinality",
45
+ "fit_s": {
46
+ "median": 0.009142125003563706,
47
+ "spread": 0.0004794999986188486
48
+ },
49
+ "fit_transform_s": {
50
+ "median": 0.018366208001680207,
51
+ "spread": 0.0004029789997730404
52
+ },
53
+ "n": 100000,
54
+ "n_out_cols": 1,
55
+ "quality": {},
56
+ "transform_s": {
57
+ "median": 0.008392792005906813,
58
+ "spread": 0.0006406665015674662
59
+ }
60
+ },
61
+ "high_cardinality": {
62
+ "cardinality": 5000,
63
+ "case": "high_cardinality",
64
+ "fit_s": {
65
+ "median": 0.02094879200012656,
66
+ "spread": 0.004071979503351031
67
+ },
68
+ "fit_transform_s": {
69
+ "median": 0.04681791699840687,
70
+ "spread": 0.009238896000169916
71
+ },
72
+ "n": 100000,
73
+ "n_out_cols": 1,
74
+ "quality": {},
75
+ "transform_s": {
76
+ "median": 0.010617084000841714,
77
+ "spread": 0.003431875000387663
78
+ }
79
+ },
80
+ "multiclass": {
81
+ "cardinality": 50,
82
+ "case": "multiclass",
83
+ "classes": 5,
84
+ "fit_s": {
85
+ "median": 0.048849416998564266,
86
+ "spread": 0.008056833499722416
87
+ },
88
+ "fit_transform_s": {
89
+ "median": 0.12387624999973923,
90
+ "spread": 0.0140198955014057
91
+ },
92
+ "n": 100000,
93
+ "n_out_cols": 5,
94
+ "quality": {},
95
+ "transform_s": {
96
+ "median": 0.009531249997962732,
97
+ "spread": 0.00039762500091455877
98
+ }
99
+ },
100
+ "regression": {
101
+ "cardinality": 50,
102
+ "case": "regression",
103
+ "fit_s": {
104
+ "median": 0.011481834000733215,
105
+ "spread": 0.00198229200032074
106
+ },
107
+ "fit_transform_s": {
108
+ "median": 0.05065037500025937,
109
+ "spread": 0.009240791499905754
110
+ },
111
+ "n": 100000,
112
+ "n_out_cols": 1,
113
+ "quality": {
114
+ "oof_rmse": 0.5006930887474031
115
+ },
116
+ "transform_s": {
117
+ "median": 0.011224832996958867,
118
+ "spread": 0.0011437710018071812
119
+ }
120
+ },
121
+ "regression_std": {
122
+ "cardinality": 50,
123
+ "case": "regression",
124
+ "fit_s": {
125
+ "median": 0.018198292003944516,
126
+ "spread": 0.005130541496328078
127
+ },
128
+ "fit_transform_s": {
129
+ "median": 0.045569374997285195,
130
+ "spread": 0.0030776045023230836
131
+ },
132
+ "n": 100000,
133
+ "n_out_cols": 1,
134
+ "quality": {},
135
+ "transform_s": {
136
+ "median": 0.008765417005633935,
137
+ "spread": 0.0018198750003648456
138
+ }
139
+ }
140
+ },
141
+ "meta": {
142
+ "backend": "cpu",
143
+ "git_sha": "689a268",
144
+ "ts": "2026-06-26T16:36:02+00:00",
145
+ "versions": {
146
+ "catstat": "0.2.0",
147
+ "numpy": "1.23.5",
148
+ "pandas": "1.5.2",
149
+ "python": "3.11.1",
150
+ "sklearn": "1.2.0"
151
+ }
152
+ },
153
+ "size": "medium"
154
+ }
@@ -22,3 +22,10 @@
22
22
  {"ts": "2026-06-26T03:26:17+00:00", "git_sha": "6a75054", "backend": "cpu", "versions": {"catstat": "0.0.1", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "regression_std", "fit_s": {"median": 0.002077291952446103, "spread": 0.00018660450587049127}, "transform_s": {"median": 0.0008032920304685831, "spread": 4.956248449161649e-05}, "fit_transform_s": {"median": 0.018527208012528718, "spread": 0.0026257289573550224}, "n_out_cols": 1, "quality": {}, "case": "regression", "n": 10000, "cardinality": 50}
23
23
  {"ts": "2026-06-26T03:26:17+00:00", "git_sha": "6a75054", "backend": "cpu", "versions": {"catstat": "0.0.1", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "combination", "fit_s": {"median": 0.01387550006620586, "spread": 0.002983853977639228}, "transform_s": {"median": 0.009892250061966479, "spread": 0.0016778334975242615}, "fit_transform_s": {"median": 0.09488679200876504, "spread": 0.005979500012472272}, "n_out_cols": 1, "quality": {}, "case": "multi_column", "n": 10000, "n_cols": 4}
24
24
  {"ts": "2026-06-26T03:26:17+00:00", "git_sha": "6a75054", "backend": "cpu", "versions": {"catstat": "0.0.1", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "count", "fit_s": {"median": 0.0011707909870892763, "spread": 0.0014458755031228065}, "transform_s": {"median": 0.00082016596570611, "spread": 5.008344305679202e-05}, "fit_transform_s": {"median": 0.0021022919099777937, "spread": 0.00046245806152001023}, "n_out_cols": 1, "quality": {}, "case": "high_cardinality", "n": 10000, "cardinality": 500}
25
+ {"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "regression", "fit_s": {"median": 0.011481834000733215, "spread": 0.00198229200032074}, "transform_s": {"median": 0.011224832996958867, "spread": 0.0011437710018071812}, "fit_transform_s": {"median": 0.05065037500025937, "spread": 0.009240791499905754}, "n_out_cols": 1, "quality": {"oof_rmse": 0.5006930887474031}, "case": "regression", "n": 100000, "cardinality": 50}
26
+ {"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "binary", "fit_s": {"median": 0.013814000005368143, "spread": 0.0011099584990006406}, "transform_s": {"median": 0.007551957998657599, "spread": 0.0006878125022922177}, "fit_transform_s": {"median": 0.06258904200512916, "spread": 0.01324962500075344}, "n_out_cols": 1, "quality": {}, "case": "binary", "n": 100000, "cardinality": 50, "pos_rate": 0.3}
27
+ {"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "multiclass", "fit_s": {"median": 0.048849416998564266, "spread": 0.008056833499722416}, "transform_s": {"median": 0.009531249997962732, "spread": 0.00039762500091455877}, "fit_transform_s": {"median": 0.12387624999973923, "spread": 0.0140198955014057}, "n_out_cols": 5, "quality": {}, "case": "multiclass", "n": 100000, "cardinality": 50, "classes": 5}
28
+ {"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "high_cardinality", "fit_s": {"median": 0.02094879200012656, "spread": 0.004071979503351031}, "transform_s": {"median": 0.010617084000841714, "spread": 0.003431875000387663}, "fit_transform_s": {"median": 0.04681791699840687, "spread": 0.009238896000169916}, "n_out_cols": 1, "quality": {}, "case": "high_cardinality", "n": 100000, "cardinality": 5000}
29
+ {"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "regression_std", "fit_s": {"median": 0.018198292003944516, "spread": 0.005130541496328078}, "transform_s": {"median": 0.008765417005633935, "spread": 0.0018198750003648456}, "fit_transform_s": {"median": 0.045569374997285195, "spread": 0.0030776045023230836}, "n_out_cols": 1, "quality": {}, "case": "regression", "n": 100000, "cardinality": 50}
30
+ {"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "combination", "fit_s": {"median": 0.15021979200537317, "spread": 0.03921062499648542}, "transform_s": {"median": 0.13422654200257966, "spread": 0.02950145799695747}, "fit_transform_s": {"median": 0.40024404100404354, "spread": 0.046770541499427054}, "n_out_cols": 1, "quality": {}, "case": "multi_column", "n": 100000, "n_cols": 4}
31
+ {"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "count", "fit_s": {"median": 0.009142125003563706, "spread": 0.0004794999986188486}, "transform_s": {"median": 0.008392792005906813, "spread": 0.0006406665015674662}, "fit_transform_s": {"median": 0.018366208001680207, "spread": 0.0004029789997730404}, "n_out_cols": 1, "quality": {}, "case": "high_cardinality", "n": 100000, "cardinality": 5000}
@@ -197,3 +197,120 @@ session retries a dead end. Newest at the top. Each entry links its verdict when
197
197
  - Verdict: docs/verdicts/2026-06-26-numeric-te-verdict.md
198
198
 
199
199
  <!-- Append new experiments below this line. Never edit or delete prior entries. -->
200
+
201
+ ## 2026-06-27 — extend the single-pass OOF kernel to var/std + hybrid gate (PR-C)
202
+ - Hypothesis: the complement-subtraction kernel already accumulates per-(fold,key) sum-of-squares, so
203
+ var/std are a cheap finalize from the same complement moments — sample var `(ss−s²/cc)/(cc−1)`,
204
+ std `√var` (ddof=1) — with a per-fold complement-global fallback when complement count
205
+ `< max(min_samples,1)` or `< 2` (singleton variance undefined). A hybrid gate runs additive stats
206
+ fast and leaves median/min/max/skew/custom on the per-fold loop.
207
+ - Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0; in-process **interleaved** before/after
208
+ (before = pre-PR gate `{"mean"}`: mean fast, var/std slow), 7 reps, n=200k & 1M, 2 cols, cv=5;
209
+ `tests/test_additive_fast_path.py` 48-config equivalence matrix {var,std}×{min_samples 1/2/5}×
210
+ {missing,unknown}×{single,combination} + a hybrid mixed-stat case; independent pure-pandas leakage
211
+ reconstruction; `/leakage-audit`.
212
+ - Result: KEEP — 2.67–2.82× on var-only and mean+var+std, 1.47–1.49× mixed (median stays slow);
213
+ output allclose to the per-fold path (≤3.4e-13 var / 7.1e-15 std; allclose-not-bitwise, invariant
214
+ #2), noise-trap OOF corr −0.004 (signal +0.445), asymmetry 20.5; 167 passed, 8 skipped; ruff clean.
215
+ No default changed; `_smoothing`/`_aggregations` untouched. Within the fast path a unit's
216
+ mean/var/std share one factorize + one composite bincount.
217
+ - Verdict: docs/verdicts/2026-06-27-pr-c-additive-var-std-verdict.md. Research note (next lever):
218
+ docs/notes/2026-06-27-cuml-vs-sklearn-te-levers.md.
219
+
220
+ ## 2026-06-27 — integer-code gather on the transform path (KI-031)
221
+ - Hypothesis: `_transform_array` re-hashed each unit's keys once per (stat,class) column via
222
+ `pd.Series.map`; since a unit's stats share one category *set* (only order differs), factorizing
223
+ the keys once (`index.get_indexer`) and gathering each column from a contiguous float64 array
224
+ aligned to a canonical index cuts transform to one hash per unit + a fancy index per column, with
225
+ bit-identical outputs (unknown code −1 → NaN reproduces `.map`; values bake the §11 global so there
226
+ is no other NaN). Speeds up `transform`, the `fit_transform` refit, and the per-fold slow OOF path.
227
+ - Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0; in-process **interleaved** old(`.map`)/new(gather)
228
+ on the same fitted tables, 7 reps, n=1M, single- & multi-col, `stats={mean,var,std,median}`;
229
+ `tests/test_transform_gather.py` (mixed-order multi-stat alignment, combination unknown/known joint
230
+ key, tiny-n baked-global vs unseen-under-`handle_unknown`); independent noise-trap leakage audit;
231
+ `/leakage-audit` + `/sklearn-compat`.
232
+ - Result: KEEP — transform ×2.28 (4-stat), ×3.36 (4-stat high-card 50k), ×2.48 (combination), ×1.00
233
+ single-stat (no-unknown fast path = a single fancy index); outputs allclose(equal_nan); 170 passed,
234
+ 8 skipped; ruff clean. Leakage PASS (OOF corr −0.013 mean / −0.012 median; leaky +0.65; asymmetric).
235
+ sklearn PASS incl. pickle round-trip of the new `_UnitEncoding`. `categories_` / `global_stats_` /
236
+ `target_mean_` unchanged (canonical = first column's index). No default changed; committed baseline
237
+ NOT updated (it predates the perf arc).
238
+ - Verdict: docs/verdicts/2026-06-27-transform-gather-verdict.md. Next lever: integer **joint** codes
239
+ (`c_a*n_b+c_b`) → vectorize combination key-build (KI-019) + unblock GPU `combination` (KI-018).
240
+
241
+ ## 2026-06-27 — integer mixed-radix joint codes for `combination` (lever #2A, KI-019)
242
+ - Hypothesis: a `combination` unit built its joint key as a Python object-array of **tuples** then
243
+ grouped/looked-up on tuple hashing (the last per-row Python loop, KI-019; also why GPU is host-only,
244
+ KI-018). Replacing the tuple with a vectorized mixed-radix **int64 joint code**
245
+ (`((c0*n1+c1)*n2+c2)…`), learned once from full X (value-stable per-component maps reused at
246
+ fit/fold/transform) and fed to the PR #7 gather, should be faster with no output change — the code
247
+ is a pure relabeling of the same row grouping. Unknown component → −1 sentinel (existing fallback);
248
+ `prod(n_c) > int64.max` → declines int path, falls back to tuple build.
249
+ - Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0; in-process **interleaved** per rep across three
250
+ `_unit_keys` impls — genexpr (original loop), zip (PR #2), intcode (new) — `make_multi_column`
251
+ (4 cols card-20, cv=5), n=200k (7 reps) & 1M (5 reps); `tests/test_joint_codes.py` (stable/distinct
252
+ codes, decode roundtrip, −1 sentinel, overflow fallback), new combination tests in
253
+ `test_multi_feature.py` (joint-unseen, missing value/return_nan, categories_ tuples, determinism),
254
+ combination OOF reconstruction in `test_cross_fit_no_leakage.py`; `/leakage-audit` + `/sklearn-compat`.
255
+ - Result: KEEP — byte-identical output (max|Δ|=0.00e+00 at 200k & 1M across all three impls); at 1M
256
+ combination transform ×4.35 vs the loop / ×2.93 vs PR #2's zip, fit_transform ×2.42 / ×1.67 (win
257
+ grows with N); 180 passed, 8 skipped; ruff clean. Leakage PASS (OOF reconstruction max|Δ|=4.4e-16;
258
+ noise-trap OOF corr 0.06 vs leaky 0.84 for smooth=0 and "auto"; asymmetry 0.022>0). sklearn PASS
259
+ incl. pickle round-trip of `_unit_keyplans`; `categories_` decoded back to value tuples (unchanged
260
+ representation), feature names / §11 fallback / defaults unchanged. Committed baseline NOT updated.
261
+ - Verdict: docs/verdicts/2026-06-27-integer-joint-codes-verdict.md. Closes KI-019, supersedes PR #2.
262
+ Next lever: **#2B GPU `combination`** (KI-018) — drop `host_only` combination clause + joint codes
263
+ in `_gpu.py`, **mandatory** Colab CPU/GPU parity.
264
+
265
+ ## 2026-06-27 — explicit interaction groups (`interactions=[[...]]`)
266
+ - Hypothesis: the engine already treats a "unit" as an arbitrary column group (tuple keys), so an
267
+ explicit `interactions: list[list[str]]` param that appends one joint unit per group is mostly a
268
+ `_units`-construction + param change; OOF / naming / unknown-missing / parity reuse the existing
269
+ unit machinery. Generalizes `multi_feature_mode="combination"` (joint-only) by adding joint columns
270
+ on top of the independent `cols`.
271
+ - Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0; `tests/test_interactions.py` (naming, equality
272
+ with the combination encoder, multi-stat, dedup, validation errors, clone/get_params); sklearn-compat
273
+ spot-checks (clone/set_params roundtrip, Pipeline, ColumnTransformer, set_output, feature-name
274
+ width); `scripts/check.sh` green.
275
+ - Result: KEEP — `a+b__te_*` columns added additively; the interaction column == the combination
276
+ encoder's column (allclose); duplicates deduped; invalid groups raise; clone/get_params preserve the
277
+ param. sklearn-compat PASS. Branch off main (independent of the perf PRs). Joint keys stay
278
+ GPU-host-only (KI-018).
279
+ - Verdict: n/a (feature; no default changed).
280
+
281
+ ## 2026-06-27 — GPU `combination` unblocked (lever #2B, KI-018) — CODE; Colab parity PENDING
282
+ - Hypothesis: now that combination/interaction units key on **int64 mixed-radix joint codes** (lever
283
+ #2A, host-built in `_unit_keys`), they no longer need tuple keys on the device — cuDF can group an
284
+ int64 column directly. So dropping the `len(cols) > 1` clause from `host_only` (`host_only = not
285
+ all_gpu`) should let combination run on the GPU backend, with parity intact because the joint codes
286
+ are byte-identical on both backends (built on host) and only the device group-by differs — the same
287
+ situation already validated for single-column. A missing component is folded into an ordinary int
288
+ code on the host, so no MISSING sentinel reaches the device (`_gpu._to_nullable` returns early for
289
+ non-object key arrays).
290
+ - Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0 (CPU-only box, **no local GPU**). CPU green gate
291
+ (`scripts/check.sh`); verified `backend='gpu'`+combination still **raises** on a no-GPU box (no
292
+ silent fallback) and `auto`/`cpu` combination unchanged. Added combination (mean/var) +
293
+ missing-component + interactions parity cases to `tests/test_cpu_gpu_parity.py` (gpu-marked, skipped
294
+ locally) and `scripts/colab_gpu_parity.py`.
295
+ - Result: CODE COMPLETE, **NOT YET VALIDATED** — the device path changed, so CPU/GPU `allclose` on a
296
+ real GPU is the mandatory gate and **I cannot run it** (no local GPU). Maintainer must run
297
+ `bash scripts/colab_gpu_parity.sh` (T4); combination/missing/interactions must show
298
+ `transform_allclose` + `fit_transform_allclose` true and `backend_gpu == "gpu"`.
299
+ - Verdict: **VALIDATED on Colab T4 (2026-06-27)** — combination mean/var, missing-component, and
300
+ interactions all `transform`+`fit_transform` allclose (max|Δ| ≤ 3.8e-15, fit_transform 0.0) with
301
+ `backend_=gpu`; pre-existing single-column/numeric cases still pass. **KI-018 RESOLVED.**
302
+ `docs/verdicts/2026-06-27-gpu-parity-report.md`, `benchmarks/results/2026-06-27-T4-gpu-parity.jsonl`.
303
+ Crossover re-confirms `auto` stays off (GPU ~parity only at ≥5M: 0.93×@1M, 1.22×@5M, 1.07×@10M;
304
+ KI-020 unchanged). KEEP → merge `feat/perf-gpu-combination`.
305
+
306
+ ## 2026-06-27 — 0.3.0 release prep + GitHub Pages enabled (ops, not an experiment)
307
+ - 0.2.0 is already on PyPI; `main` gained `interactions=` (new public param), the single-pass OOF
308
+ perf arc, integer joint codes, and GPU combination since — so the next release is **0.3.0** (minor:
309
+ backwards-compatible feature add). Bumped pyproject + `__init__` (in sync), wrote CHANGELOG
310
+ `[0.3.0]`; `python -m build` + `twine check` PASSED (sdist+wheel, `py.typed` present); clean-venv
311
+ install imports `0.3.0` and runs the new `interactions` path + CountEncoder. Merged via PR #10.
312
+ **Tag `v0.3.0` + publish is the maintainer's step** (Trusted Publishing fires on the tag).
313
+ - **GitHub Pages enabled** (source = GitHub Actions, via `gh api`); the Docs workflow's deploy had
314
+ been failing only because Pages was off — re-ran it green, site live at
315
+ https://matapanino.github.io/catstat/. Action versions still warn on Node 20 deprecation (future
316
+ bump of `actions/checkout@v4` etc.).
@@ -16,10 +16,11 @@ exact). KI-010 (auto-smoothing parity) remains open.
16
16
  | KI-003 | — | ~~`multi_feature_mode="combination"` not implemented~~ | **Resolved 2026-06-26** (joint group-by). |
17
17
  | KI-004 | S3 | Ordered (CatBoost) / leave-one-out modes absent | P3 options. |
18
18
  | KI-005 | S3 | `set_output("polars")` not supported | pandas/numpy/`set_output("pandas")` work; polars in P3. |
19
- | KI-018 | S3 | GPU `combination` (tuple keys) forced to CPU | missing-as-value now works on GPU (validated 2026-06-26); only combination remains host-only. |
20
- | KI-019 | S3 | combination joint-key build is a Python loop | O(n) host loop; vectorize for large N. |
21
- | KI-020 | S2 | GPU not faster than CPU up to 1M rows (T4) | host↔device round-trip per OOF fold dominates. `auto` disabled; perf needs on-device keys/folds. `docs/verdicts/2026-06-26-gpu-crossover-verdict.md`. |
19
+ | KI-018 | | ~~GPU `combination` forced to CPU~~ | **Resolved 2026-06-27**: combination/interaction now run on GPU — `host_only = not all_gpu` + host-built **int64 joint codes** (KI-019) flow straight to the device group-by (`_gpu._to_nullable` skips the MISSING remap for non-object keys; a missing component is already folded into an integer code). **CPU/GPU `allclose` validated on Colab T4 (2026-06-27)**: combination mean/var, missing-component, and interactions all `transform`+`fit_transform` allclose (max\|Δ\| ≤ 3.8e-15, fit_transform 0.0), `backend_=gpu`. `docs/verdicts/2026-06-27-gpu-parity-report.md`. `backend='gpu'` still raises without RAPIDS (no silent fallback); `auto` stays CPU (KI-020 crossover unchanged — ~parity only at ≥5M). |
20
+ | KI-019 | | ~~combination joint-key build is a Python loop~~ | **Resolved 2026-06-27**: replaced by vectorized mixed-radix **int64 joint codes** (`((c0*n1+c1)*n2+c2)…`), learned once from full X and reused at fit/fold/transform; byte-identical (max\|Δ\|=0 at 200k–1M), combination transform ×3.7–4.4 / fit_transform ×1.5–2.4 vs the loop. **Supersedes PR #2** (which only built tuples faster). `docs/verdicts/2026-06-27-integer-joint-codes-verdict.md`. |
21
+ | KI-020 | S2 | GPU reaches ~parity only at ≥5M rows (T4); `auto` stays off | Post-complement-subtraction (host): per-fold round-trip removed crossover **0.67×@1M → 1.11×@5M, 1.06×@10M** (marginal + noisy: 1M was 0.67 vs 0.98 across runs). GPU scales sublinearly but the win is within noise; `auto` stays disabled, explicit gpu validated (allclose, mean ft now exact). `docs/verdicts/2026-06-26-gpu-crossover-postPRB-verdict.md`. |
22
22
  | KI-030 | S3 | Numeric TE (0.2.0): `Count`/`Frequency` don't bin; numpy-object & bool route to categorical | `numeric=` is `TargetEncoder`-only. Numeric auto-detection needs real numeric dtypes, so numpy-array input (all-object after `prepare_X`) and bool columns are treated as categorical/direct, not binned. Edges are computed once from full-train X (leakage-safe, ⊥ y). **GPU:** numeric keys are emitted as **strings** — the first Colab T4 run hit `MixedTypeError` (cuDF rejects object-dtype *integer* arrays) with int bin-ids/values; fixed by stringifying keys (matches the validated string-categorical path). CPU/GPU allclose **validated on T4 (2026-06-26)** for `numeric_auto`/`numeric_bin` (max\|Δ\| ~1e-17). |
23
+ | KI-031 | S3 | Transform `map`→**gather done**; non-additive stats still re-fit per fold | **2026-06-27**: `_transform_array` now factorizes each unit's keys once (`index.get_indexer`) and **gathers** each column from a contiguous float64 array (`_UnitEncoding`), replacing per-column `pd.Series.map` — transform ×2.3–3.4 (multi-stat / high-card), single-stat neutral, outputs allclose, leakage + sklearn-compat PASS (`docs/verdicts/2026-06-27-transform-gather-verdict.md`). **Still open:** median/min/max/skew/custom re-fit per fold in the hybrid OOF slow path (now faster via the gather, but not on the single-pass kernel). **Follow-up:** ✅ integer **joint** codes (`c_a*n_b+c_b`) done — combination key-build vectorized (KI-019, 2026-06-27); GPU `combination` (KI-018) remains. See `docs/notes/2026-06-27-cuml-vs-sklearn-te-levers.md`. |
23
24
 
24
25
  ## Open risks to track (carry into implementation)
25
26
  | id | sev | risk | mitigation |
@@ -0,0 +1,79 @@
1
+ # Why is cuML's TargetEncoder faster than sklearn's — and which CPU levers is catstat still missing?
2
+
3
+ - Date: 2026-06-27
4
+ - Scope: a research scout (no code) for the perf arc. catstat **already** has the single-pass
5
+ composite-groupby + OOF-by-complement-subtraction algorithm on CPU (PR-B mean, PR-C var/std). The
6
+ question: what *else* makes cuML fast that ports to a pandas/numpy CPU path?
7
+ - Sources read: cuML `python/cuml/cuml/preprocessing/TargetEncoder.py` (`_groupby_agg`,
8
+ `_fit_transform`, `_make_fold_column`); the RAPIDS "Target Encoding with cuML" blog; sklearn
9
+ `preprocessing/_target_encoder.py` + the Cython `_target_encoder_fast.pyx`
10
+ (`_fit_encoding_all_targets`), `utils/_encode.py`, `preprocessing/_encoders.py`.
11
+
12
+ ## Headline finding
13
+
14
+ **cuML offers nothing to port algorithmically.** Its `_groupby_agg` is exactly the
15
+ complement-subtraction trick catstat already has: `groupby([fold]+x_cols).agg` then
16
+ `groupby(x_cols).agg` on the small per-fold table, then subtract. It represents categories as **raw
17
+ object/string keys** (cuDF GPU hash join) — *no* integer codes. The RAPIDS ~100× is GPU parallelism
18
+ over cuDF groupby/hash-join + the ~4× from doing all folds in parallel (which complement-subtraction
19
+ already buys). So cuML ≠ a source of CPU levers.
20
+
21
+ **The CPU levers come from scikit-learn**, whose `TargetEncoder` is the opposite design: it
22
+ cross-fits **per fold** (n_folds passes — algorithmically *worse* than catstat) but makes each pass
23
+ extremely cheap via an **integer-code representation**:
24
+ - `OrdinalEncoder` integer-codes every column **once** at fit → a C-contiguous `int` matrix
25
+ `X_ordinal`. Hashing is paid once, upfront.
26
+ - Cython `_fit_encoding_fast` does `sums[code] += y` / `counts[code] += 1` — an **O(1) array
27
+ scatter-add**, no hash lookup on the hot path; smoothing is folded into the accumulator init
28
+ (`sums[c] = smooth*y_mean`, `counts[c] = smooth`) so the m-estimate falls out of the final
29
+ division for free; buffers reused across features; `nogil`.
30
+ - Transform is a **pure numpy gather**: `encoding[X_ordinal[rows, col]]` — no pandas, no `.map()`,
31
+ no merge — into a pre-allocated `X_out`.
32
+
33
+ catstat's fast kernel **already** uses `pd.factorize` + `np.bincount` internally (so the *fit*
34
+ accumulation is sklearn-equivalent). The unadopted half is the **transform path** and the **fitted
35
+ representation**: `_transform_array` still maps via `pd.Series.map` on object keys (profiled at
36
+ **52% `get_indexer`** of a multi-stat transform), and the slow per-fold loop (median/min/max/skew/
37
+ custom) re-factorizes per stat.
38
+
39
+ ## Ranked CPU-applicable levers (for a path that already has complement-subtraction)
40
+
41
+ | # | lever | where it fires | expected payoff | risk | sklearn? | cuML? |
42
+ |---|-------|----------------|-----------------|------|----------|-------|
43
+ | 1 ✅ | **factorize-once + numpy GATHER** (store encodings as `float64[code]`; transform = `enc[codes]`) — **shipped 2026-06-27** | transform (and fit lookups) | **measured ×2.3–3.4** (multi-stat / high-card; one `get_indexer` per *unit*, not per column; single-stat neutral) | Low–Med | yes | no |
44
+ | 2 | **bincount over integer (joint) codes** for the remaining slow-path stats; `joint = code_a*n_b+code_b` | fit accumulation; **joint keys** | 5–30× at large N; integer joint codes also unblock **GPU combination (KI-018)** | Med | yes (scatter-add) | no |
45
+ | 3 | **pre-allocated output, no merge at apply-back** | transform/apply | 5–15× | Low | yes | no |
46
+ | 4 | **dtype discipline** — int32 codes, C-contiguous, float64 throughout | both, large N | 5–20% | Very low | yes | partial |
47
+ | 5 | smoothing folded into accumulator init | fit smoothing | <5% (catstat's vectorized smoothing is already ~free) | trivial | yes | no |
48
+ | 6 | column-level parallelism (joblib over independent units) | both | up to n_cores | Med | no | inherent |
49
+
50
+ ## Recommendation for catstat
51
+
52
+ The single highest-leverage intervention is **#1 + #2 together**: a factorize-once, integer-code
53
+ **gather** path. Concretely —
54
+ - **`_transform_array`**: at fit, store each unit's encoding as a `float64` array indexed by the
55
+ unit's integer codes (keep the object→code mapping built from `categories_`); at transform,
56
+ `pd.factorize`/`searchsorted` the keys once and gather (`enc[codes]`, unknown = code −1 → fallback),
57
+ replacing `pd.Series.map`. This speeds **every** `transform`/inference call, not just `fit_transform`.
58
+ - **joint codes**: `code_a * n_b + code_b` (int64) replaces object tuple keys for combination/
59
+ `interactions` — removing the per-row tuple build (KI-019's residue) and giving cuDF an integer
60
+ column to group on, which **unblocks GPU `combination`** (KI-018).
61
+
62
+ What to measure (in-process before/after, the attributable method): the **transform** step alone at
63
+ n ≥ 1M, single- and multi-column, var the cardinality; expect 10–50× on that step. Watch the small-N
64
+ break-even (pandas has low overhead; bincount/gather wins grow with N). Tracked as **KI-031**; this
65
+ is the next CPU arc after PR-C, ahead of the GPU on-device port (KI-020).
66
+
67
+ ## Status — lever #1 shipped (2026-06-27)
68
+
69
+ Lever #1 landed on `feat/perf-integer-code-gather` (stacked on `feat/perf-additive-var-std`):
70
+ `_transform_array` factorizes each unit's keys once (`index.get_indexer`) and gathers each column from
71
+ a `float64` array aligned to a per-unit canonical index (`_UnitEncoding`), replacing per-column
72
+ `pd.Series.map`. Measured (in-process interleaved, n=1M, 7 reps): transform **×2.28** (4-stat),
73
+ **×3.36** (4-stat high-card 50k), **×2.48** (combination), **×1.00** single-stat (no-unknown fast path
74
+ = a single fancy index). The 10–50× estimate above assumed `get_indexer` could be eliminated entirely;
75
+ in practice the gather still pays **one** `get_indexer` per unit to locate *arbitrary* transform keys,
76
+ so the real win is "a unit's N stats share one hash" (≈2–3× at 4 stats) plus dropping the pandas
77
+ `Series.map` overhead. Outputs allclose; leakage + sklearn-compat PASS;
78
+ `docs/verdicts/2026-06-27-transform-gather-verdict.md`. **Lever #2 (integer joint codes → vectorized
79
+ combination key-build, KI-019 + GPU combination KI-018) is the next, separate PR.**
@@ -42,9 +42,14 @@ verdict-backed), pending the maintainer's `v0.2.0` tag. Publishing is tag-driven
42
42
  missing-as-value** (cuDF nulls), transform + fit_transform. Two verdicts (parity + crossover).
43
43
  - ✅ **Crossover measured**: GPU is *slower* than CPU up to 1M rows (speedup 0.28–0.86) →
44
44
  `backend="auto"` GPU **disabled** (`_AUTO_GPU_ENABLED=False`); explicit `backend="gpu"` stays. KI-020.
45
- - `combination` on GPU (tuple keys, host-only) + vectorize joint-key build (KI-018/019).
46
- - **GPU perf**: keep keys/folds on-device to remove the per-fold host↔device round-trips that
47
- dominate; then re-run the crossover and re-enable `auto` if it wins.
45
+ - combination joint-key build **vectorized** to int64 mixed-radix codes (KI-019, 2026-06-27:
46
+ byte-identical, transform ×3.7–4.4 / fit_transform ×1.5–2.4 at 1M; supersedes PR #2). GPU
47
+ `combination`/`interactions` now run on GPU too (host-built int64 codes device group-by;
48
+ `host_only = not all_gpu`); **CPU/GPU allclose validated on Colab T4** (KI-018 resolved).
49
+ - **GPU perf** (re-measured 2026-06-26, T4): host complement-subtraction (PR-B) removed the per-fold
50
+ round-trip → crossover ~parity at ≥5M (0.67×@1M, 1.11×@5M, 1.06×@10M; marginal + noisy). `auto`
51
+ **stays off** (data doesn't justify it); a device-resident path is a niche lever.
52
+ `docs/verdicts/2026-06-26-gpu-crossover-postPRB-verdict.md`.
48
53
 
49
54
  ## Phase 3 — advanced — in progress (2026-06-26)
50
55
  - ✅ **Phase 3a**: `skew` (built-in) + **custom-callable aggregations** (`stats=[("q90", fn)]` or
@@ -77,8 +82,9 @@ verdict-backed), pending the maintainer's `v0.2.0` tag. Publishing is tag-driven
77
82
  - ✅ **Project hygiene (0.1.1)**: `CONTRIBUTING.md`, `SECURITY.md`, GitHub issue + PR templates.
78
83
  - ✅ **0.1.1 PUBLISHED (2026-06-26)**: `v0.1.1` tagged → release workflow built + published to PyPI
79
84
  via Trusted Publishing; `pip install catstat==0.1.1` verified in a clean venv; GitHub release created.
80
- - **Maintainer-only:** enable GitHub Pages (Settings Pages GitHub Actions) so the Docs
81
- workflow deploys the API site. (KI-020 GPU perf is the optional larger follow-up.)
85
+ - **GitHub Pages enabled (2026-06-27)**: source = GitHub Actions; the Docs workflow now deploys
86
+ the API site to https://matapanino.github.io/catstat/ (the deploy step had been failing only
87
+ because Pages was off). (KI-020 GPU perf is the optional larger follow-up.)
82
88
 
83
89
  ## 0.2.0 — numeric-column target encoding — done ✅ (2026-06-26)
84
90
  - ✅ **Opt-in numeric TE** on `TargetEncoder` (`numeric="ignore"|"auto"|"direct"|"bin"` +
@@ -103,16 +109,42 @@ verdict-backed), pending the maintainer's `v0.2.0` tag. Publishing is tag-driven
103
109
  **M0 bootstrap (2026-06-26)**.
104
110
  - ✅ **Phase 2 (CPU + GPU validated)** 2026-06-26: var/std/median/min/max, combination mode,
105
111
  GPU backend `backends/_gpu.py` **validated CPU/GPU-allclose on a Colab T4**, CI, Colab loop, `git`.
112
+ - **Perf arc (2026-06-26→27, profiling-driven).** ✅ CPU OOF is single-pass via complement
113
+ subtraction: `kfold_mean_oof_fast` replaced the per-fold group-by for pure-mean (2.2–3.4×;
114
+ `docs/verdicts/2026-06-26-pr-b-complement-subtraction-mean-verdict.md`). ✅ **var/std** now ride the
115
+ same kernel (shared complement moments; ddof=1 + complement-global fallback) with a **hybrid** gate
116
+ that keeps median/min/max/skew/custom on the slow loop — 2.7–2.8× on var & mean+var+std, ~1.5× mixed
117
+ (`docs/verdicts/2026-06-27-pr-c-additive-var-std-verdict.md`). The kernel also ports on-device to
118
+ remove per-fold host↔device round-trips (KI-020). ✅ **Transform gather** (2026-06-27): factorize-once
119
+ `index.get_indexer` + numpy gather replaced per-column `pd.Series.map` — transform ×2.3–3.4
120
+ (multi-stat / high-card), single-stat neutral (`docs/verdicts/2026-06-27-transform-gather-verdict.md`,
121
+ KI-031). ✅ **Integer joint codes** (2026-06-27): combination key-build replaced by mixed-radix
122
+ int64 codes — byte-identical, transform ×3.7–4.4 / fit_transform ×1.5–2.4 at 1M, closes KI-019
123
+ (supersedes PR #2). ✅ **GPU `combination`/`interactions`** unblocked (`host_only` drop + int64
124
+ codes to the device group-by); **CPU/GPU allclose validated on Colab T4** (KI-018 resolved).
125
+ - ✅ **Interactions (2026-06-27)**: `interactions=[[...]]` → one joint TE column per group (additive
126
+ to `cols`; generalizes `combination`). `_units` plumbing + one param; OOF / naming / parity reuse
127
+ the unit machinery. `test_interactions.py`; sklearn-compat PASS. Branch `feat/interactions`.
106
128
  - **Phase 2 — remaining.** GPU *performance* (on-device keys/folds; KI-020) and `combination` on
107
129
  GPU (KI-018) — both optional, gated behind a fresh crossover verdict before re-enabling `auto`.
108
130
  - **Phase 3.** quantile/skew/custom + ordered/LOO + `set_output("polars")` + PyPI release.
109
131
 
110
132
  ## "Next" pointer (update each session)
111
- > **Next task:** **0.2.0 opt-in numeric-column target encoding is implemented & green** (branch
112
- > `feat/numeric-target-encoding`): direct / quantile-binned / auto-routed numeric TE, leakage-audited
113
- > (edges ⊥ y; binned OOF exact), sklearn-compat, empirically validated (CV 0.034 0.91), defaults
114
- > set by verdict, version bumped to **0.2.0** + CHANGELOG. **Remaining:** (1) maintainer tags
115
- > `v0.2.0` to publish via Trusted Publishing; (2) maintainer runs `scripts/colab_gpu_parity.sh` to
116
- > confirm CPU/GPU allclose on the new binned/direct cases (host-side numpy expected); (3) optional:
117
- > numeric binning for `Count`/`Frequency` (KI-030). Still maintainer-only from 0.1.1: enable GitHub
118
- > Pages. Optional larger follow-up: KI-020 GPU on-device perf (needs a fresh Colab crossover verdict).
133
+ > **Next task:** **Integer joint codes done (2026-06-27, lever #2A)** combination key-build replaced
134
+ > by vectorized mixed-radix int64 codes (learned once from full X, reused at fit/fold/transform);
135
+ > byte-identical output, combination transform ×3.7–4.4 / fit_transform ×1.5–2.4 at 1M, closes KI-019
136
+ > and supersedes PR #2; leakage + sklearn-compat PASS. The perf stack (#3 mean, #5 var/std, #7 gather,
137
+ > joint codes) and **interactions** (`interactions: list[list[str]]`) are all now merged to main.
138
+ > **Lever #2B GPU `combination` DONE (2026-06-27, `feat/perf-gpu-combination`)**: `host_only = not
139
+ > all_gpu` (combination/interaction no longer forced to CPU); host-built int64 joint codes flow to the
140
+ > device group-by (`_gpu._to_nullable` non-object guard). **CPU/GPU allclose validated on Colab T4**
141
+ > combination mean/var, missing-component, interactions all pass (max\|Δ\| ≤ 3.8e-15, ft 0.0,
142
+ > `backend_=gpu`); KI-018 resolved. **Perf arc complete** — CPU levers exhausted (cuML had nothing to
143
+ > port; all came from sklearn's integer-code path). **Next:** **PR-D** GPU on-device kernel + a fresh
144
+ > crossover before re-enabling `auto` — but the 2026-06-27 crossover re-confirms GPU only reaches
145
+ > ~parity at ≥5M (0.93×@1M, 1.22×@5M, 1.07×@10M), so `auto` **stays off** and PR-D is a niche lever
146
+ > (KI-020). **Maintainer carryover:** `0.2.0` is already on PyPI and **`0.3.0` is prepared** (version
147
+ > bumped + CHANGELOG + build/twine/smoke verified, merged to main) — the maintainer tags to publish:
148
+ > `git tag -a v0.3.0 -m "catstat 0.3.0" && git push origin v0.3.0` (Trusted Publishing fires on the
149
+ > tag), then a GitHub release from the `[0.3.0]` notes. ✅ GitHub Pages enabled. **Next feature:**
150
+ > `Count`/`Frequency` numeric binning (KI-030).