PyPI - catstat - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

catstat 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (125) hide show

{catstat-0.2.0 → catstat-0.3.0}/CHANGELOG.md RENAMED Viewed

@@ -3,6 +3,38 @@
 All notable changes to `catstat` are documented here. Format follows
 [Keep a Changelog](https://keepachangelog.com/); versioning is [SemVer](https://semver.org/).
+## [0.3.0] — 2026-06-27
+### Added
+- **Explicit interaction groups** via a new `interactions` parameter on `TargetEncoder`:
+  `interactions=[["a", "b"], ...]` adds **one joint target-encoded column per group** (additive to
+  the per-column `cols` encodings), generalizing `multi_feature_mode="combination"` (which encodes a
+  single joint column only). Out-of-fold cross-fitting, feature naming (`a+b__te_*`), and the
+  unknown/missing fallback all reuse the existing encoding-unit machinery.
+- **GPU `backend="gpu"` now supports `combination` / `interactions`** (joint units). They key on
+  int64 mixed-radix joint codes (built on the host, so identical on both backends) which the cuDF
+  group-by consumes directly; CPU/GPU `allclose` validated on a Colab T4 (mean/var, missing
+  component, interactions). Previously these were forced to the CPU.
+### Performance
+All performance work below is **output-identical** (allclose; leakage-audited) — no behavior or API
+change. The committed benchmark baseline is unchanged (it predates this arc; see the verdicts).
+- **Single-pass out-of-fold encoding** via complement subtraction for `mean` and `var`/`std`,
+  replacing the per-fold group-by loop (a shared per-`(fold, key)` moment builder; a hybrid gate
+  keeps `median`/`min`/`max`/`skew`/custom on the per-fold path). ~2.2–3.4× (mean), ~2.7–2.8×
+  (var/std).
+- **Factorize-once integer-code transform gather**: `transform` now hashes each unit's keys once
+  (`index.get_indexer`) and gathers each `(stat, class)` column from a contiguous `float64` array,
+  replacing the per-column `pd.Series.map`. transform ~2.3–3.4× (multi-stat / high-cardinality).
+- **Integer mixed-radix joint codes** for `combination` / `interactions` units, replacing the
+  per-row Python tuple build: combination transform ~3.7–4.4× / fit_transform ~1.5–2.4× at 1M rows.
+  (Closes KI-019.)
+### Changed
+- **GPU crossover re-measured** on a Colab T4 after the single-pass kernel: the host-orchestrated GPU
+  path reaches only ~parity at ≥5M rows (≈0.9× at 1M, ≈1.2× at 5M), so `backend="auto"` continues to
+  resolve to **CPU**; explicit `backend="gpu"` stays available and parity-validated. (KI-020.)
 ## [0.2.0] — 2026-06-26
 ### Added

{catstat-0.2.0 → catstat-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: catstat
-Version: 0.2.0
+Version: 0.3.0
 Summary: Unified CPU/GPU statistical categorical encoding: leakage-safe target encoding generalized to arbitrary statistics, with one sklearn-compatible API.
 Project-URL: Homepage, https://github.com/Matapanino/catstat
 Project-URL: Repository, https://github.com/Matapanino/catstat

catstat-0.3.0/benchmarks/results/2026-06-26-T4-gpu-parity.jsonl ADDED Viewed

@@ -0,0 +1,12 @@
+{"kind": "parity", "case": "regression_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 3.3306690738754696e-16, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.156, "gpu_ft_s": 0.2071, "status": "ok"}
+{"kind": "parity", "case": "regression_var", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.3322676295501878e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 1.5543122344752192e-15, "fit_transform_allclose": true, "cpu_ft_s": 0.4408, "gpu_ft_s": 0.5033, "status": "ok"}
+{"kind": "parity", "case": "binary_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 0.0, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1786, "gpu_ft_s": 0.2275, "status": "ok"}
+{"kind": "parity", "case": "multiclass_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 0.0, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.4423, "gpu_ft_s": 0.6541, "status": "ok"}
+{"kind": "parity", "case": "regression_mean_missing", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 2.220446049250313e-16, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1691, "gpu_ft_s": 0.2211, "status": "ok"}
+{"kind": "parity", "case": "numeric_auto", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.1275702593849246e-17, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.6253, "gpu_ft_s": 0.6774, "status": "ok"}
+{"kind": "parity", "case": "numeric_bin", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 9.540979117872439e-18, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.6356, "gpu_ft_s": 0.6577, "status": "ok"}
+{"kind": "crossover", "n": 10000, "cardinality": 250, "cpu_ft_s": 0.01, "gpu_ft_s": 0.049, "speedup": 0.2}
+{"kind": "crossover", "n": 100000, "cardinality": 2500, "cpu_ft_s": 0.0668, "gpu_ft_s": 0.1083, "speedup": 0.62}
+{"kind": "crossover", "n": 1000000, "cardinality": 25000, "cpu_ft_s": 0.8721, "gpu_ft_s": 1.3093, "speedup": 0.67}
+{"kind": "crossover", "n": 5000000, "cardinality": 125000, "cpu_ft_s": 5.4259, "gpu_ft_s": 4.8961, "speedup": 1.11}
+{"kind": "crossover", "n": 10000000, "cardinality": 250000, "cpu_ft_s": 12.4614, "gpu_ft_s": 11.7917, "speedup": 1.06}

catstat-0.3.0/benchmarks/results/2026-06-27-T4-gpu-parity.jsonl ADDED Viewed

@@ -0,0 +1,16 @@
+{"kind": "parity", "case": "regression_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 3.3306690738754696e-16, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1915, "gpu_ft_s": 0.1998, "status": "ok"}
+{"kind": "parity", "case": "regression_var", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.3322676295501878e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1527, "gpu_ft_s": 0.1719, "status": "ok"}
+{"kind": "parity", "case": "binary_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 0.0, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.164, "gpu_ft_s": 0.2105, "status": "ok"}
+{"kind": "parity", "case": "multiclass_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 0.0, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.309, "gpu_ft_s": 0.513, "status": "ok"}
+{"kind": "parity", "case": "regression_mean_missing", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 2.220446049250313e-16, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.1615, "gpu_ft_s": 0.2015, "status": "ok"}
+{"kind": "parity", "case": "numeric_auto", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.0408340855860843e-17, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.8187, "gpu_ft_s": 0.6377, "status": "ok"}
+{"kind": "parity", "case": "numeric_bin", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 8.673617379884035e-18, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.6086, "gpu_ft_s": 0.655, "status": "ok"}
+{"kind": "parity", "case": "combination_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.1102230246251565e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.3679, "gpu_ft_s": 0.5603, "status": "ok"}
+{"kind": "parity", "case": "combination_var", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 3.774758283725532e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.3331, "gpu_ft_s": 0.3376, "status": "ok"}
+{"kind": "parity", "case": "combination_mean_missing", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.1102230246251565e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.368, "gpu_ft_s": 0.3872, "status": "ok"}
+{"kind": "parity", "case": "interactions_mean", "backend_cpu": "cpu", "backend_gpu": "gpu", "transform_max_abs_diff": 1.1102230246251565e-15, "transform_allclose": true, "fit_transform_max_abs_diff": 0.0, "fit_transform_allclose": true, "cpu_ft_s": 0.577, "gpu_ft_s": 0.8054, "status": "ok"}
+{"kind": "crossover", "n": 10000, "cardinality": 250, "cpu_ft_s": 0.0148, "gpu_ft_s": 0.0567, "speedup": 0.26}
+{"kind": "crossover", "n": 100000, "cardinality": 2500, "cpu_ft_s": 0.1045, "gpu_ft_s": 0.173, "speedup": 0.6}
+{"kind": "crossover", "n": 1000000, "cardinality": 25000, "cpu_ft_s": 0.8383, "gpu_ft_s": 0.9039, "speedup": 0.93}
+{"kind": "crossover", "n": 5000000, "cardinality": 125000, "cpu_ft_s": 5.7528, "gpu_ft_s": 4.7329, "speedup": 1.22}
+{"kind": "crossover", "n": 10000000, "cardinality": 250000, "cpu_ft_s": 11.2374, "gpu_ft_s": 10.5415, "speedup": 1.07}

catstat-0.3.0/benchmarks/results/2026-06-27-transform-gather.jsonl ADDED Viewed

@@ -0,0 +1,154 @@
+{
+  "cases": {
+    "binary": {
+      "cardinality": 50,
+      "case": "binary",
+      "fit_s": {
+        "median": 0.013814000005368143,
+        "spread": 0.0011099584990006406
+      },
+      "fit_transform_s": {
+        "median": 0.06258904200512916,
+        "spread": 0.01324962500075344
+      },
+      "n": 100000,
+      "n_out_cols": 1,
+      "pos_rate": 0.3,
+      "quality": {},
+      "transform_s": {
+        "median": 0.007551957998657599,
+        "spread": 0.0006878125022922177
+      }
+    },
+    "combination": {
+      "case": "multi_column",
+      "fit_s": {
+        "median": 0.15021979200537317,
+        "spread": 0.03921062499648542
+      },
+      "fit_transform_s": {
+        "median": 0.40024404100404354,
+        "spread": 0.046770541499427054
+      },
+      "n": 100000,
+      "n_cols": 4,
+      "n_out_cols": 1,
+      "quality": {},
+      "transform_s": {
+        "median": 0.13422654200257966,
+        "spread": 0.02950145799695747
+      }
+    },
+    "count": {
+      "cardinality": 5000,
+      "case": "high_cardinality",
+      "fit_s": {
+        "median": 0.009142125003563706,
+        "spread": 0.0004794999986188486
+      },
+      "fit_transform_s": {
+        "median": 0.018366208001680207,
+        "spread": 0.0004029789997730404
+      },
+      "n": 100000,
+      "n_out_cols": 1,
+      "quality": {},
+      "transform_s": {
+        "median": 0.008392792005906813,
+        "spread": 0.0006406665015674662
+      }
+    },
+    "high_cardinality": {
+      "cardinality": 5000,
+      "case": "high_cardinality",
+      "fit_s": {
+        "median": 0.02094879200012656,
+        "spread": 0.004071979503351031
+      },
+      "fit_transform_s": {
+        "median": 0.04681791699840687,
+        "spread": 0.009238896000169916
+      },
+      "n": 100000,
+      "n_out_cols": 1,
+      "quality": {},
+      "transform_s": {
+        "median": 0.010617084000841714,
+        "spread": 0.003431875000387663
+      }
+    },
+    "multiclass": {
+      "cardinality": 50,
+      "case": "multiclass",
+      "classes": 5,
+      "fit_s": {
+        "median": 0.048849416998564266,
+        "spread": 0.008056833499722416
+      },
+      "fit_transform_s": {
+        "median": 0.12387624999973923,
+        "spread": 0.0140198955014057
+      },
+      "n": 100000,
+      "n_out_cols": 5,
+      "quality": {},
+      "transform_s": {
+        "median": 0.009531249997962732,
+        "spread": 0.00039762500091455877
+      }
+    },
+    "regression": {
+      "cardinality": 50,
+      "case": "regression",
+      "fit_s": {
+        "median": 0.011481834000733215,
+        "spread": 0.00198229200032074
+      },
+      "fit_transform_s": {
+        "median": 0.05065037500025937,
+        "spread": 0.009240791499905754
+      },
+      "n": 100000,
+      "n_out_cols": 1,
+      "quality": {
+        "oof_rmse": 0.5006930887474031
+      },
+      "transform_s": {
+        "median": 0.011224832996958867,
+        "spread": 0.0011437710018071812
+      }
+    },
+    "regression_std": {
+      "cardinality": 50,
+      "case": "regression",
+      "fit_s": {
+        "median": 0.018198292003944516,
+        "spread": 0.005130541496328078
+      },
+      "fit_transform_s": {
+        "median": 0.045569374997285195,
+        "spread": 0.0030776045023230836
+      },
+      "n": 100000,
+      "n_out_cols": 1,
+      "quality": {},
+      "transform_s": {
+        "median": 0.008765417005633935,
+        "spread": 0.0018198750003648456
+      }
+    }
+  },
+  "meta": {
+    "backend": "cpu",
+    "git_sha": "689a268",
+    "ts": "2026-06-26T16:36:02+00:00",
+    "versions": {
+      "catstat": "0.2.0",
+      "numpy": "1.23.5",
+      "pandas": "1.5.2",
+      "python": "3.11.1",
+      "sklearn": "1.2.0"
+    }
+  },
+  "size": "medium"
+}

{catstat-0.2.0 → catstat-0.3.0}/benchmarks/results/ledger.jsonl RENAMED Viewed

@@ -22,3 +22,10 @@
 {"ts": "2026-06-26T03:26:17+00:00", "git_sha": "6a75054", "backend": "cpu", "versions": {"catstat": "0.0.1", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "regression_std", "fit_s": {"median": 0.002077291952446103, "spread": 0.00018660450587049127}, "transform_s": {"median": 0.0008032920304685831, "spread": 4.956248449161649e-05}, "fit_transform_s": {"median": 0.018527208012528718, "spread": 0.0026257289573550224}, "n_out_cols": 1, "quality": {}, "case": "regression", "n": 10000, "cardinality": 50}
 {"ts": "2026-06-26T03:26:17+00:00", "git_sha": "6a75054", "backend": "cpu", "versions": {"catstat": "0.0.1", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "combination", "fit_s": {"median": 0.01387550006620586, "spread": 0.002983853977639228}, "transform_s": {"median": 0.009892250061966479, "spread": 0.0016778334975242615}, "fit_transform_s": {"median": 0.09488679200876504, "spread": 0.005979500012472272}, "n_out_cols": 1, "quality": {}, "case": "multi_column", "n": 10000, "n_cols": 4}
 {"ts": "2026-06-26T03:26:17+00:00", "git_sha": "6a75054", "backend": "cpu", "versions": {"catstat": "0.0.1", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "count", "fit_s": {"median": 0.0011707909870892763, "spread": 0.0014458755031228065}, "transform_s": {"median": 0.00082016596570611, "spread": 5.008344305679202e-05}, "fit_transform_s": {"median": 0.0021022919099777937, "spread": 0.00046245806152001023}, "n_out_cols": 1, "quality": {}, "case": "high_cardinality", "n": 10000, "cardinality": 500}
+{"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "regression", "fit_s": {"median": 0.011481834000733215, "spread": 0.00198229200032074}, "transform_s": {"median": 0.011224832996958867, "spread": 0.0011437710018071812}, "fit_transform_s": {"median": 0.05065037500025937, "spread": 0.009240791499905754}, "n_out_cols": 1, "quality": {"oof_rmse": 0.5006930887474031}, "case": "regression", "n": 100000, "cardinality": 50}
+{"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "binary", "fit_s": {"median": 0.013814000005368143, "spread": 0.0011099584990006406}, "transform_s": {"median": 0.007551957998657599, "spread": 0.0006878125022922177}, "fit_transform_s": {"median": 0.06258904200512916, "spread": 0.01324962500075344}, "n_out_cols": 1, "quality": {}, "case": "binary", "n": 100000, "cardinality": 50, "pos_rate": 0.3}
+{"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "multiclass", "fit_s": {"median": 0.048849416998564266, "spread": 0.008056833499722416}, "transform_s": {"median": 0.009531249997962732, "spread": 0.00039762500091455877}, "fit_transform_s": {"median": 0.12387624999973923, "spread": 0.0140198955014057}, "n_out_cols": 5, "quality": {}, "case": "multiclass", "n": 100000, "cardinality": 50, "classes": 5}
+{"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "high_cardinality", "fit_s": {"median": 0.02094879200012656, "spread": 0.004071979503351031}, "transform_s": {"median": 0.010617084000841714, "spread": 0.003431875000387663}, "fit_transform_s": {"median": 0.04681791699840687, "spread": 0.009238896000169916}, "n_out_cols": 1, "quality": {}, "case": "high_cardinality", "n": 100000, "cardinality": 5000}
+{"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "regression_std", "fit_s": {"median": 0.018198292003944516, "spread": 0.005130541496328078}, "transform_s": {"median": 0.008765417005633935, "spread": 0.0018198750003648456}, "fit_transform_s": {"median": 0.045569374997285195, "spread": 0.0030776045023230836}, "n_out_cols": 1, "quality": {}, "case": "regression", "n": 100000, "cardinality": 50}
+{"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "combination", "fit_s": {"median": 0.15021979200537317, "spread": 0.03921062499648542}, "transform_s": {"median": 0.13422654200257966, "spread": 0.02950145799695747}, "fit_transform_s": {"median": 0.40024404100404354, "spread": 0.046770541499427054}, "n_out_cols": 1, "quality": {}, "case": "multi_column", "n": 100000, "n_cols": 4}
+{"ts": "2026-06-26T16:36:02+00:00", "git_sha": "689a268", "backend": "cpu", "versions": {"catstat": "0.2.0", "numpy": "1.23.5", "pandas": "1.5.2", "sklearn": "1.2.0", "python": "3.11.1"}, "case_name": "count", "fit_s": {"median": 0.009142125003563706, "spread": 0.0004794999986188486}, "transform_s": {"median": 0.008392792005906813, "spread": 0.0006406665015674662}, "fit_transform_s": {"median": 0.018366208001680207, "spread": 0.0004029789997730404}, "n_out_cols": 1, "quality": {}, "case": "high_cardinality", "n": 100000, "cardinality": 5000}

{catstat-0.2.0 → catstat-0.3.0}/docs/experiment_log.md RENAMED Viewed

@@ -197,3 +197,120 @@ session retries a dead end. Newest at the top. Each entry links its verdict when
 - Verdict: docs/verdicts/2026-06-26-numeric-te-verdict.md
 <!-- Append new experiments below this line. Never edit or delete prior entries. -->
+## 2026-06-27 — extend the single-pass OOF kernel to var/std + hybrid gate (PR-C)
+- Hypothesis: the complement-subtraction kernel already accumulates per-(fold,key) sum-of-squares, so
+  var/std are a cheap finalize from the same complement moments — sample var `(ss−s²/cc)/(cc−1)`,
+  std `√var` (ddof=1) — with a per-fold complement-global fallback when complement count
+  `< max(min_samples,1)` or `< 2` (singleton variance undefined). A hybrid gate runs additive stats
+  fast and leaves median/min/max/skew/custom on the per-fold loop.
+- Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0; in-process **interleaved** before/after
+  (before = pre-PR gate `{"mean"}`: mean fast, var/std slow), 7 reps, n=200k & 1M, 2 cols, cv=5;
+  `tests/test_additive_fast_path.py` 48-config equivalence matrix {var,std}×{min_samples 1/2/5}×
+  {missing,unknown}×{single,combination} + a hybrid mixed-stat case; independent pure-pandas leakage
+  reconstruction; `/leakage-audit`.
+- Result: KEEP — 2.67–2.82× on var-only and mean+var+std, 1.47–1.49× mixed (median stays slow);
+  output allclose to the per-fold path (≤3.4e-13 var / 7.1e-15 std; allclose-not-bitwise, invariant
+  #2), noise-trap OOF corr −0.004 (signal +0.445), asymmetry 20.5; 167 passed, 8 skipped; ruff clean.
+  No default changed; `_smoothing`/`_aggregations` untouched. Within the fast path a unit's
+  mean/var/std share one factorize + one composite bincount.
+- Verdict: docs/verdicts/2026-06-27-pr-c-additive-var-std-verdict.md. Research note (next lever):
+  docs/notes/2026-06-27-cuml-vs-sklearn-te-levers.md.
+## 2026-06-27 — integer-code gather on the transform path (KI-031)
+- Hypothesis: `_transform_array` re-hashed each unit's keys once per (stat,class) column via
+  `pd.Series.map`; since a unit's stats share one category *set* (only order differs), factorizing
+  the keys once (`index.get_indexer`) and gathering each column from a contiguous float64 array
+  aligned to a canonical index cuts transform to one hash per unit + a fancy index per column, with
+  bit-identical outputs (unknown code −1 → NaN reproduces `.map`; values bake the §11 global so there
+  is no other NaN). Speeds up `transform`, the `fit_transform` refit, and the per-fold slow OOF path.
+- Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0; in-process **interleaved** old(`.map`)/new(gather)
+  on the same fitted tables, 7 reps, n=1M, single- & multi-col, `stats={mean,var,std,median}`;
+  `tests/test_transform_gather.py` (mixed-order multi-stat alignment, combination unknown/known joint
+  key, tiny-n baked-global vs unseen-under-`handle_unknown`); independent noise-trap leakage audit;
+  `/leakage-audit` + `/sklearn-compat`.
+- Result: KEEP — transform ×2.28 (4-stat), ×3.36 (4-stat high-card 50k), ×2.48 (combination), ×1.00
+  single-stat (no-unknown fast path = a single fancy index); outputs allclose(equal_nan); 170 passed,
+  8 skipped; ruff clean. Leakage PASS (OOF corr −0.013 mean / −0.012 median; leaky +0.65; asymmetric).
+  sklearn PASS incl. pickle round-trip of the new `_UnitEncoding`. `categories_` / `global_stats_` /
+  `target_mean_` unchanged (canonical = first column's index). No default changed; committed baseline
+  NOT updated (it predates the perf arc).
+- Verdict: docs/verdicts/2026-06-27-transform-gather-verdict.md. Next lever: integer **joint** codes
+  (`c_a*n_b+c_b`) → vectorize combination key-build (KI-019) + unblock GPU `combination` (KI-018).
+## 2026-06-27 — integer mixed-radix joint codes for `combination` (lever #2A, KI-019)
+- Hypothesis: a `combination` unit built its joint key as a Python object-array of **tuples** then
+  grouped/looked-up on tuple hashing (the last per-row Python loop, KI-019; also why GPU is host-only,
+  KI-018). Replacing the tuple with a vectorized mixed-radix **int64 joint code**
+  (`((c0*n1+c1)*n2+c2)…`), learned once from full X (value-stable per-component maps reused at
+  fit/fold/transform) and fed to the PR #7 gather, should be faster with no output change — the code
+  is a pure relabeling of the same row grouping. Unknown component → −1 sentinel (existing fallback);
+  `prod(n_c) > int64.max` → declines int path, falls back to tuple build.
+- Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0; in-process **interleaved** per rep across three
+  `_unit_keys` impls — genexpr (original loop), zip (PR #2), intcode (new) — `make_multi_column`
+  (4 cols card-20, cv=5), n=200k (7 reps) & 1M (5 reps); `tests/test_joint_codes.py` (stable/distinct
+  codes, decode roundtrip, −1 sentinel, overflow fallback), new combination tests in
+  `test_multi_feature.py` (joint-unseen, missing value/return_nan, categories_ tuples, determinism),
+  combination OOF reconstruction in `test_cross_fit_no_leakage.py`; `/leakage-audit` + `/sklearn-compat`.
+- Result: KEEP — byte-identical output (max|Δ|=0.00e+00 at 200k & 1M across all three impls); at 1M
+  combination transform ×4.35 vs the loop / ×2.93 vs PR #2's zip, fit_transform ×2.42 / ×1.67 (win
+  grows with N); 180 passed, 8 skipped; ruff clean. Leakage PASS (OOF reconstruction max|Δ|=4.4e-16;
+  noise-trap OOF corr 0.06 vs leaky 0.84 for smooth=0 and "auto"; asymmetry 0.022>0). sklearn PASS
+  incl. pickle round-trip of `_unit_keyplans`; `categories_` decoded back to value tuples (unchanged
+  representation), feature names / §11 fallback / defaults unchanged. Committed baseline NOT updated.
+- Verdict: docs/verdicts/2026-06-27-integer-joint-codes-verdict.md. Closes KI-019, supersedes PR #2.
+  Next lever: **#2B GPU `combination`** (KI-018) — drop `host_only` combination clause + joint codes
+  in `_gpu.py`, **mandatory** Colab CPU/GPU parity.
+## 2026-06-27 — explicit interaction groups (`interactions=[[...]]`)
+- Hypothesis: the engine already treats a "unit" as an arbitrary column group (tuple keys), so an
+  explicit `interactions: list[list[str]]` param that appends one joint unit per group is mostly a
+  `_units`-construction + param change; OOF / naming / unknown-missing / parity reuse the existing
+  unit machinery. Generalizes `multi_feature_mode="combination"` (joint-only) by adding joint columns
+  on top of the independent `cols`.
+- Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0; `tests/test_interactions.py` (naming, equality
+  with the combination encoder, multi-stat, dedup, validation errors, clone/get_params); sklearn-compat
+  spot-checks (clone/set_params roundtrip, Pipeline, ColumnTransformer, set_output, feature-name
+  width); `scripts/check.sh` green.
+- Result: KEEP — `a+b__te_*` columns added additively; the interaction column == the combination
+  encoder's column (allclose); duplicates deduped; invalid groups raise; clone/get_params preserve the
+  param. sklearn-compat PASS. Branch off main (independent of the perf PRs). Joint keys stay
+  GPU-host-only (KI-018).
+- Verdict: n/a (feature; no default changed).
+## 2026-06-27 — GPU `combination` unblocked (lever #2B, KI-018) — CODE; Colab parity PENDING
+- Hypothesis: now that combination/interaction units key on **int64 mixed-radix joint codes** (lever
+  #2A, host-built in `_unit_keys`), they no longer need tuple keys on the device — cuDF can group an
+  int64 column directly. So dropping the `len(cols) > 1` clause from `host_only` (`host_only = not
+  all_gpu`) should let combination run on the GPU backend, with parity intact because the joint codes
+  are byte-identical on both backends (built on host) and only the device group-by differs — the same
+  situation already validated for single-column. A missing component is folded into an ordinary int
+  code on the host, so no MISSING sentinel reaches the device (`_gpu._to_nullable` returns early for
+  non-object key arrays).
+- Setup: pandas 1.5.2 / numpy 1.23.5 / sklearn 1.2.0 (CPU-only box, **no local GPU**). CPU green gate
+  (`scripts/check.sh`); verified `backend='gpu'`+combination still **raises** on a no-GPU box (no
+  silent fallback) and `auto`/`cpu` combination unchanged. Added combination (mean/var) +
+  missing-component + interactions parity cases to `tests/test_cpu_gpu_parity.py` (gpu-marked, skipped
+  locally) and `scripts/colab_gpu_parity.py`.
+- Result: CODE COMPLETE, **NOT YET VALIDATED** — the device path changed, so CPU/GPU `allclose` on a
+  real GPU is the mandatory gate and **I cannot run it** (no local GPU). Maintainer must run
+  `bash scripts/colab_gpu_parity.sh` (T4); combination/missing/interactions must show
+  `transform_allclose` + `fit_transform_allclose` true and `backend_gpu == "gpu"`.
+- Verdict: **VALIDATED on Colab T4 (2026-06-27)** — combination mean/var, missing-component, and
+  interactions all `transform`+`fit_transform` allclose (max|Δ| ≤ 3.8e-15, fit_transform 0.0) with
+  `backend_=gpu`; pre-existing single-column/numeric cases still pass. **KI-018 RESOLVED.**
+  `docs/verdicts/2026-06-27-gpu-parity-report.md`, `benchmarks/results/2026-06-27-T4-gpu-parity.jsonl`.
+  Crossover re-confirms `auto` stays off (GPU ~parity only at ≥5M: 0.93×@1M, 1.22×@5M, 1.07×@10M;
+  KI-020 unchanged). KEEP → merge `feat/perf-gpu-combination`.
+## 2026-06-27 — 0.3.0 release prep + GitHub Pages enabled (ops, not an experiment)
+- 0.2.0 is already on PyPI; `main` gained `interactions=` (new public param), the single-pass OOF
+  perf arc, integer joint codes, and GPU combination since — so the next release is **0.3.0** (minor:
+  backwards-compatible feature add). Bumped pyproject + `__init__` (in sync), wrote CHANGELOG
+  `[0.3.0]`; `python -m build` + `twine check` PASSED (sdist+wheel, `py.typed` present); clean-venv
+  install imports `0.3.0` and runs the new `interactions` path + CountEncoder. Merged via PR #10.
+  **Tag `v0.3.0` + publish is the maintainer's step** (Trusted Publishing fires on the tag).
+- **GitHub Pages enabled** (source = GitHub Actions, via `gh api`); the Docs workflow's deploy had
+  been failing only because Pages was off — re-ran it green, site live at
+  https://matapanino.github.io/catstat/. Action versions still warn on Node 20 deprecation (future
+  bump of `actions/checkout@v4` etc.).

{catstat-0.2.0 → catstat-0.3.0}/docs/known_issues.md RENAMED Viewed

@@ -16,10 +16,11 @@ exact). KI-010 (auto-smoothing parity) remains open.
 | KI-003 | — | ~~`multi_feature_mode="combination"` not implemented~~ | **Resolved 2026-06-26** (joint group-by). |
 | KI-004 | S3 | Ordered (CatBoost) / leave-one-out modes absent | P3 options. |
 | KI-005 | S3 | `set_output("polars")` not supported | pandas/numpy/`set_output("pandas")` work; polars in P3. |
-| KI-018 | S3 | GPU `combination` (tuple keys) forced to CPU | missing-as-value now works on GPU (validated 2026-06-26); only combination remains host-only. |
-| KI-019 | S3 | combination joint-key build is a Python loop | O(n) host loop; vectorize for large N. |
-| KI-020 | S2 | GPU not faster than CPU up to 1M rows (T4) | host↔device round-trip per OOF fold dominates. `auto` disabled; perf needs on-device keys/folds. `docs/verdicts/2026-06-26-gpu-crossover-verdict.md`. |
+| KI-018 | — | ~~GPU `combination` forced to CPU~~ | **Resolved 2026-06-27**: combination/interaction now run on GPU — `host_only = not all_gpu` + host-built **int64 joint codes** (KI-019) flow straight to the device group-by (`_gpu._to_nullable` skips the MISSING remap for non-object keys; a missing component is already folded into an integer code). **CPU/GPU `allclose` validated on Colab T4 (2026-06-27)**: combination mean/var, missing-component, and interactions all `transform`+`fit_transform` allclose (max\|Δ\| ≤ 3.8e-15, fit_transform 0.0), `backend_=gpu`. `docs/verdicts/2026-06-27-gpu-parity-report.md`. `backend='gpu'` still raises without RAPIDS (no silent fallback); `auto` stays CPU (KI-020 crossover unchanged — ~parity only at ≥5M). |
+| KI-019 | — | ~~combination joint-key build is a Python loop~~ | **Resolved 2026-06-27**: replaced by vectorized mixed-radix **int64 joint codes** (`((c0*n1+c1)*n2+c2)…`), learned once from full X and reused at fit/fold/transform; byte-identical (max\|Δ\|=0 at 200k–1M), combination transform ×3.7–4.4 / fit_transform ×1.5–2.4 vs the loop. **Supersedes PR #2** (which only built tuples faster). `docs/verdicts/2026-06-27-integer-joint-codes-verdict.md`. |
+| KI-020 | S2 | GPU reaches ~parity only at ≥5M rows (T4); `auto` stays off | Post-complement-subtraction (host): per-fold round-trip removed → crossover **0.67×@1M → 1.11×@5M, 1.06×@10M** (marginal + noisy: 1M was 0.67 vs 0.98 across runs). GPU scales sublinearly but the win is within noise; `auto` stays disabled, explicit gpu validated (allclose, mean ft now exact). `docs/verdicts/2026-06-26-gpu-crossover-postPRB-verdict.md`. |
 | KI-030 | S3 | Numeric TE (0.2.0): `Count`/`Frequency` don't bin; numpy-object & bool route to categorical | `numeric=` is `TargetEncoder`-only. Numeric auto-detection needs real numeric dtypes, so numpy-array input (all-object after `prepare_X`) and bool columns are treated as categorical/direct, not binned. Edges are computed once from full-train X (leakage-safe, ⊥ y). **GPU:** numeric keys are emitted as **strings** — the first Colab T4 run hit `MixedTypeError` (cuDF rejects object-dtype *integer* arrays) with int bin-ids/values; fixed by stringifying keys (matches the validated string-categorical path). CPU/GPU allclose **validated on T4 (2026-06-26)** for `numeric_auto`/`numeric_bin` (max\|Δ\| ~1e-17). |
+| KI-031 | S3 | Transform `map`→**gather done**; non-additive stats still re-fit per fold | **2026-06-27**: `_transform_array` now factorizes each unit's keys once (`index.get_indexer`) and **gathers** each column from a contiguous float64 array (`_UnitEncoding`), replacing per-column `pd.Series.map` — transform ×2.3–3.4 (multi-stat / high-card), single-stat neutral, outputs allclose, leakage + sklearn-compat PASS (`docs/verdicts/2026-06-27-transform-gather-verdict.md`). **Still open:** median/min/max/skew/custom re-fit per fold in the hybrid OOF slow path (now faster via the gather, but not on the single-pass kernel). **Follow-up:** ✅ integer **joint** codes (`c_a*n_b+c_b`) done — combination key-build vectorized (KI-019, 2026-06-27); GPU `combination` (KI-018) remains. See `docs/notes/2026-06-27-cuml-vs-sklearn-te-levers.md`. |
 ## Open risks to track (carry into implementation)
 | id | sev | risk | mitigation |

catstat-0.3.0/docs/notes/2026-06-27-cuml-vs-sklearn-te-levers.md ADDED Viewed

@@ -0,0 +1,79 @@
+# Why is cuML's TargetEncoder faster than sklearn's — and which CPU levers is catstat still missing?
+- Date: 2026-06-27
+- Scope: a research scout (no code) for the perf arc. catstat **already** has the single-pass
+  composite-groupby + OOF-by-complement-subtraction algorithm on CPU (PR-B mean, PR-C var/std). The
+  question: what *else* makes cuML fast that ports to a pandas/numpy CPU path?
+- Sources read: cuML `python/cuml/cuml/preprocessing/TargetEncoder.py` (`_groupby_agg`,
+  `_fit_transform`, `_make_fold_column`); the RAPIDS "Target Encoding with cuML" blog; sklearn
+  `preprocessing/_target_encoder.py` + the Cython `_target_encoder_fast.pyx`
+  (`_fit_encoding_all_targets`), `utils/_encode.py`, `preprocessing/_encoders.py`.
+## Headline finding
+**cuML offers nothing to port algorithmically.** Its `_groupby_agg` is exactly the
+complement-subtraction trick catstat already has: `groupby([fold]+x_cols).agg` then
+`groupby(x_cols).agg` on the small per-fold table, then subtract. It represents categories as **raw
+object/string keys** (cuDF GPU hash join) — *no* integer codes. The RAPIDS ~100× is GPU parallelism
+over cuDF groupby/hash-join + the ~4× from doing all folds in parallel (which complement-subtraction
+already buys). So cuML ≠ a source of CPU levers.
+**The CPU levers come from scikit-learn**, whose `TargetEncoder` is the opposite design: it
+cross-fits **per fold** (n_folds passes — algorithmically *worse* than catstat) but makes each pass
+extremely cheap via an **integer-code representation**:
+- `OrdinalEncoder` integer-codes every column **once** at fit → a C-contiguous `int` matrix
+  `X_ordinal`. Hashing is paid once, upfront.
+- Cython `_fit_encoding_fast` does `sums[code] += y` / `counts[code] += 1` — an **O(1) array
+  scatter-add**, no hash lookup on the hot path; smoothing is folded into the accumulator init
+  (`sums[c] = smooth*y_mean`, `counts[c] = smooth`) so the m-estimate falls out of the final
+  division for free; buffers reused across features; `nogil`.
+- Transform is a **pure numpy gather**: `encoding[X_ordinal[rows, col]]` — no pandas, no `.map()`,
+  no merge — into a pre-allocated `X_out`.
+catstat's fast kernel **already** uses `pd.factorize` + `np.bincount` internally (so the *fit*
+accumulation is sklearn-equivalent). The unadopted half is the **transform path** and the **fitted
+representation**: `_transform_array` still maps via `pd.Series.map` on object keys (profiled at
+**52% `get_indexer`** of a multi-stat transform), and the slow per-fold loop (median/min/max/skew/
+custom) re-factorizes per stat.
+## Ranked CPU-applicable levers (for a path that already has complement-subtraction)
+| # | lever | where it fires | expected payoff | risk | sklearn? | cuML? |
+|---|-------|----------------|-----------------|------|----------|-------|
+| 1 ✅ | **factorize-once + numpy GATHER** (store encodings as `float64[code]`; transform = `enc[codes]`) — **shipped 2026-06-27** | transform (and fit lookups) | **measured ×2.3–3.4** (multi-stat / high-card; one `get_indexer` per *unit*, not per column; single-stat neutral) | Low–Med | yes | no |
+| 2 | **bincount over integer (joint) codes** for the remaining slow-path stats; `joint = code_a*n_b+code_b` | fit accumulation; **joint keys** | 5–30× at large N; integer joint codes also unblock **GPU combination (KI-018)** | Med | yes (scatter-add) | no |
+| 3 | **pre-allocated output, no merge at apply-back** | transform/apply | 5–15× | Low | yes | no |
+| 4 | **dtype discipline** — int32 codes, C-contiguous, float64 throughout | both, large N | 5–20% | Very low | yes | partial |
+| 5 | smoothing folded into accumulator init | fit smoothing | <5% (catstat's vectorized smoothing is already ~free) | trivial | yes | no |
+| 6 | column-level parallelism (joblib over independent units) | both | up to n_cores | Med | no | inherent |
+## Recommendation for catstat
+The single highest-leverage intervention is **#1 + #2 together**: a factorize-once, integer-code
+**gather** path. Concretely —
+- **`_transform_array`**: at fit, store each unit's encoding as a `float64` array indexed by the
+  unit's integer codes (keep the object→code mapping built from `categories_`); at transform,
+  `pd.factorize`/`searchsorted` the keys once and gather (`enc[codes]`, unknown = code −1 → fallback),
+  replacing `pd.Series.map`. This speeds **every** `transform`/inference call, not just `fit_transform`.
+- **joint codes**: `code_a * n_b + code_b` (int64) replaces object tuple keys for combination/
+  `interactions` — removing the per-row tuple build (KI-019's residue) and giving cuDF an integer
+  column to group on, which **unblocks GPU `combination`** (KI-018).
+What to measure (in-process before/after, the attributable method): the **transform** step alone at
+n ≥ 1M, single- and multi-column, var the cardinality; expect 10–50× on that step. Watch the small-N
+break-even (pandas has low overhead; bincount/gather wins grow with N). Tracked as **KI-031**; this
+is the next CPU arc after PR-C, ahead of the GPU on-device port (KI-020).
+## Status — lever #1 shipped (2026-06-27)
+Lever #1 landed on `feat/perf-integer-code-gather` (stacked on `feat/perf-additive-var-std`):
+`_transform_array` factorizes each unit's keys once (`index.get_indexer`) and gathers each column from
+a `float64` array aligned to a per-unit canonical index (`_UnitEncoding`), replacing per-column
+`pd.Series.map`. Measured (in-process interleaved, n=1M, 7 reps): transform **×2.28** (4-stat),
+**×3.36** (4-stat high-card 50k), **×2.48** (combination), **×1.00** single-stat (no-unknown fast path
+= a single fancy index). The 10–50× estimate above assumed `get_indexer` could be eliminated entirely;
+in practice the gather still pays **one** `get_indexer` per unit to locate *arbitrary* transform keys,
+so the real win is "a unit's N stats share one hash" (≈2–3× at 4 stats) plus dropping the pandas
+`Series.map` overhead. Outputs allclose; leakage + sklearn-compat PASS;
+`docs/verdicts/2026-06-27-transform-gather-verdict.md`. **Lever #2 (integer joint codes → vectorized
+combination key-build, KI-019 + GPU combination KI-018) is the next, separate PR.**

{catstat-0.2.0 → catstat-0.3.0}/docs/roadmap.md RENAMED Viewed

@@ -42,9 +42,14 @@ verdict-backed), pending the maintainer's `v0.2.0` tag. Publishing is tag-driven
   missing-as-value** (cuDF nulls), transform + fit_transform. Two verdicts (parity + crossover).
 - ✅ **Crossover measured**: GPU is *slower* than CPU up to 1M rows (speedup 0.28–0.86) →
   `backend="auto"` GPU **disabled** (`_AUTO_GPU_ENABLED=False`); explicit `backend="gpu"` stays. KI-020.
-- ⏳ `combination` on GPU (tuple keys, host-only) + vectorize joint-key build (KI-018/019).
-- ⏳ **GPU perf**: keep keys/folds on-device to remove the per-fold host↔device round-trips that
-  dominate; then re-run the crossover and re-enable `auto` if it wins.
+- ✅ combination joint-key build **vectorized** to int64 mixed-radix codes (KI-019, 2026-06-27:
+  byte-identical, transform ×3.7–4.4 / fit_transform ×1.5–2.4 at 1M; supersedes PR #2). ✅ GPU
+  `combination`/`interactions` now run on GPU too (host-built int64 codes → device group-by;
+  `host_only = not all_gpu`); **CPU/GPU allclose validated on Colab T4** (KI-018 resolved).
+- **GPU perf** (re-measured 2026-06-26, T4): host complement-subtraction (PR-B) removed the per-fold
+  round-trip → crossover ~parity at ≥5M (0.67×@1M, 1.11×@5M, 1.06×@10M; marginal + noisy). `auto`
+  **stays off** (data doesn't justify it); a device-resident path is a niche lever.
+  `docs/verdicts/2026-06-26-gpu-crossover-postPRB-verdict.md`.
 ## Phase 3 — advanced — in progress (2026-06-26)
 - ✅ **Phase 3a**: `skew` (built-in) + **custom-callable aggregations** (`stats=[("q90", fn)]` or
@@ -77,8 +82,9 @@ verdict-backed), pending the maintainer's `v0.2.0` tag. Publishing is tag-driven
 - ✅ **Project hygiene (0.1.1)**: `CONTRIBUTING.md`, `SECURITY.md`, GitHub issue + PR templates.
 - ✅ **0.1.1 PUBLISHED (2026-06-26)**: `v0.1.1` tagged → release workflow built + published to PyPI
   via Trusted Publishing; `pip install catstat==0.1.1` verified in a clean venv; GitHub release created.
-- ⏳ **Maintainer-only:** enable GitHub Pages (Settings → Pages → GitHub Actions) so the Docs
-  workflow deploys the API site. (KI-020 GPU perf is the optional larger follow-up.)
+- ✅ **GitHub Pages enabled (2026-06-27)**: source = GitHub Actions; the Docs workflow now deploys
+  the API site to https://matapanino.github.io/catstat/ (the deploy step had been failing only
+  because Pages was off). (KI-020 GPU perf is the optional larger follow-up.)
 ## 0.2.0 — numeric-column target encoding — done ✅ (2026-06-26)
 - ✅ **Opt-in numeric TE** on `TargetEncoder` (`numeric="ignore"|"auto"|"direct"|"bin"` +
@@ -103,16 +109,42 @@ verdict-backed), pending the maintainer's `v0.2.0` tag. Publishing is tag-driven
   **M0 bootstrap (2026-06-26)**.
 - ✅ **Phase 2 (CPU + GPU validated)** 2026-06-26: var/std/median/min/max, combination mode,
   GPU backend `backends/_gpu.py` **validated CPU/GPU-allclose on a Colab T4**, CI, Colab loop, `git`.
+- **Perf arc (2026-06-26→27, profiling-driven).** ✅ CPU OOF is single-pass via complement
+  subtraction: `kfold_mean_oof_fast` replaced the per-fold group-by for pure-mean (2.2–3.4×;
+  `docs/verdicts/2026-06-26-pr-b-complement-subtraction-mean-verdict.md`). ✅ **var/std** now ride the
+  same kernel (shared complement moments; ddof=1 + complement-global fallback) with a **hybrid** gate
+  that keeps median/min/max/skew/custom on the slow loop — 2.7–2.8× on var & mean+var+std, ~1.5× mixed
+  (`docs/verdicts/2026-06-27-pr-c-additive-var-std-verdict.md`). The kernel also ports on-device to
+  remove per-fold host↔device round-trips (KI-020). ✅ **Transform gather** (2026-06-27): factorize-once
+  `index.get_indexer` + numpy gather replaced per-column `pd.Series.map` — transform ×2.3–3.4
+  (multi-stat / high-card), single-stat neutral (`docs/verdicts/2026-06-27-transform-gather-verdict.md`,
+  KI-031). ✅ **Integer joint codes** (2026-06-27): combination key-build replaced by mixed-radix
+  int64 codes — byte-identical, transform ×3.7–4.4 / fit_transform ×1.5–2.4 at 1M, closes KI-019
+  (supersedes PR #2). ✅ **GPU `combination`/`interactions`** unblocked (`host_only` drop + int64
+  codes to the device group-by); **CPU/GPU allclose validated on Colab T4** (KI-018 resolved).
+- ✅ **Interactions (2026-06-27)**: `interactions=[[...]]` → one joint TE column per group (additive
+  to `cols`; generalizes `combination`). `_units` plumbing + one param; OOF / naming / parity reuse
+  the unit machinery. `test_interactions.py`; sklearn-compat PASS. Branch `feat/interactions`.
 - **Phase 2 — remaining.** GPU *performance* (on-device keys/folds; KI-020) and `combination` on
   GPU (KI-018) — both optional, gated behind a fresh crossover verdict before re-enabling `auto`.
 - **Phase 3.** quantile/skew/custom + ordered/LOO + `set_output("polars")` + PyPI release.
 ## "Next" pointer (update each session)
-> **Next task:** **0.2.0 — opt-in numeric-column target encoding is implemented & green** (branch
-> `feat/numeric-target-encoding`): direct / quantile-binned / auto-routed numeric TE, leakage-audited
-> (edges ⊥ y; binned OOF exact), sklearn-compat, empirically validated (CV R² 0.034 → 0.91), defaults
-> set by verdict, version bumped to **0.2.0** + CHANGELOG. **Remaining:** (1) maintainer tags
-> `v0.2.0` to publish via Trusted Publishing; (2) maintainer runs `scripts/colab_gpu_parity.sh` to
-> confirm CPU/GPU allclose on the new binned/direct cases (host-side numpy → expected); (3) optional:
-> numeric binning for `Count`/`Frequency` (KI-030). Still maintainer-only from 0.1.1: enable GitHub
-> Pages. Optional larger follow-up: KI-020 GPU on-device perf (needs a fresh Colab crossover verdict).
+> **Next task:** **Integer joint codes done (2026-06-27, lever #2A)** — combination key-build replaced
+> by vectorized mixed-radix int64 codes (learned once from full X, reused at fit/fold/transform);
+> byte-identical output, combination transform ×3.7–4.4 / fit_transform ×1.5–2.4 at 1M, closes KI-019
+> and supersedes PR #2; leakage + sklearn-compat PASS. The perf stack (#3 mean, #5 var/std, #7 gather,
+> joint codes) and **interactions** (`interactions: list[list[str]]`) are all now merged to main.
+> **Lever #2B — GPU `combination` DONE (2026-06-27, `feat/perf-gpu-combination`)**: `host_only = not
+> all_gpu` (combination/interaction no longer forced to CPU); host-built int64 joint codes flow to the
+> device group-by (`_gpu._to_nullable` non-object guard). **CPU/GPU allclose validated on Colab T4** —
+> combination mean/var, missing-component, interactions all pass (max\|Δ\| ≤ 3.8e-15, ft 0.0,
+> `backend_=gpu`); KI-018 resolved. **Perf arc complete** — CPU levers exhausted (cuML had nothing to
+> port; all came from sklearn's integer-code path). **Next:** **PR-D** GPU on-device kernel + a fresh
+> crossover before re-enabling `auto` — but the 2026-06-27 crossover re-confirms GPU only reaches
+> ~parity at ≥5M (0.93×@1M, 1.22×@5M, 1.07×@10M), so `auto` **stays off** and PR-D is a niche lever
+> (KI-020). **Maintainer carryover:** `0.2.0` is already on PyPI and **`0.3.0` is prepared** (version
+> bumped + CHANGELOG + build/twine/smoke verified, merged to main) — the maintainer tags to publish:
+> `git tag -a v0.3.0 -m "catstat 0.3.0" && git push origin v0.3.0` (Trusted Publishing fires on the
+> tag), then a GitHub release from the `[0.3.0]` notes. ✅ GitHub Pages enabled. **Next feature:**
+> `Count`/`Frequency` numeric binning (KI-030).

catstat 0.2.0__tar.gz → 0.3.0__tar.gz

catstat 0.2.0tar.gz → 0.3.0tar.gz