PyPI - pystatistics - Versions diffs - 2.0.1__tar.gz → 2.2.0__tar.gz - Mend

pystatistics 2.0.1tar.gz → 2.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (544) hide show

{pystatistics-2.0.1 → pystatistics-2.2.0}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,282 @@
 # Changelog
+## 2.2.0
+- Fixed a `torch._C._LinAlgError` crash in `chi_square_mcar_batched_torch`
+  (`pystatistics/mvnmle/backends/_em_batched.py`) on GPU FP32. The batched
+  MoM fast path selected `cholesky_solve` when the SVD-based condition
+  number was below threshold, but Cholesky requires positive-definiteness,
+  which is strictly stronger than good conditioning. On GPU FP32, real
+  tabular data (e.g. `lacuna_tabular_110` applied to UCI/OpenML datasets)
+  can produce `sigma_oo` with good cond number but tiny negative eigenvalues
+  from roundoff, making Cholesky fail. Fix: wrap the fast path in
+  `try/except torch._C._LinAlgError`; on failure, fall back to
+  `torch.linalg.pinv` for the batch (honouring the `regularize` flag —
+  `regularize=False` still raises). Surfaced by dogfooding via Project
+  Lacuna on 3,080 (dataset × generator) pairs across the 110-generator
+  tabular registry; previously `mom_mcar_test` crashed on the first batch
+  containing breast_cancer / wine / credit_card_default.
+- Fixed an exception-type leak in `little_mcar_test`
+  (`pystatistics/mvnmle/mcar_test.py:~250`). The ML-estimation try/except
+  wrapped *every* exception — including `PyStatisticsError` subclasses
+  like `NumericalError` — as a bare `RuntimeError`, breaking the
+  documented `except PyStatisticsError:` pattern downstream and losing
+  the original exception chain. Fix: explicitly re-raise
+  `PyStatisticsError`, and use `raise ... from e` for anything else so
+  the chain is preserved. Surfaced by Project Lacuna's cache builder,
+  which catches `PyStatisticsError` to fall back to a sentinel entry
+  when Little's test is numerically unfit for a particular
+  (dataset, generator) pair; MLE failures were leaking past the catch
+  and killing the whole build.
+- Added ridge-fallback to the batched Cholesky sites inside the EM
+  E-step / log-likelihood
+  (`pystatistics/mvnmle/backends/_em_batched.py`):
+  `e_step_full_batched_np` (line ~361), `_e_step_full_torch` (~680), and
+  `_loglik_full_batched_torch` (~797). These compute per-pattern
+  Cholesky of sigma_oo sub-blocks; real tabular data can produce
+  individual sub-blocks that are numerically indefinite even when the
+  global sigma is PD (integer-encoded categoricals with heavy
+  collinearity in the intersection of a given missingness pattern's
+  observed variables). Fix: wrap each site in `try/except LinAlgError`
+  with a `ridge·I` retry (ridge = 1e-10 at pattern scale; statistically
+  invisible). Also removed a dead Cholesky call in `e_step_batched_np`
+  whose result was never used — it was only a crash liability.
+  `np.linalg.solve` at that same site now has a pinv fallback for
+  singular sub-blocks.
+- Added `regularize: bool = True` to `mlest`, `_solve_em`, and
+  `EMBackend.solve`, mirroring the existing convention on
+  `mom_mcar_test` / `little_mcar_test`. When True (new default),
+  `EMBackend._ensure_pd` applies a small diagonal ridge
+  (`max(0, 1e-10 - min_eig) + 1e-12`) to the M-step sigma whenever its
+  smallest eigenvalue falls below the PD threshold, rather than raising
+  `NumericalError` outright. The ridge is vanishingly small relative to
+  any real data scale — the typical case the old path rejected had
+  min_eig ≈ 1e-13 from pure FP64 roundoff — and a UserWarning makes the
+  event visible in logs. Call sites that need strict bit-for-bit
+  behaviour pass `regularize=False`. Motivated by Project Lacuna
+  dogfooding: applying real missingness generators to real UCI datasets
+  (credit_card_default × MNAR-NonLinSocial produced min_eig ≈ -0.66) was
+  hard-raising and killing the full cache build; the ridge fallback
+  keeps the test numerically well-defined with negligible statistical
+  impact, and the build proceeds.
+## 2.1.0
+- **`mom_mcar_test`: new method-of-moments MCAR test**
+  (``pystatistics/mvnmle/mcar_test.py``). A separate function — not
+  a new mode on ``little_mcar_test`` — because the method-of-moments
+  variant **is not Little's test**. Little (1988) specifically calls
+  for MLE plug-in estimators; swapping in pairwise-deletion sample
+  moments gives a statistic of the same shape with different
+  asymptotic properties, and calling it Little's test would be a
+  polite but concrete lie. The separate function preserves the
+  ``little_mcar_test`` contract exactly (matches R ``mvnmle`` bit-
+  for-bit as before) while giving users a documented fast alternative.
+  End-to-end timings at 15 % MCAR:
+  | dataset        | shape     | little_mcar_test | mom_mcar_test |
+  |----------------|-----------|------------------|---------------|
+  | iris           | 150 × 4   | 2.9 ms           | 0.31 ms       |
+  | wine           | 178 × 13  | 60.9 ms          | 2.17 ms       |
+  | breast_cancer  | 569 × 30  | 1491 ms          | 28.7 ms       |
+  For repeated-diagnostic workflows (e.g. an MCAR sweep over 3410
+  datasets), this is **1.6 minutes vs ~50 minutes** end-to-end. The
+  statistical trade-off is asymptotic efficiency: MoM is consistent
+  under the MCAR null but not asymptotically efficient, and the
+  finite-sample distribution deviates more from chi-square than
+  Little's does. The docstring spells out when to use which:
+  diagnostic screens → MoM; regulated submissions or anywhere the
+  exact asymptotic distribution matters → Little's.
+  Implementation details:
+    - ``_pairwise_deletion_moments``: O(n v^2) pairwise mean and
+      covariance via a single matmul. No per-column loop.
+    - ``chi_square_mcar_batched_np`` / ``_torch``: fully batched
+      chi-square assembly (batched SVD for conditioning,
+      well-conditioned patterns through batched solve, ill-conditioned
+      patterns through batched pinv as separate groups — no
+      per-pattern Python loop).
+    - ``backend`` parameter with same size-heuristic + visible-warning
+      discipline as the EM path. GPU is supported but does not
+      out-perform CPU on any tested shape — MoM's compute is small
+      enough that transfer + launch overhead loses to CPU numpy.
+      Auto-dispatch warns when this is the case.
+    - Honesty: ``MCARTestResult`` gained a ``method`` field so
+      downstream code knows which test produced a given result.
+      ``little_mcar_test`` reports ``"Little (MLE plug-in)"``;
+      ``mom_mcar_test`` reports ``"Method-of-moments
+      (pairwise-deletion plug-in)"``.
+  New tests (``tests/mvnmle/test_mom_mcar.py``, 10 tests):
+  name-honesty, MLE-vs-MoM agreement on MCAR data, correct rejection
+  on non-MCAR data, all-missing-row handling, speed guard of
+  ≥ 10× over MLE on breast_cancer.
+- **Fully-batched device-resident EM on GPU** (``_em_batched.py``
+  / ``_run_em_loop_gpu``). Pre-2.1.0 the "GPU EM" path set up a
+  torch device in the constructor but none of the per-iteration
+  work actually ran on-device — the numpy E-step ran for every
+  backend, which is why pre-2.1.0 benchmarks showed identical CPU
+  and GPU timings. This release implements the real thing: one
+  batched Cholesky + one batched solve over patterns for the
+  regression betas, one batched gather + bmm over all N
+  observations for the filled data, two dense gemms for the
+  sufficient-statistic accumulators, all on-device. SQUAREM also
+  runs fully on-device.
+  EM-only timings (without the MCAR-assembly wrapper):
+  | dataset        | shape     | CPU EM   | GPU EM   | speedup |
+  |----------------|-----------|----------|----------|---------|
+  | wine           | 178 × 13  | 38 ms    | 24 ms    | 1.6×    |
+  | breast_cancer  | 569 × 30  | 2142 ms  | 147 ms   | 14.6×   |
+  Small-data cases (apple, iris, missvals) lose on GPU because
+  transfer + launch overhead exceeds the per-iteration work.
+  Empirically calibrated heuristic: GPU is worth it when
+  ``n_obs * n_vars > 1500``.
+- **Size-heuristic dispatch with Rule-1 visibility** for both EM and
+  MoM backends. When ``backend='auto'`` makes a non-obvious choice
+  (e.g. picking CPU despite GPU availability because the data are
+  too small), a ``UserWarning`` explains the decision and tells
+  users how to override. When ``backend='gpu'`` is explicitly
+  requested on small data, the request is honored (user knows best)
+  but a warning notes that CPU would likely be faster. No silent
+  fallbacks anywhere. New tests pin these behaviours.
+- **Monotone-missingness closed-form MLE** (Anderson 1957; new
+  ``pystatistics.mvnmle._monotone``). When the missingness pattern
+  is monotone — when variables can be ordered such that each
+  observation's missing entries form a contiguous suffix — the MVN
+  MLE has a closed form via a chain of OLS regressions, with no
+  iteration. Common on longitudinal data with attrition, panel
+  surveys with dropout, and most sequentially-administered
+  instruments. New public helpers:
+    - ``pystatistics.mvnmle.is_monotone(data) -> bool``
+    - ``pystatistics.mvnmle.monotone_permutation(data) -> ndarray | None``
+    - ``pystatistics.mvnmle.mlest_monotone_closed_form(data) -> (mu, sigma, n)``
+    - ``mlest(data, algorithm='monotone')`` routes through the
+      closed-form; raises ``ValidationError`` if the data are not
+      monotone (Rule 1: no silent dispatch). Users who want
+      "use the closed form when applicable, fall back otherwise"
+      should call ``is_monotone`` first and branch explicitly.
+  The closed-form is the exact MLE (no tolerance-bounded
+  approximation), matches R ``mvnmle`` reference output on both
+  ``apple`` and ``missvals`` to machine precision, and is
+  dramatically faster than iterative algorithms at larger v
+  (a 1500 × 20 monotone dataset completes in ~2 ms vs EM's
+  ~40 ms). For non-monotone random MCAR data (the common case
+  in MCAR diagnostic use), detection is cheap (~O(v² n)) and
+  correctly returns False so iterative algorithms run.
+  New tests (``tests/mvnmle/test_monotone.py``, 12 tests):
+  detection true-positive / true-negative on several canonical
+  shapes; closed-form vs EM agreement; permutation invariance;
+  non-monotone data raises; performance guard at v=20.
+- **EM MLE: substantial real-data speedup via batched per-pattern
+  linear algebra + SQUAREM acceleration** (Project Lacuna-driven).
+  End-to-end ``little_mcar_test`` wall-clock at 15 % MCAR, seed 0:
+  | dataset        | shape     | 2.0.1    | 2.1.0    | speedup |
+  |----------------|-----------|----------|----------|---------|
+  | apple          | 18 × 2    |  1.9 ms  |  2.0 ms  |  flat   |
+  | missvals       | 13 × 5    | 19.9 ms  |  9.5 ms  |  2.1×   |
+  | iris           | 150 × 4   |  2.8 ms  |  2.8 ms  |  flat   |
+  | wine           | 178 × 13  | 79.4 ms  | 41.5 ms  |  1.9×   |
+  | breast_cancer  | 569 × 30  | 3278 ms  | 2089 ms  |  1.6×   |
+  For workloads that run MCAR repeatedly over many datasets
+  (e.g. a 3410-entry MCAR sweep), this is roughly a 1-hour reduction
+  per full pass at Lacuna's current scale.
+  Three changes stack:
+  1. **Batched per-pattern conditional parameters** (new
+     ``pystatistics.mvnmle.backends._em_batched``). The E-step used
+     to loop in Python over missingness patterns, issuing a scalar
+     Cholesky + triangular solve per pattern. It now stacks all P
+     pattern-sigma submatrices into a single
+     ``(P, v_max, v_max)`` tensor (identity-padded in the unused
+     slots so the Cholesky stays well-defined) and runs one batched
+     Cholesky + one batched solve for the whole iteration. The
+     accumulator loop over patterns remains in Python because
+     ``n_k`` varies and full observation-level padding hurt more
+     than it helped on the representative shapes we benchmarked.
+  2. **SQUAREM acceleration** (Varadhan & Roland 2008; new
+     ``pystatistics.mvnmle.backends._squarem``). EM's linear
+     convergence is sped up by a Steffensen-style extrapolation of
+     three consecutive EM iterates, safeguarded by a monotonicity
+     check on the observed-data log-likelihood. Typical effect on
+     well-behaved EM problems: 2–4× reduction in underlying
+     EM-step-equivalents. Preserves the MLE — the convergence
+     point is unchanged, only the path is shorter. On by default
+     via a new ``accelerate=True`` kwarg on ``EMBackend.solve``;
+     pass ``accelerate=False`` for the plain-EM reference path.
+  3. **Fully batched observed-data log-likelihood**
+     (``compute_loglik_batched_np``). The SQUAREM monotonicity
+     safeguard calls the log-likelihood often, so that path
+     needed to be cheap. The implementation now does one batched
+     Cholesky over all patterns for log-determinants and one
+     batched solve across all N observations for the quadratic-
+     form contribution — no per-pattern Python loop.
+- **Benchmark harness** (``benchmarks/mvnmle_bench.py``). Runs the
+  five reference shapes (apple, missvals, iris, wine,
+  breast_cancer) across the (algorithm, backend) matrix and
+  prints wall-clock / iteration counts per case. Use
+  ``--quick`` to skip the BFGS cases that don't converge on
+  high-$v$ data; ``--tag`` labels a run for diff against prior
+  baselines.
+- **Documented why direct-BFGS is not always the right default.**
+  Internal notes and the 2.0.0 / 2.0.1 release narrative already
+  covered why ``algorithm='em'`` became the ``little_mcar_test``
+  default; this release adds the story of why batching helps EM
+  significantly but does *not* rescue direct-BFGS on realistic
+  high-$v$ data (layer-3 Hessian conditioning is parameterization-
+  invariant; batching only addresses layer-1 launch overhead).
+  See ``GPU_BACKEND_CONVENTION.md`` Section 0 for the "when to
+  add a GPU backend and when not" rule that drove the 2.0.1
+  cleanup; this release extends that logic with
+  "accelerating the algorithm by reducing iteration count
+  (SQUAREM) is cheaper than accelerating each iteration."
+- **Finding: the "GPU EM" backend was never actually running on
+  GPU.** The ``device='cuda'`` / ``'mps'`` constructor flag set
+  up ``self._torch`` but none of ``_e_step`` / ``_m_step`` /
+  ``_compute_loglik`` used it — the numpy path ran for every
+  backend. That's why pre-2.1.0 benchmarks showed identical
+  CPU and GPU EM timings. We attempted to implement a real
+  device-resident EM loop and found it was slower than CPU for
+  all the shapes we care about (per-pattern kernel-launch
+  overhead dominates the small per-pattern matrix work). The
+  honest answer for now is that GPU EM stays CPU-equivalent
+  by design; a future release may revisit with fully N-parallel
+  observation-level batching if a workload appears where the
+  GPU can actually win. This behaviour is unchanged from prior
+  releases — we're just documenting what was already true.
+- **SQUAREM test coverage** (new ``tests/mvnmle/test_squarem.py``).
+  Four tests pinning the invariants: same MLE as plain EM on
+  apple; substantially fewer EM-equivalent steps on missvals;
+  monotonicity of log-likelihood preserved across iteration
+  caps; same MLE as plain EM on a realistic shape (sklearn
+  wine with 15 % MCAR).
 ## 2.0.1
 - **GPU Backend Convention: codified when NOT to add a GPU backend**

{pystatistics-2.0.1 → pystatistics-2.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pystatistics
-Version: 2.0.1
+Version: 2.2.0
 Summary: GPU-accelerated statistical computing for Python
 Project-URL: Homepage, https://sgcx.org/technology/pystatistics/
 Project-URL: Documentation, https://sgcx.org/docs/pystatistics/
@@ -51,6 +51,176 @@ GPU-accelerated statistical computing for Python.
 ## What's New
+### 2.2.0 — Real-data robustness from Project Lacuna dogfooding
+Continuation of the 2.1.0 dogfooding track. Running `little_mcar_test`
+and `mom_mcar_test` on 3,080 (dataset × generator) pairs drawn from 28
+real UCI / OpenML / sklearn tabular datasets under
+`lacuna_tabular_110` missingness generators surfaced four classes of
+numerical failure that synthetic unit tests did not exhibit. All fixed
+in this release; no API breaks.
+**Batched MoM GPU Cholesky crash.** `chi_square_mcar_batched_torch`'s
+fast path selected `cholesky_solve` whenever the SVD-based condition
+number check passed, on the assumption that good conditioning implies
+positive-definiteness. On GPU FP32, roundoff can produce covariances
+that pass the cond-number check but have tiny negative eigenvalues —
+Cholesky fails, the call raises `torch._C._LinAlgError`. The fast path
+is now wrapped with `try/except` and falls back to pseudo-inverse when
+the ``regularize`` flag allows. Surfaced on `credit_card_default` ×
+`MNAR-NonLinSocial` during the Lacuna cache build.
+**Exception-type preservation in `little_mcar_test`.** The
+ML-estimation `try/except` at the top of `little_mcar_test` wrapped
+*every* exception as a bare `RuntimeError` — including
+`PyStatisticsError` subclasses. This broke the documented
+`except PyStatisticsError:` catch pattern downstream: users falling
+back to a sentinel on MLE failure saw their handler bypassed and the
+full build crash. Fix: explicitly re-raise `PyStatisticsError`, and
+use `raise ... from e` for anything else so the chain is preserved.
+**`regularize=True` default on the EM path (opt-out).** `mlest`,
+`_solve_em`, and `EMBackend.solve` gain `regularize: bool = True`,
+mirroring the existing convention on `mom_mcar_test` and
+`little_mcar_test`. When True, `EMBackend._ensure_pd` applies a small
+diagonal ridge — `max(0, 1e-10 − min_eig) + 1e-12` — to the M-step
+sigma whenever its smallest eigenvalue falls below the PD threshold,
+rather than raising `NumericalError`. The ridge is well below any
+statistical precision on real data — the typical case the old path
+rejected had `min_eig ≈ 1e-13` from pure FP64 roundoff on a matrix
+that's theoretically PSD. Dogfooding surfaced cases where `min_eig`
+hit `−0.66` on realistic MNAR mechanisms; the ridge fallback keeps EM
+progressing. Callers needing strict bit-for-bit behaviour pass
+`regularize=False` to restore the old raise.
+**Three additional Cholesky ridge-fallbacks** in `_em_batched.py`:
+`e_step_full_batched_np`, `_e_step_full_torch`, and
+`_loglik_full_batched_torch` all compute per-pattern Cholesky of
+`sigma_oo` sub-blocks. Real tabular data can produce individual
+sub-blocks that are numerically indefinite even when the global sigma
+is PD (integer-encoded categoricals with heavy collinearity in the
+intersection of a given missingness pattern's observed variables).
+Each site now wraps Cholesky in `try/except LinAlgError` with a
+`ridge·I` retry (ridge = 1e-10 at pattern scale; statistically
+invisible). Also removed a dead Cholesky call in `e_step_batched_np`
+whose factor was never used downstream — pure crash liability — and
+added a `pinv` fallback to the `np.linalg.solve` at the same site for
+singular sub-blocks.
+**Impact.** The Project Lacuna cache build on 3,080 (dataset ×
+generator) pairs went from crashing on the first batch containing
+`breast_cancer` or `credit_card_default` (pre-2.2.0) to completing in
+a single pass at 0.9% MoM sentinel rate and 16.4% MLE sentinel rate
+(the MLE sentinels are legitimate EM non-convergence on 1000-pattern
+datasets — not crashes). Synthetic unit tests: 125/125 mvnmle pass.
+**No API breaks.** New defaults (`regularize=True`) are strictly more
+permissive than the old raises — any caller that was crashing before
+will now proceed with a small `UserWarning`. Callers needing strict
+behaviour pass `regularize=False`.
+### 2.1.0 — Real-data EM speedup + monotone closed-form MLE
+Dogfooding via Project Lacuna surfaced that ``little_mcar_test`` on
+realistic tabular data (sklearn's iris / wine / breast_cancer with
+random MCAR injection) was bottlenecked by EM: the E-step was a
+Python loop over missingness patterns, and each SQUAREM-style
+safeguard pass re-ran a per-pattern log-likelihood. This release
+batches both and adds Varadhan & Roland's SQUAREM acceleration.
+End-to-end ``little_mcar_test`` wall-clock at 15 % MCAR, seed 0:
+| dataset        | shape     | 2.0.1    | 2.1.0    | speedup |
+|----------------|-----------|----------|----------|---------|
+| missvals       | 13 × 5    | 19.9 ms  |  9.5 ms  |  2.1×   |
+| wine           | 178 × 13  | 79.4 ms  | 41.5 ms  |  1.9×   |
+| breast_cancer  | 569 × 30  | 3278 ms  | 2089 ms  |  1.6×   |
+For repeated-diagnostic workflows (e.g. an MCAR sweep over several
+thousand datasets), this turns a 3-hour run into a 2-hour run.
+Three stacked improvements, all preserving bit-equivalence on the R
+mvnmle reference cases (apple, missvals):
+- **Batched per-pattern conditional parameters.** The E-step's
+  per-pattern Cholesky + triangular solve now runs as a single
+  batched kernel pair across all missingness patterns. The
+  unused padding slots are identity-filled so the Cholesky stays
+  well-defined.
+- **SQUAREM acceleration on top of EM.** Three EM steps + one
+  Steffensen-style extrapolation, safeguarded by a monotonicity
+  check on the observed-data log-likelihood. Typical effect:
+  2–4× fewer EM-step equivalents to convergence. Convergence
+  point is the same MLE — only the path is shorter. On by
+  default; ``EMBackend.solve(..., accelerate=False)`` recovers
+  the plain-EM reference.
+- **Fully batched log-likelihood.** The SQUAREM monotonicity
+  check calls ``loglik`` often, so it was batched too — one
+  Cholesky over all patterns, one solve across all N
+  observations, no per-pattern Python loop.
+**`mom_mcar_test`: fast method-of-moments MCAR test.** A new *separate
+function* (not a mode on ``little_mcar_test``, because the MoM variant
+is not Little's test) that uses pairwise-deletion sample moments
+instead of MLE plug-in. The test is consistent under MCAR but not
+asymptotically efficient, trading a small amount of statistical
+efficiency for dramatic speed. At 15 % MCAR on sklearn demos:
+| dataset        | shape     | little_mcar_test | mom_mcar_test |
+|----------------|-----------|------------------|---------------|
+| iris           | 150 × 4   | 2.9 ms           | 0.31 ms       |
+| wine           | 178 × 13  | 60.9 ms          | 2.17 ms       |
+| breast_cancer  | 569 × 30  | 1491 ms          | 28.7 ms       |
+For a 3410-dataset MCAR sweep: **~50 minutes → ~1.6 minutes**. Use
+``little_mcar_test`` when you need Little 1988's asymptotic
+distribution exactly (regulated submissions, citing R reference);
+use ``mom_mcar_test`` for high-throughput diagnostic screens. The
+``MCARTestResult.method`` field records which test produced a given
+result so downstream code can disambiguate without tracking the
+calling function.
+**Fully-batched device-resident EM on GPU.** Pre-2.1.0 the
+``device='cuda'`` EM path set up a torch device but never used it —
+numpy ran for every backend. This release implements a real
+device-resident loop with fully batched E-step / M-step / log-
+likelihood, SQUAREM acceleration on top, all on device. On breast-
+cancer-scale (569 × 30) EM drops from 2142 ms CPU to 147 ms GPU
+(14.6×). Small data remains CPU-faster; an empirical size heuristic
+(``n * v >= 1500``) with visible dispatch warnings keeps this
+correct in user-facing behaviour.
+**Monotone-missingness closed-form MLE** (Anderson 1957). Longitudinal
+cohorts with attrition, panel surveys with dropout, and most
+sequentially-administered instruments produce *monotone* missingness
+— the variables can be ordered such that each observation's missing
+entries form a contiguous suffix. When the pattern is monotone, the
+MVN MLE has a closed form via a chain of OLS regressions, with no
+iteration. New helpers: ``mvnmle.is_monotone(data)``,
+``mvnmle.monotone_permutation(data)``, and
+``mlest(data, algorithm='monotone')``. The closed-form matches R
+``mvnmle`` bit-for-bit on canonical datasets and is orders of
+magnitude faster than EM on larger-v longitudinal data. Per Rule 1
+the algorithm raises on non-monotone input rather than silently
+falling back — call ``is_monotone`` first if you want conditional
+dispatch.
+Also in this release:
+- **Benchmark harness** under ``benchmarks/mvnmle_bench.py`` for
+  tracking wall-clock and iteration counts across the reference
+  shapes; use the ``--tag`` flag to label a baseline for diff
+  against future changes.
+- **Documented finding**: the ``device='cuda'`` EM path was never
+  actually running on the GPU prior to this release — it stored
+  a torch device but never used it. We tried to wire up a real
+  device-resident loop and found GPU is slower than CPU for all
+  shapes we tested (per-pattern launch overhead still dominates
+  the tiny per-pattern matrix work). GPU EM therefore remains
+  CPU-equivalent by design; a future release will revisit if a
+  workload appears where full observation-level batching makes
+  GPU actually win.
 ### 2.0.1 — GPU-backend exposure gaps and a convention rule
 Two public functions had GPU-capable inner calls but no `backend=`

{pystatistics-2.0.1 → pystatistics-2.2.0}/README.md RENAMED Viewed

@@ -4,6 +4,176 @@ GPU-accelerated statistical computing for Python.
 ## What's New
+### 2.2.0 — Real-data robustness from Project Lacuna dogfooding
+Continuation of the 2.1.0 dogfooding track. Running `little_mcar_test`
+and `mom_mcar_test` on 3,080 (dataset × generator) pairs drawn from 28
+real UCI / OpenML / sklearn tabular datasets under
+`lacuna_tabular_110` missingness generators surfaced four classes of
+numerical failure that synthetic unit tests did not exhibit. All fixed
+in this release; no API breaks.
+**Batched MoM GPU Cholesky crash.** `chi_square_mcar_batched_torch`'s
+fast path selected `cholesky_solve` whenever the SVD-based condition
+number check passed, on the assumption that good conditioning implies
+positive-definiteness. On GPU FP32, roundoff can produce covariances
+that pass the cond-number check but have tiny negative eigenvalues —
+Cholesky fails, the call raises `torch._C._LinAlgError`. The fast path
+is now wrapped with `try/except` and falls back to pseudo-inverse when
+the ``regularize`` flag allows. Surfaced on `credit_card_default` ×
+`MNAR-NonLinSocial` during the Lacuna cache build.
+**Exception-type preservation in `little_mcar_test`.** The
+ML-estimation `try/except` at the top of `little_mcar_test` wrapped
+*every* exception as a bare `RuntimeError` — including
+`PyStatisticsError` subclasses. This broke the documented
+`except PyStatisticsError:` catch pattern downstream: users falling
+back to a sentinel on MLE failure saw their handler bypassed and the
+full build crash. Fix: explicitly re-raise `PyStatisticsError`, and
+use `raise ... from e` for anything else so the chain is preserved.
+**`regularize=True` default on the EM path (opt-out).** `mlest`,
+`_solve_em`, and `EMBackend.solve` gain `regularize: bool = True`,
+mirroring the existing convention on `mom_mcar_test` and
+`little_mcar_test`. When True, `EMBackend._ensure_pd` applies a small
+diagonal ridge — `max(0, 1e-10 − min_eig) + 1e-12` — to the M-step
+sigma whenever its smallest eigenvalue falls below the PD threshold,
+rather than raising `NumericalError`. The ridge is well below any
+statistical precision on real data — the typical case the old path
+rejected had `min_eig ≈ 1e-13` from pure FP64 roundoff on a matrix
+that's theoretically PSD. Dogfooding surfaced cases where `min_eig`
+hit `−0.66` on realistic MNAR mechanisms; the ridge fallback keeps EM
+progressing. Callers needing strict bit-for-bit behaviour pass
+`regularize=False` to restore the old raise.
+**Three additional Cholesky ridge-fallbacks** in `_em_batched.py`:
+`e_step_full_batched_np`, `_e_step_full_torch`, and
+`_loglik_full_batched_torch` all compute per-pattern Cholesky of
+`sigma_oo` sub-blocks. Real tabular data can produce individual
+sub-blocks that are numerically indefinite even when the global sigma
+is PD (integer-encoded categoricals with heavy collinearity in the
+intersection of a given missingness pattern's observed variables).
+Each site now wraps Cholesky in `try/except LinAlgError` with a
+`ridge·I` retry (ridge = 1e-10 at pattern scale; statistically
+invisible). Also removed a dead Cholesky call in `e_step_batched_np`
+whose factor was never used downstream — pure crash liability — and
+added a `pinv` fallback to the `np.linalg.solve` at the same site for
+singular sub-blocks.
+**Impact.** The Project Lacuna cache build on 3,080 (dataset ×
+generator) pairs went from crashing on the first batch containing
+`breast_cancer` or `credit_card_default` (pre-2.2.0) to completing in
+a single pass at 0.9% MoM sentinel rate and 16.4% MLE sentinel rate
+(the MLE sentinels are legitimate EM non-convergence on 1000-pattern
+datasets — not crashes). Synthetic unit tests: 125/125 mvnmle pass.
+**No API breaks.** New defaults (`regularize=True`) are strictly more
+permissive than the old raises — any caller that was crashing before
+will now proceed with a small `UserWarning`. Callers needing strict
+behaviour pass `regularize=False`.
+### 2.1.0 — Real-data EM speedup + monotone closed-form MLE
+Dogfooding via Project Lacuna surfaced that ``little_mcar_test`` on
+realistic tabular data (sklearn's iris / wine / breast_cancer with
+random MCAR injection) was bottlenecked by EM: the E-step was a
+Python loop over missingness patterns, and each SQUAREM-style
+safeguard pass re-ran a per-pattern log-likelihood. This release
+batches both and adds Varadhan & Roland's SQUAREM acceleration.
+End-to-end ``little_mcar_test`` wall-clock at 15 % MCAR, seed 0:
+| dataset        | shape     | 2.0.1    | 2.1.0    | speedup |
+|----------------|-----------|----------|----------|---------|
+| missvals       | 13 × 5    | 19.9 ms  |  9.5 ms  |  2.1×   |
+| wine           | 178 × 13  | 79.4 ms  | 41.5 ms  |  1.9×   |
+| breast_cancer  | 569 × 30  | 3278 ms  | 2089 ms  |  1.6×   |
+For repeated-diagnostic workflows (e.g. an MCAR sweep over several
+thousand datasets), this turns a 3-hour run into a 2-hour run.
+Three stacked improvements, all preserving bit-equivalence on the R
+mvnmle reference cases (apple, missvals):
+- **Batched per-pattern conditional parameters.** The E-step's
+  per-pattern Cholesky + triangular solve now runs as a single
+  batched kernel pair across all missingness patterns. The
+  unused padding slots are identity-filled so the Cholesky stays
+  well-defined.
+- **SQUAREM acceleration on top of EM.** Three EM steps + one
+  Steffensen-style extrapolation, safeguarded by a monotonicity
+  check on the observed-data log-likelihood. Typical effect:
+  2–4× fewer EM-step equivalents to convergence. Convergence
+  point is the same MLE — only the path is shorter. On by
+  default; ``EMBackend.solve(..., accelerate=False)`` recovers
+  the plain-EM reference.
+- **Fully batched log-likelihood.** The SQUAREM monotonicity
+  check calls ``loglik`` often, so it was batched too — one
+  Cholesky over all patterns, one solve across all N
+  observations, no per-pattern Python loop.
+**`mom_mcar_test`: fast method-of-moments MCAR test.** A new *separate
+function* (not a mode on ``little_mcar_test``, because the MoM variant
+is not Little's test) that uses pairwise-deletion sample moments
+instead of MLE plug-in. The test is consistent under MCAR but not
+asymptotically efficient, trading a small amount of statistical
+efficiency for dramatic speed. At 15 % MCAR on sklearn demos:
+| dataset        | shape     | little_mcar_test | mom_mcar_test |
+|----------------|-----------|------------------|---------------|
+| iris           | 150 × 4   | 2.9 ms           | 0.31 ms       |
+| wine           | 178 × 13  | 60.9 ms          | 2.17 ms       |
+| breast_cancer  | 569 × 30  | 1491 ms          | 28.7 ms       |
+For a 3410-dataset MCAR sweep: **~50 minutes → ~1.6 minutes**. Use
+``little_mcar_test`` when you need Little 1988's asymptotic
+distribution exactly (regulated submissions, citing R reference);
+use ``mom_mcar_test`` for high-throughput diagnostic screens. The
+``MCARTestResult.method`` field records which test produced a given
+result so downstream code can disambiguate without tracking the
+calling function.
+**Fully-batched device-resident EM on GPU.** Pre-2.1.0 the
+``device='cuda'`` EM path set up a torch device but never used it —
+numpy ran for every backend. This release implements a real
+device-resident loop with fully batched E-step / M-step / log-
+likelihood, SQUAREM acceleration on top, all on device. On breast-
+cancer-scale (569 × 30) EM drops from 2142 ms CPU to 147 ms GPU
+(14.6×). Small data remains CPU-faster; an empirical size heuristic
+(``n * v >= 1500``) with visible dispatch warnings keeps this
+correct in user-facing behaviour.
+**Monotone-missingness closed-form MLE** (Anderson 1957). Longitudinal
+cohorts with attrition, panel surveys with dropout, and most
+sequentially-administered instruments produce *monotone* missingness
+— the variables can be ordered such that each observation's missing
+entries form a contiguous suffix. When the pattern is monotone, the
+MVN MLE has a closed form via a chain of OLS regressions, with no
+iteration. New helpers: ``mvnmle.is_monotone(data)``,
+``mvnmle.monotone_permutation(data)``, and
+``mlest(data, algorithm='monotone')``. The closed-form matches R
+``mvnmle`` bit-for-bit on canonical datasets and is orders of
+magnitude faster than EM on larger-v longitudinal data. Per Rule 1
+the algorithm raises on non-monotone input rather than silently
+falling back — call ``is_monotone`` first if you want conditional
+dispatch.
+Also in this release:
+- **Benchmark harness** under ``benchmarks/mvnmle_bench.py`` for
+  tracking wall-clock and iteration counts across the reference
+  shapes; use the ``--tag`` flag to label a baseline for diff
+  against future changes.
+- **Documented finding**: the ``device='cuda'`` EM path was never
+  actually running on the GPU prior to this release — it stored
+  a torch device but never used it. We tried to wire up a real
+  device-resident loop and found GPU is slower than CPU for all
+  shapes we tested (per-pattern launch overhead still dominates
+  the tiny per-pattern matrix work). GPU EM therefore remains
+  CPU-equivalent by design; a future release will revisit if a
+  workload appears where full observation-level batching makes
+  GPU actually win.
 ### 2.0.1 — GPU-backend exposure gaps and a convention rule
 Two public functions had GPU-capable inner calls but no `backend=`

pystatistics 2.0.1__tar.gz → 2.2.0__tar.gz

pystatistics 2.0.1tar.gz → 2.2.0tar.gz