betula-cluster 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (98) hide show
  1. betula_cluster-0.1.0/.cargo/config.toml +18 -0
  2. betula_cluster-0.1.0/.github/workflows/ci.yml +109 -0
  3. betula_cluster-0.1.0/.github/workflows/release.yml +77 -0
  4. betula_cluster-0.1.0/.gitignore +19 -0
  5. betula_cluster-0.1.0/CHANGELOG.md +112 -0
  6. betula_cluster-0.1.0/Cargo.lock +383 -0
  7. betula_cluster-0.1.0/Cargo.toml +40 -0
  8. betula_cluster-0.1.0/DESIGN.md +279 -0
  9. betula_cluster-0.1.0/LICENSE-MIT +21 -0
  10. betula_cluster-0.1.0/PKG-INFO +186 -0
  11. betula_cluster-0.1.0/README.md +157 -0
  12. betula_cluster-0.1.0/bench/RESULTS.md +187 -0
  13. betula_cluster-0.1.0/bench/_worker.py +219 -0
  14. betula_cluster-0.1.0/bench/benchmark.py +256 -0
  15. betula_cluster-0.1.0/bench/build_vs_betulars.py +85 -0
  16. betula_cluster-0.1.0/bench/comprehensive.py +472 -0
  17. betula_cluster-0.1.0/bench/cosine_spike.py +148 -0
  18. betula_cluster-0.1.0/bench/plots/memory_streaming.png +0 -0
  19. betula_cluster-0.1.0/bench/plots/quality_ari.png +0 -0
  20. betula_cluster-0.1.0/bench/plots/quality_real_ari.png +0 -0
  21. betula_cluster-0.1.0/bench/plots/scaling_time.png +0 -0
  22. betula_cluster-0.1.0/bench/results_memory.csv +11 -0
  23. betula_cluster-0.1.0/bench/results_quality.csv +67 -0
  24. betula_cluster-0.1.0/bench/results_real.csv +34 -0
  25. betula_cluster-0.1.0/bench/results_real_normalize.csv +10 -0
  26. betula_cluster-0.1.0/bench/results_real_scale.csv +3 -0
  27. betula_cluster-0.1.0/bench/results_scaling.csv +45 -0
  28. betula_cluster-0.1.0/docs/FEATURES.md +104 -0
  29. betula_cluster-0.1.0/docs/MATH.md +118 -0
  30. betula_cluster-0.1.0/docs/USAGE.md +213 -0
  31. betula_cluster-0.1.0/examples/01_quickstart.ipynb +335 -0
  32. betula_cluster-0.1.0/examples/01_quickstart.py +122 -0
  33. betula_cluster-0.1.0/examples/02_embeddings_and_inspection.ipynb +355 -0
  34. betula_cluster-0.1.0/examples/02_embeddings_and_inspection.py +139 -0
  35. betula_cluster-0.1.0/examples/03_streaming_and_persistence.ipynb +212 -0
  36. betula_cluster-0.1.0/examples/03_streaming_and_persistence.py +88 -0
  37. betula_cluster-0.1.0/examples/04_method_comparison.ipynb +300 -0
  38. betula_cluster-0.1.0/examples/04_method_comparison.py +159 -0
  39. betula_cluster-0.1.0/examples/05_topology_mapper.ipynb +464 -0
  40. betula_cluster-0.1.0/examples/05_topology_mapper.py +188 -0
  41. betula_cluster-0.1.0/examples/06_streaming_density.ipynb +440 -0
  42. betula_cluster-0.1.0/examples/06_streaming_density.py +165 -0
  43. betula_cluster-0.1.0/examples/07_mixed_data_kprototypes.ipynb +502 -0
  44. betula_cluster-0.1.0/examples/07_mixed_data_kprototypes.py +154 -0
  45. betula_cluster-0.1.0/examples/08_quantile_sketches.ipynb +496 -0
  46. betula_cluster-0.1.0/examples/08_quantile_sketches.py +132 -0
  47. betula_cluster-0.1.0/examples/09_semisupervised_constraints.ipynb +288 -0
  48. betula_cluster-0.1.0/examples/09_semisupervised_constraints.py +130 -0
  49. betula_cluster-0.1.0/examples/10_sparse_highdim.ipynb +368 -0
  50. betula_cluster-0.1.0/examples/10_sparse_highdim.py +126 -0
  51. betula_cluster-0.1.0/examples/11_soft_assignment_coreset_diagnostics.ipynb +512 -0
  52. betula_cluster-0.1.0/examples/11_soft_assignment_coreset_diagnostics.py +121 -0
  53. betula_cluster-0.1.0/examples/12_drift_robust_memory.ipynb +513 -0
  54. betula_cluster-0.1.0/examples/12_drift_robust_memory.py +144 -0
  55. betula_cluster-0.1.0/examples/README.md +49 -0
  56. betula_cluster-0.1.0/examples/usecases/usecase_01_embedding_dedup.ipynb +384 -0
  57. betula_cluster-0.1.0/examples/usecases/usecase_01_embedding_dedup.py +153 -0
  58. betula_cluster-0.1.0/examples/usecases/usecase_02_log_anomaly_detection.ipynb +436 -0
  59. betula_cluster-0.1.0/examples/usecases/usecase_02_log_anomaly_detection.py +158 -0
  60. betula_cluster-0.1.0/examples/usecases/usecase_03_customer_segmentation.ipynb +576 -0
  61. betula_cluster-0.1.0/examples/usecases/usecase_03_customer_segmentation.py +182 -0
  62. betula_cluster-0.1.0/examples/usecases/usecase_04_rag_corpus_curation.ipynb +567 -0
  63. betula_cluster-0.1.0/examples/usecases/usecase_04_rag_corpus_curation.py +189 -0
  64. betula_cluster-0.1.0/examples/usecases/usecase_05_real_data_clustering.ipynb +430 -0
  65. betula_cluster-0.1.0/examples/usecases/usecase_05_real_data_clustering.py +145 -0
  66. betula_cluster-0.1.0/pyproject.toml +64 -0
  67. betula_cluster-0.1.0/python/betula_cluster/__init__.py +1145 -0
  68. betula_cluster-0.1.0/python/betula_cluster/__init__.pyi +308 -0
  69. betula_cluster-0.1.0/python/betula_cluster/py.typed +0 -0
  70. betula_cluster-0.1.0/research/RESULTS-estep.md +31 -0
  71. betula_cluster-0.1.0/research/gmm_cf_estep.py +232 -0
  72. betula_cluster-0.1.0/src/bin/betula.rs +364 -0
  73. betula_cluster-0.1.0/src/clustering/gmm.rs +697 -0
  74. betula_cluster-0.1.0/src/clustering/hdbscan.rs +350 -0
  75. betula_cluster-0.1.0/src/clustering/kmeans.rs +638 -0
  76. betula_cluster-0.1.0/src/clustering/kprototypes.rs +414 -0
  77. betula_cluster-0.1.0/src/clustering/mod.rs +107 -0
  78. betula_cluster-0.1.0/src/clustering/rng.rs +31 -0
  79. betula_cluster-0.1.0/src/clustering/ward.rs +293 -0
  80. betula_cluster-0.1.0/src/distance.rs +344 -0
  81. betula_cluster-0.1.0/src/feature.rs +964 -0
  82. betula_cluster-0.1.0/src/kernels.rs +53 -0
  83. betula_cluster-0.1.0/src/lib.rs +23 -0
  84. betula_cluster-0.1.0/src/linalg.rs +247 -0
  85. betula_cluster-0.1.0/src/model.rs +227 -0
  86. betula_cluster-0.1.0/src/python.rs +2741 -0
  87. betula_cluster-0.1.0/src/sketch/ddsketch.rs +265 -0
  88. betula_cluster-0.1.0/src/sketch/kll.rs +307 -0
  89. betula_cluster-0.1.0/src/sketch/mod.rs +16 -0
  90. betula_cluster-0.1.0/src/sparse.rs +240 -0
  91. betula_cluster-0.1.0/src/stats.rs +230 -0
  92. betula_cluster-0.1.0/src/stream.rs +813 -0
  93. betula_cluster-0.1.0/src/topology.rs +615 -0
  94. betula_cluster-0.1.0/src/tree.rs +1022 -0
  95. betula_cluster-0.1.0/src/types.rs +27 -0
  96. betula_cluster-0.1.0/tests/integration_api.rs +114 -0
  97. betula_cluster-0.1.0/tests/test_python.py +1408 -0
  98. betula_cluster-0.1.0/uv.lock +155 -0
@@ -0,0 +1,18 @@
1
+ # Build tuning for betula-cluster.
2
+ #
3
+ # `inline-threshold` is portable — purely an inlining *hint*, no CPU-feature dependency — so the
4
+ # tiny hot-path distance kernels (`sq_euclidean` / `manhattan` / `dot`) inline across crate
5
+ # boundaries into the CF-tree insert loop. Worth a couple of percent on the build, safe everywhere,
6
+ # and applied to the published wheels.
7
+ [build]
8
+ rustflags = ["-C", "llvm-args=--inline-threshold=1000"]
9
+
10
+ # For a build pinned to *this machine's* CPU, add `target-cpu=native` for a further ~8 % from
11
+ # AVX2 / AVX-512 auto-vectorization of those same kernels (this is what closes the gap to — and at
12
+ # d≈10 matches — the reference `betulars`, whose wheels ship with it):
13
+ #
14
+ # RUSTFLAGS="-C target-cpu=native -C llvm-args=--inline-threshold=1000" maturin build --release
15
+ #
16
+ # It is deliberately NOT active here: a `target-cpu=native` wheel raises SIGILL on any CPU older
17
+ # than the build host, so it must never reach PyPI. The published wheels stay portable
18
+ # (baseline x86-64-v1); pin to the host only for a private/local build.
@@ -0,0 +1,109 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [main, master]
6
+ pull_request:
7
+
8
+ permissions:
9
+ contents: read
10
+
11
+ concurrency:
12
+ group: ci-${{ github.ref }}
13
+ cancel-in-progress: true
14
+
15
+ env:
16
+ CARGO_TERM_COLOR: always
17
+
18
+ jobs:
19
+ rust:
20
+ name: rust (fmt · clippy · test)
21
+ runs-on: ubuntu-latest
22
+ steps:
23
+ - uses: actions/checkout@v4
24
+ - uses: dtolnay/rust-toolchain@stable
25
+ with:
26
+ components: rustfmt, clippy, llvm-tools-preview
27
+ - uses: Swatinem/rust-cache@v2
28
+ - uses: taiki-e/install-action@v2
29
+ with:
30
+ tool: cargo-audit,cargo-llvm-cov
31
+ - name: fmt
32
+ run: cargo fmt --all --check
33
+ - name: clippy (default = parallel)
34
+ run: cargo clippy --all-targets -- -D warnings
35
+ - name: clippy (no-default-features = serial)
36
+ run: cargo clippy --no-default-features --all-targets -- -D warnings
37
+ - name: clippy (persistence)
38
+ run: cargo clippy --no-default-features --features persistence --all-targets -- -D warnings
39
+ - name: clippy (cli)
40
+ run: cargo clippy --features cli --bin betula -- -D warnings
41
+ - name: test (default)
42
+ run: cargo test
43
+ - name: test (serial)
44
+ run: cargo test --no-default-features
45
+ - name: test (parallel + persistence)
46
+ run: cargo test --features persistence
47
+ - name: test (cli)
48
+ run: cargo test --features cli --bin betula
49
+ - name: cargo audit (security / unmaintained advisories)
50
+ run: cargo audit
51
+ - name: coverage (llvm-cov, floor 95% lines)
52
+ run: cargo llvm-cov --summary-only --fail-under-lines 95
53
+
54
+ python-build:
55
+ name: python (build · clippy)
56
+ runs-on: ubuntu-latest
57
+ env:
58
+ PYO3_USE_ABI3_FORWARD_COMPATIBILITY: "1"
59
+ steps:
60
+ - uses: actions/checkout@v4
61
+ - uses: dtolnay/rust-toolchain@stable
62
+ with:
63
+ components: clippy
64
+ - uses: Swatinem/rust-cache@v2
65
+ - uses: astral-sh/setup-uv@v5
66
+ - name: ruff check
67
+ run: uvx ruff check python/ tests/ bench/
68
+ - name: ruff format --check
69
+ run: uvx ruff format --check python/ tests/ bench/
70
+ - name: ty check
71
+ run: uv run --with ty --with numpy ty check python/betula_cluster
72
+ - name: clippy (python bindings)
73
+ run: cargo clippy --features python -- -D warnings
74
+ - name: build abi3 wheel
75
+ run: uv run --with maturin maturin build --release --out dist
76
+ - uses: actions/upload-artifact@v4
77
+ with:
78
+ name: wheel
79
+ path: dist/*.whl
80
+
81
+ python-test:
82
+ name: python (pytest · py${{ matrix.python-version }})
83
+ needs: python-build
84
+ runs-on: ubuntu-latest
85
+ strategy:
86
+ fail-fast: false
87
+ matrix:
88
+ python-version: ["3.11", "3.12", "3.13", "3.14"]
89
+ steps:
90
+ - uses: actions/checkout@v4
91
+ - uses: astral-sh/setup-uv@v5
92
+ - uses: actions/download-artifact@v4
93
+ with:
94
+ name: wheel
95
+ path: dist
96
+ # Install-only: the single abi3 wheel must import and pass on every supported interpreter.
97
+ - name: pytest (+ wrapper coverage, floor 100%)
98
+ run: |
99
+ wheel=$(ls dist/*.whl | head -1)
100
+ uv run --python ${{ matrix.python-version }} \
101
+ --with numpy --with "scikit-learn>=1.3" --with scipy --with networkx \
102
+ --with pytest --with pytest-cov --with "$wheel" \
103
+ pytest tests/test_python.py -q \
104
+ --cov=betula_cluster --cov-report=term-missing --cov-fail-under=100
105
+ - name: stubtest
106
+ run: |
107
+ wheel=$(ls dist/*.whl | head -1)
108
+ uv run --python ${{ matrix.python-version }} --with mypy --with numpy --with "$wheel" \
109
+ python -m mypy.stubtest betula_cluster
@@ -0,0 +1,77 @@
1
+ name: Release
2
+
3
+ # Build redistributable wheels for every platform and (on a version tag) publish to PyPI.
4
+ # Publishing uses PyPI Trusted Publishing (OIDC) — configure a trusted publisher for this repo at
5
+ # https://pypi.org/manage/project/betula-cluster/settings/publishing/ before pushing a `v*` tag.
6
+
7
+ on:
8
+ push:
9
+ tags: ["v*"]
10
+ workflow_dispatch:
11
+
12
+ permissions:
13
+ contents: read
14
+
15
+ jobs:
16
+ wheels:
17
+ name: wheels ${{ matrix.platform.runner }} ${{ matrix.platform.target }}
18
+ runs-on: ${{ matrix.platform.runner }}
19
+ strategy:
20
+ fail-fast: false
21
+ matrix:
22
+ platform:
23
+ - { runner: ubuntu-latest, target: x86_64 }
24
+ - { runner: ubuntu-latest, target: aarch64 }
25
+ # macOS x86_64 is cross-built on the arm64 macos-14 runner: dedicated Intel (macos-13)
26
+ # runners are scarce/deprecated and queue for tens of minutes. abi3 needs no interpreter at
27
+ # build time, so cross-compiling x86_64-apple-darwin here is sound.
28
+ - { runner: macos-14, target: x86_64 }
29
+ - { runner: macos-14, target: aarch64 }
30
+ - { runner: windows-latest, target: x64 }
31
+ steps:
32
+ - uses: actions/checkout@v4
33
+ - uses: actions/setup-python@v5
34
+ with:
35
+ python-version: "3.x"
36
+ - name: Build wheels
37
+ uses: PyO3/maturin-action@v1
38
+ with:
39
+ target: ${{ matrix.platform.target }}
40
+ args: --release --out dist
41
+ manylinux: auto
42
+ sccache: "true"
43
+ - uses: actions/upload-artifact@v4
44
+ with:
45
+ name: wheels-${{ matrix.platform.runner }}-${{ matrix.platform.target }}
46
+ path: dist
47
+
48
+ sdist:
49
+ name: sdist
50
+ runs-on: ubuntu-latest
51
+ steps:
52
+ - uses: actions/checkout@v4
53
+ - name: Build sdist
54
+ uses: PyO3/maturin-action@v1
55
+ with:
56
+ command: sdist
57
+ args: --out dist
58
+ - uses: actions/upload-artifact@v4
59
+ with:
60
+ name: wheels-sdist
61
+ path: dist
62
+
63
+ publish:
64
+ name: publish to PyPI
65
+ runs-on: ubuntu-latest
66
+ needs: [wheels, sdist]
67
+ if: startsWith(github.ref, 'refs/tags/')
68
+ environment: pypi
69
+ permissions:
70
+ id-token: write # OIDC token for PyPI Trusted Publishing
71
+ steps:
72
+ - uses: actions/download-artifact@v4
73
+ with:
74
+ pattern: wheels-*
75
+ merge-multiple: true
76
+ path: dist
77
+ - uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,19 @@
1
+ /target
2
+ **/*.rs.bk
3
+ Cargo.lock
4
+ # python build / artifacts
5
+ *.so
6
+ *.pyd
7
+ __pycache__/
8
+ .venv/
9
+ *.npy
10
+ /dist/
11
+ *.whl
12
+ .coverage
13
+ .coverage.*
14
+ .mypy_cache/
15
+ .ruff_cache/
16
+ .pytest_cache/
17
+ # editor
18
+ .idea/
19
+ .vscode/
@@ -0,0 +1,112 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented here. The format follows
4
+ [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and the project adheres to
5
+ [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [0.1.0] — 2026-06-28
8
+
9
+ First public release.
10
+
11
+ ### Added
12
+ - Numerically stable BETULA clustering features `(n, μ, S)` (Welford/Chan updates) with four
13
+ covariance models: spherical, diagonal, full (PSD via Cholesky), and a Frequent-Directions sketch
14
+ (`O(ℓ·d)` per leaf) for very high-dimensional data.
15
+ - Memory-bounded CF-tree (Phase 1) with auto-rebuild under a `max_leaves` cap; optional parallel
16
+ shard+merge build (`n_jobs`); EWMA `decay` for streaming concept drift.
17
+ - Global clustering heads: Hamerly-accelerated exact k-means, diagonal & full-covariance GMM-EM
18
+ (expected-log E-step + NIW/MAP), Ward-HAC (nearest-neighbour chain), and HDBSCAN-on-CF; automatic
19
+ cluster count at `n_clusters=0` (BIC / X-means / dendrogram cut).
20
+ - χ² / Mahalanobis mass-invariant absorption gate (`absorb="chi2"`).
21
+ - `normalize=True` for cosine/direction clustering of embeddings (L2-normalized rows on the unit
22
+ sphere; squared-Euclidean is monotone in cosine). Doubles as the **high-dimensional fix**: at d≫100
23
+ raw Euclidean distances concentrate and the CF-tree collapses, but direction stays discriminative —
24
+ on MNIST-784 it lifts ARI 0.04 → 0.44, beating scikit-learn (benchmarked in
25
+ `bench/results_real_normalize.csv`). Off by default (magnitude is signal on tabular data).
26
+ - Inline auto-vectorized distance kernels (the compiler vectorizes the tight reductions per call
27
+ site; `target-cpu=native` opts into AVX2 / AVX-512 — see `.cargo/config.toml`); rayon-parallel
28
+ labeling.
29
+ - Python bindings: abi3 wheel (CPython 3.11+), zero-copy NumPy, `float32`/`float64` (no upcast), GIL
30
+ released during compute; one-shot `fit_predict` and a scikit-learn-style streaming `Betula`
31
+ estimator (`partial_fit` / `fit` / `predict` / `fit_predict`).
32
+ - Full scikit-learn parameter protocol (`get_params` / `set_params`) — works with `clone`,
33
+ `Pipeline`, and `GridSearchCV`. PEP 561 typed (`py.typed` + stubs).
34
+ - Dataset-structure inspection: `microcluster_centers_`/`_weights_`/`_radii_`,
35
+ `cluster_centers_`/`_radii_`/`_sizes_`, `outlier_scores`, `find_outliers`, `find_near_duplicates`,
36
+ `near_duplicate_pairs` (scored cosine pairs, exact within each leaf-block — the scalable
37
+ counterpart to an O(N²) all-pairs scan), `sample_representatives`, `assign_microclusters`,
38
+ `summary`, and `n_rebuilds_` / `threshold_` diagnostics.
39
+ - **Mapper topological skeleton** (`topology::mapper` → `Betula.mapper()` → `MapperGraph`): a lens
40
+ (`density` / `radius` / `l2norm` / `coordinate` / `eccentricity`) over the microclusters, an
41
+ overlapping cover, per-bin single-linkage at a data-adaptive (median-NN) scale, and a nerve graph with branch
42
+ points and bridges (Tarjan); optional `to_networkx()`. Exploration of structure / RAG leakage /
43
+ dedup, not a partition. `mapper_stability()` sweeps the resolution and reports the topology's
44
+ persistence across scale (β₀ components, β₁ loops, branch points, bridges per resolution).
45
+ - **Soft assignment & confidence**: `predict_proba` (true posterior for the GMM heads via the
46
+ per-leaf responsibility matrix `microcluster_proba_`; a documented centroid-distance softmax
47
+ *heuristic* for k-means / Ward / HDBSCAN) and `assignment_confidence`.
48
+ - **Coreset / diagnostics**: `export_coreset()` → `Coreset` (leaves as weighted points — a streaming
49
+ coreset), `diagnostics()` (compression ratio, radius percentiles, cluster mass spread),
50
+ `representatives(method=medoid|boundary|outlier|diverse)`, and `cluster_profile()` (JSON-able
51
+ geometry for LLM cluster naming).
52
+ - **`memory_budget_mb`**: size `max_leaves` from a target tree-resident memory (MiB) at fit time
53
+ instead of tuning it by hand; the resolved value is exposed as `effective_max_leaves_`.
54
+ - **Drift monitoring & curation**: `snapshot()` + `Betula.compare_snapshots(before, after)`
55
+ (nearest-centroid match → centroid shifts / mass ratios) and `active_learning_batch(strategy=
56
+ "uncertain"|"outlier")` (rows to review/label).
57
+ - **`DenStream`** streaming density clusterer (Cao et al., SDM 2006) over fading spherical
58
+ micro-clusters built on the stable CFs (decay is centroid/radius-invariant); `partial_fit` /
59
+ `cluster` / `fit` / `fit_predict` / `predict` (`-1` = noise) + microcluster getters, sklearn-style.
60
+ - **`DbStream`** streaming DBSTREAM clusterer (Hahsler & Bolaños, 2016): fading micro-clusters
61
+ connected by **shared density** (faded overlap mass) rather than distance, so it recovers
62
+ arbitrarily-shaped clusters and keeps close-but-disconnected dense regions apart. Fixed-radius
63
+ multi-assignment online; offline connects a pair when their overlap mass is `≥ alpha·min_weight`.
64
+ Same fading-CF core and sklearn-style API as `DenStream`; `core::stream::DbStream` in Rust.
65
+ - **Streaming quantile sketches** (`betula-sketch`, in `src/sketch/`): `KllSketch` (Karnin–Lang–
66
+ Liberty, rank-error) and `DdSketch` (Masson et al., relative-error) — `update` / `update_many` /
67
+ `merge` / `quantile` / `quantiles`, mergeable, bounded memory.
68
+ - **Sparse input**: `fit` / `fit_predict` / `partial_fit` / `predict` accept a `scipy.sparse` matrix
69
+ (CSR-routed, rows expanded one at a time — the dense `N × d` matrix is never materialized). f64;
70
+ this dense-tree path keeps the cancellation-free guarantee, compute `O(N·d)`.
71
+ - **`O(nnz)` sparse-native** (`fit_predict_sparse`): one-shot clustering of a `scipy.sparse` matrix
72
+ that touches only the non-zeros. Rows summarize into spherical micro-clusters keeping
73
+ `(n, ΣX, ‖ΣX‖², S)` (so the mean, cached `‖μ‖²`, and centroid distance are `O(nnz)`) via a flat
74
+ leader pass bounded by `max_leaves`, then a parametric head (`kmeans` default — robust for
75
+ high-`d` sparse) labels each row. Uses the *expanded* squared-distance form, so unlike the dense
76
+ path it is not cancellation-free (accurate for sparse rows far from the dense centroid);
77
+ `core::sparse::{summarize_sparse, nearest_sparse}` is the Rust API.
78
+ - **Robust insertion** (`huber_k`): optional Huber/winsorized point updates on the streaming
79
+ estimator — each point is clamped to within `huber_k` per-dimension standard deviations of its
80
+ target microcluster before the Welford fold-in, bounding any single point's pull on the centroid
81
+ (`O(k·σ/n)`) so stream outliers cannot stretch a centroid or inflate a radius. Off by default;
82
+ zero-variance dimensions pass through and a 5-point warm-up gates the clip. The result is still a
83
+ valid `(n, μ, S)` triple, so every downstream head is unchanged.
84
+ - **Constrained clustering** (`must_link` / `cannot_link`): semi-supervised COP-KMeans (Wagstaff et
85
+ al., 2001) over the leaf microclusters — `fit(X, must_link=..., cannot_link=...)` /
86
+ `fit_predict(...)` take `(m, 2)` row-index pairs. Must-link is transitively closed; cannot-link is
87
+ enforced per assignment. Constraints are honoured at the microcluster granularity, so a cannot-link
88
+ inside one leaf (or contradictory / over-constrained inputs) raises `ValueError` rather than being
89
+ silently dropped. `method="kmeans"`, dense input; `core::clustering::cop_kmeans` exposes the Rust
90
+ API with a typed `ConstraintError`.
91
+ - **Mixed numeric + categorical clustering** (`KPrototypes`): k-prototypes (Huang, 1997) for mixed
92
+ data. A *mixed CF* (`MixedCf`) pairs the stable numeric `(n, μ, S)` with a per-attribute category
93
+ histogram (mode = categorical centroid); distance is `‖Δnumeric‖² + γ·(categorical mismatch)`, with
94
+ `γ` defaulting to Huang's heuristic. Rows are leader-summarized into bounded mixed micro-clusters,
95
+ then clustered. Standalone scikit-learn-style estimator (`categorical` column indices,
96
+ `fit`/`fit_predict`/`predict`, `cluster_centroids_`/`cluster_modes_`); `core::clustering::{MixedCf,
97
+ kprototypes, summarize_mixed}` is the Rust API.
98
+ - **Command-line interface** (`betula`, behind the `cli` feature): a dependency-free binary that
99
+ clusters a delimited numeric file or stdin and writes one label per row; flags mirror the library
100
+ (`--clusters` / `--method` / `--feature` / `--threshold` / … ; `--clusters 0` auto-selects `k`).
101
+ - `save` / `load` + pickle (`joblib`-compatible) persistence (serde + CBOR via ciborium,
102
+ schema-versioned).
103
+ - NaN/Inf input validation at the boundary.
104
+
105
+ ### Fixed
106
+ - `estimate_threshold` now measures the mean nearest-sibling distance **within each leaf node**
107
+ (ELKI/BETULA-standard, `O(M·capacity)`) instead of a global all-pairs scan; the rebuild threshold
108
+ rises monotonically (no multiplicative bump that compounded across rebuilds and collapsed the tree
109
+ far below `max_leaves`), and rebuilds reinsert in reverse-DFS leaf order. The CF-tree build is now
110
+ byte-for-byte the reference (`betulars`) tree shape and at speed parity with matched build flags.
111
+
112
+ [0.1.0]: https://github.com/ilgrad/betula-cluster/releases/tag/v0.1.0