betula-cluster 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- betula_cluster-0.1.0/.cargo/config.toml +18 -0
- betula_cluster-0.1.0/.github/workflows/ci.yml +109 -0
- betula_cluster-0.1.0/.github/workflows/release.yml +77 -0
- betula_cluster-0.1.0/.gitignore +19 -0
- betula_cluster-0.1.0/CHANGELOG.md +112 -0
- betula_cluster-0.1.0/Cargo.lock +383 -0
- betula_cluster-0.1.0/Cargo.toml +40 -0
- betula_cluster-0.1.0/DESIGN.md +279 -0
- betula_cluster-0.1.0/LICENSE-MIT +21 -0
- betula_cluster-0.1.0/PKG-INFO +186 -0
- betula_cluster-0.1.0/README.md +157 -0
- betula_cluster-0.1.0/bench/RESULTS.md +187 -0
- betula_cluster-0.1.0/bench/_worker.py +219 -0
- betula_cluster-0.1.0/bench/benchmark.py +256 -0
- betula_cluster-0.1.0/bench/build_vs_betulars.py +85 -0
- betula_cluster-0.1.0/bench/comprehensive.py +472 -0
- betula_cluster-0.1.0/bench/cosine_spike.py +148 -0
- betula_cluster-0.1.0/bench/plots/memory_streaming.png +0 -0
- betula_cluster-0.1.0/bench/plots/quality_ari.png +0 -0
- betula_cluster-0.1.0/bench/plots/quality_real_ari.png +0 -0
- betula_cluster-0.1.0/bench/plots/scaling_time.png +0 -0
- betula_cluster-0.1.0/bench/results_memory.csv +11 -0
- betula_cluster-0.1.0/bench/results_quality.csv +67 -0
- betula_cluster-0.1.0/bench/results_real.csv +34 -0
- betula_cluster-0.1.0/bench/results_real_normalize.csv +10 -0
- betula_cluster-0.1.0/bench/results_real_scale.csv +3 -0
- betula_cluster-0.1.0/bench/results_scaling.csv +45 -0
- betula_cluster-0.1.0/docs/FEATURES.md +104 -0
- betula_cluster-0.1.0/docs/MATH.md +118 -0
- betula_cluster-0.1.0/docs/USAGE.md +213 -0
- betula_cluster-0.1.0/examples/01_quickstart.ipynb +335 -0
- betula_cluster-0.1.0/examples/01_quickstart.py +122 -0
- betula_cluster-0.1.0/examples/02_embeddings_and_inspection.ipynb +355 -0
- betula_cluster-0.1.0/examples/02_embeddings_and_inspection.py +139 -0
- betula_cluster-0.1.0/examples/03_streaming_and_persistence.ipynb +212 -0
- betula_cluster-0.1.0/examples/03_streaming_and_persistence.py +88 -0
- betula_cluster-0.1.0/examples/04_method_comparison.ipynb +300 -0
- betula_cluster-0.1.0/examples/04_method_comparison.py +159 -0
- betula_cluster-0.1.0/examples/05_topology_mapper.ipynb +464 -0
- betula_cluster-0.1.0/examples/05_topology_mapper.py +188 -0
- betula_cluster-0.1.0/examples/06_streaming_density.ipynb +440 -0
- betula_cluster-0.1.0/examples/06_streaming_density.py +165 -0
- betula_cluster-0.1.0/examples/07_mixed_data_kprototypes.ipynb +502 -0
- betula_cluster-0.1.0/examples/07_mixed_data_kprototypes.py +154 -0
- betula_cluster-0.1.0/examples/08_quantile_sketches.ipynb +496 -0
- betula_cluster-0.1.0/examples/08_quantile_sketches.py +132 -0
- betula_cluster-0.1.0/examples/09_semisupervised_constraints.ipynb +288 -0
- betula_cluster-0.1.0/examples/09_semisupervised_constraints.py +130 -0
- betula_cluster-0.1.0/examples/10_sparse_highdim.ipynb +368 -0
- betula_cluster-0.1.0/examples/10_sparse_highdim.py +126 -0
- betula_cluster-0.1.0/examples/11_soft_assignment_coreset_diagnostics.ipynb +512 -0
- betula_cluster-0.1.0/examples/11_soft_assignment_coreset_diagnostics.py +121 -0
- betula_cluster-0.1.0/examples/12_drift_robust_memory.ipynb +513 -0
- betula_cluster-0.1.0/examples/12_drift_robust_memory.py +144 -0
- betula_cluster-0.1.0/examples/README.md +49 -0
- betula_cluster-0.1.0/examples/usecases/usecase_01_embedding_dedup.ipynb +384 -0
- betula_cluster-0.1.0/examples/usecases/usecase_01_embedding_dedup.py +153 -0
- betula_cluster-0.1.0/examples/usecases/usecase_02_log_anomaly_detection.ipynb +436 -0
- betula_cluster-0.1.0/examples/usecases/usecase_02_log_anomaly_detection.py +158 -0
- betula_cluster-0.1.0/examples/usecases/usecase_03_customer_segmentation.ipynb +576 -0
- betula_cluster-0.1.0/examples/usecases/usecase_03_customer_segmentation.py +182 -0
- betula_cluster-0.1.0/examples/usecases/usecase_04_rag_corpus_curation.ipynb +567 -0
- betula_cluster-0.1.0/examples/usecases/usecase_04_rag_corpus_curation.py +189 -0
- betula_cluster-0.1.0/examples/usecases/usecase_05_real_data_clustering.ipynb +430 -0
- betula_cluster-0.1.0/examples/usecases/usecase_05_real_data_clustering.py +145 -0
- betula_cluster-0.1.0/pyproject.toml +64 -0
- betula_cluster-0.1.0/python/betula_cluster/__init__.py +1145 -0
- betula_cluster-0.1.0/python/betula_cluster/__init__.pyi +308 -0
- betula_cluster-0.1.0/python/betula_cluster/py.typed +0 -0
- betula_cluster-0.1.0/research/RESULTS-estep.md +31 -0
- betula_cluster-0.1.0/research/gmm_cf_estep.py +232 -0
- betula_cluster-0.1.0/src/bin/betula.rs +364 -0
- betula_cluster-0.1.0/src/clustering/gmm.rs +697 -0
- betula_cluster-0.1.0/src/clustering/hdbscan.rs +350 -0
- betula_cluster-0.1.0/src/clustering/kmeans.rs +638 -0
- betula_cluster-0.1.0/src/clustering/kprototypes.rs +414 -0
- betula_cluster-0.1.0/src/clustering/mod.rs +107 -0
- betula_cluster-0.1.0/src/clustering/rng.rs +31 -0
- betula_cluster-0.1.0/src/clustering/ward.rs +293 -0
- betula_cluster-0.1.0/src/distance.rs +344 -0
- betula_cluster-0.1.0/src/feature.rs +964 -0
- betula_cluster-0.1.0/src/kernels.rs +53 -0
- betula_cluster-0.1.0/src/lib.rs +23 -0
- betula_cluster-0.1.0/src/linalg.rs +247 -0
- betula_cluster-0.1.0/src/model.rs +227 -0
- betula_cluster-0.1.0/src/python.rs +2741 -0
- betula_cluster-0.1.0/src/sketch/ddsketch.rs +265 -0
- betula_cluster-0.1.0/src/sketch/kll.rs +307 -0
- betula_cluster-0.1.0/src/sketch/mod.rs +16 -0
- betula_cluster-0.1.0/src/sparse.rs +240 -0
- betula_cluster-0.1.0/src/stats.rs +230 -0
- betula_cluster-0.1.0/src/stream.rs +813 -0
- betula_cluster-0.1.0/src/topology.rs +615 -0
- betula_cluster-0.1.0/src/tree.rs +1022 -0
- betula_cluster-0.1.0/src/types.rs +27 -0
- betula_cluster-0.1.0/tests/integration_api.rs +114 -0
- betula_cluster-0.1.0/tests/test_python.py +1408 -0
- betula_cluster-0.1.0/uv.lock +155 -0
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
# Build tuning for betula-cluster.
|
|
2
|
+
#
|
|
3
|
+
# `inline-threshold` is portable — purely an inlining *hint*, no CPU-feature dependency — so the
|
|
4
|
+
# tiny hot-path distance kernels (`sq_euclidean` / `manhattan` / `dot`) inline across crate
|
|
5
|
+
# boundaries into the CF-tree insert loop. Worth a couple of percent on the build, safe everywhere,
|
|
6
|
+
# and applied to the published wheels.
|
|
7
|
+
[build]
|
|
8
|
+
rustflags = ["-C", "llvm-args=--inline-threshold=1000"]
|
|
9
|
+
|
|
10
|
+
# For a build pinned to *this machine's* CPU, add `target-cpu=native` for a further ~8 % from
|
|
11
|
+
# AVX2 / AVX-512 auto-vectorization of those same kernels (this is what closes the gap to — and at
|
|
12
|
+
# d≈10 matches — the reference `betulars`, whose wheels ship with it):
|
|
13
|
+
#
|
|
14
|
+
# RUSTFLAGS="-C target-cpu=native -C llvm-args=--inline-threshold=1000" maturin build --release
|
|
15
|
+
#
|
|
16
|
+
# It is deliberately NOT active here: a `target-cpu=native` wheel raises SIGILL on any CPU older
|
|
17
|
+
# than the build host, so it must never reach PyPI. The published wheels stay portable
|
|
18
|
+
# (baseline x86-64-v1); pin to the host only for a private/local build.
|
|
@@ -0,0 +1,109 @@
|
|
|
1
|
+
name: CI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: [main, master]
|
|
6
|
+
pull_request:
|
|
7
|
+
|
|
8
|
+
permissions:
|
|
9
|
+
contents: read
|
|
10
|
+
|
|
11
|
+
concurrency:
|
|
12
|
+
group: ci-${{ github.ref }}
|
|
13
|
+
cancel-in-progress: true
|
|
14
|
+
|
|
15
|
+
env:
|
|
16
|
+
CARGO_TERM_COLOR: always
|
|
17
|
+
|
|
18
|
+
jobs:
|
|
19
|
+
rust:
|
|
20
|
+
name: rust (fmt · clippy · test)
|
|
21
|
+
runs-on: ubuntu-latest
|
|
22
|
+
steps:
|
|
23
|
+
- uses: actions/checkout@v4
|
|
24
|
+
- uses: dtolnay/rust-toolchain@stable
|
|
25
|
+
with:
|
|
26
|
+
components: rustfmt, clippy, llvm-tools-preview
|
|
27
|
+
- uses: Swatinem/rust-cache@v2
|
|
28
|
+
- uses: taiki-e/install-action@v2
|
|
29
|
+
with:
|
|
30
|
+
tool: cargo-audit,cargo-llvm-cov
|
|
31
|
+
- name: fmt
|
|
32
|
+
run: cargo fmt --all --check
|
|
33
|
+
- name: clippy (default = parallel)
|
|
34
|
+
run: cargo clippy --all-targets -- -D warnings
|
|
35
|
+
- name: clippy (no-default-features = serial)
|
|
36
|
+
run: cargo clippy --no-default-features --all-targets -- -D warnings
|
|
37
|
+
- name: clippy (persistence)
|
|
38
|
+
run: cargo clippy --no-default-features --features persistence --all-targets -- -D warnings
|
|
39
|
+
- name: clippy (cli)
|
|
40
|
+
run: cargo clippy --features cli --bin betula -- -D warnings
|
|
41
|
+
- name: test (default)
|
|
42
|
+
run: cargo test
|
|
43
|
+
- name: test (serial)
|
|
44
|
+
run: cargo test --no-default-features
|
|
45
|
+
- name: test (parallel + persistence)
|
|
46
|
+
run: cargo test --features persistence
|
|
47
|
+
- name: test (cli)
|
|
48
|
+
run: cargo test --features cli --bin betula
|
|
49
|
+
- name: cargo audit (security / unmaintained advisories)
|
|
50
|
+
run: cargo audit
|
|
51
|
+
- name: coverage (llvm-cov, floor 95% lines)
|
|
52
|
+
run: cargo llvm-cov --summary-only --fail-under-lines 95
|
|
53
|
+
|
|
54
|
+
python-build:
|
|
55
|
+
name: python (build · clippy)
|
|
56
|
+
runs-on: ubuntu-latest
|
|
57
|
+
env:
|
|
58
|
+
PYO3_USE_ABI3_FORWARD_COMPATIBILITY: "1"
|
|
59
|
+
steps:
|
|
60
|
+
- uses: actions/checkout@v4
|
|
61
|
+
- uses: dtolnay/rust-toolchain@stable
|
|
62
|
+
with:
|
|
63
|
+
components: clippy
|
|
64
|
+
- uses: Swatinem/rust-cache@v2
|
|
65
|
+
- uses: astral-sh/setup-uv@v5
|
|
66
|
+
- name: ruff check
|
|
67
|
+
run: uvx ruff check python/ tests/ bench/
|
|
68
|
+
- name: ruff format --check
|
|
69
|
+
run: uvx ruff format --check python/ tests/ bench/
|
|
70
|
+
- name: ty check
|
|
71
|
+
run: uv run --with ty --with numpy ty check python/betula_cluster
|
|
72
|
+
- name: clippy (python bindings)
|
|
73
|
+
run: cargo clippy --features python -- -D warnings
|
|
74
|
+
- name: build abi3 wheel
|
|
75
|
+
run: uv run --with maturin maturin build --release --out dist
|
|
76
|
+
- uses: actions/upload-artifact@v4
|
|
77
|
+
with:
|
|
78
|
+
name: wheel
|
|
79
|
+
path: dist/*.whl
|
|
80
|
+
|
|
81
|
+
python-test:
|
|
82
|
+
name: python (pytest · py${{ matrix.python-version }})
|
|
83
|
+
needs: python-build
|
|
84
|
+
runs-on: ubuntu-latest
|
|
85
|
+
strategy:
|
|
86
|
+
fail-fast: false
|
|
87
|
+
matrix:
|
|
88
|
+
python-version: ["3.11", "3.12", "3.13", "3.14"]
|
|
89
|
+
steps:
|
|
90
|
+
- uses: actions/checkout@v4
|
|
91
|
+
- uses: astral-sh/setup-uv@v5
|
|
92
|
+
- uses: actions/download-artifact@v4
|
|
93
|
+
with:
|
|
94
|
+
name: wheel
|
|
95
|
+
path: dist
|
|
96
|
+
# Install-only: the single abi3 wheel must import and pass on every supported interpreter.
|
|
97
|
+
- name: pytest (+ wrapper coverage, floor 100%)
|
|
98
|
+
run: |
|
|
99
|
+
wheel=$(ls dist/*.whl | head -1)
|
|
100
|
+
uv run --python ${{ matrix.python-version }} \
|
|
101
|
+
--with numpy --with "scikit-learn>=1.3" --with scipy --with networkx \
|
|
102
|
+
--with pytest --with pytest-cov --with "$wheel" \
|
|
103
|
+
pytest tests/test_python.py -q \
|
|
104
|
+
--cov=betula_cluster --cov-report=term-missing --cov-fail-under=100
|
|
105
|
+
- name: stubtest
|
|
106
|
+
run: |
|
|
107
|
+
wheel=$(ls dist/*.whl | head -1)
|
|
108
|
+
uv run --python ${{ matrix.python-version }} --with mypy --with numpy --with "$wheel" \
|
|
109
|
+
python -m mypy.stubtest betula_cluster
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
name: Release
|
|
2
|
+
|
|
3
|
+
# Build redistributable wheels for every platform and (on a version tag) publish to PyPI.
|
|
4
|
+
# Publishing uses PyPI Trusted Publishing (OIDC) — configure a trusted publisher for this repo at
|
|
5
|
+
# https://pypi.org/manage/project/betula-cluster/settings/publishing/ before pushing a `v*` tag.
|
|
6
|
+
|
|
7
|
+
on:
|
|
8
|
+
push:
|
|
9
|
+
tags: ["v*"]
|
|
10
|
+
workflow_dispatch:
|
|
11
|
+
|
|
12
|
+
permissions:
|
|
13
|
+
contents: read
|
|
14
|
+
|
|
15
|
+
jobs:
|
|
16
|
+
wheels:
|
|
17
|
+
name: wheels ${{ matrix.platform.runner }} ${{ matrix.platform.target }}
|
|
18
|
+
runs-on: ${{ matrix.platform.runner }}
|
|
19
|
+
strategy:
|
|
20
|
+
fail-fast: false
|
|
21
|
+
matrix:
|
|
22
|
+
platform:
|
|
23
|
+
- { runner: ubuntu-latest, target: x86_64 }
|
|
24
|
+
- { runner: ubuntu-latest, target: aarch64 }
|
|
25
|
+
# macOS x86_64 is cross-built on the arm64 macos-14 runner: dedicated Intel (macos-13)
|
|
26
|
+
# runners are scarce/deprecated and queue for tens of minutes. abi3 needs no interpreter at
|
|
27
|
+
# build time, so cross-compiling x86_64-apple-darwin here is sound.
|
|
28
|
+
- { runner: macos-14, target: x86_64 }
|
|
29
|
+
- { runner: macos-14, target: aarch64 }
|
|
30
|
+
- { runner: windows-latest, target: x64 }
|
|
31
|
+
steps:
|
|
32
|
+
- uses: actions/checkout@v4
|
|
33
|
+
- uses: actions/setup-python@v5
|
|
34
|
+
with:
|
|
35
|
+
python-version: "3.x"
|
|
36
|
+
- name: Build wheels
|
|
37
|
+
uses: PyO3/maturin-action@v1
|
|
38
|
+
with:
|
|
39
|
+
target: ${{ matrix.platform.target }}
|
|
40
|
+
args: --release --out dist
|
|
41
|
+
manylinux: auto
|
|
42
|
+
sccache: "true"
|
|
43
|
+
- uses: actions/upload-artifact@v4
|
|
44
|
+
with:
|
|
45
|
+
name: wheels-${{ matrix.platform.runner }}-${{ matrix.platform.target }}
|
|
46
|
+
path: dist
|
|
47
|
+
|
|
48
|
+
sdist:
|
|
49
|
+
name: sdist
|
|
50
|
+
runs-on: ubuntu-latest
|
|
51
|
+
steps:
|
|
52
|
+
- uses: actions/checkout@v4
|
|
53
|
+
- name: Build sdist
|
|
54
|
+
uses: PyO3/maturin-action@v1
|
|
55
|
+
with:
|
|
56
|
+
command: sdist
|
|
57
|
+
args: --out dist
|
|
58
|
+
- uses: actions/upload-artifact@v4
|
|
59
|
+
with:
|
|
60
|
+
name: wheels-sdist
|
|
61
|
+
path: dist
|
|
62
|
+
|
|
63
|
+
publish:
|
|
64
|
+
name: publish to PyPI
|
|
65
|
+
runs-on: ubuntu-latest
|
|
66
|
+
needs: [wheels, sdist]
|
|
67
|
+
if: startsWith(github.ref, 'refs/tags/')
|
|
68
|
+
environment: pypi
|
|
69
|
+
permissions:
|
|
70
|
+
id-token: write # OIDC token for PyPI Trusted Publishing
|
|
71
|
+
steps:
|
|
72
|
+
- uses: actions/download-artifact@v4
|
|
73
|
+
with:
|
|
74
|
+
pattern: wheels-*
|
|
75
|
+
merge-multiple: true
|
|
76
|
+
path: dist
|
|
77
|
+
- uses: pypa/gh-action-pypi-publish@release/v1
|
|
@@ -0,0 +1,112 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented here. The format follows
|
|
4
|
+
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and the project adheres to
|
|
5
|
+
[Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
|
+
|
|
7
|
+
## [0.1.0] — 2026-06-28
|
|
8
|
+
|
|
9
|
+
First public release.
|
|
10
|
+
|
|
11
|
+
### Added
|
|
12
|
+
- Numerically stable BETULA clustering features `(n, μ, S)` (Welford/Chan updates) with four
|
|
13
|
+
covariance models: spherical, diagonal, full (PSD via Cholesky), and a Frequent-Directions sketch
|
|
14
|
+
(`O(ℓ·d)` per leaf) for very high-dimensional data.
|
|
15
|
+
- Memory-bounded CF-tree (Phase 1) with auto-rebuild under a `max_leaves` cap; optional parallel
|
|
16
|
+
shard+merge build (`n_jobs`); EWMA `decay` for streaming concept drift.
|
|
17
|
+
- Global clustering heads: Hamerly-accelerated exact k-means, diagonal & full-covariance GMM-EM
|
|
18
|
+
(expected-log E-step + NIW/MAP), Ward-HAC (nearest-neighbour chain), and HDBSCAN-on-CF; automatic
|
|
19
|
+
cluster count at `n_clusters=0` (BIC / X-means / dendrogram cut).
|
|
20
|
+
- χ² / Mahalanobis mass-invariant absorption gate (`absorb="chi2"`).
|
|
21
|
+
- `normalize=True` for cosine/direction clustering of embeddings (L2-normalized rows on the unit
|
|
22
|
+
sphere; squared-Euclidean is monotone in cosine). Doubles as the **high-dimensional fix**: at d≫100
|
|
23
|
+
raw Euclidean distances concentrate and the CF-tree collapses, but direction stays discriminative —
|
|
24
|
+
on MNIST-784 it lifts ARI 0.04 → 0.44, beating scikit-learn (benchmarked in
|
|
25
|
+
`bench/results_real_normalize.csv`). Off by default (magnitude is signal on tabular data).
|
|
26
|
+
- Inline auto-vectorized distance kernels (the compiler vectorizes the tight reductions per call
|
|
27
|
+
site; `target-cpu=native` opts into AVX2 / AVX-512 — see `.cargo/config.toml`); rayon-parallel
|
|
28
|
+
labeling.
|
|
29
|
+
- Python bindings: abi3 wheel (CPython 3.11+), zero-copy NumPy, `float32`/`float64` (no upcast), GIL
|
|
30
|
+
released during compute; one-shot `fit_predict` and a scikit-learn-style streaming `Betula`
|
|
31
|
+
estimator (`partial_fit` / `fit` / `predict` / `fit_predict`).
|
|
32
|
+
- Full scikit-learn parameter protocol (`get_params` / `set_params`) — works with `clone`,
|
|
33
|
+
`Pipeline`, and `GridSearchCV`. PEP 561 typed (`py.typed` + stubs).
|
|
34
|
+
- Dataset-structure inspection: `microcluster_centers_`/`_weights_`/`_radii_`,
|
|
35
|
+
`cluster_centers_`/`_radii_`/`_sizes_`, `outlier_scores`, `find_outliers`, `find_near_duplicates`,
|
|
36
|
+
`near_duplicate_pairs` (scored cosine pairs, exact within each leaf-block — the scalable
|
|
37
|
+
counterpart to an O(N²) all-pairs scan), `sample_representatives`, `assign_microclusters`,
|
|
38
|
+
`summary`, and `n_rebuilds_` / `threshold_` diagnostics.
|
|
39
|
+
- **Mapper topological skeleton** (`topology::mapper` → `Betula.mapper()` → `MapperGraph`): a lens
|
|
40
|
+
(`density` / `radius` / `l2norm` / `coordinate` / `eccentricity`) over the microclusters, an
|
|
41
|
+
overlapping cover, per-bin single-linkage at a data-adaptive (median-NN) scale, and a nerve graph with branch
|
|
42
|
+
points and bridges (Tarjan); optional `to_networkx()`. Exploration of structure / RAG leakage /
|
|
43
|
+
dedup, not a partition. `mapper_stability()` sweeps the resolution and reports the topology's
|
|
44
|
+
persistence across scale (β₀ components, β₁ loops, branch points, bridges per resolution).
|
|
45
|
+
- **Soft assignment & confidence**: `predict_proba` (true posterior for the GMM heads via the
|
|
46
|
+
per-leaf responsibility matrix `microcluster_proba_`; a documented centroid-distance softmax
|
|
47
|
+
*heuristic* for k-means / Ward / HDBSCAN) and `assignment_confidence`.
|
|
48
|
+
- **Coreset / diagnostics**: `export_coreset()` → `Coreset` (leaves as weighted points — a streaming
|
|
49
|
+
coreset), `diagnostics()` (compression ratio, radius percentiles, cluster mass spread),
|
|
50
|
+
`representatives(method=medoid|boundary|outlier|diverse)`, and `cluster_profile()` (JSON-able
|
|
51
|
+
geometry for LLM cluster naming).
|
|
52
|
+
- **`memory_budget_mb`**: size `max_leaves` from a target tree-resident memory (MiB) at fit time
|
|
53
|
+
instead of tuning it by hand; the resolved value is exposed as `effective_max_leaves_`.
|
|
54
|
+
- **Drift monitoring & curation**: `snapshot()` + `Betula.compare_snapshots(before, after)`
|
|
55
|
+
(nearest-centroid match → centroid shifts / mass ratios) and `active_learning_batch(strategy=
|
|
56
|
+
"uncertain"|"outlier")` (rows to review/label).
|
|
57
|
+
- **`DenStream`** streaming density clusterer (Cao et al., SDM 2006) over fading spherical
|
|
58
|
+
micro-clusters built on the stable CFs (decay is centroid/radius-invariant); `partial_fit` /
|
|
59
|
+
`cluster` / `fit` / `fit_predict` / `predict` (`-1` = noise) + microcluster getters, sklearn-style.
|
|
60
|
+
- **`DbStream`** streaming DBSTREAM clusterer (Hahsler & Bolaños, 2016): fading micro-clusters
|
|
61
|
+
connected by **shared density** (faded overlap mass) rather than distance, so it recovers
|
|
62
|
+
arbitrarily-shaped clusters and keeps close-but-disconnected dense regions apart. Fixed-radius
|
|
63
|
+
multi-assignment online; offline connects a pair when their overlap mass is `≥ alpha·min_weight`.
|
|
64
|
+
Same fading-CF core and sklearn-style API as `DenStream`; `core::stream::DbStream` in Rust.
|
|
65
|
+
- **Streaming quantile sketches** (`betula-sketch`, in `src/sketch/`): `KllSketch` (Karnin–Lang–
|
|
66
|
+
Liberty, rank-error) and `DdSketch` (Masson et al., relative-error) — `update` / `update_many` /
|
|
67
|
+
`merge` / `quantile` / `quantiles`, mergeable, bounded memory.
|
|
68
|
+
- **Sparse input**: `fit` / `fit_predict` / `partial_fit` / `predict` accept a `scipy.sparse` matrix
|
|
69
|
+
(CSR-routed, rows expanded one at a time — the dense `N × d` matrix is never materialized). f64;
|
|
70
|
+
this dense-tree path keeps the cancellation-free guarantee, compute `O(N·d)`.
|
|
71
|
+
- **`O(nnz)` sparse-native** (`fit_predict_sparse`): one-shot clustering of a `scipy.sparse` matrix
|
|
72
|
+
that touches only the non-zeros. Rows summarize into spherical micro-clusters keeping
|
|
73
|
+
`(n, ΣX, ‖ΣX‖², S)` (so the mean, cached `‖μ‖²`, and centroid distance are `O(nnz)`) via a flat
|
|
74
|
+
leader pass bounded by `max_leaves`, then a parametric head (`kmeans` default — robust for
|
|
75
|
+
high-`d` sparse) labels each row. Uses the *expanded* squared-distance form, so unlike the dense
|
|
76
|
+
path it is not cancellation-free (accurate for sparse rows far from the dense centroid);
|
|
77
|
+
`core::sparse::{summarize_sparse, nearest_sparse}` is the Rust API.
|
|
78
|
+
- **Robust insertion** (`huber_k`): optional Huber/winsorized point updates on the streaming
|
|
79
|
+
estimator — each point is clamped to within `huber_k` per-dimension standard deviations of its
|
|
80
|
+
target microcluster before the Welford fold-in, bounding any single point's pull on the centroid
|
|
81
|
+
(`O(k·σ/n)`) so stream outliers cannot stretch a centroid or inflate a radius. Off by default;
|
|
82
|
+
zero-variance dimensions pass through and a 5-point warm-up gates the clip. The result is still a
|
|
83
|
+
valid `(n, μ, S)` triple, so every downstream head is unchanged.
|
|
84
|
+
- **Constrained clustering** (`must_link` / `cannot_link`): semi-supervised COP-KMeans (Wagstaff et
|
|
85
|
+
al., 2001) over the leaf microclusters — `fit(X, must_link=..., cannot_link=...)` /
|
|
86
|
+
`fit_predict(...)` take `(m, 2)` row-index pairs. Must-link is transitively closed; cannot-link is
|
|
87
|
+
enforced per assignment. Constraints are honoured at the microcluster granularity, so a cannot-link
|
|
88
|
+
inside one leaf (or contradictory / over-constrained inputs) raises `ValueError` rather than being
|
|
89
|
+
silently dropped. `method="kmeans"`, dense input; `core::clustering::cop_kmeans` exposes the Rust
|
|
90
|
+
API with a typed `ConstraintError`.
|
|
91
|
+
- **Mixed numeric + categorical clustering** (`KPrototypes`): k-prototypes (Huang, 1997) for mixed
|
|
92
|
+
data. A *mixed CF* (`MixedCf`) pairs the stable numeric `(n, μ, S)` with a per-attribute category
|
|
93
|
+
histogram (mode = categorical centroid); distance is `‖Δnumeric‖² + γ·(categorical mismatch)`, with
|
|
94
|
+
`γ` defaulting to Huang's heuristic. Rows are leader-summarized into bounded mixed micro-clusters,
|
|
95
|
+
then clustered. Standalone scikit-learn-style estimator (`categorical` column indices,
|
|
96
|
+
`fit`/`fit_predict`/`predict`, `cluster_centroids_`/`cluster_modes_`); `core::clustering::{MixedCf,
|
|
97
|
+
kprototypes, summarize_mixed}` is the Rust API.
|
|
98
|
+
- **Command-line interface** (`betula`, behind the `cli` feature): a dependency-free binary that
|
|
99
|
+
clusters a delimited numeric file or stdin and writes one label per row; flags mirror the library
|
|
100
|
+
(`--clusters` / `--method` / `--feature` / `--threshold` / … ; `--clusters 0` auto-selects `k`).
|
|
101
|
+
- `save` / `load` + pickle (`joblib`-compatible) persistence (serde + CBOR via ciborium,
|
|
102
|
+
schema-versioned).
|
|
103
|
+
- NaN/Inf input validation at the boundary.
|
|
104
|
+
|
|
105
|
+
### Fixed
|
|
106
|
+
- `estimate_threshold` now measures the mean nearest-sibling distance **within each leaf node**
|
|
107
|
+
(ELKI/BETULA-standard, `O(M·capacity)`) instead of a global all-pairs scan; the rebuild threshold
|
|
108
|
+
rises monotonically (no multiplicative bump that compounded across rebuilds and collapsed the tree
|
|
109
|
+
far below `max_leaves`), and rebuilds reinsert in reverse-DFS leaf order. The CF-tree build is now
|
|
110
|
+
byte-for-byte the reference (`betulars`) tree shape and at speed parity with matched build flags.
|
|
111
|
+
|
|
112
|
+
[0.1.0]: https://github.com/ilgrad/betula-cluster/releases/tag/v0.1.0
|