PyPI - stride-align - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

stride-align 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (193) hide show

{stride_align-0.2.0 → stride_align-0.3.0}/.claude/settings.local.json RENAMED Viewed

@@ -32,7 +32,11 @@
       "Bash(~/.pyenv/bin/pyenv versions *)",
       "Bash(scp *)",
       "Bash(gh --version)",
-      "Bash(gh auth *)"
+      "Bash(gh auth *)",
+      "Bash(env)",
+      "Bash(.venv/bin/pip install *)",
+      "Bash(podman info *)",
+      "Bash(.venv/bin/cibuildwheel --help)"
     ]
   }
 }

{stride_align-0.2.0 → stride_align-0.3.0}/.gitignore RENAMED Viewed

@@ -208,4 +208,8 @@ __marimo__/
 *~
 demo/kjv.txt
-\#*\#
+\#*\#
+# Local wheelhouse staging (not part of the source tree).
+wheelhouse/
+wheelhouse_*/
+dist/

{stride_align-0.2.0 → stride_align-0.3.0}/BENCHMARK.md RENAMED Viewed

@@ -45,6 +45,8 @@ ratio = baseline_median_seconds / stride_align_median_seconds
 | Damerau-Lev (Graviton4, short tgts) | `linux_aarch64_neon`/`sve`/`sve2` | rapidfuzz OSA | 4 | **2.85x** | 2.83x | 2.27x | 3.89x |
 | Lev (Power8 VSX, mixed tgts) | `linux_powerpc64_vsx` | generic (no rapidfuzz wheel) | 8 | **2.40x** | 2.51x | 1.56x | 3.03x |
 | Damerau-Lev (Power8 VSX, mixed tgts) | `linux_powerpc64_vsx` | generic (no rapidfuzz wheel) | 7 | **1.99x** | 2.22x | 1.46x | 2.57x |
+| Jaro batch (cross-arch, N=1000) | `x86_avx512bwvl` / `*_neon` / `*_lasx` / `*_vsx` | rapidfuzz | 10 | **5.1x** | 3.7x | 1.54x | 263x |
+| cdist pruning (cross-arch, T=0.99) | `x86_avx512bwvl` / `*_neon` / `*_lasx` | own T=0 baseline | 24 | **426x** | 512x | 145x | 1,408x |
 ## Intel x86 - 2026-05-18
@@ -972,6 +974,252 @@ useful as a correctness/reference backend only.
 4. Build parasail from source for ppc64le and add a parasail column to the next sweep — every other family in this file has at least one parasail point of reference.
 5. Investigate why SWAR loses to generic on Power8 via an asm dump of the generic score loop.
+## Jaro + Jaro-Winkler (cross-arch) - 2026-05-23
+First cross-arch sweep of the new Jaro / Jaro-Winkler SIMD batch
+kernels. One target per 64-bit SIMD lane; the query's per-byte PEQ is
+gathered per-lane on each iteration, and the per-lane window mask is
+built via the new `shl_var_u64` / `shr_var_u64` Ops primitives. After
+the SIMD inner loop, a scalar finishing pass per lane computes
+match/transposition counts from the bitmaps.
+Same workload everywhere: random lowercase strings, one query of the
+listed length, 1000 targets of the same length. Median of 3 runs of
+50 iterations each. Baseline is `rapidfuzz.distance.Jaro.similarity`
+called in a Python list comprehension — the natural "fuzzy match one
+query against many targets" pattern.
+Output is bit-equivalent to rapidfuzz across all listed backends:
+verified on 500 random batches × ~25 targets each (~12,500 pairs per
+backend); 0 mismatches at machine precision.
+### Singular SIMD batch (one query, 1000 targets)
+| Host / backend | Query len | stride-align | rapidfuzz | Ratio |
+| --- | ---: | ---: | ---: | ---: |
+| Tiger Lake `x86_avx512bwvl` | 12 | 40.7 us | 181.5 us | **4.46x** |
+| Tiger Lake `x86_avx512bwvl` | 32 | 105.1 us | 289.2 us | **2.75x** |
+| Graviton4 `linux_aarch64_neon` | 12 | 43 us | 269 us | **6.26x** |
+| Graviton4 `linux_aarch64_neon` | 32 | 100 us | 353 us | **3.53x** |
+| Apple M-series `macos_arm64_neon` | 12 | 16 us | 151 us | **9.36x** |
+| Apple M-series `macos_arm64_neon` | 32 | 48 us | 183 us | **3.86x** |
+| Loongson `linux_loongarch64_lasx` | 12 | 87 us | 13,952 us | 161x |
+| Loongson `linux_loongarch64_lasx` | 32 | 187 us | 49,299 us | 263x |
+| Power8 `linux_powerpc64_vsx` | 12 | 194 us | 600 us | **3.09x** |
+| Power8 `linux_powerpc64_vsx` | 32 | 467 us | 719 us | **1.54x** |
+The Loongson ratios are dramatic because rapidfuzz has no LSX/LASX
+SIMD path on LoongArch64 — it falls through to a scalar C kernel,
+while our LSX/LASX bit-parallel batch fans out 2/4 targets per vector
+iteration.
+### One bug surfaced during deployment
+VSX (`*reinterpret_cast<Vec*>(ptr)` for `load_aligned`/`store_aligned`)
+ran into a strict-aliasing miscompile under GCC -O3: the scalar
+writes to per-iteration `LaneScratch` could be reordered past the
+same-block Vec read, silently dropping lane-1 match updates on every
+2-target group. The Levenshtein SIMD kernel uses the same primitives
+but a different scratch pattern, so it didn't trip. Fix: switch VSX
+to `vec_xl` / `vec_xst`, the proper VSX load/store intrinsics.
+Documented in commit `8ae4905`.
+### Multi-word query batch (q_len in (64, 256], m_len ≤ 64)
+Same workload shape, query length stretched into the multi-word path
+(W = 2 for q in (64, 128], W = 3 for (128, 192], W = 4 for (192, 256]).
+Targets stay short; b_matched fits in a single word.
+| Host / backend | q_len | m_len | N | stride-align | rapidfuzz | Ratio |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: |
+| Tiger Lake `x86_avx512bwvl` | 50 | 20 | 500 | 45 us | 227 us | **5.01x** |
+| Tiger Lake `x86_avx512bwvl` | 100 | 20 | 500 | 44 us | 255 us | **5.81x** |
+| Tiger Lake `x86_avx512bwvl` | 150 | 20 | 500 | 52 us | 299 us | **5.79x** |
+| Tiger Lake `x86_avx512bwvl` | 200 | 20 | 500 | 60 us | 328 us | **5.45x** |
+| Graviton4 `linux_aarch64_neon` | 100 | 20 | 500 | 53 us | 238 us | **4.52x** |
+| Apple M-series `macos_arm64_neon` | 100 | 20 | 500 | 25 us | 124 us | **4.96x** |
+| Loongson `linux_loongarch64_lasx` | 100 | 20 | 500 | 110 us | 39,461 us | 358x |
+| Power8 `linux_powerpc64_vsx` | 100 | 20 | 500 | 1,532 us | 581 us | 0.38x |
+Power8 is the one regression. The 2-block (W=2) inner loop doubles
+the gather count per j vs the single-word path, and Power8's VSX
+gather is emulated as scalar `vec_extract`/`vec_insert` (no native
+ppc gather instruction at this lane count). The per-iteration
+overhead exceeds rapidfuzz's tight scalar loop at this size.
+Workaround: if q_len ≤ 64 the single-word path stays 3x ahead;
+above 64 on Power8 specifically, prefer the per-target scalar
+dispatch (which the singular-API path already uses for q > 64).
+Future work: native pre-shuffle of the gather indices, or a Power-
+specific tuned gather using `vec_perm`.
+### Multi-word target batch (q_len ≤ 256, m_len in (64, 256])
+The second multi-word axis: target length crossing the 64-bit
+register boundary, in addition to (or independent of) the query
+length. b_matched becomes `std::array<Vec, W_target>`; the inner
+loop only updates block `j / 64` so the per-iteration work is the
+same as single-word target.
+| Host / backend | q | m | N | stride-align | rapidfuzz | Ratio |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: |
+| Tiger Lake `x86_avx512bwvl` | 100 | 100 | 500 | 196 us | 789 us | **4.02x** |
+| Tiger Lake `x86_avx512bwvl` | 200 | 150 | 500 | 440 us | 1305 us | **2.97x** |
+| Tiger Lake `x86_avx512bwvl` | 60 | 100 | 500 | 171 us | 631 us | **3.70x** |
+| Tiger Lake `x86_avx512bwvl` | 80 | 200 | 500 | 343 us | 1230 us | **3.58x** |
+| Graviton4 `linux_aarch64_neon` | 100 | 100 | 500 | 246 us | 643 us | **2.62x** |
+| Apple M-series `macos_arm64_neon` | 100 | 100 | 500 | 100 us | 296 us | **2.95x** |
+| Loongson `linux_loongarch64_lasx` | 100 | 100 | 500 | 509 us | 142,489 us | 279x |
+The dispatch picks `(W_query, W_target)` from the actual lengths in
+the batch, so short-target inputs still get `W_target = 1` (no wasted
+work). 16 instantiations max per backend.
+### Constraints (v0.3.0)
+* SIMD path covers query lengths up to 256 AND target lengths up to
+  256 (W = 1..4 blocks per side). Above 256 on either side it falls
+  through to per-target scalar dispatch (bit-parallel single-word
+  for ≤ 64 inputs and the scalar reference above).
+* Byte-compatible inputs (bytes / 1-byte unicode). Wider unicode
+  falls through to scalar via the prepared-token path.
+### Levenshtein audit (no changes needed)
+Lev's SIMD batch already handles query lengths up to 256 via the
+same W = 1..4 multi-word pattern. Target length on Lev's side is
+just an iteration count over the inner DP loop — no per-target
+register-width constraint — so multi-word target is a non-issue for
+Lev. Above q_len = 256, Lev's scalar dispatch picks up via Hyyrö's
+multi-word Myers (no upper bound on q_len). Future work to extend
+the SIMD batch beyond W = 4 is small but the use case (queries >
+256 chars in batches of 1000+) is rare.
+## cdist pruning + cutoff push-down (Intel x86) - 2026-05-24
+Three optimizations stacked on top of `cdist_above_threshold` and
+`cdist_top_k`:
+1. **Length-difference pruning.** Each pair `(q, t)` is gated by a
+   closed-form upper bound on the achievable normalized similarity
+   before any SIMD work runs. Bounds: `min/max` for Lev / OSA /
+   true-DL, `(2 + min/max)/3` for Jaro, `2*min/(q+t)` for Indel,
+   `1.0` if equal-length for Hamming.
+2. **Row-sort by query length, descending.** `cdist_top_k`
+   processes the longest queries first so close-length high-scoring
+   pairs surface early and the shared `global_min_bound` atomic
+   reaches a useful value before the short-query rows run.
+3. **Per-pair cutoff push-down into the SIMD kernel.** The Myers /
+   OSA / Hamming inner loops bail per lane when the score exceeds
+   the per-pair cutoff plus the remaining-chars allowance; bailed
+   lanes return the `cutoff + 1` sentinel. Lev/OSA use
+   `floor((1-T)*max(|q|, |t|) + 1e-9)`; Hamming uses
+   `floor((1-T)*|q|)`. Indel's bit-parallel Allison-Dix doesn't
+   track a running distance per column, and Jaro/JW have multi-term
+   scores without a clean per-column bail, so those two scorer
+   families benefit from length pruning only.
+All three are correctness-preserving — tests in
+`tests/test_cdist_length_pruning.py` pin the result set against the
+un-pruned full `cdist` matrix at multiple thresholds and the
+floating-point integer-boundary edges.
+### Setup
+Tiger Lake `x86_avx512bwvl`, N=400 queries × M=400 targets =
+160,000 pairs, random lowercase ASCII. Lengths 4–40 for Lev / OSA /
+Indel / Jaro / JW; lengths 100 (equal-length) for Hamming.
+`cpu_count=4`. Reproduce via `tools/bench_cdist_pruning.py --scorer
+<name>`.
+### `cdist_above_threshold` throughput (pairs/sec)
+| Scorer | T=0 | T=0.3 | T=0.5 | T=0.7 | T=0.85 | T=0.95 | T=0.99 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| LEVENSHTEIN_NORMALIZED        | 0.49M | 12.7M |  31.3M |  53.7M | 116M | 226M | **290M** |
+| DAMERAU_LEVENSHTEIN_NORMALIZED| 0.50M | 13.0M |  35.5M |  70.2M | 147M | 190M | **297M** |
+| HAMMING_NORMALIZED (n=100)    | 0.46M | 22.5M | 122M   | 108M   | 110M | 114M | **136M** |
+| INDEL_NORMALIZED              | 0.47M | 11.3M |  24.7M |  39.0M |  66M | 143M | **286M** |
+| JARO                          | 0.53M |  0.5M |   2.2M |  12.6M |  24M |  47M | **129M** |
+| JARO_WINKLER                  | 0.51M |  0.5M |   1.8M |  11.9M |  14M |  41M | **141M** |
+### Speedup ratio vs `T=0` (same scorer, same workload)
+| Scorer | T=0.3 | T=0.5 | T=0.7 | T=0.85 | T=0.95 | T=0.99 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: |
+| LEVENSHTEIN_NORMALIZED        | **26x** |  **64x** | **109x** | **236x** | **458x** | **587x** |
+| DAMERAU_LEVENSHTEIN_NORMALIZED| **26x** |  **72x** | **141x** | **298x** | **384x** | **598x** |
+| HAMMING_NORMALIZED            | **49x** | **264x** | **233x** | **238x** | **246x** | **293x** |
+| INDEL_NORMALIZED              | **24x** |  **53x** |  **83x** | **142x** | **306x** | **611x** |
+| JARO                          |  0.9x   |  4.1x    |  **24x** |  **46x** |  **90x** | **245x** |
+| JARO_WINKLER                  |  1.0x   |  3.5x    |  **23x** |  **27x** |  **80x** | **275x** |
+Jaro / Jaro-Winkler show no benefit until the threshold rises above
+the natural-distribution floor of the `(2 + min/max)/3` length
+bound. For length 4–40 random strings that happens around T ≈ 0.7;
+above that the bound rules out most pairs and the speedup compounds.
+### `cdist_top_k` throughput (pairs/sec)
+| Scorer | k=1 | k=10 | k=100 | k=1000 | k=10000 |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| LEVENSHTEIN_NORMALIZED        |  8.0M | 25.3M | 21.1M | 17.0M |  8.5M |
+| DAMERAU_LEVENSHTEIN_NORMALIZED| 20.0M | 23.4M | 19.4M | 18.6M | 12.2M |
+| HAMMING_NORMALIZED (n=100)    | 78.0M | 68.9M | 71.1M | 61.6M | 24.0M |
+| INDEL_NORMALIZED              | 29.1M | 24.5M | 21.5M | 17.4M |  7.5M |
+| JARO                          | 15.3M | 14.2M | 13.9M | 12.6M |  8.9M |
+| JARO_WINKLER                  | 15.1M | 12.2M | 13.6M | 13.4M | 10.0M |
+The row-sort matters most at small `k`: with the longest queries
+processed first, the global heap-min bound rises early and the
+per-pair cutoff push-down has a tight value to compare against for
+the bulk of the remaining rows. At very large `k` the heap rarely
+fills with strong matches so the bound stays close to the
+`(1.0 - safe margin)` floor and the kernel-level cutoff doesn't bite.
+### Cross-arch throughput at `T=0.99` (pairs/sec)
+Same script, same workload, four different SIMD backends.
+| Scorer | Tiger Lake `x86_avx512bwvl` | Graviton4 `linux_aarch64_neon` | Mac M-series `macos_arm64_neon` | Loongson `linux_loongarch64_lasx` |
+| --- | ---: | ---: | ---: | ---: |
+| LEVENSHTEIN_NORMALIZED        | 290M | 318M |   996M | 370M |
+| DAMERAU_LEVENSHTEIN_NORMALIZED| 297M | 318M | 1,014M | 426M |
+| HAMMING_NORMALIZED (n=100)    | 136M |  70M |   295M | 139M |
+| INDEL_NORMALIZED              | 286M | 272M |   995M | 402M |
+| JARO                          | 129M | 160M |   784M | 190M |
+| JARO_WINKLER                  | 141M | 105M |   543M | 136M |
+### Cross-arch speedup vs `T=0` (same scorer, same host)
+| Scorer | Tiger Lake | Graviton4 | Mac M-series | Loongson |
+| --- | ---: | ---: | ---: | ---: |
+| LEVENSHTEIN_NORMALIZED        | 587x | 699x |   537x | **1,199x** |
+| DAMERAU_LEVENSHTEIN_NORMALIZED| 598x | 714x |   478x | **1,353x** |
+| HAMMING_NORMALIZED            | 293x | 145x |   147x |   487x  |
+| INDEL_NORMALIZED              | 611x | 567x |   540x | **1,408x** |
+| JARO                          | 245x | 339x |   429x |   642x  |
+| JARO_WINKLER                  | 275x | 225x |   293x |   472x  |
+The speedup ratios are algorithmic — the bound math is independent
+of ISA, so the cross-host spread reflects only how much the
+un-pruned baseline costs vs the post-pruning hot path on each
+machine. Mac's M-series tops the absolute throughput because the
+inner SIMD loops are bit-parallel ops that the Apple core hands
+back at high IPC; Loongson posts the largest *ratio* because its
+un-pruned baseline (full Myers / OSA / Indel scan per pair at
+length 4–40) is slowest in absolute terms.
+Power8 numbers are deferred (host RAM too tight for a full `-O3`
+rebuild — see `docs/power8-gcc10-workarounds.md`).
+### Reading the numbers
+The relative speedups carry over across hosts; the absolute
+throughput numbers don't. The pre-pruning baseline (`T=0`) is
+`cdist_above_threshold` running every pair through full SIMD —
+equivalent to a full `cdist` plus the iterator overhead — so it's
+the right "no optimization" reference for the speedup ratios.
 ## Notes on comparing across families
 These numbers are intended for engineering direction, not publication-grade

stride_align-0.3.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,89 @@
+# Changelog
+All notable changes to `stride-align` are recorded here. The format
+follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and
+this project adheres to [Semantic Versioning](https://semver.org/).
+## [0.3.0] - 2026-05-24
+### Added
+* **Indel distance** (`Scorer.INDEL` / `Scorer.INDEL_NORMALIZED`).
+  Levenshtein restricted to insertions and deletions; equivalent to
+  `|a| + |b| - 2 * LCS(a, b)`. Bit-parallel single-word kernel uses
+  the Allison-Dix (1986) recurrence; multi-word patterns fall back
+  to scalar DP. Public API: `indel_score`, `indel_normalized_score`,
+  `indel_scores`, `indel_normalized_scores`, `indel_top_k`,
+  `indel_best`, and the corresponding normalized variants. Wired
+  through every backend, `cdist`, `cdist_above_threshold`,
+  `cdist_top_k`, and the function-reference dispatch in `extract`.
+* **True (unrestricted) Damerau-Levenshtein**
+  (`Scorer.TRUE_DAMERAU_LEVENSHTEIN` /
+  `Scorer.TRUE_DAMERAU_LEVENSHTEIN_NORMALIZED`). The unrestricted
+  form where a single character may participate in multiple edits.
+  Diverges from OSA on overlapping transpositions
+  (e.g. `"ca"`→`"abc"`: OSA=3, true-DL=2). Scalar DP only; no
+  bit-parallel kernel yet (Hyyrö 2003 exists but is significantly
+  more complex than OSA's bit-parallel and rarely the bottleneck).
+  Existing `Scorer.DAMERAU_LEVENSHTEIN` continues to refer to OSA —
+  the API name is unchanged.
+* **Length-difference pruning** for `cdist_above_threshold` and
+  `cdist_top_k`. Each pair is gated by a closed-form upper bound on
+  the achievable normalized similarity before any SIMD work runs;
+  bounds are scorer-specific (`min/max` for Lev/OSA/true-DL,
+  `(2 + min/max)/3` for Jaro, `2*min/(q+t)` for Indel, `1.0` if
+  equal-length for Hamming).
+* **`cdist_top_k` row-sort by query length, descending.** Longest
+  queries processed first so close-length high-scoring pairs
+  surface early and the shared `global_min_bound` atomic reaches a
+  useful value before the short-query rows run.
+* **Per-pair cutoff push-down into the SIMD kernels.** Myers
+  (Levenshtein single-word + multi-word), OSA single-word, and the
+  Hamming inner loop all bail when the running distance plus
+  remaining-chars allowance proves the pair can't reach its cutoff;
+  bailed lanes return the per-pair `cutoff + 1` sentinel.
+* **`docs/adding-a-new-algorithm.md`**: grep-able checklist for the
+  touch points (`Scorer` enum, runtime helpers, cdist switches,
+  bindings, per-backend Implementation methods, tests) a new
+  scorer / alignment algorithm / SIMD backend has to hit.
+* **Python 3.9 support.** The three `match` blocks in the Python
+  layer became dict lookups; `from __future__ import annotations`
+  was already in place project-wide. `pyproject.toml`
+  `requires-python = ">=3.9"`, classifiers extended.
+### Changed
+* **Lowered the build-time C++ requirement from C++23 to C++20.**
+  The project doesn't actually use any C++23 library feature — the
+  `cxx_std_23` setting was aspirational. Lowering it lets gcc 10
+  toolchains build the project (POWER8 Ubuntu 20.04 ships gcc 9.4
+  and 10.5). Two stdlib gaps in gcc-10 libstdc++ are bridged with
+  feature-test-gated fallbacks (`std::bit_cast` →
+  `__builtin_bit_cast`, `std::make_unique_for_overwrite` → plain
+  `new T[n]`). See `docs/power8-gcc10-workarounds.md` for the full
+  list and the revert recipe once gcc 16 lands.
+### Fixed
+* **`cdist_above_threshold` iterator on macOS and LoongArch64.** The
+  end-of-stream signal previously used `throw nb::stop_iteration()`,
+  which relies on cross-DSO RTTI matching for nanobind's
+  `builtin_exception`. macOS's two-level namespace and at least one
+  LoongArch toolchain configuration defeat that lookup, and the
+  exception ended up routed through nanobind's generic
+  `std::exception` translator → bare `RuntimeError` instead of
+  Python's `StopIteration`. Replaced with the C-API path
+  (`PyErr_SetNone(PyExc_StopIteration)` plus a null `nb::object`
+  return), which bypasses C++ exception machinery entirely. Fixes
+  91 macOS test failures.
+## [0.2.0]
+(Existing behavior at this tag was not previously tracked in this
+file; future releases will list specific deltas.)

{stride_align-0.2.0 → stride_align-0.3.0}/CMakeLists.txt RENAMED Viewed

@@ -11,7 +11,7 @@ if(NOT SKBUILD)
   )
 endif()
-set(CMAKE_CXX_STANDARD 23)
+set(CMAKE_CXX_STANDARD 20)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 set(CMAKE_CXX_EXTENSIONS OFF)
 string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" STRIDE_ALIGN_SYSTEM_PROCESSOR)
@@ -134,7 +134,12 @@ function(apply_stride_align_optimization_flags target_name)
 endfunction()
 function(configure_stride_align_target target_name)
-  target_compile_features(${target_name} PRIVATE cxx_std_23)
+  # C++20 is sufficient — we use std::popcount, <bit>, consteval,
+  # if-constexpr-requires; nothing from the C++23 stdlib (no expected,
+  # no format/print, no flat_*, no generator, no mdspan). Keeping the
+  # required standard at 20 lets older toolchains (gcc 10, e.g. on
+  # POWER8 Ubuntu 20.04) build the project.
+  target_compile_features(${target_name} PRIVATE cxx_std_20)
   target_include_directories(${target_name} PRIVATE include src/cpp)
   if(CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU")

{stride_align-0.2.0 → stride_align-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,12 +1,13 @@
 Metadata-Version: 2.4
 Name: stride-align
-Version: 0.2.0
-Summary: Smith-Waterman and Needleman-Wunsch alignments with a nanobind C++23 backend.
+Version: 0.3.0
+Summary: Smith-Waterman and Needleman-Wunsch alignments with a nanobind C++20 backend.
 Author: Adam
 License-Expression: Apache-2.0
 Classifier: Development Status :: 3 - Alpha
 Classifier: Intended Audience :: Developers
 Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
@@ -15,7 +16,7 @@ Classifier: Programming Language :: Python :: 3.14
 Classifier: Programming Language :: C++
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
-Requires-Python: >=3.10
+Requires-Python: >=3.9
 Requires-Dist: numpy>=1.22
 Provides-Extra: dev
 Requires-Dist: build>=1.2; extra == "dev"
@@ -49,14 +50,14 @@ sudo apt install python3-numpy
 PY=$(python3 -c 'import sys; print(f"cp{sys.version_info.major}{sys.version_info.minor}")')
 pip install \
-  https://github.com/adamdeprince/stride-align/releases/download/v0.2.0/stride_align-0.2.0-${PY}-${PY}-linux_loongarch64.whl
+  https://github.com/adamdeprince/stride-align/releases/download/v0.3.0/stride_align-0.3.0-${PY}-${PY}-linux_loongarch64.whl
 ```
-Prebuilt LoongArch64 wheels are available for Python 3.10, 3.11, 3.12,
-3.13, and 3.14. If you are on a different Python (or just want to
-build from source), `pip install stride-align` falls back to the
-source distribution on PyPI, which compiles the LSX/LASX kernels
-locally.
+Prebuilt LoongArch64 wheels are available for Python 3.9, 3.10,
+3.11, 3.12, 3.13, and 3.14. If you are on a different Python (or
+just want to build from source), `pip install stride-align` falls
+back to the source distribution on PyPI, which compiles the
+LSX/LASX kernels locally.
 First, just a disclaimer: I'm not using religious texts here to push
 an agenda - for this demo I need multiple largish public domain
@@ -326,10 +327,11 @@ Gapped Alignment Report". CIGAR is the compact alignment-operation notation
 used by SAM/BAM tooling. If you want the full formal version, see the
 [SAM specification](https://samtools.github.io/hts-specs/SAMv1.pdf).
-### Levenshtein and Damerau-Levenshtein
+### Edit-distance scorers
-Beyond Smith-Waterman and Needleman-Wunsch, `stride-align` exposes two
-unit-cost edit-distance metrics with their own SIMD-batched code paths:
+Beyond Smith-Waterman and Needleman-Wunsch, `stride-align` exposes
+six unit-cost edit-distance and similarity metrics — each with its
+own SIMD-batched code path:
 ```python
 import stride_align
@@ -338,7 +340,6 @@ import stride_align
 stride_align.levenshtein_score("kitten", "sitting")               # -> 3
 stride_align.levenshtein_normalized_score("kitten", "sitting")    # -> 0.571...
 stride_align.levenshtein_scores("kitten", ["kit", "sitting"])     # -> ndarray[int64]
-stride_align.levenshtein_normalized_scores("kitten", targets)     # -> ndarray[float64]
 # Optional `score_cutoff` (rapidfuzz convention): bail early per-target,
 # results that exceed the cutoff come back as `cutoff + 1`.
@@ -349,23 +350,78 @@ stride_align.levenshtein_scores(query, targets, score_cutoff=3)
 # OSA.distance and is what most callers asking for
 # "Damerau-Levenshtein" actually want.
 stride_align.damerau_levenshtein_score("ab", "ba")                # -> 1
-stride_align.damerau_levenshtein_scores(query, targets)           # -> ndarray[int64]
+# True Damerau-Levenshtein — the unrestricted form, where one
+# character may participate in more than one edit. Slower (no
+# bit-parallel kernel yet) but matches rapidfuzz.distance.DamerauLevenshtein
+# exactly. Diverges from OSA on overlapping transpositions, e.g.
+# "ca" -> "abc": OSA=3, true-DL=2.
+stride_align.true_damerau_levenshtein_score("ca", "abc")          # -> 2
+# Indel — Levenshtein restricted to insertions and deletions, no
+# substitutions. Equivalent to |a| + |b| - 2 * LCS(a, b). Bit-
+# parallel Allison-Dix (1986) inner loop.
+stride_align.indel_score("kitten", "sitting")                     # -> 5
+# Hamming — count of positions where two equal-length strings differ.
+# Cutoff variant bails the byte loop once mismatches exceed the cap.
+stride_align.hamming_score("100", "110")                          # -> 1
+# Jaro / Jaro-Winkler — similarities in [0, 1]; Winkler adds a
+# capped prefix bonus.
+stride_align.jaro_similarity("martha", "marhta")                  # -> 0.944...
+stride_align.jaro_winkler_similarity("martha", "marhta")          # -> 0.961...
 ```
-Both algorithms use a bit-parallel Myers-style inner loop. The batch
-variants pack one target per SIMD lane (`*_scores`) and currently
-specialize on every architecture's primary 64-bit-lane SIMD:
+The batch variants (`*_scores`, `*_similarities`) pack one target
+per SIMD lane on every supported backend:
 - x86: SSE4.1 / AVX2 / AVX-512 / AVX10-256 / AVX10-512
 - ARM: NEON (Linux + macOS), SVE / SVE2
 - LoongArch: LSX / LASX
 - PowerPC: VSX
-Patterns up to 64 chars run a single-word Myers; 65-256 chars use the
-multi-word kernel (W=2/3/4). Beyond 256, the implementation falls
-through to a scalar bit-parallel dispatch.
+For Lev / OSA, patterns up to 64 chars run a single-word Myers;
+65–256 chars use the multi-word kernel (W=2/3/4). Indel and OSA
+fall back to scalar bit-parallel for patterns >64 (multi-word
+generalization deferred); true-DL is scalar DP only.
+### `cdist`, `cdist_above_threshold`, `cdist_top_k`
+For all-pairs scoring across two lists of strings, `stride-align`
+ships three matrix-style entry points:
+```python
+qs = ["kitten", "sitting", "kit"]
+ts = ["kitten", "kit", "sitting", "biting"]
+# Full N×M similarity matrix — ndarray[float64] (similarity scorers)
+# or ndarray[int64] (distance scorers).
+sa.cdist(qs, ts, scorer=sa.Scorer.JARO)
+# Streaming filter — yields only pairs whose similarity exceeds the
+# threshold. Workers feed a bounded queue; the caller drains it.
+# Length pruning + per-pair cutoff push-down into the kernel skip
+# most of the work at high thresholds.
+for score, q, t in sa.cdist_above_threshold(
+    qs, ts, scorer=sa.Scorer.LEVENSHTEIN_NORMALIZED, threshold=0.7,
+):
+    ...
+# Top-k by score — returns at most k highest-scoring (or lowest, for
+# distance scorers) (score, query, target) tuples. Heaps are
+# per-thread; a shared atomic global-min bound lets the per-pair
+# cutoff push-down lift the prune threshold as work progresses.
+sa.cdist_top_k(qs, ts, scorer=sa.Scorer.JARO, k=10)
+```
+At high thresholds the pruning is dramatic — see the cross-arch
+table in [BENCHMARK.md](BENCHMARK.md) (the `cdist pruning` rows).
+Loongson LASX in particular flips the expected ranking against
+Tiger Lake AVX-512 at T=0.99; the comparison report lives at
+[docs/loongson-vs-tiger-lake-cdist-2026-05-24.md](docs/loongson-vs-tiger-lake-cdist-2026-05-24.md).
-See [BENCHMARK.md](BENCHMARK.md) for cross-architecture numbers.
+See [BENCHMARK.md](BENCHMARK.md) for full cross-architecture numbers.
 ## Optimizations and Benchmarks

stride-align 0.2.0__tar.gz → 0.3.0__tar.gz

stride-align 0.2.0tar.gz → 0.3.0tar.gz