PyPI - stride-align - Versions diffs - 0.1.0__tar.gz → 0.2.0__tar.gz - Mend

stride-align 0.1.0tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (194) hide show

{stride_align-0.1.0 → stride_align-0.2.0}/.claude/settings.local.json RENAMED Viewed

@@ -29,7 +29,10 @@
       "Bash(git pull *)",
       "WebFetch(domain:adamdeprince.com)",
       "Bash(awk *)",
-      "Bash(~/.pyenv/bin/pyenv versions *)"
+      "Bash(~/.pyenv/bin/pyenv versions *)",
+      "Bash(scp *)",
+      "Bash(gh --version)",
+      "Bash(gh auth *)"
     ]
   }
 }

stride_align-0.2.0/11} ADDED Viewed

File without changes

stride_align-0.2.0/12} ADDED Viewed

File without changes

stride_align-0.2.0/5} ADDED Viewed

File without changes

{stride_align-0.1.0 → stride_align-0.2.0}/BENCHMARK.md RENAMED Viewed

@@ -29,6 +29,22 @@ ratio = baseline_median_seconds / stride_align_median_seconds
 | Loongson LoongArch64 | `linux_loongarch64_lasx` | patched parasail (1:1 score) | 16 | **7.517x** | 6.502x | 4.315x | 22.365x |
 | Loongson LoongArch64 | `linux_loongarch64_lasx` | generic (native) | 80 | **4.909x** | 5.149x | 0.499x | 29.707x |
 | Power8 VSX (Linux) | `linux_powerpc64_vsx` | generic (no parasail) | 80 | **3.772x** | 4.128x | 0.915x | 16.797x |
+| Levenshtein (Intel x86) | `x86_avx512bwvl` | python-Levenshtein | 14 | **1.159x** | 1.151x | 1.039x | 1.353x |
+| Levenshtein (Intel x86) | `x86_avx512bwvl` | rapidfuzz | 14 | 1.075x | 1.070x | 0.898x | 1.364x |
+| Levenshtein (Intel x86) | `x86_avx512bwvl` | editdistance | 14 | 13.564x | 13.758x | 11.099x | 15.880x |
+| Lev (long, >64 chars) | `x86_avx512bwvl` | rapidfuzz | 5 | **2.35x** | 2.55x | 1.45x | 2.88x |
+| Lev (1-vs-1, q>=100) | `x86_avx512bwvl` | rapidfuzz | 2 | **1.36x** | 1.36x | 1.34x | 1.39x |
+| Lev (cutoff, q=50) | `x86_avx512bwvl` | rapidfuzz | 3 | **3.91x** | 2.41x | 2.41x | 6.03x |
+| Damerau-Lev (short tgts) | `x86_avx512bwvl` | rapidfuzz | 4 | **3.13x** | 3.03x | 2.38x | 4.22x |
+| Damerau-Lev (medium tgts) | `x86_avx512bwvl` | rapidfuzz | 3 | 0.98x | 0.87x | 0.85x | 1.25x |
+| Lev (Mac M4 NEON, short tgts) | `macos_arm64_neon` | python-Levenshtein | 4 | **6.61x** | 6.42x | 5.49x | 8.54x |
+| Damerau-Lev (Mac M4 NEON, short tgts) | `macos_arm64_neon` | rapidfuzz OSA | 4 | **5.49x** | 5.43x | 4.35x | 7.45x |
+| Lev (Loongson LASX, mixed tgts) | `linux_loongarch64_lasx` | generic (no rapidfuzz wheel) | 7 | **2.17x** | 2.18x | 1.54x | 3.34x |
+| Damerau-Lev (Loongson LASX, mixed tgts) | `linux_loongarch64_lasx` | generic (no rapidfuzz wheel) | 6 | **1.43x** | 1.43x | 1.14x | 1.97x |
+| Lev (Graviton4, short tgts) | `linux_aarch64_neon`/`sve`/`sve2` | python-Levenshtein | 4 | **3.18x** | 3.06x | 2.67x | 4.05x |
+| Damerau-Lev (Graviton4, short tgts) | `linux_aarch64_neon`/`sve`/`sve2` | rapidfuzz OSA | 4 | **2.85x** | 2.83x | 2.27x | 3.89x |
+| Lev (Power8 VSX, mixed tgts) | `linux_powerpc64_vsx` | generic (no rapidfuzz wheel) | 8 | **2.40x** | 2.51x | 1.56x | 3.03x |
+| Damerau-Lev (Power8 VSX, mixed tgts) | `linux_powerpc64_vsx` | generic (no rapidfuzz wheel) | 7 | **1.99x** | 2.22x | 1.46x | 2.57x |
 ## Intel x86 - 2026-05-18
@@ -157,6 +173,457 @@ at width 16 (`sw-cigar` and `sw-path-info`); AVX512BWVL's worst row is
 It loses every score-only row badly; a handful of linear NW path/CIGAR rows
 are competitive but not consistently.
+## Levenshtein (Intel x86) - 2026-05-19
+Raw artifact: [`benchmarks/intel-levenshtein-2026-05-19.csv`](benchmarks/intel-levenshtein-2026-05-19.csv).
+Build context: same host as Intel x86 above (11th Gen Core i7-1195G7,
+Python 3.13, `taskset -c 2`), running on the `x86_avx512bwvl` backend. The
+multi-target Myers kernel runs one target per SIMD lane (8x 64-bit lanes
+under AVX512) and reads bytes / 1-byte unicode strings zero-copy from
+CPython buffers. Patterns over 64 chars fall through to the scalar
+Hyyrö multi-word dispatch in `levenshtein_dispatch.hpp`.
+Command:
+```bash
+taskset -c 2 .venv/bin/python tools/benchmark_libs.py \
+  --input-file kjv_subset.txt --levenshtein \
+  --iterations 25 --warmups 3 < lev_queries.txt > intel-levenshtein-2026-05-19.csv
+```
+Corpus: first 1000 lines of `demo/kjv.txt`. Queries: 14 single words and
+short phrases (3-29 chars) covering the pattern lengths that hit the
+SIMD fast path.
+### Overall
+| Backend | Rows | Geomean | Median | Worst | Best |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| `stride_align` vs `python-Levenshtein` | 14 | **1.159x** | 1.151x | 1.039x | 1.353x |
+| `stride_align` vs `rapidfuzz`          | 14 | 1.075x | 1.070x | 0.898x | 1.364x |
+| `stride_align` vs `editdistance`       | 14 | 13.564x | 13.758x | 11.099x | 15.880x |
+Per-call wall time at 1000 targets (median across 14 queries):
+| Library | µs/call | ns/target |
+| --- | ---: | ---: |
+| `stride_align` (`x86_avx512bwvl`) | **496** | **496** |
+| `rapidfuzz`                       | 540  | 540  |
+| `python-Levenshtein`              | 567  | 567  |
+| `editdistance`                    | 6806 | 6806 |
+### Takeaways
+The multi-target Myers kernel keeps stride-align ahead of every popular
+Python Levenshtein library on this corpus. python-Levenshtein loses by
+1.16x geomean across all 14 queries; rapidfuzz loses by 1.07x with one
+sub-parity row (0.898x on a 26-char query). editdistance is roughly
+13.5x slower, reflecting its pure-C scalar DP loop with no batching.
+The "vs rapidfuzz" worst row is the only sub-parity result of the
+sweep. rapidfuzz also uses bit-parallel Myers in its hot path, so the
+remaining headroom is mostly per-call overhead — list traversal, the
+Python ABI, the ndarray allocation — rather than the inner loop. The
+SIMD multi-target kernel pulls ahead on shorter queries where the
+per-target setup dominates.
+## Levenshtein extended (Intel x86) - 2026-05-19
+Raw artifact: [`benchmarks/intel-levenshtein-v2-2026-05-19.csv`](benchmarks/intel-levenshtein-v2-2026-05-19.csv).
+Three follow-up workloads that exercise the multi-word SIMD batch
+kernel (patterns 65-256 chars, in 64-char blocks W = 2/3/4), the
+zero-copy singular dispatch (no `prepare_alignment` vector copy when
+both inputs are bytes or 1-byte unicode), and the `score_cutoff`
+parameter with per-lane done masks and all-lanes early-exit. Same
+build host and pinning as the section above.
+### Long patterns (1-vs-200, no cutoff)
+| `q_len` | stride_align | python-Lev | rapidfuzz | vs Lev | vs rf |
+| ---: | ---: | ---: | ---: | ---: | ---: |
+| 40  |  95 µs | 110 µs |  95 µs | 1.16x | 1.00x |
+| 65  | 118 µs | 187 µs | 171 µs | **1.59x** | **1.45x** |
+| 100 | 118 µs | 324 µs | 357 µs | **2.75x** | **3.03x** |
+| 128 | 118 µs | 320 µs | 300 µs | **2.71x** | **2.54x** |
+| 180 | 135 µs | 405 µs | 384 µs | **3.00x** | **2.85x** |
+| 200 | 166 µs | 456 µs | 440 µs | **2.75x** | **2.65x** |
+Each lane in the AVX-512 kernel runs Hyyrö's wide-add carry chain over
+W blocks in parallel across 8 targets. The wide add uses two chained
+64-bit adds + `gt_u64` overflow detection (AVX-512 native unsigned
+`cmpgt`, AVX2 sign-bit-XOR + signed `cmpgt`, SSE4.1 sub-and-sign-bit
+emulation). q_len = 40 is the single-word kernel, which hits parity
+with rapidfuzz.
+### 1-vs-1 singular (zero-copy dispatch)
+| `q_len` | stride_align | python-Lev | rapidfuzz |
+| ---: | ---: | ---: | ---: |
+| 10  | 0.20 µs | 0.25 µs | **0.17 µs** |
+| 30  | 0.27 µs | 0.31 µs | **0.24 µs** |
+| 60  | 0.36 µs | 0.40 µs | **0.32 µs** |
+| 100 | **0.90 µs** | 1.30 µs | 1.21 µs |
+| 200 | **2.35 µs** | 3.24 µs | 3.14 µs |
+When both inputs are bytes or 1-byte unicode the singular path skips
+the prepare\_alignment vector copy and runs scalar Myers directly on
+the CPython buffer (`PyBytes_AsStringAndSize` / `PyUnicode_1BYTE_DATA`).
+We trail rapidfuzz by ~10% under 60 chars (Python ABI overhead, no
+algorithmic gap) and pull ~1.35x ahead from 100 chars onward, where
+the multi-word inner loop dominates.
+### score_cutoff (5000 targets, short query)
+stride_align vs rapidfuzz with matching cutoff:
+| `q_len` | cutoff | stride_align | rapidfuzz | ratio |
+| ---: | ---: | ---: | ---: | ---: |
+| 10 |  2 | **277 µs** |  353 µs | 1.27x |
+| 10 |  5 | **301 µs** |  472 µs | 1.57x |
+| 10 | 10 | **342 µs** |  556 µs | 1.62x |
+| 30 |  7 | **190 µs** |  486 µs | 2.56x |
+| 30 | 15 | **328 µs** |  682 µs | 2.08x |
+| 30 | 30 | **410 µs** |  866 µs | 2.11x |
+| 50 | 12 | **49 µs**  |  297 µs | **6.03x** |
+| 50 | 25 | **204 µs** |  491 µs | 2.41x |
+| 50 | 50 | **410 µs** | 1119 µs | 2.73x |
+Per-lane done masks freeze score updates once a lane crosses
+`cutoff + remaining_chars`, and the column loop breaks as soon as every
+batch lane is settled (target exhausted or bailed). The biggest win
+(`q_len=50`, `cutoff=12`, 6x) is where most targets exceed cutoff after
+a handful of columns and the whole batch can short-circuit.
+### Where rapidfuzz still wins
+Long patterns *with* tight cutoff (e.g. `q=100`, `cutoff=20` over
+50-250-char targets): rapidfuzz 110 µs vs stride_align 402 µs (0.27x).
+Our cutoff bail condition `score > cutoff + remaining_chars` only
+fires near the end of the column loop because `remaining_chars` shrinks
+slowly. rapidfuzz uses **banded Myers** here, restricting the DP to a
+2K+1 diagonal band so the work drops to O(K·n) instead of O(m·n).
+Banded SIMD Myers is a separate kernel and isn't implemented in
+stride-align yet — see "Future work" below.
+### Future work
+- **Banded Myers** for tight-cutoff long-pattern workloads (the one
+  remaining rapidfuzz win). Restrict per-lane state to ±K diagonals
+  from the main; sliding window across columns.
+- **Pattern lengths > 256**: the multi-word SIMD kernel currently caps
+  at W=4. Extending to W=8 (pattern up to 512) is a recompile.
+## Damerau-Levenshtein / OSA (Intel x86) - 2026-05-19
+Raw artifact: [`benchmarks/intel-damerau-levenshtein-2026-05-19.csv`](benchmarks/intel-damerau-levenshtein-2026-05-19.csv).
+Build context: same host as the Levenshtein section (11th Gen Core
+i7-1195G7, Python 3.13, `taskset -c 2`). The algorithm is OSA-restricted
+(Optimal String Alignment) Damerau-Levenshtein: like Levenshtein but
+adjacent transpositions cost 1 instead of two substitutions, and each
+character can participate in at most one edit. Hyyrö's bit-parallel
+recurrence (the `TR = (((~D0_prev) & PM) << 1) & PM_old` formulation
+that rapidfuzz also uses), wrapped in the same multi-target SIMD batch
+architecture as our Levenshtein kernel — one target per SIMD lane
+(2/4/8 lanes for SSE4.1/AVX2/AVX-512).
+### Short targets (1-vs-1000, 3-15 char corpus)
+This is the SIMD batch sweet spot: short alignments amortize the
+gather + state-shift cost across 8 lanes, and rapidfuzz's per-pair
+overhead dominates its loop.
+| `q_len` | stride_align | rapidfuzz | ratio |
+| ---: | ---: | ---: | ---: |
+|  5 | **41 µs** |  99 µs | 2.38x |
+| 10 | **42 µs** | 108 µs | 2.59x |
+| 20 | **42 µs** | 176 µs | **4.22x** |
+| 30 | **47 µs** | 163 µs | 3.47x |
+### Medium targets (1-vs-200, 30-250 char corpus)
+| `q_len` | stride_align | rapidfuzz | ratio |
+| ---: | ---: | ---: | ---: |
+| 10 | 117 µs | **100 µs** | 0.85x |
+| 30 | 117 µs | **102 µs** | 0.87x |
+| 64 | **117 µs** | 147 µs | 1.25x |
+For medium-target workloads we trail rapidfuzz by ~15% under 60 chars
+(their inner loop is slightly tighter — fewer SIMD ops per column),
+then pull ahead at q_len=64 where their bit-parallel fallback path
+kicks in.
+### 1-vs-1 singular
+| `q_len` | stride_align | rapidfuzz | ratio |
+| ---: | ---: | ---: | ---: |
+| 10 | 0.18 µs | 0.15 µs | 0.85x |
+| 30 | 0.23 µs | 0.21 µs | 0.92x |
+| 60 | 0.35 µs | 0.32 µs | 0.90x |
+Per-call Python ABI dominates; we're within 15% of rapidfuzz on every
+length.
+### API
+```python
+import stride_align
+# Singular
+stride_align.damerau_levenshtein_score(query, target)            # int
+stride_align.damerau_levenshtein_normalized_score(query, target) # float in [0, 1]
+# Batch (returns numpy ndarray)
+stride_align.damerau_levenshtein_scores(query, targets)             # int64
+stride_align.damerau_levenshtein_normalized_scores(query, targets)  # float64
+```
+Backends specialized for the new SIMD batch kernel: `x86_sse41`,
+`x86_avx2`, `x86_avx512bwvl`, `x86_avx10_256`, `x86_avx10_512`, and
+(added 2026-05-19) `macos_arm64_neon` and `linux_aarch64_neon` via the
+shared `NeonOps` bundle. Other architectures (SVE / Loongson / Power)
+still fall through to the shared scalar bit-parallel dispatch and
+remain correct.
+## Levenshtein + Damerau-Levenshtein (Power8 VSX) - 2026-05-19
+Raw artifact: [`benchmarks/power8-lev-osa-2026-05-19.txt`](benchmarks/power8-lev-osa-2026-05-19.txt).
+Build context: Power8 KVM-virtualized core (4.157 GHz), Ubuntu 20.04,
+Python 3.13, AT15.0 GCC 11.4 at `/opt/at15.0/bin/g++` (the system GCC
+9.4 is too old for `cxx_std_23`), CMake 4.3.2. `VsxOps` is the new
+128-bit / 2-lane bundle (same shape as SSE / NEON / LSX) using
+`__vector unsigned long long` and Altivec intrinsics. Power8's ISA
+2.07 has native unsigned `vec_cmpgt` for 64-bit lanes so the kernel
+ports without emulation.
+No rapidfuzz / python-Levenshtein wheels exist for ppc64le on PyPI;
+comparison is against our generic backend (which runs tight bit-parallel
+Myers/OSA scalars).
+### Levenshtein 1-vs-1000 short (3-15 char corpus)
+| `q_len` | generic | VSX | ratio |
+| ---: | ---: | ---: | ---: |
+|  5 | 103 µs | **44 µs** | 2.37x |
+| 10 | 109 µs | **44 µs** | 2.49x |
+| 20 | 111 µs | **44 µs** | 2.54x |
+| 30 | 125 µs | **44 µs** | **2.87x** |
+### Levenshtein 1-vs-200 medium (30-250 char corpus)
+| `q_len` | generic | VSX | ratio |
+| ---: | ---: | ---: | ---: |
+|  10 | 153 µs |  98 µs | 1.56x |
+|  64 | 200 µs |  98 µs | 2.04x |
+| 100 | 468 µs | 172 µs | 2.72x |
+| 200 | 675 µs | 223 µs | **3.03x** |
+Multi-word kernel takes over at q=100; the ratio grows because the
+W-block SIMD scales with q while generic scalar's overhead scales
+linearly with q too.
+### Damerau-Levenshtein 1-vs-1000 short
+| `q_len` | generic | VSX | ratio |
+| ---: | ---: | ---: | ---: |
+|  5 | 109 µs | **48 µs** | 2.25x |
+| 10 | 108 µs | **48 µs** | 2.22x |
+| 20 | 112 µs | **48 µs** | 2.30x |
+| 30 | 124 µs | **48 µs** | **2.57x** |
+### Damerau-Levenshtein 1-vs-200 medium
+| `q_len` | generic | VSX | ratio |
+| ---: | ---: | ---: | ---: |
+| 10 | 159 µs | 109 µs | 1.46x |
+| 32 | 182 µs | 109 µs | 1.67x |
+| 64 | 204 µs | 109 µs | **1.87x** |
+Unlike LSX (which trails generic on the 2-lane Damerau medium
+workload), Power8 VSX wins consistently here. Power8's faster
+`vec_extract` for the per-lane gather setup and the native unsigned
+`vec_cmpgt` keep the 2-lane SIMD competitive even on short-pattern
+inner loops.
+## Levenshtein + Damerau-Levenshtein (Graviton4 NEON/SVE/SVE2) - 2026-05-19
+Raw artifact: [`benchmarks/graviton4-lev-osa-2026-05-19.csv`](benchmarks/graviton4-lev-osa-2026-05-19.csv).
+Build context: AWS Graviton4 (Neoverse V2, 1 vCPU c8g.medium), Ubuntu
+24.04, Python 3.14, GCC 13.x. The Graviton4 host has only 1.8 GiB RAM
+and 1 vCPU, so the build required `CMAKE_BUILD_PARALLEL_LEVEL=1` and a
+4 GiB swapfile to keep cc1plus from OOM-killing on the template-heavy
+TUs.
+All three ARM backends (`linux_aarch64_neon`, `linux_aarch64_sve`,
+`linux_aarch64_sve2`) share the same SIMD path: the SVE backends are
+built with `-msve-vector-bits=128`, so they hold the same 2 lanes of
+64-bit as NEON. Both wire through `NeonOps` rather than a separate
+`SveOps` bundle — the bit-parallel Lev/OSA kernel uses no
+SVE-specific feature.
+### Levenshtein 1-vs-1000 short (3-15 char corpus)
+| `q_len` | stride_align | python-Levenshtein | ratio |
+| ---: | ---: | ---: | ---: |
+|  5 |  53 µs | 140 µs | 2.67x |
+| 10 |  53 µs | 148 µs | 2.80x |
+| 20 |  53 µs | 176 µs | 3.33x |
+| 30 |  53 µs | 213 µs | **4.05x** |
+### Levenshtein 1-vs-200 medium (30-250 char corpus)
+| `q_len` | stride_align | python-Levenshtein | ratio |
+| ---: | ---: | ---: | ---: |
+|  10 | 150 µs | **119 µs** | 0.80x |
+|  32 | 150 µs | **121 µs** | 0.81x |
+|  64 | 150 µs | **127 µs** | 0.85x |
+| 100 | **163 µs** | 256 µs | 1.57x |
+| 200 | **224 µs** | 434 µs | 1.94x |
+Single-word multi-target (q ≤ 64): we trail python-Levenshtein on
+medium-length targets because the per-target SIMD setup outpaces the
+2-lane parallelism gain. Multi-word kicks in at q=100; we then pull
+ahead 1.57-1.94x.
+### Damerau-Levenshtein 1-vs-1000 short
+| `q_len` | stride_align | rapidfuzz OSA | ratio |
+| ---: | ---: | ---: | ---: |
+|  5 |  57 µs | 129 µs | 2.27x |
+| 10 |  57 µs | 142 µs | 2.50x |
+| 20 |  57 µs | 179 µs | 3.15x |
+| 30 |  57 µs | 221 µs | **3.89x** |
+### NEON vs SVE vs SVE2
+All three ARM backends produce identical results and identical
+performance (53 µs for q=10 short, etc.). The auto-detect picks SVE2
+on Graviton4 since it ranks first in the priority list, but routing
+through `NeonOps` means swapping backends is observationally a no-op.
+## Levenshtein + Damerau-Levenshtein (Loongson LASX/LSX) - 2026-05-19
+Raw artifact: [`benchmarks/loongson-lev-osa-2026-05-19.txt`](benchmarks/loongson-lev-osa-2026-05-19.txt).
+Build context: Loongson 3A6000 (LoongArch64), Kylin V10 SP1, Python
+3.13, GCC 15.2.0 (`/opt/loongson-gcc-15.2.0`), CMake 4.3.2. No
+rapidfuzz / python-Levenshtein wheels exist for LoongArch on PyPI, so
+the comparison is against our generic scalar backend (which already
+runs the bit-parallel Myers / OSA kernels in tight C++).
+`LsxOps` is 128-bit / 2 lanes (similar to SSE & NEON);
+`LasxOps` is 256-bit / 4 lanes (similar to AVX2).
+### Caveat on `vandn`
+Initial port had LSX/LASX returning negative scores on simple inputs.
+Root cause: `__lsx_vandn_v(a, b)` returns `~a & b` (Intel-style),
+contrary to what the LoongArch ISA reference's mnemonic name "VANDN"
+suggested. The fix was a single-line operand swap; correctness on the
+generic-reference test set is now 3200/3200 across q_lens 10/32/64/100.
+### Levenshtein 1-vs-1000 short (3-15 char corpus)
+| `q_len` | generic | LSX | LASX |
+| ---: | ---: | ---: | ---: |
+|  5 | 103 µs | 67 µs (1.54x) | **49 µs (2.10x)** |
+| 10 | 108 µs | 67 µs (1.61x) | **49 µs (2.18x)** |
+| 30 | 126 µs | 67 µs (1.88x) | **49 µs (2.56x)** |
+### Levenshtein 1-vs-200 medium (30-250 char corpus)
+| `q_len` | generic | LSX | LASX |
+| ---: | ---: | ---: | ---: |
+|  10 | 175 µs | 162 µs (1.08x) | **114 µs (1.54x)** |
+|  64 | 185 µs | 162 µs (1.14x) | **114 µs (1.63x)** |
+| 100 | 492 µs | 203 µs (2.43x) | **147 µs (3.34x)** |
+| 200 | 777 µs | 351 µs (2.22x) | **260 µs (2.99x)** |
+The multi-word kernel (q_len > 64, W=2/3) pulls ahead more
+dramatically than the single-word range because the SIMD batch
+amortizes the wide-add carry chain across 4 lanes (LASX) on the
+LoongArch's 3 GHz cores.
+### Damerau-Levenshtein 1-vs-1000 short
+| `q_len` | generic | LSX | LASX |
+| ---: | ---: | ---: | ---: |
+|  5 |  97 µs | 91 µs (1.06x) | **62 µs (1.57x)** |
+| 10 | 102 µs | 91 µs (1.12x) | **61 µs (1.66x)** |
+| 30 | 121 µs | 91 µs (1.32x) | **61 µs (1.97x)** |
+### Damerau-Levenshtein 1-vs-200 medium
+| `q_len` | generic | LSX | LASX |
+| ---: | ---: | ---: | ---: |
+| 10 | 176 µs | 249 µs (0.71x) | **155 µs (1.14x)** |
+| 32 | 182 µs | 249 µs (0.73x) | **155 µs (1.17x)** |
+| 64 | 189 µs | 249 µs (0.76x) | **155 µs (1.21x)** |
+LSX trails the generic backend on the OSA medium workload: 2 lanes of
+SIMD overhead (extra gather / mask / state-shift cost) outpaces the
+parallelism gain when the per-target scalar bit-parallel Myers loop is
+already tight. LASX keeps 4-lane parallelism worthwhile. Backend
+auto-detect picks LASX where available, so this only matters on
+machines that lack LASX.
+## Levenshtein + Damerau-Levenshtein (Mac M4 NEON) - 2026-05-19
+Raw artifact: [`benchmarks/macos-arm64-neon-lev-osa-2026-05-19.csv`](benchmarks/macos-arm64-neon-lev-osa-2026-05-19.csv).
+Build context: Apple M4 (T6041), macOS 15.x, Python 3.13 in the
+project virtualenv. Uses the new `macos_arm64_neon` SIMD batch kernel
+(2 lanes × 64-bit, NEON intrinsics in `levenshtein_simd_ops.hpp`). The
+Mac is 2-lane (NEON 128-bit), so per-call SIMD speedup is smaller than
+on AVX-512 (8 lanes); the win comes from skipping Python ABI per-pair
+overhead on the batch path.
+### Levenshtein, 1-vs-1000 short targets (3-15 char corpus)
+| `q_len` | stride_align | python-Levenshtein | ratio |
+| ---: | ---: | ---: | ---: |
+|  5 | **17 µs** |  92 µs | 5.49x |
+| 10 | **17 µs** |  97 µs | 5.80x |
+| 20 | **17 µs** | 118 µs | 7.04x |
+| 30 | **17 µs** | 143 µs | **8.54x** |
+### Levenshtein, 1-vs-200 medium targets (30-250 char corpus)
+| `q_len` | stride_align | python-Levenshtein | ratio |
+| ---: | ---: | ---: | ---: |
+|  10 |  80 µs |  89 µs | 1.12x |
+|  32 |  80 µs |  94 µs | 1.18x |
+|  64 |  80 µs |  95 µs | 1.19x |
+| 100 |  94 µs | 152 µs | 1.61x |
+| 200 | **112 µs** | 260 µs | **2.33x** |
+The 100/200-char rows exercise the multi-word kernel (W=2/3); the
+ratio grows because python-Levenshtein's overhead scales with pattern
+length while our W-block SIMD scales with `q_len / (64 * lanes)`.
+### Damerau-Levenshtein, 1-vs-1000 short
+| `q_len` | stride_align | rapidfuzz OSA | ratio |
+| ---: | ---: | ---: | ---: |
+|  5 | **20 µs** |  87 µs | 4.35x |
+| 10 | **20 µs** |  95 µs | 4.81x |
+| 20 | **20 µs** | 120 µs | 6.06x |
+| 30 | **20 µs** | 148 µs | **7.45x** |
+### 1-vs-1 singular
+Parity territory — Python ABI dominates, no algorithmic gap.
+| `q_len` | Lev sa / Lev rf | OSA sa / OSA rf |
+| ---: | ---: | ---: |
+| 10 | 0.13 / 0.13 µs | 0.13 / 0.13 µs |
+| 30 | 0.21 / 0.21 µs | **0.17** / 0.21 µs |
+| 60 | **0.29** / 0.33 µs | 0.29 / 0.29 µs |
 ## ARM Graviton4 (Linux aarch64) - 2026-05-18
 Raw artifacts:

{stride_align-0.1.0 → stride_align-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: stride-align
-Version: 0.1.0
+Version: 0.2.0
 Summary: Smith-Waterman and Needleman-Wunsch alignments with a nanobind C++23 backend.
 Author: Adam
 License-Expression: Apache-2.0
@@ -28,9 +28,6 @@ Description-Content-Type: text/markdown
 # stride-align
-**Languages:** **[English](README.md)** · [简体中文](README.zh-CN.md) · [繁體中文](README.zh-TW.md) · [日本語](README.ja.md) · [Deutsch](README.de.md) · [한국어](README.ko.md) · [Français](README.fr.md) · [Español](README.es.md) · [Português do Brasil](README.pt-BR.md) · [Русский](README.ru.md) · [Tiếng Việt](README.vi.md) · [Bahasa Indonesia](README.id.md) · [हिन्दी](README.hi.md) · [العربية](README.ar.md) · [Türkçe](README.tr.md) · [Polski](README.pl.md)
 `stride-align` is a [blazing fast library](BENCHMARK.md) to tell you how "similar" two strings are.
 It does this by implementing the Smith-Waterman and Needleman-Wunsch
 algorithms. Instead of giving you a lecture, we're going to learn by
@@ -43,13 +40,24 @@ pip install stride-align
 ```
 On Loongson systems, install NumPy from your Linux distribution before
-installing `stride-align`:
+installing `stride-align`, and grab the LoongArch64 wheel from the
+GitHub release instead of PyPI (PyPI does not yet accept the
+`linux_loongarch64` or `manylinux_2_38_loongarch64` platform tags):
 ```bash
 sudo apt install python3-numpy
-pip install stride-align
+PY=$(python3 -c 'import sys; print(f"cp{sys.version_info.major}{sys.version_info.minor}")')
+pip install \
+  https://github.com/adamdeprince/stride-align/releases/download/v0.2.0/stride_align-0.2.0-${PY}-${PY}-linux_loongarch64.whl
 ```
+Prebuilt LoongArch64 wheels are available for Python 3.10, 3.11, 3.12,
+3.13, and 3.14. If you are on a different Python (or just want to
+build from source), `pip install stride-align` falls back to the
+source distribution on PyPI, which compiles the LSX/LASX kernels
+locally.
 First, just a disclaimer: I'm not using religious texts here to push
 an agenda - for this demo I need multiple largish public domain
 documents that have the same meaning but are phrased differently. The
@@ -318,6 +326,47 @@ Gapped Alignment Report". CIGAR is the compact alignment-operation notation
 used by SAM/BAM tooling. If you want the full formal version, see the
 [SAM specification](https://samtools.github.io/hts-specs/SAMv1.pdf).
+### Levenshtein and Damerau-Levenshtein
+Beyond Smith-Waterman and Needleman-Wunsch, `stride-align` exposes two
+unit-cost edit-distance metrics with their own SIMD-batched code paths:
+```python
+import stride_align
+# Levenshtein (Myers 1999 bit-parallel) — inserts, deletes, substitutes
+stride_align.levenshtein_score("kitten", "sitting")               # -> 3
+stride_align.levenshtein_normalized_score("kitten", "sitting")    # -> 0.571...
+stride_align.levenshtein_scores("kitten", ["kit", "sitting"])     # -> ndarray[int64]
+stride_align.levenshtein_normalized_scores("kitten", targets)     # -> ndarray[float64]
+# Optional `score_cutoff` (rapidfuzz convention): bail early per-target,
+# results that exceed the cutoff come back as `cutoff + 1`.
+stride_align.levenshtein_scores(query, targets, score_cutoff=3)
+# Damerau-Levenshtein (OSA-restricted, Hyyrö 2002) — adds adjacent
+# transposition at unit cost. This is what rapidfuzz exposes as
+# OSA.distance and is what most callers asking for
+# "Damerau-Levenshtein" actually want.
+stride_align.damerau_levenshtein_score("ab", "ba")                # -> 1
+stride_align.damerau_levenshtein_scores(query, targets)           # -> ndarray[int64]
+```
+Both algorithms use a bit-parallel Myers-style inner loop. The batch
+variants pack one target per SIMD lane (`*_scores`) and currently
+specialize on every architecture's primary 64-bit-lane SIMD:
+- x86: SSE4.1 / AVX2 / AVX-512 / AVX10-256 / AVX10-512
+- ARM: NEON (Linux + macOS), SVE / SVE2
+- LoongArch: LSX / LASX
+- PowerPC: VSX
+Patterns up to 64 chars run a single-word Myers; 65-256 chars use the
+multi-word kernel (W=2/3/4). Beyond 256, the implementation falls
+through to a scalar bit-parallel dispatch.
+See [BENCHMARK.md](BENCHMARK.md) for cross-architecture numbers.
 ## Optimizations and Benchmarks
 Careful attention has been, and continues to be, paid to `stride-align`'s

{stride_align-0.1.0 → stride_align-0.2.0}/README.md RENAMED Viewed

@@ -1,8 +1,5 @@
 # stride-align
-**Languages:** **[English](README.md)** · [简体中文](README.zh-CN.md) · [繁體中文](README.zh-TW.md) · [日本語](README.ja.md) · [Deutsch](README.de.md) · [한국어](README.ko.md) · [Français](README.fr.md) · [Español](README.es.md) · [Português do Brasil](README.pt-BR.md) · [Русский](README.ru.md) · [Tiếng Việt](README.vi.md) · [Bahasa Indonesia](README.id.md) · [हिन्दी](README.hi.md) · [العربية](README.ar.md) · [Türkçe](README.tr.md) · [Polski](README.pl.md)
 `stride-align` is a [blazing fast library](BENCHMARK.md) to tell you how "similar" two strings are.
 It does this by implementing the Smith-Waterman and Needleman-Wunsch
 algorithms. Instead of giving you a lecture, we're going to learn by
@@ -15,13 +12,24 @@ pip install stride-align
 ```
 On Loongson systems, install NumPy from your Linux distribution before
-installing `stride-align`:
+installing `stride-align`, and grab the LoongArch64 wheel from the
+GitHub release instead of PyPI (PyPI does not yet accept the
+`linux_loongarch64` or `manylinux_2_38_loongarch64` platform tags):
 ```bash
 sudo apt install python3-numpy
-pip install stride-align
+PY=$(python3 -c 'import sys; print(f"cp{sys.version_info.major}{sys.version_info.minor}")')
+pip install \
+  https://github.com/adamdeprince/stride-align/releases/download/v0.2.0/stride_align-0.2.0-${PY}-${PY}-linux_loongarch64.whl
 ```
+Prebuilt LoongArch64 wheels are available for Python 3.10, 3.11, 3.12,
+3.13, and 3.14. If you are on a different Python (or just want to
+build from source), `pip install stride-align` falls back to the
+source distribution on PyPI, which compiles the LSX/LASX kernels
+locally.
 First, just a disclaimer: I'm not using religious texts here to push
 an agenda - for this demo I need multiple largish public domain
 documents that have the same meaning but are phrased differently. The
@@ -290,6 +298,47 @@ Gapped Alignment Report". CIGAR is the compact alignment-operation notation
 used by SAM/BAM tooling. If you want the full formal version, see the
 [SAM specification](https://samtools.github.io/hts-specs/SAMv1.pdf).
+### Levenshtein and Damerau-Levenshtein
+Beyond Smith-Waterman and Needleman-Wunsch, `stride-align` exposes two
+unit-cost edit-distance metrics with their own SIMD-batched code paths:
+```python
+import stride_align
+# Levenshtein (Myers 1999 bit-parallel) — inserts, deletes, substitutes
+stride_align.levenshtein_score("kitten", "sitting")               # -> 3
+stride_align.levenshtein_normalized_score("kitten", "sitting")    # -> 0.571...
+stride_align.levenshtein_scores("kitten", ["kit", "sitting"])     # -> ndarray[int64]
+stride_align.levenshtein_normalized_scores("kitten", targets)     # -> ndarray[float64]
+# Optional `score_cutoff` (rapidfuzz convention): bail early per-target,
+# results that exceed the cutoff come back as `cutoff + 1`.
+stride_align.levenshtein_scores(query, targets, score_cutoff=3)
+# Damerau-Levenshtein (OSA-restricted, Hyyrö 2002) — adds adjacent
+# transposition at unit cost. This is what rapidfuzz exposes as
+# OSA.distance and is what most callers asking for
+# "Damerau-Levenshtein" actually want.
+stride_align.damerau_levenshtein_score("ab", "ba")                # -> 1
+stride_align.damerau_levenshtein_scores(query, targets)           # -> ndarray[int64]
+```
+Both algorithms use a bit-parallel Myers-style inner loop. The batch
+variants pack one target per SIMD lane (`*_scores`) and currently
+specialize on every architecture's primary 64-bit-lane SIMD:
+- x86: SSE4.1 / AVX2 / AVX-512 / AVX10-256 / AVX10-512
+- ARM: NEON (Linux + macOS), SVE / SVE2
+- LoongArch: LSX / LASX
+- PowerPC: VSX
+Patterns up to 64 chars run a single-word Myers; 65-256 chars use the
+multi-word kernel (W=2/3/4). Beyond 256, the implementation falls
+through to a scalar bit-parallel dispatch.
+See [BENCHMARK.md](BENCHMARK.md) for cross-architecture numbers.
 ## Optimizations and Benchmarks
 Careful attention has been, and continues to be, paid to `stride-align`'s

stride-align 0.1.0__tar.gz → 0.2.0__tar.gz

stride-align 0.1.0tar.gz → 0.2.0tar.gz