stride-align 0.1.0__tar.gz → 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (194) hide show
  1. {stride_align-0.1.0 → stride_align-0.2.0}/.claude/settings.local.json +4 -1
  2. stride_align-0.2.0/11} +0 -0
  3. stride_align-0.2.0/12} +0 -0
  4. stride_align-0.2.0/5} +0 -0
  5. {stride_align-0.1.0 → stride_align-0.2.0}/BENCHMARK.md +467 -0
  6. {stride_align-0.1.0 → stride_align-0.2.0}/PKG-INFO +55 -6
  7. {stride_align-0.1.0 → stride_align-0.2.0}/README.md +54 -5
  8. stride_align-0.2.0/benchmarks/graviton4-lev-osa-2026-05-19.csv +17 -0
  9. stride_align-0.2.0/benchmarks/intel-damerau-levenshtein-2026-05-19.csv +11 -0
  10. stride_align-0.2.0/benchmarks/intel-levenshtein-2026-05-19.csv +15 -0
  11. stride_align-0.2.0/benchmarks/intel-levenshtein-v2-2026-05-19.csv +24 -0
  12. stride_align-0.2.0/benchmarks/loongson-lev-osa-2026-05-19.txt +20 -0
  13. stride_align-0.2.0/benchmarks/macos-arm64-neon-lev-osa-2026-05-19.csv +23 -0
  14. stride_align-0.2.0/benchmarks/power8-lev-osa-2026-05-19.txt +22 -0
  15. {stride_align-0.1.0 → stride_align-0.2.0}/html/index.html +13 -21
  16. stride_align-0.2.0/include/stride_align/levenshtein.hpp +486 -0
  17. {stride_align-0.1.0 → stride_align-0.2.0}/pyproject.toml +1 -1
  18. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/arm_neon128.hpp +37 -0
  19. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_aarch64_neon.hpp +28 -0
  20. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_aarch64_sve.hpp +41 -0
  21. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_aarch64_sve2.hpp +38 -0
  22. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_loongarch64_lasx.hpp +33 -0
  23. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_loongarch64_lsx.hpp +33 -0
  24. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_powerpc64_vsx.hpp +40 -0
  25. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_riscv64_rvv.hpp +2 -0
  26. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/macos_arm64_neon.hpp +28 -0
  27. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/swar.hpp +2 -0
  28. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_avx10_256.hpp +62 -0
  29. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_avx10_512.hpp +61 -0
  30. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_avx2.hpp +61 -0
  31. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_avx512bwvl.hpp +61 -0
  32. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_sse41.hpp +33 -0
  33. stride_align-0.2.0/src/cpp/byte_view.hpp +51 -0
  34. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/farrar_preprocess.hpp +10 -14
  35. stride_align-0.2.0/src/cpp/levenshtein_dispatch.hpp +333 -0
  36. stride_align-0.2.0/src/cpp/levenshtein_simd.hpp +570 -0
  37. stride_align-0.2.0/src/cpp/levenshtein_simd_ops.hpp +396 -0
  38. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/module_bindings.hpp +156 -0
  39. stride_align-0.2.0/src/cpp/osa_simd.hpp +245 -0
  40. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/preprocess.hpp +3 -3
  41. {stride_align-0.1.0 → stride_align-0.2.0}/src/stride_align/__init__.py +131 -0
  42. {stride_align-0.1.0 → stride_align-0.2.0}/src/stride_align/_pybackend.py +58 -0
  43. {stride_align-0.1.0 → stride_align-0.2.0}/tests/test_api.py +79 -0
  44. stride_align-0.2.0/tools/benchmark_libs.py +304 -0
  45. stride_align-0.2.0/tools/correctness_check.py +274 -0
  46. {stride_align-0.1.0 → stride_align-0.2.0}/tools/md_to_html.py +5 -16
  47. stride_align-0.2.0/wheelhouse/stride_align-0.1.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  48. stride_align-0.2.0/wheelhouse/stride_align-0.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  49. stride_align-0.2.0/wheelhouse/stride_align-0.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  50. stride_align-0.2.0/wheelhouse/stride_align-0.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  51. stride_align-0.2.0/wheelhouse/stride_align-0.1.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  52. stride_align-0.2.0/wheelhouse_v0.2.0/stride_align-0.2.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  53. stride_align-0.2.0/wheelhouse_v0.2.0/stride_align-0.2.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  54. stride_align-0.2.0/wheelhouse_v0.2.0/stride_align-0.2.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  55. stride_align-0.2.0/wheelhouse_v0.2.0/stride_align-0.2.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  56. stride_align-0.2.0/wheelhouse_v0.2.0/stride_align-0.2.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl +0 -0
  57. stride_align-0.1.0/README.ar.md +0 -284
  58. stride_align-0.1.0/README.de.md +0 -284
  59. stride_align-0.1.0/README.es.md +0 -284
  60. stride_align-0.1.0/README.fr.md +0 -284
  61. stride_align-0.1.0/README.hi.md +0 -284
  62. stride_align-0.1.0/README.id.md +0 -284
  63. stride_align-0.1.0/README.ja.md +0 -284
  64. stride_align-0.1.0/README.ko.md +0 -284
  65. stride_align-0.1.0/README.pl.md +0 -284
  66. stride_align-0.1.0/README.pt-BR.md +0 -284
  67. stride_align-0.1.0/README.ru.md +0 -284
  68. stride_align-0.1.0/README.tr.md +0 -284
  69. stride_align-0.1.0/README.vi.md +0 -284
  70. stride_align-0.1.0/README.zh-CN.md +0 -284
  71. stride_align-0.1.0/README.zh-TW.md +0 -284
  72. stride_align-0.1.0/html/index.ar.html +0 -260
  73. stride_align-0.1.0/html/index.de.html +0 -260
  74. stride_align-0.1.0/html/index.es.html +0 -260
  75. stride_align-0.1.0/html/index.fr.html +0 -260
  76. stride_align-0.1.0/html/index.hi.html +0 -260
  77. stride_align-0.1.0/html/index.id.html +0 -260
  78. stride_align-0.1.0/html/index.ja.html +0 -260
  79. stride_align-0.1.0/html/index.ko.html +0 -260
  80. stride_align-0.1.0/html/index.pl.html +0 -260
  81. stride_align-0.1.0/html/index.pt-BR.html +0 -260
  82. stride_align-0.1.0/html/index.ru.html +0 -260
  83. stride_align-0.1.0/html/index.tr.html +0 -260
  84. stride_align-0.1.0/html/index.vi.html +0 -260
  85. stride_align-0.1.0/html/index.zh-CN.html +0 -260
  86. stride_align-0.1.0/html/index.zh-TW.html +0 -260
  87. {stride_align-0.1.0 → stride_align-0.2.0}/.codex +0 -0
  88. {stride_align-0.1.0 → stride_align-0.2.0}/.gitattributes +0 -0
  89. {stride_align-0.1.0 → stride_align-0.2.0}/.gitignore +0 -0
  90. {stride_align-0.1.0 → stride_align-0.2.0}/CITATION.bib +0 -0
  91. {stride_align-0.1.0 → stride_align-0.2.0}/CMakeLists.txt +0 -0
  92. {stride_align-0.1.0 → stride_align-0.2.0}/LICENSE +0 -0
  93. {stride_align-0.1.0 → stride_align-0.2.0}/SME-EXPERIMENT.md +0 -0
  94. {stride_align-0.1.0 → stride_align-0.2.0}/TRANSLATION_REPORT.md +0 -0
  95. {stride_align-0.1.0 → stride_align-0.2.0}/TRANSLATION_REPORT.txt +0 -0
  96. {stride_align-0.1.0 → stride_align-0.2.0}/TRANSLATION_RULES.md +0 -0
  97. {stride_align-0.1.0 → stride_align-0.2.0}/benchmark.csv +0 -0
  98. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/graviton4-arm-simd-parasail-2026-05-16.csv +0 -0
  99. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/graviton4-arm-simd-parasail-2026-05-18.csv +0 -0
  100. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/loongson-2026-05-13.md +0 -0
  101. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/loongson-native-2026-05-18.csv +0 -0
  102. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/loongson-path-native-2026-05-13.csv +0 -0
  103. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/loongson-score-1to1-parasail-2026-05-13.csv +0 -0
  104. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/loongson-score-native-2026-05-13.csv +0 -0
  105. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/loongson-sw-farrar-exactfill-baseline-2026-05-14.csv +0 -0
  106. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/loongson-sw-farrar-exactfill-study-2026-05-14.csv +0 -0
  107. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-2026-05-13.md +0 -0
  108. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-2026-05-14.md +0 -0
  109. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-2026-05-18.csv +0 -0
  110. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-focused-2026-05-14.csv +0 -0
  111. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-linear-trace-onepass-parasail-study-2026-05-14.csv +0 -0
  112. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-microbench-2026-05-14.txt +0 -0
  113. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-microbench-nw-affine-primitives-2026-05-14.txt +0 -0
  114. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-nw-affine-fastpaths-2026-05-14.csv +0 -0
  115. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-nw-affine-primitives-2026-05-14.csv +0 -0
  116. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-path-parasail-2026-05-13.csv +0 -0
  117. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-score-native-2026-05-13.csv +0 -0
  118. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-score-parasail-2026-05-13.csv +0 -0
  119. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/macos-arm64-neon-sw-farrar-parasail-study-2026-05-14.csv +0 -0
  120. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/power8-vsx-2026-05-17.csv +0 -0
  121. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/power8-vsx-2026-05-17.md +0 -0
  122. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/power8-vsx-2026-05-18.csv +0 -0
  123. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/x86-sw-farrar-exactfill-study-2026-05-14.csv +0 -0
  124. {stride_align-0.1.0 → stride_align-0.2.0}/benchmarks/x86_microbench_baseline.json +0 -0
  125. {stride_align-0.1.0 → stride_align-0.2.0}/demo/demo2.py +0 -0
  126. {stride_align-0.1.0 → stride_align-0.2.0}/demo/demo3.py +0 -0
  127. {stride_align-0.1.0 → stride_align-0.2.0}/docs/x86_algorithmic_deltas.txt +0 -0
  128. {stride_align-0.1.0 → stride_align-0.2.0}/fast.png +0 -0
  129. {stride_align-0.1.0 → stride_align-0.2.0}/html/BENCHMARK.html +0 -0
  130. {stride_align-0.1.0 → stride_align-0.2.0}/html/apple-touch-icon.png +0 -0
  131. {stride_align-0.1.0 → stride_align-0.2.0}/html/fast.png +0 -0
  132. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-128.png +0 -0
  133. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-16.png +0 -0
  134. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-180.png +0 -0
  135. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-192.png +0 -0
  136. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-256.png +0 -0
  137. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-32.png +0 -0
  138. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-48.png +0 -0
  139. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-512.png +0 -0
  140. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon-64.png +0 -0
  141. {stride_align-0.1.0 → stride_align-0.2.0}/html/favicon.ico +0 -0
  142. {stride_align-0.1.0 → stride_align-0.2.0}/html/icon-192.png +0 -0
  143. {stride_align-0.1.0 → stride_align-0.2.0}/html/icon-512.png +0 -0
  144. {stride_align-0.1.0 → stride_align-0.2.0}/html/site.webmanifest +0 -0
  145. {stride_align-0.1.0 → stride_align-0.2.0}/html/style.css +0 -0
  146. {stride_align-0.1.0 → stride_align-0.2.0}/include/stride_align/alignment.hpp +0 -0
  147. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/affine.hpp +0 -0
  148. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/affine_fixed_kernel.hpp +0 -0
  149. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/affine_scalable_kernel.hpp +0 -0
  150. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/arm_neon_kernel.hpp +0 -0
  151. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/arm_sve_backend.hpp +0 -0
  152. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/arm_sve_kernel.hpp +0 -0
  153. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/farrar_fixed_kernel.hpp +0 -0
  154. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/farrar_scalable_kernel.hpp +0 -0
  155. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/generic.cpp +0 -0
  156. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/generic.hpp +0 -0
  157. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_aarch64_neon.cpp +0 -0
  158. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_aarch64_sve.cpp +0 -0
  159. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_aarch64_sve2.cpp +0 -0
  160. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_loongarch64_lasx.cpp +0 -0
  161. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_loongarch64_lsx.cpp +0 -0
  162. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_powerpc64_vsx.cpp +0 -0
  163. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/linux_riscv64_rvv.cpp +0 -0
  164. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/loongarch_fixed_kernel.hpp +0 -0
  165. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/macos_arm64_neon.cpp +0 -0
  166. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/powerpc_vsx_kernel.hpp +0 -0
  167. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/profile_traceback.hpp +0 -0
  168. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/riscv_rvv_kernel.hpp +0 -0
  169. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/score_fast_paths.hpp +0 -0
  170. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/swar.cpp +0 -0
  171. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_avx10_256.cpp +0 -0
  172. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_avx10_512.cpp +0 -0
  173. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_avx2.cpp +0 -0
  174. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_avx512.cpp +0 -0
  175. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_fixed_kernel.hpp +0 -0
  176. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/backends/x86_sse.cpp +0 -0
  177. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/cpu.cpp +0 -0
  178. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/cpu.hpp +0 -0
  179. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/cpu_module.cpp +0 -0
  180. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/promotion.hpp +0 -0
  181. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/tools/arm_neon_microbench_backend.cpp +0 -0
  182. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/tools/x86_microbench.cpp +0 -0
  183. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/tools/x86_microbench_avx2.cpp +0 -0
  184. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/tools/x86_microbench_avx512bwvl.cpp +0 -0
  185. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/tools/x86_microbench_common.hpp +0 -0
  186. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/tools/x86_microbench_kernels.hpp +0 -0
  187. {stride_align-0.1.0 → stride_align-0.2.0}/src/cpp/tools/x86_microbench_parasail.cpp +0 -0
  188. {stride_align-0.1.0 → stride_align-0.2.0}/src/stride_align/_cpu.py +0 -0
  189. {stride_align-0.1.0 → stride_align-0.2.0}/src/stride_align/_fallback.py +0 -0
  190. {stride_align-0.1.0 → stride_align-0.2.0}/src/stride_align/benchmark.py +0 -0
  191. {stride_align-0.1.0 → stride_align-0.2.0}/src/stride_align/file_compare.py +0 -0
  192. {stride_align-0.1.0 → stride_align-0.2.0}/src/stride_align/py.typed +0 -0
  193. {stride_align-0.1.0 → stride_align-0.2.0}/tools/pinned_benchmark_sweep.py +0 -0
  194. {stride_align-0.1.0 → stride_align-0.2.0}/tools/x86_microbench_regression.py +0 -0
@@ -29,7 +29,10 @@
29
29
  "Bash(git pull *)",
30
30
  "WebFetch(domain:adamdeprince.com)",
31
31
  "Bash(awk *)",
32
- "Bash(~/.pyenv/bin/pyenv versions *)"
32
+ "Bash(~/.pyenv/bin/pyenv versions *)",
33
+ "Bash(scp *)",
34
+ "Bash(gh --version)",
35
+ "Bash(gh auth *)"
33
36
  ]
34
37
  }
35
38
  }
stride_align-0.2.0/11} ADDED
File without changes
stride_align-0.2.0/12} ADDED
File without changes
stride_align-0.2.0/5} ADDED
File without changes
@@ -29,6 +29,22 @@ ratio = baseline_median_seconds / stride_align_median_seconds
29
29
  | Loongson LoongArch64 | `linux_loongarch64_lasx` | patched parasail (1:1 score) | 16 | **7.517x** | 6.502x | 4.315x | 22.365x |
30
30
  | Loongson LoongArch64 | `linux_loongarch64_lasx` | generic (native) | 80 | **4.909x** | 5.149x | 0.499x | 29.707x |
31
31
  | Power8 VSX (Linux) | `linux_powerpc64_vsx` | generic (no parasail) | 80 | **3.772x** | 4.128x | 0.915x | 16.797x |
32
+ | Levenshtein (Intel x86) | `x86_avx512bwvl` | python-Levenshtein | 14 | **1.159x** | 1.151x | 1.039x | 1.353x |
33
+ | Levenshtein (Intel x86) | `x86_avx512bwvl` | rapidfuzz | 14 | 1.075x | 1.070x | 0.898x | 1.364x |
34
+ | Levenshtein (Intel x86) | `x86_avx512bwvl` | editdistance | 14 | 13.564x | 13.758x | 11.099x | 15.880x |
35
+ | Lev (long, >64 chars) | `x86_avx512bwvl` | rapidfuzz | 5 | **2.35x** | 2.55x | 1.45x | 2.88x |
36
+ | Lev (1-vs-1, q>=100) | `x86_avx512bwvl` | rapidfuzz | 2 | **1.36x** | 1.36x | 1.34x | 1.39x |
37
+ | Lev (cutoff, q=50) | `x86_avx512bwvl` | rapidfuzz | 3 | **3.91x** | 2.41x | 2.41x | 6.03x |
38
+ | Damerau-Lev (short tgts) | `x86_avx512bwvl` | rapidfuzz | 4 | **3.13x** | 3.03x | 2.38x | 4.22x |
39
+ | Damerau-Lev (medium tgts) | `x86_avx512bwvl` | rapidfuzz | 3 | 0.98x | 0.87x | 0.85x | 1.25x |
40
+ | Lev (Mac M4 NEON, short tgts) | `macos_arm64_neon` | python-Levenshtein | 4 | **6.61x** | 6.42x | 5.49x | 8.54x |
41
+ | Damerau-Lev (Mac M4 NEON, short tgts) | `macos_arm64_neon` | rapidfuzz OSA | 4 | **5.49x** | 5.43x | 4.35x | 7.45x |
42
+ | Lev (Loongson LASX, mixed tgts) | `linux_loongarch64_lasx` | generic (no rapidfuzz wheel) | 7 | **2.17x** | 2.18x | 1.54x | 3.34x |
43
+ | Damerau-Lev (Loongson LASX, mixed tgts) | `linux_loongarch64_lasx` | generic (no rapidfuzz wheel) | 6 | **1.43x** | 1.43x | 1.14x | 1.97x |
44
+ | Lev (Graviton4, short tgts) | `linux_aarch64_neon`/`sve`/`sve2` | python-Levenshtein | 4 | **3.18x** | 3.06x | 2.67x | 4.05x |
45
+ | Damerau-Lev (Graviton4, short tgts) | `linux_aarch64_neon`/`sve`/`sve2` | rapidfuzz OSA | 4 | **2.85x** | 2.83x | 2.27x | 3.89x |
46
+ | Lev (Power8 VSX, mixed tgts) | `linux_powerpc64_vsx` | generic (no rapidfuzz wheel) | 8 | **2.40x** | 2.51x | 1.56x | 3.03x |
47
+ | Damerau-Lev (Power8 VSX, mixed tgts) | `linux_powerpc64_vsx` | generic (no rapidfuzz wheel) | 7 | **1.99x** | 2.22x | 1.46x | 2.57x |
32
48
 
33
49
  ## Intel x86 - 2026-05-18
34
50
 
@@ -157,6 +173,457 @@ at width 16 (`sw-cigar` and `sw-path-info`); AVX512BWVL's worst row is
157
173
  It loses every score-only row badly; a handful of linear NW path/CIGAR rows
158
174
  are competitive but not consistently.
159
175
 
176
+ ## Levenshtein (Intel x86) - 2026-05-19
177
+
178
+ Raw artifact: [`benchmarks/intel-levenshtein-2026-05-19.csv`](benchmarks/intel-levenshtein-2026-05-19.csv).
179
+
180
+ Build context: same host as Intel x86 above (11th Gen Core i7-1195G7,
181
+ Python 3.13, `taskset -c 2`), running on the `x86_avx512bwvl` backend. The
182
+ multi-target Myers kernel runs one target per SIMD lane (8x 64-bit lanes
183
+ under AVX512) and reads bytes / 1-byte unicode strings zero-copy from
184
+ CPython buffers. Patterns over 64 chars fall through to the scalar
185
+ Hyyrö multi-word dispatch in `levenshtein_dispatch.hpp`.
186
+
187
+ Command:
188
+
189
+ ```bash
190
+ taskset -c 2 .venv/bin/python tools/benchmark_libs.py \
191
+ --input-file kjv_subset.txt --levenshtein \
192
+ --iterations 25 --warmups 3 < lev_queries.txt > intel-levenshtein-2026-05-19.csv
193
+ ```
194
+
195
+ Corpus: first 1000 lines of `demo/kjv.txt`. Queries: 14 single words and
196
+ short phrases (3-29 chars) covering the pattern lengths that hit the
197
+ SIMD fast path.
198
+
199
+ ### Overall
200
+
201
+ | Backend | Rows | Geomean | Median | Worst | Best |
202
+ | --- | ---: | ---: | ---: | ---: | ---: |
203
+ | `stride_align` vs `python-Levenshtein` | 14 | **1.159x** | 1.151x | 1.039x | 1.353x |
204
+ | `stride_align` vs `rapidfuzz` | 14 | 1.075x | 1.070x | 0.898x | 1.364x |
205
+ | `stride_align` vs `editdistance` | 14 | 13.564x | 13.758x | 11.099x | 15.880x |
206
+
207
+ Per-call wall time at 1000 targets (median across 14 queries):
208
+
209
+ | Library | µs/call | ns/target |
210
+ | --- | ---: | ---: |
211
+ | `stride_align` (`x86_avx512bwvl`) | **496** | **496** |
212
+ | `rapidfuzz` | 540 | 540 |
213
+ | `python-Levenshtein` | 567 | 567 |
214
+ | `editdistance` | 6806 | 6806 |
215
+
216
+ ### Takeaways
217
+
218
+ The multi-target Myers kernel keeps stride-align ahead of every popular
219
+ Python Levenshtein library on this corpus. python-Levenshtein loses by
220
+ 1.16x geomean across all 14 queries; rapidfuzz loses by 1.07x with one
221
+ sub-parity row (0.898x on a 26-char query). editdistance is roughly
222
+ 13.5x slower, reflecting its pure-C scalar DP loop with no batching.
223
+
224
+ The "vs rapidfuzz" worst row is the only sub-parity result of the
225
+ sweep. rapidfuzz also uses bit-parallel Myers in its hot path, so the
226
+ remaining headroom is mostly per-call overhead — list traversal, the
227
+ Python ABI, the ndarray allocation — rather than the inner loop. The
228
+ SIMD multi-target kernel pulls ahead on shorter queries where the
229
+ per-target setup dominates.
230
+
231
+ ## Levenshtein extended (Intel x86) - 2026-05-19
232
+
233
+ Raw artifact: [`benchmarks/intel-levenshtein-v2-2026-05-19.csv`](benchmarks/intel-levenshtein-v2-2026-05-19.csv).
234
+
235
+ Three follow-up workloads that exercise the multi-word SIMD batch
236
+ kernel (patterns 65-256 chars, in 64-char blocks W = 2/3/4), the
237
+ zero-copy singular dispatch (no `prepare_alignment` vector copy when
238
+ both inputs are bytes or 1-byte unicode), and the `score_cutoff`
239
+ parameter with per-lane done masks and all-lanes early-exit. Same
240
+ build host and pinning as the section above.
241
+
242
+ ### Long patterns (1-vs-200, no cutoff)
243
+
244
+ | `q_len` | stride_align | python-Lev | rapidfuzz | vs Lev | vs rf |
245
+ | ---: | ---: | ---: | ---: | ---: | ---: |
246
+ | 40 | 95 µs | 110 µs | 95 µs | 1.16x | 1.00x |
247
+ | 65 | 118 µs | 187 µs | 171 µs | **1.59x** | **1.45x** |
248
+ | 100 | 118 µs | 324 µs | 357 µs | **2.75x** | **3.03x** |
249
+ | 128 | 118 µs | 320 µs | 300 µs | **2.71x** | **2.54x** |
250
+ | 180 | 135 µs | 405 µs | 384 µs | **3.00x** | **2.85x** |
251
+ | 200 | 166 µs | 456 µs | 440 µs | **2.75x** | **2.65x** |
252
+
253
+ Each lane in the AVX-512 kernel runs Hyyrö's wide-add carry chain over
254
+ W blocks in parallel across 8 targets. The wide add uses two chained
255
+ 64-bit adds + `gt_u64` overflow detection (AVX-512 native unsigned
256
+ `cmpgt`, AVX2 sign-bit-XOR + signed `cmpgt`, SSE4.1 sub-and-sign-bit
257
+ emulation). q_len = 40 is the single-word kernel, which hits parity
258
+ with rapidfuzz.
259
+
260
+ ### 1-vs-1 singular (zero-copy dispatch)
261
+
262
+ | `q_len` | stride_align | python-Lev | rapidfuzz |
263
+ | ---: | ---: | ---: | ---: |
264
+ | 10 | 0.20 µs | 0.25 µs | **0.17 µs** |
265
+ | 30 | 0.27 µs | 0.31 µs | **0.24 µs** |
266
+ | 60 | 0.36 µs | 0.40 µs | **0.32 µs** |
267
+ | 100 | **0.90 µs** | 1.30 µs | 1.21 µs |
268
+ | 200 | **2.35 µs** | 3.24 µs | 3.14 µs |
269
+
270
+ When both inputs are bytes or 1-byte unicode the singular path skips
271
+ the prepare\_alignment vector copy and runs scalar Myers directly on
272
+ the CPython buffer (`PyBytes_AsStringAndSize` / `PyUnicode_1BYTE_DATA`).
273
+ We trail rapidfuzz by ~10% under 60 chars (Python ABI overhead, no
274
+ algorithmic gap) and pull ~1.35x ahead from 100 chars onward, where
275
+ the multi-word inner loop dominates.
276
+
277
+ ### score_cutoff (5000 targets, short query)
278
+
279
+ stride_align vs rapidfuzz with matching cutoff:
280
+
281
+ | `q_len` | cutoff | stride_align | rapidfuzz | ratio |
282
+ | ---: | ---: | ---: | ---: | ---: |
283
+ | 10 | 2 | **277 µs** | 353 µs | 1.27x |
284
+ | 10 | 5 | **301 µs** | 472 µs | 1.57x |
285
+ | 10 | 10 | **342 µs** | 556 µs | 1.62x |
286
+ | 30 | 7 | **190 µs** | 486 µs | 2.56x |
287
+ | 30 | 15 | **328 µs** | 682 µs | 2.08x |
288
+ | 30 | 30 | **410 µs** | 866 µs | 2.11x |
289
+ | 50 | 12 | **49 µs** | 297 µs | **6.03x** |
290
+ | 50 | 25 | **204 µs** | 491 µs | 2.41x |
291
+ | 50 | 50 | **410 µs** | 1119 µs | 2.73x |
292
+
293
+ Per-lane done masks freeze score updates once a lane crosses
294
+ `cutoff + remaining_chars`, and the column loop breaks as soon as every
295
+ batch lane is settled (target exhausted or bailed). The biggest win
296
+ (`q_len=50`, `cutoff=12`, 6x) is where most targets exceed cutoff after
297
+ a handful of columns and the whole batch can short-circuit.
298
+
299
+ ### Where rapidfuzz still wins
300
+
301
+ Long patterns *with* tight cutoff (e.g. `q=100`, `cutoff=20` over
302
+ 50-250-char targets): rapidfuzz 110 µs vs stride_align 402 µs (0.27x).
303
+ Our cutoff bail condition `score > cutoff + remaining_chars` only
304
+ fires near the end of the column loop because `remaining_chars` shrinks
305
+ slowly. rapidfuzz uses **banded Myers** here, restricting the DP to a
306
+ 2K+1 diagonal band so the work drops to O(K·n) instead of O(m·n).
307
+ Banded SIMD Myers is a separate kernel and isn't implemented in
308
+ stride-align yet — see "Future work" below.
309
+
310
+ ### Future work
311
+
312
+ - **Banded Myers** for tight-cutoff long-pattern workloads (the one
313
+ remaining rapidfuzz win). Restrict per-lane state to ±K diagonals
314
+ from the main; sliding window across columns.
315
+ - **Pattern lengths > 256**: the multi-word SIMD kernel currently caps
316
+ at W=4. Extending to W=8 (pattern up to 512) is a recompile.
317
+
318
+ ## Damerau-Levenshtein / OSA (Intel x86) - 2026-05-19
319
+
320
+ Raw artifact: [`benchmarks/intel-damerau-levenshtein-2026-05-19.csv`](benchmarks/intel-damerau-levenshtein-2026-05-19.csv).
321
+
322
+ Build context: same host as the Levenshtein section (11th Gen Core
323
+ i7-1195G7, Python 3.13, `taskset -c 2`). The algorithm is OSA-restricted
324
+ (Optimal String Alignment) Damerau-Levenshtein: like Levenshtein but
325
+ adjacent transpositions cost 1 instead of two substitutions, and each
326
+ character can participate in at most one edit. Hyyrö's bit-parallel
327
+ recurrence (the `TR = (((~D0_prev) & PM) << 1) & PM_old` formulation
328
+ that rapidfuzz also uses), wrapped in the same multi-target SIMD batch
329
+ architecture as our Levenshtein kernel — one target per SIMD lane
330
+ (2/4/8 lanes for SSE4.1/AVX2/AVX-512).
331
+
332
+ ### Short targets (1-vs-1000, 3-15 char corpus)
333
+
334
+ This is the SIMD batch sweet spot: short alignments amortize the
335
+ gather + state-shift cost across 8 lanes, and rapidfuzz's per-pair
336
+ overhead dominates its loop.
337
+
338
+ | `q_len` | stride_align | rapidfuzz | ratio |
339
+ | ---: | ---: | ---: | ---: |
340
+ | 5 | **41 µs** | 99 µs | 2.38x |
341
+ | 10 | **42 µs** | 108 µs | 2.59x |
342
+ | 20 | **42 µs** | 176 µs | **4.22x** |
343
+ | 30 | **47 µs** | 163 µs | 3.47x |
344
+
345
+ ### Medium targets (1-vs-200, 30-250 char corpus)
346
+
347
+ | `q_len` | stride_align | rapidfuzz | ratio |
348
+ | ---: | ---: | ---: | ---: |
349
+ | 10 | 117 µs | **100 µs** | 0.85x |
350
+ | 30 | 117 µs | **102 µs** | 0.87x |
351
+ | 64 | **117 µs** | 147 µs | 1.25x |
352
+
353
+ For medium-target workloads we trail rapidfuzz by ~15% under 60 chars
354
+ (their inner loop is slightly tighter — fewer SIMD ops per column),
355
+ then pull ahead at q_len=64 where their bit-parallel fallback path
356
+ kicks in.
357
+
358
+ ### 1-vs-1 singular
359
+
360
+ | `q_len` | stride_align | rapidfuzz | ratio |
361
+ | ---: | ---: | ---: | ---: |
362
+ | 10 | 0.18 µs | 0.15 µs | 0.85x |
363
+ | 30 | 0.23 µs | 0.21 µs | 0.92x |
364
+ | 60 | 0.35 µs | 0.32 µs | 0.90x |
365
+
366
+ Per-call Python ABI dominates; we're within 15% of rapidfuzz on every
367
+ length.
368
+
369
+ ### API
370
+
371
+ ```python
372
+ import stride_align
373
+
374
+ # Singular
375
+ stride_align.damerau_levenshtein_score(query, target) # int
376
+ stride_align.damerau_levenshtein_normalized_score(query, target) # float in [0, 1]
377
+
378
+ # Batch (returns numpy ndarray)
379
+ stride_align.damerau_levenshtein_scores(query, targets) # int64
380
+ stride_align.damerau_levenshtein_normalized_scores(query, targets) # float64
381
+ ```
382
+
383
+ Backends specialized for the new SIMD batch kernel: `x86_sse41`,
384
+ `x86_avx2`, `x86_avx512bwvl`, `x86_avx10_256`, `x86_avx10_512`, and
385
+ (added 2026-05-19) `macos_arm64_neon` and `linux_aarch64_neon` via the
386
+ shared `NeonOps` bundle. Other architectures (SVE / Loongson / Power)
387
+ still fall through to the shared scalar bit-parallel dispatch and
388
+ remain correct.
389
+
390
+ ## Levenshtein + Damerau-Levenshtein (Power8 VSX) - 2026-05-19
391
+
392
+ Raw artifact: [`benchmarks/power8-lev-osa-2026-05-19.txt`](benchmarks/power8-lev-osa-2026-05-19.txt).
393
+
394
+ Build context: Power8 KVM-virtualized core (4.157 GHz), Ubuntu 20.04,
395
+ Python 3.13, AT15.0 GCC 11.4 at `/opt/at15.0/bin/g++` (the system GCC
396
+ 9.4 is too old for `cxx_std_23`), CMake 4.3.2. `VsxOps` is the new
397
+ 128-bit / 2-lane bundle (same shape as SSE / NEON / LSX) using
398
+ `__vector unsigned long long` and Altivec intrinsics. Power8's ISA
399
+ 2.07 has native unsigned `vec_cmpgt` for 64-bit lanes so the kernel
400
+ ports without emulation.
401
+
402
+ No rapidfuzz / python-Levenshtein wheels exist for ppc64le on PyPI;
403
+ comparison is against our generic backend (which runs tight bit-parallel
404
+ Myers/OSA scalars).
405
+
406
+ ### Levenshtein 1-vs-1000 short (3-15 char corpus)
407
+
408
+ | `q_len` | generic | VSX | ratio |
409
+ | ---: | ---: | ---: | ---: |
410
+ | 5 | 103 µs | **44 µs** | 2.37x |
411
+ | 10 | 109 µs | **44 µs** | 2.49x |
412
+ | 20 | 111 µs | **44 µs** | 2.54x |
413
+ | 30 | 125 µs | **44 µs** | **2.87x** |
414
+
415
+ ### Levenshtein 1-vs-200 medium (30-250 char corpus)
416
+
417
+ | `q_len` | generic | VSX | ratio |
418
+ | ---: | ---: | ---: | ---: |
419
+ | 10 | 153 µs | 98 µs | 1.56x |
420
+ | 64 | 200 µs | 98 µs | 2.04x |
421
+ | 100 | 468 µs | 172 µs | 2.72x |
422
+ | 200 | 675 µs | 223 µs | **3.03x** |
423
+
424
+ Multi-word kernel takes over at q=100; the ratio grows because the
425
+ W-block SIMD scales with q while generic scalar's overhead scales
426
+ linearly with q too.
427
+
428
+ ### Damerau-Levenshtein 1-vs-1000 short
429
+
430
+ | `q_len` | generic | VSX | ratio |
431
+ | ---: | ---: | ---: | ---: |
432
+ | 5 | 109 µs | **48 µs** | 2.25x |
433
+ | 10 | 108 µs | **48 µs** | 2.22x |
434
+ | 20 | 112 µs | **48 µs** | 2.30x |
435
+ | 30 | 124 µs | **48 µs** | **2.57x** |
436
+
437
+ ### Damerau-Levenshtein 1-vs-200 medium
438
+
439
+ | `q_len` | generic | VSX | ratio |
440
+ | ---: | ---: | ---: | ---: |
441
+ | 10 | 159 µs | 109 µs | 1.46x |
442
+ | 32 | 182 µs | 109 µs | 1.67x |
443
+ | 64 | 204 µs | 109 µs | **1.87x** |
444
+
445
+ Unlike LSX (which trails generic on the 2-lane Damerau medium
446
+ workload), Power8 VSX wins consistently here. Power8's faster
447
+ `vec_extract` for the per-lane gather setup and the native unsigned
448
+ `vec_cmpgt` keep the 2-lane SIMD competitive even on short-pattern
449
+ inner loops.
450
+
451
+ ## Levenshtein + Damerau-Levenshtein (Graviton4 NEON/SVE/SVE2) - 2026-05-19
452
+
453
+ Raw artifact: [`benchmarks/graviton4-lev-osa-2026-05-19.csv`](benchmarks/graviton4-lev-osa-2026-05-19.csv).
454
+
455
+ Build context: AWS Graviton4 (Neoverse V2, 1 vCPU c8g.medium), Ubuntu
456
+ 24.04, Python 3.14, GCC 13.x. The Graviton4 host has only 1.8 GiB RAM
457
+ and 1 vCPU, so the build required `CMAKE_BUILD_PARALLEL_LEVEL=1` and a
458
+ 4 GiB swapfile to keep cc1plus from OOM-killing on the template-heavy
459
+ TUs.
460
+
461
+ All three ARM backends (`linux_aarch64_neon`, `linux_aarch64_sve`,
462
+ `linux_aarch64_sve2`) share the same SIMD path: the SVE backends are
463
+ built with `-msve-vector-bits=128`, so they hold the same 2 lanes of
464
+ 64-bit as NEON. Both wire through `NeonOps` rather than a separate
465
+ `SveOps` bundle — the bit-parallel Lev/OSA kernel uses no
466
+ SVE-specific feature.
467
+
468
+ ### Levenshtein 1-vs-1000 short (3-15 char corpus)
469
+
470
+ | `q_len` | stride_align | python-Levenshtein | ratio |
471
+ | ---: | ---: | ---: | ---: |
472
+ | 5 | 53 µs | 140 µs | 2.67x |
473
+ | 10 | 53 µs | 148 µs | 2.80x |
474
+ | 20 | 53 µs | 176 µs | 3.33x |
475
+ | 30 | 53 µs | 213 µs | **4.05x** |
476
+
477
+ ### Levenshtein 1-vs-200 medium (30-250 char corpus)
478
+
479
+ | `q_len` | stride_align | python-Levenshtein | ratio |
480
+ | ---: | ---: | ---: | ---: |
481
+ | 10 | 150 µs | **119 µs** | 0.80x |
482
+ | 32 | 150 µs | **121 µs** | 0.81x |
483
+ | 64 | 150 µs | **127 µs** | 0.85x |
484
+ | 100 | **163 µs** | 256 µs | 1.57x |
485
+ | 200 | **224 µs** | 434 µs | 1.94x |
486
+
487
+ Single-word multi-target (q ≤ 64): we trail python-Levenshtein on
488
+ medium-length targets because the per-target SIMD setup outpaces the
489
+ 2-lane parallelism gain. Multi-word kicks in at q=100; we then pull
490
+ ahead 1.57-1.94x.
491
+
492
+ ### Damerau-Levenshtein 1-vs-1000 short
493
+
494
+ | `q_len` | stride_align | rapidfuzz OSA | ratio |
495
+ | ---: | ---: | ---: | ---: |
496
+ | 5 | 57 µs | 129 µs | 2.27x |
497
+ | 10 | 57 µs | 142 µs | 2.50x |
498
+ | 20 | 57 µs | 179 µs | 3.15x |
499
+ | 30 | 57 µs | 221 µs | **3.89x** |
500
+
501
+ ### NEON vs SVE vs SVE2
502
+
503
+ All three ARM backends produce identical results and identical
504
+ performance (53 µs for q=10 short, etc.). The auto-detect picks SVE2
505
+ on Graviton4 since it ranks first in the priority list, but routing
506
+ through `NeonOps` means swapping backends is observationally a no-op.
507
+
508
+ ## Levenshtein + Damerau-Levenshtein (Loongson LASX/LSX) - 2026-05-19
509
+
510
+ Raw artifact: [`benchmarks/loongson-lev-osa-2026-05-19.txt`](benchmarks/loongson-lev-osa-2026-05-19.txt).
511
+
512
+ Build context: Loongson 3A6000 (LoongArch64), Kylin V10 SP1, Python
513
+ 3.13, GCC 15.2.0 (`/opt/loongson-gcc-15.2.0`), CMake 4.3.2. No
514
+ rapidfuzz / python-Levenshtein wheels exist for LoongArch on PyPI, so
515
+ the comparison is against our generic scalar backend (which already
516
+ runs the bit-parallel Myers / OSA kernels in tight C++).
517
+
518
+ `LsxOps` is 128-bit / 2 lanes (similar to SSE & NEON);
519
+ `LasxOps` is 256-bit / 4 lanes (similar to AVX2).
520
+
521
+ ### Caveat on `vandn`
522
+
523
+ Initial port had LSX/LASX returning negative scores on simple inputs.
524
+ Root cause: `__lsx_vandn_v(a, b)` returns `~a & b` (Intel-style),
525
+ contrary to what the LoongArch ISA reference's mnemonic name "VANDN"
526
+ suggested. The fix was a single-line operand swap; correctness on the
527
+ generic-reference test set is now 3200/3200 across q_lens 10/32/64/100.
528
+
529
+ ### Levenshtein 1-vs-1000 short (3-15 char corpus)
530
+
531
+ | `q_len` | generic | LSX | LASX |
532
+ | ---: | ---: | ---: | ---: |
533
+ | 5 | 103 µs | 67 µs (1.54x) | **49 µs (2.10x)** |
534
+ | 10 | 108 µs | 67 µs (1.61x) | **49 µs (2.18x)** |
535
+ | 30 | 126 µs | 67 µs (1.88x) | **49 µs (2.56x)** |
536
+
537
+ ### Levenshtein 1-vs-200 medium (30-250 char corpus)
538
+
539
+ | `q_len` | generic | LSX | LASX |
540
+ | ---: | ---: | ---: | ---: |
541
+ | 10 | 175 µs | 162 µs (1.08x) | **114 µs (1.54x)** |
542
+ | 64 | 185 µs | 162 µs (1.14x) | **114 µs (1.63x)** |
543
+ | 100 | 492 µs | 203 µs (2.43x) | **147 µs (3.34x)** |
544
+ | 200 | 777 µs | 351 µs (2.22x) | **260 µs (2.99x)** |
545
+
546
+ The multi-word kernel (q_len > 64, W=2/3) pulls ahead more
547
+ dramatically than the single-word range because the SIMD batch
548
+ amortizes the wide-add carry chain across 4 lanes (LASX) on the
549
+ LoongArch's 3 GHz cores.
550
+
551
+ ### Damerau-Levenshtein 1-vs-1000 short
552
+
553
+ | `q_len` | generic | LSX | LASX |
554
+ | ---: | ---: | ---: | ---: |
555
+ | 5 | 97 µs | 91 µs (1.06x) | **62 µs (1.57x)** |
556
+ | 10 | 102 µs | 91 µs (1.12x) | **61 µs (1.66x)** |
557
+ | 30 | 121 µs | 91 µs (1.32x) | **61 µs (1.97x)** |
558
+
559
+ ### Damerau-Levenshtein 1-vs-200 medium
560
+
561
+ | `q_len` | generic | LSX | LASX |
562
+ | ---: | ---: | ---: | ---: |
563
+ | 10 | 176 µs | 249 µs (0.71x) | **155 µs (1.14x)** |
564
+ | 32 | 182 µs | 249 µs (0.73x) | **155 µs (1.17x)** |
565
+ | 64 | 189 µs | 249 µs (0.76x) | **155 µs (1.21x)** |
566
+
567
+ LSX trails the generic backend on the OSA medium workload: 2 lanes of
568
+ SIMD overhead (extra gather / mask / state-shift cost) outpaces the
569
+ parallelism gain when the per-target scalar bit-parallel Myers loop is
570
+ already tight. LASX keeps 4-lane parallelism worthwhile. Backend
571
+ auto-detect picks LASX where available, so this only matters on
572
+ machines that lack LASX.
573
+
574
+ ## Levenshtein + Damerau-Levenshtein (Mac M4 NEON) - 2026-05-19
575
+
576
+ Raw artifact: [`benchmarks/macos-arm64-neon-lev-osa-2026-05-19.csv`](benchmarks/macos-arm64-neon-lev-osa-2026-05-19.csv).
577
+
578
+ Build context: Apple M4 (T6041), macOS 15.x, Python 3.13 in the
579
+ project virtualenv. Uses the new `macos_arm64_neon` SIMD batch kernel
580
+ (2 lanes × 64-bit, NEON intrinsics in `levenshtein_simd_ops.hpp`). The
581
+ Mac is 2-lane (NEON 128-bit), so per-call SIMD speedup is smaller than
582
+ on AVX-512 (8 lanes); the win comes from skipping Python ABI per-pair
583
+ overhead on the batch path.
584
+
585
+ ### Levenshtein, 1-vs-1000 short targets (3-15 char corpus)
586
+
587
+ | `q_len` | stride_align | python-Levenshtein | ratio |
588
+ | ---: | ---: | ---: | ---: |
589
+ | 5 | **17 µs** | 92 µs | 5.49x |
590
+ | 10 | **17 µs** | 97 µs | 5.80x |
591
+ | 20 | **17 µs** | 118 µs | 7.04x |
592
+ | 30 | **17 µs** | 143 µs | **8.54x** |
593
+
594
+ ### Levenshtein, 1-vs-200 medium targets (30-250 char corpus)
595
+
596
+ | `q_len` | stride_align | python-Levenshtein | ratio |
597
+ | ---: | ---: | ---: | ---: |
598
+ | 10 | 80 µs | 89 µs | 1.12x |
599
+ | 32 | 80 µs | 94 µs | 1.18x |
600
+ | 64 | 80 µs | 95 µs | 1.19x |
601
+ | 100 | 94 µs | 152 µs | 1.61x |
602
+ | 200 | **112 µs** | 260 µs | **2.33x** |
603
+
604
+ The 100/200-char rows exercise the multi-word kernel (W=2/3); the
605
+ ratio grows because python-Levenshtein's overhead scales with pattern
606
+ length while our W-block SIMD scales with `q_len / (64 * lanes)`.
607
+
608
+ ### Damerau-Levenshtein, 1-vs-1000 short
609
+
610
+ | `q_len` | stride_align | rapidfuzz OSA | ratio |
611
+ | ---: | ---: | ---: | ---: |
612
+ | 5 | **20 µs** | 87 µs | 4.35x |
613
+ | 10 | **20 µs** | 95 µs | 4.81x |
614
+ | 20 | **20 µs** | 120 µs | 6.06x |
615
+ | 30 | **20 µs** | 148 µs | **7.45x** |
616
+
617
+ ### 1-vs-1 singular
618
+
619
+ Parity territory — Python ABI dominates, no algorithmic gap.
620
+
621
+ | `q_len` | Lev sa / Lev rf | OSA sa / OSA rf |
622
+ | ---: | ---: | ---: |
623
+ | 10 | 0.13 / 0.13 µs | 0.13 / 0.13 µs |
624
+ | 30 | 0.21 / 0.21 µs | **0.17** / 0.21 µs |
625
+ | 60 | **0.29** / 0.33 µs | 0.29 / 0.29 µs |
626
+
160
627
  ## ARM Graviton4 (Linux aarch64) - 2026-05-18
161
628
 
162
629
  Raw artifacts:
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: stride-align
3
- Version: 0.1.0
3
+ Version: 0.2.0
4
4
  Summary: Smith-Waterman and Needleman-Wunsch alignments with a nanobind C++23 backend.
5
5
  Author: Adam
6
6
  License-Expression: Apache-2.0
@@ -28,9 +28,6 @@ Description-Content-Type: text/markdown
28
28
 
29
29
  # stride-align
30
30
 
31
- **Languages:** **[English](README.md)** · [简体中文](README.zh-CN.md) · [繁體中文](README.zh-TW.md) · [日本語](README.ja.md) · [Deutsch](README.de.md) · [한국어](README.ko.md) · [Français](README.fr.md) · [Español](README.es.md) · [Português do Brasil](README.pt-BR.md) · [Русский](README.ru.md) · [Tiếng Việt](README.vi.md) · [Bahasa Indonesia](README.id.md) · [हिन्दी](README.hi.md) · [العربية](README.ar.md) · [Türkçe](README.tr.md) · [Polski](README.pl.md)
32
-
33
-
34
31
  `stride-align` is a [blazing fast library](BENCHMARK.md) to tell you how "similar" two strings are.
35
32
  It does this by implementing the Smith-Waterman and Needleman-Wunsch
36
33
  algorithms. Instead of giving you a lecture, we're going to learn by
@@ -43,13 +40,24 @@ pip install stride-align
43
40
  ```
44
41
 
45
42
  On Loongson systems, install NumPy from your Linux distribution before
46
- installing `stride-align`:
43
+ installing `stride-align`, and grab the LoongArch64 wheel from the
44
+ GitHub release instead of PyPI (PyPI does not yet accept the
45
+ `linux_loongarch64` or `manylinux_2_38_loongarch64` platform tags):
47
46
 
48
47
  ```bash
49
48
  sudo apt install python3-numpy
50
- pip install stride-align
49
+
50
+ PY=$(python3 -c 'import sys; print(f"cp{sys.version_info.major}{sys.version_info.minor}")')
51
+ pip install \
52
+ https://github.com/adamdeprince/stride-align/releases/download/v0.2.0/stride_align-0.2.0-${PY}-${PY}-linux_loongarch64.whl
51
53
  ```
52
54
 
55
+ Prebuilt LoongArch64 wheels are available for Python 3.10, 3.11, 3.12,
56
+ 3.13, and 3.14. If you are on a different Python (or just want to
57
+ build from source), `pip install stride-align` falls back to the
58
+ source distribution on PyPI, which compiles the LSX/LASX kernels
59
+ locally.
60
+
53
61
  First, just a disclaimer: I'm not using religious texts here to push
54
62
  an agenda - for this demo I need multiple largish public domain
55
63
  documents that have the same meaning but are phrased differently. The
@@ -318,6 +326,47 @@ Gapped Alignment Report". CIGAR is the compact alignment-operation notation
318
326
  used by SAM/BAM tooling. If you want the full formal version, see the
319
327
  [SAM specification](https://samtools.github.io/hts-specs/SAMv1.pdf).
320
328
 
329
+ ### Levenshtein and Damerau-Levenshtein
330
+
331
+ Beyond Smith-Waterman and Needleman-Wunsch, `stride-align` exposes two
332
+ unit-cost edit-distance metrics with their own SIMD-batched code paths:
333
+
334
+ ```python
335
+ import stride_align
336
+
337
+ # Levenshtein (Myers 1999 bit-parallel) — inserts, deletes, substitutes
338
+ stride_align.levenshtein_score("kitten", "sitting") # -> 3
339
+ stride_align.levenshtein_normalized_score("kitten", "sitting") # -> 0.571...
340
+ stride_align.levenshtein_scores("kitten", ["kit", "sitting"]) # -> ndarray[int64]
341
+ stride_align.levenshtein_normalized_scores("kitten", targets) # -> ndarray[float64]
342
+
343
+ # Optional `score_cutoff` (rapidfuzz convention): bail early per-target,
344
+ # results that exceed the cutoff come back as `cutoff + 1`.
345
+ stride_align.levenshtein_scores(query, targets, score_cutoff=3)
346
+
347
+ # Damerau-Levenshtein (OSA-restricted, Hyyrö 2002) — adds adjacent
348
+ # transposition at unit cost. This is what rapidfuzz exposes as
349
+ # OSA.distance and is what most callers asking for
350
+ # "Damerau-Levenshtein" actually want.
351
+ stride_align.damerau_levenshtein_score("ab", "ba") # -> 1
352
+ stride_align.damerau_levenshtein_scores(query, targets) # -> ndarray[int64]
353
+ ```
354
+
355
+ Both algorithms use a bit-parallel Myers-style inner loop. The batch
356
+ variants pack one target per SIMD lane (`*_scores`) and currently
357
+ specialize on every architecture's primary 64-bit-lane SIMD:
358
+
359
+ - x86: SSE4.1 / AVX2 / AVX-512 / AVX10-256 / AVX10-512
360
+ - ARM: NEON (Linux + macOS), SVE / SVE2
361
+ - LoongArch: LSX / LASX
362
+ - PowerPC: VSX
363
+
364
+ Patterns up to 64 chars run a single-word Myers; 65-256 chars use the
365
+ multi-word kernel (W=2/3/4). Beyond 256, the implementation falls
366
+ through to a scalar bit-parallel dispatch.
367
+
368
+ See [BENCHMARK.md](BENCHMARK.md) for cross-architecture numbers.
369
+
321
370
  ## Optimizations and Benchmarks
322
371
 
323
372
  Careful attention has been, and continues to be, paid to `stride-align`'s
@@ -1,8 +1,5 @@
1
1
  # stride-align
2
2
 
3
- **Languages:** **[English](README.md)** · [简体中文](README.zh-CN.md) · [繁體中文](README.zh-TW.md) · [日本語](README.ja.md) · [Deutsch](README.de.md) · [한국어](README.ko.md) · [Français](README.fr.md) · [Español](README.es.md) · [Português do Brasil](README.pt-BR.md) · [Русский](README.ru.md) · [Tiếng Việt](README.vi.md) · [Bahasa Indonesia](README.id.md) · [हिन्दी](README.hi.md) · [العربية](README.ar.md) · [Türkçe](README.tr.md) · [Polski](README.pl.md)
4
-
5
-
6
3
  `stride-align` is a [blazing fast library](BENCHMARK.md) to tell you how "similar" two strings are.
7
4
  It does this by implementing the Smith-Waterman and Needleman-Wunsch
8
5
  algorithms. Instead of giving you a lecture, we're going to learn by
@@ -15,13 +12,24 @@ pip install stride-align
15
12
  ```
16
13
 
17
14
  On Loongson systems, install NumPy from your Linux distribution before
18
- installing `stride-align`:
15
+ installing `stride-align`, and grab the LoongArch64 wheel from the
16
+ GitHub release instead of PyPI (PyPI does not yet accept the
17
+ `linux_loongarch64` or `manylinux_2_38_loongarch64` platform tags):
19
18
 
20
19
  ```bash
21
20
  sudo apt install python3-numpy
22
- pip install stride-align
21
+
22
+ PY=$(python3 -c 'import sys; print(f"cp{sys.version_info.major}{sys.version_info.minor}")')
23
+ pip install \
24
+ https://github.com/adamdeprince/stride-align/releases/download/v0.2.0/stride_align-0.2.0-${PY}-${PY}-linux_loongarch64.whl
23
25
  ```
24
26
 
27
+ Prebuilt LoongArch64 wheels are available for Python 3.10, 3.11, 3.12,
28
+ 3.13, and 3.14. If you are on a different Python (or just want to
29
+ build from source), `pip install stride-align` falls back to the
30
+ source distribution on PyPI, which compiles the LSX/LASX kernels
31
+ locally.
32
+
25
33
  First, just a disclaimer: I'm not using religious texts here to push
26
34
  an agenda - for this demo I need multiple largish public domain
27
35
  documents that have the same meaning but are phrased differently. The
@@ -290,6 +298,47 @@ Gapped Alignment Report". CIGAR is the compact alignment-operation notation
290
298
  used by SAM/BAM tooling. If you want the full formal version, see the
291
299
  [SAM specification](https://samtools.github.io/hts-specs/SAMv1.pdf).
292
300
 
301
+ ### Levenshtein and Damerau-Levenshtein
302
+
303
+ Beyond Smith-Waterman and Needleman-Wunsch, `stride-align` exposes two
304
+ unit-cost edit-distance metrics with their own SIMD-batched code paths:
305
+
306
+ ```python
307
+ import stride_align
308
+
309
+ # Levenshtein (Myers 1999 bit-parallel) — inserts, deletes, substitutes
310
+ stride_align.levenshtein_score("kitten", "sitting") # -> 3
311
+ stride_align.levenshtein_normalized_score("kitten", "sitting") # -> 0.571...
312
+ stride_align.levenshtein_scores("kitten", ["kit", "sitting"]) # -> ndarray[int64]
313
+ stride_align.levenshtein_normalized_scores("kitten", targets) # -> ndarray[float64]
314
+
315
+ # Optional `score_cutoff` (rapidfuzz convention): bail early per-target,
316
+ # results that exceed the cutoff come back as `cutoff + 1`.
317
+ stride_align.levenshtein_scores(query, targets, score_cutoff=3)
318
+
319
+ # Damerau-Levenshtein (OSA-restricted, Hyyrö 2002) — adds adjacent
320
+ # transposition at unit cost. This is what rapidfuzz exposes as
321
+ # OSA.distance and is what most callers asking for
322
+ # "Damerau-Levenshtein" actually want.
323
+ stride_align.damerau_levenshtein_score("ab", "ba") # -> 1
324
+ stride_align.damerau_levenshtein_scores(query, targets) # -> ndarray[int64]
325
+ ```
326
+
327
+ Both algorithms use a bit-parallel Myers-style inner loop. The batch
328
+ variants pack one target per SIMD lane (`*_scores`) and currently
329
+ specialize on every architecture's primary 64-bit-lane SIMD:
330
+
331
+ - x86: SSE4.1 / AVX2 / AVX-512 / AVX10-256 / AVX10-512
332
+ - ARM: NEON (Linux + macOS), SVE / SVE2
333
+ - LoongArch: LSX / LASX
334
+ - PowerPC: VSX
335
+
336
+ Patterns up to 64 chars run a single-word Myers; 65-256 chars use the
337
+ multi-word kernel (W=2/3/4). Beyond 256, the implementation falls
338
+ through to a scalar bit-parallel dispatch.
339
+
340
+ See [BENCHMARK.md](BENCHMARK.md) for cross-architecture numbers.
341
+
293
342
  ## Optimizations and Benchmarks
294
343
 
295
344
  Careful attention has been, and continues to be, paid to `stride-align`'s