token-fuzz-rs 0.1.1__cp314-cp314t-musllinux_1_2_armv7l.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,5 @@
1
+ from .token_fuzz_rs import *
2
+
3
+ __doc__ = token_fuzz_rs.__doc__
4
+ if hasattr(token_fuzz_rs, "__all__"):
5
+ __all__ = token_fuzz_rs.__all__
@@ -0,0 +1,5 @@
1
+
2
+ class TokenFuzzer:
3
+ def __init__(self, data: list[str], num_hashes: int = 128) -> None: ...
4
+ def match_closest(self, query: str) -> str: ...
5
+ def match_closest_batch(self, queries: list[str]) -> list[str]: ...
token_fuzz_rs/py.typed ADDED
File without changes
@@ -0,0 +1,197 @@
1
+ Metadata-Version: 2.4
2
+ Name: token_fuzz_rs
3
+ Version: 0.1.1
4
+ Classifier: Development Status :: 5 - Production/Stable
5
+ Classifier: Intended Audience :: Developers
6
+ Classifier: License :: OSI Approved :: MIT License
7
+ Classifier: Programming Language :: Python
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3 :: Only
10
+ Classifier: Programming Language :: Python :: 3.8
11
+ Classifier: Programming Language :: Python :: 3.9
12
+ Classifier: Programming Language :: Python :: 3.10
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Classifier: Programming Language :: Rust
16
+ Classifier: Programming Language :: Python :: Implementation :: CPython
17
+ Classifier: Programming Language :: Python :: Implementation :: PyPy
18
+ Classifier: Topic :: Software Development :: Libraries
19
+ Classifier: Topic :: Text Processing
20
+ License-File: LICENSE
21
+ Summary: Fast token-based fuzzy string matching for very large, static corpora (Rust-backed, Python-first).
22
+ Keywords: fuzzy,string matching,similarity,minhash,tokens,rust,pyo3
23
+ Author-email: Matthew Akram <mazfh85246@gmail.com>
24
+ Requires-Python: >=3.8
25
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
26
+ Project-URL: Homepage, https://github.com/matthewakram/token_fuzz_rs
27
+ Project-URL: Repository, https://github.com/matthewakram/token_fuzz_rs
28
+ Project-URL: Issues, https://github.com/matthewakram/token_fuzz_rs/issues
29
+ Project-URL: Documentation, https://github.com/matthewakram/token_fuzz_rs#readme
30
+
31
+ # token-fuzz-rs
32
+
33
+ **The fastest** token-based fuzzy string matching in Python for **very large, static corpora**.
34
+
35
+ `token-fuzz-rs` is designed for the case where:
36
+
37
+ - You have a **very large list of (possibly very long) strings**.
38
+ - That list is **static** (or rarely changes).
39
+ - You need to run **many queries** against that list.
40
+ - You want **token-based** matching (robust to extra/missing words, small typos, etc.).
41
+
42
+ In this scenario, `token-fuzz-rs` can be **significantly faster** (often by multiple orders of magnitude) than general-purpose Python fuzzy matching libraries for token-based search.
43
+
44
+ For **small to medium-sized sets or one-off matching**, you should strongly consider using [RapidFuzz](https://github.com/maxbachmann/RapidFuzz) instead – it’s feature-rich, very well maintained, and easier to integrate in many typical workloads. However, for large, static corpora with many token-based queries, `token-fuzz-rs` is focused specifically on that performance niche.
45
+
46
+ The core is implemented in Rust for speed, but the library is intended to be used **purely from Python** via its PyPI package.
47
+
48
+ ---
49
+
50
+ ## Why Token-Based Matching?
51
+
52
+ Token-based matching treats strings as **bags of tokens** (e.g. words or byte n-grams) rather than as plain character sequences. This has several advantages:
53
+
54
+ - **Robust to word order**
55
+ `"New York City"` vs. `"City of New York"` can still match well.
56
+
57
+ - **Robust to extra or missing words**
58
+ `"hello world"` vs. `"hello wurld I love you"` still yield a high similarity because the important tokens overlap.
59
+
60
+ - **More tolerant of local edits**
61
+ Small insertions/deletions don’t completely destroy similarity as they might with naive edit-distance-based approaches.
62
+
63
+ - **Good for partial overlaps**
64
+ Useful when strings share important keywords but differ in prefixes/suffixes.
65
+
66
+ `token-fuzz-rs` implements a MinHash-style token similarity over byte-based tokens, making it efficient and scalable for very large corpora.
67
+
68
+ ---
69
+
70
+ ## Installation
71
+
72
+ Install from PyPI:
73
+
74
+ ```bash
75
+ pip install token-fuzz-rs
76
+ ```
77
+
78
+ Then import it in Python as:
79
+
80
+ ```python
81
+ from token_fuzz_rs import TokenFuzzer
82
+ ```
83
+
84
+ ---
85
+
86
+ ## Quick Start
87
+
88
+ ```python
89
+ from token_fuzz_rs import TokenFuzzer
90
+
91
+ # Your corpus of strings (can be very large)
92
+ data = [
93
+ "hello world",
94
+ "rust programming",
95
+ "fuzzy token matcher",
96
+ ]
97
+
98
+ # Build the index (one-off cost; optimized for many subsequent queries)
99
+ fuzzer = TokenFuzzer(data)
100
+
101
+ # Fuzzy queries
102
+ print(fuzzer.match_closest("hello wurld")) # -> "hello world"
103
+ print(fuzzer.match_closest("hello wurld I love you")) # -> "hello world"
104
+ print(fuzzer.match_closest("rust progmming")) # -> "rust programming"
105
+ ```
106
+
107
+ ---
108
+
109
+ ## When to Use `token-fuzz-rs` vs. RapidFuzz
110
+
111
+ **Use `token-fuzz-rs` when:**
112
+
113
+ - Your corpus is **large** (thousands to millions of strings).
114
+ - The corpus is **static** or changes rarely.
115
+ - You need to run **many queries** against that corpus.
116
+ - You care about **token-based similarity** and want very high throughput.
117
+
118
+ In this context, `token-fuzz-rs`:
119
+
120
+ - Builds a compact MinHash-based index once.
121
+ - Answers subsequent queries very quickly.
122
+ - Can outperform general-purpose libraries by **multiple orders of magnitude** on large token-based workloads.
123
+
124
+ **Use RapidFuzz when:**
125
+
126
+ - Your corpus is **small or medium-sized**.
127
+ - You don’t have an expensive one-off index build step.
128
+ - You need a rich set of similarity metrics and utilities.
129
+ - You prefer a pure-Python / standard C-extension workflow and broader feature set.
130
+
131
+ `token-fuzz-rs` is intentionally focused and minimal: one main type (`TokenFuzzer`) and one main operation (`match_closest`).
132
+
133
+ ---
134
+
135
+ ## API Reference
136
+
137
+ ### Class: `TokenFuzzer`
138
+
139
+ #### Constructor
140
+
141
+ ```python
142
+ TokenFuzzer(strings: list[str]) -> TokenFuzzer
143
+ ```
144
+
145
+ Builds an index over the provided list of strings.
146
+
147
+ - `strings`: list of strings to match against (the “corpus”).
148
+ - Internally, computes MinHash-style signatures for each string (one-time cost).
149
+
150
+ #### Methods
151
+
152
+ ```python
153
+ match_closest(self, s: str) -> str
154
+ ```
155
+
156
+ Returns the closest-matching string from the original corpus.
157
+
158
+ - `s`: query string.
159
+ - Returns: a single string – the best match from the corpus.
160
+ - Raises: `ValueError` if the corpus was empty at construction time.
161
+
162
+ ---
163
+
164
+ ## How It Works (High Level)
165
+
166
+ - Strings are treated as byte sequences.
167
+ - For each position in the string, the library builds short byte tokens (up to 8 bytes).
168
+ - Each token is hashed with multiple independent hash functions based on SplitMix64.
169
+ - For each hash function, the minimum hash value seen over all tokens becomes one element of a **MinHash signature**.
170
+ - Similarity between two strings is approximated by the fraction of equal entries in their signatures.
171
+ - `match_closest`:
172
+ 1. Computes the MinHash signature for the query string.
173
+ 2. Compares it against all precomputed signatures in the corpus.
174
+ 3. Returns the string with the highest similarity score.
175
+
176
+ This design:
177
+
178
+ - Exploits **token overlap** rather than pure character-level edit distance.
179
+ - Allows fast, approximate similarity search once signatures are precomputed.
180
+ - Scales well to large, static corpora with many queries.
181
+
182
+ ---
183
+
184
+ ## Notes & Limitations
185
+
186
+ - Similarity is **approximate** (MinHash-based), not exact edit distance.
187
+ - Matching is 1-to-N: it returns only the **single best match**.
188
+ - The index is **immutable** after construction; to add or remove strings, build a new `TokenFuzzer`.
189
+ - The library is intended to be used from Python; the Rust code is an internal implementation detail.
190
+
191
+ ---
192
+
193
+ ## License
194
+
195
+ This project is licensed under the **MIT License**.
196
+ Feel free to use it, make small PRs, or open issues to make feature requests.
197
+
@@ -0,0 +1,9 @@
1
+ token_fuzz_rs-0.1.1.dist-info/METADATA,sha256=jOTPn1BjpbM17J1c4KxM7meurQxiDw-Nera7qpWNyAY,7199
2
+ token_fuzz_rs-0.1.1.dist-info/WHEEL,sha256=50sOPaSYc3bjJpBvsS8-LiMZUpJkJjzNIwaqSNTvTFU,109
3
+ token_fuzz_rs-0.1.1.dist-info/licenses/LICENSE,sha256=ELI5NZabgRlP2kgrYjyDelyrdE7uuQaGNkmyrSGUa9k,1074
4
+ token_fuzz_rs.libs/libgcc_s-262c4f60.so.1,sha256=xPsZgCvL7EO-llmjqc5bm96baehLsO4avBqUhih0xZg,2810501
5
+ token_fuzz_rs/__init__.py,sha256=3xpBvLa8QmnODVoD6E99O-MQ9Wog2GkRekqxq6c_GuE,135
6
+ token_fuzz_rs/__init__.pyi,sha256=BTIALcdvIY5pYOZ2pjaIiNoNk_WaccuTGT91kpON0nY,219
7
+ token_fuzz_rs/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
8
+ token_fuzz_rs/token_fuzz_rs.cpython-314t-arm-linux-musleabihf.so,sha256=mA2YLYaREzi6ESoUp1ZG3XL9c4wo3PJUXgd7w-RaMu8,801533
9
+ token_fuzz_rs-0.1.1.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.10.2)
3
+ Root-Is-Purelib: false
4
+ Tag: cp314-cp314t-musllinux_1_2_armv7l
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) [2025] [Matthew Akram]
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.