token-fuzz-rs 0.1.0__pp311-pypy311_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- token_fuzz_rs/__init__.py +5 -0
- token_fuzz_rs/__init__.pyi +5 -0
- token_fuzz_rs/py.typed +0 -0
- token_fuzz_rs/token_fuzz_rs.pypy311-pp73-s390x-linux-gnu.so +0 -0
- token_fuzz_rs-0.1.0.dist-info/METADATA +197 -0
- token_fuzz_rs-0.1.0.dist-info/RECORD +8 -0
- token_fuzz_rs-0.1.0.dist-info/WHEEL +5 -0
- token_fuzz_rs-0.1.0.dist-info/licenses/LICENSE +21 -0
token_fuzz_rs/py.typed
ADDED
|
File without changes
|
|
Binary file
|
|
@@ -0,0 +1,197 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: token_fuzz_rs
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
5
|
+
Classifier: Intended Audience :: Developers
|
|
6
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
7
|
+
Classifier: Programming Language :: Python
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Classifier: Programming Language :: Rust
|
|
16
|
+
Classifier: Programming Language :: Python :: Implementation :: CPython
|
|
17
|
+
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
|
18
|
+
Classifier: Topic :: Software Development :: Libraries
|
|
19
|
+
Classifier: Topic :: Text Processing
|
|
20
|
+
License-File: LICENSE
|
|
21
|
+
Summary: Fast token-based fuzzy string matching for very large, static corpora (Rust-backed, Python-first).
|
|
22
|
+
Keywords: fuzzy,string matching,similarity,minhash,tokens,rust,pyo3
|
|
23
|
+
Author-email: Matthew Akram <mazfh85246@gmail.com>
|
|
24
|
+
Requires-Python: >=3.8
|
|
25
|
+
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
|
26
|
+
Project-URL: Homepage, https://github.com/matthewakram/token_fuzz_rs
|
|
27
|
+
Project-URL: Repository, https://github.com/matthewakram/token_fuzz_rs
|
|
28
|
+
Project-URL: Issues, https://github.com/matthewakram/token_fuzz_rs/issues
|
|
29
|
+
Project-URL: Documentation, https://github.com/matthewakram/token_fuzz_rs#readme
|
|
30
|
+
|
|
31
|
+
# token-fuzz-rs
|
|
32
|
+
|
|
33
|
+
**The fastest** token-based fuzzy string matching in Python for **very large, static corpora**.
|
|
34
|
+
|
|
35
|
+
`token-fuzz-rs` is designed for the case where:
|
|
36
|
+
|
|
37
|
+
- You have a **very large list of (possibly very long) strings**.
|
|
38
|
+
- That list is **static** (or rarely changes).
|
|
39
|
+
- You need to run **many queries** against that list.
|
|
40
|
+
- You want **token-based** matching (robust to extra/missing words, small typos, etc.).
|
|
41
|
+
|
|
42
|
+
In this scenario, `token-fuzz-rs` can be **significantly faster** (often by multiple orders of magnitude) than general-purpose Python fuzzy matching libraries for token-based search.
|
|
43
|
+
|
|
44
|
+
For **small to medium-sized sets or one-off matching**, you should strongly consider using [RapidFuzz](https://github.com/maxbachmann/RapidFuzz) instead – it’s feature-rich, very well maintained, and easier to integrate in many typical workloads. However, for large, static corpora with many token-based queries, `token-fuzz-rs` is focused specifically on that performance niche.
|
|
45
|
+
|
|
46
|
+
The core is implemented in Rust for speed, but the library is intended to be used **purely from Python** via its PyPI package.
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## Why Token-Based Matching?
|
|
51
|
+
|
|
52
|
+
Token-based matching treats strings as **bags of tokens** (e.g. words or byte n-grams) rather than as plain character sequences. This has several advantages:
|
|
53
|
+
|
|
54
|
+
- **Robust to word order**
|
|
55
|
+
`"New York City"` vs. `"City of New York"` can still match well.
|
|
56
|
+
|
|
57
|
+
- **Robust to extra or missing words**
|
|
58
|
+
`"hello world"` vs. `"hello wurld I love you"` still yield a high similarity because the important tokens overlap.
|
|
59
|
+
|
|
60
|
+
- **More tolerant of local edits**
|
|
61
|
+
Small insertions/deletions don’t completely destroy similarity as they might with naive edit-distance-based approaches.
|
|
62
|
+
|
|
63
|
+
- **Good for partial overlaps**
|
|
64
|
+
Useful when strings share important keywords but differ in prefixes/suffixes.
|
|
65
|
+
|
|
66
|
+
`token-fuzz-rs` implements a MinHash-style token similarity over byte-based tokens, making it efficient and scalable for very large corpora.
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
## Installation
|
|
71
|
+
|
|
72
|
+
Install from PyPI:
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
pip install token-fuzz-rs
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Then import it in Python as:
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
from token_fuzz_rs import TokenFuzzer
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## Quick Start
|
|
87
|
+
|
|
88
|
+
```python
|
|
89
|
+
from token_fuzz_rs import TokenFuzzer
|
|
90
|
+
|
|
91
|
+
# Your corpus of strings (can be very large)
|
|
92
|
+
data = [
|
|
93
|
+
"hello world",
|
|
94
|
+
"rust programming",
|
|
95
|
+
"fuzzy token matcher",
|
|
96
|
+
]
|
|
97
|
+
|
|
98
|
+
# Build the index (one-off cost; optimized for many subsequent queries)
|
|
99
|
+
fuzzer = TokenFuzzer(data)
|
|
100
|
+
|
|
101
|
+
# Fuzzy queries
|
|
102
|
+
print(fuzzer.match_closest("hello wurld")) # -> "hello world"
|
|
103
|
+
print(fuzzer.match_closest("hello wurld I love you")) # -> "hello world"
|
|
104
|
+
print(fuzzer.match_closest("rust progmming")) # -> "rust programming"
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## When to Use `token-fuzz-rs` vs. RapidFuzz
|
|
110
|
+
|
|
111
|
+
**Use `token-fuzz-rs` when:**
|
|
112
|
+
|
|
113
|
+
- Your corpus is **large** (thousands to millions of strings).
|
|
114
|
+
- The corpus is **static** or changes rarely.
|
|
115
|
+
- You need to run **many queries** against that corpus.
|
|
116
|
+
- You care about **token-based similarity** and want very high throughput.
|
|
117
|
+
|
|
118
|
+
In this context, `token-fuzz-rs`:
|
|
119
|
+
|
|
120
|
+
- Builds a compact MinHash-based index once.
|
|
121
|
+
- Answers subsequent queries very quickly.
|
|
122
|
+
- Can outperform general-purpose libraries by **multiple orders of magnitude** on large token-based workloads.
|
|
123
|
+
|
|
124
|
+
**Use RapidFuzz when:**
|
|
125
|
+
|
|
126
|
+
- Your corpus is **small or medium-sized**.
|
|
127
|
+
- You don’t have an expensive one-off index build step.
|
|
128
|
+
- You need a rich set of similarity metrics and utilities.
|
|
129
|
+
- You prefer a pure-Python / standard C-extension workflow and broader feature set.
|
|
130
|
+
|
|
131
|
+
`token-fuzz-rs` is intentionally focused and minimal: one main type (`TokenFuzzer`) and one main operation (`match_closest`).
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## API Reference
|
|
136
|
+
|
|
137
|
+
### Class: `TokenFuzzer`
|
|
138
|
+
|
|
139
|
+
#### Constructor
|
|
140
|
+
|
|
141
|
+
```python
|
|
142
|
+
TokenFuzzer(strings: list[str]) -> TokenFuzzer
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
Builds an index over the provided list of strings.
|
|
146
|
+
|
|
147
|
+
- `strings`: list of strings to match against (the “corpus”).
|
|
148
|
+
- Internally, computes MinHash-style signatures for each string (one-time cost).
|
|
149
|
+
|
|
150
|
+
#### Methods
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
match_closest(self, s: str) -> str
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Returns the closest-matching string from the original corpus.
|
|
157
|
+
|
|
158
|
+
- `s`: query string.
|
|
159
|
+
- Returns: a single string – the best match from the corpus.
|
|
160
|
+
- Raises: `ValueError` if the corpus was empty at construction time.
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
## How It Works (High Level)
|
|
165
|
+
|
|
166
|
+
- Strings are treated as byte sequences.
|
|
167
|
+
- For each position in the string, the library builds short byte tokens (up to 8 bytes).
|
|
168
|
+
- Each token is hashed with multiple independent hash functions based on SplitMix64.
|
|
169
|
+
- For each hash function, the minimum hash value seen over all tokens becomes one element of a **MinHash signature**.
|
|
170
|
+
- Similarity between two strings is approximated by the fraction of equal entries in their signatures.
|
|
171
|
+
- `match_closest`:
|
|
172
|
+
1. Computes the MinHash signature for the query string.
|
|
173
|
+
2. Compares it against all precomputed signatures in the corpus.
|
|
174
|
+
3. Returns the string with the highest similarity score.
|
|
175
|
+
|
|
176
|
+
This design:
|
|
177
|
+
|
|
178
|
+
- Exploits **token overlap** rather than pure character-level edit distance.
|
|
179
|
+
- Allows fast, approximate similarity search once signatures are precomputed.
|
|
180
|
+
- Scales well to large, static corpora with many queries.
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## Notes & Limitations
|
|
185
|
+
|
|
186
|
+
- Similarity is **approximate** (MinHash-based), not exact edit distance.
|
|
187
|
+
- Matching is 1-to-N: it returns only the **single best match**.
|
|
188
|
+
- The index is **immutable** after construction; to add or remove strings, build a new `TokenFuzzer`.
|
|
189
|
+
- The library is intended to be used from Python; the Rust code is an internal implementation detail.
|
|
190
|
+
|
|
191
|
+
---
|
|
192
|
+
|
|
193
|
+
## License
|
|
194
|
+
|
|
195
|
+
This project is licensed under the **MIT License**.
|
|
196
|
+
Feel free to use it, make small PRs, or open issues to make feature requests.
|
|
197
|
+
|
|
@@ -0,0 +1,8 @@
|
|
|
1
|
+
token_fuzz_rs-0.1.0.dist-info/METADATA,sha256=XXSJKJwLoGjRW71Cc65pD1zr_x0MjIDwhHHnG8FfZ4g,7199
|
|
2
|
+
token_fuzz_rs-0.1.0.dist-info/WHEEL,sha256=q_ChmeTQtmZLQTfXc0K4PTDs4yNo_IG3wENqZNScc1s,159
|
|
3
|
+
token_fuzz_rs-0.1.0.dist-info/licenses/LICENSE,sha256=ELI5NZabgRlP2kgrYjyDelyrdE7uuQaGNkmyrSGUa9k,1074
|
|
4
|
+
token_fuzz_rs/__init__.py,sha256=3xpBvLa8QmnODVoD6E99O-MQ9Wog2GkRekqxq6c_GuE,135
|
|
5
|
+
token_fuzz_rs/__init__.pyi,sha256=BTIALcdvIY5pYOZ2pjaIiNoNk_WaccuTGT91kpON0nY,219
|
|
6
|
+
token_fuzz_rs/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
7
|
+
token_fuzz_rs/token_fuzz_rs.pypy311-pp73-s390x-linux-gnu.so,sha256=h3jHlJTzpJZnpph4XHyBSZoVtp1GeIoIe9tyc_bSrFY,896192
|
|
8
|
+
token_fuzz_rs-0.1.0.dist-info/RECORD,,
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) [2025] [Matthew Akram]
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|