@nlptools/distance 0.0.2 → 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE CHANGED
@@ -1,21 +1,21 @@
1
- MIT License
2
-
3
- Copyright (c) 2023 Demo Macro
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in all
13
- copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
- SOFTWARE.
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Demo Macro
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md CHANGED
@@ -1,23 +1,19 @@
1
1
  # @nlptools/distance
2
2
 
3
3
  ![npm version](https://img.shields.io/npm/v/@nlptools/distance)
4
- ![npm downloads](https://img.shields.io/npm/dw/@nlptools/distance)
5
4
  ![npm license](https://img.shields.io/npm/l/@nlptools/distance)
6
- [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](https://www.contributor-covenant.org/version/2/1/code_of_conduct/)
7
5
 
8
- > Complete string distance and similarity algorithms package with WebAssembly and JavaScript implementations
9
-
10
- This package provides comprehensive text similarity and distance algorithms, combining the high-performance WebAssembly implementation from `@nlptools/distance-wasm` with additional JavaScript-based algorithms for maximum compatibility and performance.
6
+ > High-performance string distance and similarity algorithms, implemented in pure TypeScript
11
7
 
12
8
  ## Features
13
9
 
14
- - **Dual Implementation**: WebAssembly for performance + JavaScript for compatibility
15
- - 🧮 **Comprehensive Algorithms**: 30+ string similarity and distance algorithms
16
- - 🎯 **Multiple Categories**: Edit-based, sequence-based, token-based, and naive algorithms
17
- - 📝 **TypeScript First**: Full type safety with comprehensive API
18
- - 🔧 **Universal Interface**: Single compare function for all algorithms
19
- - 📊 **Normalized Results**: Consistent 0-1 similarity scores across algorithms
20
- - 🚀 **Auto-optimization**: Automatically chooses the fastest implementation available
10
+ - Pure TypeScript implementation, zero native dependencies
11
+ - Edit distance: Levenshtein, LCS (Myers O(ND) and DP)
12
+ - Token similarity: Jaccard, Cosine, Sorensen-Dice (character multiset and n-gram variants)
13
+ - Hash-based deduplication: SimHash, MinHash, LSH
14
+ - Fuzzy search: `FuzzySearch` class and `findBestMatch` with multi-algorithm support
15
+ - Diff: based on `@algorithm.ts/diff` (Myers and DP backends)
16
+ - All distance algorithms include normalized similarity variants (0-1 range)
21
17
 
22
18
  ## Installation
23
19
 
@@ -34,94 +30,294 @@ pnpm add @nlptools/distance
34
30
 
35
31
  ## Usage
36
32
 
37
- ### Basic Setup
33
+ ### Edit Distance
34
+
35
+ ```typescript
36
+ import { levenshtein, levenshteinNormalized } from "@nlptools/distance";
37
+
38
+ levenshtein("kitten", "sitting"); // 3
39
+ levenshteinNormalized("kitten", "sitting"); // 0.571
40
+ ```
41
+
42
+ ### LCS (Longest Common Subsequence)
38
43
 
39
44
  ```typescript
40
- import * as distance from "@nlptools/distance";
45
+ import { lcsDistance, lcsNormalized, lcsLength, lcsPairs } from "@nlptools/distance";
41
46
 
42
- // All algorithms are available as named functions
43
- console.log(distance.levenshtein("kitten", "sitting")); // 3
44
- console.log(distance.jaro("hello", "hallo")); // 0.8666666666666667
45
- console.log(distance.cosine("abc", "bcd")); // 0.6666666666666666
47
+ lcsDistance("abcde", "ace"); // 2 (= 5 + 3 - 2 * 3)
48
+ lcsNormalized("abcde", "ace"); // 0.75
49
+ lcsLength("abcde", "ace"); // 3
50
+ lcsPairs("abcde", "ace"); // [[0,0], [2,1], [4,2]]
46
51
  ```
47
52
 
48
- ### Distance vs Similarity
53
+ By default uses Myers O(ND) algorithm. Switch to DP with `algorithm: "dp"`.
49
54
 
50
- Most algorithms have both distance and normalized versions:
55
+ ### Token Similarity (Character Multiset)
56
+
57
+ Based on character frequency maps (Counter), matching the `textdistance` crate semantics:
51
58
 
52
59
  ```typescript
53
- // Distance algorithms (lower is more similar)
54
- const dist = distance.levenshtein("cat", "bat"); // 1
60
+ import { jaccard, cosine, sorensen } from "@nlptools/distance";
55
61
 
56
- // Similarity algorithms (higher is more similar, 0-1 range)
57
- const sim = distance.levenshtein_normalized("cat", "bat"); // 0.6666666666666666
62
+ jaccard("abc", "abd"); // 0.667
63
+ cosine("hello", "hallo"); // 0.8
64
+ sorensen("test", "text"); // 0.75
58
65
  ```
59
66
 
60
- ### Available Algorithms
67
+ ### N-Gram Variants
68
+
69
+ ```typescript
70
+ import { jaccardNgram, cosineNgram, sorensenNgram } from "@nlptools/distance";
61
71
 
62
- This package includes all algorithms from `@nlptools/distance-wasm` plus additional JavaScript implementations:
72
+ jaccardNgram("hello", "hallo"); // 0.333 (bigram-based)
73
+ cosineNgram("hello", "hallo"); // 0.5 (bigram-based)
74
+ ```
63
75
 
64
- #### Edit Distance Algorithms
76
+ ### SimHash (Document Fingerprinting)
65
77
 
66
- - `levenshtein` - Classic edit distance
67
- - `fastest_levenshtein` - High-performance Levenshtein distance (fastest-levenshtein)
68
- - `damerau_levenshtein` - Edit distance with transpositions
69
- - `myers_levenshtein` - Myers bit-parallel algorithm for edit distance
70
- - `jaro` - Jaro similarity
71
- - `jarowinkler` - Jaro-Winkler similarity
72
- - `hamming` - Hamming distance for equal-length strings
73
- - `sift4_simple` - SIFT4 algorithm
78
+ ```typescript
79
+ import { simhash, hammingDistance, SimHasher } from "@nlptools/distance";
80
+
81
+ // Function-based
82
+ const fp1 = simhash(["hello", "world"]);
83
+ const fp2 = simhash(["hello", "earth"]);
84
+ hammingDistance(fp1, fp2); // small = similar
85
+
86
+ // Class-based
87
+ const hasher = new SimHasher();
88
+ const a = hasher.hash(["hello", "world"]);
89
+ const b = hasher.hash(["hello", "earth"]);
90
+ hasher.isDuplicate(a, b); // true if hamming distance <= 3
91
+ ```
74
92
 
75
- #### Sequence-based Algorithms
93
+ ### MinHash (Jaccard Similarity Estimation)
76
94
 
77
- - `lcs_seq` - Longest common subsequence
78
- - `lcs_str` - Longest common substring
79
- - `ratcliff_obershelp` - Gestalt pattern matching
80
- - `smith_waterman` - Local sequence alignment
95
+ ```typescript
96
+ import { MinHash } from "@nlptools/distance";
81
97
 
82
- #### Token-based Algorithms
98
+ const mh1 = new MinHash({ numHashes: 128 });
99
+ mh1.update("hello");
100
+ mh1.update("world");
83
101
 
84
- - `jaccard` - Jaccard similarity
85
- - `cosine` - Cosine similarity
86
- - `sorensen` - Sørensen-Dice coefficient
87
- - `tversky` - Tversky index
88
- - `overlap` - Overlap coefficient
102
+ const mh2 = new MinHash({ numHashes: 128 });
103
+ mh2.update("hello");
104
+ mh2.update("earth");
89
105
 
90
- #### Bigram Algorithms
106
+ MinHash.estimate(mh1.digest(), mh2.digest()); // ~0.67
107
+ ```
91
108
 
92
- - `jaccard_bigram` - Jaccard similarity on character bigrams
93
- - `cosine_bigram` - Cosine similarity on character bigrams
109
+ ### LSH (Approximate Nearest Neighbor Search)
94
110
 
95
- #### Naive Algorithms
111
+ ```typescript
112
+ import { MinHash } from "@nlptools/distance";
113
+ import { LSH } from "@nlptools/distance";
96
114
 
97
- - `prefix` - Prefix similarity
98
- - `suffix` - Suffix similarity
99
- - `length` - Length-based similarity
115
+ const lsh = new LSH({ numBands: 16, numHashes: 128 });
100
116
 
101
- ### Universal Compare Function
117
+ const mh = new MinHash({ numHashes: 128 });
118
+ mh.update("hello");
119
+ mh.update("world");
120
+ lsh.insert("doc1", mh.digest());
121
+
122
+ // Query for similar documents
123
+ const query = lsh.query(mh.digest(), 0.5);
124
+ // => [["doc1", 0.67]]
125
+ ```
126
+
127
+ ### Fuzzy Search
102
128
 
103
129
  ```typescript
104
- const result = distance.compare("hello", "hallo", "jaro");
105
- console.log(result); // 0.8666666666666667
130
+ import { FuzzySearch, findBestMatch } from "@nlptools/distance";
131
+
132
+ // String array search
133
+ const search = new FuzzySearch(["apple", "banana", "cherry"]);
134
+ search.search("aple");
135
+ // => [{ item: "apple", score: 0.8, index: 0 }]
136
+
137
+ // Object array with weighted keys
138
+ const books = [
139
+ { title: "Old Man's War", author: "John Scalzi" },
140
+ { title: "The Great Gatsby", author: "F. Scott Fitzgerald" },
141
+ ];
142
+ const bookSearch = new FuzzySearch(books, {
143
+ keys: [
144
+ { name: "title", weight: 0.7 },
145
+ { name: "author", weight: 0.3 },
146
+ ],
147
+ algorithm: "cosine",
148
+ threshold: 0.3,
149
+ });
150
+ bookSearch.search("old man");
151
+ // => [{ item: { title: "Old Man's War", ... }, score: 0.52, index: 0 }]
152
+
153
+ // One-shot best match
154
+ findBestMatch("kitten", ["sitting", "kit", "mitten"]);
155
+ // => { item: "kit", score: 0.5, index: 1 }
156
+
157
+ // With per-key details
158
+ const detailed = new FuzzySearch(books, {
159
+ keys: [{ name: "title" }, { name: "author" }],
160
+ includeMatchDetails: true,
161
+ });
162
+ detailed.search("gatsby");
163
+ // => [{ item: ..., score: 0.45, index: 1, matches: { title: 0.6, author: 0.1 } }]
164
+ ```
165
+
166
+ ### Diff
106
167
 
107
- // Use fastest-levenshtein for optimal performance
108
- console.log(distance.fastest_levenshtein("fast", "faster")); // 2
168
+ ```typescript
169
+ import { diff, DiffType } from "@nlptools/distance";
170
+
171
+ const result = diff("abc", "ac");
172
+ // => [
173
+ // { type: DiffType.COMMON, tokens: "a" },
174
+ // { type: DiffType.REMOVED, tokens: "b" },
175
+ // { type: DiffType.COMMON, tokens: "c" },
176
+ // ]
109
177
  ```
110
178
 
179
+ ## API Reference
180
+
181
+ ### Edit Distance
182
+
183
+ | Function | Description | Returns |
184
+ | --------------------------------- | ------------------------- | -------------------- |
185
+ | `levenshtein(a, b)` | Levenshtein edit distance | `number` |
186
+ | `levenshteinNormalized(a, b)` | Normalized similarity | `number` (0-1) |
187
+ | `lcsDistance(a, b, algorithm?)` | LCS distance | `number` |
188
+ | `lcsNormalized(a, b, algorithm?)` | Normalized LCS similarity | `number` (0-1) |
189
+ | `lcsLength(a, b, algorithm?)` | LCS length | `number` |
190
+ | `lcsPairs(a, b, algorithm?)` | LCS matching pairs | `[number, number][]` |
191
+
192
+ ### Token Similarity
193
+
194
+ | Function | Description | Returns |
195
+ | ------------------------- | ---------------------------------------------- | -------------- |
196
+ | `jaccard(a, b)` | Jaccard similarity (character multiset) | `number` (0-1) |
197
+ | `jaccardNgram(a, b, n?)` | Jaccard on character n-grams | `number` (0-1) |
198
+ | `cosine(a, b)` | Cosine similarity (character multiset) | `number` (0-1) |
199
+ | `cosineNgram(a, b, n?)` | Cosine on character n-grams | `number` (0-1) |
200
+ | `sorensen(a, b)` | Sorensen-Dice coefficient (character multiset) | `number` (0-1) |
201
+ | `sorensenNgram(a, b, n?)` | Sorensen-Dice on character n-grams | `number` (0-1) |
202
+
203
+ ### Hash-Based Deduplication
204
+
205
+ | Function / Class | Description |
206
+ | -------------------------------- | ------------------------------------------------------------------ |
207
+ | `simhash(features, options?)` | Generate 64-bit fingerprint as `bigint` |
208
+ | `hammingDistance(a, b)` | Hamming distance between two fingerprints |
209
+ | `hammingSimilarity(a, b, bits?)` | Normalized Hamming similarity |
210
+ | `SimHasher` | Class with `hash()`, `distance()`, `similarity()`, `isDuplicate()` |
211
+ | `MinHash` | Class with `update()`, `digest()`, `estimate()` |
212
+ | `MinHash.estimate(sig1, sig2)` | Static: estimate Jaccard from signatures |
213
+ | `LSH` | Class with `insert()`, `query()`, `remove()` |
214
+
215
+ ### Fuzzy Search
216
+
217
+ | Function / Class | Description |
218
+ | -------------------------------------------- | -------------------------------------------------- |
219
+ | `FuzzySearch<T>(collection, options?)` | Search engine with dynamic collection management |
220
+ | `findBestMatch(query, collection, options?)` | One-shot convenience: returns best match or `null` |
221
+
222
+ **FuzzySearch options:**
223
+
224
+ | Option | Type | Default | Description |
225
+ | --------------------- | ---------------------------------- | --------------- | ----------------------------- |
226
+ | `algorithm` | `BuiltinAlgorithm \| SimilarityFn` | `"levenshtein"` | Similarity algorithm to use |
227
+ | `keys` | `ISearchKey[]` | `[]` | Object fields to search on |
228
+ | `threshold` | `number` | `0` | Min similarity score (0-1) |
229
+ | `limit` | `number` | `Infinity` | Max results to return |
230
+ | `caseSensitive` | `boolean` | `false` | Case-insensitive by default |
231
+ | `includeMatchDetails` | `boolean` | `false` | Include per-key scores |
232
+ | `lsh` | `{ numHashes?, numBands? }` | — | Enable LSH for large datasets |
233
+
234
+ **Built-in algorithms:** `"levenshtein"`, `"lcs"`, `"jaccard"`, `"jaccardNgram"`, `"cosine"`, `"cosineNgram"`, `"sorensen"`, `"sorensenNgram"`
235
+
236
+ ### Diff
237
+
238
+ | Function | Description | Returns |
239
+ | ---------------------- | --------------------------- | ---------------- |
240
+ | `diff(a, b, options?)` | Sequence diff (Myers or DP) | `IDiffItem<T>[]` |
241
+
242
+ ### Types
243
+
244
+ | Type | Description |
245
+ | ----------------------- | -------------------------------------------- |
246
+ | `DiffType` | Enum: `ADDED`, `REMOVED`, `COMMON` |
247
+ | `IDiffItem<T>` | Diff result item with type and tokens |
248
+ | `IDiffOptions<T>` | Options for diff (equals, lcs algorithm) |
249
+ | `ISimHashOptions` | Options for SimHash (bits, hashFn) |
250
+ | `IMinHashOptions` | Options for MinHash (numHashes, seed) |
251
+ | `ILSHOptions` | Options for LSH (numBands, numHashes) |
252
+ | `IFuzzySearchOptions` | Options for FuzzySearch constructor |
253
+ | `IFindBestMatchOptions` | Options for findBestMatch function |
254
+ | `ISearchKey` | Searchable key config (name, weight, getter) |
255
+ | `ISearchResult<T>` | Search result with item, score, index |
256
+ | `SimilarityFn` | `(a: string, b: string) => number` in [0,1] |
257
+
111
258
  ## Performance
112
259
 
113
- The package automatically selects the fastest implementation available:
260
+ Benchmark: 1000 iterations per pair, same test data across all runtimes.
261
+ Unit: microseconds per operation (us/op).
262
+
263
+ ### Edit Distance
264
+
265
+ | Algorithm | Size | TS (V8 JIT) | WASM (via JS) | Rust (native) |
266
+ | --------------- | --------------- | ----------- | ------------- | ------------- |
267
+ | levenshtein | Short (<10) | 0.3 | 7.9 | 0.11 |
268
+ | levenshtein | Medium (10-100) | 1.3 | 116.2 | 0.98 |
269
+ | levenshtein | Long (>200) | 15.2 | 2,877.5 | 39.68 |
270
+ | levenshteinNorm | Short | 0.3 | 7.9 | 0.11 |
271
+ | lcs | Short (<10) | 1.6 | 16.5 | 0.41 |
272
+ | lcs | Medium (10-100) | 6.8 | 272.6 | 3.22 |
273
+ | lcs | Long (>200) | 217.8 | 6,574.1 | 122.63 |
274
+ | lcsNorm | Short | 1.7 | 16.2 | 0.48 |
275
+
276
+ ### Token Similarity (Character Multiset)
277
+
278
+ | Algorithm | Size | TS (V8 JIT) | WASM (via JS) | Rust (native) |
279
+ | --------- | --------------- | ----------- | ------------- | ------------- |
280
+ | jaccard | Short (<10) | 0.8 | 25.2 | 0.42 |
281
+ | jaccard | Medium (10-100) | 0.8 | 74.3 | 1.55 |
282
+ | jaccard | Long (>200) | 1.6 | 171.5 | 5.54 |
283
+ | cosine | Short (<10) | 0.8 | 19.3 | 0.32 |
284
+ | cosine | Medium (10-100) | 0.8 | 61.4 | 1.35 |
285
+ | cosine | Long (>200) | 1.5 | 158.5 | 4.77 |
286
+ | sorensen | Short (<10) | 0.7 | 19.3 | 0.33 |
287
+ | sorensen | Medium (10-100) | 0.7 | 61.0 | 1.33 |
288
+ | sorensen | Long (>200) | 1.5 | 160.0 | 4.46 |
289
+
290
+ ### Bigram Variants
291
+
292
+ | Algorithm | Size | TS (V8 JIT) | WASM (via JS) | Rust (native) |
293
+ | ------------- | --------------- | ----------- | ------------- | ------------- |
294
+ | jaccardBigram | Short (<10) | 1.1 | 27.4 | 0.45 |
295
+ | jaccardBigram | Medium (10-100) | 7.7 | 160.4 | 3.86 |
296
+ | cosineBigram | Short (<10) | 0.8 | 21.2 | 0.36 |
297
+ | cosineBigram | Medium (10-100) | 5.9 | 127.0 | 3.12 |
298
+
299
+ TS implementations use V8 JIT optimization + `Int32Array` ASCII fast path + integer-encoded bigrams, avoiding JS-WASM boundary overhead entirely.
300
+
301
+ ### Fuzzy Search: NLPTools vs Fuse.js
302
+
303
+ Benchmark: 20 items in collection, 6 queries per iteration, 1000 iterations.
304
+ Unit: milliseconds per operation (ms/op). Algorithm: levenshtein (default).
114
305
 
115
- - **WebAssembly algorithms**: 10-100x faster than pure JavaScript
116
- - **Auto-detection**: Seamlessly switches between WASM and JS implementations
306
+ | Scenario | NLPTools | Fuse.js |
307
+ | ----------------------- | -------- | ------- |
308
+ | Setup (constructor) | 0.0002 | 0.0050 |
309
+ | Search (string array) | 0.0114 | 0.1077 |
310
+ | Search (object, 1 key) | 0.0176 | 0.3308 |
311
+ | Search (object, 2 keys) | 0.0289 | 0.6445 |
117
312
 
118
- ## References
313
+ Both libraries return identical top-1 results for all test queries. NLPTools scores are normalized similarity (0-1, higher is better); Fuse.js uses Bitap error scores (0 = perfect, lower is better).
119
314
 
120
- This package incorporates and builds upon the following excellent open source projects:
315
+ ## Dependencies
121
316
 
122
- - [textdistance.rs](https://github.com/life4/textdistance.rs) - Core Rust implementation via @nlptools/distance-wasm
123
- - [fastest-levenshtein](https://github.com/ka-weihe/fastest-levenshtein) - High-performance Levenshtein implementation
317
+ - `fastest-levenshtein` fastest JS Levenshtein implementation
318
+ - `@algorithm.ts/lcs` Myers and DP Longest Common Subsequence
319
+ - `@algorithm.ts/diff` — Sequence diff built on LCS
124
320
 
125
321
  ## License
126
322
 
127
- - [MIT](../../LICENSE) &copy; [Demo Macro](https://imst.xyz/)
323
+ [MIT](../../LICENSE) &copy; [Demo Macro](https://www.demomacro.com/)