@nlptools/distance 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE CHANGED
@@ -1,21 +1,21 @@
1
- MIT License
2
-
3
- Copyright (c) 2023 Demo Macro
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in all
13
- copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
- SOFTWARE.
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Demo Macro
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md CHANGED
@@ -1,23 +1,18 @@
1
1
  # @nlptools/distance
2
2
 
3
3
  ![npm version](https://img.shields.io/npm/v/@nlptools/distance)
4
- ![npm downloads](https://img.shields.io/npm/dw/@nlptools/distance)
5
4
  ![npm license](https://img.shields.io/npm/l/@nlptools/distance)
6
- [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](https://www.contributor-covenant.org/version/2/1/code_of_conduct/)
7
5
 
8
- > Complete string distance and similarity algorithms package with WebAssembly and JavaScript implementations
9
-
10
- This package provides comprehensive text similarity and distance algorithms, combining the high-performance WebAssembly implementation from `@nlptools/distance-wasm` with additional JavaScript-based algorithms for maximum compatibility and performance.
6
+ > High-performance string distance and similarity algorithms, implemented in pure TypeScript
11
7
 
12
8
  ## Features
13
9
 
14
- - **Dual Implementation**: WebAssembly for performance + JavaScript for compatibility
15
- - 🧮 **Comprehensive Algorithms**: 30+ string similarity and distance algorithms
16
- - 🎯 **Multiple Categories**: Edit-based, sequence-based, token-based, and naive algorithms
17
- - 📝 **TypeScript First**: Full type safety with comprehensive API
18
- - 🔧 **Universal Interface**: Single compare function for all algorithms
19
- - 📊 **Normalized Results**: Consistent 0-1 similarity scores across algorithms
20
- - 🚀 **Auto-optimization**: Automatically chooses the fastest implementation available
10
+ - Pure TypeScript implementation, zero native dependencies
11
+ - Edit distance: Levenshtein, LCS (Myers O(ND) and DP)
12
+ - Token similarity: Jaccard, Cosine, Sorensen-Dice (character multiset and n-gram variants)
13
+ - Hash-based deduplication: SimHash, MinHash, LSH
14
+ - Diff: based on `@algorithm.ts/diff` (Myers and DP backends)
15
+ - All distance algorithms include normalized similarity variants (0-1 range)
21
16
 
22
17
  ## Installation
23
18
 
@@ -34,94 +29,215 @@ pnpm add @nlptools/distance
34
29
 
35
30
  ## Usage
36
31
 
37
- ### Basic Setup
32
+ ### Edit Distance
38
33
 
39
34
  ```typescript
40
- import * as distance from "@nlptools/distance";
35
+ import { levenshtein, levenshteinNormalized } from "@nlptools/distance";
41
36
 
42
- // All algorithms are available as named functions
43
- console.log(distance.levenshtein("kitten", "sitting")); // 3
44
- console.log(distance.jaro("hello", "hallo")); // 0.8666666666666667
45
- console.log(distance.cosine("abc", "bcd")); // 0.6666666666666666
37
+ levenshtein("kitten", "sitting"); // 3
38
+ levenshteinNormalized("kitten", "sitting"); // 0.571
46
39
  ```
47
40
 
48
- ### Distance vs Similarity
49
-
50
- Most algorithms have both distance and normalized versions:
41
+ ### LCS (Longest Common Subsequence)
51
42
 
52
43
  ```typescript
53
- // Distance algorithms (lower is more similar)
54
- const dist = distance.levenshtein("cat", "bat"); // 1
44
+ import { lcsDistance, lcsNormalized, lcsLength, lcsPairs } from "@nlptools/distance";
55
45
 
56
- // Similarity algorithms (higher is more similar, 0-1 range)
57
- const sim = distance.levenshtein_normalized("cat", "bat"); // 0.6666666666666666
46
+ lcsDistance("abcde", "ace"); // 2 (= 5 + 3 - 2 * 3)
47
+ lcsNormalized("abcde", "ace"); // 0.75
48
+ lcsLength("abcde", "ace"); // 3
49
+ lcsPairs("abcde", "ace"); // [[0,0], [2,1], [4,2]]
58
50
  ```
59
51
 
60
- ### Available Algorithms
52
+ By default uses Myers O(ND) algorithm. Switch to DP with `algorithm: "dp"`.
61
53
 
62
- This package includes all algorithms from `@nlptools/distance-wasm` plus additional JavaScript implementations:
54
+ ### Token Similarity (Character Multiset)
63
55
 
64
- #### Edit Distance Algorithms
56
+ Based on character frequency maps (Counter), matching the `textdistance` crate semantics:
65
57
 
66
- - `levenshtein` - Classic edit distance
67
- - `fastest_levenshtein` - High-performance Levenshtein distance (fastest-levenshtein)
68
- - `damerau_levenshtein` - Edit distance with transpositions
69
- - `myers_levenshtein` - Myers bit-parallel algorithm for edit distance
70
- - `jaro` - Jaro similarity
71
- - `jarowinkler` - Jaro-Winkler similarity
72
- - `hamming` - Hamming distance for equal-length strings
73
- - `sift4_simple` - SIFT4 algorithm
58
+ ```typescript
59
+ import { jaccard, cosine, sorensen } from "@nlptools/distance";
74
60
 
75
- #### Sequence-based Algorithms
61
+ jaccard("abc", "abd"); // 0.667
62
+ cosine("hello", "hallo"); // 0.8
63
+ sorensen("test", "text"); // 0.75
64
+ ```
76
65
 
77
- - `lcs_seq` - Longest common subsequence
78
- - `lcs_str` - Longest common substring
79
- - `ratcliff_obershelp` - Gestalt pattern matching
80
- - `smith_waterman` - Local sequence alignment
66
+ ### N-Gram Variants
81
67
 
82
- #### Token-based Algorithms
68
+ ```typescript
69
+ import { jaccardNgram, cosineNgram, sorensenNgram } from "@nlptools/distance";
83
70
 
84
- - `jaccard` - Jaccard similarity
85
- - `cosine` - Cosine similarity
86
- - `sorensen` - Sørensen-Dice coefficient
87
- - `tversky` - Tversky index
88
- - `overlap` - Overlap coefficient
71
+ jaccardNgram("hello", "hallo"); // 0.333 (bigram-based)
72
+ cosineNgram("hello", "hallo"); // 0.5 (bigram-based)
73
+ ```
89
74
 
90
- #### Bigram Algorithms
75
+ ### SimHash (Document Fingerprinting)
91
76
 
92
- - `jaccard_bigram` - Jaccard similarity on character bigrams
93
- - `cosine_bigram` - Cosine similarity on character bigrams
77
+ ```typescript
78
+ import { simhash, hammingDistance, SimHasher } from "@nlptools/distance";
79
+
80
+ // Function-based
81
+ const fp1 = simhash(["hello", "world"]);
82
+ const fp2 = simhash(["hello", "earth"]);
83
+ hammingDistance(fp1, fp2); // small = similar
84
+
85
+ // Class-based
86
+ const hasher = new SimHasher();
87
+ const a = hasher.hash(["hello", "world"]);
88
+ const b = hasher.hash(["hello", "earth"]);
89
+ hasher.isDuplicate(a, b); // true if hamming distance <= 3
90
+ ```
94
91
 
95
- #### Naive Algorithms
92
+ ### MinHash (Jaccard Similarity Estimation)
96
93
 
97
- - `prefix` - Prefix similarity
98
- - `suffix` - Suffix similarity
99
- - `length` - Length-based similarity
94
+ ```typescript
95
+ import { MinHash } from "@nlptools/distance";
100
96
 
101
- ### Universal Compare Function
97
+ const mh1 = new MinHash({ numHashes: 128 });
98
+ mh1.update("hello");
99
+ mh1.update("world");
102
100
 
103
- ```typescript
104
- const result = distance.compare("hello", "hallo", "jaro");
105
- console.log(result); // 0.8666666666666667
101
+ const mh2 = new MinHash({ numHashes: 128 });
102
+ mh2.update("hello");
103
+ mh2.update("earth");
106
104
 
107
- // Use fastest-levenshtein for optimal performance
108
- console.log(distance.fastest_levenshtein("fast", "faster")); // 2
105
+ MinHash.estimate(mh1.digest(), mh2.digest()); // ~0.67
109
106
  ```
110
107
 
111
- ## Performance
108
+ ### LSH (Approximate Nearest Neighbor Search)
109
+
110
+ ```typescript
111
+ import { MinHash } from "@nlptools/distance";
112
+ import { LSH } from "@nlptools/distance";
113
+
114
+ const lsh = new LSH({ numBands: 16, numHashes: 128 });
115
+
116
+ const mh = new MinHash({ numHashes: 128 });
117
+ mh.update("hello");
118
+ mh.update("world");
119
+ lsh.insert("doc1", mh.digest());
112
120
 
113
- The package automatically selects the fastest implementation available:
121
+ // Query for similar documents
122
+ const query = lsh.query(mh.digest(), 0.5);
123
+ // => [["doc1", 0.67]]
124
+ ```
114
125
 
115
- - **WebAssembly algorithms**: 10-100x faster than pure JavaScript
116
- - **Auto-detection**: Seamlessly switches between WASM and JS implementations
126
+ ### Diff
117
127
 
118
- ## References
128
+ ```typescript
129
+ import { diff, DiffType } from "@nlptools/distance";
130
+
131
+ const result = diff("abc", "ac");
132
+ // => [
133
+ // { type: DiffType.COMMON, tokens: "a" },
134
+ // { type: DiffType.REMOVED, tokens: "b" },
135
+ // { type: DiffType.COMMON, tokens: "c" },
136
+ // ]
137
+ ```
119
138
 
120
- This package incorporates and builds upon the following excellent open source projects:
139
+ ## API Reference
140
+
141
+ ### Edit Distance
142
+
143
+ | Function | Description | Returns |
144
+ | --------------------------------- | ------------------------- | -------------------- |
145
+ | `levenshtein(a, b)` | Levenshtein edit distance | `number` |
146
+ | `levenshteinNormalized(a, b)` | Normalized similarity | `number` (0-1) |
147
+ | `lcsDistance(a, b, algorithm?)` | LCS distance | `number` |
148
+ | `lcsNormalized(a, b, algorithm?)` | Normalized LCS similarity | `number` (0-1) |
149
+ | `lcsLength(a, b, algorithm?)` | LCS length | `number` |
150
+ | `lcsPairs(a, b, algorithm?)` | LCS matching pairs | `[number, number][]` |
151
+
152
+ ### Token Similarity
153
+
154
+ | Function | Description | Returns |
155
+ | ------------------------- | ---------------------------------------------- | -------------- |
156
+ | `jaccard(a, b)` | Jaccard similarity (character multiset) | `number` (0-1) |
157
+ | `jaccardNgram(a, b, n?)` | Jaccard on character n-grams | `number` (0-1) |
158
+ | `cosine(a, b)` | Cosine similarity (character multiset) | `number` (0-1) |
159
+ | `cosineNgram(a, b, n?)` | Cosine on character n-grams | `number` (0-1) |
160
+ | `sorensen(a, b)` | Sorensen-Dice coefficient (character multiset) | `number` (0-1) |
161
+ | `sorensenNgram(a, b, n?)` | Sorensen-Dice on character n-grams | `number` (0-1) |
162
+
163
+ ### Hash-Based Deduplication
164
+
165
+ | Function / Class | Description |
166
+ | -------------------------------- | ------------------------------------------------------------------ |
167
+ | `simhash(features, options?)` | Generate 64-bit fingerprint as `bigint` |
168
+ | `hammingDistance(a, b)` | Hamming distance between two fingerprints |
169
+ | `hammingSimilarity(a, b, bits?)` | Normalized Hamming similarity |
170
+ | `SimHasher` | Class with `hash()`, `distance()`, `similarity()`, `isDuplicate()` |
171
+ | `MinHash` | Class with `update()`, `digest()`, `estimate()` |
172
+ | `MinHash.estimate(sig1, sig2)` | Static: estimate Jaccard from signatures |
173
+ | `LSH` | Class with `insert()`, `query()`, `remove()` |
174
+
175
+ ### Diff
176
+
177
+ | Function | Description | Returns |
178
+ | ---------------------- | --------------------------- | ---------------- |
179
+ | `diff(a, b, options?)` | Sequence diff (Myers or DP) | `IDiffItem<T>[]` |
180
+
181
+ ### Types
182
+
183
+ | Type | Description |
184
+ | ----------------- | ---------------------------------------- |
185
+ | `DiffType` | Enum: `ADDED`, `REMOVED`, `COMMON` |
186
+ | `IDiffItem<T>` | Diff result item with type and tokens |
187
+ | `IDiffOptions<T>` | Options for diff (equals, lcs algorithm) |
188
+ | `ISimHashOptions` | Options for SimHash (bits, hashFn) |
189
+ | `IMinHashOptions` | Options for MinHash (numHashes, seed) |
190
+ | `ILSHOptions` | Options for LSH (numBands, numHashes) |
191
+
192
+ ## Performance
121
193
 
122
- - [textdistance.rs](https://github.com/life4/textdistance.rs) - Core Rust implementation via @nlptools/distance-wasm
123
- - [fastest-levenshtein](https://github.com/ka-weihe/fastest-levenshtein) - High-performance Levenshtein implementation
194
+ Benchmark: 1000 iterations per pair, same test data across all runtimes.
195
+ Unit: microseconds per operation (us/op).
196
+
197
+ ### Edit Distance
198
+
199
+ | Algorithm | Size | TS (V8 JIT) | WASM (via JS) | Rust (native) |
200
+ | --------------- | --------------- | ----------- | ------------- | ------------- |
201
+ | levenshtein | Short (<10) | 0.3 | 7.9 | 0.11 |
202
+ | levenshtein | Medium (10-100) | 1.3 | 116.2 | 0.98 |
203
+ | levenshtein | Long (>200) | 15.2 | 2,877.5 | 39.68 |
204
+ | levenshteinNorm | Short | 0.3 | 7.9 | 0.11 |
205
+ | lcs | Short (<10) | 1.6 | 16.5 | 0.41 |
206
+ | lcs | Medium (10-100) | 6.8 | 272.6 | 3.22 |
207
+ | lcs | Long (>200) | 217.8 | 6,574.1 | 122.63 |
208
+ | lcsNorm | Short | 1.7 | 16.2 | 0.48 |
209
+
210
+ ### Token Similarity (Character Multiset)
211
+
212
+ | Algorithm | Size | TS (V8 JIT) | WASM (via JS) | Rust (native) |
213
+ | --------- | --------------- | ----------- | ------------- | ------------- |
214
+ | jaccard | Short (<10) | 0.8 | 25.2 | 0.42 |
215
+ | jaccard | Medium (10-100) | 0.8 | 74.3 | 1.55 |
216
+ | jaccard | Long (>200) | 1.6 | 171.5 | 5.54 |
217
+ | cosine | Short (<10) | 0.8 | 19.3 | 0.32 |
218
+ | cosine | Medium (10-100) | 0.8 | 61.4 | 1.35 |
219
+ | cosine | Long (>200) | 1.5 | 158.5 | 4.77 |
220
+ | sorensen | Short (<10) | 0.7 | 19.3 | 0.33 |
221
+ | sorensen | Medium (10-100) | 0.7 | 61.0 | 1.33 |
222
+ | sorensen | Long (>200) | 1.5 | 160.0 | 4.46 |
223
+
224
+ ### Bigram Variants
225
+
226
+ | Algorithm | Size | TS (V8 JIT) | WASM (via JS) | Rust (native) |
227
+ | ------------- | --------------- | ----------- | ------------- | ------------- |
228
+ | jaccardBigram | Short (<10) | 1.1 | 27.4 | 0.45 |
229
+ | jaccardBigram | Medium (10-100) | 7.7 | 160.4 | 3.86 |
230
+ | cosineBigram | Short (<10) | 0.8 | 21.2 | 0.36 |
231
+ | cosineBigram | Medium (10-100) | 5.9 | 127.0 | 3.12 |
232
+
233
+ TS implementations use V8 JIT optimization + `Int32Array` ASCII fast path + integer-encoded bigrams, avoiding JS-WASM boundary overhead entirely.
234
+
235
+ ## Dependencies
236
+
237
+ - `fastest-levenshtein` — fastest JS Levenshtein implementation
238
+ - `@algorithm.ts/lcs` — Myers and DP Longest Common Subsequence
239
+ - `@algorithm.ts/diff` — Sequence diff built on LCS
124
240
 
125
241
  ## License
126
242
 
127
- - [MIT](../../LICENSE) &copy; [Demo Macro](https://imst.xyz/)
243
+ [MIT](../../LICENSE) &copy; [Demo Macro](https://www.demomacro.com/)