npm - @nlptools/distance - Versions diffs - 0.0.2 → 0.0.4 - Mend

@nlptools/distance 0.0.2 → 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/LICENSE CHANGED Viewed

@@ -1,21 +1,21 @@
-MIT License
-Copyright (c) 2023 Demo Macro
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+MIT License
+Copyright (c) 2023 Demo Macro
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md CHANGED Viewed

@@ -1,23 +1,19 @@
 # @nlptools/distance
 ![npm version](https://img.shields.io/npm/v/@nlptools/distance)
-![npm downloads](https://img.shields.io/npm/dw/@nlptools/distance)
 ![npm license](https://img.shields.io/npm/l/@nlptools/distance)
-[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](https://www.contributor-covenant.org/version/2/1/code_of_conduct/)
-> Complete string distance and similarity algorithms package with WebAssembly and JavaScript implementations
-This package provides comprehensive text similarity and distance algorithms, combining the high-performance WebAssembly implementation from `@nlptools/distance-wasm` with additional JavaScript-based algorithms for maximum compatibility and performance.
+> High-performance string distance and similarity algorithms, implemented in pure TypeScript
 ## Features
-- ⚡ **Dual Implementation**: WebAssembly for performance + JavaScript for compatibility
-- 🧮 **Comprehensive Algorithms**: 30+ string similarity and distance algorithms
-- 🎯 **Multiple Categories**: Edit-based, sequence-based, token-based, and naive algorithms
-- 📝 **TypeScript First**: Full type safety with comprehensive API
-- 🔧 **Universal Interface**: Single compare function for all algorithms
-- 📊 **Normalized Results**: Consistent 0-1 similarity scores across algorithms
-- 🚀 **Auto-optimization**: Automatically chooses the fastest implementation available
+- Pure TypeScript implementation, zero native dependencies
+- Edit distance: Levenshtein, LCS (Myers O(ND) and DP)
+- Token similarity: Jaccard, Cosine, Sorensen-Dice (character multiset and n-gram variants)
+- Hash-based deduplication: SimHash, MinHash, LSH
+- Fuzzy search: `FuzzySearch` class and `findBestMatch` with multi-algorithm support
+- Diff: based on `@algorithm.ts/diff` (Myers and DP backends)
+- All distance algorithms include normalized similarity variants (0-1 range)
 ## Installation
@@ -34,94 +30,294 @@ pnpm add @nlptools/distance
 ## Usage
-### Basic Setup
+### Edit Distance
+```typescript
+import { levenshtein, levenshteinNormalized } from "@nlptools/distance";
+levenshtein("kitten", "sitting"); // 3
+levenshteinNormalized("kitten", "sitting"); // 0.571
+```
+### LCS (Longest Common Subsequence)
 ```typescript
-import * as distance from "@nlptools/distance";
+import { lcsDistance, lcsNormalized, lcsLength, lcsPairs } from "@nlptools/distance";
-// All algorithms are available as named functions
-console.log(distance.levenshtein("kitten", "sitting")); // 3
-console.log(distance.jaro("hello", "hallo")); // 0.8666666666666667
-console.log(distance.cosine("abc", "bcd")); // 0.6666666666666666
+lcsDistance("abcde", "ace"); // 2  (= 5 + 3 - 2 * 3)
+lcsNormalized("abcde", "ace"); // 0.75
+lcsLength("abcde", "ace"); // 3
+lcsPairs("abcde", "ace"); // [[0,0], [2,1], [4,2]]
 ```
-### Distance vs Similarity
+By default uses Myers O(ND) algorithm. Switch to DP with `algorithm: "dp"`.
-Most algorithms have both distance and normalized versions:
+### Token Similarity (Character Multiset)
+Based on character frequency maps (Counter), matching the `textdistance` crate semantics:
 ```typescript
-// Distance algorithms (lower is more similar)
-const dist = distance.levenshtein("cat", "bat"); // 1
+import { jaccard, cosine, sorensen } from "@nlptools/distance";
-// Similarity algorithms (higher is more similar, 0-1 range)
-const sim = distance.levenshtein_normalized("cat", "bat"); // 0.6666666666666666
+jaccard("abc", "abd"); // 0.667
+cosine("hello", "hallo"); // 0.8
+sorensen("test", "text"); // 0.75
 ```
-### Available Algorithms
+### N-Gram Variants
+```typescript
+import { jaccardNgram, cosineNgram, sorensenNgram } from "@nlptools/distance";
-This package includes all algorithms from `@nlptools/distance-wasm` plus additional JavaScript implementations:
+jaccardNgram("hello", "hallo"); // 0.333  (bigram-based)
+cosineNgram("hello", "hallo"); // 0.5     (bigram-based)
+```
-#### Edit Distance Algorithms
+### SimHash (Document Fingerprinting)
-- `levenshtein` - Classic edit distance
-- `fastest_levenshtein` - High-performance Levenshtein distance (fastest-levenshtein)
-- `damerau_levenshtein` - Edit distance with transpositions
-- `myers_levenshtein` - Myers bit-parallel algorithm for edit distance
-- `jaro` - Jaro similarity
-- `jarowinkler` - Jaro-Winkler similarity
-- `hamming` - Hamming distance for equal-length strings
-- `sift4_simple` - SIFT4 algorithm
+```typescript
+import { simhash, hammingDistance, SimHasher } from "@nlptools/distance";
+// Function-based
+const fp1 = simhash(["hello", "world"]);
+const fp2 = simhash(["hello", "earth"]);
+hammingDistance(fp1, fp2); // small = similar
+// Class-based
+const hasher = new SimHasher();
+const a = hasher.hash(["hello", "world"]);
+const b = hasher.hash(["hello", "earth"]);
+hasher.isDuplicate(a, b); // true if hamming distance <= 3
+```
-#### Sequence-based Algorithms
+### MinHash (Jaccard Similarity Estimation)
-- `lcs_seq` - Longest common subsequence
-- `lcs_str` - Longest common substring
-- `ratcliff_obershelp` - Gestalt pattern matching
-- `smith_waterman` - Local sequence alignment
+```typescript
+import { MinHash } from "@nlptools/distance";
-#### Token-based Algorithms
+const mh1 = new MinHash({ numHashes: 128 });
+mh1.update("hello");
+mh1.update("world");
-- `jaccard` - Jaccard similarity
-- `cosine` - Cosine similarity
-- `sorensen` - Sørensen-Dice coefficient
-- `tversky` - Tversky index
-- `overlap` - Overlap coefficient
+const mh2 = new MinHash({ numHashes: 128 });
+mh2.update("hello");
+mh2.update("earth");
-#### Bigram Algorithms
+MinHash.estimate(mh1.digest(), mh2.digest()); // ~0.67
+```
-- `jaccard_bigram` - Jaccard similarity on character bigrams
-- `cosine_bigram` - Cosine similarity on character bigrams
+### LSH (Approximate Nearest Neighbor Search)
-#### Naive Algorithms
+```typescript
+import { MinHash } from "@nlptools/distance";
+import { LSH } from "@nlptools/distance";
-- `prefix` - Prefix similarity
-- `suffix` - Suffix similarity
-- `length` - Length-based similarity
+const lsh = new LSH({ numBands: 16, numHashes: 128 });
-### Universal Compare Function
+const mh = new MinHash({ numHashes: 128 });
+mh.update("hello");
+mh.update("world");
+lsh.insert("doc1", mh.digest());
+// Query for similar documents
+const query = lsh.query(mh.digest(), 0.5);
+// => [["doc1", 0.67]]
+```
+### Fuzzy Search
 ```typescript
-const result = distance.compare("hello", "hallo", "jaro");
-console.log(result); // 0.8666666666666667
+import { FuzzySearch, findBestMatch } from "@nlptools/distance";
+// String array search
+const search = new FuzzySearch(["apple", "banana", "cherry"]);
+search.search("aple");
+// => [{ item: "apple", score: 0.8, index: 0 }]
+// Object array with weighted keys
+const books = [
+  { title: "Old Man's War", author: "John Scalzi" },
+  { title: "The Great Gatsby", author: "F. Scott Fitzgerald" },
+];
+const bookSearch = new FuzzySearch(books, {
+  keys: [
+    { name: "title", weight: 0.7 },
+    { name: "author", weight: 0.3 },
+  ],
+  algorithm: "cosine",
+  threshold: 0.3,
+});
+bookSearch.search("old man");
+// => [{ item: { title: "Old Man's War", ... }, score: 0.52, index: 0 }]
+// One-shot best match
+findBestMatch("kitten", ["sitting", "kit", "mitten"]);
+// => { item: "kit", score: 0.5, index: 1 }
+// With per-key details
+const detailed = new FuzzySearch(books, {
+  keys: [{ name: "title" }, { name: "author" }],
+  includeMatchDetails: true,
+});
+detailed.search("gatsby");
+// => [{ item: ..., score: 0.45, index: 1, matches: { title: 0.6, author: 0.1 } }]
+```
+### Diff
-// Use fastest-levenshtein for optimal performance
-console.log(distance.fastest_levenshtein("fast", "faster")); // 2
+```typescript
+import { diff, DiffType } from "@nlptools/distance";
+const result = diff("abc", "ac");
+// => [
+//   { type: DiffType.COMMON, tokens: "a" },
+//   { type: DiffType.REMOVED, tokens: "b" },
+//   { type: DiffType.COMMON, tokens: "c" },
+// ]
 ```
+## API Reference
+### Edit Distance
+| Function                          | Description               | Returns              |
+| --------------------------------- | ------------------------- | -------------------- |
+| `levenshtein(a, b)`               | Levenshtein edit distance | `number`             |
+| `levenshteinNormalized(a, b)`     | Normalized similarity     | `number` (0-1)       |
+| `lcsDistance(a, b, algorithm?)`   | LCS distance              | `number`             |
+| `lcsNormalized(a, b, algorithm?)` | Normalized LCS similarity | `number` (0-1)       |
+| `lcsLength(a, b, algorithm?)`     | LCS length                | `number`             |
+| `lcsPairs(a, b, algorithm?)`      | LCS matching pairs        | `[number, number][]` |
+### Token Similarity
+| Function                  | Description                                    | Returns        |
+| ------------------------- | ---------------------------------------------- | -------------- |
+| `jaccard(a, b)`           | Jaccard similarity (character multiset)        | `number` (0-1) |
+| `jaccardNgram(a, b, n?)`  | Jaccard on character n-grams                   | `number` (0-1) |
+| `cosine(a, b)`            | Cosine similarity (character multiset)         | `number` (0-1) |
+| `cosineNgram(a, b, n?)`   | Cosine on character n-grams                    | `number` (0-1) |
+| `sorensen(a, b)`          | Sorensen-Dice coefficient (character multiset) | `number` (0-1) |
+| `sorensenNgram(a, b, n?)` | Sorensen-Dice on character n-grams             | `number` (0-1) |
+### Hash-Based Deduplication
+| Function / Class                 | Description                                                        |
+| -------------------------------- | ------------------------------------------------------------------ |
+| `simhash(features, options?)`    | Generate 64-bit fingerprint as `bigint`                            |
+| `hammingDistance(a, b)`          | Hamming distance between two fingerprints                          |
+| `hammingSimilarity(a, b, bits?)` | Normalized Hamming similarity                                      |
+| `SimHasher`                      | Class with `hash()`, `distance()`, `similarity()`, `isDuplicate()` |
+| `MinHash`                        | Class with `update()`, `digest()`, `estimate()`                    |
+| `MinHash.estimate(sig1, sig2)`   | Static: estimate Jaccard from signatures                           |
+| `LSH`                            | Class with `insert()`, `query()`, `remove()`                       |
+### Fuzzy Search
+| Function / Class                             | Description                                        |
+| -------------------------------------------- | -------------------------------------------------- |
+| `FuzzySearch<T>(collection, options?)`       | Search engine with dynamic collection management   |
+| `findBestMatch(query, collection, options?)` | One-shot convenience: returns best match or `null` |
+**FuzzySearch options:**
+| Option                | Type                               | Default         | Description                   |
+| --------------------- | ---------------------------------- | --------------- | ----------------------------- |
+| `algorithm`           | `BuiltinAlgorithm \| SimilarityFn` | `"levenshtein"` | Similarity algorithm to use   |
+| `keys`                | `ISearchKey[]`                     | `[]`            | Object fields to search on    |
+| `threshold`           | `number`                           | `0`             | Min similarity score (0-1)    |
+| `limit`               | `number`                           | `Infinity`      | Max results to return         |
+| `caseSensitive`       | `boolean`                          | `false`         | Case-insensitive by default   |
+| `includeMatchDetails` | `boolean`                          | `false`         | Include per-key scores        |
+| `lsh`                 | `{ numHashes?, numBands? }`        | —               | Enable LSH for large datasets |
+**Built-in algorithms:** `"levenshtein"`, `"lcs"`, `"jaccard"`, `"jaccardNgram"`, `"cosine"`, `"cosineNgram"`, `"sorensen"`, `"sorensenNgram"`
+### Diff
+| Function               | Description                 | Returns          |
+| ---------------------- | --------------------------- | ---------------- |
+| `diff(a, b, options?)` | Sequence diff (Myers or DP) | `IDiffItem<T>[]` |
+### Types
+| Type                    | Description                                  |
+| ----------------------- | -------------------------------------------- |
+| `DiffType`              | Enum: `ADDED`, `REMOVED`, `COMMON`           |
+| `IDiffItem<T>`          | Diff result item with type and tokens        |
+| `IDiffOptions<T>`       | Options for diff (equals, lcs algorithm)     |
+| `ISimHashOptions`       | Options for SimHash (bits, hashFn)           |
+| `IMinHashOptions`       | Options for MinHash (numHashes, seed)        |
+| `ILSHOptions`           | Options for LSH (numBands, numHashes)        |
+| `IFuzzySearchOptions`   | Options for FuzzySearch constructor          |
+| `IFindBestMatchOptions` | Options for findBestMatch function           |
+| `ISearchKey`            | Searchable key config (name, weight, getter) |
+| `ISearchResult<T>`      | Search result with item, score, index        |
+| `SimilarityFn`          | `(a: string, b: string) => number` in [0,1]  |
 ## Performance
-The package automatically selects the fastest implementation available:
+Benchmark: 1000 iterations per pair, same test data across all runtimes.
+Unit: microseconds per operation (us/op).
+### Edit Distance
+| Algorithm       | Size            | TS (V8 JIT) | WASM (via JS) | Rust (native) |
+| --------------- | --------------- | ----------- | ------------- | ------------- |
+| levenshtein     | Short (<10)     | 0.3         | 7.9           | 0.11          |
+| levenshtein     | Medium (10-100) | 1.3         | 116.2         | 0.98          |
+| levenshtein     | Long (>200)     | 15.2        | 2,877.5       | 39.68         |
+| levenshteinNorm | Short           | 0.3         | 7.9           | 0.11          |
+| lcs             | Short (<10)     | 1.6         | 16.5          | 0.41          |
+| lcs             | Medium (10-100) | 6.8         | 272.6         | 3.22          |
+| lcs             | Long (>200)     | 217.8       | 6,574.1       | 122.63        |
+| lcsNorm         | Short           | 1.7         | 16.2          | 0.48          |
+### Token Similarity (Character Multiset)
+| Algorithm | Size            | TS (V8 JIT) | WASM (via JS) | Rust (native) |
+| --------- | --------------- | ----------- | ------------- | ------------- |
+| jaccard   | Short (<10)     | 0.8         | 25.2          | 0.42          |
+| jaccard   | Medium (10-100) | 0.8         | 74.3          | 1.55          |
+| jaccard   | Long (>200)     | 1.6         | 171.5         | 5.54          |
+| cosine    | Short (<10)     | 0.8         | 19.3          | 0.32          |
+| cosine    | Medium (10-100) | 0.8         | 61.4          | 1.35          |
+| cosine    | Long (>200)     | 1.5         | 158.5         | 4.77          |
+| sorensen  | Short (<10)     | 0.7         | 19.3          | 0.33          |
+| sorensen  | Medium (10-100) | 0.7         | 61.0          | 1.33          |
+| sorensen  | Long (>200)     | 1.5         | 160.0         | 4.46          |
+### Bigram Variants
+| Algorithm     | Size            | TS (V8 JIT) | WASM (via JS) | Rust (native) |
+| ------------- | --------------- | ----------- | ------------- | ------------- |
+| jaccardBigram | Short (<10)     | 1.1         | 27.4          | 0.45          |
+| jaccardBigram | Medium (10-100) | 7.7         | 160.4         | 3.86          |
+| cosineBigram  | Short (<10)     | 0.8         | 21.2          | 0.36          |
+| cosineBigram  | Medium (10-100) | 5.9         | 127.0         | 3.12          |
+TS implementations use V8 JIT optimization + `Int32Array` ASCII fast path + integer-encoded bigrams, avoiding JS-WASM boundary overhead entirely.
+### Fuzzy Search: NLPTools vs Fuse.js
+Benchmark: 20 items in collection, 6 queries per iteration, 1000 iterations.
+Unit: milliseconds per operation (ms/op). Algorithm: levenshtein (default).
-- **WebAssembly algorithms**: 10-100x faster than pure JavaScript
-- **Auto-detection**: Seamlessly switches between WASM and JS implementations
+| Scenario                | NLPTools | Fuse.js |
+| ----------------------- | -------- | ------- |
+| Setup (constructor)     | 0.0002   | 0.0050  |
+| Search (string array)   | 0.0114   | 0.1077  |
+| Search (object, 1 key)  | 0.0176   | 0.3308  |
+| Search (object, 2 keys) | 0.0289   | 0.6445  |
-## References
+Both libraries return identical top-1 results for all test queries. NLPTools scores are normalized similarity (0-1, higher is better); Fuse.js uses Bitap error scores (0 = perfect, lower is better).
-This package incorporates and builds upon the following excellent open source projects:
+## Dependencies
-- [textdistance.rs](https://github.com/life4/textdistance.rs) - Core Rust implementation via @nlptools/distance-wasm
-- [fastest-levenshtein](https://github.com/ka-weihe/fastest-levenshtein) - High-performance Levenshtein implementation
+- `fastest-levenshtein` — fastest JS Levenshtein implementation
+- `@algorithm.ts/lcs` — Myers and DP Longest Common Subsequence
+- `@algorithm.ts/diff` — Sequence diff built on LCS
 ## License
-- [MIT](../../LICENSE) &copy; [Demo Macro](https://imst.xyz/)
+[MIT](../../LICENSE) &copy; [Demo Macro](https://www.demomacro.com/)