npm - @nlptools/distance - Versions diffs - 0.0.3 → 0.0.5 - Mend

@nlptools/distance 0.0.3 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -11,6 +11,7 @@
 - Edit distance: Levenshtein, LCS (Myers O(ND) and DP)
 - Token similarity: Jaccard, Cosine, Sorensen-Dice (character multiset and n-gram variants)
 - Hash-based deduplication: SimHash, MinHash, LSH
+- Fuzzy search: `FuzzySearch` class and `findBestMatch` with multi-algorithm support
 - Diff: based on `@algorithm.ts/diff` (Myers and DP backends)
 - All distance algorithms include normalized similarity variants (0-1 range)
@@ -123,6 +124,45 @@ const query = lsh.query(mh.digest(), 0.5);
 // => [["doc1", 0.67]]
 ```
+### Fuzzy Search
+```typescript
+import { FuzzySearch, findBestMatch } from "@nlptools/distance";
+// String array search
+const search = new FuzzySearch(["apple", "banana", "cherry"]);
+search.search("aple");
+// => [{ item: "apple", score: 0.8, index: 0 }]
+// Object array with weighted keys
+const books = [
+  { title: "Old Man's War", author: "John Scalzi" },
+  { title: "The Great Gatsby", author: "F. Scott Fitzgerald" },
+];
+const bookSearch = new FuzzySearch(books, {
+  keys: [
+    { name: "title", weight: 0.7 },
+    { name: "author", weight: 0.3 },
+  ],
+  algorithm: "cosine",
+  threshold: 0.3,
+});
+bookSearch.search("old man");
+// => [{ item: { title: "Old Man's War", ... }, score: 0.52, index: 0 }]
+// One-shot best match
+findBestMatch("kitten", ["sitting", "kit", "mitten"]);
+// => { item: "kit", score: 0.5, index: 1 }
+// With per-key details
+const detailed = new FuzzySearch(books, {
+  keys: [{ name: "title" }, { name: "author" }],
+  includeMatchDetails: true,
+});
+detailed.search("gatsby");
+// => [{ item: ..., score: 0.45, index: 1, matches: { title: 0.6, author: 0.1 } }]
+```
 ### Diff
 ```typescript
@@ -172,6 +212,27 @@ const result = diff("abc", "ac");
 | `MinHash.estimate(sig1, sig2)`   | Static: estimate Jaccard from signatures                           |
 | `LSH`                            | Class with `insert()`, `query()`, `remove()`                       |
+### Fuzzy Search
+| Function / Class                             | Description                                        |
+| -------------------------------------------- | -------------------------------------------------- |
+| `FuzzySearch<T>(collection, options?)`       | Search engine with dynamic collection management   |
+| `findBestMatch(query, collection, options?)` | One-shot convenience: returns best match or `null` |
+**FuzzySearch options:**
+| Option                | Type                               | Default         | Description                   |
+| --------------------- | ---------------------------------- | --------------- | ----------------------------- |
+| `algorithm`           | `BuiltinAlgorithm \| SimilarityFn` | `"levenshtein"` | Similarity algorithm to use   |
+| `keys`                | `ISearchKey[]`                     | `[]`            | Object fields to search on    |
+| `threshold`           | `number`                           | `0`             | Min similarity score (0-1)    |
+| `limit`               | `number`                           | `Infinity`      | Max results to return         |
+| `caseSensitive`       | `boolean`                          | `false`         | Case-insensitive by default   |
+| `includeMatchDetails` | `boolean`                          | `false`         | Include per-key scores        |
+| `lsh`                 | `{ numHashes?, numBands? }`        | —               | Enable LSH for large datasets |
+**Built-in algorithms:** `"levenshtein"`, `"lcs"`, `"jaccard"`, `"jaccardNgram"`, `"cosine"`, `"cosineNgram"`, `"sorensen"`, `"sorensenNgram"`
 ### Diff
 | Function               | Description                 | Returns          |
@@ -180,57 +241,76 @@ const result = diff("abc", "ac");
 ### Types
-| Type              | Description                              |
-| ----------------- | ---------------------------------------- |
-| `DiffType`        | Enum: `ADDED`, `REMOVED`, `COMMON`       |
-| `IDiffItem<T>`    | Diff result item with type and tokens    |
-| `IDiffOptions<T>` | Options for diff (equals, lcs algorithm) |
-| `ISimHashOptions` | Options for SimHash (bits, hashFn)       |
-| `IMinHashOptions` | Options for MinHash (numHashes, seed)    |
-| `ILSHOptions`     | Options for LSH (numBands, numHashes)    |
+| Type                    | Description                                  |
+| ----------------------- | -------------------------------------------- |
+| `DiffType`              | Enum: `ADDED`, `REMOVED`, `COMMON`           |
+| `IDiffItem<T>`          | Diff result item with type and tokens        |
+| `IDiffOptions<T>`       | Options for diff (equals, lcs algorithm)     |
+| `ISimHashOptions`       | Options for SimHash (bits, hashFn)           |
+| `IMinHashOptions`       | Options for MinHash (numHashes, seed)        |
+| `ILSHOptions`           | Options for LSH (numBands, numHashes)        |
+| `IFuzzySearchOptions`   | Options for FuzzySearch constructor          |
+| `IFindBestMatchOptions` | Options for findBestMatch function           |
+| `ISearchKey`            | Searchable key config (name, weight, getter) |
+| `ISearchResult<T>`      | Search result with item, score, index        |
+| `SimilarityFn`          | `(a: string, b: string) => number` in [0,1]  |
 ## Performance
-Benchmark: 1000 iterations per pair, same test data across all runtimes.
+Benchmark: same test data across all runtimes. TS/WASM via `vitest bench` (V8 JIT), Rust via `cargo test --release`.
 Unit: microseconds per operation (us/op).
 ### Edit Distance
 | Algorithm       | Size            | TS (V8 JIT) | WASM (via JS) | Rust (native) |
 | --------------- | --------------- | ----------- | ------------- | ------------- |
-| levenshtein     | Short (<10)     | 0.3         | 7.9           | 0.11          |
-| levenshtein     | Medium (10-100) | 1.3         | 116.2         | 0.98          |
-| levenshtein     | Long (>200)     | 15.2        | 2,877.5       | 39.68         |
-| levenshteinNorm | Short           | 0.3         | 7.9           | 0.11          |
-| lcs             | Short (<10)     | 1.6         | 16.5          | 0.41          |
-| lcs             | Medium (10-100) | 6.8         | 272.6         | 3.22          |
-| lcs             | Long (>200)     | 217.8       | 6,574.1       | 122.63        |
-| lcsNorm         | Short           | 1.7         | 16.2          | 0.48          |
+| levenshtein     | Short (<10)     | 0.3         | 1.0           | 0.24          |
+| levenshtein     | Medium (10-100) | 1.3         | 4.8           | 2.00          |
+| levenshtein     | Long (>200)     | 13.9        | 102.3         | 61.77         |
+| levenshteinNorm | Short           | 0.3         | 1.0           | 0.19          |
+| lcs             | Short (<10)     | 1.7         | 1.9           | 0.69          |
+| lcs             | Medium (10-100) | 6.8         | 10.1          | 7.70          |
+| lcs             | Long (>200)     | 216.0       | 161.8         | 151.84        |
+| lcsNorm         | Short           | 1.7         | 1.9           | 0.42          |
 ### Token Similarity (Character Multiset)
 | Algorithm | Size            | TS (V8 JIT) | WASM (via JS) | Rust (native) |
 | --------- | --------------- | ----------- | ------------- | ------------- |
-| jaccard   | Short (<10)     | 0.8         | 25.2          | 0.42          |
-| jaccard   | Medium (10-100) | 0.8         | 74.3          | 1.55          |
-| jaccard   | Long (>200)     | 1.6         | 171.5         | 5.54          |
-| cosine    | Short (<10)     | 0.8         | 19.3          | 0.32          |
-| cosine    | Medium (10-100) | 0.8         | 61.4          | 1.35          |
-| cosine    | Long (>200)     | 1.5         | 158.5         | 4.77          |
-| sorensen  | Short (<10)     | 0.7         | 19.3          | 0.33          |
-| sorensen  | Medium (10-100) | 0.7         | 61.0          | 1.33          |
-| sorensen  | Long (>200)     | 1.5         | 160.0         | 4.46          |
+| jaccard   | Short (<10)     | 0.8         | 3.4           | 0.63          |
+| jaccard   | Medium (10-100) | 0.8         | 8.6           | 2.67          |
+| jaccard   | Long (>200)     | 1.5         | 18.9          | 7.25          |
+| cosine    | Short (<10)     | 1.0         | 2.6           | 0.43          |
+| cosine    | Medium (10-100) | 0.8         | 7.0           | 1.56          |
+| cosine    | Long (>200)     | 1.7         | 17.2          | 6.23          |
+| sorensen  | Short (<10)     | 0.7         | 2.6           | 0.56          |
+| sorensen  | Medium (10-100) | 0.7         | 7.0           | 2.27          |
+| sorensen  | Long (>200)     | 1.4         | 17.4          | 6.48          |
 ### Bigram Variants
 | Algorithm     | Size            | TS (V8 JIT) | WASM (via JS) | Rust (native) |
 | ------------- | --------------- | ----------- | ------------- | ------------- |
-| jaccardBigram | Short (<10)     | 1.1         | 27.4          | 0.45          |
-| jaccardBigram | Medium (10-100) | 7.7         | 160.4         | 3.86          |
-| cosineBigram  | Short (<10)     | 0.8         | 21.2          | 0.36          |
-| cosineBigram  | Medium (10-100) | 5.9         | 127.0         | 3.12          |
+| jaccardBigram | Short (<10)     | 1.1         | 3.5           | 0.67          |
+| jaccardBigram | Medium (10-100) | 7.5         | 18.1          | 4.80          |
+| cosineBigram  | Short (<10)     | 0.7         | 2.8           | 0.43          |
+| cosineBigram  | Medium (10-100) | 5.4         | 14.0          | 4.04          |
+TS implementations use `Int32Array` ASCII fast path + integer-encoded bigrams, avoiding JS-WASM boundary overhead. For compute-heavy algorithms on long strings (e.g. LCS), WASM via JS and Rust native can outperform TS due to native computation advantage outweighing the boundary cost.
+### Fuzzy Search: NLPTools vs Fuse.js
+Benchmark: 20 items in collection, 6 queries per iteration, 1000 iterations.
+Unit: milliseconds per operation (ms/op). Algorithm: levenshtein (default).
+| Scenario                | NLPTools | Fuse.js |
+| ----------------------- | -------- | ------- |
+| Setup (constructor)     | 0.0002   | 0.0050  |
+| Search (string array)   | 0.0114   | 0.1077  |
+| Search (object, 1 key)  | 0.0176   | 0.3308  |
+| Search (object, 2 keys) | 0.0289   | 0.6445  |
-TS implementations use V8 JIT optimization + `Int32Array` ASCII fast path + integer-encoded bigrams, avoiding JS-WASM boundary overhead entirely.
+Both libraries return identical top-1 results for all test queries. NLPTools scores are normalized similarity (0-1, higher is better); Fuse.js uses Bitap error scores (0 = perfect, lower is better).
 ## Dependencies