npm - simile-search - Versions diffs - 0.4.2 → 0.4.3 - Mend

simile-search 0.4.2 → 0.4.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -1,45 +1,48 @@
 <div align="center">
-  <img src="assets/logo.jpeg" alt="Simile Logo" width="200">
+  <img src="assets/logo.svg" alt="Simile Logo" width="200">
+  # Simile
+  **Intelligent offline-first semantic search for modern applications**
+  [![npm version](https://img.shields.io/npm/v/simile-search)](https://www.npmjs.com/package/simile-search)
+  [![npm downloads](https://img.shields.io/npm/dm/simile-search)](https://www.npmjs.com/package/simile-search)
+  [![license](https://img.shields.io/npm/l/simile-search)](https://github.com/iaavas/simile/blob/main/LICENSE)
 </div>
-# Simile 🔍
+---
-![npm](https://img.shields.io/npm/v/simile-search)
-![npm](https://img.shields.io/npm/dm/simile-search)
-![license](https://img.shields.io/npm/l/simile-search)
+## Overview
-**Offline-first semantic + fuzzy search engine for catalogs, names, and products.**
+Simile is a high-performance search engine that combines semantic understanding, fuzzy matching, and keyword search to deliver highly relevant results—entirely offline. Built with Transformers.js, it requires no API calls, runs completely locally, and scales to handle large datasets efficiently.
-Simile combines the power of AI embeddings with fuzzy string matching and keyword search to deliver highly relevant search results—all running locally, no API calls required.
+Perfect for product catalogs, content libraries, user directories, and any application requiring intelligent search without external dependencies.
-## ✨ Features
+## Key Features
-- 🧠 **Semantic Search** - Understands meaning, not just keywords ("phone charger" finds "USB-C cable")
-- 🔤 **Fuzzy Matching** - Handles typos and partial matches gracefully
-- 🎯 **Keyword Boost** - Exact matches get priority
-- ⚡ **O(log n) Search** - Built-in HNSW index for lightning-fast search on large datasets (10k+ items)
-- 📉 **Quantization** - Reduce memory usage by up to 75% with `float16` and `int8` support
-- 🚀 **Vector Cache** - LRU caching to avoid redundant embedding of duplicate text
-- 🔄 **Non-blocking Updates** - Asynchronous background indexing keeps your app responsive
-- 💾 **Persistence** - Save/load embeddings to avoid re-computing
-- 🔧 **Configurable** - Tune scoring weights for your use case
-- 📦 **Zero API Calls** - Everything runs locally with Transformers.js
-- 🔗 **Nested Path Search** - Search `author.firstName` instead of flat strings
-- 📊 **Score Normalization** - Consistent scoring across different methods
-- ✂️ **Min Character Limit** - Control when search triggers
+- **🧠 Semantic Understanding** — Finds conceptually similar items, not just keyword matches ("phone charger" → "USB-C cable")
+- **🔤 Typo Tolerance** — Fuzzy matching handles misspellings and partial queries gracefully
+- **⚡ Lightning Fast** — O(log n) search with HNSW indexing for datasets of 10k+ items
+- **💾 Memory Efficient** — Quantization support (float16/int8) reduces memory usage by up to 75%
+- **🔄 Non-blocking Updates** — Asynchronous indexing keeps your application responsive
+- **📦 Zero Dependencies on APIs** — Runs entirely locally with Transformers.js
+- **🔗 Deep Object Search** — Query nested fields with dot notation (`author.firstName`)
+- **💾 Persistent Storage** — Save and load embeddings to avoid recomputation
+- **🎯 Highly Configurable** — Tune scoring weights, thresholds, and search behavior
-## 📦 Installation
+## Installation
 ```bash
 npm install simile-search
 ```
-## 🚀 Quick Start
+## Quick Start
 ```typescript
 import { Simile } from 'simile-search';
-// Create a search engine with your items
+// Initialize search engine
 const engine = await Simile.from([
   { id: '1', text: 'Bathroom floor cleaner', metadata: { category: 'Cleaning' } },
   { id: '2', text: 'Dishwashing liquid', metadata: { category: 'Kitchen' } },
@@ -47,144 +50,140 @@ const engine = await Simile.from([
   { id: '4', text: 'USB-C phone charger cable', metadata: { category: 'Electronics' } },
 ]);
-// Search!
+// Search with natural language
 const results = await engine.search('phone charger');
 console.log(results);
 // [
 //   { id: '3', text: 'iPhone Charger', score: 0.92, ... },
-//   { id: '4', text: 'USB-C phone charger cable', score: 0.87, ... },
-//   ...
+//   { id: '4', text: 'USB-C phone charger cable', score: 0.87, ... }
 // ]
 ```
-## 💾 Persistence (Save & Load)
+## Core Concepts
+### Persistence
-The first embedding run can be slow. Save your embeddings to load instantly next time:
+Avoid re-embedding on every startup by saving your index:
 ```typescript
 import { Simile } from 'simile-search';
 import * as fs from 'fs';
-// First run: embed and save (slow, but only once!)
+// Initial setup: embed and save
 const engine = await Simile.from(items);
-fs.writeFileSync('catalog.json', engine.toJSON());
+fs.writeFileSync('search-index.json', engine.toJSON());
-// Later: instant load from file (no re-embedding!)
-const json = fs.readFileSync('catalog.json', 'utf-8');
+// Subsequent loads: instant startup
+const json = fs.readFileSync('search-index.json', 'utf-8');
 const loadedEngine = Simile.loadFromJSON(json);
-// Works exactly the same
-const results = await loadedEngine.search('cleaner');
+// Functionally identical to the original
+const results = await loadedEngine.search('query');
 ```
-### Snapshot Format
+**Snapshot Format** for database storage:
 ```typescript
-// For database storage
 const snapshot = engine.save();
 // {
 //   version: '0.2.0',
 //   model: 'Xenova/all-MiniLM-L6-v2',
 //   items: [...],
-//   vectors: ['base64...', 'base64...'],
+//   vectors: ['base64...'],
 //   createdAt: '2024-12-28T...',
-//   textPaths: ['metadata.title', ...]  // if configured
+//   textPaths: [...]
 // }
-// Load from snapshot object
 const restored = Simile.load(snapshot);
 ```
-## 🔗 Nested Path Search
+### Nested Object Search
-Search complex objects by specifying paths to extract text from:
+Search complex data structures by specifying extraction paths:
 ```typescript
 const books = [
   {
     id: '1',
-    text: '',  // Can be empty when using textPaths
     metadata: {
       author: { firstName: 'John', lastName: 'Doe' },
       title: 'The Art of Programming',
       tags: ['coding', 'javascript'],
     },
   },
-  {
-    id: '2',
-    text: '',
-    metadata: {
-      author: { firstName: 'Jane', lastName: 'Smith' },
-      title: 'Machine Learning Basics',
-      tags: ['ai', 'python'],
-    },
-  },
 ];
-// Configure which paths to extract and search
 const engine = await Simile.from(books, {
   textPaths: [
     'metadata.author.firstName',
     'metadata.author.lastName',
     'metadata.title',
-    'metadata.tags',  // Arrays are joined with spaces
+    'metadata.tags',  // Arrays are automatically joined
   ],
 });
-// Now you can search by author name!
+// Search across all configured paths
 const results = await engine.search('John programming');
-// Finds "The Art of Programming" by John Doe
 ```
-### Supported Path Formats
+**Supported path formats:**
+- Nested objects: `metadata.author.firstName`
+- Array indexing: `items[0].name`
+- Array joining: `metadata.tags` (joins all elements)
+### Dynamic Catalog Management
+Update your search index without rebuilding:
 ```typescript
-// Dot notation for nested objects
-'metadata.author.firstName'  // → "John"
+// Add new items
+await engine.add([
+  { id: '5', text: 'Wireless headphones', metadata: { category: 'Electronics' } }
+]);
+// Update existing items (by ID)
+await engine.add([
+  { id: '1', text: 'Premium bathroom cleaner', metadata: { category: 'Cleaning' } }
+]);
-// Array index access
-'metadata.tags[0]'           // → "coding"
-'items[0].name'              // → nested array access
+// Remove items
+engine.remove(['2', '3']);
-// Arrays without index (joins all elements)
-'metadata.tags'              // → "coding javascript"
+// Retrieve items
+const item = engine.get('1');
+const allItems = engine.getAll();
+console.log(engine.size); // Current item count
 ```
-## 🔧 Configuration
+## Configuration
-### Custom Scoring Weights
+### Scoring Weights
-Tune how much each scoring method contributes:
+Customize how different matching strategies contribute to the final score:
 ```typescript
 const engine = await Simile.from(items, {
   weights: {
-    semantic: 0.7,  // AI embedding similarity (default: 0.7)
-    fuzzy: 0.15,    // Levenshtein distance (default: 0.15)
-    keyword: 0.15,  // Exact keyword matches (default: 0.15)
+    semantic: 0.7,  // AI embedding similarity (default)
+    fuzzy: 0.15,    // Levenshtein distance
+    keyword: 0.15,  // Exact keyword matching
   }
 });
-// Or adjust later
+// Adjust weights dynamically
 engine.setWeights({ semantic: 0.9, fuzzy: 0.05, keyword: 0.05 });
 ```
 ### Score Normalization
-By default, scores are normalized so that a "0.8" semantic score means the same as a "0.8" fuzzy score. This ensures fair comparison across different scoring methods.
+Simile normalizes scores across different matching methods for fair comparison:
 ```typescript
-// Enabled by default
 const engine = await Simile.from(items, {
-  normalizeScores: true,  // default
-});
-// Disable if you want raw scores
-const rawEngine = await Simile.from(items, {
-  normalizeScores: false,
+  normalizeScores: true,  // Enabled by default
 });
-// With explain: true, you can see both normalized and raw scores
+// View normalized and raw scores
 const results = await engine.search('cleaner', { explain: true });
 // {
 //   score: 1.0,
@@ -193,9 +192,9 @@ const results = await engine.search('cleaner', { explain: true });
 //     fuzzy: 1.0,       // normalized
 //     keyword: 1.0,     // normalized
 //     raw: {
-//       semantic: 0.62, // original score
-//       fuzzy: 0.32,    // original score
-//       keyword: 1.0,   // original score
+//       semantic: 0.62,
+//       fuzzy: 0.32,
+//       keyword: 1.0
 //     }
 //   }
 // }
@@ -203,60 +202,82 @@ const results = await engine.search('cleaner', { explain: true });
 ### Search Options
+Fine-tune search behavior per query:
 ```typescript
 const results = await engine.search('cleaner', {
-  topK: 10,           // Max results to return (default: 5)
-  threshold: 0.5,     // Minimum score (default: 0)
-  explain: true,      // Include score breakdown
-  filter: (meta) => meta.category === 'Cleaning',  // Filter by metadata
-  minLength: 3,       // Don't search until 3+ characters typed (default: 1)
+  topK: 10,                                      // Maximum results (default: 5)
+  threshold: 0.5,                                // Minimum score cutoff
+  explain: true,                                 // Include score breakdown
+  filter: (meta) => meta.category === 'Cleaning', // Metadata filtering
+  minLength: 3,                                  // Minimum query length (default: 1)
 });
 ```
-### Min Character Limit
-Prevent unnecessary searches on very short queries:
+**Minimum character limit** prevents unnecessary searches on partial input:
 ```typescript
-// Don't trigger search until user types at least 3 characters
-const results = await engine.search('cl', { minLength: 3 });
-// Returns [] because query length (2) < minLength (3)
-const results2 = await engine.search('cle', { minLength: 3 });
-// Returns results because query length (3) >= minLength (3)
+await engine.search('cl', { minLength: 3 }); // Returns [] (too short)
+await engine.search('cle', { minLength: 3 }); // Returns results
 ```
-This is useful for autocomplete/typeahead UIs where you don't want to search on every keystroke.
+## Performance Optimization
-## 📝 Dynamic Catalog Management
+Simile is designed to scale efficiently from hundreds to hundreds of thousands of items.
-Add, update, or remove items without rebuilding:
+### Quantization
+Reduce memory usage with lower-precision vector representations:
 ```typescript
-// Add new items
-await engine.add([
-  { id: '5', text: 'Wireless headphones', metadata: { category: 'Electronics' } }
-]);
+const engine = await Simile.from(items, {
+  quantization: 'float16', // 50% memory reduction, minimal accuracy loss
+  // OR
+  quantization: 'int8',    // 75% memory reduction, slight accuracy trade-off
+});
+```
-// Update existing item (same ID)
-await engine.add([
-  { id: '1', text: 'Premium bathroom cleaner', metadata: { category: 'Cleaning' } }
-]);
+### Approximate Nearest Neighbor (ANN) Search
-// Remove items
-engine.remove(['2', '3']);
+For large datasets, HNSW indexing provides logarithmic search time:
-// Get item by ID
-const item = engine.get('1');
+```typescript
+const engine = await Simile.from(items, {
+  useANN: true,          // Enable ANN indexing
+  annThreshold: 1000,    // Auto-enable when items > threshold (default: 1000)
+});
+```
-// Get all items
-const allItems = engine.getAll();
+### Vector Caching
+LRU cache eliminates redundant embeddings for duplicate texts:
+```typescript
+const engine = await Simile.from(items, {
+  cache: {
+    maxSize: 5000,      // Cache up to 5000 embeddings
+    enableStats: true,  // Track cache performance
+  }
+});
-// Get count
-console.log(engine.size); // 3
+// Monitor cache efficiency
+const stats = engine.getIndexInfo().cacheStats;
+console.log(`Hit rate: ${stats.hitRate}%`);
 ```
-## 🎯 Advanced: Direct Access to Utilities
+### Background Indexing
+Updates are processed asynchronously to maintain responsiveness:
+```typescript
+// Returns immediately, processes in background
+await engine.add(newItems);
+await engine.add(moreItems);
+```
+## Advanced Usage
+### Direct Utility Access
 For custom implementations:
@@ -268,18 +289,12 @@ import {
   fuzzyScore,
   keywordScore,
   hybridScore,
-  vectorToBase64,
-  base64ToVector,
   getByPath,
   extractText,
-  normalizeScore,
-  calculateScoreStats,
 } from 'simile-search';
-// Embed text directly
+// Generate embeddings
 const vector = await embed('hello world');
-// Batch embed for performance
 const vectors = await embedBatch(['text1', 'text2', 'text3']);
 // Calculate similarities
@@ -288,52 +303,36 @@ const fuzzy = fuzzyScore('cleaner', 'cleenr');
 const keyword = keywordScore('phone charger', 'USB phone charger cable');
 // Combine scores
-const score = hybridScore(0.8, 0.6, 0.5, { semantic: 0.7, fuzzy: 0.15, keyword: 0.15 });
+const finalScore = hybridScore(
+  0.8, 0.6, 0.5,
+  { semantic: 0.7, fuzzy: 0.15, keyword: 0.15 }
+);
-// Extract nested values
+// Extract nested data
 const firstName = getByPath(obj, 'author.firstName');
 const text = extractText(item, ['metadata.title', 'metadata.tags']);
 ```
-## 📊 API Reference
-### `Simile.from(items, config?)`
-Create a new engine from items. Embeds all items (async).
-### `Simile.load(snapshot, config?)`
-Load from a saved snapshot (instant, no embedding).
-### `Simile.loadFromJSON(json, config?)`
-Load from JSON string.
-### `engine.search(query, options?)`
-Search for similar items. **Results are always sorted by relevance (highest score first).**
-### `engine.save()`
-Export snapshot object for persistence.
-### `engine.toJSON()`
-Export as JSON string.
-### `engine.add(items)`
-Add or update items (async).
-### `engine.remove(ids)`
-Remove items by ID.
-### `engine.get(id)`
-Get single item by ID.
+## API Reference
-### `engine.getAll()`
-Get all items.
+### Class Methods
-### `engine.size`
-Number of items.
+| Method | Description |
+|--------|-------------|
+| `Simile.from(items, config?)` | Create engine from items (async, embeds all) |
+| `Simile.load(snapshot, config?)` | Load from snapshot object (instant) |
+| `Simile.loadFromJSON(json, config?)` | Load from JSON string |
+| `engine.search(query, options?)` | Search for similar items (sorted by relevance) |
+| `engine.save()` | Export snapshot object |
+| `engine.toJSON()` | Export as JSON string |
+| `engine.add(items)` | Add or update items (async) |
+| `engine.remove(ids)` | Remove items by ID |
+| `engine.get(id)` | Retrieve single item |
+| `engine.getAll()` | Retrieve all items |
+| `engine.setWeights(weights)` | Update scoring weights |
+| `engine.size` | Current item count |
-### `engine.setWeights(weights)`
-Update scoring weights.
-## 🧪 Types
+## TypeScript Types
 ```typescript
 interface SearchItem<T = any> {
@@ -359,99 +358,38 @@ interface SearchOptions {
   topK?: number;
   explain?: boolean;
   threshold?: number;
-  minLength?: number;  // Min query length to trigger search
+  minLength?: number;
   filter?: (metadata: any) => boolean;
 }
 interface SimileConfig {
   weights?: { semantic?: number; fuzzy?: number; keyword?: number };
   model?: string;
-  textPaths?: string[];       // Paths for nested object search
-  normalizeScores?: boolean;  // Enable score normalization (default: true)
+  textPaths?: string[];
+  normalizeScores?: boolean;
   cache?: boolean | CacheOptions;
   quantization?: 'float32' | 'float16' | 'int8';
   useANN?: boolean | HNSWConfig;
   annThreshold?: number;
 }
-interface CacheOptions {
-  maxSize?: number;
-  enableStats?: boolean;
-}
-interface HNSWConfig {
-  M?: number;
-  efConstruction?: number;
-  efSearch?: number;
-}
 ```
-## 🤖 Model
-Simile uses [Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2) via Transformers.js by default. This model runs entirely in JavaScript—no Python or external APIs required.
-## 📄 License
+## Technical Details
-MIT © [Aavash Baral](https://github.com/iaavas)
-## ⚡ Performance Optimization
+**Embedding Model:** [Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2) via Transformers.js
-Simile v0.4.0 introduces several features to handle large scale datasets (10k-100k+ items) efficiently.
+This model runs entirely in JavaScript with no Python runtime or external API dependencies.
-### 📉 Quantization
+## License
-Reduce memory footprint by representing vectors with lower precision.
-```typescript
-const engine = await Simile.from(items, {
-  quantization: 'float16', // 50% memory reduction, minimal accuracy loss
-  // OR
-  quantization: 'int8',    // 75% memory reduction, slight accuracy loss
-});
-```
-### ⚡ O(log n) Search (ANN)
-For datasets larger than 1,000 items, Simile automatically builds an HNSW (Hierarchical Navigable Small World) index for near-instant search.
-```typescript
-const engine = await Simile.from(items, {
-  useANN: true, // Force enable ANN
-  annThreshold: 500, // Enable ANN if items > 500 (default: 1000)
-});
-```
-### 🚀 Vector Caching
-Avoid redundant AI embedding calls for duplicate texts with built-in LRU caching.
-```typescript
-const engine = await Simile.from(items, {
-  cache: {
-    maxSize: 5000, // Cache up to 5000 unique embeddings
-    enableStats: true,
-  }
-});
-// Check cache performance
-const stats = engine.getIndexInfo().cacheStats;
-console.log(`Cache Hit Rate: ${stats.hitRate}%`);
-```
-### 🔄 Non-blocking Background Updates
+MIT © [Aavash Baral](https://github.com/iaavas)
-Adding items to a large index can be expensive. Simile uses an internal queue to process updates in the background without blocking search.
+## Contributing
-```typescript
-// These return immediately/nearly immediately and process in batches
-engine.add(newItems);
-engine.add(moreItems);
-```
+Contributions are welcome! Please feel free to submit a Pull Request.
 ---
-<p align="center">
-  Made with ❤️ by <a href="https://github.com/iaavas">Aavash Baral</a>
-</p>
+<div align="center">
+  <sub>Built with ❤️ by <a href="https://github.com/iaavas">Aavash Baral</a></sub>
+</div>

package/dist/utils.d.ts CHANGED Viewed

@@ -28,4 +28,4 @@ export declare function extractText(item: any, paths?: string[]): string;
  * Normalize a score to a 0-1 range using min-max normalization.
  * Handles edge cases where min equals max.
  */
-export declare function normalizeScore(value: number, min: number, max: number): number;
+export declare function normalizeScore(value: number, min: number, max: number, floorMax?: number): number;

package/dist/utils.js CHANGED Viewed

@@ -59,8 +59,9 @@ export function extractText(item, paths) {
  * Normalize a score to a 0-1 range using min-max normalization.
  * Handles edge cases where min equals max.
  */
-export function normalizeScore(value, min, max) {
-    if (max === min)
+export function normalizeScore(value, min, max, floorMax = 0) {
+    const effectiveMax = Math.max(max, floorMax);
+    if (effectiveMax <= min)
         return value > 0 ? 1 : 0;
-    return (value - min) / (max - min);
+    return Math.max(0, Math.min(1, (value - min) / (effectiveMax - min)));
 }

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "simile-search",
-  "version": "0.4.2",
+  "version": "0.4.3",
   "description": "Offline-first semantic + fuzzy search engine for catalogs, names, and products",
   "type": "module",
   "main": "dist/index.js",
@@ -44,10 +44,10 @@
   },
   "repository": {
     "type": "git",
-    "url": "github.com/iaavas/simile-search"
+    "url": "https://github.com/iaavas/simile-search.git"
   },
   "bugs": {
-    "url": "github.com/iaavas/simile-search/issues"
+    "url": "https://github.com/iaavas/simile-search/issues"
   },
-  "homepage": "github.com/iaavas/simile-search"
+  "homepage": "https://github.com/iaavas/simile-search"
 }