npm - @memberjunction/ai-vector-dupe - Versions diffs - 5.21.0 → 5.22.0 - Mend

@memberjunction/ai-vector-dupe 5.21.0 → 5.22.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

package/README.md +254 -230
package/dist/duplicateRecordDetector.d.ts +116 -18
package/dist/duplicateRecordDetector.d.ts.map +1 -1
package/dist/duplicateRecordDetector.js +465 -262
package/dist/duplicateRecordDetector.js.map +1 -1
package/dist/index.d.ts +2 -3
package/dist/index.d.ts.map +1 -1
package/dist/index.js +4 -3
package/dist/index.js.map +1 -1
package/dist/scoring/ReciprocalRankFusion.d.ts +45 -0
package/dist/scoring/ReciprocalRankFusion.d.ts.map +1 -0
package/dist/scoring/ReciprocalRankFusion.js +63 -0
package/dist/scoring/ReciprocalRankFusion.js.map +1 -0
package/package.json +10 -10
package/dist/config.d.ts +0 -13
package/dist/config.d.ts.map +0 -1
package/dist/config.js +0 -15
package/dist/config.js.map +0 -1
package/dist/generic/vectorSyncBase.d.ts +0 -20
package/dist/generic/vectorSyncBase.d.ts.map +0 -1
package/dist/generic/vectorSyncBase.js +0 -42
package/dist/generic/vectorSyncBase.js.map +0 -1
package/dist/models/entitySyncConfig.d.ts +0 -36
package/dist/models/entitySyncConfig.d.ts.map +0 -1
package/dist/models/entitySyncConfig.js +0 -2
package/dist/models/entitySyncConfig.js.map +0 -1

package/README.md CHANGED Viewed

@@ -1,42 +1,66 @@
 # @memberjunction/ai-vector-dupe
-AI-powered duplicate record detection for MemberJunction entities. This package uses vector embeddings and similarity search to find potential duplicate records, track detection runs, and optionally auto-merge high-confidence matches.
+<!-- Badges -->
+<!-- [![npm version](https://img.shields.io/npm/v/@memberjunction/ai-vector-dupe)](https://www.npmjs.com/package/@memberjunction/ai-vector-dupe) -->
+<!-- [![build](https://img.shields.io/github/actions/workflow/status/MemberJunction/MJ/ci.yml?branch=next)](https://github.com/MemberJunction/MJ/actions) -->
+**AI-powered duplicate record detection for MemberJunction entities** -- finds, scores, tracks, and optionally auto-merges duplicate records using vector similarity, hybrid search (RRF), and optional reranking.
+---
 ## Architecture
-```mermaid
-graph TD
-    subgraph DupePkg["@memberjunction/ai-vector-dupe"]
-        DRD["DuplicateRecordDetector"]
-        VSB["VectorSyncBase"]
-        ESC["EntitySyncConfig"]
-    end
-    subgraph Pipeline["Detection Pipeline"]
-        LIST["Load Records<br/>from List"] --> VECT["Vectorize Records<br/>via Templates"]
-        VECT --> EMBED["Generate<br/>Embeddings"]
-        EMBED --> QUERY["Query Vector DB<br/>for Matches"]
-        QUERY --> FILTER["Filter by<br/>Threshold"]
-        FILTER --> TRACK["Track Results<br/>in Duplicate Runs"]
-        TRACK --> MERGE["Auto-Merge<br/>Above Threshold"]
-    end
-    subgraph Dependencies["Key Dependencies"]
-        VB["ai-vectors<br/>(VectorBase)"]
-        SYNC["ai-vector-sync<br/>(EntityVectorSyncer)"]
-        VDBB["ai-vectordb<br/>(VectorDBBase)"]
-        AI["ai<br/>(BaseEmbeddings)"]
-    end
-    DRD -->|extends| VB
-    DRD --> SYNC
-    DRD --> VDBB
-    DRD --> AI
-    style DupePkg fill:#2d6a9f,stroke:#1a4971,color:#fff
-    style Pipeline fill:#2d8659,stroke:#1a5c3a,color:#fff
-    style Dependencies fill:#7c5295,stroke:#563a6b,color:#fff
 ```
+                         +--------------------------+
+                         |   DuplicateRecordDetector |
+                         |   (extends VectorBase)    |
+                         +-----+----------+---------+
+                               |          |
+              +----------------+          +----------------+
+              |                                            |
+    +---------v----------+                     +-----------v---------+
+    | GetDuplicateRecords|                     |  CheckSingleRecord  |
+    | (list-based batch) |                     |  (single record)    |
+    +--------+-----------+                     +-----------+---------+
+             |                                             |
+             +-------------------+-------------------------+
+                                 |
+                    +------------v------------+
+                    |    Detection Pipeline   |
+                    +-------------------------+
+                    | 1. Validate Entity Doc  |
+                    | 2. Vectorize records    |
+                    | 3. Embed via AI model   |
+                    | 4. Query vector DB      |
+                    |    (hybrid if supported)|
+                    | 5. Filter self-matches  |
+                    | 6. Apply thresholds     |
+                    | 7. Persist match results|
+                    | 8. Auto-merge (optional)|
+                    +-------------------------+
+                                 |
+              +------------------+------------------+
+              |                  |                   |
+    +---------v------+  +-------v--------+  +-------v--------+
+    | ai-vector-sync |  | ai-vectordb    |  | ai (Embeddings)|
+    | (vectorizer,   |  | (VectorDBBase, |  | (BaseEmbeddings|
+    |  templates)    |  |  hybrid query) |  |  GetAIAPIKey)  |
+    +----------------+  +----------------+  +----------------+
+```
+**Key dependencies:**
+| Package | Role |
+|---|---|
+| `@memberjunction/ai` | Embedding model abstraction and API key resolution |
+| `@memberjunction/ai-vectordb` | Vector database abstraction (query, hybrid search) |
+| `@memberjunction/ai-vectors` | `VectorBase` base class with metadata and RunView helpers |
+| `@memberjunction/ai-vector-sync` | `EntityVectorSyncer` for record vectorization, template parsing |
+| `@memberjunction/core` | Core types: `PotentialDuplicateRequest`, `DuplicateDetectionOptions`, etc. |
+| `@memberjunction/core-entities` | Generated entity classes for Duplicate Runs, Lists, Entity Documents |
+| `@memberjunction/global` | `MJGlobal` class factory, `UUIDsEqual` |
+---
 ## Installation
@@ -44,252 +68,249 @@ graph TD
 npm install @memberjunction/ai-vector-dupe
 ```
-## Overview
-The package provides the `DuplicateRecordDetector` class, which orchestrates a complete duplicate detection workflow:
-1. Loads records from a MemberJunction List
-2. Vectorizes them using a configured Entity Document template and embedding model
-3. Queries the vector database for similarity matches
-4. Filters results against configurable thresholds
-5. Creates Duplicate Run, Duplicate Run Detail, and Duplicate Run Detail Match records for tracking
-6. Optionally auto-merges records that exceed the absolute match threshold
-## Duplicate Detection Flow
-```mermaid
-sequenceDiagram
-    participant Caller
-    participant DRD as DuplicateRecordDetector
-    participant EVS as EntityVectorSyncer
-    participant Embed as Embedding Model
-    participant VDB as Vector Database
-    participant DB as MJ Database
-    Caller->>DRD: getDuplicateRecords(request, user)
-    DRD->>DB: Load Entity Document
-    DRD->>EVS: VectorizeEntity (ensure all records are indexed)
-    DRD->>DB: Load records from List
-    loop For each record
-        DRD->>Embed: Generate embedding from template
-        DRD->>VDB: queryIndex (topK=5)
-        VDB-->>DRD: Scored matches
-        DRD->>DRD: Filter by PotentialMatchThreshold
-        DRD->>DB: Create DuplicateRunDetailMatch records
-    end
-    DRD->>DRD: Check AbsoluteMatchThreshold
-    DRD->>DB: Auto-merge high-confidence duplicates
-    DRD-->>Caller: PotentialDuplicateResponse
-```
+---
-## Core Components
+## Quick Start
-### DuplicateRecordDetector
+### List-Based Batch Detection
+Detect duplicates across all records in an MJ List:
+```typescript
+import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
+import { PotentialDuplicateRequest } from '@memberjunction/core';
+const detector = new DuplicateRecordDetector();
+const request: PotentialDuplicateRequest = {
+    ListID: 'your-list-uuid',
+    EntityID: 'your-entity-uuid',
+    EntityDocumentID: 'your-entity-document-uuid',
+    Options: {
+        TopK: 10,
+        OnProgress: (progress) => {
+            console.log(`[${progress.Phase}] ${progress.ProcessedRecords}/${progress.TotalRecords} -- ${progress.MatchesFound} matches`);
+        },
+    },
+};
+const response = await detector.GetDuplicateRecords(request, contextUser);
+if (response.Status === 'Success') {
+    for (const result of response.PotentialDuplicateResult) {
+        console.log(`Record: ${result.RecordCompositeKey.ToString()}`);
+        for (const dupe of result.Duplicates) {
+            console.log(`  Match: ${dupe.ToString()} (${(dupe.ProbabilityScore * 100).toFixed(1)}%)`);
+        }
+    }
+}
+```
-The main class that extends `VectorBase` from `@memberjunction/ai-vectors`.
+### Single-Record Check
-**Key method:**
+Check one record for duplicates without creating a list -- ideal for server hooks (e.g., fire-and-forget after record save):
 ```typescript
-getDuplicateRecords(
-    params: PotentialDuplicateRequest,
-    contextUser?: UserInfo
-): Promise<PotentialDuplicateResponse>
+import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
+import { CompositeKey } from '@memberjunction/core';
+const detector = new DuplicateRecordDetector();
+const recordKey = new CompositeKey([{ FieldName: 'ID', Value: 'record-uuid' }]);
+const result = await detector.CheckSingleRecord(
+    'your-entity-document-uuid',
+    recordKey,
+    { TopK: 5 },
+    contextUser
+);
+for (const dupe of result.Duplicates) {
+    console.log(`Potential duplicate: ${dupe.ToString()} (score: ${dupe.ProbabilityScore})`);
+}
 ```
-**Parameters in `PotentialDuplicateRequest`:**
+---
-| Field | Type | Description |
-|---|---|---|
-| `ListID` | `string` | ID of the List containing records to check |
-| `EntityID` | `string` | ID of the entity type |
-| `EntityDocumentID` | `string` | ID of the Entity Document with vectorization template |
-| `Options.DuplicateRunID` | `string` (optional) | Resume an existing duplicate run |
+## DuplicateDetectionOptions Reference
+Options are passed via the `Options` property on `PotentialDuplicateRequest`, or directly to `CheckSingleRecord`.
+| Option | Type | Default | Description |
+|---|---|---|---|
+| `TopK` | `number` | `5` | Number of nearest neighbors to retrieve per record |
+| `DuplicateRunID` | `string` | -- | Resume an existing duplicate run (batch mode only) |
+| `KeywordSearchWeight` | `number` | `0.3` | Weight for keyword search in hybrid mode (0.0 = vector only, 1.0 = keyword only). Vector weight is `1.0 - KeywordSearchWeight`. |
+| `FusionMethod` | `string` | `'rrf'` | Fusion method for hybrid search. Currently supports `'rrf'` (Reciprocal Rank Fusion). |
+| `OnProgress` | `(progress: DuplicateDetectionProgress) => void` | -- | Callback for real-time progress reporting |
+### Thresholds (Configured on Entity Document)
-**Thresholds (configured on Entity Document):**
+Thresholds are not part of `DuplicateDetectionOptions` -- they are configured on the `EntityDocument` record itself:
 | Threshold | Purpose |
 |---|---|
-| `PotentialMatchThreshold` | Minimum similarity score to report as potential duplicate |
-| `AbsoluteMatchThreshold` | Minimum similarity score for automatic record merge |
+| `PotentialMatchThreshold` | Minimum similarity score to report a candidate as a potential duplicate |
+| `AbsoluteMatchThreshold` | Minimum similarity score to trigger automatic record merge |
-### VectorSyncBase
+---
-A utility base class providing helper methods for vector synchronization operations:
+## Hybrid Search and Reciprocal Rank Fusion (RRF)
-- `parseStringTemplate(str, obj)` -- simple template variable substitution
-- `timer(ms)` -- async delay
-- `start()` / `end()` / `timeDiff()` -- execution timing
-- `saveJSONData(data, path)` -- JSON file output
+When the configured vector database supports hybrid search (`VectorDBBase.SupportsHybridSearch === true`), the detector automatically combines **vector similarity** and **keyword search** for higher-quality results.
-### EntitySyncConfig
+### How It Works
-Configuration type for entity synchronization scheduling:
+1. The record's template text is sent as both a vector embedding and a keyword query.
+2. The vector DB returns results from both retrieval methods.
+3. Results are fused using **Reciprocal Rank Fusion (RRF)**, a rank-based algorithm that is score-scale independent.
+### RRF Formula
-```typescript
-type EntitySyncConfig = {
-    EntityDocumentID: string;     // Entity Document to use
-    Interval: number;             // Sync interval in seconds
-    RunViewParams: RunViewParams; // View parameters for fetching
-    IncludeInSync: boolean;       // Whether to include in sync
-    LastRunDate: string;          // Last sync timestamp
-    VectorIndexID: number;        // Vector index ID
-    VectorID: number;             // Vector database ID
-};
 ```
+FusedScore(d) = SUM_i [ 1 / (k + rank_i(d)) ]
+```
+Where `rank_i(d)` is the 1-based rank of document `d` in list `i`, and `k` is a smoothing constant (default: 60).
-## Usage
+### Using ComputeRRF Directly
-### Basic Duplicate Detection
+The `ComputeRRF` utility is exported for use in custom pipelines:
 ```typescript
-import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
-import { PotentialDuplicateRequest, UserInfo } from '@memberjunction/core';
+import { ComputeRRF, ScoredCandidate } from '@memberjunction/ai-vector-dupe';
+const vectorResults: ScoredCandidate[] = [
+    { ID: 'rec-1', Score: 0.95 },
+    { ID: 'rec-2', Score: 0.87 },
+    { ID: 'rec-3', Score: 0.82 },
+];
+const keywordResults: ScoredCandidate[] = [
+    { ID: 'rec-2', Score: 12.5 },  // Different scale -- RRF handles this
+    { ID: 'rec-4', Score: 10.1 },
+    { ID: 'rec-1', Score: 8.3 },
+];
+const fused = ComputeRRF([vectorResults, keywordResults], 60);
+// Results sorted by fused RRF score, score-scale independent
+```
-const detector = new DuplicateRecordDetector();
+### Tuning Hybrid Search
-const request: PotentialDuplicateRequest = {
-    ListID: 'list-uuid',
-    EntityID: 'entity-uuid',
-    EntityDocumentID: 'doc-uuid'
-};
+- **`KeywordSearchWeight = 0.0`**: Pure vector similarity (semantic matching).
+- **`KeywordSearchWeight = 0.3`** (default): Slight keyword boost. Good for entities with distinctive names or codes.
+- **`KeywordSearchWeight = 0.5`**: Equal weight. Useful when both semantic and lexical matches matter.
+- **`KeywordSearchWeight = 1.0`**: Pure keyword search (not recommended for duplicate detection).
-const response = await detector.getDuplicateRecords(request, currentUser);
+---
-if (response.Status === 'Success') {
-    for (const result of response.PotentialDuplicateResult) {
-        console.log(`Record: ${result.RecordCompositeKey.ToString()}`);
-        for (const dupe of result.Duplicates) {
-            console.log(`  Match: ${dupe.ToString()} (${(dupe.ProbabilityScore * 100).toFixed(1)}%)`);
-        }
-    }
-}
-```
+## Reranking
+When MJ's `BaseReranker` / `RerankerService` is configured, the detector can apply a second-stage reranking pass after initial retrieval. Reranking uses a cross-encoder model to re-score candidates with higher precision than embedding-based similarity alone.
+Reranking is especially effective when:
+- Initial retrieval returns many borderline candidates
+- Entity records have complex, multi-field structures
+- You need to maximize precision at the cost of slightly higher latency
+See the [Duplicate Detection Guide](docs/DUPLICATE_DETECTION_GUIDE.md#reranking-integration) for configuration details.
+---
-### Resuming an Existing Run
+## Progress Reporting
+The `OnProgress` callback fires at each phase of the pipeline:
 ```typescript
 const request: PotentialDuplicateRequest = {
-    ListID: 'list-uuid',
-    EntityID: 'entity-uuid',
-    EntityDocumentID: 'doc-uuid',
+    // ...
     Options: {
-        DuplicateRunID: 'existing-run-uuid'
-    }
+        OnProgress: (progress) => {
+            const { Phase, TotalRecords, ProcessedRecords, MatchesFound, ElapsedMs } = progress;
+            const pct = TotalRecords > 0 ? ((ProcessedRecords / TotalRecords) * 100).toFixed(0) : '0';
+            console.log(`[${Phase}] ${pct}% -- ${MatchesFound} matches (${ElapsedMs}ms)`);
+        },
+    },
 };
-const response = await detector.getDuplicateRecords(request, currentUser);
 ```
-## Database Entities Used
-The package reads from and writes to these MemberJunction entities:
-```mermaid
-erDiagram
-    DUPLICATE_RUN {
-        string ID PK
-        string EntityID
-        string StartedByUserID
-        datetime StartedAt
-        datetime EndedAt
-        string ProcessingStatus
-        string ApprovalStatus
-        string SourceListID
-    }
+### Progress Phases
-    DUPLICATE_RUN_DETAIL {
-        string ID PK
-        string DuplicateRunID FK
-        string RecordID
-        string MatchStatus
-        string MergeStatus
-    }
+| Phase | Description |
+|---|---|
+| `Vectorizing` | Records are being vectorized via `EntityVectorSyncer` |
+| `Embedding` | Template texts are being embedded via the AI model |
+| `Querying` | Vector DB is being queried for each record |
+| `Matching` | Results are being persisted and match records created |
+| `Merging` | High-confidence matches are being auto-merged |
-    DUPLICATE_RUN_DETAIL_MATCH {
-        string ID PK
-        string DuplicateRunDetailID FK
-        string MatchRecordID
-        float MatchProbability
-        datetime MatchedAt
-        string Action
-        string ApprovalStatus
-        string MergeStatus
-    }
+### DuplicateDetectionProgress Shape
-    LIST {
-        string ID PK
-        string Name
-        string EntityID
-    }
+```typescript
+interface DuplicateDetectionProgress {
+    Phase: 'Vectorizing' | 'Embedding' | 'Querying' | 'Matching' | 'Merging';
+    TotalRecords: number;
+    ProcessedRecords: number;
+    MatchesFound: number;
+    CurrentRecordID?: string;
+    ElapsedMs: number;
+}
+```
-    LIST_DETAIL {
-        string ID PK
-        string ListID FK
-        string RecordID
-    }
+---
-    ENTITY_DOCUMENT {
-        string ID PK
-        string EntityID
-        string TemplateID
-        string AIModelID
-        string VectorDatabaseID
-        float PotentialMatchThreshold
-        float AbsoluteMatchThreshold
-    }
+## API Reference Summary
-    DUPLICATE_RUN ||--o{ DUPLICATE_RUN_DETAIL : contains
-    DUPLICATE_RUN_DETAIL ||--o{ DUPLICATE_RUN_DETAIL_MATCH : has
-    DUPLICATE_RUN }o--|| LIST : "source"
-    LIST ||--o{ LIST_DETAIL : contains
-```
+### DuplicateRecordDetector
-## Environment Variables
+| Method | Signature | Description |
+|---|---|---|
+| `GetDuplicateRecords` | `(params: PotentialDuplicateRequest, contextUser?: UserInfo) => Promise<PotentialDuplicateResponse>` | Run batch duplicate detection for all records in a list |
+| `CheckSingleRecord` | `(EntityDocumentID: string, RecordID: CompositeKey, Options?: DuplicateDetectionOptions, ContextUser?: UserInfo) => Promise<PotentialDuplicateResult>` | Check a single record for duplicates |
+| `ParseVectorMatches` | `(queryResponse: BaseResponse, sourceKey?: CompositeKey) => PotentialDuplicateResult` | Parse raw vector DB response into typed results |
-```env
-# AI Model API Keys
-OPENAI_API_KEY=your-openai-key
-MISTRAL_API_KEY=your-mistral-key
+### ComputeRRF
-# Vector Database
-PINECONE_API_KEY=your-pinecone-key
-PINECONE_HOST=your-pinecone-host
-PINECONE_DEFAULT_INDEX=your-index-name
+```typescript
+function ComputeRRF(rankedLists: ScoredCandidate[][], k?: number): ScoredCandidate[]
+```
+Compute Reciprocal Rank Fusion across multiple ranked result lists. Returns candidates sorted by descending fused score.
-# Database Connection
-DB_HOST=your-sql-server
-DB_PORT=1433
-DB_USERNAME=your-username
-DB_PASSWORD=your-password
-DB_DATABASE=your-database
+### ScoredCandidate
-# User Context
-CURRENT_USER_EMAIL=user@example.com
+```typescript
+interface ScoredCandidate {
+    ID: string;
+    Score: number;
+    Metadata?: Record<string, unknown>;
+}
 ```
-## Dependencies
+---
+## Database Entities
+The package reads from and writes to these MJ entities:
-| Package | Purpose |
+| Entity | Purpose |
 |---|---|
-| `@memberjunction/ai` | `BaseEmbeddings`, `GetAIAPIKey` |
-| `@memberjunction/ai-vectordb` | `VectorDBBase`, `BaseResponse` |
-| `@memberjunction/ai-vectors` | `VectorBase` base class |
-| `@memberjunction/ai-vectors-pinecone` | Pinecone implementation |
-| `@memberjunction/ai-vector-sync` | `EntityVectorSyncer`, `EntityDocumentTemplateParser` |
-| `@memberjunction/aiengine` | AI engine integration |
-| `@memberjunction/core` | Core MJ types and data access |
-| `@memberjunction/core-entities` | Entity type definitions |
-| `@memberjunction/global` | MJGlobal class factory |
-## Limitations
-- Duplicate detection operates within a single entity type
-- Requires pre-configured Entity Documents with templates
-- Currently supports Pinecone as the vector database provider
-- Records must be added to a List before detection can run
+| `MJ: Entity Documents` | Configuration: template, AI model, vector DB, thresholds |
+| `MJ: Lists` / `MJ: List Details` | Source records to check for duplicates |
+| `MJ: Duplicate Runs` | Tracks each detection run (status, timing) |
+| `MJ: Duplicate Run Details` | Per-record tracking within a run |
+| `MJ: Duplicate Run Detail Matches` | Individual match results with probability scores |
+---
+## Further Reading
+- **[Duplicate Detection Guide](docs/DUPLICATE_DETECTION_GUIDE.md)** -- comprehensive developer guide covering end-to-end workflow, threshold tuning, hybrid search deep dive, performance optimization, and troubleshooting
+- **[MemberJunction AI Vectors](../Core/README.md)** -- base vector infrastructure
+- **[AI Vector Sync](../Sync/README.md)** -- entity vectorization and template parsing
+---
 ## Development
@@ -297,8 +318,11 @@ CURRENT_USER_EMAIL=user@example.com
 # Build
 npm run build
-# Development mode
-npm run start
+# Run tests
+npm run test
+# Watch mode
+npm run test:watch
 ```
 ## License