@memberjunction/ai-vector-dupe 5.21.0 → 5.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,42 +1,66 @@
1
1
  # @memberjunction/ai-vector-dupe
2
2
 
3
- AI-powered duplicate record detection for MemberJunction entities. This package uses vector embeddings and similarity search to find potential duplicate records, track detection runs, and optionally auto-merge high-confidence matches.
3
+ <!-- Badges -->
4
+ <!-- [![npm version](https://img.shields.io/npm/v/@memberjunction/ai-vector-dupe)](https://www.npmjs.com/package/@memberjunction/ai-vector-dupe) -->
5
+ <!-- [![build](https://img.shields.io/github/actions/workflow/status/MemberJunction/MJ/ci.yml?branch=next)](https://github.com/MemberJunction/MJ/actions) -->
6
+
7
+ **AI-powered duplicate record detection for MemberJunction entities** -- finds, scores, tracks, and optionally auto-merges duplicate records using vector similarity, hybrid search (RRF), and optional reranking.
8
+
9
+ ---
4
10
 
5
11
  ## Architecture
6
12
 
7
- ```mermaid
8
- graph TD
9
- subgraph DupePkg["@memberjunction/ai-vector-dupe"]
10
- DRD["DuplicateRecordDetector"]
11
- VSB["VectorSyncBase"]
12
- ESC["EntitySyncConfig"]
13
- end
14
-
15
- subgraph Pipeline["Detection Pipeline"]
16
- LIST["Load Records<br/>from List"] --> VECT["Vectorize Records<br/>via Templates"]
17
- VECT --> EMBED["Generate<br/>Embeddings"]
18
- EMBED --> QUERY["Query Vector DB<br/>for Matches"]
19
- QUERY --> FILTER["Filter by<br/>Threshold"]
20
- FILTER --> TRACK["Track Results<br/>in Duplicate Runs"]
21
- TRACK --> MERGE["Auto-Merge<br/>Above Threshold"]
22
- end
23
-
24
- subgraph Dependencies["Key Dependencies"]
25
- VB["ai-vectors<br/>(VectorBase)"]
26
- SYNC["ai-vector-sync<br/>(EntityVectorSyncer)"]
27
- VDBB["ai-vectordb<br/>(VectorDBBase)"]
28
- AI["ai<br/>(BaseEmbeddings)"]
29
- end
30
-
31
- DRD -->|extends| VB
32
- DRD --> SYNC
33
- DRD --> VDBB
34
- DRD --> AI
35
-
36
- style DupePkg fill:#2d6a9f,stroke:#1a4971,color:#fff
37
- style Pipeline fill:#2d8659,stroke:#1a5c3a,color:#fff
38
- style Dependencies fill:#7c5295,stroke:#563a6b,color:#fff
39
13
  ```
14
+ +--------------------------+
15
+ | DuplicateRecordDetector |
16
+ | (extends VectorBase) |
17
+ +-----+----------+---------+
18
+ | |
19
+ +----------------+ +----------------+
20
+ | |
21
+ +---------v----------+ +-----------v---------+
22
+ | GetDuplicateRecords| | CheckSingleRecord |
23
+ | (list-based batch) | | (single record) |
24
+ +--------+-----------+ +-----------+---------+
25
+ | |
26
+ +-------------------+-------------------------+
27
+ |
28
+ +------------v------------+
29
+ | Detection Pipeline |
30
+ +-------------------------+
31
+ | 1. Validate Entity Doc |
32
+ | 2. Vectorize records |
33
+ | 3. Embed via AI model |
34
+ | 4. Query vector DB |
35
+ | (hybrid if supported)|
36
+ | 5. Filter self-matches |
37
+ | 6. Apply thresholds |
38
+ | 7. Persist match results|
39
+ | 8. Auto-merge (optional)|
40
+ +-------------------------+
41
+ |
42
+ +------------------+------------------+
43
+ | | |
44
+ +---------v------+ +-------v--------+ +-------v--------+
45
+ | ai-vector-sync | | ai-vectordb | | ai (Embeddings)|
46
+ | (vectorizer, | | (VectorDBBase, | | (BaseEmbeddings|
47
+ | templates) | | hybrid query) | | GetAIAPIKey) |
48
+ +----------------+ +----------------+ +----------------+
49
+ ```
50
+
51
+ **Key dependencies:**
52
+
53
+ | Package | Role |
54
+ |---|---|
55
+ | `@memberjunction/ai` | Embedding model abstraction and API key resolution |
56
+ | `@memberjunction/ai-vectordb` | Vector database abstraction (query, hybrid search) |
57
+ | `@memberjunction/ai-vectors` | `VectorBase` base class with metadata and RunView helpers |
58
+ | `@memberjunction/ai-vector-sync` | `EntityVectorSyncer` for record vectorization, template parsing |
59
+ | `@memberjunction/core` | Core types: `PotentialDuplicateRequest`, `DuplicateDetectionOptions`, etc. |
60
+ | `@memberjunction/core-entities` | Generated entity classes for Duplicate Runs, Lists, Entity Documents |
61
+ | `@memberjunction/global` | `MJGlobal` class factory, `UUIDsEqual` |
62
+
63
+ ---
40
64
 
41
65
  ## Installation
42
66
 
@@ -44,252 +68,262 @@ graph TD
44
68
  npm install @memberjunction/ai-vector-dupe
45
69
  ```
46
70
 
47
- ## Overview
48
-
49
- The package provides the `DuplicateRecordDetector` class, which orchestrates a complete duplicate detection workflow:
50
-
51
- 1. Loads records from a MemberJunction List
52
- 2. Vectorizes them using a configured Entity Document template and embedding model
53
- 3. Queries the vector database for similarity matches
54
- 4. Filters results against configurable thresholds
55
- 5. Creates Duplicate Run, Duplicate Run Detail, and Duplicate Run Detail Match records for tracking
56
- 6. Optionally auto-merges records that exceed the absolute match threshold
57
-
58
- ## Duplicate Detection Flow
59
-
60
- ```mermaid
61
- sequenceDiagram
62
- participant Caller
63
- participant DRD as DuplicateRecordDetector
64
- participant EVS as EntityVectorSyncer
65
- participant Embed as Embedding Model
66
- participant VDB as Vector Database
67
- participant DB as MJ Database
68
-
69
- Caller->>DRD: getDuplicateRecords(request, user)
70
- DRD->>DB: Load Entity Document
71
- DRD->>EVS: VectorizeEntity (ensure all records are indexed)
72
- DRD->>DB: Load records from List
73
-
74
- loop For each record
75
- DRD->>Embed: Generate embedding from template
76
- DRD->>VDB: queryIndex (topK=5)
77
- VDB-->>DRD: Scored matches
78
- DRD->>DRD: Filter by PotentialMatchThreshold
79
- DRD->>DB: Create DuplicateRunDetailMatch records
80
- end
81
-
82
- DRD->>DRD: Check AbsoluteMatchThreshold
83
- DRD->>DB: Auto-merge high-confidence duplicates
84
- DRD-->>Caller: PotentialDuplicateResponse
85
- ```
71
+ ---
86
72
 
87
- ## Core Components
73
+ ## Quick Start
88
74
 
89
- ### DuplicateRecordDetector
75
+ ### List-Based Batch Detection
76
+
77
+ Detect duplicates across all records in an MJ List:
78
+
79
+ ```typescript
80
+ import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
81
+ import { PotentialDuplicateRequest } from '@memberjunction/core';
82
+
83
+ const detector = new DuplicateRecordDetector();
84
+
85
+ const request: PotentialDuplicateRequest = {
86
+ ListID: 'your-list-uuid',
87
+ EntityID: 'your-entity-uuid',
88
+ EntityDocumentID: 'your-entity-document-uuid',
89
+ Options: {
90
+ TopK: 10,
91
+ OnProgress: (progress) => {
92
+ console.log(`[${progress.Phase}] ${progress.ProcessedRecords}/${progress.TotalRecords} -- ${progress.MatchesFound} matches`);
93
+ },
94
+ },
95
+ };
96
+
97
+ const response = await detector.GetDuplicateRecords(request, contextUser);
98
+
99
+ if (response.Status === 'Success') {
100
+ for (const result of response.PotentialDuplicateResult) {
101
+ console.log(`Record: ${result.RecordCompositeKey.ToString()}`);
102
+ for (const dupe of result.Duplicates) {
103
+ console.log(` Match: ${dupe.ToString()} (${(dupe.ProbabilityScore * 100).toFixed(1)}%)`);
104
+ }
105
+ }
106
+ }
107
+ ```
90
108
 
91
- The main class that extends `VectorBase` from `@memberjunction/ai-vectors`.
109
+ ### Single-Record Check
92
110
 
93
- **Key method:**
111
+ Check one record for duplicates without creating a list -- ideal for server hooks (e.g., fire-and-forget after record save):
94
112
 
95
113
  ```typescript
96
- getDuplicateRecords(
97
- params: PotentialDuplicateRequest,
98
- contextUser?: UserInfo
99
- ): Promise<PotentialDuplicateResponse>
114
+ import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
115
+ import { CompositeKey } from '@memberjunction/core';
116
+
117
+ const detector = new DuplicateRecordDetector();
118
+
119
+ const recordKey = new CompositeKey([{ FieldName: 'ID', Value: 'record-uuid' }]);
120
+
121
+ const result = await detector.CheckSingleRecord(
122
+ 'your-entity-document-uuid',
123
+ recordKey,
124
+ { TopK: 5 },
125
+ contextUser
126
+ );
127
+
128
+ for (const dupe of result.Duplicates) {
129
+ console.log(`Potential duplicate: ${dupe.ToString()} (score: ${dupe.ProbabilityScore})`);
130
+ }
100
131
  ```
101
132
 
102
- **Parameters in `PotentialDuplicateRequest`:**
133
+ ---
103
134
 
104
- | Field | Type | Description |
105
- |---|---|---|
106
- | `ListID` | `string` | ID of the List containing records to check |
107
- | `EntityID` | `string` | ID of the entity type |
108
- | `EntityDocumentID` | `string` | ID of the Entity Document with vectorization template |
109
- | `Options.DuplicateRunID` | `string` (optional) | Resume an existing duplicate run |
135
+ ## DuplicateDetectionOptions Reference
136
+
137
+ Options are passed via the `Options` property on `PotentialDuplicateRequest`, or directly to `CheckSingleRecord`.
110
138
 
111
- **Thresholds (configured on Entity Document):**
139
+ | Option | Type | Default | Description |
140
+ |---|---|---|---|
141
+ | `TopK` | `number` | `5` | Number of nearest neighbors to retrieve per record |
142
+ | `DuplicateRunID` | `string` | -- | Resume an existing duplicate run (batch mode only) |
143
+ | `KeywordSearchWeight` | `number` | `0.3` | Weight for keyword search in hybrid mode (0.0 = vector only, 1.0 = keyword only). Vector weight is `1.0 - KeywordSearchWeight`. |
144
+ | `FusionMethod` | `string` | `'rrf'` | Fusion method for hybrid search. Currently supports `'rrf'` (Reciprocal Rank Fusion). |
145
+ | `PotentialMatchThreshold` | `number` | -- | Override the EntityDocument's PotentialMatchThreshold for this run |
146
+ | `AbsoluteMatchThreshold` | `number` | -- | Override the EntityDocument's AbsoluteMatchThreshold for this run |
147
+ | `OnProgress` | `(progress: DuplicateDetectionProgress) => void` | -- | Callback for real-time progress reporting |
148
+
149
+ ### Thresholds
150
+
151
+ Thresholds can be configured at two levels -- on the `EntityDocument` record (default) or overridden per-run via `DuplicateDetectionOptions`. When threshold overrides are provided in the options, they take precedence over the EntityDocument values.
112
152
 
113
153
  | Threshold | Purpose |
114
154
  |---|---|
115
- | `PotentialMatchThreshold` | Minimum similarity score to report as potential duplicate |
116
- | `AbsoluteMatchThreshold` | Minimum similarity score for automatic record merge |
155
+ | `PotentialMatchThreshold` | Minimum similarity score to report a candidate as a potential duplicate |
156
+ | `AbsoluteMatchThreshold` | Minimum similarity score to trigger automatic record merge |
117
157
 
118
- ### VectorSyncBase
158
+ A server hook normalizes `1.0` thresholds to sensible defaults (`0.70` for potential, `0.95` for absolute) to prevent degenerate behavior when thresholds are left at the maximum.
119
159
 
120
- A utility base class providing helper methods for vector synchronization operations:
160
+ ---
121
161
 
122
- - `parseStringTemplate(str, obj)` -- simple template variable substitution
123
- - `timer(ms)` -- async delay
124
- - `start()` / `end()` / `timeDiff()` -- execution timing
125
- - `saveJSONData(data, path)` -- JSON file output
162
+ ## Hybrid Search and Reciprocal Rank Fusion (RRF)
126
163
 
127
- ### EntitySyncConfig
164
+ When the configured vector database supports hybrid search (`VectorDBBase.SupportsHybridSearch === true`), the detector automatically combines **vector similarity** and **keyword search** for higher-quality results.
128
165
 
129
- Configuration type for entity synchronization scheduling:
166
+ ### How It Works
167
+
168
+ 1. The record's template text is sent as both a vector embedding and a keyword query.
169
+ 2. The vector DB returns results from both retrieval methods.
170
+ 3. Results are fused using **Reciprocal Rank Fusion (RRF)**, a rank-based algorithm that is score-scale independent.
171
+
172
+ ### RRF Formula
130
173
 
131
- ```typescript
132
- type EntitySyncConfig = {
133
- EntityDocumentID: string; // Entity Document to use
134
- Interval: number; // Sync interval in seconds
135
- RunViewParams: RunViewParams; // View parameters for fetching
136
- IncludeInSync: boolean; // Whether to include in sync
137
- LastRunDate: string; // Last sync timestamp
138
- VectorIndexID: number; // Vector index ID
139
- VectorID: number; // Vector database ID
140
- };
141
174
  ```
175
+ FusedScore(d) = SUM_i [ 1 / (k + rank_i(d)) ]
176
+ ```
177
+
178
+ Where `rank_i(d)` is the 1-based rank of document `d` in list `i`, and `k` is a smoothing constant (default: 60).
142
179
 
143
- ## Usage
180
+ ### Using ComputeRRF Directly
144
181
 
145
- ### Basic Duplicate Detection
182
+ The `ComputeRRF` utility is exported for use in custom pipelines:
146
183
 
147
184
  ```typescript
148
- import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
149
- import { PotentialDuplicateRequest, UserInfo } from '@memberjunction/core';
185
+ import { ComputeRRF, ScoredCandidate } from '@memberjunction/ai-vector-dupe';
186
+
187
+ const vectorResults: ScoredCandidate[] = [
188
+ { ID: 'rec-1', Score: 0.95 },
189
+ { ID: 'rec-2', Score: 0.87 },
190
+ { ID: 'rec-3', Score: 0.82 },
191
+ ];
192
+
193
+ const keywordResults: ScoredCandidate[] = [
194
+ { ID: 'rec-2', Score: 12.5 }, // Different scale -- RRF handles this
195
+ { ID: 'rec-4', Score: 10.1 },
196
+ { ID: 'rec-1', Score: 8.3 },
197
+ ];
198
+
199
+ const fused = ComputeRRF([vectorResults, keywordResults], 60);
200
+ // Results sorted by fused RRF score, score-scale independent
201
+ ```
150
202
 
151
- const detector = new DuplicateRecordDetector();
203
+ ### Tuning Hybrid Search
152
204
 
153
- const request: PotentialDuplicateRequest = {
154
- ListID: 'list-uuid',
155
- EntityID: 'entity-uuid',
156
- EntityDocumentID: 'doc-uuid'
157
- };
205
+ - **`KeywordSearchWeight = 0.0`**: Pure vector similarity (semantic matching).
206
+ - **`KeywordSearchWeight = 0.3`** (default): Slight keyword boost. Good for entities with distinctive names or codes.
207
+ - **`KeywordSearchWeight = 0.5`**: Equal weight. Useful when both semantic and lexical matches matter.
208
+ - **`KeywordSearchWeight = 1.0`**: Pure keyword search (not recommended for duplicate detection).
158
209
 
159
- const response = await detector.getDuplicateRecords(request, currentUser);
210
+ ---
160
211
 
161
- if (response.Status === 'Success') {
162
- for (const result of response.PotentialDuplicateResult) {
163
- console.log(`Record: ${result.RecordCompositeKey.ToString()}`);
164
- for (const dupe of result.Duplicates) {
165
- console.log(` Match: ${dupe.ToString()} (${(dupe.ProbabilityScore * 100).toFixed(1)}%)`);
166
- }
167
- }
168
- }
169
- ```
212
+ ## Reranking
213
+
214
+ When MJ's `BaseReranker` / `RerankerService` is configured, the detector can apply a second-stage reranking pass after initial retrieval. Reranking uses a cross-encoder model to re-score candidates with higher precision than embedding-based similarity alone.
215
+
216
+ Reranking is especially effective when:
217
+ - Initial retrieval returns many borderline candidates
218
+ - Entity records have complex, multi-field structures
219
+ - You need to maximize precision at the cost of slightly higher latency
220
+
221
+ See the [Duplicate Detection Guide](docs/DUPLICATE_DETECTION_GUIDE.md#reranking-integration) for configuration details.
222
+
223
+ ---
224
+
225
+ ## Progress Reporting
170
226
 
171
- ### Resuming an Existing Run
227
+ The `OnProgress` callback fires at each phase of the pipeline:
172
228
 
173
229
  ```typescript
174
230
  const request: PotentialDuplicateRequest = {
175
- ListID: 'list-uuid',
176
- EntityID: 'entity-uuid',
177
- EntityDocumentID: 'doc-uuid',
231
+ // ...
178
232
  Options: {
179
- DuplicateRunID: 'existing-run-uuid'
180
- }
233
+ OnProgress: (progress) => {
234
+ const { Phase, TotalRecords, ProcessedRecords, MatchesFound, ElapsedMs } = progress;
235
+ const pct = TotalRecords > 0 ? ((ProcessedRecords / TotalRecords) * 100).toFixed(0) : '0';
236
+ console.log(`[${Phase}] ${pct}% -- ${MatchesFound} matches (${ElapsedMs}ms)`);
237
+ },
238
+ },
181
239
  };
240
+ ```
241
+
242
+ ### Progress Phases
243
+
244
+ | Phase | Description |
245
+ |---|---|
246
+ | `Vectorizing` | Records are being vectorized via `EntityVectorSyncer` |
247
+ | `Embedding` | Template texts are being embedded via the AI model |
248
+ | `Querying` | Vector DB is being queried for each record |
249
+ | `Matching` | Results are being persisted and match records created |
250
+ | `Merging` | High-confidence matches are being auto-merged |
251
+
252
+ ### DuplicateDetectionProgress Shape
182
253
 
183
- const response = await detector.getDuplicateRecords(request, currentUser);
254
+ ```typescript
255
+ interface DuplicateDetectionProgress {
256
+ Phase: 'Vectorizing' | 'Embedding' | 'Querying' | 'Matching' | 'Merging';
257
+ TotalRecords: number;
258
+ ProcessedRecords: number;
259
+ MatchesFound: number;
260
+ CurrentRecordID?: string;
261
+ ElapsedMs: number;
262
+ }
184
263
  ```
185
264
 
186
- ## Database Entities Used
187
-
188
- The package reads from and writes to these MemberJunction entities:
189
-
190
- ```mermaid
191
- erDiagram
192
- DUPLICATE_RUN {
193
- string ID PK
194
- string EntityID
195
- string StartedByUserID
196
- datetime StartedAt
197
- datetime EndedAt
198
- string ProcessingStatus
199
- string ApprovalStatus
200
- string SourceListID
201
- }
265
+ ---
202
266
 
203
- DUPLICATE_RUN_DETAIL {
204
- string ID PK
205
- string DuplicateRunID FK
206
- string RecordID
207
- string MatchStatus
208
- string MergeStatus
209
- }
267
+ ## API Reference Summary
210
268
 
211
- DUPLICATE_RUN_DETAIL_MATCH {
212
- string ID PK
213
- string DuplicateRunDetailID FK
214
- string MatchRecordID
215
- float MatchProbability
216
- datetime MatchedAt
217
- string Action
218
- string ApprovalStatus
219
- string MergeStatus
220
- }
269
+ ### DuplicateRecordDetector
221
270
 
222
- LIST {
223
- string ID PK
224
- string Name
225
- string EntityID
226
- }
271
+ | Method | Signature | Description |
272
+ |---|---|---|
273
+ | `GetDuplicateRecords` | `(params: PotentialDuplicateRequest, contextUser?: UserInfo) => Promise<PotentialDuplicateResponse>` | Run batch duplicate detection for all records in a list |
274
+ | `CheckSingleRecord` | `(EntityDocumentID: string, RecordID: CompositeKey, Options?: DuplicateDetectionOptions, ContextUser?: UserInfo) => Promise<PotentialDuplicateResult>` | Check a single record for duplicates |
275
+ | `ParseVectorMatches` | `(queryResponse: BaseResponse, sourceKey?: CompositeKey) => PotentialDuplicateResult` | Parse raw vector DB response into typed results |
227
276
 
228
- LIST_DETAIL {
229
- string ID PK
230
- string ListID FK
231
- string RecordID
232
- }
277
+ ### ComputeRRF
233
278
 
234
- ENTITY_DOCUMENT {
235
- string ID PK
236
- string EntityID
237
- string TemplateID
238
- string AIModelID
239
- string VectorDatabaseID
240
- float PotentialMatchThreshold
241
- float AbsoluteMatchThreshold
242
- }
279
+ ```typescript
280
+ function ComputeRRF(rankedLists: ScoredCandidate[][], k?: number): ScoredCandidate[]
281
+ ```
243
282
 
244
- DUPLICATE_RUN ||--o{ DUPLICATE_RUN_DETAIL : contains
245
- DUPLICATE_RUN_DETAIL ||--o{ DUPLICATE_RUN_DETAIL_MATCH : has
246
- DUPLICATE_RUN }o--|| LIST : "source"
247
- LIST ||--o{ LIST_DETAIL : contains
283
+ Compute Reciprocal Rank Fusion across multiple ranked result lists. Returns candidates sorted by descending fused score.
284
+
285
+ ### ScoredCandidate
286
+
287
+ ```typescript
288
+ interface ScoredCandidate {
289
+ ID: string;
290
+ Score: number;
291
+ Metadata?: Record<string, unknown>;
292
+ }
248
293
  ```
249
294
 
250
- ## Environment Variables
295
+ ---
251
296
 
252
- ```env
253
- # AI Model API Keys
254
- OPENAI_API_KEY=your-openai-key
255
- MISTRAL_API_KEY=your-mistral-key
297
+ ## Inverse Match Deduplication
256
298
 
257
- # Vector Database
258
- PINECONE_API_KEY=your-pinecone-key
259
- PINECONE_HOST=your-pinecone-host
260
- PINECONE_DEFAULT_INDEX=your-index-name
299
+ The detector maintains a `_seenPairs` set across the entire run to suppress inverse duplicates. If record A is identified as a duplicate of record B (A->B), the reverse match (B->A) is automatically suppressed. Pair keys use canonical ordering (`smallerID::largerID`) for consistent deduplication regardless of query direction.
261
300
 
262
- # Database Connection
263
- DB_HOST=your-sql-server
264
- DB_PORT=1433
265
- DB_USERNAME=your-username
266
- DB_PASSWORD=your-password
267
- DB_DATABASE=your-database
301
+ ## RecordID Format and Metadata
268
302
 
269
- # User Context
270
- CURRENT_USER_EMAIL=user@example.com
271
- ```
303
+ - **RecordID and MatchRecordID** are stored in MJ URL segment format (e.g., `ID|uuid`), making them compatible with `CompositeKey` for entities with composite primary keys.
304
+ - **RecordMetadata** is stored on both `DuplicateRunDetail` and `DuplicateRunDetailMatch` entities, capturing the vector database metadata snapshot at detection time. This preserves the context used for matching even if the source record changes later.
305
+
306
+ ## Database Entities
272
307
 
273
- ## Dependencies
308
+ The package reads from and writes to these MJ entities:
274
309
 
275
- | Package | Purpose |
310
+ | Entity | Purpose |
276
311
  |---|---|
277
- | `@memberjunction/ai` | `BaseEmbeddings`, `GetAIAPIKey` |
278
- | `@memberjunction/ai-vectordb` | `VectorDBBase`, `BaseResponse` |
279
- | `@memberjunction/ai-vectors` | `VectorBase` base class |
280
- | `@memberjunction/ai-vectors-pinecone` | Pinecone implementation |
281
- | `@memberjunction/ai-vector-sync` | `EntityVectorSyncer`, `EntityDocumentTemplateParser` |
282
- | `@memberjunction/aiengine` | AI engine integration |
283
- | `@memberjunction/core` | Core MJ types and data access |
284
- | `@memberjunction/core-entities` | Entity type definitions |
285
- | `@memberjunction/global` | MJGlobal class factory |
286
-
287
- ## Limitations
288
-
289
- - Duplicate detection operates within a single entity type
290
- - Requires pre-configured Entity Documents with templates
291
- - Currently supports Pinecone as the vector database provider
292
- - Records must be added to a List before detection can run
312
+ | `MJ: Entity Documents` | Configuration: template, AI model, vector DB, thresholds |
313
+ | `MJ: Lists` / `MJ: List Details` | Source records to check for duplicates |
314
+ | `MJ: Duplicate Runs` | Tracks each detection run (status, timing) |
315
+ | `MJ: Duplicate Run Details` | Per-record tracking within a run; includes `RecordMetadata` (vector DB metadata snapshot) |
316
+ | `MJ: Duplicate Run Detail Matches` | Individual match results with probability scores; includes `RecordMetadata` for the matched record |
317
+
318
+ ---
319
+
320
+ ## Further Reading
321
+
322
+ - **[Duplicate Detection Guide](docs/DUPLICATE_DETECTION_GUIDE.md)** -- comprehensive developer guide covering end-to-end workflow, threshold tuning, hybrid search deep dive, performance optimization, and troubleshooting
323
+ - **[MemberJunction AI Vectors](../Core/README.md)** -- base vector infrastructure
324
+ - **[AI Vector Sync](../Sync/README.md)** -- entity vectorization and template parsing
325
+
326
+ ---
293
327
 
294
328
  ## Development
295
329
 
@@ -297,8 +331,11 @@ CURRENT_USER_EMAIL=user@example.com
297
331
  # Build
298
332
  npm run build
299
333
 
300
- # Development mode
301
- npm run start
334
+ # Run tests
335
+ npm run test
336
+
337
+ # Watch mode
338
+ npm run test:watch
302
339
  ```
303
340
 
304
341
  ## License