@memberjunction/ai-vector-dupe 5.21.0 → 5.23.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +266 -229
- package/dist/duplicateRecordDetector.d.ts +180 -18
- package/dist/duplicateRecordDetector.d.ts.map +1 -1
- package/dist/duplicateRecordDetector.js +746 -267
- package/dist/duplicateRecordDetector.js.map +1 -1
- package/dist/index.d.ts +2 -3
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +4 -3
- package/dist/index.js.map +1 -1
- package/dist/scoring/ReciprocalRankFusion.d.ts +45 -0
- package/dist/scoring/ReciprocalRankFusion.d.ts.map +1 -0
- package/dist/scoring/ReciprocalRankFusion.js +63 -0
- package/dist/scoring/ReciprocalRankFusion.js.map +1 -0
- package/package.json +10 -10
- package/dist/config.d.ts +0 -13
- package/dist/config.d.ts.map +0 -1
- package/dist/config.js +0 -15
- package/dist/config.js.map +0 -1
- package/dist/generic/vectorSyncBase.d.ts +0 -20
- package/dist/generic/vectorSyncBase.d.ts.map +0 -1
- package/dist/generic/vectorSyncBase.js +0 -42
- package/dist/generic/vectorSyncBase.js.map +0 -1
- package/dist/models/entitySyncConfig.d.ts +0 -36
- package/dist/models/entitySyncConfig.d.ts.map +0 -1
- package/dist/models/entitySyncConfig.js +0 -2
- package/dist/models/entitySyncConfig.js.map +0 -1
package/README.md
CHANGED
|
@@ -1,42 +1,66 @@
|
|
|
1
1
|
# @memberjunction/ai-vector-dupe
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
<!-- Badges -->
|
|
4
|
+
<!-- [](https://www.npmjs.com/package/@memberjunction/ai-vector-dupe) -->
|
|
5
|
+
<!-- [](https://github.com/MemberJunction/MJ/actions) -->
|
|
6
|
+
|
|
7
|
+
**AI-powered duplicate record detection for MemberJunction entities** -- finds, scores, tracks, and optionally auto-merges duplicate records using vector similarity, hybrid search (RRF), and optional reranking.
|
|
8
|
+
|
|
9
|
+
---
|
|
4
10
|
|
|
5
11
|
## Architecture
|
|
6
12
|
|
|
7
|
-
```mermaid
|
|
8
|
-
graph TD
|
|
9
|
-
subgraph DupePkg["@memberjunction/ai-vector-dupe"]
|
|
10
|
-
DRD["DuplicateRecordDetector"]
|
|
11
|
-
VSB["VectorSyncBase"]
|
|
12
|
-
ESC["EntitySyncConfig"]
|
|
13
|
-
end
|
|
14
|
-
|
|
15
|
-
subgraph Pipeline["Detection Pipeline"]
|
|
16
|
-
LIST["Load Records<br/>from List"] --> VECT["Vectorize Records<br/>via Templates"]
|
|
17
|
-
VECT --> EMBED["Generate<br/>Embeddings"]
|
|
18
|
-
EMBED --> QUERY["Query Vector DB<br/>for Matches"]
|
|
19
|
-
QUERY --> FILTER["Filter by<br/>Threshold"]
|
|
20
|
-
FILTER --> TRACK["Track Results<br/>in Duplicate Runs"]
|
|
21
|
-
TRACK --> MERGE["Auto-Merge<br/>Above Threshold"]
|
|
22
|
-
end
|
|
23
|
-
|
|
24
|
-
subgraph Dependencies["Key Dependencies"]
|
|
25
|
-
VB["ai-vectors<br/>(VectorBase)"]
|
|
26
|
-
SYNC["ai-vector-sync<br/>(EntityVectorSyncer)"]
|
|
27
|
-
VDBB["ai-vectordb<br/>(VectorDBBase)"]
|
|
28
|
-
AI["ai<br/>(BaseEmbeddings)"]
|
|
29
|
-
end
|
|
30
|
-
|
|
31
|
-
DRD -->|extends| VB
|
|
32
|
-
DRD --> SYNC
|
|
33
|
-
DRD --> VDBB
|
|
34
|
-
DRD --> AI
|
|
35
|
-
|
|
36
|
-
style DupePkg fill:#2d6a9f,stroke:#1a4971,color:#fff
|
|
37
|
-
style Pipeline fill:#2d8659,stroke:#1a5c3a,color:#fff
|
|
38
|
-
style Dependencies fill:#7c5295,stroke:#563a6b,color:#fff
|
|
39
13
|
```
|
|
14
|
+
+--------------------------+
|
|
15
|
+
| DuplicateRecordDetector |
|
|
16
|
+
| (extends VectorBase) |
|
|
17
|
+
+-----+----------+---------+
|
|
18
|
+
| |
|
|
19
|
+
+----------------+ +----------------+
|
|
20
|
+
| |
|
|
21
|
+
+---------v----------+ +-----------v---------+
|
|
22
|
+
| GetDuplicateRecords| | CheckSingleRecord |
|
|
23
|
+
| (list-based batch) | | (single record) |
|
|
24
|
+
+--------+-----------+ +-----------+---------+
|
|
25
|
+
| |
|
|
26
|
+
+-------------------+-------------------------+
|
|
27
|
+
|
|
|
28
|
+
+------------v------------+
|
|
29
|
+
| Detection Pipeline |
|
|
30
|
+
+-------------------------+
|
|
31
|
+
| 1. Validate Entity Doc |
|
|
32
|
+
| 2. Vectorize records |
|
|
33
|
+
| 3. Embed via AI model |
|
|
34
|
+
| 4. Query vector DB |
|
|
35
|
+
| (hybrid if supported)|
|
|
36
|
+
| 5. Filter self-matches |
|
|
37
|
+
| 6. Apply thresholds |
|
|
38
|
+
| 7. Persist match results|
|
|
39
|
+
| 8. Auto-merge (optional)|
|
|
40
|
+
+-------------------------+
|
|
41
|
+
|
|
|
42
|
+
+------------------+------------------+
|
|
43
|
+
| | |
|
|
44
|
+
+---------v------+ +-------v--------+ +-------v--------+
|
|
45
|
+
| ai-vector-sync | | ai-vectordb | | ai (Embeddings)|
|
|
46
|
+
| (vectorizer, | | (VectorDBBase, | | (BaseEmbeddings|
|
|
47
|
+
| templates) | | hybrid query) | | GetAIAPIKey) |
|
|
48
|
+
+----------------+ +----------------+ +----------------+
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
**Key dependencies:**
|
|
52
|
+
|
|
53
|
+
| Package | Role |
|
|
54
|
+
|---|---|
|
|
55
|
+
| `@memberjunction/ai` | Embedding model abstraction and API key resolution |
|
|
56
|
+
| `@memberjunction/ai-vectordb` | Vector database abstraction (query, hybrid search) |
|
|
57
|
+
| `@memberjunction/ai-vectors` | `VectorBase` base class with metadata and RunView helpers |
|
|
58
|
+
| `@memberjunction/ai-vector-sync` | `EntityVectorSyncer` for record vectorization, template parsing |
|
|
59
|
+
| `@memberjunction/core` | Core types: `PotentialDuplicateRequest`, `DuplicateDetectionOptions`, etc. |
|
|
60
|
+
| `@memberjunction/core-entities` | Generated entity classes for Duplicate Runs, Lists, Entity Documents |
|
|
61
|
+
| `@memberjunction/global` | `MJGlobal` class factory, `UUIDsEqual` |
|
|
62
|
+
|
|
63
|
+
---
|
|
40
64
|
|
|
41
65
|
## Installation
|
|
42
66
|
|
|
@@ -44,252 +68,262 @@ graph TD
|
|
|
44
68
|
npm install @memberjunction/ai-vector-dupe
|
|
45
69
|
```
|
|
46
70
|
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
The package provides the `DuplicateRecordDetector` class, which orchestrates a complete duplicate detection workflow:
|
|
50
|
-
|
|
51
|
-
1. Loads records from a MemberJunction List
|
|
52
|
-
2. Vectorizes them using a configured Entity Document template and embedding model
|
|
53
|
-
3. Queries the vector database for similarity matches
|
|
54
|
-
4. Filters results against configurable thresholds
|
|
55
|
-
5. Creates Duplicate Run, Duplicate Run Detail, and Duplicate Run Detail Match records for tracking
|
|
56
|
-
6. Optionally auto-merges records that exceed the absolute match threshold
|
|
57
|
-
|
|
58
|
-
## Duplicate Detection Flow
|
|
59
|
-
|
|
60
|
-
```mermaid
|
|
61
|
-
sequenceDiagram
|
|
62
|
-
participant Caller
|
|
63
|
-
participant DRD as DuplicateRecordDetector
|
|
64
|
-
participant EVS as EntityVectorSyncer
|
|
65
|
-
participant Embed as Embedding Model
|
|
66
|
-
participant VDB as Vector Database
|
|
67
|
-
participant DB as MJ Database
|
|
68
|
-
|
|
69
|
-
Caller->>DRD: getDuplicateRecords(request, user)
|
|
70
|
-
DRD->>DB: Load Entity Document
|
|
71
|
-
DRD->>EVS: VectorizeEntity (ensure all records are indexed)
|
|
72
|
-
DRD->>DB: Load records from List
|
|
73
|
-
|
|
74
|
-
loop For each record
|
|
75
|
-
DRD->>Embed: Generate embedding from template
|
|
76
|
-
DRD->>VDB: queryIndex (topK=5)
|
|
77
|
-
VDB-->>DRD: Scored matches
|
|
78
|
-
DRD->>DRD: Filter by PotentialMatchThreshold
|
|
79
|
-
DRD->>DB: Create DuplicateRunDetailMatch records
|
|
80
|
-
end
|
|
81
|
-
|
|
82
|
-
DRD->>DRD: Check AbsoluteMatchThreshold
|
|
83
|
-
DRD->>DB: Auto-merge high-confidence duplicates
|
|
84
|
-
DRD-->>Caller: PotentialDuplicateResponse
|
|
85
|
-
```
|
|
71
|
+
---
|
|
86
72
|
|
|
87
|
-
##
|
|
73
|
+
## Quick Start
|
|
88
74
|
|
|
89
|
-
###
|
|
75
|
+
### List-Based Batch Detection
|
|
76
|
+
|
|
77
|
+
Detect duplicates across all records in an MJ List:
|
|
78
|
+
|
|
79
|
+
```typescript
|
|
80
|
+
import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
|
|
81
|
+
import { PotentialDuplicateRequest } from '@memberjunction/core';
|
|
82
|
+
|
|
83
|
+
const detector = new DuplicateRecordDetector();
|
|
84
|
+
|
|
85
|
+
const request: PotentialDuplicateRequest = {
|
|
86
|
+
ListID: 'your-list-uuid',
|
|
87
|
+
EntityID: 'your-entity-uuid',
|
|
88
|
+
EntityDocumentID: 'your-entity-document-uuid',
|
|
89
|
+
Options: {
|
|
90
|
+
TopK: 10,
|
|
91
|
+
OnProgress: (progress) => {
|
|
92
|
+
console.log(`[${progress.Phase}] ${progress.ProcessedRecords}/${progress.TotalRecords} -- ${progress.MatchesFound} matches`);
|
|
93
|
+
},
|
|
94
|
+
},
|
|
95
|
+
};
|
|
96
|
+
|
|
97
|
+
const response = await detector.GetDuplicateRecords(request, contextUser);
|
|
98
|
+
|
|
99
|
+
if (response.Status === 'Success') {
|
|
100
|
+
for (const result of response.PotentialDuplicateResult) {
|
|
101
|
+
console.log(`Record: ${result.RecordCompositeKey.ToString()}`);
|
|
102
|
+
for (const dupe of result.Duplicates) {
|
|
103
|
+
console.log(` Match: ${dupe.ToString()} (${(dupe.ProbabilityScore * 100).toFixed(1)}%)`);
|
|
104
|
+
}
|
|
105
|
+
}
|
|
106
|
+
}
|
|
107
|
+
```
|
|
90
108
|
|
|
91
|
-
|
|
109
|
+
### Single-Record Check
|
|
92
110
|
|
|
93
|
-
|
|
111
|
+
Check one record for duplicates without creating a list -- ideal for server hooks (e.g., fire-and-forget after record save):
|
|
94
112
|
|
|
95
113
|
```typescript
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
)
|
|
114
|
+
import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
|
|
115
|
+
import { CompositeKey } from '@memberjunction/core';
|
|
116
|
+
|
|
117
|
+
const detector = new DuplicateRecordDetector();
|
|
118
|
+
|
|
119
|
+
const recordKey = new CompositeKey([{ FieldName: 'ID', Value: 'record-uuid' }]);
|
|
120
|
+
|
|
121
|
+
const result = await detector.CheckSingleRecord(
|
|
122
|
+
'your-entity-document-uuid',
|
|
123
|
+
recordKey,
|
|
124
|
+
{ TopK: 5 },
|
|
125
|
+
contextUser
|
|
126
|
+
);
|
|
127
|
+
|
|
128
|
+
for (const dupe of result.Duplicates) {
|
|
129
|
+
console.log(`Potential duplicate: ${dupe.ToString()} (score: ${dupe.ProbabilityScore})`);
|
|
130
|
+
}
|
|
100
131
|
```
|
|
101
132
|
|
|
102
|
-
|
|
133
|
+
---
|
|
103
134
|
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
| `EntityID` | `string` | ID of the entity type |
|
|
108
|
-
| `EntityDocumentID` | `string` | ID of the Entity Document with vectorization template |
|
|
109
|
-
| `Options.DuplicateRunID` | `string` (optional) | Resume an existing duplicate run |
|
|
135
|
+
## DuplicateDetectionOptions Reference
|
|
136
|
+
|
|
137
|
+
Options are passed via the `Options` property on `PotentialDuplicateRequest`, or directly to `CheckSingleRecord`.
|
|
110
138
|
|
|
111
|
-
|
|
139
|
+
| Option | Type | Default | Description |
|
|
140
|
+
|---|---|---|---|
|
|
141
|
+
| `TopK` | `number` | `5` | Number of nearest neighbors to retrieve per record |
|
|
142
|
+
| `DuplicateRunID` | `string` | -- | Resume an existing duplicate run (batch mode only) |
|
|
143
|
+
| `KeywordSearchWeight` | `number` | `0.3` | Weight for keyword search in hybrid mode (0.0 = vector only, 1.0 = keyword only). Vector weight is `1.0 - KeywordSearchWeight`. |
|
|
144
|
+
| `FusionMethod` | `string` | `'rrf'` | Fusion method for hybrid search. Currently supports `'rrf'` (Reciprocal Rank Fusion). |
|
|
145
|
+
| `PotentialMatchThreshold` | `number` | -- | Override the EntityDocument's PotentialMatchThreshold for this run |
|
|
146
|
+
| `AbsoluteMatchThreshold` | `number` | -- | Override the EntityDocument's AbsoluteMatchThreshold for this run |
|
|
147
|
+
| `OnProgress` | `(progress: DuplicateDetectionProgress) => void` | -- | Callback for real-time progress reporting |
|
|
148
|
+
|
|
149
|
+
### Thresholds
|
|
150
|
+
|
|
151
|
+
Thresholds can be configured at two levels -- on the `EntityDocument` record (default) or overridden per-run via `DuplicateDetectionOptions`. When threshold overrides are provided in the options, they take precedence over the EntityDocument values.
|
|
112
152
|
|
|
113
153
|
| Threshold | Purpose |
|
|
114
154
|
|---|---|
|
|
115
|
-
| `PotentialMatchThreshold` | Minimum similarity score to report as potential duplicate |
|
|
116
|
-
| `AbsoluteMatchThreshold` | Minimum similarity score
|
|
155
|
+
| `PotentialMatchThreshold` | Minimum similarity score to report a candidate as a potential duplicate |
|
|
156
|
+
| `AbsoluteMatchThreshold` | Minimum similarity score to trigger automatic record merge |
|
|
117
157
|
|
|
118
|
-
|
|
158
|
+
A server hook normalizes `1.0` thresholds to sensible defaults (`0.70` for potential, `0.95` for absolute) to prevent degenerate behavior when thresholds are left at the maximum.
|
|
119
159
|
|
|
120
|
-
|
|
160
|
+
---
|
|
121
161
|
|
|
122
|
-
|
|
123
|
-
- `timer(ms)` -- async delay
|
|
124
|
-
- `start()` / `end()` / `timeDiff()` -- execution timing
|
|
125
|
-
- `saveJSONData(data, path)` -- JSON file output
|
|
162
|
+
## Hybrid Search and Reciprocal Rank Fusion (RRF)
|
|
126
163
|
|
|
127
|
-
|
|
164
|
+
When the configured vector database supports hybrid search (`VectorDBBase.SupportsHybridSearch === true`), the detector automatically combines **vector similarity** and **keyword search** for higher-quality results.
|
|
128
165
|
|
|
129
|
-
|
|
166
|
+
### How It Works
|
|
167
|
+
|
|
168
|
+
1. The record's template text is sent as both a vector embedding and a keyword query.
|
|
169
|
+
2. The vector DB returns results from both retrieval methods.
|
|
170
|
+
3. Results are fused using **Reciprocal Rank Fusion (RRF)**, a rank-based algorithm that is score-scale independent.
|
|
171
|
+
|
|
172
|
+
### RRF Formula
|
|
130
173
|
|
|
131
|
-
```typescript
|
|
132
|
-
type EntitySyncConfig = {
|
|
133
|
-
EntityDocumentID: string; // Entity Document to use
|
|
134
|
-
Interval: number; // Sync interval in seconds
|
|
135
|
-
RunViewParams: RunViewParams; // View parameters for fetching
|
|
136
|
-
IncludeInSync: boolean; // Whether to include in sync
|
|
137
|
-
LastRunDate: string; // Last sync timestamp
|
|
138
|
-
VectorIndexID: number; // Vector index ID
|
|
139
|
-
VectorID: number; // Vector database ID
|
|
140
|
-
};
|
|
141
174
|
```
|
|
175
|
+
FusedScore(d) = SUM_i [ 1 / (k + rank_i(d)) ]
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
Where `rank_i(d)` is the 1-based rank of document `d` in list `i`, and `k` is a smoothing constant (default: 60).
|
|
142
179
|
|
|
143
|
-
|
|
180
|
+
### Using ComputeRRF Directly
|
|
144
181
|
|
|
145
|
-
|
|
182
|
+
The `ComputeRRF` utility is exported for use in custom pipelines:
|
|
146
183
|
|
|
147
184
|
```typescript
|
|
148
|
-
import {
|
|
149
|
-
|
|
185
|
+
import { ComputeRRF, ScoredCandidate } from '@memberjunction/ai-vector-dupe';
|
|
186
|
+
|
|
187
|
+
const vectorResults: ScoredCandidate[] = [
|
|
188
|
+
{ ID: 'rec-1', Score: 0.95 },
|
|
189
|
+
{ ID: 'rec-2', Score: 0.87 },
|
|
190
|
+
{ ID: 'rec-3', Score: 0.82 },
|
|
191
|
+
];
|
|
192
|
+
|
|
193
|
+
const keywordResults: ScoredCandidate[] = [
|
|
194
|
+
{ ID: 'rec-2', Score: 12.5 }, // Different scale -- RRF handles this
|
|
195
|
+
{ ID: 'rec-4', Score: 10.1 },
|
|
196
|
+
{ ID: 'rec-1', Score: 8.3 },
|
|
197
|
+
];
|
|
198
|
+
|
|
199
|
+
const fused = ComputeRRF([vectorResults, keywordResults], 60);
|
|
200
|
+
// Results sorted by fused RRF score, score-scale independent
|
|
201
|
+
```
|
|
150
202
|
|
|
151
|
-
|
|
203
|
+
### Tuning Hybrid Search
|
|
152
204
|
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
};
|
|
205
|
+
- **`KeywordSearchWeight = 0.0`**: Pure vector similarity (semantic matching).
|
|
206
|
+
- **`KeywordSearchWeight = 0.3`** (default): Slight keyword boost. Good for entities with distinctive names or codes.
|
|
207
|
+
- **`KeywordSearchWeight = 0.5`**: Equal weight. Useful when both semantic and lexical matches matter.
|
|
208
|
+
- **`KeywordSearchWeight = 1.0`**: Pure keyword search (not recommended for duplicate detection).
|
|
158
209
|
|
|
159
|
-
|
|
210
|
+
---
|
|
160
211
|
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
212
|
+
## Reranking
|
|
213
|
+
|
|
214
|
+
When MJ's `BaseReranker` / `RerankerService` is configured, the detector can apply a second-stage reranking pass after initial retrieval. Reranking uses a cross-encoder model to re-score candidates with higher precision than embedding-based similarity alone.
|
|
215
|
+
|
|
216
|
+
Reranking is especially effective when:
|
|
217
|
+
- Initial retrieval returns many borderline candidates
|
|
218
|
+
- Entity records have complex, multi-field structures
|
|
219
|
+
- You need to maximize precision at the cost of slightly higher latency
|
|
220
|
+
|
|
221
|
+
See the [Duplicate Detection Guide](docs/DUPLICATE_DETECTION_GUIDE.md#reranking-integration) for configuration details.
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
## Progress Reporting
|
|
170
226
|
|
|
171
|
-
|
|
227
|
+
The `OnProgress` callback fires at each phase of the pipeline:
|
|
172
228
|
|
|
173
229
|
```typescript
|
|
174
230
|
const request: PotentialDuplicateRequest = {
|
|
175
|
-
|
|
176
|
-
EntityID: 'entity-uuid',
|
|
177
|
-
EntityDocumentID: 'doc-uuid',
|
|
231
|
+
// ...
|
|
178
232
|
Options: {
|
|
179
|
-
|
|
180
|
-
|
|
233
|
+
OnProgress: (progress) => {
|
|
234
|
+
const { Phase, TotalRecords, ProcessedRecords, MatchesFound, ElapsedMs } = progress;
|
|
235
|
+
const pct = TotalRecords > 0 ? ((ProcessedRecords / TotalRecords) * 100).toFixed(0) : '0';
|
|
236
|
+
console.log(`[${Phase}] ${pct}% -- ${MatchesFound} matches (${ElapsedMs}ms)`);
|
|
237
|
+
},
|
|
238
|
+
},
|
|
181
239
|
};
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
### Progress Phases
|
|
243
|
+
|
|
244
|
+
| Phase | Description |
|
|
245
|
+
|---|---|
|
|
246
|
+
| `Vectorizing` | Records are being vectorized via `EntityVectorSyncer` |
|
|
247
|
+
| `Embedding` | Template texts are being embedded via the AI model |
|
|
248
|
+
| `Querying` | Vector DB is being queried for each record |
|
|
249
|
+
| `Matching` | Results are being persisted and match records created |
|
|
250
|
+
| `Merging` | High-confidence matches are being auto-merged |
|
|
251
|
+
|
|
252
|
+
### DuplicateDetectionProgress Shape
|
|
182
253
|
|
|
183
|
-
|
|
254
|
+
```typescript
|
|
255
|
+
interface DuplicateDetectionProgress {
|
|
256
|
+
Phase: 'Vectorizing' | 'Embedding' | 'Querying' | 'Matching' | 'Merging';
|
|
257
|
+
TotalRecords: number;
|
|
258
|
+
ProcessedRecords: number;
|
|
259
|
+
MatchesFound: number;
|
|
260
|
+
CurrentRecordID?: string;
|
|
261
|
+
ElapsedMs: number;
|
|
262
|
+
}
|
|
184
263
|
```
|
|
185
264
|
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
The package reads from and writes to these MemberJunction entities:
|
|
189
|
-
|
|
190
|
-
```mermaid
|
|
191
|
-
erDiagram
|
|
192
|
-
DUPLICATE_RUN {
|
|
193
|
-
string ID PK
|
|
194
|
-
string EntityID
|
|
195
|
-
string StartedByUserID
|
|
196
|
-
datetime StartedAt
|
|
197
|
-
datetime EndedAt
|
|
198
|
-
string ProcessingStatus
|
|
199
|
-
string ApprovalStatus
|
|
200
|
-
string SourceListID
|
|
201
|
-
}
|
|
265
|
+
---
|
|
202
266
|
|
|
203
|
-
|
|
204
|
-
string ID PK
|
|
205
|
-
string DuplicateRunID FK
|
|
206
|
-
string RecordID
|
|
207
|
-
string MatchStatus
|
|
208
|
-
string MergeStatus
|
|
209
|
-
}
|
|
267
|
+
## API Reference Summary
|
|
210
268
|
|
|
211
|
-
|
|
212
|
-
string ID PK
|
|
213
|
-
string DuplicateRunDetailID FK
|
|
214
|
-
string MatchRecordID
|
|
215
|
-
float MatchProbability
|
|
216
|
-
datetime MatchedAt
|
|
217
|
-
string Action
|
|
218
|
-
string ApprovalStatus
|
|
219
|
-
string MergeStatus
|
|
220
|
-
}
|
|
269
|
+
### DuplicateRecordDetector
|
|
221
270
|
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
271
|
+
| Method | Signature | Description |
|
|
272
|
+
|---|---|---|
|
|
273
|
+
| `GetDuplicateRecords` | `(params: PotentialDuplicateRequest, contextUser?: UserInfo) => Promise<PotentialDuplicateResponse>` | Run batch duplicate detection for all records in a list |
|
|
274
|
+
| `CheckSingleRecord` | `(EntityDocumentID: string, RecordID: CompositeKey, Options?: DuplicateDetectionOptions, ContextUser?: UserInfo) => Promise<PotentialDuplicateResult>` | Check a single record for duplicates |
|
|
275
|
+
| `ParseVectorMatches` | `(queryResponse: BaseResponse, sourceKey?: CompositeKey) => PotentialDuplicateResult` | Parse raw vector DB response into typed results |
|
|
227
276
|
|
|
228
|
-
|
|
229
|
-
string ID PK
|
|
230
|
-
string ListID FK
|
|
231
|
-
string RecordID
|
|
232
|
-
}
|
|
277
|
+
### ComputeRRF
|
|
233
278
|
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
string TemplateID
|
|
238
|
-
string AIModelID
|
|
239
|
-
string VectorDatabaseID
|
|
240
|
-
float PotentialMatchThreshold
|
|
241
|
-
float AbsoluteMatchThreshold
|
|
242
|
-
}
|
|
279
|
+
```typescript
|
|
280
|
+
function ComputeRRF(rankedLists: ScoredCandidate[][], k?: number): ScoredCandidate[]
|
|
281
|
+
```
|
|
243
282
|
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
283
|
+
Compute Reciprocal Rank Fusion across multiple ranked result lists. Returns candidates sorted by descending fused score.
|
|
284
|
+
|
|
285
|
+
### ScoredCandidate
|
|
286
|
+
|
|
287
|
+
```typescript
|
|
288
|
+
interface ScoredCandidate {
|
|
289
|
+
ID: string;
|
|
290
|
+
Score: number;
|
|
291
|
+
Metadata?: Record<string, unknown>;
|
|
292
|
+
}
|
|
248
293
|
```
|
|
249
294
|
|
|
250
|
-
|
|
295
|
+
---
|
|
251
296
|
|
|
252
|
-
|
|
253
|
-
# AI Model API Keys
|
|
254
|
-
OPENAI_API_KEY=your-openai-key
|
|
255
|
-
MISTRAL_API_KEY=your-mistral-key
|
|
297
|
+
## Inverse Match Deduplication
|
|
256
298
|
|
|
257
|
-
|
|
258
|
-
PINECONE_API_KEY=your-pinecone-key
|
|
259
|
-
PINECONE_HOST=your-pinecone-host
|
|
260
|
-
PINECONE_DEFAULT_INDEX=your-index-name
|
|
299
|
+
The detector maintains a `_seenPairs` set across the entire run to suppress inverse duplicates. If record A is identified as a duplicate of record B (A->B), the reverse match (B->A) is automatically suppressed. Pair keys use canonical ordering (`smallerID::largerID`) for consistent deduplication regardless of query direction.
|
|
261
300
|
|
|
262
|
-
|
|
263
|
-
DB_HOST=your-sql-server
|
|
264
|
-
DB_PORT=1433
|
|
265
|
-
DB_USERNAME=your-username
|
|
266
|
-
DB_PASSWORD=your-password
|
|
267
|
-
DB_DATABASE=your-database
|
|
301
|
+
## RecordID Format and Metadata
|
|
268
302
|
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
303
|
+
- **RecordID and MatchRecordID** are stored in MJ URL segment format (e.g., `ID|uuid`), making them compatible with `CompositeKey` for entities with composite primary keys.
|
|
304
|
+
- **RecordMetadata** is stored on both `DuplicateRunDetail` and `DuplicateRunDetailMatch` entities, capturing the vector database metadata snapshot at detection time. This preserves the context used for matching even if the source record changes later.
|
|
305
|
+
|
|
306
|
+
## Database Entities
|
|
272
307
|
|
|
273
|
-
|
|
308
|
+
The package reads from and writes to these MJ entities:
|
|
274
309
|
|
|
275
|
-
|
|
|
310
|
+
| Entity | Purpose |
|
|
276
311
|
|---|---|
|
|
277
|
-
|
|
|
278
|
-
|
|
|
279
|
-
|
|
|
280
|
-
|
|
|
281
|
-
|
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
-
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
- Records must be added to a List before detection can run
|
|
312
|
+
| `MJ: Entity Documents` | Configuration: template, AI model, vector DB, thresholds |
|
|
313
|
+
| `MJ: Lists` / `MJ: List Details` | Source records to check for duplicates |
|
|
314
|
+
| `MJ: Duplicate Runs` | Tracks each detection run (status, timing) |
|
|
315
|
+
| `MJ: Duplicate Run Details` | Per-record tracking within a run; includes `RecordMetadata` (vector DB metadata snapshot) |
|
|
316
|
+
| `MJ: Duplicate Run Detail Matches` | Individual match results with probability scores; includes `RecordMetadata` for the matched record |
|
|
317
|
+
|
|
318
|
+
---
|
|
319
|
+
|
|
320
|
+
## Further Reading
|
|
321
|
+
|
|
322
|
+
- **[Duplicate Detection Guide](docs/DUPLICATE_DETECTION_GUIDE.md)** -- comprehensive developer guide covering end-to-end workflow, threshold tuning, hybrid search deep dive, performance optimization, and troubleshooting
|
|
323
|
+
- **[MemberJunction AI Vectors](../Core/README.md)** -- base vector infrastructure
|
|
324
|
+
- **[AI Vector Sync](../Sync/README.md)** -- entity vectorization and template parsing
|
|
325
|
+
|
|
326
|
+
---
|
|
293
327
|
|
|
294
328
|
## Development
|
|
295
329
|
|
|
@@ -297,8 +331,11 @@ CURRENT_USER_EMAIL=user@example.com
|
|
|
297
331
|
# Build
|
|
298
332
|
npm run build
|
|
299
333
|
|
|
300
|
-
#
|
|
301
|
-
npm run
|
|
334
|
+
# Run tests
|
|
335
|
+
npm run test
|
|
336
|
+
|
|
337
|
+
# Watch mode
|
|
338
|
+
npm run test:watch
|
|
302
339
|
```
|
|
303
340
|
|
|
304
341
|
## License
|