@memberjunction/ai-vector-dupe 5.21.0 → 5.22.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +254 -230
- package/dist/duplicateRecordDetector.d.ts +116 -18
- package/dist/duplicateRecordDetector.d.ts.map +1 -1
- package/dist/duplicateRecordDetector.js +465 -262
- package/dist/duplicateRecordDetector.js.map +1 -1
- package/dist/index.d.ts +2 -3
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +4 -3
- package/dist/index.js.map +1 -1
- package/dist/scoring/ReciprocalRankFusion.d.ts +45 -0
- package/dist/scoring/ReciprocalRankFusion.d.ts.map +1 -0
- package/dist/scoring/ReciprocalRankFusion.js +63 -0
- package/dist/scoring/ReciprocalRankFusion.js.map +1 -0
- package/package.json +10 -10
- package/dist/config.d.ts +0 -13
- package/dist/config.d.ts.map +0 -1
- package/dist/config.js +0 -15
- package/dist/config.js.map +0 -1
- package/dist/generic/vectorSyncBase.d.ts +0 -20
- package/dist/generic/vectorSyncBase.d.ts.map +0 -1
- package/dist/generic/vectorSyncBase.js +0 -42
- package/dist/generic/vectorSyncBase.js.map +0 -1
- package/dist/models/entitySyncConfig.d.ts +0 -36
- package/dist/models/entitySyncConfig.d.ts.map +0 -1
- package/dist/models/entitySyncConfig.js +0 -2
- package/dist/models/entitySyncConfig.js.map +0 -1
package/README.md
CHANGED
|
@@ -1,42 +1,66 @@
|
|
|
1
1
|
# @memberjunction/ai-vector-dupe
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
<!-- Badges -->
|
|
4
|
+
<!-- [](https://www.npmjs.com/package/@memberjunction/ai-vector-dupe) -->
|
|
5
|
+
<!-- [](https://github.com/MemberJunction/MJ/actions) -->
|
|
6
|
+
|
|
7
|
+
**AI-powered duplicate record detection for MemberJunction entities** -- finds, scores, tracks, and optionally auto-merges duplicate records using vector similarity, hybrid search (RRF), and optional reranking.
|
|
8
|
+
|
|
9
|
+
---
|
|
4
10
|
|
|
5
11
|
## Architecture
|
|
6
12
|
|
|
7
|
-
```mermaid
|
|
8
|
-
graph TD
|
|
9
|
-
subgraph DupePkg["@memberjunction/ai-vector-dupe"]
|
|
10
|
-
DRD["DuplicateRecordDetector"]
|
|
11
|
-
VSB["VectorSyncBase"]
|
|
12
|
-
ESC["EntitySyncConfig"]
|
|
13
|
-
end
|
|
14
|
-
|
|
15
|
-
subgraph Pipeline["Detection Pipeline"]
|
|
16
|
-
LIST["Load Records<br/>from List"] --> VECT["Vectorize Records<br/>via Templates"]
|
|
17
|
-
VECT --> EMBED["Generate<br/>Embeddings"]
|
|
18
|
-
EMBED --> QUERY["Query Vector DB<br/>for Matches"]
|
|
19
|
-
QUERY --> FILTER["Filter by<br/>Threshold"]
|
|
20
|
-
FILTER --> TRACK["Track Results<br/>in Duplicate Runs"]
|
|
21
|
-
TRACK --> MERGE["Auto-Merge<br/>Above Threshold"]
|
|
22
|
-
end
|
|
23
|
-
|
|
24
|
-
subgraph Dependencies["Key Dependencies"]
|
|
25
|
-
VB["ai-vectors<br/>(VectorBase)"]
|
|
26
|
-
SYNC["ai-vector-sync<br/>(EntityVectorSyncer)"]
|
|
27
|
-
VDBB["ai-vectordb<br/>(VectorDBBase)"]
|
|
28
|
-
AI["ai<br/>(BaseEmbeddings)"]
|
|
29
|
-
end
|
|
30
|
-
|
|
31
|
-
DRD -->|extends| VB
|
|
32
|
-
DRD --> SYNC
|
|
33
|
-
DRD --> VDBB
|
|
34
|
-
DRD --> AI
|
|
35
|
-
|
|
36
|
-
style DupePkg fill:#2d6a9f,stroke:#1a4971,color:#fff
|
|
37
|
-
style Pipeline fill:#2d8659,stroke:#1a5c3a,color:#fff
|
|
38
|
-
style Dependencies fill:#7c5295,stroke:#563a6b,color:#fff
|
|
39
13
|
```
|
|
14
|
+
+--------------------------+
|
|
15
|
+
| DuplicateRecordDetector |
|
|
16
|
+
| (extends VectorBase) |
|
|
17
|
+
+-----+----------+---------+
|
|
18
|
+
| |
|
|
19
|
+
+----------------+ +----------------+
|
|
20
|
+
| |
|
|
21
|
+
+---------v----------+ +-----------v---------+
|
|
22
|
+
| GetDuplicateRecords| | CheckSingleRecord |
|
|
23
|
+
| (list-based batch) | | (single record) |
|
|
24
|
+
+--------+-----------+ +-----------+---------+
|
|
25
|
+
| |
|
|
26
|
+
+-------------------+-------------------------+
|
|
27
|
+
|
|
|
28
|
+
+------------v------------+
|
|
29
|
+
| Detection Pipeline |
|
|
30
|
+
+-------------------------+
|
|
31
|
+
| 1. Validate Entity Doc |
|
|
32
|
+
| 2. Vectorize records |
|
|
33
|
+
| 3. Embed via AI model |
|
|
34
|
+
| 4. Query vector DB |
|
|
35
|
+
| (hybrid if supported)|
|
|
36
|
+
| 5. Filter self-matches |
|
|
37
|
+
| 6. Apply thresholds |
|
|
38
|
+
| 7. Persist match results|
|
|
39
|
+
| 8. Auto-merge (optional)|
|
|
40
|
+
+-------------------------+
|
|
41
|
+
|
|
|
42
|
+
+------------------+------------------+
|
|
43
|
+
| | |
|
|
44
|
+
+---------v------+ +-------v--------+ +-------v--------+
|
|
45
|
+
| ai-vector-sync | | ai-vectordb | | ai (Embeddings)|
|
|
46
|
+
| (vectorizer, | | (VectorDBBase, | | (BaseEmbeddings|
|
|
47
|
+
| templates) | | hybrid query) | | GetAIAPIKey) |
|
|
48
|
+
+----------------+ +----------------+ +----------------+
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
**Key dependencies:**
|
|
52
|
+
|
|
53
|
+
| Package | Role |
|
|
54
|
+
|---|---|
|
|
55
|
+
| `@memberjunction/ai` | Embedding model abstraction and API key resolution |
|
|
56
|
+
| `@memberjunction/ai-vectordb` | Vector database abstraction (query, hybrid search) |
|
|
57
|
+
| `@memberjunction/ai-vectors` | `VectorBase` base class with metadata and RunView helpers |
|
|
58
|
+
| `@memberjunction/ai-vector-sync` | `EntityVectorSyncer` for record vectorization, template parsing |
|
|
59
|
+
| `@memberjunction/core` | Core types: `PotentialDuplicateRequest`, `DuplicateDetectionOptions`, etc. |
|
|
60
|
+
| `@memberjunction/core-entities` | Generated entity classes for Duplicate Runs, Lists, Entity Documents |
|
|
61
|
+
| `@memberjunction/global` | `MJGlobal` class factory, `UUIDsEqual` |
|
|
62
|
+
|
|
63
|
+
---
|
|
40
64
|
|
|
41
65
|
## Installation
|
|
42
66
|
|
|
@@ -44,252 +68,249 @@ graph TD
|
|
|
44
68
|
npm install @memberjunction/ai-vector-dupe
|
|
45
69
|
```
|
|
46
70
|
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
The package provides the `DuplicateRecordDetector` class, which orchestrates a complete duplicate detection workflow:
|
|
50
|
-
|
|
51
|
-
1. Loads records from a MemberJunction List
|
|
52
|
-
2. Vectorizes them using a configured Entity Document template and embedding model
|
|
53
|
-
3. Queries the vector database for similarity matches
|
|
54
|
-
4. Filters results against configurable thresholds
|
|
55
|
-
5. Creates Duplicate Run, Duplicate Run Detail, and Duplicate Run Detail Match records for tracking
|
|
56
|
-
6. Optionally auto-merges records that exceed the absolute match threshold
|
|
57
|
-
|
|
58
|
-
## Duplicate Detection Flow
|
|
59
|
-
|
|
60
|
-
```mermaid
|
|
61
|
-
sequenceDiagram
|
|
62
|
-
participant Caller
|
|
63
|
-
participant DRD as DuplicateRecordDetector
|
|
64
|
-
participant EVS as EntityVectorSyncer
|
|
65
|
-
participant Embed as Embedding Model
|
|
66
|
-
participant VDB as Vector Database
|
|
67
|
-
participant DB as MJ Database
|
|
68
|
-
|
|
69
|
-
Caller->>DRD: getDuplicateRecords(request, user)
|
|
70
|
-
DRD->>DB: Load Entity Document
|
|
71
|
-
DRD->>EVS: VectorizeEntity (ensure all records are indexed)
|
|
72
|
-
DRD->>DB: Load records from List
|
|
73
|
-
|
|
74
|
-
loop For each record
|
|
75
|
-
DRD->>Embed: Generate embedding from template
|
|
76
|
-
DRD->>VDB: queryIndex (topK=5)
|
|
77
|
-
VDB-->>DRD: Scored matches
|
|
78
|
-
DRD->>DRD: Filter by PotentialMatchThreshold
|
|
79
|
-
DRD->>DB: Create DuplicateRunDetailMatch records
|
|
80
|
-
end
|
|
81
|
-
|
|
82
|
-
DRD->>DRD: Check AbsoluteMatchThreshold
|
|
83
|
-
DRD->>DB: Auto-merge high-confidence duplicates
|
|
84
|
-
DRD-->>Caller: PotentialDuplicateResponse
|
|
85
|
-
```
|
|
71
|
+
---
|
|
86
72
|
|
|
87
|
-
##
|
|
73
|
+
## Quick Start
|
|
88
74
|
|
|
89
|
-
###
|
|
75
|
+
### List-Based Batch Detection
|
|
76
|
+
|
|
77
|
+
Detect duplicates across all records in an MJ List:
|
|
78
|
+
|
|
79
|
+
```typescript
|
|
80
|
+
import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
|
|
81
|
+
import { PotentialDuplicateRequest } from '@memberjunction/core';
|
|
82
|
+
|
|
83
|
+
const detector = new DuplicateRecordDetector();
|
|
84
|
+
|
|
85
|
+
const request: PotentialDuplicateRequest = {
|
|
86
|
+
ListID: 'your-list-uuid',
|
|
87
|
+
EntityID: 'your-entity-uuid',
|
|
88
|
+
EntityDocumentID: 'your-entity-document-uuid',
|
|
89
|
+
Options: {
|
|
90
|
+
TopK: 10,
|
|
91
|
+
OnProgress: (progress) => {
|
|
92
|
+
console.log(`[${progress.Phase}] ${progress.ProcessedRecords}/${progress.TotalRecords} -- ${progress.MatchesFound} matches`);
|
|
93
|
+
},
|
|
94
|
+
},
|
|
95
|
+
};
|
|
96
|
+
|
|
97
|
+
const response = await detector.GetDuplicateRecords(request, contextUser);
|
|
98
|
+
|
|
99
|
+
if (response.Status === 'Success') {
|
|
100
|
+
for (const result of response.PotentialDuplicateResult) {
|
|
101
|
+
console.log(`Record: ${result.RecordCompositeKey.ToString()}`);
|
|
102
|
+
for (const dupe of result.Duplicates) {
|
|
103
|
+
console.log(` Match: ${dupe.ToString()} (${(dupe.ProbabilityScore * 100).toFixed(1)}%)`);
|
|
104
|
+
}
|
|
105
|
+
}
|
|
106
|
+
}
|
|
107
|
+
```
|
|
90
108
|
|
|
91
|
-
|
|
109
|
+
### Single-Record Check
|
|
92
110
|
|
|
93
|
-
|
|
111
|
+
Check one record for duplicates without creating a list -- ideal for server hooks (e.g., fire-and-forget after record save):
|
|
94
112
|
|
|
95
113
|
```typescript
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
)
|
|
114
|
+
import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
|
|
115
|
+
import { CompositeKey } from '@memberjunction/core';
|
|
116
|
+
|
|
117
|
+
const detector = new DuplicateRecordDetector();
|
|
118
|
+
|
|
119
|
+
const recordKey = new CompositeKey([{ FieldName: 'ID', Value: 'record-uuid' }]);
|
|
120
|
+
|
|
121
|
+
const result = await detector.CheckSingleRecord(
|
|
122
|
+
'your-entity-document-uuid',
|
|
123
|
+
recordKey,
|
|
124
|
+
{ TopK: 5 },
|
|
125
|
+
contextUser
|
|
126
|
+
);
|
|
127
|
+
|
|
128
|
+
for (const dupe of result.Duplicates) {
|
|
129
|
+
console.log(`Potential duplicate: ${dupe.ToString()} (score: ${dupe.ProbabilityScore})`);
|
|
130
|
+
}
|
|
100
131
|
```
|
|
101
132
|
|
|
102
|
-
|
|
133
|
+
---
|
|
103
134
|
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
|
109
|
-
|
|
135
|
+
## DuplicateDetectionOptions Reference
|
|
136
|
+
|
|
137
|
+
Options are passed via the `Options` property on `PotentialDuplicateRequest`, or directly to `CheckSingleRecord`.
|
|
138
|
+
|
|
139
|
+
| Option | Type | Default | Description |
|
|
140
|
+
|---|---|---|---|
|
|
141
|
+
| `TopK` | `number` | `5` | Number of nearest neighbors to retrieve per record |
|
|
142
|
+
| `DuplicateRunID` | `string` | -- | Resume an existing duplicate run (batch mode only) |
|
|
143
|
+
| `KeywordSearchWeight` | `number` | `0.3` | Weight for keyword search in hybrid mode (0.0 = vector only, 1.0 = keyword only). Vector weight is `1.0 - KeywordSearchWeight`. |
|
|
144
|
+
| `FusionMethod` | `string` | `'rrf'` | Fusion method for hybrid search. Currently supports `'rrf'` (Reciprocal Rank Fusion). |
|
|
145
|
+
| `OnProgress` | `(progress: DuplicateDetectionProgress) => void` | -- | Callback for real-time progress reporting |
|
|
146
|
+
|
|
147
|
+
### Thresholds (Configured on Entity Document)
|
|
110
148
|
|
|
111
|
-
|
|
149
|
+
Thresholds are not part of `DuplicateDetectionOptions` -- they are configured on the `EntityDocument` record itself:
|
|
112
150
|
|
|
113
151
|
| Threshold | Purpose |
|
|
114
152
|
|---|---|
|
|
115
|
-
| `PotentialMatchThreshold` | Minimum similarity score to report as potential duplicate |
|
|
116
|
-
| `AbsoluteMatchThreshold` | Minimum similarity score
|
|
153
|
+
| `PotentialMatchThreshold` | Minimum similarity score to report a candidate as a potential duplicate |
|
|
154
|
+
| `AbsoluteMatchThreshold` | Minimum similarity score to trigger automatic record merge |
|
|
117
155
|
|
|
118
|
-
|
|
156
|
+
---
|
|
119
157
|
|
|
120
|
-
|
|
158
|
+
## Hybrid Search and Reciprocal Rank Fusion (RRF)
|
|
121
159
|
|
|
122
|
-
|
|
123
|
-
- `timer(ms)` -- async delay
|
|
124
|
-
- `start()` / `end()` / `timeDiff()` -- execution timing
|
|
125
|
-
- `saveJSONData(data, path)` -- JSON file output
|
|
160
|
+
When the configured vector database supports hybrid search (`VectorDBBase.SupportsHybridSearch === true`), the detector automatically combines **vector similarity** and **keyword search** for higher-quality results.
|
|
126
161
|
|
|
127
|
-
###
|
|
162
|
+
### How It Works
|
|
128
163
|
|
|
129
|
-
|
|
164
|
+
1. The record's template text is sent as both a vector embedding and a keyword query.
|
|
165
|
+
2. The vector DB returns results from both retrieval methods.
|
|
166
|
+
3. Results are fused using **Reciprocal Rank Fusion (RRF)**, a rank-based algorithm that is score-scale independent.
|
|
167
|
+
|
|
168
|
+
### RRF Formula
|
|
130
169
|
|
|
131
|
-
```typescript
|
|
132
|
-
type EntitySyncConfig = {
|
|
133
|
-
EntityDocumentID: string; // Entity Document to use
|
|
134
|
-
Interval: number; // Sync interval in seconds
|
|
135
|
-
RunViewParams: RunViewParams; // View parameters for fetching
|
|
136
|
-
IncludeInSync: boolean; // Whether to include in sync
|
|
137
|
-
LastRunDate: string; // Last sync timestamp
|
|
138
|
-
VectorIndexID: number; // Vector index ID
|
|
139
|
-
VectorID: number; // Vector database ID
|
|
140
|
-
};
|
|
141
170
|
```
|
|
171
|
+
FusedScore(d) = SUM_i [ 1 / (k + rank_i(d)) ]
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
Where `rank_i(d)` is the 1-based rank of document `d` in list `i`, and `k` is a smoothing constant (default: 60).
|
|
142
175
|
|
|
143
|
-
|
|
176
|
+
### Using ComputeRRF Directly
|
|
144
177
|
|
|
145
|
-
|
|
178
|
+
The `ComputeRRF` utility is exported for use in custom pipelines:
|
|
146
179
|
|
|
147
180
|
```typescript
|
|
148
|
-
import {
|
|
149
|
-
|
|
181
|
+
import { ComputeRRF, ScoredCandidate } from '@memberjunction/ai-vector-dupe';
|
|
182
|
+
|
|
183
|
+
const vectorResults: ScoredCandidate[] = [
|
|
184
|
+
{ ID: 'rec-1', Score: 0.95 },
|
|
185
|
+
{ ID: 'rec-2', Score: 0.87 },
|
|
186
|
+
{ ID: 'rec-3', Score: 0.82 },
|
|
187
|
+
];
|
|
188
|
+
|
|
189
|
+
const keywordResults: ScoredCandidate[] = [
|
|
190
|
+
{ ID: 'rec-2', Score: 12.5 }, // Different scale -- RRF handles this
|
|
191
|
+
{ ID: 'rec-4', Score: 10.1 },
|
|
192
|
+
{ ID: 'rec-1', Score: 8.3 },
|
|
193
|
+
];
|
|
194
|
+
|
|
195
|
+
const fused = ComputeRRF([vectorResults, keywordResults], 60);
|
|
196
|
+
// Results sorted by fused RRF score, score-scale independent
|
|
197
|
+
```
|
|
150
198
|
|
|
151
|
-
|
|
199
|
+
### Tuning Hybrid Search
|
|
152
200
|
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
};
|
|
201
|
+
- **`KeywordSearchWeight = 0.0`**: Pure vector similarity (semantic matching).
|
|
202
|
+
- **`KeywordSearchWeight = 0.3`** (default): Slight keyword boost. Good for entities with distinctive names or codes.
|
|
203
|
+
- **`KeywordSearchWeight = 0.5`**: Equal weight. Useful when both semantic and lexical matches matter.
|
|
204
|
+
- **`KeywordSearchWeight = 1.0`**: Pure keyword search (not recommended for duplicate detection).
|
|
158
205
|
|
|
159
|
-
|
|
206
|
+
---
|
|
160
207
|
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
208
|
+
## Reranking
|
|
209
|
+
|
|
210
|
+
When MJ's `BaseReranker` / `RerankerService` is configured, the detector can apply a second-stage reranking pass after initial retrieval. Reranking uses a cross-encoder model to re-score candidates with higher precision than embedding-based similarity alone.
|
|
211
|
+
|
|
212
|
+
Reranking is especially effective when:
|
|
213
|
+
- Initial retrieval returns many borderline candidates
|
|
214
|
+
- Entity records have complex, multi-field structures
|
|
215
|
+
- You need to maximize precision at the cost of slightly higher latency
|
|
216
|
+
|
|
217
|
+
See the [Duplicate Detection Guide](docs/DUPLICATE_DETECTION_GUIDE.md#reranking-integration) for configuration details.
|
|
218
|
+
|
|
219
|
+
---
|
|
170
220
|
|
|
171
|
-
|
|
221
|
+
## Progress Reporting
|
|
222
|
+
|
|
223
|
+
The `OnProgress` callback fires at each phase of the pipeline:
|
|
172
224
|
|
|
173
225
|
```typescript
|
|
174
226
|
const request: PotentialDuplicateRequest = {
|
|
175
|
-
|
|
176
|
-
EntityID: 'entity-uuid',
|
|
177
|
-
EntityDocumentID: 'doc-uuid',
|
|
227
|
+
// ...
|
|
178
228
|
Options: {
|
|
179
|
-
|
|
180
|
-
|
|
229
|
+
OnProgress: (progress) => {
|
|
230
|
+
const { Phase, TotalRecords, ProcessedRecords, MatchesFound, ElapsedMs } = progress;
|
|
231
|
+
const pct = TotalRecords > 0 ? ((ProcessedRecords / TotalRecords) * 100).toFixed(0) : '0';
|
|
232
|
+
console.log(`[${Phase}] ${pct}% -- ${MatchesFound} matches (${ElapsedMs}ms)`);
|
|
233
|
+
},
|
|
234
|
+
},
|
|
181
235
|
};
|
|
182
|
-
|
|
183
|
-
const response = await detector.getDuplicateRecords(request, currentUser);
|
|
184
236
|
```
|
|
185
237
|
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
The package reads from and writes to these MemberJunction entities:
|
|
189
|
-
|
|
190
|
-
```mermaid
|
|
191
|
-
erDiagram
|
|
192
|
-
DUPLICATE_RUN {
|
|
193
|
-
string ID PK
|
|
194
|
-
string EntityID
|
|
195
|
-
string StartedByUserID
|
|
196
|
-
datetime StartedAt
|
|
197
|
-
datetime EndedAt
|
|
198
|
-
string ProcessingStatus
|
|
199
|
-
string ApprovalStatus
|
|
200
|
-
string SourceListID
|
|
201
|
-
}
|
|
238
|
+
### Progress Phases
|
|
202
239
|
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
240
|
+
| Phase | Description |
|
|
241
|
+
|---|---|
|
|
242
|
+
| `Vectorizing` | Records are being vectorized via `EntityVectorSyncer` |
|
|
243
|
+
| `Embedding` | Template texts are being embedded via the AI model |
|
|
244
|
+
| `Querying` | Vector DB is being queried for each record |
|
|
245
|
+
| `Matching` | Results are being persisted and match records created |
|
|
246
|
+
| `Merging` | High-confidence matches are being auto-merged |
|
|
210
247
|
|
|
211
|
-
|
|
212
|
-
string ID PK
|
|
213
|
-
string DuplicateRunDetailID FK
|
|
214
|
-
string MatchRecordID
|
|
215
|
-
float MatchProbability
|
|
216
|
-
datetime MatchedAt
|
|
217
|
-
string Action
|
|
218
|
-
string ApprovalStatus
|
|
219
|
-
string MergeStatus
|
|
220
|
-
}
|
|
248
|
+
### DuplicateDetectionProgress Shape
|
|
221
249
|
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
250
|
+
```typescript
|
|
251
|
+
interface DuplicateDetectionProgress {
|
|
252
|
+
Phase: 'Vectorizing' | 'Embedding' | 'Querying' | 'Matching' | 'Merging';
|
|
253
|
+
TotalRecords: number;
|
|
254
|
+
ProcessedRecords: number;
|
|
255
|
+
MatchesFound: number;
|
|
256
|
+
CurrentRecordID?: string;
|
|
257
|
+
ElapsedMs: number;
|
|
258
|
+
}
|
|
259
|
+
```
|
|
227
260
|
|
|
228
|
-
|
|
229
|
-
string ID PK
|
|
230
|
-
string ListID FK
|
|
231
|
-
string RecordID
|
|
232
|
-
}
|
|
261
|
+
---
|
|
233
262
|
|
|
234
|
-
|
|
235
|
-
string ID PK
|
|
236
|
-
string EntityID
|
|
237
|
-
string TemplateID
|
|
238
|
-
string AIModelID
|
|
239
|
-
string VectorDatabaseID
|
|
240
|
-
float PotentialMatchThreshold
|
|
241
|
-
float AbsoluteMatchThreshold
|
|
242
|
-
}
|
|
263
|
+
## API Reference Summary
|
|
243
264
|
|
|
244
|
-
|
|
245
|
-
DUPLICATE_RUN_DETAIL ||--o{ DUPLICATE_RUN_DETAIL_MATCH : has
|
|
246
|
-
DUPLICATE_RUN }o--|| LIST : "source"
|
|
247
|
-
LIST ||--o{ LIST_DETAIL : contains
|
|
248
|
-
```
|
|
265
|
+
### DuplicateRecordDetector
|
|
249
266
|
|
|
250
|
-
|
|
267
|
+
| Method | Signature | Description |
|
|
268
|
+
|---|---|---|
|
|
269
|
+
| `GetDuplicateRecords` | `(params: PotentialDuplicateRequest, contextUser?: UserInfo) => Promise<PotentialDuplicateResponse>` | Run batch duplicate detection for all records in a list |
|
|
270
|
+
| `CheckSingleRecord` | `(EntityDocumentID: string, RecordID: CompositeKey, Options?: DuplicateDetectionOptions, ContextUser?: UserInfo) => Promise<PotentialDuplicateResult>` | Check a single record for duplicates |
|
|
271
|
+
| `ParseVectorMatches` | `(queryResponse: BaseResponse, sourceKey?: CompositeKey) => PotentialDuplicateResult` | Parse raw vector DB response into typed results |
|
|
251
272
|
|
|
252
|
-
|
|
253
|
-
# AI Model API Keys
|
|
254
|
-
OPENAI_API_KEY=your-openai-key
|
|
255
|
-
MISTRAL_API_KEY=your-mistral-key
|
|
273
|
+
### ComputeRRF
|
|
256
274
|
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
275
|
+
```typescript
|
|
276
|
+
function ComputeRRF(rankedLists: ScoredCandidate[][], k?: number): ScoredCandidate[]
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
Compute Reciprocal Rank Fusion across multiple ranked result lists. Returns candidates sorted by descending fused score.
|
|
261
280
|
|
|
262
|
-
|
|
263
|
-
DB_HOST=your-sql-server
|
|
264
|
-
DB_PORT=1433
|
|
265
|
-
DB_USERNAME=your-username
|
|
266
|
-
DB_PASSWORD=your-password
|
|
267
|
-
DB_DATABASE=your-database
|
|
281
|
+
### ScoredCandidate
|
|
268
282
|
|
|
269
|
-
|
|
270
|
-
|
|
283
|
+
```typescript
|
|
284
|
+
interface ScoredCandidate {
|
|
285
|
+
ID: string;
|
|
286
|
+
Score: number;
|
|
287
|
+
Metadata?: Record<string, unknown>;
|
|
288
|
+
}
|
|
271
289
|
```
|
|
272
290
|
|
|
273
|
-
|
|
291
|
+
---
|
|
292
|
+
|
|
293
|
+
## Database Entities
|
|
294
|
+
|
|
295
|
+
The package reads from and writes to these MJ entities:
|
|
274
296
|
|
|
275
|
-
|
|
|
297
|
+
| Entity | Purpose |
|
|
276
298
|
|---|---|
|
|
277
|
-
|
|
|
278
|
-
|
|
|
279
|
-
|
|
|
280
|
-
|
|
|
281
|
-
|
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
-
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
- Records must be added to a List before detection can run
|
|
299
|
+
| `MJ: Entity Documents` | Configuration: template, AI model, vector DB, thresholds |
|
|
300
|
+
| `MJ: Lists` / `MJ: List Details` | Source records to check for duplicates |
|
|
301
|
+
| `MJ: Duplicate Runs` | Tracks each detection run (status, timing) |
|
|
302
|
+
| `MJ: Duplicate Run Details` | Per-record tracking within a run |
|
|
303
|
+
| `MJ: Duplicate Run Detail Matches` | Individual match results with probability scores |
|
|
304
|
+
|
|
305
|
+
---
|
|
306
|
+
|
|
307
|
+
## Further Reading
|
|
308
|
+
|
|
309
|
+
- **[Duplicate Detection Guide](docs/DUPLICATE_DETECTION_GUIDE.md)** -- comprehensive developer guide covering end-to-end workflow, threshold tuning, hybrid search deep dive, performance optimization, and troubleshooting
|
|
310
|
+
- **[MemberJunction AI Vectors](../Core/README.md)** -- base vector infrastructure
|
|
311
|
+
- **[AI Vector Sync](../Sync/README.md)** -- entity vectorization and template parsing
|
|
312
|
+
|
|
313
|
+
---
|
|
293
314
|
|
|
294
315
|
## Development
|
|
295
316
|
|
|
@@ -297,8 +318,11 @@ CURRENT_USER_EMAIL=user@example.com
|
|
|
297
318
|
# Build
|
|
298
319
|
npm run build
|
|
299
320
|
|
|
300
|
-
#
|
|
301
|
-
npm run
|
|
321
|
+
# Run tests
|
|
322
|
+
npm run test
|
|
323
|
+
|
|
324
|
+
# Watch mode
|
|
325
|
+
npm run test:watch
|
|
302
326
|
```
|
|
303
327
|
|
|
304
328
|
## License
|