email-origin-chain 1.0.8 → 1.0.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +35 -2
- package/dist/detectors/outlook-empty-header-detector.d.ts +1 -1
- package/dist/detectors/outlook-empty-header-detector.js +2 -1
- package/dist/detectors/outlook-reverse-fr-detector.d.ts +1 -1
- package/dist/detectors/outlook-reverse-fr-detector.js +2 -1
- package/dist/detectors/registry.js +6 -6
- package/dist/index.js +15 -3
- package/dist/inline-layer.js +1 -1
- package/dist/scoring.d.ts +16 -0
- package/dist/scoring.js +154 -0
- package/dist/types.d.ts +8 -0
- package/dist/utils.js +3 -0
- package/docs/architecture/README.md +4 -0
- package/docs/confidence_scoring.md +75 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -17,6 +17,7 @@ Detailed documentation can be found in the [docs/architecture/](docs/architectur
|
|
|
17
17
|
- [Phase 2: Plugin Architecture](docs/architecture/phase2_plugin_foundation.md)
|
|
18
18
|
- [Phase 3: Full Compatibility (100%)](docs/architecture/phase3_fallbacks.md)
|
|
19
19
|
- [Deep Forward Fix Walkthrough](docs/walkthrough_deep_forward_fix.md)
|
|
20
|
+
- [Confidence Scoring System](docs/confidence_scoring.md)
|
|
20
21
|
- [Detector Usage & Priorities](docs/detectors_usage.md)
|
|
21
22
|
|
|
22
23
|
**✅ Test Coverage:** The library has been validated against **239 fixtures** from the `email-forward-parser-recursive` library with a **100% success rate** (239/239). This includes validating message bodies and ensuring non-message snippets are correctly identified. See [Test Coverage Report](docs/TEST_COVERAGE.md) for details.
|
|
@@ -77,6 +78,10 @@ The library returns a `ResultObject` with the following structure:
|
|
|
77
78
|
| `text` | `string \| null` | Cleaned body content of the deepest message. |
|
|
78
79
|
| `attachments` | `array` | Metadata for MIME attachments found at the deepest level. |
|
|
79
80
|
| `history` | `array` | **Conversation Chaining**: Full audit trail of the discussion (see below). |
|
|
81
|
+
| `confidence_score` | `number` | Reliability score (0-100) based on signal analysis. |
|
|
82
|
+
| `confidence_description` | `string` | Human-readable explanation of the score. |
|
|
83
|
+
| `confidence_signals` | `object` | Key-value breakdown of triggered bonuses and penalties. |
|
|
84
|
+
| `confidence_reasons` | `array` | Detailed list of triggered scoring rules. |
|
|
80
85
|
| `diagnostics` | `object` | Metadata about the parsing process. |
|
|
81
86
|
|
|
82
87
|
### Diagnostics Detail
|
|
@@ -116,6 +121,22 @@ Each history entry contains its own `from`, `to`, `cc`, `subject`, `date_iso`, `
|
|
|
116
121
|
- `content:silent_forward`: The user forwarded the message without adding any text.
|
|
117
122
|
- `date:unparseable`: A date string was found but could not be normalized to ISO.
|
|
118
123
|
|
|
124
|
+
## Confidence Scoring System
|
|
125
|
+
|
|
126
|
+
To ensure high-quality extraction from text-based forwards, the library uses a **Signal-Based Confidence Score**. It analyzes metrics like email address density, sender count consistency, and quote levels to detect "Garbage" or incomplete chains.
|
|
127
|
+
|
|
128
|
+
### Scoring Logic:
|
|
129
|
+
- **Baseline**: 100% confidence for standard formatting (~2 emails per level).
|
|
130
|
+
- **Penalties**:
|
|
131
|
+
- **Sender Mismatch**: More senders found than levels detected (-75%).
|
|
132
|
+
- **Quote Mismatch**: Quote nesting deeper than detected levels (-75%).
|
|
133
|
+
- **Partial Chain**: Only 1 email detected per level (-50%).
|
|
134
|
+
- **Ghost Forward**: No emails found in text (-100%).
|
|
135
|
+
- **Bonuses**:
|
|
136
|
+
- **Validated Density**: High email density corroborated by context headers (+75%).
|
|
137
|
+
|
|
138
|
+
Check the [Confidence Scoring Documentation](docs/confidence_scoring.md) for full details.
|
|
139
|
+
|
|
119
140
|
### Typical Output Example
|
|
120
141
|
|
|
121
142
|
```json
|
|
@@ -148,7 +169,13 @@ Each history entry contains its own `from`, `to`, `cc`, `subject`, `date_iso`, `
|
|
|
148
169
|
"depth": 2,
|
|
149
170
|
"parsedOk": true,
|
|
150
171
|
"warnings": []
|
|
151
|
-
}
|
|
172
|
+
},
|
|
173
|
+
"confidence_score": 100,
|
|
174
|
+
"confidence_description": "High Confidence: Standard Density: Ratio 2.00 is optimal (~2 emails per level)",
|
|
175
|
+
"confidence_signals": {},
|
|
176
|
+
"confidence_reasons": [
|
|
177
|
+
"Standard Density: Ratio 2.00 is optimal (~2 emails per level)"
|
|
178
|
+
]
|
|
152
179
|
}
|
|
153
180
|
```
|
|
154
181
|
|
|
@@ -278,7 +305,13 @@ console.log(result.diagnostics.depth); // 4 (5 messages total)
|
|
|
278
305
|
"depth": 4,
|
|
279
306
|
"parsedOk": true,
|
|
280
307
|
"warnings": []
|
|
281
|
-
}
|
|
308
|
+
},
|
|
309
|
+
"confidence_score": 100,
|
|
310
|
+
"confidence_description": "High Confidence: Standard Density: Ratio 2.00 is optimal (~2 emails per level)",
|
|
311
|
+
"confidence_signals": {},
|
|
312
|
+
"confidence_reasons": [
|
|
313
|
+
"Standard Density: Ratio 2.00 is optimal (~2 emails per level)"
|
|
314
|
+
]
|
|
282
315
|
}
|
|
283
316
|
```
|
|
284
317
|
|
|
@@ -10,7 +10,7 @@ import { ForwardDetector, DetectionResult } from './types';
|
|
|
10
10
|
*/
|
|
11
11
|
export declare class OutlookEmptyHeaderDetector implements ForwardDetector {
|
|
12
12
|
readonly name = "outlook_empty_header";
|
|
13
|
-
readonly priority = 50;
|
|
13
|
+
readonly priority = -50;
|
|
14
14
|
private readonly HEADER_PATTERN;
|
|
15
15
|
detect(text: string): DetectionResult;
|
|
16
16
|
}
|
|
@@ -14,7 +14,7 @@ const cleaner_1 = require("../utils/cleaner");
|
|
|
14
14
|
class OutlookEmptyHeaderDetector {
|
|
15
15
|
constructor() {
|
|
16
16
|
this.name = 'outlook_empty_header';
|
|
17
|
-
this.priority = 50; //
|
|
17
|
+
this.priority = -50; // Very specific - High Priority
|
|
18
18
|
// Regex to capture the header block:
|
|
19
19
|
// 1. Optional Separator (mostly underscores)
|
|
20
20
|
// 2. De: ... (From)
|
|
@@ -51,6 +51,7 @@ class OutlookEmptyHeaderDetector {
|
|
|
51
51
|
message: message || undefined,
|
|
52
52
|
email: {
|
|
53
53
|
from: fromLine,
|
|
54
|
+
to: toLine,
|
|
54
55
|
subject: subjectLine,
|
|
55
56
|
date: dateLine || undefined,
|
|
56
57
|
body: finalBody
|
|
@@ -4,7 +4,7 @@ import { ForwardDetector, DetectionResult } from './types';
|
|
|
4
4
|
*/
|
|
5
5
|
export declare class OutlookReverseFrDetector implements ForwardDetector {
|
|
6
6
|
readonly name = "outlook_reverse_fr";
|
|
7
|
-
readonly priority = -
|
|
7
|
+
readonly priority = -45;
|
|
8
8
|
private readonly ENVOYE_PATTERN;
|
|
9
9
|
private readonly DE_PATTERN;
|
|
10
10
|
private readonly A_PATTERN;
|
|
@@ -8,7 +8,7 @@ const cleaner_1 = require("../utils/cleaner");
|
|
|
8
8
|
class OutlookReverseFrDetector {
|
|
9
9
|
constructor() {
|
|
10
10
|
this.name = 'outlook_reverse_fr';
|
|
11
|
-
this.priority = -
|
|
11
|
+
this.priority = -45; // Specific detector - High Priority
|
|
12
12
|
// Regex patterns for field detection
|
|
13
13
|
this.ENVOYE_PATTERN = /^[ \t]*Envoy(?:é|=E9|e)?\s*:\s*(.*?)\s*$/m;
|
|
14
14
|
this.DE_PATTERN = /^[ \t]*De\s*:/i;
|
|
@@ -76,6 +76,7 @@ class OutlookReverseFrDetector {
|
|
|
76
76
|
from: fromEmail.includes('@')
|
|
77
77
|
? { name: fromName !== fromEmail ? fromName : '', address: fromEmail }
|
|
78
78
|
: { name: fromName, address: fromName },
|
|
79
|
+
to: a ? extractValue(a.line) : undefined,
|
|
79
80
|
subject: objet ? extractValue(objet.line) : '',
|
|
80
81
|
date: extractValue(envoyeMatch[0]),
|
|
81
82
|
body: finalBody
|
|
@@ -15,12 +15,12 @@ class DetectorRegistry {
|
|
|
15
15
|
constructor(customDetectors = []) {
|
|
16
16
|
this.detectors = [];
|
|
17
17
|
// Register all detectors (priority determines order)
|
|
18
|
-
this.register(new
|
|
19
|
-
this.register(new
|
|
20
|
-
this.register(new
|
|
21
|
-
this.register(new
|
|
22
|
-
this.register(new
|
|
23
|
-
this.register(new
|
|
18
|
+
this.register(new outlook_empty_header_detector_1.OutlookEmptyHeaderDetector()); // priority: -50 (Very specific)
|
|
19
|
+
this.register(new outlook_reverse_fr_detector_1.OutlookReverseFrDetector()); // priority: -45 (Specific)
|
|
20
|
+
this.register(new new_outlook_detector_1.NewOutlookDetector()); // priority: -40 (Specific)
|
|
21
|
+
this.register(new outlook_fr_detector_1.OutlookFRDetector()); // priority: -30 (Fallback for FR)
|
|
22
|
+
this.register(new reply_detector_1.ReplyDetector()); // priority: -10 (Replies)
|
|
23
|
+
this.register(new crisp_detector_1.CrispDetector()); // priority: 100 (Universal fallback)
|
|
24
24
|
// Register custom detectors
|
|
25
25
|
customDetectors.forEach(detector => this.register(detector));
|
|
26
26
|
}
|
package/dist/index.js
CHANGED
|
@@ -18,6 +18,7 @@ exports.extractDeepestHybrid = extractDeepestHybrid;
|
|
|
18
18
|
const mime_layer_1 = require("./mime-layer");
|
|
19
19
|
const inline_layer_1 = require("./inline-layer");
|
|
20
20
|
const utils_1 = require("./utils");
|
|
21
|
+
const scoring_1 = require("./scoring");
|
|
21
22
|
/**
|
|
22
23
|
* Main entry point: Extract the deepest forwarded email using hybrid strategy
|
|
23
24
|
*/
|
|
@@ -53,17 +54,17 @@ async function extractDeepestHybrid(raw, options) {
|
|
|
53
54
|
const inlineResult = await (0, inline_layer_1.processInline)(mimeResult.rawBody, mimeResult.depth, mimeResult.history, opts.customDetectors);
|
|
54
55
|
// Step 3: Align results
|
|
55
56
|
let from = (0, utils_1.normalizeFrom)(inlineResult.from);
|
|
56
|
-
let to = inlineResult.to;
|
|
57
|
+
let to = (0, utils_1.normalizeFrom)(inlineResult.to);
|
|
57
58
|
let subject = inlineResult.subject;
|
|
58
59
|
let date_raw = inlineResult.date_raw;
|
|
59
60
|
let date_iso = inlineResult.date_iso;
|
|
60
61
|
let text = inlineResult.text;
|
|
61
62
|
if (inlineResult.diagnostics.method === 'fallback' && mimeResult.metadata) {
|
|
62
63
|
const m = mimeResult.metadata;
|
|
63
|
-
if (!from && m.from?.value?.[0]) {
|
|
64
|
+
if ((!from || !from.address) && m.from?.value?.[0]) {
|
|
64
65
|
from = (0, utils_1.normalizeFrom)({ name: m.from.value[0].name, address: m.from.value[0].address });
|
|
65
66
|
}
|
|
66
|
-
if (!to && m.to?.value?.[0]) {
|
|
67
|
+
if ((!to || !to.address) && m.to?.value?.[0]) {
|
|
67
68
|
to = (0, utils_1.normalizeFrom)({ name: m.to.value[0].name, address: m.to.value[0].address });
|
|
68
69
|
}
|
|
69
70
|
if (!subject && m.subject)
|
|
@@ -99,6 +100,8 @@ async function extractDeepestHybrid(raw, options) {
|
|
|
99
100
|
date_iso = date_iso || (0, utils_1.normalizeDateToISO)(date_raw);
|
|
100
101
|
// Destructure to exclude 'from' since we have our own normalized version
|
|
101
102
|
const { from: _unusedFrom, ...restInlineResult } = inlineResult;
|
|
103
|
+
// Calculate confidence score
|
|
104
|
+
const confidence = (0, scoring_1.calculateConfidence)(mimeResult.rawBody, mimeResult.depth + inlineResult.diagnostics.depth);
|
|
102
105
|
const result = {
|
|
103
106
|
...restInlineResult,
|
|
104
107
|
// Use our normalized/enriched values
|
|
@@ -109,6 +112,15 @@ async function extractDeepestHybrid(raw, options) {
|
|
|
109
112
|
date_iso,
|
|
110
113
|
text: (0, utils_1.cleanText)(text),
|
|
111
114
|
full_body: mimeResult.rawBody,
|
|
115
|
+
// Confidence
|
|
116
|
+
confidence_score: confidence.score,
|
|
117
|
+
confidence_description: confidence.description,
|
|
118
|
+
confidence_ratio: confidence.ratio,
|
|
119
|
+
confidence_email_count: confidence.email_count,
|
|
120
|
+
confidence_sender_count: confidence.sender_count,
|
|
121
|
+
confidence_quote_depth: confidence.quote_depth,
|
|
122
|
+
confidence_signals: confidence.signals,
|
|
123
|
+
confidence_reasons: confidence.reasons,
|
|
112
124
|
attachments: [...attachments, ...inlineResult.attachments],
|
|
113
125
|
diagnostics: {
|
|
114
126
|
...inlineResult.diagnostics,
|
package/dist/inline-layer.js
CHANGED
|
@@ -121,7 +121,7 @@ async function processInline(text, depth, baseHistory = [], customDetectors = []
|
|
|
121
121
|
attachments: [],
|
|
122
122
|
history: history.slice().reverse(),
|
|
123
123
|
diagnostics: {
|
|
124
|
-
method: (deepestEntry.flags.find(f => f.startsWith('method:')) || 'inline'),
|
|
124
|
+
method: (deepestEntry.flags.find(f => f.startsWith('method:'))?.replace('method:', '') || 'inline'),
|
|
125
125
|
depth: currentDepth - startingDepth,
|
|
126
126
|
parsedOk: true,
|
|
127
127
|
warnings: warnings
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Confidence Score Calculation Logic
|
|
3
|
+
* Evaluates the coherence between detected forward depth and email address density.
|
|
4
|
+
* Uses a Signal-Based architecture where various factors contribute to a health score.
|
|
5
|
+
*/
|
|
6
|
+
export interface ConfidenceResult {
|
|
7
|
+
score: number;
|
|
8
|
+
description: string;
|
|
9
|
+
ratio: number;
|
|
10
|
+
email_count: number;
|
|
11
|
+
sender_count: number;
|
|
12
|
+
quote_depth: number;
|
|
13
|
+
signals: Record<string, number>;
|
|
14
|
+
reasons: string[];
|
|
15
|
+
}
|
|
16
|
+
export declare function calculateConfidence(fullBody: string, depth: number): ConfidenceResult;
|
package/dist/scoring.js
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
"use strict";
|
|
2
|
+
/**
|
|
3
|
+
* Confidence Score Calculation Logic
|
|
4
|
+
* Evaluates the coherence between detected forward depth and email address density.
|
|
5
|
+
* Uses a Signal-Based architecture where various factors contribute to a health score.
|
|
6
|
+
*/
|
|
7
|
+
Object.defineProperty(exports, "__esModule", { value: true });
|
|
8
|
+
exports.calculateConfidence = calculateConfidence;
|
|
9
|
+
function calculateConfidence(fullBody, depth) {
|
|
10
|
+
// 0. Base case: No depth detected implies no confidence metric applicable (N/A)
|
|
11
|
+
if (depth === 0) {
|
|
12
|
+
return {
|
|
13
|
+
score: 100,
|
|
14
|
+
description: "N/A (No depth detected)",
|
|
15
|
+
ratio: 0,
|
|
16
|
+
email_count: 0,
|
|
17
|
+
sender_count: 0,
|
|
18
|
+
quote_depth: 0,
|
|
19
|
+
signals: {},
|
|
20
|
+
reasons: ["No depth detected"]
|
|
21
|
+
};
|
|
22
|
+
}
|
|
23
|
+
// 1. Calculate Max Quote Depth (">" prefix)
|
|
24
|
+
const lines = fullBody.split('\n');
|
|
25
|
+
let maxQuoteDepth = 0;
|
|
26
|
+
for (const line of lines) {
|
|
27
|
+
const match = line.match(/^(\s*>)+/);
|
|
28
|
+
if (match) {
|
|
29
|
+
const qCount = (match[0].match(/>/g) || []).length;
|
|
30
|
+
if (qCount > maxQuoteDepth)
|
|
31
|
+
maxQuoteDepth = qCount;
|
|
32
|
+
}
|
|
33
|
+
}
|
|
34
|
+
// 2. Count emails strictly between angle brackets <...>
|
|
35
|
+
const emailRegex = /<[\s\r\n]*([^\s<>@]+@[^\s<>@]+)[\s\r\n]*>/g;
|
|
36
|
+
let match;
|
|
37
|
+
const emails = [];
|
|
38
|
+
while ((match = emailRegex.exec(fullBody)) !== null) {
|
|
39
|
+
emails.push({ addr: match[1], index: match.index, fullMatchLength: match[0].length });
|
|
40
|
+
}
|
|
41
|
+
const count = emails.length;
|
|
42
|
+
const ratio = count / depth;
|
|
43
|
+
// 3. Sender & Header context analysis
|
|
44
|
+
let explainedCount = 0;
|
|
45
|
+
let fromCount = 0;
|
|
46
|
+
const contextWindow = 150;
|
|
47
|
+
const fromKeywords = [
|
|
48
|
+
"From", "Od", "Fra", "Von", "De", "Lähettäjä", "Šalje", "Feladó", "Da", "Van", "Expeditorul",
|
|
49
|
+
"Отправитель", "Från", "Kimden", "Від кого", "Saatja", "De la", "Gönderen", "От", "Від",
|
|
50
|
+
"Mittente", "Nadawca", "送信元"
|
|
51
|
+
];
|
|
52
|
+
const otherKeywords = [
|
|
53
|
+
"To", "Komu", "Til", "An", "Para", "Vastaanottaja", "À", "Prima", "Címzett", "A", "Aan", "Do",
|
|
54
|
+
"Destinatarul", "Кому", "Pre", "Till", "Kime", "Pour", "Adresat", "送信先",
|
|
55
|
+
"Cc", "CC", "Kopie", "Kopio", "Másolat", "Kopi", "Dw", "Копия", "Kopia", "Bilgi", "Копія",
|
|
56
|
+
"Másolatot kap", "Kópia", "Copie à",
|
|
57
|
+
"Reply-To", "Odgovori na", "Odpověď na", "Svar til", "Antwoord aan", "Vastaus", "Répondre à",
|
|
58
|
+
"Antwort an", "Válaszcím", "Rispondi a", "Odpowiedź-do", "Responder A", "Responder a",
|
|
59
|
+
"Răspuns către", "Ответ-Кому", "Odpovedať-Pre", "Svara till", "Yanıt Adresi", "Кому відповісти"
|
|
60
|
+
];
|
|
61
|
+
const trailingSenderKeywords = [
|
|
62
|
+
"wrote", "escribió", "a écrit", "kirjoitti", "ezt írta", "ha scritto", "geschreven", "skrev",
|
|
63
|
+
"napisał", "escreveu", "написал", "napísal", "följande", "tarihinde şunu yazdı", "napsal"
|
|
64
|
+
];
|
|
65
|
+
const buildRegex = (words, strict = false) => {
|
|
66
|
+
const sorted = Array.from(new Set(words)).sort((a, b) => b.length - a.length);
|
|
67
|
+
const joined = sorted.map(k => k.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')).join('|');
|
|
68
|
+
const prefix = `[\\*\\_\\>]*\\s*`;
|
|
69
|
+
const suffix = `\\s*[\\*\\_]*\\s*`;
|
|
70
|
+
if (strict) {
|
|
71
|
+
return new RegExp(`(?:${prefix}(?:${joined})${suffix})\\s*:\\s*(?:[^:\\n]*\\n\\s*)?[^:\\n]*$`, 'i');
|
|
72
|
+
}
|
|
73
|
+
return new RegExp(`(?:${prefix}(?:${joined})${suffix})\\s*:`, 'i');
|
|
74
|
+
};
|
|
75
|
+
const headerPattern = buildRegex([...fromKeywords, ...otherKeywords], false);
|
|
76
|
+
const fromPattern = buildRegex(fromKeywords, true);
|
|
77
|
+
const trailingPattern = new RegExp(`^\\s*[\\*\\_\\>]*\\s*(?:${trailingSenderKeywords.join('|')})\\s*:?`, 'i');
|
|
78
|
+
for (const email of emails) {
|
|
79
|
+
const start = Math.max(0, email.index - contextWindow);
|
|
80
|
+
const preText = fullBody.substring(start, email.index);
|
|
81
|
+
const postText = fullBody.substring(email.index + email.fullMatchLength);
|
|
82
|
+
const blocks = preText.split(/\n\s*\n/);
|
|
83
|
+
const currentBlock = blocks[blocks.length - 1];
|
|
84
|
+
if (headerPattern.test(currentBlock))
|
|
85
|
+
explainedCount++;
|
|
86
|
+
if (fromPattern.test(preText) || trailingPattern.test(postText))
|
|
87
|
+
fromCount++;
|
|
88
|
+
}
|
|
89
|
+
// ⚖️ SIGNAL-BASED SCORING ⚖️
|
|
90
|
+
const signals = {};
|
|
91
|
+
let finalScore = 100;
|
|
92
|
+
const reasons = [];
|
|
93
|
+
// --- 1. Ratio Signals (The base score) ---
|
|
94
|
+
if (count === 0) {
|
|
95
|
+
signals['penalty_ghost'] = -100;
|
|
96
|
+
reasons.push("Ghost Forward: 0 emails found in the body");
|
|
97
|
+
}
|
|
98
|
+
else if (ratio < 0.5) {
|
|
99
|
+
signals['penalty_inconsistent'] = -100;
|
|
100
|
+
reasons.push(`Inconsistent Density: Ratio ${ratio.toFixed(2)} is too low (expected >= 0.5)`);
|
|
101
|
+
}
|
|
102
|
+
else if (ratio >= 0.5 && ratio <= 1.5) {
|
|
103
|
+
signals['adjustment_partial'] = -50;
|
|
104
|
+
reasons.push(`Partial Chain: Ratio ${ratio.toFixed(2)} suggests ~1 email per detected level`);
|
|
105
|
+
}
|
|
106
|
+
else if (ratio > 2.4) {
|
|
107
|
+
signals['adjustment_high_density'] = -75;
|
|
108
|
+
reasons.push(`High Density: Ratio ${ratio.toFixed(2)} is high (many emails per level)`);
|
|
109
|
+
// Bonus for validated high density
|
|
110
|
+
const explainedRatio = explainedCount / count;
|
|
111
|
+
if (explainedRatio >= 0.6) {
|
|
112
|
+
signals['bonus_validated_density'] = 75;
|
|
113
|
+
reasons.push(`Validated Density: ${Math.round(explainedRatio * 100)}% of emails are preceded by headers`);
|
|
114
|
+
}
|
|
115
|
+
else {
|
|
116
|
+
reasons.push(`Unvalidated Density: Only ${Math.round(explainedRatio * 100)}% of emails have header context`);
|
|
117
|
+
}
|
|
118
|
+
}
|
|
119
|
+
else {
|
|
120
|
+
reasons.push(`Standard Density: Ratio ${ratio.toFixed(2)} is optimal (~2 emails per level)`);
|
|
121
|
+
}
|
|
122
|
+
// --- 2. Coherence Signals (Penalties) ---
|
|
123
|
+
if (fromCount > depth) {
|
|
124
|
+
signals['penalty_sender_mismatch'] = -75;
|
|
125
|
+
reasons.push(`Sender Mismatch: Found ${fromCount} senders but only ${depth} forward levels`);
|
|
126
|
+
}
|
|
127
|
+
if (maxQuoteDepth > depth) {
|
|
128
|
+
signals['penalty_quote_mismatch'] = -75;
|
|
129
|
+
reasons.push(`Quote Mismatch: Max quote nesting ${maxQuoteDepth} exceeds detected depth ${depth}`);
|
|
130
|
+
}
|
|
131
|
+
// --- Aggregate ---
|
|
132
|
+
for (const val of Object.values(signals)) {
|
|
133
|
+
finalScore += val;
|
|
134
|
+
}
|
|
135
|
+
finalScore = Math.max(0, Math.min(100, finalScore));
|
|
136
|
+
// Map description based on final score if not already descriptive
|
|
137
|
+
let description = reasons.join("; ");
|
|
138
|
+
if (finalScore === 100)
|
|
139
|
+
description = "High Confidence: " + description;
|
|
140
|
+
else if (finalScore >= 50)
|
|
141
|
+
description = "Medium Confidence: " + description;
|
|
142
|
+
else
|
|
143
|
+
description = "Low Confidence: " + description;
|
|
144
|
+
return {
|
|
145
|
+
score: finalScore,
|
|
146
|
+
description,
|
|
147
|
+
ratio,
|
|
148
|
+
email_count: count,
|
|
149
|
+
sender_count: fromCount,
|
|
150
|
+
quote_depth: maxQuoteDepth,
|
|
151
|
+
signals,
|
|
152
|
+
reasons
|
|
153
|
+
};
|
|
154
|
+
}
|
package/dist/types.d.ts
CHANGED
|
@@ -35,6 +35,14 @@ export interface ResultObject {
|
|
|
35
35
|
date_iso: string | null;
|
|
36
36
|
text: string | null;
|
|
37
37
|
full_body?: string;
|
|
38
|
+
confidence_score?: number;
|
|
39
|
+
confidence_description?: string;
|
|
40
|
+
confidence_ratio?: number;
|
|
41
|
+
confidence_email_count?: number;
|
|
42
|
+
confidence_sender_count?: number;
|
|
43
|
+
confidence_quote_depth?: number;
|
|
44
|
+
confidence_signals?: Record<string, number>;
|
|
45
|
+
confidence_reasons?: string[];
|
|
38
46
|
attachments: Attachment[];
|
|
39
47
|
history: HistoryEntry[];
|
|
40
48
|
diagnostics: Diagnostics;
|
package/dist/utils.js
CHANGED
|
@@ -221,6 +221,9 @@ function normalizeFrom(from) {
|
|
|
221
221
|
if (from.address) {
|
|
222
222
|
from.address = from.address.replace(/^[\*\_]+|[\*\_]+$/g, '').trim();
|
|
223
223
|
}
|
|
224
|
+
// FINAL VALIDATION: If at the end we have no address and no name, return null
|
|
225
|
+
if (!from.address && !from.name)
|
|
226
|
+
return null;
|
|
224
227
|
return from;
|
|
225
228
|
}
|
|
226
229
|
function normalizeParserResult(parsed, method, depth, warnings = []) {
|
|
@@ -16,6 +16,10 @@ This directory contains the documentation for the refactor of the `email-deepest
|
|
|
16
16
|
* Implementation of `OutlookFRDetector`, `NewOutlookDetector`, and `ReplyDetector`.
|
|
17
17
|
* Achieved **100% compatibility** with 239/239 body fixtures.
|
|
18
18
|
|
|
19
|
+
4. **[Confidence Scoring System](../confidence_scoring.md)**
|
|
20
|
+
* Implementation of the signal-based reliability evaluation.
|
|
21
|
+
* Handles email density, sender count mismatches, and quote level analysis.
|
|
22
|
+
|
|
19
23
|
## Planning & Reports
|
|
20
24
|
|
|
21
25
|
* **[Overall Plugin Plan](plugin_plan.md)**: The technical blueprint for the refactor.
|
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
# Confidence Scoring System
|
|
2
|
+
|
|
3
|
+
The `email-origin-chain` library implements a specialized **Signal-Based Scoring System** to evaluate the reliability of detected email chains. This is particularly important for `inline` detection (text-based), where formatting can sometimes be ambiguous.
|
|
4
|
+
|
|
5
|
+
## ⚖️ Architecture
|
|
6
|
+
|
|
7
|
+
Instead of a single boolean check, the system evaluates several independent **Signals**. These signals can be positive (bonuses) or negative (penalties) and are aggregated into a final score from **0 to 100**.
|
|
8
|
+
|
|
9
|
+
```mermaid
|
|
10
|
+
graph TD
|
|
11
|
+
A[Message Body Analysis] --> B[Metrics Extraction]
|
|
12
|
+
B --> C[Ratio: Email / Depth]
|
|
13
|
+
B --> D[Sender Detection]
|
|
14
|
+
B --> E[Quote Nesting Levels]
|
|
15
|
+
|
|
16
|
+
C & D & E --> F[Signal Evaluators]
|
|
17
|
+
|
|
18
|
+
F --> S1[Ratio Adjustments]
|
|
19
|
+
F --> S2[Sender Count Penalty]
|
|
20
|
+
F --> S3[Quote Depth Penalty]
|
|
21
|
+
F --> S4[Context Header Bonus]
|
|
22
|
+
|
|
23
|
+
S1 & S2 & S3 & S4 --> G[Aggregate Score]
|
|
24
|
+
G --> H[Clamping 0-100]
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
## 📊 Available Signals
|
|
28
|
+
|
|
29
|
+
### 1. Ratio Signals (Base Score)
|
|
30
|
+
The ratio is calculated as `Detected Email Addresses / Detected Forward Depth`.
|
|
31
|
+
|
|
32
|
+
| Signal | Logic | Impact | Description |
|
|
33
|
+
| :--- | :--- | :--- | :--- |
|
|
34
|
+
| `penalty_ghost` | `email_count == 0` | -100 | The chain indicates a forward but no actual email addresses found. |
|
|
35
|
+
| `penalty_inconsistent` | `ratio < 0.5` | -100 | Extremely low density for the detected depth. |
|
|
36
|
+
| `adjustment_partial` | `0.5 <= ratio <= 1.5` | -50 | Likely a partial chain (e.g., only 1 email detected per level). |
|
|
37
|
+
| `adjustment_high_density` | `ratio > 2.4` | -75 | Very high density (many emails). Requires verification. |
|
|
38
|
+
| **Standard** | `1.5 < ratio <= 2.4` | +0 | **Optimal base score (100)**. Typical for From/To blocks. |
|
|
39
|
+
|
|
40
|
+
### 2. Validation & Penalties
|
|
41
|
+
These signals refine the base score based on visual or logical evidence.
|
|
42
|
+
|
|
43
|
+
| Signal | Condition | Impact | Description |
|
|
44
|
+
| :--- | :--- | :--- | :--- |
|
|
45
|
+
| `bonus_validated_density` | High density + >60% headers | +75 | Validates "High Density" chains if email addresses are preceded by headers like `To:` or `Cc:`. |
|
|
46
|
+
| `penalty_sender_mismatch` | `senders > depth` | -75 | Found more actual `From:` headers than recursion levels. Suggests a missed separator. |
|
|
47
|
+
| `penalty_quote_mismatch` | `quote_depth > depth` | -75 | Found nested `>` symbols deeper than the detected levels. Suggests hidden levels. |
|
|
48
|
+
|
|
49
|
+
## 🔍 Debugging & Transparency
|
|
50
|
+
|
|
51
|
+
The extraction result provides two fields for auditing the score:
|
|
52
|
+
|
|
53
|
+
1. **`confidence_signals`**: A raw key-value pair of every triggered signal and its impact.
|
|
54
|
+
2. **`confidence_reasons`**: A list of human-readable strings explaining each triggered signal.
|
|
55
|
+
|
|
56
|
+
### Example Suspect Result
|
|
57
|
+
```json
|
|
58
|
+
{
|
|
59
|
+
"confidence_score": 25,
|
|
60
|
+
"confidence_description": "Low Confidence: High Density: Ratio 5.00 is high (many emails per level); Unvalidated Density; Sender Mismatch: Found 2 senders but only 1 forward levels",
|
|
61
|
+
"confidence_signals": {
|
|
62
|
+
"adjustment_high_density": -75,
|
|
63
|
+
"penalty_sender_mismatch": -75
|
|
64
|
+
},
|
|
65
|
+
"confidence_reasons": [
|
|
66
|
+
"High Density: Ratio 5.00 is high (many emails per level)",
|
|
67
|
+
"Unvalidated Density: Only 0% of emails have header context",
|
|
68
|
+
"Sender Mismatch: Found 2 senders but only 1 forward levels"
|
|
69
|
+
]
|
|
70
|
+
}
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## 🛠 Usage for Developers
|
|
74
|
+
|
|
75
|
+
You should typically flag results with `confidence_score < 50` for manual review, as they likely indicate "Garbage" chains or highly fragmented formatting that fooled the parser.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "email-origin-chain",
|
|
3
|
-
"version": "1.0.
|
|
3
|
+
"version": "1.0.11",
|
|
4
4
|
"description": "Uncover the full audit trail of your email threads. Recursively reconstructs the entire conversation history with instant access to the original sender and true source message.",
|
|
5
5
|
"main": "dist/index.js",
|
|
6
6
|
"types": "dist/index.d.ts",
|