@sidub-inc/docuoria.cli 1.0.15

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/dist/index.js +1056 -0
  2. package/package.json +56 -0
  3. package/payload/.claude-plugin/plugin.json +21 -0
  4. package/payload/MANIFEST.json +322 -0
  5. package/payload/SKILL.md +88 -0
  6. package/payload/assets/lib/Docuoria.dll +0 -0
  7. package/payload/assets/schemas/template-schema.json +413 -0
  8. package/payload/commands/classify.md +11 -0
  9. package/payload/commands/diagnose.md +11 -0
  10. package/payload/commands/extract.md +11 -0
  11. package/payload/commands/inspect.md +11 -0
  12. package/payload/commands/validate-template.md +11 -0
  13. package/payload/examples/01-extract-to-csv.md +49 -0
  14. package/payload/examples/02-classify-unknown-pdf.md +102 -0
  15. package/payload/examples/03-diagnose-failed-result.md +68 -0
  16. package/payload/references/classification.md +363 -0
  17. package/payload/references/decision-tree.md +43 -0
  18. package/payload/references/failure-tree.md +169 -0
  19. package/payload/references/pattern-authoring.md +40 -0
  20. package/payload/references/patterns.md +97 -0
  21. package/payload/references/privacy.md +36 -0
  22. package/payload/references/scripts.md +361 -0
  23. package/payload/references/template-reference.md +606 -0
  24. package/payload/references/workflow.md +163 -0
  25. package/payload/scripts/_common.csx +250 -0
  26. package/payload/scripts/classify.csx +53 -0
  27. package/payload/scripts/dry-run.csx +85 -0
  28. package/payload/scripts/evaluate-match.csx +72 -0
  29. package/payload/scripts/execute.csx +89 -0
  30. package/payload/scripts/inspect.csx +43 -0
  31. package/payload/scripts/list-templates.csx +34 -0
  32. package/payload/scripts/load-template.csx +54 -0
  33. package/payload/scripts/save-template.csx +53 -0
  34. package/payload/scripts/schema-info.csx +84 -0
  35. package/payload/scripts/test-groups.csx +44 -0
  36. package/payload/scripts/test-pattern.csx +61 -0
  37. package/payload/scripts/validate-template.csx +54 -0
  38. package/payload/skill/SKILL.md +88 -0
  39. package/payload/skill/assets/lib/Docuoria.dll +0 -0
  40. package/payload/skill/assets/schemas/template-schema.json +413 -0
  41. package/payload/skill/examples/01-extract-to-csv.md +49 -0
  42. package/payload/skill/examples/02-classify-unknown-pdf.md +102 -0
  43. package/payload/skill/examples/03-diagnose-failed-result.md +68 -0
  44. package/payload/skill/references/classification.md +363 -0
  45. package/payload/skill/references/decision-tree.md +43 -0
  46. package/payload/skill/references/failure-tree.md +169 -0
  47. package/payload/skill/references/pattern-authoring.md +40 -0
  48. package/payload/skill/references/patterns.md +97 -0
  49. package/payload/skill/references/privacy.md +36 -0
  50. package/payload/skill/references/scripts.md +361 -0
  51. package/payload/skill/references/template-reference.md +606 -0
  52. package/payload/skill/references/workflow.md +163 -0
  53. package/payload/skill/scripts/_common.csx +250 -0
  54. package/payload/skill/scripts/classify.csx +53 -0
  55. package/payload/skill/scripts/dry-run.csx +85 -0
  56. package/payload/skill/scripts/evaluate-match.csx +72 -0
  57. package/payload/skill/scripts/execute.csx +89 -0
  58. package/payload/skill/scripts/inspect.csx +43 -0
  59. package/payload/skill/scripts/list-templates.csx +34 -0
  60. package/payload/skill/scripts/load-template.csx +54 -0
  61. package/payload/skill/scripts/save-template.csx +53 -0
  62. package/payload/skill/scripts/schema-info.csx +84 -0
  63. package/payload/skill/scripts/test-groups.csx +44 -0
  64. package/payload/skill/scripts/test-pattern.csx +61 -0
  65. package/payload/skill/scripts/validate-template.csx +54 -0
@@ -0,0 +1,68 @@
1
+ # Example 3 — Diagnose a FailedResult
2
+
3
+ ## Scenario
4
+
5
+ `dotnet script scripts/execute.csx -- --pdf <pdf> --template <template.json> --format csv --output out.csv` exits non-zero and emits `{ "error": { "code": "failed", "message": "FieldPath ... could not be coerced to System.Decimal", "detail": "..." } }` on stderr. The underlying `FailedResult` has `Step = StepIdentifier.Transformation` — a field expected to be a `decimal` could not be coerced from the captured text.
6
+
7
+ ## Example: before (broken template)
8
+
9
+ The template maps a currency amount as `fieldType: 1` (Number), but the regex captures the dollar sign:
10
+
11
+ ```json
12
+ {
13
+ "$kind": "FieldMapping",
14
+ "fieldName": "totalAmount",
15
+ "fieldType": 1,
16
+ "source": {
17
+ "$kind": "TextPatternExtractionSource",
18
+ "mode": "Pattern",
19
+ "regexPattern": "Total:?\\s*(?<value>\\$[\\d,.]+)"
20
+ }
21
+ }
22
+ ```
23
+
24
+ This captures `"$1,234.56"`, which cannot be parsed as a decimal → `FailedResult`.
25
+
26
+ ## Example: after (fixed template)
27
+
28
+ Exclude the dollar sign from the capture group:
29
+
30
+ ```json
31
+ {
32
+ "$kind": "FieldMapping",
33
+ "fieldName": "totalAmount",
34
+ "fieldType": 1,
35
+ "source": {
36
+ "$kind": "TextPatternExtractionSource",
37
+ "mode": "Pattern",
38
+ "regexPattern": "Total:?\\s*\\$?(?<value>[\\d,]+\\.\\d{2})"
39
+ }
40
+ }
41
+ ```
42
+
43
+ Now captures `"1,234.56"` → coerces to `1234.56` successfully.
44
+
45
+ ## Steps
46
+
47
+ 1. **Map the stderr `error.code` to a branch first.** `execute.csx` emitted `{ "error": { "code": "failed", "message": "...", "detail": "Step: Transformation, FieldPath: ..." } }` on stderr with exit ≥ 1. Look up `failed` in [`../references/failure-tree.md` § Stderr error.code → Branch routing](../references/failure-tree.md#stderr-errorcode--branch-routing) → **Branch B**, then read the `StepIdentifier` in `detail` (`Transformation`) to land on the right remediation row.
48
+ 2. Read the structured fields on `FailedResult` (parse from stderr `detail` or re-run dry-run with diagnostics):
49
+ - `Step` — here `StepIdentifier.Transformation`.
50
+ - `FieldPath` — e.g. `"DataModel.Fields.Total"`.
51
+ - `SourceText` — the raw captured substring, capped at 256 characters with a trailing ellipsis when truncated.
52
+ - `TargetTypeName` — e.g. `"System.Decimal"` or the engine-friendly name.
53
+ - `InnerDetail` — short description of the inner exception.
54
+ 3. Confirm with diagnostics: `dotnet script scripts/dry-run.csx -- --pdf <pdf> --template <template.json>`. Dry-run defaults `Diagnostics = true`. Read the same fields off the returned `DryRunFailed` and inspect the per-mapping match trace.
55
+ 4. Inspect `SourceText`. If it contains a currency symbol, a thousands separator, or surrounding whitespace, the pattern is capturing too much for the target type — tighten the pattern (see [`../references/patterns.md`](../references/patterns.md) patterns 3 vs. 4, or pattern 6 for decimals) or change the target type to `string` with a downstream transform.
56
+ 5. Re-run `dotnet script scripts/dry-run.csx -- --pdf <pdf> --template <template.json>` until `DryRunSucceeded`.
57
+ 6. Re-run `dotnet script scripts/execute.csx -- --pdf <pdf> --template <template.json> --format csv --output output.csv` for the real CSV.
58
+
59
+ ## Expected outcome
60
+
61
+ `ProcessingResult` becomes `SucceededResult`; the previously-failing field carries the coerced value.
62
+
63
+ ## See also
64
+
65
+ - [`../references/failure-tree.md` § Stderr error.code → Branch routing](../references/failure-tree.md#stderr-errorcode--branch-routing) — always start here when a script exits non-zero.
66
+ - [`../references/failure-tree.md`](../references/failure-tree.md) Branch B `Step = Transformation` — the canonical remediation matrix.
67
+ - [`../references/patterns.md`](../references/patterns.md) — illustrative patterns, especially 3–7 for numeric coercion.
68
+ - [`../references/pattern-authoring.md`](../references/pattern-authoring.md) — the iteration loop for shrinking an over-broad capture.
@@ -0,0 +1,363 @@
1
+ # Classification rule design
2
+
3
+ The `rootMatchRule` determines whether a template is eligible for a given PDF. A weak rule produces false positives — the template classifies documents it cannot extract from, causing silent failures (empty collections, wrong data). This guide teaches how to design discriminating rules that match ONLY the documents the template can actually handle.
4
+
5
+ ## How classification scoring works
6
+
7
+ Understanding the scoring model prevents accidental over-matching.
8
+
9
+ ### Rule-level confidence
10
+
11
+ Each match rule returns a **confidence** ∈ [0, 1] and compares it to its **threshold** ∈ [0, 1]:
12
+
13
+ | Rule type | Confidence formula |
14
+ |---|---|
15
+ | `TextPatternMatchRule` (regex) | Binary: 1.0 if match, 0.0 if not |
16
+ | `TextPatternMatchRule` (AnyToken) | **Proportional:** `matched_tokens / total_tokens` |
17
+ | `TextPatternMatchRule` (AllTokens) | Binary: 1.0 if ALL match, 0.0 if not |
18
+ | `TextAnchorMatchRule` | Binary: 1.0 if text found in region |
19
+ | `TableMatchRule` | Proportional: `satisfied_criteria / specified_criteria` |
20
+ | `PageGeometryMatchRule` | Proportional: `met_criteria / specified_criteria` |
21
+ | `MetadataMatchRule` | Proportional: `matched_fields / expected_fields` |
22
+ | `CompositeMatchRule` (And) | Weighted average: `Σ(confidence × weight) / Σ(weight)` |
23
+ | `CompositeMatchRule` (Or) | Max weighted: `max(confidence × weight) / max(weight)` |
24
+ | `CompositeMatchRule` (Not) | Inverted: `1 - child_confidence` |
25
+
26
+ A rule **matches** when `confidence >= threshold`. Default threshold is `0.5`.
27
+
28
+ ### Classification output
29
+
30
+ The engine evaluates each template and produces a single aggregated **confidence** score:
31
+
32
+ - **`confidence`** — `ruleConfidence × extractionProbeScore`. This is the signal the agent acts on. A template is a functional match only when the root rule passes AND at least one collection extraction pattern matches (probe > 0).
33
+
34
+ The `classify.csx` script evaluates every stored template against a PDF and returns them ranked by confidence (descending), giving the agent a gradient rather than a single binary winner.
35
+
36
+ ### Interpreting the gradient
37
+
38
+ | confidence | Meaning | Agent action |
39
+ |---|---|---|
40
+ | ≥ 0.8 | Strong match — rules pass and extraction patterns match | Extract directly |
41
+ | 0.4 – 0.8 | Partial match — some discriminating tokens present | Try extraction with this template. If results are incomplete, iterate on this template rather than authoring from scratch |
42
+ | 0.1 – 0.4 | Weak match — only baseline signals present | Use as structural reference (field layout, extraction sources) when authoring a new template |
43
+ | ~0 | No overlap | New document type — author from scratch |
44
+
45
+ **Key insight:** two templates from the same vendor may both score 0.3–0.5 for a new variant that neither fully covers. The ranked list surfaces this — pick the closest match as a starting point for refinement.
46
+
47
+ ### The proportional confidence trap
48
+
49
+ With `TextPatternMatchRule` in `AnyToken` mode (integer `0`):
50
+ - 4 tokens + threshold 0.5 → only **2 tokens** need to match
51
+ - 6 tokens + threshold 0.5 → only **3 tokens** need to match
52
+
53
+ Generic vendor tokens like `["Microsoft", "Invoice"]` will match many unrelated documents from the same vendor. This is the #1 cause of misclassification.
54
+
55
+ ## Principles
56
+
57
+ When designing a `rootMatchRule`, follow these three imperatives:
58
+
59
+ 1. **Match the document TYPE, not just the vendor.** Author one template per document type (invoice, credit note, statement, usage report) with type-specific markers — never let a single template cover multiple document types from the same vendor.
60
+ 2. **Give every template at least one token or feature that no other template shares.** If two templates can match the same PDF, treat classification as broken and tighten the discriminators before iterating on extraction quality.
61
+ 3. **Validate every template against negative examples.** A template that matches its target PDF is necessary but not sufficient — confirm with `evaluate-match.csx` that the template *rejects* same-vendor PDFs with different layouts before saving it.
62
+
63
+ ## Token selection strategy
64
+
65
+ ### Bad tokens (vendor-level — too broad)
66
+
67
+ ```json
68
+ {
69
+ "tokens": ["Microsoft", "Invoice", "Bill To", "Total"],
70
+ "mode": 0,
71
+ "threshold": 0.5
72
+ }
73
+ ```
74
+
75
+ Every Microsoft invoice, credit note, and statement contains these tokens. This matches everything.
76
+
77
+ ### Good tokens (document-type-level — discriminating)
78
+
79
+ ```json
80
+ {
81
+ "tokens": ["Microsoft", "Invoice", "Subscription", "Office 365", "License Qty"],
82
+ "mode": 0,
83
+ "threshold": 0.8
84
+ }
85
+ ```
86
+
87
+ The tokens `"Subscription"`, `"Office 365"`, `"License Qty"` are specific to subscription invoices. A consumption/usage invoice won't contain them.
88
+
89
+ ### Best approach — AllTokens with discriminators
90
+
91
+ ```json
92
+ {
93
+ "tokens": ["Microsoft", "Invoice", "Subscription"],
94
+ "mode": 1,
95
+ "threshold": 0.5
96
+ }
97
+ ```
98
+
99
+ `AllTokens` (mode `1`) requires **every** token to be present (binary 1.0 or 0.0). Add tokens that are unique to the document type. If even one discriminating token is absent, the rule fails entirely.
100
+
101
+ ### How to find discriminating tokens
102
+
103
+ 1. Run `inspect.csx` on the target PDF — note distinctive section headers, product names, column headers, and billing terminology.
104
+ 2. Run `inspect.csx` on other PDFs from the same vendor — identify tokens that appear ONLY in your target.
105
+ 3. Good discriminators: section headers (`"Usage Charges"`, `"Subscription Details"`), product identifiers, unique column headers, regulatory text specific to one document type.
106
+ 4. Bad discriminators: vendor name, generic labels (`"Amount"`, `"Date"`, `"Page"`), boilerplate legal text shared across document types.
107
+
108
+ ## Composite rule architecture
109
+
110
+ For maximum discrimination, compose multiple rule types with appropriate weights:
111
+
112
+ ```json
113
+ {
114
+ "$kind": "CompositeMatchRule",
115
+ "operator": 0,
116
+ "threshold": 0.85,
117
+ "children": [
118
+ {
119
+ "rule": {
120
+ "$kind": "TextPatternMatchRule",
121
+ "tokens": ["Microsoft", "Invoice"],
122
+ "mode": 1,
123
+ "threshold": 0.5
124
+ },
125
+ "weight": 1.0
126
+ },
127
+ {
128
+ "rule": {
129
+ "$kind": "TextPatternMatchRule",
130
+ "tokens": ["Subscription", "License Qty", "Seats"],
131
+ "mode": 0,
132
+ "threshold": 0.6
133
+ },
134
+ "weight": 2.0
135
+ },
136
+ {
137
+ "rule": {
138
+ "$kind": "PageGeometryMatchRule",
139
+ "expectedOrientation": 0,
140
+ "expectedPageCount": 2,
141
+ "threshold": 0.5
142
+ },
143
+ "weight": 0.5
144
+ }
145
+ ]
146
+ }
147
+ ```
148
+
149
+ **Architecture breakdown:**
150
+
151
+ | Child | Purpose | Weight | Rationale |
152
+ |---|---|---|---|
153
+ | TextPattern (AllTokens) | Vendor gate — broad necessary condition | 1.0 | Baseline: must be a Microsoft Invoice |
154
+ | TextPattern (AnyToken, high threshold) | Type discriminator — narrows to subscription invoices | **2.0** | Emphasized: these tokens distinguish this layout from usage invoices |
155
+ | PageGeometry | Structural hint | 0.5 | De-emphasized: helpful but not definitive alone |
156
+
157
+ **Weighting strategy:**
158
+ - Weight `1.0` — baseline signals (necessary but not sufficient)
159
+ - Weight `> 1.0` — discriminators (these separate your template from siblings)
160
+ - Weight `< 1.0` — supporting hints (adds confidence but shouldn't gate alone)
161
+
162
+ The composite `And` calculates: `Σ(confidence × weight) / Σ(weight)`. With the example above: if the discriminator fails, the weighted average drops below the composite threshold of 0.85 even if other rules pass.
163
+
164
+ ## Structural discriminators
165
+
166
+ Text tokens alone are often insufficient for same-vendor discrimination. Use structural rules:
167
+
168
+ ### TableMatchRule — for documents with distinctive table layout
169
+
170
+ ```json
171
+ {
172
+ "$kind": "TableMatchRule",
173
+ "minRows": 5,
174
+ "minColumns": 4,
175
+ "requiredHeaderTokens": ["Service Name", "Quantity", "Unit Price"],
176
+ "threshold": 0.75
177
+ }
178
+ ```
179
+
180
+ Use when: the target document has a table with specific headers that sibling documents do not share. Verify with `inspect.csx` — check `Tables[*].Headers` and `Tables[*].TotalRowCount`.
181
+
182
+ ### PageGeometryMatchRule — for documents with distinctive page structure
183
+
184
+ ```json
185
+ {
186
+ "$kind": "PageGeometryMatchRule",
187
+ "expectedPageCount": 4,
188
+ "expectedOrientation": 1,
189
+ "threshold": 0.5
190
+ }
191
+ ```
192
+
193
+ Use when: the target document has a consistent page count or orientation that distinguishes it. Landscape-only reports vs. portrait invoices are easy to discriminate.
194
+
195
+ ### TextAnchorMatchRule — for documents with distinctive spatial layout
196
+
197
+ ```json
198
+ {
199
+ "$kind": "TextAnchorMatchRule",
200
+ "expectedContent": "Azure Usage Detail",
201
+ "region": { "x": 50, "y": 30, "width": 300, "height": 40 },
202
+ "pageNumber": 1,
203
+ "threshold": 0.5
204
+ }
205
+ ```
206
+
207
+ Use when: a specific text appears in a known location on the page. This is stronger than plain token matching because it validates both content AND position. Obtain coordinates from `inspect.csx` → `TextBlocks[*].Bounds`.
208
+
209
+ ## Threshold strategy
210
+
211
+ | Scenario | Recommended approach |
212
+ |---|---|
213
+ | Single document type from a vendor | `AllTokens` mode, threshold `0.5` — binary pass/fail is sufficient |
214
+ | Multiple document types from same vendor | Composite with discriminator weight `≥ 2.0`, composite threshold `≥ 0.8` |
215
+ | Documents with variable content (optional sections) | `AnyToken` mode with enough tokens that threshold still requires the discriminating ones |
216
+ | Filename-gated workflows | Add `FileNameMatchRule` child with low weight (`0.3`) as a tiebreaker, never as sole gate |
217
+
218
+ **Key rule:** Raise the composite threshold when siblings are close. The composite threshold gates the entire root rule — if your weighted average can reach 0.85 even without the discriminator matching, your threshold is too low.
219
+
220
+ ## Validation checklist
221
+
222
+ Before storing a template, validate classification quality:
223
+
224
+ 1. **Positive validation** — run `evaluate-match.csx` against the target PDF:
225
+ ```
226
+ dotnet script scripts/evaluate-match.csx -- --pdf <target.pdf> --template <template.json>
227
+ ```
228
+ Must return a high `confidence` (≥ 0.8 ideally).
229
+
230
+ 2. **Negative validation** — run against PDFs from the same vendor that should NOT match:
231
+ ```
232
+ dotnet script scripts/evaluate-match.csx -- --pdf <sibling.pdf> --template <template.json>
233
+ ```
234
+ Must return `confidence: 0` or near-zero. If confidence is moderate (0.4–0.7), the rules are not discriminating enough.
235
+
236
+ 3. **Ranked classification** — if multiple templates are stored, verify correct ranking:
237
+ ```
238
+ dotnet script scripts/classify.csx -- --pdf <target.pdf> --store-path <templates-dir>
239
+ ```
240
+ The correct template must appear at the top of the ranked list with the highest `confidence`. Check the gap between the top match and the next — a wide gap indicates strong discrimination, a narrow gap indicates ambiguity.
241
+
242
+ 4. **Boundary cases** — test with the most similar document from another vendor/type to ensure the discriminators are doing their job.
243
+
244
+ If negative validation fails, strengthen the discriminator child (add more type-specific tokens, increase its weight, or add a structural rule like `TableMatchRule`).
245
+
246
+ ## Common mistakes
247
+
248
+ | Mistake | Consequence | Fix |
249
+ |---|---|---|
250
+ | Vendor tokens only | Every document from that vendor classifies at 1.0 | Add document-type-specific tokens |
251
+ | AnyToken with low threshold | Too few tokens needed to pass | Use AllTokens for critical gates, or raise threshold |
252
+ | No negative validation | Template matches sibling documents | Always test against same-vendor PDFs that should NOT match |
253
+ | Equal weights on all children | Discriminator failure doesn't drop below threshold | Weight discriminators at 2.0+ |
254
+ | Single TextPatternMatchRule as root | No layered defense against similar documents | Use CompositeMatchRule with multiple signal types |
255
+ | Testing only the target PDF | False confidence in rule quality | Test against 2-3 negative examples from same vendor |
256
+
257
+ ## Worked example: splitting Microsoft invoices
258
+
259
+ **Problem:** One template uses `["Microsoft", "Invoice", "Bill To"]` in `AllTokens` mode. Both a subscription invoice and an Azure usage invoice contain all three tokens. Both classify at 1.0.
260
+
261
+ **Solution — two templates with mutual exclusivity:**
262
+
263
+ Template A (subscription):
264
+ ```json
265
+ {
266
+ "rootMatchRule": {
267
+ "$kind": "CompositeMatchRule",
268
+ "operator": 0,
269
+ "threshold": 0.8,
270
+ "children": [
271
+ {
272
+ "rule": {
273
+ "$kind": "TextPatternMatchRule",
274
+ "tokens": ["Microsoft", "Invoice"],
275
+ "mode": 1,
276
+ "threshold": 0.5
277
+ },
278
+ "weight": 1.0
279
+ },
280
+ {
281
+ "rule": {
282
+ "$kind": "TextPatternMatchRule",
283
+ "tokens": ["Subscription", "License", "Seats", "Renewal"],
284
+ "mode": 0,
285
+ "threshold": 0.5
286
+ },
287
+ "weight": 2.0
288
+ }
289
+ ]
290
+ }
291
+ }
292
+ ```
293
+
294
+ Template B (Azure usage):
295
+ ```json
296
+ {
297
+ "rootMatchRule": {
298
+ "$kind": "CompositeMatchRule",
299
+ "operator": 0,
300
+ "threshold": 0.8,
301
+ "children": [
302
+ {
303
+ "rule": {
304
+ "$kind": "TextPatternMatchRule",
305
+ "tokens": ["Microsoft", "Invoice"],
306
+ "mode": 1,
307
+ "threshold": 0.5
308
+ },
309
+ "weight": 1.0
310
+ },
311
+ {
312
+ "rule": {
313
+ "$kind": "TextPatternMatchRule",
314
+ "tokens": ["Azure", "Usage", "Consumption", "Resource Group"],
315
+ "mode": 0,
316
+ "threshold": 0.5
317
+ },
318
+ "weight": 2.0
319
+ }
320
+ ]
321
+ }
322
+ }
323
+ ```
324
+
325
+ **Why this works:**
326
+ - Both require "Microsoft" + "Invoice" (baseline gate, weight 1.0)
327
+ - Template A requires subscription-specific tokens (weight 2.0) — these are absent in Azure usage invoices
328
+ - Template B requires Azure-specific tokens (weight 2.0) — these are absent in subscription invoices
329
+ - With weights [1.0, 2.0] and composite threshold 0.8: if the discriminator (weight 2.0) scores 0.0, the weighted average is `(1.0×1 + 0.0×2) / (1+2) = 0.33` — well below 0.8
330
+
331
+ **Validation:**
332
+ - Subscription PDF against Template A → passes (both children match)
333
+ - Subscription PDF against Template B → fails (Azure tokens absent, avg drops below 0.8)
334
+ - Usage PDF against Template A → fails (subscription tokens absent)
335
+ - Usage PDF against Template B → passes (both children match)
336
+
337
+ ## Diagnosing classification with per-rule scores
338
+
339
+ When `evaluate-match.csx` returns the result, the `matchedRules` array includes a summary for every rule in the tree — not just the root. Each summary carries the individual rule's `confidence` score.
340
+
341
+ For a composite root with two children, the output looks like:
342
+
343
+ ```json
344
+ {
345
+ "confidence": 0.89,
346
+ "matchedRules": [
347
+ { "ruleType": "CompositeMatchRule", "matched": true, "confidence": 0.89, "detail": null },
348
+ { "ruleType": "TextPatternMatchRule", "matched": true, "confidence": 1.0, "detail": null },
349
+ { "ruleType": "TextPatternMatchRule", "matched": true, "confidence": 0.75, "detail": null }
350
+ ]
351
+ }
352
+ ```
353
+
354
+ ### How to read the diagnostic
355
+
356
+ 1. **Root `confidence` = 1.0 for both templates?** The individual children will reveal the issue — look for children where confidence is 1.0 on BOTH invoice types. Those rules are not discriminating.
357
+ 2. **Child has `confidence: 1.0`?** That rule matches fully — all tokens/criteria are present. It provides no differentiation signal.
358
+ 3. **Child has `confidence: 0.0`?** That rule found nothing — either the tokens are absent or the structural criteria (table, geometry) don't match. This is the child that creates differentiation.
359
+ 4. **Child has fractional confidence?** Some but not all criteria matched. This is the gradient at work — the rule is partially relevant.
360
+
361
+ ### Troubleshooting 1.0 across sibling documents
362
+
363
+ If two templates both return `confidence: 1.0` against the same PDF, every child rule scored 1.0. The fix is to add a discriminator child with type-specific tokens or structural criteria that would score < 1.0 (or 0.0) for the wrong document type. Use the per-child breakdown to verify the new discriminator actually creates separation.
@@ -0,0 +1,43 @@
1
+ # Extraction-source decision tree
2
+
3
+ Pick the extraction source by the *shape* of the data on the page, not by the field name. The subtypes below are mutually exclusive per field; if more than one seems to fit, the lower bullet wins.
4
+
5
+ ## `$kind` reference table
6
+
7
+ Every extraction source uses the `$kind` JSON discriminator to identify the SDK type. You **must** use these exact values in template JSON — any other value causes silent deserialization failure.
8
+
9
+ | `$kind` value | Mode / variant | Use case |
10
+ |---|---|---|
11
+ | `TextPatternExtractionSource` | `mode: "Token"` | Literal token match (one value) |
12
+ | `TextPatternExtractionSource` | `mode: "Pattern"` | Regex match (one value, first match) |
13
+ | `TextPatternExtractionSource` | `mode: "AllMatches"` | Regex match (all matches → collection) |
14
+ | `TextAnchorExtractionSource` | — | Value next to a spatially-anchored label |
15
+ | `TableCellExtractionSource` | — | Single cell by row/column coordinate |
16
+ | `TableRowsExtractionSource` | — | All data rows from a real PDF table |
17
+ | `MetadataFieldExtractionSource` | — | PDF metadata (Title, Author, etc.) |
18
+ | `FallbackExtractionSource` | — | Composite: try primary, fall back to fallback |
19
+
20
+ ## The decision questions, in order
21
+
22
+ 1. **Is there exactly one value on the page** (e.g. a reference number, a single date)? → use `TextPatternExtractionSource` with `mode: "Pattern"`. Why: it returns the first regex match against the flattened haystack and projects a single named capture group. In JSON: `{ "$kind": "TextPatternExtractionSource", "mode": "Pattern", "regexPattern": "..." }`.
23
+ 2. **Is the value a *list of similar values* on one page** (e.g. repeating line items, multiple reference numbers)? → use `TextPatternExtractionSource` with `mode: "AllMatches"`. Why: it returns every regex match, preserving order, as collection rows. In JSON: `{ "$kind": "TextPatternExtractionSource", "mode": "AllMatches", "regexPattern": "..." }`.
24
+ 3. **Is the value inside a *visually tabular* block** (rows and columns, bordered or unbordered)? → run `inspect.csx` and check `Tables[*].TotalRowCount`. If any table has `TotalRowCount > 1` with correct `RowPreviews`, use `TableRowsExtractionSource` for whole rows or `TableCellExtractionSource` for a single cell coordinate. Why: the engine detects both bordered (lattice) and unbordered (stream/spacing-based) tables via Tabula. Even without visible grid lines, consistent text spacing is often enough for detection.
25
+ 4. **Does the value live *next to a label* whose position varies** (e.g. `Total: 123.45` floating between header and footer)? → use `TextAnchorExtractionSource`. Why: it anchors on the label text and reads the adjacent run, so layout drift between PDFs is absorbed by the anchor.
26
+ 5. **Does the value's location vary by layout *version*** (same vendor, two PDF templates over time)? → wrap the primary source in a `FallbackExtractionSource` with `primary` and `fallback` properties. Why: the engine tries `primary` first and falls back to `fallback` if no value is produced.
27
+ 6. **Is the value in the PDF metadata** (author, title, creation date)? → use `MetadataFieldExtractionSource`. Why: it reads from the PDF's metadata dictionary, not from page text.
28
+
29
+ ## Anti-patterns
30
+
31
+ - Do not use `TextPatternExtractionSource` (mode `"Pattern"`) for repeating data; it will silently take only the first match. Use mode `"AllMatches"` instead.
32
+ - Do not assume `TableRowsExtractionSource` is unusable just because the PDF "looks tabular but isn't bordered." The engine uses both Tabula lattice (bordered) AND stream (unbordered, spacing-based) table detection. **Always run `inspect.csx` and check `Tables[*].TotalRowCount`** — if any table has `TotalRowCount > 1` with meaningful `RowPreviews`, `TableRowsExtractionSource` will work even without visible borders.
33
+ - If `inspect.csx` shows tables with only 1 row (header-only) or no tables at all for a visually tabular section, fall back to `TextPatternExtractionSource` with `mode: "AllMatches"`. Before writing the regex, read the actual `FlattenedText` from `inspect.csx` — it may differ significantly from the visual PDF layout due to column-grouped block ordering.
34
+ - Do not chain more than three `FallbackExtractionSource` layers; that is a sign the document is fundamentally different and warrants a separate `Template`.
35
+ - Do not use `PatternExtractionSource` or `AllMatchesExtractionSource` — these are conceptual names that do **not** exist in the SDK. The actual type is always `TextPatternExtractionSource` with a `mode` property.
36
+
37
+ ## Layout variants and template splitting
38
+
39
+ If `DryRunSucceeded` returns an empty collection for a field that should have data and `inspect.csx` reveals a different page structure than the PDF used during authoring, you are facing a layout variant. The canonical splitting procedure (choosing discriminators, tightening the original `rootMatchRule`, validating mutual exclusivity with `evaluate-match.csx`) lives in [`classification.md` § Worked example: splitting Microsoft invoices](classification.md#worked-example-splitting-microsoft-invoices). Follow that procedure, then return here to confirm the `ExtractionSource` choice survives the split.
40
+
41
+ ## After picking
42
+
43
+ Confirm the choice by running `dotnet script scripts/dry-run.csx -- <pdf> <template.json>` and inspecting the field's `ExtractionDiagnostics` trace — if the wrong source was picked, the trace shows which match path was attempted and why it produced no value. Then go to `workflow.md` Step 6.
@@ -0,0 +1,169 @@
1
+ # Failure-mode decision tree
2
+
3
+ Every PDF run produces one of three outcomes: `SucceededResult` (done), `RejectedResult` (the engine refused before completing — see `RejectionReason`), or `FailedResult` (the engine ran a step and it threw — see `StepIdentifier`). Classification can additionally produce "no template matched". Use the matching subsection below.
4
+
5
+ ## Stderr `error.code` → Branch routing
6
+
7
+ Every script emits errors as `{ "error": { "code": "<code>", "message": "...", "detail": "..." } }` on stderr with non-zero exit. Map the stderr `error.code` to a branch below before reading the prose narrative:
8
+
9
+ | `error.code` | Emitted by | Meaning | Go to |
10
+ | --- | --- | --- | --- |
11
+ | `pdf-not-found` | every script that takes `--pdf` | The path passed to `--pdf` does not resolve to a file | Fix the path (relative paths resolve from the cwd); re-run. No branch — this is an input error, not an SDK outcome. |
12
+ | `template-not-found` | `load-template.csx`, `evaluate-match.csx`, `dry-run.csx`, `execute.csx` | Template ID does not exist in the store, or the `--template` path does not resolve | Run `list-templates.csx` to confirm the ID, or correct the path. No branch. |
13
+ | `no-store` | `classify.csx`, `list-templates.csx`, `load-template.csx`, `save-template.csx` | No `--store-path` or `--store-url` was provided and the default `./templates` directory was not found | Pass `--store-path <dir>` to specify a local template store directory; see `references/scripts.md` § Common Store Parameters. No branch. |
14
+ | `bad-format` | `validate-template.csx`, `dry-run.csx`, `execute.csx`, `save-template.csx` | Template JSON is malformed or fails schema validation (parse error, missing required property, invalid enum value) | **Branch A** — `RejectionReason.MalformedTemplate`. Run `validate-template.csx` for the schema error list. |
15
+ | `rejected` | `dry-run.csx`, `execute.csx` | Engine returned `RejectedResult` — see `RejectionReason` in stderr `detail` | **Branch A** — match on the `RejectionReason` enum value. |
16
+ | `failed` | `dry-run.csx`, `execute.csx` | Engine returned `FailedResult` — see `StepIdentifier` in stderr `detail` | **Branch B** — match on the `StepIdentifier` enum value. |
17
+ | `empty-result` / silent `DryRunSucceeded` with empty collections | `dry-run.csx`, `execute.csx` | The engine succeeded but a `RepeatingFieldMapping` returned `[]` or a scalar returned `null` unexpectedly | **Branch D** — silent extraction failure (no stderr; detect by inspecting stdout). |
18
+ | Wrong template ranked first, or no template ranked | `classify.csx` | Classification produced unexpected ordering — diagnose via the ranked confidence gradient | **Branch C** — see also [`classification.md` § Interpreting the gradient](classification.md#interpreting-the-gradient). |
19
+ | `already-exists` | `save-template.csx` | Template with the same ID already exists in the store | Pass `--overwrite` if intentional, or pick a different ID. No branch. |
20
+ | `unhandled` | any script | Unexpected exception inside the script (not an SDK outcome) — the `detail` field contains the stack | File a bug; this is an SDK or script defect, not a template/PDF problem. No branch. |
21
+
22
+ ## Branch A — RejectedResult
23
+
24
+ ### RejectionReason.InvalidPdf
25
+
26
+ - **Meaning:** the PDF stream could not be parsed.
27
+ - **Diagnose:** run `dotnet script scripts/inspect.csx -- <pdf>`. If `PdfInspection.PageCount` is `0`, the file is unparseable.
28
+ - **Remediation:** confirm the file is a real PDF (magic bytes `%PDF-`), not a renamed image or HTML. If the source is a scan, the engine cannot extract from rasterised content; OCR upstream first.
29
+
30
+ ### RejectionReason.MalformedTemplate
31
+
32
+ - **Meaning:** the `Template` JSON is structurally invalid (schema violations, missing required fields, unknown discriminators).
33
+ - **Diagnose:** run `dotnet script scripts/validate-template.csx -- <template.json>`.
34
+ - **Remediation:** fix every validation error reported. Never bypass validation by editing past it — the runtime check is the same check.
35
+
36
+ ### RejectionReason.UnknownOutputGenerator
37
+
38
+ - **Meaning:** the generic `TGenerator` passed to `IDocuoriaEngine.ExecuteTemplateAsync<TGenerator, TOptions>` is not registered in DI.
39
+ - **Diagnose:** search host startup for `AddOutputGenerator<TGenerator, TOptions>` (or convenience helpers such as `AddCsvOutputGenerator`).
40
+ - **Remediation:** register the generator before calling execute, or pick an already-registered generator.
41
+
42
+ ### RejectionReason.GeneratorRejected
43
+
44
+ - **Meaning:** the generator refused the extracted data (e.g. multiple collections handed to a CSV generator).
45
+ - **Diagnose:** run `dotnet script scripts/dry-run.csx -- <pdf> <template.json>` — `DryRunSucceeded` shows what the generator would have received.
46
+ - **Remediation:** reshape the template (split into multiple templates, or pick a richer generator that accepts the shape).
47
+
48
+ ## Branch B — FailedResult
49
+
50
+ Read `FailedResult.Step` (enum-typed `StepIdentifier`) first; everything else is diagnostic context.
51
+
52
+ ### Step = Retrieval
53
+
54
+ - **Meaning:** a retrieval provider threw.
55
+ - **Diagnose:** read `FailedResult.ErrorMessage` and `FailedResult.Exception`. For HTTP retrieval check connectivity and 4xx/5xx status.
56
+ - **Remediation:** fix the retrieval source; rerun.
57
+
58
+ ### Step = Extraction
59
+
60
+ - **Meaning:** a pattern, table, or anchor extraction step threw or could not produce a value.
61
+ - **Diagnose:** rerun via `dotnet script scripts/dry-run.csx -- <pdf> <template.json>` (diagnostics on by default). Read `FailedResult.FieldPath` to identify the field, then `dotnet script scripts/test-pattern.csx` (for `TextPatternExtractionSource` mode `"Pattern"` / `"AllMatches"`) or `dotnet script scripts/inspect.csx` (for `TableRowsExtractionSource` / `TableCellExtractionSource`).
62
+ - **Remediation:** see `pattern-authoring.md` or revisit `decision-tree.md`.
63
+
64
+ ### Step = Transformation
65
+
66
+ - **Meaning:** a field coercion failed (string → `DateOnly`, `decimal`, etc.).
67
+ - **Diagnose:** read the structured fields on `FailedResult`: `FieldPath` (which field), `SourceText` (the raw captured substring, capped at 256 chars with a trailing ellipsis), `TargetTypeName` (the destination type), `InnerDetail` (the coercion exception detail).
68
+ - **Remediation:** either tighten the regex to capture a coercible substring (see `patterns.md` patterns 4 vs. 3), or change the field type, or add a transform step.
69
+
70
+ ### Step = Publish
71
+
72
+ - **Meaning:** the output generator threw mid-write (vs. cleanly rejecting, which would yield `RejectionReason.GeneratorRejected`).
73
+ - **Diagnose:** read `FailedResult.Exception`.
74
+ - **Remediation:** typical causes are I/O (path not writable, file locked) or generator bugs — fix infrastructure first, then file a generator issue.
75
+
76
+ ### Step = Unknown
77
+
78
+ - **Meaning:** legacy/default value (assigned `0`). Observed when the legacy three-arg `FailedResult` constructor was used and no `Step` was set.
79
+ - **Diagnose:** treat as `Extraction` until proven otherwise; read `FailedResult.ErrorMessage` and `FailedResult.Exception`.
80
+ - **Remediation:** proceed as for the inferred step.
81
+
82
+ ## Branch C — Classification failure
83
+
84
+ Classification issues come in three forms: no template matched, the wrong template matched, or no template is confident enough. Use `classify.csx` to get a ranked view of all templates with their `confidence` scores — this replaces the binary match/no-match model with a confidence gradient.
85
+
86
+ ### First step: get the ranked classification
87
+
88
+ ```
89
+ dotnet script scripts/classify.csx -- --pdf <pdf> --store-path <templates-dir>
90
+ ```
91
+
92
+ This returns the top-N templates sorted by `confidence` (descending). For the canonical confidence-to-action table, see [`classification.md` § Interpreting the gradient](classification.md#interpreting-the-gradient). If the matches array is empty, no templates are stored at all — author from scratch (`workflow.md` Step 3).
93
+
94
+ ### Scenario: Wrong template matched (misclassification)
95
+
96
+ - **Meaning:** the top-ranked template with high `confidence` produces empty or incorrect data because the PDF belongs to a different document type from the same vendor.
97
+ - **How to detect:** `DryRunSucceeded` with empty collections or null scalars where data should exist (→ Branch D), OR extracted values are nonsensical (wrong field mapped to wrong content).
98
+ - **Diagnose:**
99
+ 1. Run `classify.csx` — check if the correct template appears lower in the ranked list. If so, the correct template's rules are weaker than expected.
100
+ 2. Run `dotnet script scripts/inspect.csx -- --pdf <pdf>` — compare the page structure against the PDF used when authoring the matched template.
101
+ 3. Check the `confidence` gap between the wrong match and the correct template — a narrow gap means the rules lack discrimination.
102
+ - **Remediation:**
103
+ 1. The matched template's `rootMatchRule` is too broad — it matches documents it cannot extract from. See `classification.md` for how to tighten.
104
+ 2. Strengthen the correct template's discriminating rules to boost its `confidence` above the wrong match.
105
+ 3. Validate with `classify.csx` — the correct template must rank #1 with a clear gap over sibling templates.
106
+
107
+ ### Scenario: No match at all
108
+
109
+ - **Meaning:** all templates score near zero `confidence`.
110
+ - **Remediation:** author a new template (back to `workflow.md` Step 3). If any template scores moderately (0.3+), use `load-template.csx` to retrieve it as a structural starting point — its field layout and extraction sources may transfer even if its match rules don't fit.
111
+
112
+ ### Root cause patterns for classification failures
113
+
114
+ | Pattern | Symptom | Fix |
115
+ |---|---|---|
116
+ | Vendor-only tokens | Multiple document types from same vendor score similarly | Add document-type-specific discriminators (see `classification.md`) |
117
+ | AnyToken with low threshold | Template matches when only generic tokens are present | Switch to AllTokens for critical gates, or raise threshold |
118
+ | No structural rules | Two documents share text tokens but have different layouts | Add `TableMatchRule`, `PageGeometryMatchRule`, or `TextAnchorMatchRule` |
119
+ | Single rule (no composite) | No layered defense | Use `CompositeMatchRule` with weighted discriminators |
120
+ | Never tested negative examples | Rule seems fine against target but matches siblings too | Always validate against same-vendor PDFs that should NOT match |
121
+ | Narrow confidence gap | Top two templates score within 0.1 of each other | Strengthen discriminators on both to widen the gap |
122
+
123
+ ## Branch D — DryRunSucceeded but data is empty or incomplete
124
+
125
+ This is the *silent failure* mode — the pipeline reports success but one or more fields (typically a `RepeatingFieldMapping`) returned `null` or an empty collection. The dry-run does not flag this as an error because the engine has no expectation about how many rows should be extracted.
126
+
127
+ ### Symptom: RepeatingFieldMapping returns `[]` (empty collection)
128
+
129
+ - **Diagnose:**
130
+ 1. Run `dotnet script scripts/dry-run.csx -- <pdf> <template.json>` — confirm the collection field is present but empty.
131
+ 2. Run `dotnet script scripts/inspect.csx -- <pdf> --page <N>` — examine `FlattenedText` and `Tables` for the page containing the expected data.
132
+ 3. If `Tables` shows entries with `TotalRowCount > 1` and meaningful `RowPreviews`, the data IS detectable as a table → switch to `TableRowsExtractionSource`.
133
+ 4. If `Tables` shows only header-like entries (1 row) or no entries, read `FlattenedText` carefully. The block flattening order may differ from the visual PDF layout.
134
+ 5. Run `dotnet script scripts/test-pattern.csx -- <pdf> "<your-regex>"` against the ACTUAL flattened text (not the text you see when viewing the PDF).
135
+
136
+ - **Root causes (in likelihood order):**
137
+
138
+ | Cause | How to confirm | Remediation |
139
+ |---|---|---|
140
+ | **Layout variant** — The template was designed for a different document layout from the same vendor | Compare the PDF's page 2+ structure against the one used during authoring. Look for differences in section headers, column layout, or whether data appears per-product vs. in a single table. | **Split into separate templates** with narrower match rules. See [`classification.md` § Worked example: splitting Microsoft invoices](classification.md#worked-example-splitting-microsoft-invoices) for the canonical procedure. |
141
+ | **Block order differs from visual order** — The PDF's internal text blocks are column-grouped (all labels in one block, all values in another) rather than row-grouped | Run `inspect.csx` and read the `Blocks` array with coordinates. If all row labels share one block and values are in separate column-aligned blocks, the flattened haystack destroys row associations. | If `Tables` detection finds the structure → use `TableRowsExtractionSource`. Otherwise, this layout may not be extractable with the current SDK; document as a known limitation. |
142
+ | **Pattern mismatch** — The AllMatches regex was written for a different text structure | Run `test-pattern.csx` — if `HasMatches` is `false`, the pattern doesn't match the actual haystack. | Rewrite the regex against the actual `FlattenedText` from `inspect.csx`. See `pattern-authoring.md`. |
143
+ | **Page targeting** — The extraction source targets the wrong page | Check `pageNumber` in the extraction source config vs. where the data actually appears. | Correct the page number or remove it to search all pages. |
144
+
145
+ ### Layout variant splitting
146
+
147
+ When a single template classifies multiple document layouts correctly (same vendor, same match tokens) but extraction works for only one layout, follow [`classification.md` § Worked example: splitting Microsoft invoices](classification.md#worked-example-splitting-microsoft-invoices) — the canonical step-by-step procedure for splitting templates, choosing discriminators, and validating mutual exclusivity.
148
+
149
+ ### Symptom: Scalar field returns `null` unexpectedly
150
+
151
+ - **Diagnose:** run `dotnet script scripts/test-pattern.csx -- <pdf> "<regex>"` with the field's regex. If no match, inspect the haystack for the actual text surrounding the expected value.
152
+ - **Remediation:** adjust the regex. If the field genuinely doesn't exist in this document variant, consider making it non-required (`"isRequired": false`) or splitting templates.
153
+
154
+ ## Quick reference
155
+
156
+ | Outcome | Enum value | Script | Remediation pointer |
157
+ | --- | --- | --- | --- |
158
+ | `RejectedResult` | `RejectionReason.InvalidPdf` | `inspect.csx` | check magic bytes / OCR upstream |
159
+ | `RejectedResult` | `RejectionReason.MalformedTemplate` | `validate-template.csx` | fix schema errors |
160
+ | `RejectedResult` | `RejectionReason.UnknownOutputGenerator` | (host startup) | register generator in DI |
161
+ | `RejectedResult` | `RejectionReason.GeneratorRejected` | `dry-run.csx` | reshape template or change generator |
162
+ | `FailedResult` | `Step = Retrieval` | (inspect retrieval) | fix retrieval source |
163
+ | `FailedResult` | `Step = Extraction` | `dry-run.csx`, `test-pattern.csx`, `inspect.csx` | `pattern-authoring.md`, `decision-tree.md` |
164
+ | `FailedResult` | `Step = Transformation` | `dry-run.csx` | tighten pattern or change field type |
165
+ | `FailedResult` | `Step = Publish` | (inspect I/O) | fix infrastructure |
166
+ | `FailedResult` | `Step = Unknown` | `dry-run.csx` | treat as Extraction |
167
+ | `DryRunSucceeded` | empty collection | `dry-run.csx`, `inspect.csx`, `test-pattern.csx` | Branch D above — layout variant or block-order mismatch |
168
+ | Classification | no match | `classify.csx`, `evaluate-match.csx` | Branch C — examine ranked results, author or refine template |
169
+ | Classification | wrong match | `classify.csx`, `inspect.csx` | Branch C — tighten rules, strengthen discriminators (see `classification.md`) |