@sidub-inc/docuoria.cli 1.0.15
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/index.js +1056 -0
- package/package.json +56 -0
- package/payload/.claude-plugin/plugin.json +21 -0
- package/payload/MANIFEST.json +322 -0
- package/payload/SKILL.md +88 -0
- package/payload/assets/lib/Docuoria.dll +0 -0
- package/payload/assets/schemas/template-schema.json +413 -0
- package/payload/commands/classify.md +11 -0
- package/payload/commands/diagnose.md +11 -0
- package/payload/commands/extract.md +11 -0
- package/payload/commands/inspect.md +11 -0
- package/payload/commands/validate-template.md +11 -0
- package/payload/examples/01-extract-to-csv.md +49 -0
- package/payload/examples/02-classify-unknown-pdf.md +102 -0
- package/payload/examples/03-diagnose-failed-result.md +68 -0
- package/payload/references/classification.md +363 -0
- package/payload/references/decision-tree.md +43 -0
- package/payload/references/failure-tree.md +169 -0
- package/payload/references/pattern-authoring.md +40 -0
- package/payload/references/patterns.md +97 -0
- package/payload/references/privacy.md +36 -0
- package/payload/references/scripts.md +361 -0
- package/payload/references/template-reference.md +606 -0
- package/payload/references/workflow.md +163 -0
- package/payload/scripts/_common.csx +250 -0
- package/payload/scripts/classify.csx +53 -0
- package/payload/scripts/dry-run.csx +85 -0
- package/payload/scripts/evaluate-match.csx +72 -0
- package/payload/scripts/execute.csx +89 -0
- package/payload/scripts/inspect.csx +43 -0
- package/payload/scripts/list-templates.csx +34 -0
- package/payload/scripts/load-template.csx +54 -0
- package/payload/scripts/save-template.csx +53 -0
- package/payload/scripts/schema-info.csx +84 -0
- package/payload/scripts/test-groups.csx +44 -0
- package/payload/scripts/test-pattern.csx +61 -0
- package/payload/scripts/validate-template.csx +54 -0
- package/payload/skill/SKILL.md +88 -0
- package/payload/skill/assets/lib/Docuoria.dll +0 -0
- package/payload/skill/assets/schemas/template-schema.json +413 -0
- package/payload/skill/examples/01-extract-to-csv.md +49 -0
- package/payload/skill/examples/02-classify-unknown-pdf.md +102 -0
- package/payload/skill/examples/03-diagnose-failed-result.md +68 -0
- package/payload/skill/references/classification.md +363 -0
- package/payload/skill/references/decision-tree.md +43 -0
- package/payload/skill/references/failure-tree.md +169 -0
- package/payload/skill/references/pattern-authoring.md +40 -0
- package/payload/skill/references/patterns.md +97 -0
- package/payload/skill/references/privacy.md +36 -0
- package/payload/skill/references/scripts.md +361 -0
- package/payload/skill/references/template-reference.md +606 -0
- package/payload/skill/references/workflow.md +163 -0
- package/payload/skill/scripts/_common.csx +250 -0
- package/payload/skill/scripts/classify.csx +53 -0
- package/payload/skill/scripts/dry-run.csx +85 -0
- package/payload/skill/scripts/evaluate-match.csx +72 -0
- package/payload/skill/scripts/execute.csx +89 -0
- package/payload/skill/scripts/inspect.csx +43 -0
- package/payload/skill/scripts/list-templates.csx +34 -0
- package/payload/skill/scripts/load-template.csx +54 -0
- package/payload/skill/scripts/save-template.csx +53 -0
- package/payload/skill/scripts/schema-info.csx +84 -0
- package/payload/skill/scripts/test-groups.csx +44 -0
- package/payload/skill/scripts/test-pattern.csx +61 -0
- package/payload/skill/scripts/validate-template.csx +54 -0
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
# Patterns
|
|
2
|
+
|
|
3
|
+
The patterns below are *illustrative*. They demonstrate authoring techniques, not copy-paste solutions. Real PDFs have layout quirks (broken words, unicode whitespace, OCR drift) that mean a pattern that matches the visible text will often miss the engine's flattened haystack. Before using any pattern below, run `dotnet script scripts/test-pattern.csx -- <pdf> '<regex>'` and adapt to what `PatternTestResult.Matches` and `PatternTestResult.Gaps` actually report. See `pattern-authoring.md` for techniques.
|
|
4
|
+
|
|
5
|
+
### 1. ISO 8601 date
|
|
6
|
+
|
|
7
|
+
- regex: `\b\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])\b`
|
|
8
|
+
- Matches: `2024-01-15`, `1999-12-31`.
|
|
9
|
+
- Non-matches: `2024-13-01`, `24-01-15`, `2024/01/15`.
|
|
10
|
+
- Field type: `DateOnly`.
|
|
11
|
+
- Teaches: alternation inside character classes for valid month/day ranges.
|
|
12
|
+
|
|
13
|
+
### 2. US ZIP / CA postal code (combined)
|
|
14
|
+
|
|
15
|
+
- regex: `\b(?:\d{5}(?:-\d{4})?|[A-Z]\d[A-Z] ?\d[A-Z]\d)\b`
|
|
16
|
+
- Matches: `90210`, `90210-1234`, `K1A 0B1`, `K1A0B1`.
|
|
17
|
+
- Non-matches: `1234`, `9021O` (letter O instead of zero).
|
|
18
|
+
- Field type: `string`.
|
|
19
|
+
- Teaches: non-capturing groups and optional whitespace with `?`.
|
|
20
|
+
|
|
21
|
+
### 3. Currency with symbol
|
|
22
|
+
|
|
23
|
+
- regex: `(?<currency>[$€£¥])\s?(?<amount>\d{1,3}(?:,\d{3})*(?:\.\d{2})?)`
|
|
24
|
+
- Matches: `$1,234.56`, `€ 99.00`, `£10`.
|
|
25
|
+
- Non-matches: `1234.56` (no symbol), `$1.234,56` (European grouping).
|
|
26
|
+
- Field type: `decimal`.
|
|
27
|
+
- Teaches: named groups (consumed downstream by `PatternExtractionSource.PrimaryGroup`).
|
|
28
|
+
|
|
29
|
+
### 4. Currency, symbol-stripped numeric only
|
|
30
|
+
|
|
31
|
+
- regex: `(?<![\$€£¥\d])-?\d{1,3}(?:,\d{3})*\.\d{2}(?![\d])`
|
|
32
|
+
- Matches: `1,234.56`, `-99.00`.
|
|
33
|
+
- Non-matches: `1234` (no decimal), `1,234.5` (one fraction digit).
|
|
34
|
+
- Field type: `decimal`.
|
|
35
|
+
- Teaches: lookbehind/lookahead to reject embedded matches.
|
|
36
|
+
|
|
37
|
+
### 5. Integer (standalone)
|
|
38
|
+
|
|
39
|
+
- regex: `(?<![\d.])\d+(?![\d.])`
|
|
40
|
+
- Matches: `42`, `1000`.
|
|
41
|
+
- Non-matches: `3.14`, `12345.67`.
|
|
42
|
+
- Field type: `int`.
|
|
43
|
+
- Teaches: negative lookarounds for "not part of a bigger token".
|
|
44
|
+
|
|
45
|
+
### 6. Decimal (any precision)
|
|
46
|
+
|
|
47
|
+
- regex: `(?<![\d.])-?\d+\.\d+(?![\d.])`
|
|
48
|
+
- Matches: `3.14`, `-0.001`.
|
|
49
|
+
- Non-matches: `3.`, `.5`, `3.14.15`.
|
|
50
|
+
- Field type: `decimal`.
|
|
51
|
+
- Teaches: balanced lookarounds preventing partial overlap.
|
|
52
|
+
|
|
53
|
+
### 7. Percentage
|
|
54
|
+
|
|
55
|
+
- regex: `(?<value>-?\d+(?:\.\d+)?)\s?%`
|
|
56
|
+
- Matches: `50%`, `12.5 %`, `-3%`.
|
|
57
|
+
- Non-matches: `% 50`, `fifty percent`.
|
|
58
|
+
- Field type: `decimal`.
|
|
59
|
+
- Teaches: optional whitespace plus capture before a literal suffix.
|
|
60
|
+
|
|
61
|
+
### 8. Email (RFC-pragmatic)
|
|
62
|
+
|
|
63
|
+
- regex: `\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b`
|
|
64
|
+
- Matches: `a@b.co`, `first.last+tag@example.com`.
|
|
65
|
+
- Non-matches: `a@b`, `@example.com`.
|
|
66
|
+
- Field type: `string`.
|
|
67
|
+
- Teaches: word-boundary anchoring and the pragmatic-vs-RFC-strict tradeoff.
|
|
68
|
+
|
|
69
|
+
### 9. Phone E.164
|
|
70
|
+
|
|
71
|
+
- regex: `\+[1-9]\d{1,14}\b`
|
|
72
|
+
- Matches: `+14165551212`, `+442071838750`.
|
|
73
|
+
- Non-matches: `416-555-1212`, `+0123` (leading zero after `+`).
|
|
74
|
+
- Field type: `string`.
|
|
75
|
+
- Teaches: format normalisation upstream of extraction (the engine expects already-cleaned input).
|
|
76
|
+
|
|
77
|
+
### 10. URL (http/https)
|
|
78
|
+
|
|
79
|
+
- regex: `https?://[^\s<>"']+`
|
|
80
|
+
- Matches: `https://example.com/path?q=1`, `http://a.b`.
|
|
81
|
+
- Non-matches: `ftp://x`, `example.com`.
|
|
82
|
+
- Field type: `string`.
|
|
83
|
+
- Teaches: negated character class for "until whitespace or quote".
|
|
84
|
+
|
|
85
|
+
### 11. UUID v1–v5
|
|
86
|
+
|
|
87
|
+
- regex: `\b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b`
|
|
88
|
+
- Matches: `550e8400-e29b-41d4-a716-446655440000`.
|
|
89
|
+
- Non-matches: `not-a-uuid`, all-zeros (fails the variant nibble).
|
|
90
|
+
- Field type: `Guid`.
|
|
91
|
+
- Teaches: position-specific character classes to validate format.
|
|
92
|
+
|
|
93
|
+
## How to verify a pattern
|
|
94
|
+
|
|
95
|
+
- Run `dotnet script scripts/test-pattern.csx -- <pdf> '<regex>'`. If `PatternTestResult.HasMatches` is `false`, the pattern does not match this PDF's haystack — go to `pattern-authoring.md`.
|
|
96
|
+
- If matches are partial, run `dotnet script scripts/test-groups.csx -- <pdf> '<regex>'` and read `PatternGroupTestResult.Groups[*].MatchesIndependently` to find the failing group.
|
|
97
|
+
- If matches are too many, tighten with lookarounds (patterns 4–6 demonstrate the technique).
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
# Local-processing privacy guarantee
|
|
2
|
+
|
|
3
|
+
## Claim
|
|
4
|
+
|
|
5
|
+
When you call `Docuoria`, the PDF bytes you supply never leave the machine the engine runs on. The library reads, extracts, transforms, and renders the PDF entirely in-process. The only outbound network call any first-party component makes is template JSON read/write against an HTTP template store, and that call is opt-in.
|
|
6
|
+
|
|
7
|
+
## Evidence — extraction is in-process
|
|
8
|
+
|
|
9
|
+
Every PDF-consuming primitive on `IDocuoriaEngine` (`InspectAsync`, `TestPatternAsync`, `TestGroupsAsync`, `DryRunAsync`, `ExecuteTemplateAsync`, `EvaluateMatchAsync`, `EvaluateMatchRuleAsync`, `ClassifyAsync`) takes a `Stream` — never a URL or remote handle for the PDF. See `src/libs/Docuoria/Contracts/IDocuoriaEngine.cs`; each method's documentation carries the contract phrase "The PDF stream is opened and disposed within the call (D-13)." The implementation in `src/libs/Docuoria/Engine/DocuoriaEngine.cs` resolves the in-process `IPdfDocumentFactory` and walks the configured extraction/transformation/publish steps without leaving the process.
|
|
10
|
+
|
|
11
|
+
## Evidence — the one network surface
|
|
12
|
+
|
|
13
|
+
`src/libs/Docuoria/Storage/ApiTemplateStoreProvider.cs` implements `ITemplateStoreProvider` and uses `IHttpClientFactory` (named client `ApiTemplateStoreProvider.HttpClientName = "Docuoria.TemplateStore"`) to read and write *templates*. Templates are JSON describing how to extract — they do not contain PDF bytes. A host that wants entirely local processing simply does not register `ApiTemplateStoreProvider`; the engine functions identically against a local store.
|
|
14
|
+
|
|
15
|
+
`src/libs/Docuoria/Pipeline/Retrieval/Http/HttpRetrievalProvider.cs` is an inbound fetch path the *host* opts into for retrieval steps; it does not upload PDFs supplied by the caller — it only downloads PDFs the template explicitly references.
|
|
16
|
+
|
|
17
|
+
## What this does NOT promise
|
|
18
|
+
|
|
19
|
+
- If you wire a third-party logger, telemetry sink, or background storage handler around the engine, your hosting code may transmit data. The guarantee is about the library, not your host.
|
|
20
|
+
- If the host uses `HttpRetrievalProvider` to download a PDF before processing, the URL of that PDF is necessarily known to the network. The guarantee is about what happens *after* the engine has the bytes in memory.
|
|
21
|
+
- If you store templates via `ApiTemplateStoreProvider`, template content (which may contain regex patterns derived from the PDF's text) crosses the network. PDFs do not.
|
|
22
|
+
|
|
23
|
+
## Verifying for yourself
|
|
24
|
+
|
|
25
|
+
1. Search the library for outbound HTTP usage:
|
|
26
|
+
`Select-String -Path src/libs/Sidub.PdfPipeline -Pattern 'HttpClient|HttpRequestMessage' -Recurse`.
|
|
27
|
+
Confirm hits are confined to the template-store and retrieval surfaces:
|
|
28
|
+
- `Storage/ApiTemplateStoreProvider.cs`
|
|
29
|
+
- `Storage/Http/TemplateStoreCredentialHandler.cs`
|
|
30
|
+
- `Registration/HttpRetrievalProviderBuilderExtensions.cs`
|
|
31
|
+
- `Registration/TemplateStoreBuilderExtensions.cs`
|
|
32
|
+
- `Pipeline/Retrieval/Http/HttpRetrievalProvider.cs`
|
|
33
|
+
|
|
34
|
+
The extra hits beyond `ApiTemplateStoreProvider` and `HttpRetrievalProvider` are credential-handler and DI-registration support for those same two surfaces — they do not introduce new outbound paths.
|
|
35
|
+
2. Confirm `IDocuoriaEngine` only accepts `Stream` for PDF input — open `src/libs/Docuoria/Contracts/IDocuoriaEngine.cs` and read every method signature.
|
|
36
|
+
3. Run `dotnet test` with the network blocked (e.g. firewall rule) and observe extraction tests still pass.
|
|
@@ -0,0 +1,361 @@
|
|
|
1
|
+
# Agent Scripts
|
|
2
|
+
|
|
3
|
+
This directory hosts the **agent-facing CLI surface** for `Docuoria`. Each script
|
|
4
|
+
is a [`dotnet-script`](https://github.com/dotnet-script/dotnet-script) `.csx` file that
|
|
5
|
+
binds a single SDK verb to a deterministic JSON contract. The scripts are designed for
|
|
6
|
+
non-interactive automation (LLM agents, CI jobs, shell pipelines) and uphold a strict
|
|
7
|
+
output contract:
|
|
8
|
+
|
|
9
|
+
- **Successful runs** emit a single line of UTF-8 JSON to **stdout**, exit code `0`.
|
|
10
|
+
- **Errors** emit a single `{"error":{"code","message","detail"}}` line to **stderr**,
|
|
11
|
+
non-zero exit code.
|
|
12
|
+
- All payloads serialize via `DocuoriaJsonOptions.Default` (camelCase, discriminator
|
|
13
|
+
`$type` for polymorphic results, `WhenWritingNull` ignore policy — see Classify for
|
|
14
|
+
the explicit-null exception).
|
|
15
|
+
|
|
16
|
+
Every script `#load "_common.csx"` to share host bootstrap, argument parsing
|
|
17
|
+
(`Cli.Require` / `Cli.Get` / `Cli.Has`), template-store registration, PDF stream
|
|
18
|
+
loading, and JSON writers.
|
|
19
|
+
|
|
20
|
+
> v1.4 invariants: confidence is binary (`1.0` / `0.0`), `ClassifyAsync` returns
|
|
21
|
+
> `null` when no template matches (CLS-02), and throws `InvalidOperationException`
|
|
22
|
+
> when no store is registered.
|
|
23
|
+
|
|
24
|
+
> **Distribution:** this directory is the **source** for the AI plugin's `scripts/`
|
|
25
|
+
> folder. `skills/build.ps1` copies these `.csx` files into `dist/docuoria/scripts/`
|
|
26
|
+
> and rewrites the SDK `#r` line in `_common.csx` to point at the bundled
|
|
27
|
+
> `assets/lib/Docuoria.dll`. In-repo development uses the relative
|
|
28
|
+
> `bin/Release/...dll` path; downstream consumers receive the bundled DLL.
|
|
29
|
+
|
|
30
|
+
## Installation
|
|
31
|
+
|
|
32
|
+
```powershell
|
|
33
|
+
dotnet tool install -g dotnet-script
|
|
34
|
+
# build the SDK once so _common.csx can reference the local DLL
|
|
35
|
+
dotnet build src/libs/Docuoria/Docuoria.csproj -c Debug
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
Run any script with:
|
|
39
|
+
|
|
40
|
+
```powershell
|
|
41
|
+
dotnet script scripts/<name>.csx -- --pdf path\to\file.pdf
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
The `--` separator forwards subsequent tokens as script arguments (exposed as the
|
|
45
|
+
`Args` global, an `IList<string>`).
|
|
46
|
+
|
|
47
|
+
## Common Store Parameters
|
|
48
|
+
|
|
49
|
+
Scripts that access the template store (`classify`, `evaluate-match`,
|
|
50
|
+
`list-templates`, `load-template`, `save-template`) accept these shared flags
|
|
51
|
+
to configure the store backend. When neither `--store-path` nor `--store-url`
|
|
52
|
+
is provided, the local store defaults to `./templates`.
|
|
53
|
+
|
|
54
|
+
| Flag | Default | Description |
|
|
55
|
+
| -------------- | -------------- | ------------------------------------------------------------------------ |
|
|
56
|
+
| `--store-path` | `./templates` | Local file-system template store directory. |
|
|
57
|
+
| `--store-url` | _(none)_ | API template store URL (mutually exclusive with `--store-path`). |
|
|
58
|
+
| `--store-key` | _(none)_ | Function key for API store authentication (used with `--store-url`). |
|
|
59
|
+
|
|
60
|
+
## Error JSON Shape
|
|
61
|
+
|
|
62
|
+
```json
|
|
63
|
+
{
|
|
64
|
+
"error": {
|
|
65
|
+
"code": "kebab-case-identifier",
|
|
66
|
+
"message": "Human-readable summary.",
|
|
67
|
+
"detail": "Optional stack trace or full exception text."
|
|
68
|
+
}
|
|
69
|
+
}
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
Common `code` values: `pdf-not-found`, `parse-error`, `already-exists`, `no-store`,
|
|
73
|
+
`unhandled`, `bad-format`.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## inspect.csx
|
|
78
|
+
|
|
79
|
+
**Synopsis.** Report low-level PDF structure (page count, text blocks, candidate
|
|
80
|
+
patterns) for a PDF — primary discovery step before authoring a template.
|
|
81
|
+
|
|
82
|
+
| Arg | Required | Description |
|
|
83
|
+
| --------- | -------- | -------------------------------------------- |
|
|
84
|
+
| `--pdf` | yes | Path to the source PDF. |
|
|
85
|
+
| `--page` | no | 1-based page index (default: all pages). |
|
|
86
|
+
|
|
87
|
+
**Output schema.** `PdfInspection` payload (`pageCount`, `pages[].blocks[]`, …).
|
|
88
|
+
|
|
89
|
+
**Exit codes.** `0` success · `1` unhandled · `pdf-not-found` on missing input.
|
|
90
|
+
|
|
91
|
+
**Example.**
|
|
92
|
+
|
|
93
|
+
```powershell
|
|
94
|
+
dotnet script scripts/inspect.csx -- --pdf invoice.pdf --page 1
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## test-pattern.csx
|
|
100
|
+
|
|
101
|
+
**Synopsis.** Evaluate a single extraction pattern against a PDF and report whether
|
|
102
|
+
it matched, with the captured value(s).
|
|
103
|
+
|
|
104
|
+
| Arg | Required | Description |
|
|
105
|
+
| -------------------- | -------- | ---------------------------------------------------------- |
|
|
106
|
+
| `--pattern` | yes | Inline pattern source (regex or DSL block). |
|
|
107
|
+
| `--pdf` | yes | Path to the source PDF. |
|
|
108
|
+
| `--page` | no | 1-based page index (default: all pages). |
|
|
109
|
+
| `--block-separator` | no | Override the block-separator regex used during extraction. |
|
|
110
|
+
|
|
111
|
+
**Output schema.** `{ hasMatches: bool, matches: [...] }`.
|
|
112
|
+
|
|
113
|
+
**Exit codes.** `0` success · `1` unhandled / parse-error · `pdf-not-found`.
|
|
114
|
+
|
|
115
|
+
**Example.**
|
|
116
|
+
|
|
117
|
+
```powershell
|
|
118
|
+
dotnet script scripts/test-pattern.csx -- --pattern 'Invoice #(\d+)' --pdf invoice.pdf
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
---
|
|
122
|
+
|
|
123
|
+
## test-groups.csx
|
|
124
|
+
|
|
125
|
+
**Synopsis.** Evaluate a multi-group pattern and emit each named capture group's
|
|
126
|
+
match set — used when authoring repeating-row extractions.
|
|
127
|
+
|
|
128
|
+
| Arg | Required | Description |
|
|
129
|
+
| ----------- | -------- | -------------------------------------------- |
|
|
130
|
+
| `--pattern` | yes | Multi-group pattern source. |
|
|
131
|
+
| `--pdf` | yes | Path to the source PDF. |
|
|
132
|
+
| `--page` | no | 1-based page index (default: all pages). |
|
|
133
|
+
|
|
134
|
+
**Output schema.** `{ groups: { <name>: [matches...] } }`.
|
|
135
|
+
|
|
136
|
+
**Exit codes.** `0` success · `1` on extraction failure.
|
|
137
|
+
|
|
138
|
+
**Example.**
|
|
139
|
+
|
|
140
|
+
```powershell
|
|
141
|
+
dotnet script scripts/test-groups.csx -- --pattern @rows.txt --pdf invoice.pdf
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
## validate-template.csx
|
|
147
|
+
|
|
148
|
+
**Synopsis.** Parse a `Template` JSON file and report schema / semantic validation
|
|
149
|
+
results without executing it.
|
|
150
|
+
|
|
151
|
+
| Arg | Required | Description |
|
|
152
|
+
| ------------ | -------- | ------------------------------------ |
|
|
153
|
+
| `--template` | yes | Path to the template JSON file. |
|
|
154
|
+
|
|
155
|
+
**Output schema.** `{ valid: bool, errors: [string] }`.
|
|
156
|
+
|
|
157
|
+
**Exit codes.** `0` always (even on `valid:false`) · `1` for `parse-error`.
|
|
158
|
+
|
|
159
|
+
**Example.**
|
|
160
|
+
|
|
161
|
+
```powershell
|
|
162
|
+
dotnet script scripts/validate-template.csx -- --template templates/invoice.json
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
## dry-run.csx
|
|
168
|
+
|
|
169
|
+
**Synopsis.** Execute extraction + publish steps against a PDF **without** producing a
|
|
170
|
+
serialized output payload — useful for end-to-end pipeline validation. Optionally
|
|
171
|
+
preview formatted output with `--preview-as`.
|
|
172
|
+
|
|
173
|
+
| Arg | Required | Description |
|
|
174
|
+
| -------------- | -------- | ---------------------------------------------------------- |
|
|
175
|
+
| `--pdf` | yes | Path to the source PDF. |
|
|
176
|
+
| `--template` | yes | Path to the template JSON file. |
|
|
177
|
+
| `--preview-as` | no | Preview formatted output: `csv` or `json` (no file written). |
|
|
178
|
+
|
|
179
|
+
**Output schema.** `{ kind: "SucceededResult"|"FailedResult"|"RejectedResult", result }`.
|
|
180
|
+
With `--preview-as`: `{ kind, format, preview }`.
|
|
181
|
+
|
|
182
|
+
**Exit codes.** `0` success · `1` unhandled.
|
|
183
|
+
|
|
184
|
+
**Example.**
|
|
185
|
+
|
|
186
|
+
```powershell
|
|
187
|
+
dotnet script scripts/dry-run.csx -- --pdf invoice.pdf --template templates/invoice.json
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## execute.csx
|
|
193
|
+
|
|
194
|
+
**Synopsis.** Full pipeline run with output generation. Writes a CSV or JSON payload
|
|
195
|
+
either to stdout or to `--output`.
|
|
196
|
+
|
|
197
|
+
| Arg | Required | Description |
|
|
198
|
+
| ------------ | -------- | -------------------------------------------------------------- |
|
|
199
|
+
| `--pdf` | yes | Path to the source PDF. |
|
|
200
|
+
| `--template` | yes | Path to the template JSON file. |
|
|
201
|
+
| `--format` | yes | `csv` or `json`. |
|
|
202
|
+
| `--output` | no | Write binary payload to this path. If omitted, stdout text. |
|
|
203
|
+
|
|
204
|
+
**Output schema.** Success: `{ status: "ok", format, output? }` (output is base64 / text).
|
|
205
|
+
Failure: `{ status: "rejected"|"failed", result }`.
|
|
206
|
+
|
|
207
|
+
**Exit codes.** `0` success · `1` rejected/failed · `2` `bad-format`.
|
|
208
|
+
|
|
209
|
+
**Example.**
|
|
210
|
+
|
|
211
|
+
```powershell
|
|
212
|
+
dotnet script scripts/execute.csx -- --pdf invoice.pdf --template templates/invoice.json --format json --output out.json
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
---
|
|
216
|
+
|
|
217
|
+
## evaluate-match.csx
|
|
218
|
+
|
|
219
|
+
**Synopsis.** Compute the aggregated match confidence between a PDF and a single
|
|
220
|
+
template. Confidence is `ruleConfidence × extractionProbeScore` (0.0 when either
|
|
221
|
+
fails, 1.0 when both are perfect). Template argument may be a file path **or** a
|
|
222
|
+
template identifier resolved through the configured store.
|
|
223
|
+
|
|
224
|
+
| Arg | Required | Description |
|
|
225
|
+
| -------------- | -------- | --------------------------------------------------------------------------------- |
|
|
226
|
+
| `--pdf` | yes | Path to the source PDF. |
|
|
227
|
+
| `--template` | yes | File path (`.json` / contains path separator) **or** template ID for store lookup. |
|
|
228
|
+
| `--store-path` | no | Local template store directory (default: `./templates`). |
|
|
229
|
+
| `--store-url` | no | API template store URL. |
|
|
230
|
+
| `--store-key` | no | Function key for API store authentication. |
|
|
231
|
+
|
|
232
|
+
**Output schema.** `{ confidence, matchedRules }`.
|
|
233
|
+
|
|
234
|
+
**Exit codes.** `0` success · `1` template not found / unhandled.
|
|
235
|
+
|
|
236
|
+
**Example.**
|
|
237
|
+
|
|
238
|
+
```powershell
|
|
239
|
+
dotnet script scripts/evaluate-match.csx -- --pdf invoice.pdf --template invoice
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
---
|
|
243
|
+
|
|
244
|
+
## classify.csx
|
|
245
|
+
|
|
246
|
+
**Synopsis.** Run ranked classification across **all** registered templates and return
|
|
247
|
+
the top matches sorted by confidence (descending).
|
|
248
|
+
|
|
249
|
+
| Arg | Required | Description |
|
|
250
|
+
| -------------- | -------- | ------------------------------------------------- |
|
|
251
|
+
| `--pdf` | yes | Path to the source PDF. |
|
|
252
|
+
| `--top` | no | Maximum number of results to return (default: 5). |
|
|
253
|
+
| `--store-path` | no | Local template store directory (default: `./templates`). |
|
|
254
|
+
| `--store-url` | no | API template store URL. |
|
|
255
|
+
| `--store-key` | no | Function key for API store authentication. |
|
|
256
|
+
|
|
257
|
+
**Output schema.** `{ matches: [{ templateId, confidence }, ...] }`. Only functional
|
|
258
|
+
matches are included (root rule passes AND extraction probe > 0).
|
|
259
|
+
|
|
260
|
+
**Exit codes.** `0` success (including `match:null`) · `1` `no-store` if no template
|
|
261
|
+
store is registered · `1` unhandled.
|
|
262
|
+
|
|
263
|
+
**Example.**
|
|
264
|
+
|
|
265
|
+
```powershell
|
|
266
|
+
dotnet script scripts/classify.csx -- --pdf invoice.pdf
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
## list-templates.csx
|
|
272
|
+
|
|
273
|
+
**Synopsis.** Enumerate template identifiers from the configured store.
|
|
274
|
+
|
|
275
|
+
| Arg | Required | Description |
|
|
276
|
+
| -------------- | -------- | -------------------------------------------------------- |
|
|
277
|
+
| `--store-path` | no | Local template store directory (default: `./templates`). |
|
|
278
|
+
| `--store-url` | no | API template store URL. |
|
|
279
|
+
| `--store-key` | no | Function key for API store authentication. |
|
|
280
|
+
|
|
281
|
+
**Output schema.** `{ templates: [id, ...] }`.
|
|
282
|
+
|
|
283
|
+
**Exit codes.** `0` success · `1` unhandled.
|
|
284
|
+
|
|
285
|
+
**Example.**
|
|
286
|
+
|
|
287
|
+
```powershell
|
|
288
|
+
dotnet script scripts/list-templates.csx -- --store-path ./templates
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
---
|
|
292
|
+
|
|
293
|
+
## load-template.csx
|
|
294
|
+
|
|
295
|
+
**Synopsis.** Resolve a template by identifier and emit its JSON representation.
|
|
296
|
+
|
|
297
|
+
| Arg | Required | Description |
|
|
298
|
+
| -------------- | -------- | -------------------------------------------------------- |
|
|
299
|
+
| `--id` | yes | Template identifier. |
|
|
300
|
+
| `--output` | no | Write JSON to this file path instead of stdout. |
|
|
301
|
+
| `--store-path` | no | Local template store directory (default: `./templates`). |
|
|
302
|
+
| `--store-url` | no | API template store URL. |
|
|
303
|
+
| `--store-key` | no | Function key for API store authentication. |
|
|
304
|
+
|
|
305
|
+
**Output schema.** Without `--output`: full template JSON. With `--output`:
|
|
306
|
+
`{ status: "ok", path }`.
|
|
307
|
+
|
|
308
|
+
**Exit codes.** `0` success · `1` not-found / unhandled.
|
|
309
|
+
|
|
310
|
+
**Example.**
|
|
311
|
+
|
|
312
|
+
```powershell
|
|
313
|
+
dotnet script scripts/load-template.csx -- --id invoice --output templates/invoice.json
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
---
|
|
317
|
+
|
|
318
|
+
## save-template.csx
|
|
319
|
+
|
|
320
|
+
**Synopsis.** Persist a template JSON file to the configured store. Fails with
|
|
321
|
+
`already-exists` unless `--overwrite` is supplied.
|
|
322
|
+
|
|
323
|
+
| Arg | Required | Description |
|
|
324
|
+
| -------------- | -------- | ------------------------------------------------------------------------- |
|
|
325
|
+
| `--template` | yes | Path to the template JSON file to persist. |
|
|
326
|
+
| `--overwrite` | no | Boolean switch — overwrite an existing template with the same identifier. |
|
|
327
|
+
| `--store-path` | no | Local template store directory (default: `./templates`). |
|
|
328
|
+
| `--store-url` | no | API template store URL. |
|
|
329
|
+
| `--store-key` | no | Function key for API store authentication. |
|
|
330
|
+
|
|
331
|
+
**Output schema.** `{ status: "ok", identifier }`.
|
|
332
|
+
|
|
333
|
+
**Exit codes.** `0` success · `1` `already-exists` / parse-error / unhandled.
|
|
334
|
+
|
|
335
|
+
**Example.**
|
|
336
|
+
|
|
337
|
+
```powershell
|
|
338
|
+
dotnet script scripts/save-template.csx -- --template templates/invoice.json --overwrite
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
---
|
|
342
|
+
|
|
343
|
+
## Internals — `_common.csx`
|
|
344
|
+
|
|
345
|
+
`_common.csx` is the **single bootstrap** used by every script. It:
|
|
346
|
+
|
|
347
|
+
1. References the locally-built `Docuoria.dll` and required NuGet packages
|
|
348
|
+
(`PdfPig`, `Tabula`, `CsvHelper`, `pythonnet`, `Microsoft.Extensions.Hosting` /
|
|
349
|
+
`DependencyInjection` / `Http`).
|
|
350
|
+
2. Exposes `Cli.Require / Cli.Get / Cli.Has` for argument parsing (renamed from
|
|
351
|
+
`Args` to avoid shadowing the `dotnet-script` global of the same name).
|
|
352
|
+
3. Builds a Generic Host via `ScriptHost.CreateHost(args, includeStore: bool)` which
|
|
353
|
+
wires `AddDocuoriaEngine`, `AddBuiltInMatchRules`, the CSV/JSON output
|
|
354
|
+
generators, and (optionally) the template store selected by the `--store-path` /
|
|
355
|
+
`--store-url` / `--store-key` flags.
|
|
356
|
+
4. Provides `JsonOut.Write` / `JsonOut.Error` writers backed by
|
|
357
|
+
`DocuoriaJsonOptions.Default` and a `LoadPdf(path)` helper that exits with
|
|
358
|
+
`pdf-not-found` when the input is missing.
|
|
359
|
+
|
|
360
|
+
Scripts must declare `#nullable enable` after `#load "_common.csx"` because the
|
|
361
|
+
nullable context does not propagate across `#load` boundaries.
|