@docrouter/mcp 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +207 -0
- package/dist/docs/knowledge_base/forms.md +724 -0
- package/dist/docs/knowledge_base/prompts.md +472 -0
- package/dist/docs/knowledge_base/schemas.md +852 -0
- package/dist/index.d.mts +1 -0
- package/dist/index.d.ts +1 -0
- package/dist/index.js +1812 -0
- package/dist/index.js.map +1 -0
- package/dist/index.mjs +1809 -0
- package/dist/index.mjs.map +1 -0
- package/package.json +66 -0
|
@@ -0,0 +1,852 @@
|
|
|
1
|
+
# DocRouter Schema Definition Manual
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
DocRouter uses **OpenAI's Structured Outputs JSON Schema format** to define extraction schemas for document processing. Schemas ensure that AI-extracted data from documents follows a consistent, validated structure.
|
|
6
|
+
|
|
7
|
+
## Table of Contents
|
|
8
|
+
|
|
9
|
+
1. [Schema Format Specification](#schema-format-specification)
|
|
10
|
+
2. [Basic Schema Structure](#basic-schema-structure)
|
|
11
|
+
3. [Field Types](#field-types)
|
|
12
|
+
4. [Required Fields and Strict Mode](#required-fields-and-strict-mode)
|
|
13
|
+
5. [Advanced Schema Features](#advanced-schema-features)
|
|
14
|
+
6. [Best Practices](#best-practices)
|
|
15
|
+
7. [Examples](#examples)
|
|
16
|
+
8. [API Integration](#api-integration)
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Schema Format Specification
|
|
21
|
+
|
|
22
|
+
### Root Structure
|
|
23
|
+
|
|
24
|
+
All DocRouter schemas follow this format:
|
|
25
|
+
|
|
26
|
+
```json
|
|
27
|
+
{
|
|
28
|
+
"type": "json_schema",
|
|
29
|
+
"json_schema": {
|
|
30
|
+
"name": "document_extraction",
|
|
31
|
+
"schema": {
|
|
32
|
+
"type": "object",
|
|
33
|
+
"properties": { ... },
|
|
34
|
+
"required": [ ... ],
|
|
35
|
+
"additionalProperties": false
|
|
36
|
+
},
|
|
37
|
+
"strict": true
|
|
38
|
+
}
|
|
39
|
+
}
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
### Components
|
|
43
|
+
|
|
44
|
+
| Component | Type | Required | Description |
|
|
45
|
+
|-----------|------|----------|-------------|
|
|
46
|
+
| `type` | string | Yes | Must be `"json_schema"` |
|
|
47
|
+
| `json_schema` | object | Yes | Container for schema definition |
|
|
48
|
+
| `json_schema.name` | string | Yes | Identifier for the schema (typically `"document_extraction"`) |
|
|
49
|
+
| `json_schema.schema` | object | Yes | JSON Schema specification following JSON Schema Draft 7 |
|
|
50
|
+
| `json_schema.strict` | boolean | Yes | **Must be `true`** - Ensures 100% schema adherence |
|
|
51
|
+
|
|
52
|
+
### Strict Mode Constraints
|
|
53
|
+
|
|
54
|
+
When `strict: true` is enabled (mandatory for DocRouter), the following rules apply:
|
|
55
|
+
|
|
56
|
+
1. **All properties MUST be in the `required` array** - No optional fields allowed
|
|
57
|
+
2. **`additionalProperties: false` MUST be set** - At every level, including nested objects
|
|
58
|
+
3. **Perfect schema adherence** - The LLM output will always match the schema exactly
|
|
59
|
+
4. **Default values for missing data** - Empty strings, zeros, false, or empty arrays/objects
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Basic Schema Structure
|
|
64
|
+
|
|
65
|
+
### Minimal Schema Example
|
|
66
|
+
|
|
67
|
+
```json
|
|
68
|
+
{
|
|
69
|
+
"type": "json_schema",
|
|
70
|
+
"json_schema": {
|
|
71
|
+
"name": "document_extraction",
|
|
72
|
+
"schema": {
|
|
73
|
+
"type": "object",
|
|
74
|
+
"properties": {
|
|
75
|
+
"field_name": {
|
|
76
|
+
"type": "string",
|
|
77
|
+
"description": "Human-readable description of this field"
|
|
78
|
+
}
|
|
79
|
+
},
|
|
80
|
+
"required": ["field_name"],
|
|
81
|
+
"additionalProperties": false
|
|
82
|
+
},
|
|
83
|
+
"strict": true
|
|
84
|
+
}
|
|
85
|
+
}
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Schema Object Properties
|
|
89
|
+
|
|
90
|
+
| Property | Type | Required | Description |
|
|
91
|
+
|----------|------|----------|-------------|
|
|
92
|
+
| `type` | string | Yes | Must be `"object"` for root schema |
|
|
93
|
+
| `properties` | object | Yes | Defines all extractable fields |
|
|
94
|
+
| `required` | array | Yes | **Must list ALL properties** when `strict: true` |
|
|
95
|
+
| `additionalProperties` | boolean | Yes | **Must be `false`** when `strict: true` |
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## Field Types
|
|
100
|
+
|
|
101
|
+
DocRouter schemas support standard JSON Schema data types:
|
|
102
|
+
|
|
103
|
+
### String Fields
|
|
104
|
+
|
|
105
|
+
```json
|
|
106
|
+
{
|
|
107
|
+
"field_name": {
|
|
108
|
+
"type": "string",
|
|
109
|
+
"description": "A text field"
|
|
110
|
+
}
|
|
111
|
+
}
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
**Use for:** Names, emails, addresses, free-text descriptions, comma-separated lists
|
|
115
|
+
|
|
116
|
+
### Number Fields
|
|
117
|
+
|
|
118
|
+
```json
|
|
119
|
+
{
|
|
120
|
+
"amount": {
|
|
121
|
+
"type": "number",
|
|
122
|
+
"description": "A numeric value"
|
|
123
|
+
}
|
|
124
|
+
}
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**Use for:** Quantities, amounts, percentages, measurements
|
|
128
|
+
|
|
129
|
+
### Integer Fields
|
|
130
|
+
|
|
131
|
+
```json
|
|
132
|
+
{
|
|
133
|
+
"count": {
|
|
134
|
+
"type": "integer",
|
|
135
|
+
"description": "A whole number"
|
|
136
|
+
}
|
|
137
|
+
}
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**Use for:** Counts, years, age, quantity of items
|
|
141
|
+
|
|
142
|
+
### Boolean Fields
|
|
143
|
+
|
|
144
|
+
```json
|
|
145
|
+
{
|
|
146
|
+
"is_verified": {
|
|
147
|
+
"type": "boolean",
|
|
148
|
+
"description": "True/false indicator"
|
|
149
|
+
}
|
|
150
|
+
}
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
**Use for:** Yes/no questions, checkboxes, status flags
|
|
154
|
+
|
|
155
|
+
### Array Fields
|
|
156
|
+
|
|
157
|
+
```json
|
|
158
|
+
{
|
|
159
|
+
"skills": {
|
|
160
|
+
"type": "array",
|
|
161
|
+
"description": "List of programming skills",
|
|
162
|
+
"items": {
|
|
163
|
+
"type": "string"
|
|
164
|
+
}
|
|
165
|
+
}
|
|
166
|
+
}
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
**Use for:** Lists, multiple values, repeated items
|
|
170
|
+
|
|
171
|
+
### Object Fields (Nested)
|
|
172
|
+
|
|
173
|
+
```json
|
|
174
|
+
{
|
|
175
|
+
"address": {
|
|
176
|
+
"type": "object",
|
|
177
|
+
"description": "Address information",
|
|
178
|
+
"properties": {
|
|
179
|
+
"street": {
|
|
180
|
+
"type": "string",
|
|
181
|
+
"description": "Street address"
|
|
182
|
+
},
|
|
183
|
+
"city": {
|
|
184
|
+
"type": "string",
|
|
185
|
+
"description": "City name"
|
|
186
|
+
},
|
|
187
|
+
"postal_code": {
|
|
188
|
+
"type": "string",
|
|
189
|
+
"description": "Postal code"
|
|
190
|
+
}
|
|
191
|
+
},
|
|
192
|
+
"required": ["street", "city", "postal_code"],
|
|
193
|
+
"additionalProperties": false
|
|
194
|
+
}
|
|
195
|
+
}
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
**Use for:** Grouped related fields, structured sub-data
|
|
199
|
+
|
|
200
|
+
---
|
|
201
|
+
|
|
202
|
+
## Required Fields and Strict Mode
|
|
203
|
+
|
|
204
|
+
### Strict Mode Requirements
|
|
205
|
+
|
|
206
|
+
**IMPORTANT:** When using OpenAI's Structured Outputs with `strict: true` (which DocRouter uses), **ALL properties MUST be listed in the `required` array**. This is a mandatory requirement from OpenAI's API.
|
|
207
|
+
|
|
208
|
+
```json
|
|
209
|
+
{
|
|
210
|
+
"type": "object",
|
|
211
|
+
"properties": {
|
|
212
|
+
"name": { "type": "string", "description": "Full name" },
|
|
213
|
+
"email": { "type": "string", "description": "Email address" },
|
|
214
|
+
"middle_name": { "type": "string", "description": "Middle name" }
|
|
215
|
+
},
|
|
216
|
+
"required": ["name", "email", "middle_name"],
|
|
217
|
+
"additionalProperties": false
|
|
218
|
+
}
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
### How the LLM Handles Missing Data
|
|
222
|
+
|
|
223
|
+
Since all fields must be required in strict mode, the LLM handles missing data as follows:
|
|
224
|
+
|
|
225
|
+
- **String fields**: Returns empty string `""` if data not found in document
|
|
226
|
+
- **Number/Integer fields**: Returns `0` if data not found
|
|
227
|
+
- **Boolean fields**: Returns `false` if data not found
|
|
228
|
+
- **Array fields**: Returns empty array `[]` if data not found
|
|
229
|
+
- **Object fields**: Returns object with all nested required fields populated with default values
|
|
230
|
+
|
|
231
|
+
**Best Practice:** Design your schema knowing that all fields will always be present in the response, but may contain empty/default values when data is not found in the document.
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
## Advanced Schema Features
|
|
236
|
+
|
|
237
|
+
**⚠️ PORTABILITY WARNING:** The features below are supported by OpenAI's Structured Outputs, but **not recommended** for DocRouter schemas. These constraints may not be portable across different LLM providers (Anthropic Claude, Google Gemini, etc.). For maximum compatibility and reliability:
|
|
238
|
+
|
|
239
|
+
- **Use basic types only**: string, number, integer, boolean, array, object
|
|
240
|
+
- **Avoid** enums, patterns, minimum/maximum, minItems/maxItems, uniqueItems
|
|
241
|
+
- **Handle validation in your application code** instead of in the schema
|
|
242
|
+
- **Use detailed descriptions** to guide the LLM rather than strict constraints
|
|
243
|
+
|
|
244
|
+
### Enums (Restricted Values) - ⚠️ NOT RECOMMENDED
|
|
245
|
+
|
|
246
|
+
While OpenAI supports limiting field values to specific options, this feature may not work with other LLM providers:
|
|
247
|
+
|
|
248
|
+
```json
|
|
249
|
+
{
|
|
250
|
+
"document_type": {
|
|
251
|
+
"type": "string",
|
|
252
|
+
"description": "Type of document",
|
|
253
|
+
"enum": ["invoice", "receipt", "contract", "bill"]
|
|
254
|
+
}
|
|
255
|
+
}
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
**Better approach:**
|
|
259
|
+
```json
|
|
260
|
+
{
|
|
261
|
+
"document_type": {
|
|
262
|
+
"type": "string",
|
|
263
|
+
"description": "Type of document (e.g., invoice, receipt, contract, or bill)"
|
|
264
|
+
}
|
|
265
|
+
}
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
### String Patterns - ⚠️ NOT RECOMMENDED
|
|
269
|
+
|
|
270
|
+
Regex validation is OpenAI-specific and reduces portability:
|
|
271
|
+
|
|
272
|
+
```json
|
|
273
|
+
{
|
|
274
|
+
"phone": {
|
|
275
|
+
"type": "string",
|
|
276
|
+
"description": "Phone number in E.164 format",
|
|
277
|
+
"pattern": "^\\+[1-9]\\d{1,14}$"
|
|
278
|
+
}
|
|
279
|
+
}
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
**Better approach:**
|
|
283
|
+
```json
|
|
284
|
+
{
|
|
285
|
+
"phone": {
|
|
286
|
+
"type": "string",
|
|
287
|
+
"description": "Phone number in E.164 format (e.g., +1234567890)"
|
|
288
|
+
}
|
|
289
|
+
}
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
### Number Constraints - ⚠️ NOT RECOMMENDED
|
|
293
|
+
|
|
294
|
+
Minimum, maximum, and multipleOf constraints may not be portable:
|
|
295
|
+
|
|
296
|
+
```json
|
|
297
|
+
{
|
|
298
|
+
"age": {
|
|
299
|
+
"type": "integer",
|
|
300
|
+
"description": "Age in years",
|
|
301
|
+
"minimum": 0,
|
|
302
|
+
"maximum": 150
|
|
303
|
+
},
|
|
304
|
+
"price": {
|
|
305
|
+
"type": "number",
|
|
306
|
+
"description": "Price in USD",
|
|
307
|
+
"minimum": 0,
|
|
308
|
+
"multipleOf": 0.01
|
|
309
|
+
}
|
|
310
|
+
}
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
**Better approach:**
|
|
314
|
+
```json
|
|
315
|
+
{
|
|
316
|
+
"age": {
|
|
317
|
+
"type": "integer",
|
|
318
|
+
"description": "Age in years (0-150)"
|
|
319
|
+
},
|
|
320
|
+
"price": {
|
|
321
|
+
"type": "number",
|
|
322
|
+
"description": "Price in USD (e.g., 19.99)"
|
|
323
|
+
}
|
|
324
|
+
}
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
### Array Constraints - ⚠️ NOT RECOMMENDED
|
|
328
|
+
|
|
329
|
+
Array size and uniqueness constraints are not universally supported:
|
|
330
|
+
|
|
331
|
+
```json
|
|
332
|
+
{
|
|
333
|
+
"tags": {
|
|
334
|
+
"type": "array",
|
|
335
|
+
"description": "Document tags",
|
|
336
|
+
"items": {
|
|
337
|
+
"type": "string"
|
|
338
|
+
},
|
|
339
|
+
"minItems": 1,
|
|
340
|
+
"maxItems": 10,
|
|
341
|
+
"uniqueItems": true
|
|
342
|
+
}
|
|
343
|
+
}
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
**Better approach:**
|
|
347
|
+
```json
|
|
348
|
+
{
|
|
349
|
+
"tags": {
|
|
350
|
+
"type": "array",
|
|
351
|
+
"description": "Document tags (1-10 unique tags)",
|
|
352
|
+
"items": {
|
|
353
|
+
"type": "string"
|
|
354
|
+
}
|
|
355
|
+
}
|
|
356
|
+
}
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
### Complex Nested Objects
|
|
360
|
+
|
|
361
|
+
```json
|
|
362
|
+
{
|
|
363
|
+
"work_history": {
|
|
364
|
+
"type": "array",
|
|
365
|
+
"description": "Employment history",
|
|
366
|
+
"items": {
|
|
367
|
+
"type": "object",
|
|
368
|
+
"properties": {
|
|
369
|
+
"company": {
|
|
370
|
+
"type": "string",
|
|
371
|
+
"description": "Company name"
|
|
372
|
+
},
|
|
373
|
+
"position": {
|
|
374
|
+
"type": "string",
|
|
375
|
+
"description": "Job title"
|
|
376
|
+
},
|
|
377
|
+
"start_date": {
|
|
378
|
+
"type": "string",
|
|
379
|
+
"description": "Start date (YYYY-MM-DD)"
|
|
380
|
+
},
|
|
381
|
+
"end_date": {
|
|
382
|
+
"type": "string",
|
|
383
|
+
"description": "End date (YYYY-MM-DD or 'Present')"
|
|
384
|
+
},
|
|
385
|
+
"responsibilities": {
|
|
386
|
+
"type": "array",
|
|
387
|
+
"items": {
|
|
388
|
+
"type": "string"
|
|
389
|
+
},
|
|
390
|
+
"description": "List of job responsibilities"
|
|
391
|
+
}
|
|
392
|
+
},
|
|
393
|
+
"required": ["company", "position", "start_date", "end_date", "responsibilities"],
|
|
394
|
+
"additionalProperties": false
|
|
395
|
+
}
|
|
396
|
+
}
|
|
397
|
+
}
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
---
|
|
401
|
+
|
|
402
|
+
## Best Practices
|
|
403
|
+
|
|
404
|
+
### 1. Use Clear, Descriptive Field Names
|
|
405
|
+
|
|
406
|
+
**Good:**
|
|
407
|
+
```json
|
|
408
|
+
"current_academic_program": { "type": "string", "description": "Current degree program" }
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
**Avoid:**
|
|
412
|
+
```json
|
|
413
|
+
"prog": { "type": "string", "description": "program" }
|
|
414
|
+
```
|
|
415
|
+
|
|
416
|
+
### 2. Provide Detailed Descriptions
|
|
417
|
+
|
|
418
|
+
Descriptions guide the LLM on what to extract:
|
|
419
|
+
|
|
420
|
+
**Good:**
|
|
421
|
+
```json
|
|
422
|
+
{
|
|
423
|
+
"total_amount": {
|
|
424
|
+
"type": "string",
|
|
425
|
+
"description": "Total invoice amount including tax, with currency symbol and commas (e.g., $1,234.56)"
|
|
426
|
+
}
|
|
427
|
+
}
|
|
428
|
+
```
|
|
429
|
+
|
|
430
|
+
**Avoid:**
|
|
431
|
+
```json
|
|
432
|
+
{
|
|
433
|
+
"total_amount": {
|
|
434
|
+
"type": "string",
|
|
435
|
+
"description": "total"
|
|
436
|
+
}
|
|
437
|
+
}
|
|
438
|
+
```
|
|
439
|
+
|
|
440
|
+
### 3. Choose Appropriate Field Types
|
|
441
|
+
|
|
442
|
+
- Use **string** for currency values with formatting (e.g., "$1,234.56")
|
|
443
|
+
- Use **number** for numeric calculations
|
|
444
|
+
- Use **array** for multiple items instead of comma-separated strings
|
|
445
|
+
- Use **object** to group related fields
|
|
446
|
+
|
|
447
|
+
### 4. Set `additionalProperties: false`
|
|
448
|
+
|
|
449
|
+
Prevent the LLM from adding unexpected fields:
|
|
450
|
+
|
|
451
|
+
```json
|
|
452
|
+
{
|
|
453
|
+
"type": "object",
|
|
454
|
+
"properties": { ... },
|
|
455
|
+
"additionalProperties": false
|
|
456
|
+
}
|
|
457
|
+
```
|
|
458
|
+
|
|
459
|
+
### 5. All Fields Must Be Required (Strict Mode)
|
|
460
|
+
|
|
461
|
+
- **ALL fields must be listed in the `required` array** when using `strict: true`
|
|
462
|
+
- The LLM will return empty/default values for fields not found in the document
|
|
463
|
+
- Design your schema to handle empty values gracefully in your application logic
|
|
464
|
+
- There are no optional fields in strict mode - this is an OpenAI API requirement
|
|
465
|
+
|
|
466
|
+
### 6. Avoid Advanced Constraints for Portability
|
|
467
|
+
|
|
468
|
+
For maximum portability across LLM providers (OpenAI, Anthropic, Gemini, etc.):
|
|
469
|
+
|
|
470
|
+
- **Use basic types only** and avoid enums, patterns, min/max constraints
|
|
471
|
+
- **Put constraints in descriptions** instead: `"Status (paid, unpaid, overdue, or cancelled)"`
|
|
472
|
+
- **Validate data in your application** rather than in the schema
|
|
473
|
+
- This ensures your schemas work consistently across all supported LLM providers
|
|
474
|
+
|
|
475
|
+
**Not recommended:**
|
|
476
|
+
```json
|
|
477
|
+
{
|
|
478
|
+
"invoice_status": {
|
|
479
|
+
"type": "string",
|
|
480
|
+
"enum": ["paid", "unpaid", "overdue", "cancelled"]
|
|
481
|
+
}
|
|
482
|
+
}
|
|
483
|
+
```
|
|
484
|
+
|
|
485
|
+
**Recommended:**
|
|
486
|
+
```json
|
|
487
|
+
{
|
|
488
|
+
"invoice_status": {
|
|
489
|
+
"type": "string",
|
|
490
|
+
"description": "Invoice status (paid, unpaid, overdue, or cancelled)"
|
|
491
|
+
}
|
|
492
|
+
}
|
|
493
|
+
```
|
|
494
|
+
|
|
495
|
+
### 7. Document Your Schema
|
|
496
|
+
|
|
497
|
+
Include clear descriptions that explain:
|
|
498
|
+
- What data to extract
|
|
499
|
+
- Expected format
|
|
500
|
+
- How to handle edge cases
|
|
501
|
+
|
|
502
|
+
---
|
|
503
|
+
|
|
504
|
+
## Examples
|
|
505
|
+
|
|
506
|
+
### Example 1: Invoice Schema
|
|
507
|
+
|
|
508
|
+
```json
|
|
509
|
+
{
|
|
510
|
+
"type": "json_schema",
|
|
511
|
+
"json_schema": {
|
|
512
|
+
"name": "document_extraction",
|
|
513
|
+
"schema": {
|
|
514
|
+
"type": "object",
|
|
515
|
+
"properties": {
|
|
516
|
+
"invoice_number": {
|
|
517
|
+
"type": "string",
|
|
518
|
+
"description": "Unique invoice identifier"
|
|
519
|
+
},
|
|
520
|
+
"invoice_date": {
|
|
521
|
+
"type": "string",
|
|
522
|
+
"description": "Date of invoice in YYYY-MM-DD format"
|
|
523
|
+
},
|
|
524
|
+
"vendor_name": {
|
|
525
|
+
"type": "string",
|
|
526
|
+
"description": "Name of the vendor/supplier"
|
|
527
|
+
},
|
|
528
|
+
"vendor_address": {
|
|
529
|
+
"type": "string",
|
|
530
|
+
"description": "Complete vendor address"
|
|
531
|
+
},
|
|
532
|
+
"customer_name": {
|
|
533
|
+
"type": "string",
|
|
534
|
+
"description": "Name of the customer/buyer"
|
|
535
|
+
},
|
|
536
|
+
"line_items": {
|
|
537
|
+
"type": "array",
|
|
538
|
+
"description": "List of items on the invoice",
|
|
539
|
+
"items": {
|
|
540
|
+
"type": "object",
|
|
541
|
+
"properties": {
|
|
542
|
+
"description": {
|
|
543
|
+
"type": "string",
|
|
544
|
+
"description": "Item description"
|
|
545
|
+
},
|
|
546
|
+
"quantity": {
|
|
547
|
+
"type": "string",
|
|
548
|
+
"description": "Quantity ordered"
|
|
549
|
+
},
|
|
550
|
+
"unit_price": {
|
|
551
|
+
"type": "string",
|
|
552
|
+
"description": "Price per unit with currency"
|
|
553
|
+
},
|
|
554
|
+
"total": {
|
|
555
|
+
"type": "string",
|
|
556
|
+
"description": "Line total with currency"
|
|
557
|
+
}
|
|
558
|
+
},
|
|
559
|
+
"required": ["description", "quantity", "unit_price", "total"],
|
|
560
|
+
"additionalProperties": false
|
|
561
|
+
}
|
|
562
|
+
},
|
|
563
|
+
"subtotal": {
|
|
564
|
+
"type": "string",
|
|
565
|
+
"description": "Subtotal before tax with currency"
|
|
566
|
+
},
|
|
567
|
+
"tax_amount": {
|
|
568
|
+
"type": "string",
|
|
569
|
+
"description": "Tax amount with currency"
|
|
570
|
+
},
|
|
571
|
+
"total_amount": {
|
|
572
|
+
"type": "string",
|
|
573
|
+
"description": "Total amount due with currency"
|
|
574
|
+
},
|
|
575
|
+
"payment_terms": {
|
|
576
|
+
"type": "string",
|
|
577
|
+
"description": "Payment terms (e.g., Net 30, Due on Receipt)"
|
|
578
|
+
}
|
|
579
|
+
},
|
|
580
|
+
"required": [
|
|
581
|
+
"invoice_number",
|
|
582
|
+
"invoice_date",
|
|
583
|
+
"vendor_name",
|
|
584
|
+
"vendor_address",
|
|
585
|
+
"customer_name",
|
|
586
|
+
"line_items",
|
|
587
|
+
"subtotal",
|
|
588
|
+
"tax_amount",
|
|
589
|
+
"total_amount",
|
|
590
|
+
"payment_terms"
|
|
591
|
+
],
|
|
592
|
+
"additionalProperties": false
|
|
593
|
+
},
|
|
594
|
+
"strict": true
|
|
595
|
+
}
|
|
596
|
+
}
|
|
597
|
+
```
|
|
598
|
+
|
|
599
|
+
### Example 2: Resume/CV Schema
|
|
600
|
+
|
|
601
|
+
```json
|
|
602
|
+
{
|
|
603
|
+
"type": "json_schema",
|
|
604
|
+
"json_schema": {
|
|
605
|
+
"name": "document_extraction",
|
|
606
|
+
"schema": {
|
|
607
|
+
"type": "object",
|
|
608
|
+
"properties": {
|
|
609
|
+
"Name": {
|
|
610
|
+
"type": "string",
|
|
611
|
+
"description": "Candidate's full name"
|
|
612
|
+
},
|
|
613
|
+
"Email": {
|
|
614
|
+
"type": "string",
|
|
615
|
+
"description": "Email address"
|
|
616
|
+
},
|
|
617
|
+
"Telephone": {
|
|
618
|
+
"type": "string",
|
|
619
|
+
"description": "Phone number"
|
|
620
|
+
},
|
|
621
|
+
"Current Academic Program": {
|
|
622
|
+
"type": "string",
|
|
623
|
+
"description": "Current degree program (e.g., MEng Computing)"
|
|
624
|
+
},
|
|
625
|
+
"Current Grade": {
|
|
626
|
+
"type": "string",
|
|
627
|
+
"description": "Academic year or GPA/grade information"
|
|
628
|
+
},
|
|
629
|
+
"High School Qualification": {
|
|
630
|
+
"type": "string",
|
|
631
|
+
"description": "A-levels, GCSEs, or equivalent qualifications"
|
|
632
|
+
},
|
|
633
|
+
"Programming Languages": {
|
|
634
|
+
"type": "string",
|
|
635
|
+
"description": "Comma-separated list of programming languages"
|
|
636
|
+
},
|
|
637
|
+
"Experiences": {
|
|
638
|
+
"type": "string",
|
|
639
|
+
"description": "Professional or research experiences"
|
|
640
|
+
},
|
|
641
|
+
"Projects": {
|
|
642
|
+
"type": "string",
|
|
643
|
+
"description": "Academic or personal projects with descriptions"
|
|
644
|
+
},
|
|
645
|
+
"Awards": {
|
|
646
|
+
"type": "string",
|
|
647
|
+
"description": "Academic awards, honors, competition placements"
|
|
648
|
+
},
|
|
649
|
+
"Work Experience": {
|
|
650
|
+
"type": "string",
|
|
651
|
+
"description": "Employment history with companies and roles"
|
|
652
|
+
},
|
|
653
|
+
"Extracurricular": {
|
|
654
|
+
"type": "string",
|
|
655
|
+
"description": "Clubs, hobbies, volunteer work, sports"
|
|
656
|
+
},
|
|
657
|
+
"Languages": {
|
|
658
|
+
"type": "string",
|
|
659
|
+
"description": "Spoken languages and proficiency levels"
|
|
660
|
+
}
|
|
661
|
+
},
|
|
662
|
+
"required": [
|
|
663
|
+
"Name",
|
|
664
|
+
"Email",
|
|
665
|
+
"Telephone",
|
|
666
|
+
"Current Academic Program",
|
|
667
|
+
"Current Grade",
|
|
668
|
+
"High School Qualification",
|
|
669
|
+
"Programming Languages",
|
|
670
|
+
"Experiences",
|
|
671
|
+
"Projects",
|
|
672
|
+
"Awards",
|
|
673
|
+
"Work Experience",
|
|
674
|
+
"Extracurricular",
|
|
675
|
+
"Languages"
|
|
676
|
+
],
|
|
677
|
+
"additionalProperties": false
|
|
678
|
+
},
|
|
679
|
+
"strict": true
|
|
680
|
+
}
|
|
681
|
+
}
|
|
682
|
+
```
|
|
683
|
+
|
|
684
|
+
### Example 3: Financial Statement Schema
|
|
685
|
+
|
|
686
|
+
```json
|
|
687
|
+
{
|
|
688
|
+
"type": "json_schema",
|
|
689
|
+
"json_schema": {
|
|
690
|
+
"name": "document_extraction",
|
|
691
|
+
"schema": {
|
|
692
|
+
"type": "object",
|
|
693
|
+
"properties": {
|
|
694
|
+
"net_interest_income": {
|
|
695
|
+
"type": "string",
|
|
696
|
+
"description": "Net interest income in thousands with formatting"
|
|
697
|
+
},
|
|
698
|
+
"net_fee_and_commission_income": {
|
|
699
|
+
"type": "string",
|
|
700
|
+
"description": "Net fee and commission income"
|
|
701
|
+
},
|
|
702
|
+
"other_operating_income": {
|
|
703
|
+
"type": "string",
|
|
704
|
+
"description": "Other operating income"
|
|
705
|
+
},
|
|
706
|
+
"credit_loss_expense": {
|
|
707
|
+
"type": "string",
|
|
708
|
+
"description": "Credit loss expense (negative values in parentheses)"
|
|
709
|
+
},
|
|
710
|
+
"net_operating_income": {
|
|
711
|
+
"type": "string",
|
|
712
|
+
"description": "Net operating income"
|
|
713
|
+
},
|
|
714
|
+
"personnel_expenses": {
|
|
715
|
+
"type": "string",
|
|
716
|
+
"description": "Personnel expenses"
|
|
717
|
+
},
|
|
718
|
+
"other_operating_expenses": {
|
|
719
|
+
"type": "string",
|
|
720
|
+
"description": "Other operating expenses"
|
|
721
|
+
},
|
|
722
|
+
"total_expenses": {
|
|
723
|
+
"type": "string",
|
|
724
|
+
"description": "Total expenses"
|
|
725
|
+
},
|
|
726
|
+
"profit_loss_before_tax": {
|
|
727
|
+
"type": "string",
|
|
728
|
+
"description": "Profit/loss before tax"
|
|
729
|
+
},
|
|
730
|
+
"tax_expense_credit": {
|
|
731
|
+
"type": "string",
|
|
732
|
+
"description": "Tax expense or credit"
|
|
733
|
+
},
|
|
734
|
+
"profit_loss_for_the_year": {
|
|
735
|
+
"type": "string",
|
|
736
|
+
"description": "Final profit/loss for the year"
|
|
737
|
+
}
|
|
738
|
+
},
|
|
739
|
+
"required": [
|
|
740
|
+
"net_interest_income",
|
|
741
|
+
"net_fee_and_commission_income",
|
|
742
|
+
"other_operating_income",
|
|
743
|
+
"credit_loss_expense",
|
|
744
|
+
"net_operating_income",
|
|
745
|
+
"personnel_expenses",
|
|
746
|
+
"other_operating_expenses",
|
|
747
|
+
"total_expenses",
|
|
748
|
+
"profit_loss_before_tax",
|
|
749
|
+
"tax_expense_credit",
|
|
750
|
+
"profit_loss_for_the_year"
|
|
751
|
+
],
|
|
752
|
+
"additionalProperties": false
|
|
753
|
+
},
|
|
754
|
+
"strict": true
|
|
755
|
+
}
|
|
756
|
+
}
|
|
757
|
+
```
|
|
758
|
+
|
|
759
|
+
---
|
|
760
|
+
|
|
761
|
+
## API Integration
|
|
762
|
+
|
|
763
|
+
DocRouter provides multiple ways to interact with schemas programmatically:
|
|
764
|
+
|
|
765
|
+
- **TypeScript/JavaScript SDK** - Type-safe client library for Node.js and browsers (see `packages/typescript/docrouter-sdk/`)
|
|
766
|
+
- **Python SDK** - Type-safe Python client library (see `packages/docrouter_sdk/`)
|
|
767
|
+
- **REST API** - Direct HTTP requests (see API documentation for endpoints)
|
|
768
|
+
- **MCP (Model Context Protocol)** - Integration with AI assistants like Claude Code
|
|
769
|
+
|
|
770
|
+
All methods support the same schema operations: create, list, retrieve, update, delete, and validate against schemas.
|
|
771
|
+
|
|
772
|
+
---
|
|
773
|
+
|
|
774
|
+
## Schema Workflow
|
|
775
|
+
|
|
776
|
+
### 1. Design Phase
|
|
777
|
+
- Identify document type and key fields to extract
|
|
778
|
+
- Choose appropriate data types for each field
|
|
779
|
+
- Design nested structures for complex data
|
|
780
|
+
- Remember: ALL fields will be required in strict mode
|
|
781
|
+
|
|
782
|
+
### 2. Creation Phase
|
|
783
|
+
- Create schema using API or UI
|
|
784
|
+
- Test with sample documents
|
|
785
|
+
- Iterate based on extraction results
|
|
786
|
+
|
|
787
|
+
### 3. Prompt Integration
|
|
788
|
+
- Link schema to extraction prompt
|
|
789
|
+
- Configure LLM model (e.g., gpt-4o-mini, gemini-2.0-flash)
|
|
790
|
+
- Associate with document tags for automatic processing
|
|
791
|
+
|
|
792
|
+
### 4. Processing Phase
|
|
793
|
+
- Upload documents with appropriate tags
|
|
794
|
+
- LLM extracts data according to schema
|
|
795
|
+
- Results available via `getLLMResult` API
|
|
796
|
+
|
|
797
|
+
### 5. Validation Phase
|
|
798
|
+
- Review extracted data
|
|
799
|
+
- Verify against schema requirements
|
|
800
|
+
- Mark as verified when accurate
|
|
801
|
+
|
|
802
|
+
---
|
|
803
|
+
|
|
804
|
+
## Troubleshooting
|
|
805
|
+
|
|
806
|
+
### Common Issues
|
|
807
|
+
|
|
808
|
+
**Issue:** LLM returns empty strings for all fields
|
|
809
|
+
- **Solution:** Check prompt content, ensure it references the schema, verify document has OCR text
|
|
810
|
+
|
|
811
|
+
**Issue:** Extra fields appear in extraction
|
|
812
|
+
- **Solution:** Ensure `additionalProperties: false` is set in schema
|
|
813
|
+
|
|
814
|
+
**Issue:** Error "all properties must be required when strict is true"
|
|
815
|
+
- **Solution:** Ensure ALL properties are listed in the `required` array at every level (including nested objects)
|
|
816
|
+
|
|
817
|
+
**Issue:** Error "additionalProperties must be false when strict is true"
|
|
818
|
+
- **Solution:** Set `additionalProperties: false` on all objects in the schema, including nested objects
|
|
819
|
+
|
|
820
|
+
**Issue:** Required fields missing from extraction
|
|
821
|
+
- **Solution:** This should not happen with `strict: true`. Verify schema matches exactly, check field names
|
|
822
|
+
|
|
823
|
+
**Issue:** Number fields returned as strings
|
|
824
|
+
- **Solution:** For formatted numbers (with commas, currency), use string type. For calculations, use number type.
|
|
825
|
+
|
|
826
|
+
**Issue:** Array fields contain single concatenated string
|
|
827
|
+
- **Solution:** Update prompt to explicitly instruct LLM to return array of items
|
|
828
|
+
|
|
829
|
+
---
|
|
830
|
+
|
|
831
|
+
## Version Control
|
|
832
|
+
|
|
833
|
+
DocRouter maintains schema versioning:
|
|
834
|
+
|
|
835
|
+
- Each schema update creates a new version
|
|
836
|
+
- `schema_version` increments with each change
|
|
837
|
+
- `schema_revid` uniquely identifies each version
|
|
838
|
+
- Previous versions remain accessible for historical extractions
|
|
839
|
+
|
|
840
|
+
---
|
|
841
|
+
|
|
842
|
+
## References
|
|
843
|
+
|
|
844
|
+
- [JSON Schema Specification](https://json-schema.org/)
|
|
845
|
+
- [OpenAI Structured Outputs Documentation](https://platform.openai.com/docs/guides/structured-outputs)
|
|
846
|
+
- [DocRouter API Documentation](../README.md)
|
|
847
|
+
|
|
848
|
+
---
|
|
849
|
+
|
|
850
|
+
**Document Version:** 1.0
|
|
851
|
+
**Last Updated:** 2025-10-11
|
|
852
|
+
**Maintained by:** DocRouter Development Team
|