@nestbox-ai/cli 1.0.59 → 1.0.61
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/agents/docProc/CONFIG_GUIDE.md +1381 -0
- package/dist/agents/docProc/EVAL_GUIDE.md +800 -0
- package/dist/agents/docProc/SYSTEM_PROMPT.md +24 -0
- package/dist/agents/docProc/config.schema.yaml +564 -0
- package/dist/agents/docProc/eval-test-cases.schema.yaml +248 -0
- package/dist/agents/docProc/index.d.ts +20 -0
- package/dist/agents/docProc/index.js +221 -0
- package/dist/agents/docProc/index.js.map +1 -0
- package/dist/commands/generate/docProc.d.ts +2 -0
- package/dist/commands/generate/docProc.js +111 -0
- package/dist/commands/generate/docProc.js.map +1 -0
- package/dist/commands/generate.js +2 -0
- package/dist/commands/generate.js.map +1 -1
- package/package.json +4 -2
|
@@ -0,0 +1,1381 @@
|
|
|
1
|
+
# Nestbox Document Pipeline — Configuration Guide
|
|
2
|
+
|
|
3
|
+
This guide explains every option in the pipeline configuration file (profile), with recommendations for each setting based on document type and use case.
|
|
4
|
+
|
|
5
|
+
The config file is a single YAML file that controls three stages of the pipeline:
|
|
6
|
+
|
|
7
|
+
1. **Docling** — document extraction (PDF/DOCX → structured text, tables, images)
|
|
8
|
+
2. **Chunking** — text segmentation for RAG
|
|
9
|
+
3. **GraphRAG** — knowledge graph construction for semantic search and Q&A
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Table of Contents
|
|
14
|
+
|
|
15
|
+
- [Quick Start](#quick-start)
|
|
16
|
+
- [File Structure](#file-structure)
|
|
17
|
+
- [Docling — Document Extraction](#docling--document-extraction)
|
|
18
|
+
- [Layout Model](#layout-model)
|
|
19
|
+
- [OCR Engine](#ocr-engine)
|
|
20
|
+
- [Tables](#tables)
|
|
21
|
+
- [Pictures](#pictures)
|
|
22
|
+
- [Accelerator](#accelerator)
|
|
23
|
+
- [Limits](#limits)
|
|
24
|
+
- [Chunking — Text Segmentation](#chunking--text-segmentation)
|
|
25
|
+
- [Strategy](#strategy)
|
|
26
|
+
- [Token Sizes](#token-sizes)
|
|
27
|
+
- [Metadata](#metadata)
|
|
28
|
+
- [GraphRAG — Knowledge Graph](#graphrag--knowledge-graph)
|
|
29
|
+
- [Models](#models)
|
|
30
|
+
- [Entity Types](#entity-types)
|
|
31
|
+
- [Writing the Entity Extraction Prompt](#writing-the-entity-extraction-prompt)
|
|
32
|
+
- [Relationships](#relationships)
|
|
33
|
+
- [Summarize Descriptions](#summarize-descriptions)
|
|
34
|
+
- [Claim Extraction](#claim-extraction)
|
|
35
|
+
- [Community Detection](#community-detection)
|
|
36
|
+
- [Community Reports Prompt](#community-reports-prompt)
|
|
37
|
+
- [Local Search](#local-search)
|
|
38
|
+
- [Global Search](#global-search)
|
|
39
|
+
- [DRIFT Search](#drift-search)
|
|
40
|
+
- [Clustering & Cache](#clustering--cache)
|
|
41
|
+
- [API Keys](#api-keys)
|
|
42
|
+
- [Recommendations by Document Type](#recommendations-by-document-type)
|
|
43
|
+
- [Complete Example Configs](#complete-example-configs)
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Quick Start
|
|
48
|
+
|
|
49
|
+
The minimum valid config requires only a `name`:
|
|
50
|
+
|
|
51
|
+
```yaml
|
|
52
|
+
name: "My Pipeline"
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
Everything else defaults to sensible values. To customise for your domain, at minimum change:
|
|
56
|
+
|
|
57
|
+
1. `graphrag.entityExtraction.entityTypes` — what concepts to extract
|
|
58
|
+
2. `graphrag.entityExtraction.prompt` — instructions for the LLM
|
|
59
|
+
3. `docling.layout.model` — based on your hardware and document quality
|
|
60
|
+
4. `docling.ocr.engine` — whether documents are digital or scanned
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## File Structure
|
|
65
|
+
|
|
66
|
+
```yaml
|
|
67
|
+
name: "..." # required
|
|
68
|
+
description: "..." # optional
|
|
69
|
+
|
|
70
|
+
docling:
|
|
71
|
+
layout: ...
|
|
72
|
+
ocr: ...
|
|
73
|
+
tables: ...
|
|
74
|
+
pictures: ...
|
|
75
|
+
accelerator: ...
|
|
76
|
+
limits: ...
|
|
77
|
+
|
|
78
|
+
chunking:
|
|
79
|
+
strategy: ...
|
|
80
|
+
maxTokens: ...
|
|
81
|
+
overlapTokens: ...
|
|
82
|
+
tokenizer: ...
|
|
83
|
+
mergePeers: ...
|
|
84
|
+
contextualize: ...
|
|
85
|
+
output: ...
|
|
86
|
+
metadata: ...
|
|
87
|
+
|
|
88
|
+
graphrag:
|
|
89
|
+
enabled: true
|
|
90
|
+
models: ...
|
|
91
|
+
entityExtraction: ...
|
|
92
|
+
summarizeDescriptions: ...
|
|
93
|
+
claimExtraction: ...
|
|
94
|
+
communities: ...
|
|
95
|
+
communityReports: ...
|
|
96
|
+
embeddings: ...
|
|
97
|
+
localSearch: ...
|
|
98
|
+
globalSearch: ...
|
|
99
|
+
driftSearch: ...
|
|
100
|
+
clusterGraph: ...
|
|
101
|
+
cache: ...
|
|
102
|
+
|
|
103
|
+
apiKeys:
|
|
104
|
+
openai: ${OPENAI_API_KEY}
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## Docling — Document Extraction
|
|
110
|
+
|
|
111
|
+
Docling converts your source documents (PDF, DOCX, HTML, PPTX, Markdown) into structured JSON and Markdown with extracted tables and figures.
|
|
112
|
+
|
|
113
|
+
### Layout Model
|
|
114
|
+
|
|
115
|
+
The layout model detects document structure: headings, paragraphs, tables, figures, columns.
|
|
116
|
+
|
|
117
|
+
```yaml
|
|
118
|
+
docling:
|
|
119
|
+
layout:
|
|
120
|
+
model: docling-layout-egret-large # default
|
|
121
|
+
createOrphanClusters: true
|
|
122
|
+
keepEmptyClusters: true
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
| Model | GPU Memory | Speed | Accuracy | When to Use |
|
|
126
|
+
|-------|-----------|-------|----------|-------------|
|
|
127
|
+
| `docling-layout-heron` | ~2 GB | Fastest | Good | High volume, simple documents (plain text PDFs, single-column articles) |
|
|
128
|
+
| `docling-layout-heron-101` | ~2 GB | Fast | Better | Simple documents where heron is insufficient |
|
|
129
|
+
| `docling-layout-egret-medium` | ~4 GB | Medium | High | Balanced choice for most office documents |
|
|
130
|
+
| `docling-layout-egret-large` | ~6 GB | Slower | Highest | **Default. Best for complex layouts: multi-column, mixed tables/text, legal docs** |
|
|
131
|
+
| `docling-layout-egret-xlarge` | ~10 GB | Slowest | Best | Dense academic papers, financial reports with complex tables, maximum accuracy needed |
|
|
132
|
+
|
|
133
|
+
**Recommendations by document type:**
|
|
134
|
+
|
|
135
|
+
- **Commercial leases, contracts** → `egret-large` (complex formatting, tables, multi-column)
|
|
136
|
+
- **Scanned old documents** → `egret-large` or `egret-xlarge` (needs best layout detection to assist OCR)
|
|
137
|
+
- **Simple text PDFs / reports** → `egret-medium` or `heron` (faster processing)
|
|
138
|
+
- **Academic papers with equations** → `egret-xlarge`
|
|
139
|
+
- **High-volume batch processing** → `heron` or `heron-101` (trade accuracy for throughput)
|
|
140
|
+
|
|
141
|
+
**Other layout options:**
|
|
142
|
+
|
|
143
|
+
- `createOrphanClusters: true` — groups floating text elements (headers, footers, captions) into clusters. Recommended: keep `true`.
|
|
144
|
+
- `keepEmptyClusters: true` — preserves empty structural elements for layout fidelity. Recommended: keep `true`.
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
### OCR Engine
|
|
149
|
+
|
|
150
|
+
OCR is used when documents contain scanned pages or images rather than selectable text.
|
|
151
|
+
|
|
152
|
+
```yaml
|
|
153
|
+
docling:
|
|
154
|
+
ocr:
|
|
155
|
+
enabled: true
|
|
156
|
+
engine: rapidocr # rapidocr | tesseract | easyocr | mac
|
|
157
|
+
backend: torch # torch | onnx | cpu
|
|
158
|
+
languages: [en]
|
|
159
|
+
textScore: 0.5
|
|
160
|
+
forceFullPageOcr: true
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
| Engine | Backend | Speed | Multi-Language | Best For |
|
|
164
|
+
|--------|---------|-------|---------------|----------|
|
|
165
|
+
| `rapidocr` | `torch` / `onnx` | Fast | Limited | **Default. GPU-accelerated, excellent for English documents** |
|
|
166
|
+
| `rapidocr` | `onnx` | Medium | Limited | CPU-only servers, Docker deployments without GPU |
|
|
167
|
+
| `tesseract` | `cpu` | Slow | Very good | Legacy systems, broad language support, well-known engine |
|
|
168
|
+
| `easyocr` | `torch` | Medium | Excellent | Documents with multiple languages, Asian scripts, mixed content |
|
|
169
|
+
| `mac` | native | Fast | Good | macOS development environments only |
|
|
170
|
+
|
|
171
|
+
**Recommendations:**
|
|
172
|
+
|
|
173
|
+
- **English business documents on GPU server** → `engine: rapidocr`, `backend: torch`
|
|
174
|
+
- **CPU-only deployment** → `engine: rapidocr`, `backend: onnx`
|
|
175
|
+
- **Documents with French, Spanish, German, etc.** → `engine: easyocr`, add languages: `[en, fr]`
|
|
176
|
+
- **Japanese, Chinese, Arabic scripts** → `engine: easyocr`, add the relevant language codes
|
|
177
|
+
- **macOS development** → `engine: mac`
|
|
178
|
+
|
|
179
|
+
**Key parameters:**
|
|
180
|
+
|
|
181
|
+
- `textScore` (0–1): Confidence threshold for accepting detected text. Lower (0.3) = accept more text but more noise. Higher (0.7) = stricter. **Recommended: 0.5 for most documents, 0.3 for poor quality scans.**
|
|
182
|
+
- `forceFullPageOcr: true`: Process the entire page even when digital text is detected. Use when documents mix selectable text and scanned images. Set to `false` for pure digital PDFs to speed up processing.
|
|
183
|
+
- `languages`: ISO codes. Examples: `[en]`, `[en, fr]`, `[en, zh]`
|
|
184
|
+
|
|
185
|
+
**When to disable OCR:**
|
|
186
|
+
|
|
187
|
+
Set `enabled: false` only for fully digital PDFs with selectable text. This significantly speeds up processing.
|
|
188
|
+
|
|
189
|
+
```yaml
|
|
190
|
+
ocr:
|
|
191
|
+
enabled: false # pure digital PDFs only
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
---
|
|
195
|
+
|
|
196
|
+
### Tables
|
|
197
|
+
|
|
198
|
+
```yaml
|
|
199
|
+
docling:
|
|
200
|
+
tables:
|
|
201
|
+
enabled: true
|
|
202
|
+
mode: accurate # accurate | fast
|
|
203
|
+
doCellMatching: true
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
- `mode: accurate` — uses a more thorough table recognition algorithm. **Always recommended** unless processing speed is critical.
|
|
207
|
+
- `mode: fast` — quicker but may miss table structure in complex tables (merged cells, nested headers).
|
|
208
|
+
- `doCellMatching: true` — matches detected cells to the table grid structure. Keep `true` for structured output.
|
|
209
|
+
|
|
210
|
+
**When to use `fast`:** Simple tables, high-volume processing where table accuracy is secondary.
|
|
211
|
+
**When to use `accurate`:** Financial statements, lease schedules, data-heavy reports.
|
|
212
|
+
|
|
213
|
+
---
|
|
214
|
+
|
|
215
|
+
### Pictures
|
|
216
|
+
|
|
217
|
+
```yaml
|
|
218
|
+
docling:
|
|
219
|
+
pictures:
|
|
220
|
+
enabled: true
|
|
221
|
+
enableClassification: true
|
|
222
|
+
enableDescription: true
|
|
223
|
+
descriptionProvider: openai # openai | local
|
|
224
|
+
descriptionModel: gpt-4o # gpt-4o | gpt-4o-mini
|
|
225
|
+
imagesScale: 2.0
|
|
226
|
+
descriptionPrompt: |
|
|
227
|
+
Describe this image...
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
- `enabled: false` — skip all image processing (fastest, use when documents have no relevant images)
|
|
231
|
+
- `enableClassification` — classifies each image as chart, diagram, floor plan, photo, etc.
|
|
232
|
+
- `enableDescription` — uses a vision LLM to generate a text description of each image, making images searchable
|
|
233
|
+
- `descriptionModel`:
|
|
234
|
+
- `gpt-4o` — highest quality descriptions, more expensive
|
|
235
|
+
- `gpt-4o-mini` — good quality, significantly cheaper (recommended for most cases)
|
|
236
|
+
- `imagesScale` (0.1–4.0): Resolution multiplier for image extraction. Higher = better quality but larger files.
|
|
237
|
+
- `1.0` — original resolution
|
|
238
|
+
- `2.0` — **default, good balance**
|
|
239
|
+
- `3.0–4.0` — for documents with fine details (architectural drawings, technical schematics)
|
|
240
|
+
|
|
241
|
+
**Custom description prompt:**
|
|
242
|
+
|
|
243
|
+
The `descriptionPrompt` tells the vision model how to describe images. Write it based on what images appear in your documents:
|
|
244
|
+
|
|
245
|
+
```yaml
|
|
246
|
+
# For property/real estate documents
|
|
247
|
+
descriptionPrompt: |
|
|
248
|
+
Analyze this image from a real estate document.
|
|
249
|
+
If it is a floor plan: list all rooms with labels and measurements.
|
|
250
|
+
If it is a chart: extract all data points, labels, and axis values.
|
|
251
|
+
If it is a photo: describe the property features visible.
|
|
252
|
+
Include all visible text, numbers, and dimensions.
|
|
253
|
+
|
|
254
|
+
# For financial documents
|
|
255
|
+
descriptionPrompt: |
|
|
256
|
+
Analyze this image from a financial report.
|
|
257
|
+
If it is a chart or graph: extract all data values, axis labels, legend entries, and trends.
|
|
258
|
+
If it is a table rendered as image: transcribe all cell values.
|
|
259
|
+
Be precise with all numbers and percentages.
|
|
260
|
+
|
|
261
|
+
# For technical manuals
|
|
262
|
+
descriptionPrompt: |
|
|
263
|
+
Analyze this technical diagram or figure.
|
|
264
|
+
List all labeled components and their connections.
|
|
265
|
+
Describe flow directions, measurements, and specifications.
|
|
266
|
+
Include part numbers and annotations if visible.
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
### Accelerator
|
|
272
|
+
|
|
273
|
+
```yaml
|
|
274
|
+
docling:
|
|
275
|
+
accelerator:
|
|
276
|
+
device: auto # auto | cpu | cuda | mps
|
|
277
|
+
numThreads: 4
|
|
278
|
+
cudaUseFlashAttention2: false
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
- `device: auto` — auto-detects the best available hardware (recommended)
|
|
282
|
+
- `device: cuda` — force NVIDIA GPU
|
|
283
|
+
- `device: mps` — Apple Silicon GPU (M1/M2/M3)
|
|
284
|
+
- `device: cpu` — force CPU (slow, use only if no GPU available)
|
|
285
|
+
- `numThreads` — CPU threads for parallel processing (1–32). Match to your server's core count.
|
|
286
|
+
- `cudaUseFlashAttention2: true` — enables Flash Attention 2 optimization on NVIDIA GPUs (A100, H100, RTX 3090+). Leave `false` for older GPUs.
|
|
287
|
+
|
|
288
|
+
---
|
|
289
|
+
|
|
290
|
+
### Limits
|
|
291
|
+
|
|
292
|
+
```yaml
|
|
293
|
+
docling:
|
|
294
|
+
limits:
|
|
295
|
+
documentTimeout: 300 # seconds per document
|
|
296
|
+
maxPages: 100 # optional: skip pages beyond this
|
|
297
|
+
maxFileSize: 104857600 # optional: bytes (100MB)
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
- `documentTimeout` — maximum seconds to spend on one document (60–3600). **For large PDFs (100+ pages), increase to 600–900.**
|
|
301
|
+
- `maxPages` — omit to process all pages; set a limit to cap processing costs
|
|
302
|
+
- `maxFileSize` — omit for no limit; useful to reject unexpectedly large uploads
|
|
303
|
+
|
|
304
|
+
---
|
|
305
|
+
|
|
306
|
+
## Chunking — Text Segmentation
|
|
307
|
+
|
|
308
|
+
Chunking splits extracted text into pieces suitable for embedding and GraphRAG.
|
|
309
|
+
|
|
310
|
+
### Strategy
|
|
311
|
+
|
|
312
|
+
```yaml
|
|
313
|
+
chunking:
|
|
314
|
+
strategy: docling_hybrid # docling_hybrid | sentence | paragraph | fixed
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
| Strategy | How It Works | Best For |
|
|
318
|
+
|----------|-------------|----------|
|
|
319
|
+
| `docling_hybrid` | Respects document structure (headings, sections, tables). Splits at natural boundaries. | **Default. Best for structured documents: contracts, reports, manuals** |
|
|
320
|
+
| `sentence` | Splits on sentence boundaries | Narrative text, articles, news |
|
|
321
|
+
| `paragraph` | Splits on paragraph breaks | General documents without clear section structure |
|
|
322
|
+
| `fixed` | Fixed token windows | When structure is irrelevant; simple text corpora |
|
|
323
|
+
|
|
324
|
+
**Use `docling_hybrid` in almost all cases.** It understands the document's original structure and produces semantically coherent chunks that improve retrieval quality significantly.
|
|
325
|
+
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
### Token Sizes
|
|
329
|
+
|
|
330
|
+
```yaml
|
|
331
|
+
chunking:
|
|
332
|
+
maxTokens: 1200 # tokens per chunk (100–8000)
|
|
333
|
+
overlapTokens: 200 # token overlap between chunks (0–1000)
|
|
334
|
+
tokenizer: cl100k_base
|
|
335
|
+
mergePeers: true
|
|
336
|
+
contextualize: true
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
**maxTokens recommendations:**
|
|
340
|
+
|
|
341
|
+
| Use Case | Recommended maxTokens | Reasoning |
|
|
342
|
+
|----------|----------------------|-----------|
|
|
343
|
+
| GraphRAG (default) | **1200** | Optimal balance for entity extraction. Smaller = more precise entities but less context |
|
|
344
|
+
| Dense financial/legal | 800–1000 | Shorter chunks improve extraction of specific values |
|
|
345
|
+
| Long narrative text | 1500–2000 | Preserves more context per chunk |
|
|
346
|
+
| Very short documents | 400–600 | Avoid splitting small sections |
|
|
347
|
+
| Simple Q&A retrieval | 512 | Fast, precise retrieval |
|
|
348
|
+
|
|
349
|
+
**overlapTokens recommendations:**
|
|
350
|
+
|
|
351
|
+
| Document Type | Recommended Overlap |
|
|
352
|
+
|--------------|--------------------|
|
|
353
|
+
| Contracts (clauses span sections) | 200–300 |
|
|
354
|
+
| Technical manuals | 100–200 |
|
|
355
|
+
| Plain text / articles | 50–100 |
|
|
356
|
+
| Self-contained sections (reports) | 0–100 |
|
|
357
|
+
|
|
358
|
+
The overlap ensures that context at the boundary between chunks isn't lost. **200 is a safe default.**
|
|
359
|
+
|
|
360
|
+
**Other options:**
|
|
361
|
+
|
|
362
|
+
- `tokenizer: cl100k_base` — use for GPT-4 / text-embedding-3 family (default, recommended)
|
|
363
|
+
- `mergePeers: true` — merges adjacent small chunks from the same section before splitting. Produces cleaner chunks. Keep `true`.
|
|
364
|
+
- `contextualize: true` — prepends the section heading path to each chunk (e.g., "Article 5 > Payment Terms > Section 5.2"). Dramatically improves retrieval. **Always keep `true`.**
|
|
365
|
+
|
|
366
|
+
---
|
|
367
|
+
|
|
368
|
+
### Metadata
|
|
369
|
+
|
|
370
|
+
```yaml
|
|
371
|
+
chunking:
|
|
372
|
+
output:
|
|
373
|
+
format: text_files # text_files | json
|
|
374
|
+
includeMetadataHeader: true
|
|
375
|
+
|
|
376
|
+
metadata:
|
|
377
|
+
includeHeadings: true
|
|
378
|
+
includePageNumbers: true
|
|
379
|
+
includePosition: true
|
|
380
|
+
includeSource: true
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
- `format: text_files` — required for GraphRAG (one text file per chunk). Use `json` for API/custom consumption.
|
|
384
|
+
- `includeMetadataHeader: true` — adds a metadata block at the top of each chunk (source file, page, headings). Keeps `true` for GraphRAG — improves retrieval context.
|
|
385
|
+
- Keep all `metadata` flags `true`. They add provenance information used in citations and search results.
|
|
386
|
+
|
|
387
|
+
---
|
|
388
|
+
|
|
389
|
+
## GraphRAG — Knowledge Graph
|
|
390
|
+
|
|
391
|
+
GraphRAG builds a knowledge graph from your chunks using an LLM to extract entities, relationships, and community summaries. This enables both precise entity-level retrieval (local search) and broad thematic analysis (global search).
|
|
392
|
+
|
|
393
|
+
### Models
|
|
394
|
+
|
|
395
|
+
```yaml
|
|
396
|
+
graphrag:
|
|
397
|
+
models:
|
|
398
|
+
chatModel: gpt-4o-mini # for entity extraction + summarization
|
|
399
|
+
embeddingModel: text-embedding-3-large
|
|
400
|
+
temperature: 0
|
|
401
|
+
maxTokens: 4096
|
|
402
|
+
embeddingBatchSize: 16
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
- `chatModel`: The LLM used for all extraction and summarization.
|
|
406
|
+
- `gpt-4o-mini` — **recommended default**. Good accuracy, lower cost, sufficient for most domains.
|
|
407
|
+
- `gpt-4o` — higher accuracy for complex extractions, ambiguous entities, or demanding domains. 5–10x more expensive.
|
|
408
|
+
- `gpt-4-turbo` — alternative for large context requirements.
|
|
409
|
+
- `embeddingModel`: Used for vector embeddings.
|
|
410
|
+
- `text-embedding-3-large` — **recommended**. 3072 dimensions, best quality.
|
|
411
|
+
- `text-embedding-3-small` — cheaper, 1536 dimensions. Use for large corpora on tight budget.
|
|
412
|
+
- `temperature: 0` — always use 0 for extraction tasks. Higher temperature introduces inconsistency in structured outputs.
|
|
413
|
+
- `maxTokens: 4096` — response token limit. Increase to 8192 if entity lists are being truncated.
|
|
414
|
+
- `embeddingBatchSize` — how many texts to embed per API call. Increase to 32–64 on fast connections to speed up indexing.
|
|
415
|
+
|
|
416
|
+
---
|
|
417
|
+
|
|
418
|
+
### Entity Types
|
|
419
|
+
|
|
420
|
+
Entity types define what concepts the LLM should extract from your documents. **This is the most impactful configuration decision.**
|
|
421
|
+
|
|
422
|
+
```yaml
|
|
423
|
+
graphrag:
|
|
424
|
+
entityExtraction:
|
|
425
|
+
entityTypes:
|
|
426
|
+
- PERSON
|
|
427
|
+
- ORGANIZATION
|
|
428
|
+
- LOCATION
|
|
429
|
+
...
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
**Rules for defining entity types:**
|
|
433
|
+
|
|
434
|
+
1. **Use domain-specific types** — generic types (PERSON, ORGANIZATION) produce weaker graphs than domain-specific ones (LANDLORD, TENANT, PREMISES)
|
|
435
|
+
2. **Keep types UPPER_CASE** — this is the expected format
|
|
436
|
+
3. **Use 8–25 types** — too few misses distinctions; too many confuses the LLM
|
|
437
|
+
4. **Name types for what they represent, not what they contain** — `MINIMUM_RENT` is better than `FINANCIAL_VALUE`
|
|
438
|
+
5. **Include relational types** (who the parties are) alongside content types (what the obligations are)
|
|
439
|
+
|
|
440
|
+
**Examples by domain:**
|
|
441
|
+
|
|
442
|
+
```yaml
|
|
443
|
+
# Commercial Leases
|
|
444
|
+
entityTypes:
|
|
445
|
+
- LANDLORD
|
|
446
|
+
- TENANT
|
|
447
|
+
- GUARANTOR
|
|
448
|
+
- PREMISES
|
|
449
|
+
- BUILDING
|
|
450
|
+
- MINIMUM_RENT
|
|
451
|
+
- ADDITIONAL_RENT
|
|
452
|
+
- SECURITY_DEPOSIT
|
|
453
|
+
- TERM
|
|
454
|
+
- COMMENCEMENT_DATE
|
|
455
|
+
- EXPIRY_DATE
|
|
456
|
+
- EXTENSION_OPTION
|
|
457
|
+
- TERMINATION_RIGHT
|
|
458
|
+
- INSURANCE_REQUIREMENT
|
|
459
|
+
- MAINTENANCE_OBLIGATION
|
|
460
|
+
- USE_RESTRICTION
|
|
461
|
+
- DEFAULT
|
|
462
|
+
|
|
463
|
+
# Financial Reports / Investment Documents
|
|
464
|
+
entityTypes:
|
|
465
|
+
- COMPANY
|
|
466
|
+
- FUND
|
|
467
|
+
- INVESTOR
|
|
468
|
+
- ASSET
|
|
469
|
+
- REVENUE
|
|
470
|
+
- EXPENSE
|
|
471
|
+
- PROFIT
|
|
472
|
+
- VALUATION
|
|
473
|
+
- RISK_FACTOR
|
|
474
|
+
- FINANCIAL_PERIOD
|
|
475
|
+
- PROJECTION
|
|
476
|
+
- REGULATORY_REQUIREMENT
|
|
477
|
+
|
|
478
|
+
# Medical / Clinical Documents
|
|
479
|
+
entityTypes:
|
|
480
|
+
- PATIENT
|
|
481
|
+
- DIAGNOSIS
|
|
482
|
+
- MEDICATION
|
|
483
|
+
- DOSAGE
|
|
484
|
+
- TREATMENT
|
|
485
|
+
- PROCEDURE
|
|
486
|
+
- PHYSICIAN
|
|
487
|
+
- INSTITUTION
|
|
488
|
+
- OUTCOME
|
|
489
|
+
- ADVERSE_EVENT
|
|
490
|
+
- TRIAL_PHASE
|
|
491
|
+
- CONTRAINDICATION
|
|
492
|
+
|
|
493
|
+
# Software / Technical Documentation
|
|
494
|
+
entityTypes:
|
|
495
|
+
- SERVICE
|
|
496
|
+
- API_ENDPOINT
|
|
497
|
+
- PARAMETER
|
|
498
|
+
- RETURN_VALUE
|
|
499
|
+
- ERROR_CODE
|
|
500
|
+
- DEPENDENCY
|
|
501
|
+
- VERSION
|
|
502
|
+
- CONFIGURATION
|
|
503
|
+
- WORKFLOW
|
|
504
|
+
- PERMISSION
|
|
505
|
+
- DATA_MODEL
|
|
506
|
+
```
|
|
507
|
+
|
|
508
|
+
---
|
|
509
|
+
|
|
510
|
+
### Writing the Entity Extraction Prompt
|
|
511
|
+
|
|
512
|
+
The entity extraction prompt is the most important customisation. A well-written prompt dramatically improves the quality and consistency of extracted entities and relationships.
|
|
513
|
+
|
|
514
|
+
**Structure of an effective prompt:**
|
|
515
|
+
|
|
516
|
+
```
|
|
517
|
+
-Goal-
|
|
518
|
+
Brief description of the extraction task and document domain.
|
|
519
|
+
|
|
520
|
+
-Entity Types-
|
|
521
|
+
List allowed types. State clearly that NO other types are allowed.
|
|
522
|
+
|
|
523
|
+
Type Definitions:
|
|
524
|
+
- TYPE_NAME: what it means, what to include/exclude
|
|
525
|
+
|
|
526
|
+
-Steps-
|
|
527
|
+
1. Entity extraction format
|
|
528
|
+
2. Relationship extraction format
|
|
529
|
+
3. Delimiter instructions
|
|
530
|
+
|
|
531
|
+
IMPORTANT: Rules to prevent common mistakes.
|
|
532
|
+
|
|
533
|
+
-Examples-
|
|
534
|
+
######################
|
|
535
|
+
[3+ examples showing input text → expected output]
|
|
536
|
+
|
|
537
|
+
-Real Data-
|
|
538
|
+
######################
|
|
539
|
+
entity_types: [...]
|
|
540
|
+
text: {input_text}
|
|
541
|
+
######################
|
|
542
|
+
output:
|
|
543
|
+
```
|
|
544
|
+
|
|
545
|
+
**Required placeholders** (GraphRAG injects these automatically):
|
|
546
|
+
|
|
547
|
+
| Placeholder | Description |
|
|
548
|
+
|-------------|-------------|
|
|
549
|
+
| `{input_text}` | The actual chunk text to process |
|
|
550
|
+
| `{tuple_delimiter}` | Separator between fields within a record |
|
|
551
|
+
| `{record_delimiter}` | Separator between records |
|
|
552
|
+
| `{completion_delimiter}` | End-of-output marker |
|
|
553
|
+
|
|
554
|
+
**Entity output format:**
|
|
555
|
+
```
|
|
556
|
+
("entity"{tuple_delimiter}<NAME>{tuple_delimiter}<TYPE>{tuple_delimiter}<DESCRIPTION>)
|
|
557
|
+
```
|
|
558
|
+
|
|
559
|
+
**Relationship output format:**
|
|
560
|
+
```
|
|
561
|
+
("relationship"{tuple_delimiter}<SOURCE>{tuple_delimiter}<TARGET>{tuple_delimiter}<DESCRIPTION>{tuple_delimiter}<STRENGTH>)
|
|
562
|
+
```
|
|
563
|
+
|
|
564
|
+
- Relationship strength: integer 1–10 (10 = core to the document's purpose)
|
|
565
|
+
|
|
566
|
+
---
|
|
567
|
+
|
|
568
|
+
**Full example prompt — Commercial Leases:**
|
|
569
|
+
|
|
570
|
+
```yaml
|
|
571
|
+
graphrag:
|
|
572
|
+
entityExtraction:
|
|
573
|
+
entityTypes:
|
|
574
|
+
- LANDLORD
|
|
575
|
+
- TENANT
|
|
576
|
+
- GUARANTOR
|
|
577
|
+
- PREMISES
|
|
578
|
+
- BUILDING
|
|
579
|
+
- MINIMUM_RENT
|
|
580
|
+
- ADDITIONAL_RENT
|
|
581
|
+
- SECURITY_DEPOSIT
|
|
582
|
+
- TERM
|
|
583
|
+
- COMMENCEMENT_DATE
|
|
584
|
+
- EXPIRY_DATE
|
|
585
|
+
- EXTENSION_OPTION
|
|
586
|
+
- TERMINATION_RIGHT
|
|
587
|
+
- NOTICE_PERIOD
|
|
588
|
+
- INSURANCE_REQUIREMENT
|
|
589
|
+
- MAINTENANCE_OBLIGATION
|
|
590
|
+
- USE_RESTRICTION
|
|
591
|
+
- DEFAULT
|
|
592
|
+
|
|
593
|
+
maxGleanings: 1
|
|
594
|
+
|
|
595
|
+
prompt: |
|
|
596
|
+
-Goal-
|
|
597
|
+
Extract entities and relationships from commercial lease documents. Always include specific values (dollar amounts, dates, percentages, square footage) directly in entity names.
|
|
598
|
+
|
|
599
|
+
-Entity Types-
|
|
600
|
+
You MUST use ONLY these entity types:
|
|
601
|
+
[LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
|
|
602
|
+
|
|
603
|
+
Type Definitions:
|
|
604
|
+
- LANDLORD: The property owner or lessor granting the lease
|
|
605
|
+
- TENANT: The lessee or occupant paying rent
|
|
606
|
+
- GUARANTOR: Party (often parent company) guaranteeing the tenant's obligations
|
|
607
|
+
- PREMISES: The specific leased space (include address, unit, square footage)
|
|
608
|
+
- BUILDING: The building containing the premises
|
|
609
|
+
- MINIMUM_RENT: Base rent — include $/sqft AND total annual/monthly amounts
|
|
610
|
+
- ADDITIONAL_RENT: Operating costs, taxes, utilities passed to tenant beyond base rent
|
|
611
|
+
- SECURITY_DEPOSIT: Upfront deposit securing tenant obligations (include exact amount)
|
|
612
|
+
- TERM: Total lease duration (e.g., "5 YEAR TERM")
|
|
613
|
+
- COMMENCEMENT_DATE: When lease term officially starts
|
|
614
|
+
- EXPIRY_DATE: When lease ends
|
|
615
|
+
- EXTENSION_OPTION: Right to extend — include number of options and duration
|
|
616
|
+
- TERMINATION_RIGHT: Right to end lease early — include conditions
|
|
617
|
+
- NOTICE_PERIOD: Required advance notice for any action
|
|
618
|
+
- INSURANCE_REQUIREMENT: Required coverage types and minimum amounts
|
|
619
|
+
- MAINTENANCE_OBLIGATION: Who is responsible for repairs and what
|
|
620
|
+
- USE_RESTRICTION: Permitted or prohibited uses of the premises
|
|
621
|
+
- DEFAULT: Events constituting a breach of the lease
|
|
622
|
+
|
|
623
|
+
-Steps-
|
|
624
|
+
1. Identify all entities. For each:
|
|
625
|
+
- entity_name: Descriptive WITH specific values (e.g., "MINIMUM RENT $135.00/SQFT $291,060/YEAR - YEARS 1-2")
|
|
626
|
+
- entity_type: MUST be one of the types listed above
|
|
627
|
+
- entity_description: Complete details including all dollar amounts, dates, percentages, conditions
|
|
628
|
+
Format: ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
|
|
629
|
+
|
|
630
|
+
2. Identify all relationships. For each:
|
|
631
|
+
- source_entity: Entity name from step 1
|
|
632
|
+
- target_entity: Entity name from step 1
|
|
633
|
+
- relationship_description: How they relate or interact
|
|
634
|
+
- relationship_strength: 1-10 (10 = core lease term like rent or parties; 1 = minor reference)
|
|
635
|
+
Format: ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
|
|
636
|
+
|
|
637
|
+
3. Use {record_delimiter} between records. End with {completion_delimiter}.
|
|
638
|
+
|
|
639
|
+
IMPORTANT: Never leave entity_type blank. If an entity doesn't perfectly match a type, use the closest one. Never invent new types.
|
|
640
|
+
|
|
641
|
+
-Examples-
|
|
642
|
+
######################
|
|
643
|
+
|
|
644
|
+
Example 1:
|
|
645
|
+
|
|
646
|
+
entity_types: [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
|
|
647
|
+
text:
|
|
648
|
+
The Tenant, EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN, agrees to lease the Premises from the Landlord, YORKVILLE OFFICE RETAIL CORPORATION. The Rentable Area is approximately 2,156 square feet at 135 Yorkville Avenue, Units 2 and 3. The Term is five (5) years commencing January 15, 2026.
|
|
649
|
+
------------------------
|
|
650
|
+
output:
|
|
651
|
+
("entity"{tuple_delimiter}EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN{tuple_delimiter}TENANT{tuple_delimiter}Tenant corporation operating as Bang & Olufsen, high-end consumer electronics retailer, lessee under the lease)
|
|
652
|
+
{record_delimiter}
|
|
653
|
+
("entity"{tuple_delimiter}YORKVILLE OFFICE RETAIL CORPORATION{tuple_delimiter}LANDLORD{tuple_delimiter}Landlord and property owner, lessor of 135 Yorkville Avenue)
|
|
654
|
+
{record_delimiter}
|
|
655
|
+
("entity"{tuple_delimiter}135 YORKVILLE UNITS 2-3 - 2,156 SQFT{tuple_delimiter}PREMISES{tuple_delimiter}Commercial retail premises at 135 Yorkville Avenue, Units 2 and 3, Level 1, approximately 2,156 square feet rentable area)
|
|
656
|
+
{record_delimiter}
|
|
657
|
+
("entity"{tuple_delimiter}5 YEAR TERM{tuple_delimiter}TERM{tuple_delimiter}Initial lease term of five years)
|
|
658
|
+
{record_delimiter}
|
|
659
|
+
("entity"{tuple_delimiter}JANUARY 15, 2026{tuple_delimiter}COMMENCEMENT_DATE{tuple_delimiter}Date when the lease term officially commences)
|
|
660
|
+
{record_delimiter}
|
|
661
|
+
("relationship"{tuple_delimiter}EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN{tuple_delimiter}YORKVILLE OFFICE RETAIL CORPORATION{tuple_delimiter}Tenant leases premises from Landlord under this commercial lease{tuple_delimiter}10)
|
|
662
|
+
{record_delimiter}
|
|
663
|
+
("relationship"{tuple_delimiter}EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN{tuple_delimiter}135 YORKVILLE UNITS 2-3 - 2,156 SQFT{tuple_delimiter}Tenant occupies and leases this premises for retail operations{tuple_delimiter}10)
|
|
664
|
+
{record_delimiter}
|
|
665
|
+
("relationship"{tuple_delimiter}5 YEAR TERM{tuple_delimiter}JANUARY 15, 2026{tuple_delimiter}Lease term of five years begins on this commencement date{tuple_delimiter}9)
|
|
666
|
+
{completion_delimiter}
|
|
667
|
+
#############################
|
|
668
|
+
|
|
669
|
+
Example 2:
|
|
670
|
+
|
|
671
|
+
entity_types: [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
|
|
672
|
+
text:
|
|
673
|
+
Minimum Rent for Years 1 and 2 is $135.00 per square foot per annum ($291,060.00 annually, $24,255.00 monthly). Year 3 increases to $140.00 per square foot ($301,840.00 annually). Years 4 and 5 are $145.00 per square foot ($312,620.00 annually). All amounts are plus HST. The Security Deposit is $71,464.22 inclusive of HST.
|
|
674
|
+
------------------------
|
|
675
|
+
output:
|
|
676
|
+
("entity"{tuple_delimiter}MINIMUM RENT $135/SQFT - $291,060/YEAR - YEARS 1-2{tuple_delimiter}MINIMUM_RENT{tuple_delimiter}Base rent for Years 1-2: $135.00 per square foot per annum, totaling $291,060.00 annually ($24,255.00 monthly) plus HST)
|
|
677
|
+
{record_delimiter}
|
|
678
|
+
("entity"{tuple_delimiter}MINIMUM RENT $140/SQFT - $301,840/YEAR - YEAR 3{tuple_delimiter}MINIMUM_RENT{tuple_delimiter}Base rent for Year 3: $140.00 per square foot per annum, totaling $301,840.00 annually plus HST)
|
|
679
|
+
{record_delimiter}
|
|
680
|
+
("entity"{tuple_delimiter}MINIMUM RENT $145/SQFT - $312,620/YEAR - YEARS 4-5{tuple_delimiter}MINIMUM_RENT{tuple_delimiter}Base rent for Years 4-5: $145.00 per square foot per annum, totaling $312,620.00 annually plus HST)
|
|
681
|
+
{record_delimiter}
|
|
682
|
+
("entity"{tuple_delimiter}SECURITY DEPOSIT $71,464.22 INCLUDING HST{tuple_delimiter}SECURITY_DEPOSIT{tuple_delimiter}Security deposit of $71,464.22 inclusive of HST, held by Landlord to secure Tenant's obligations)
|
|
683
|
+
{record_delimiter}
|
|
684
|
+
("relationship"{tuple_delimiter}TENANT{tuple_delimiter}MINIMUM RENT $135/SQFT - $291,060/YEAR - YEARS 1-2{tuple_delimiter}Tenant pays this base rent during Years 1 and 2 of the lease{tuple_delimiter}10)
|
|
685
|
+
{record_delimiter}
|
|
686
|
+
("relationship"{tuple_delimiter}TENANT{tuple_delimiter}MINIMUM RENT $140/SQFT - $301,840/YEAR - YEAR 3{tuple_delimiter}Tenant pays this escalated base rent during Year 3{tuple_delimiter}10)
|
|
687
|
+
{record_delimiter}
|
|
688
|
+
("relationship"{tuple_delimiter}TENANT{tuple_delimiter}MINIMUM RENT $145/SQFT - $312,620/YEAR - YEARS 4-5{tuple_delimiter}Tenant pays this escalated base rent during Years 4 and 5{tuple_delimiter}10)
|
|
689
|
+
{record_delimiter}
|
|
690
|
+
("relationship"{tuple_delimiter}MINIMUM RENT $135/SQFT - $291,060/YEAR - YEARS 1-2{tuple_delimiter}MINIMUM RENT $140/SQFT - $301,840/YEAR - YEAR 3{tuple_delimiter}Rent escalates by $5/sqft from Years 1-2 to Year 3{tuple_delimiter}8)
|
|
691
|
+
{record_delimiter}
|
|
692
|
+
("relationship"{tuple_delimiter}TENANT{tuple_delimiter}SECURITY DEPOSIT $71,464.22 INCLUDING HST{tuple_delimiter}Tenant provides deposit to secure lease obligations{tuple_delimiter}9)
|
|
693
|
+
{completion_delimiter}
|
|
694
|
+
#############################
|
|
695
|
+
|
|
696
|
+
Example 3:
|
|
697
|
+
|
|
698
|
+
entity_types: [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
|
|
699
|
+
text:
|
|
700
|
+
The Tenant shall maintain comprehensive general liability insurance of not less than $5,000,000 per occurrence. The Tenant has two (2) options to extend the Term, each for five (5) years, provided 180 days' prior written notice is given. The Premises may only be used for the retail sale of luxury consumer electronics and related accessories.
|
|
701
|
+
------------------------
|
|
702
|
+
output:
|
|
703
|
+
("entity"{tuple_delimiter}LIABILITY INSURANCE - $5,000,000 PER OCCURRENCE{tuple_delimiter}INSURANCE_REQUIREMENT{tuple_delimiter}Tenant must maintain comprehensive general liability insurance with minimum coverage of $5,000,000 per occurrence)
|
|
704
|
+
{record_delimiter}
|
|
705
|
+
("entity"{tuple_delimiter}TWO 5-YEAR EXTENSION OPTIONS{tuple_delimiter}EXTENSION_OPTION{tuple_delimiter}Tenant holds two options to extend the lease term, each for five years, exercisable upon proper notice)
|
|
706
|
+
{record_delimiter}
|
|
707
|
+
("entity"{tuple_delimiter}180 DAYS PRIOR WRITTEN NOTICE - EXTENSION{tuple_delimiter}NOTICE_PERIOD{tuple_delimiter}Required advance notice of 180 days to exercise extension option)
|
|
708
|
+
{record_delimiter}
|
|
709
|
+
("entity"{tuple_delimiter}LUXURY CONSUMER ELECTRONICS RETAIL ONLY{tuple_delimiter}USE_RESTRICTION{tuple_delimiter}Permitted use of the Premises restricted to retail sale of luxury consumer electronics and related accessories only)
|
|
710
|
+
{record_delimiter}
|
|
711
|
+
("relationship"{tuple_delimiter}TENANT{tuple_delimiter}LIABILITY INSURANCE - $5,000,000 PER OCCURRENCE{tuple_delimiter}Tenant is required to maintain this insurance coverage throughout the lease term{tuple_delimiter}9)
|
|
712
|
+
{record_delimiter}
|
|
713
|
+
("relationship"{tuple_delimiter}TENANT{tuple_delimiter}TWO 5-YEAR EXTENSION OPTIONS{tuple_delimiter}Tenant holds the right to exercise these extension options{tuple_delimiter}8)
|
|
714
|
+
{record_delimiter}
|
|
715
|
+
("relationship"{tuple_delimiter}TWO 5-YEAR EXTENSION OPTIONS{tuple_delimiter}180 DAYS PRIOR WRITTEN NOTICE - EXTENSION{tuple_delimiter}Extension option must be exercised with 180 days prior written notice{tuple_delimiter}8)
|
|
716
|
+
{record_delimiter}
|
|
717
|
+
("relationship"{tuple_delimiter}TENANT{tuple_delimiter}LUXURY CONSUMER ELECTRONICS RETAIL ONLY{tuple_delimiter}Tenant's use of premises is restricted to this permitted use{tuple_delimiter}7)
|
|
718
|
+
{completion_delimiter}
|
|
719
|
+
#############################
|
|
720
|
+
|
|
721
|
+
-Real Data-
|
|
722
|
+
######################
|
|
723
|
+
entity_types: [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
|
|
724
|
+
text: {input_text}
|
|
725
|
+
######################
|
|
726
|
+
output:
|
|
727
|
+
```
|
|
728
|
+
|
|
729
|
+
---
|
|
730
|
+
|
|
731
|
+
### Relationships
|
|
732
|
+
|
|
733
|
+
Relationships are defined in the same extraction prompt as entities. Key principles:
|
|
734
|
+
|
|
735
|
+
**Strength scale (1–10):**
|
|
736
|
+
|
|
737
|
+
| Strength | Meaning |
|
|
738
|
+
|----------|---------|
|
|
739
|
+
| 10 | Core document relationship (party to party, party to primary obligation) |
|
|
740
|
+
| 8–9 | Important contractual link (obligation to deadline, right to condition) |
|
|
741
|
+
| 6–7 | Supporting relationship (secondary obligation, cross-reference) |
|
|
742
|
+
| 3–5 | Contextual link (location to building, general description) |
|
|
743
|
+
| 1–2 | Weak or incidental mention |
|
|
744
|
+
|
|
745
|
+
**Always extract relationships between:**
|
|
746
|
+
- Parties ↔ Parties (Tenant ↔ Landlord)
|
|
747
|
+
- Parties ↔ Obligations (Tenant → Insurance requirement)
|
|
748
|
+
- Obligations ↔ Conditions (Extension option → Notice period required)
|
|
749
|
+
- Financial terms ↔ Time periods (Rent $135 → Years 1-2)
|
|
750
|
+
- Escalation chains (Rent Year 1 → Rent Year 2 → Rent Year 3)
|
|
751
|
+
|
|
752
|
+
---
|
|
753
|
+
|
|
754
|
+
### Summarize Descriptions
|
|
755
|
+
|
|
756
|
+
When the same entity appears in multiple chunks, its descriptions are merged using the summarization prompt.
|
|
757
|
+
|
|
758
|
+
```yaml
|
|
759
|
+
graphrag:
|
|
760
|
+
summarizeDescriptions:
|
|
761
|
+
maxLength: 500 # characters per summarized description
|
|
762
|
+
maxInputLength: 8000 # input character limit before truncation
|
|
763
|
+
prompt: |
|
|
764
|
+
You are consolidating descriptions for entities from [domain] documents.
|
|
765
|
+
|
|
766
|
+
Given multiple descriptions of the same entity, produce one comprehensive description that:
|
|
767
|
+
1. Preserves ALL specific values (amounts, dates, percentages, names)
|
|
768
|
+
2. Combines unique details without duplication
|
|
769
|
+
3. Never rounds, approximates, or generalizes numbers
|
|
770
|
+
|
|
771
|
+
Entity: {entity_name}
|
|
772
|
+
Descriptions: {description_list}
|
|
773
|
+
|
|
774
|
+
Consolidated Description:
|
|
775
|
+
```
|
|
776
|
+
|
|
777
|
+
- `maxLength: 500` — good default. Increase to 800–1000 for entities that accumulate many details (e.g., a complex rent schedule).
|
|
778
|
+
- The prompt receives `{entity_name}` and `{description_list}` — always reference both.
|
|
779
|
+
|
|
780
|
+
---
|
|
781
|
+
|
|
782
|
+
### Claim Extraction
|
|
783
|
+
|
|
784
|
+
Claims are facts, obligations, or assertions extracted separately from entities.
|
|
785
|
+
|
|
786
|
+
```yaml
|
|
787
|
+
graphrag:
|
|
788
|
+
claimExtraction:
|
|
789
|
+
enabled: false # disabled by default — adds API cost
|
|
790
|
+
description: "Explicit obligations and factual claims in the document"
|
|
791
|
+
maxGleanings: 1
|
|
792
|
+
prompt: |
|
|
793
|
+
Extract specific claims, obligations, and factual statements...
|
|
794
|
+
```
|
|
795
|
+
|
|
796
|
+
**Enable for:** Legal documents, compliance documents, regulatory filings where individual claims need to be tracked separately from entities.
|
|
797
|
+
**Leave disabled for:** General documents, when entity extraction already captures the needed information.
|
|
798
|
+
|
|
799
|
+
---
|
|
800
|
+
|
|
801
|
+
### Community Detection
|
|
802
|
+
|
|
803
|
+
GraphRAG groups related entities into communities and generates summaries for each.
|
|
804
|
+
|
|
805
|
+
```yaml
|
|
806
|
+
graphrag:
|
|
807
|
+
communities:
|
|
808
|
+
algorithm: leiden # leiden | louvain
|
|
809
|
+
resolution: 1.0
|
|
810
|
+
minCommunitySize: 3
|
|
811
|
+
maxLevels: 3
|
|
812
|
+
```
|
|
813
|
+
|
|
814
|
+
- `algorithm: leiden` — **recommended**. More accurate community detection than louvain.
|
|
815
|
+
- `resolution` (0.1–10): Controls granularity.
|
|
816
|
+
- Lower (0.5) = fewer, larger communities (broader summaries)
|
|
817
|
+
- Higher (2.0) = more, smaller communities (more specific summaries)
|
|
818
|
+
- **1.0 is a good default for most document types**
|
|
819
|
+
- `minCommunitySize: 3` — minimum entities to form a community. Prevents trivial communities.
|
|
820
|
+
- `maxLevels: 3` — depth of hierarchical community structure. Increase to 4–5 for very large corpora.
|
|
821
|
+
|
|
822
|
+
---
|
|
823
|
+
|
|
824
|
+
### Community Reports Prompt
|
|
825
|
+
|
|
826
|
+
Community reports summarise each entity cluster. This is what global search uses to answer broad questions.
|
|
827
|
+
|
|
828
|
+
```yaml
|
|
829
|
+
graphrag:
|
|
830
|
+
communityReports:
|
|
831
|
+
maxLength: 2000
|
|
832
|
+
maxInputLength: 8000
|
|
833
|
+
prompt: |
|
|
834
|
+
You are a [domain] expert. Analyze communities of entities to produce actionable summaries.
|
|
835
|
+
|
|
836
|
+
# Goal
|
|
837
|
+
Write a comprehensive report about a community of related entities from [domain] documents.
|
|
838
|
+
|
|
839
|
+
# Report Structure
|
|
840
|
+
- TITLE: Short descriptive name including key entities
|
|
841
|
+
- SUMMARY: Executive summary with specific values and dates
|
|
842
|
+
- RATING: Float 0-10 (10 = most critical/central to domain)
|
|
843
|
+
- RATING EXPLANATION: One sentence
|
|
844
|
+
- FINDINGS: 5-10 specific insights with data references
|
|
845
|
+
|
|
846
|
+
Return as JSON:
|
|
847
|
+
{{
|
|
848
|
+
"title": <title>,
|
|
849
|
+
"summary": <summary>,
|
|
850
|
+
"rating": <rating>,
|
|
851
|
+
"rating_explanation": <explanation>,
|
|
852
|
+
"findings": [
|
|
853
|
+
{{
|
|
854
|
+
"summary": <finding>,
|
|
855
|
+
"explanation": <explanation with data references>
|
|
856
|
+
}}
|
|
857
|
+
]
|
|
858
|
+
}}
|
|
859
|
+
|
|
860
|
+
# Grounding Rules
|
|
861
|
+
Reference data as: [Data: Entities (ids); Relationships (ids)]
|
|
862
|
+
Use max 5 IDs per reference, add "+more" if needed.
|
|
863
|
+
|
|
864
|
+
# Data
|
|
865
|
+
{input_text}
|
|
866
|
+
Output:
|
|
867
|
+
```
|
|
868
|
+
|
|
869
|
+
**Note:** Use `{{` and `}}` (double braces) around JSON template keys in YAML to escape them from GraphRAG's template engine.
|
|
870
|
+
|
|
871
|
+
- `maxLength: 2000` — report character limit. Increase to 3000–4000 for complex communities with many entities.
|
|
872
|
+
- `maxInputLength: 8000` — input limit before truncation. Increase for large communities.
|
|
873
|
+
|
|
874
|
+
---
|
|
875
|
+
|
|
876
|
+
### Local Search
|
|
877
|
+
|
|
878
|
+
Local search answers specific, entity-focused questions ("What is the rent?", "Who are the parties?").
|
|
879
|
+
|
|
880
|
+
```yaml
|
|
881
|
+
graphrag:
|
|
882
|
+
localSearch:
|
|
883
|
+
topKEntities: 10 # top entities to retrieve
|
|
884
|
+
topKRelationships: 10 # top relationships to retrieve
|
|
885
|
+
topKCommunityReports: 5 # community reports to include
|
|
886
|
+
maxContextTokens: 12000 # total context window
|
|
887
|
+
prompt: |
|
|
888
|
+
---Role---
|
|
889
|
+
You are a [domain] expert answering questions using extracted knowledge graph data.
|
|
890
|
+
|
|
891
|
+
---Domain Knowledge---
|
|
892
|
+
[Define what each entity type means and how to interpret it]
|
|
893
|
+
|
|
894
|
+
---Goal---
|
|
895
|
+
Answer the user's question using ONLY the data tables provided.
|
|
896
|
+
- Be specific: include exact values, dates, and amounts from the data
|
|
897
|
+
- Cite sources: [Data: Entities (ids); Relationships (ids)]
|
|
898
|
+
- State clearly if information is not in the data
|
|
899
|
+
|
|
900
|
+
---Target response length and format---
|
|
901
|
+
{response_type}
|
|
902
|
+
|
|
903
|
+
---Data tables---
|
|
904
|
+
{context_data}
|
|
905
|
+
|
|
906
|
+
Style the response in markdown.
|
|
907
|
+
```
|
|
908
|
+
|
|
909
|
+
- `topKEntities / topKRelationships` — increase to 20–30 for complex queries that touch many entities. Watch `maxContextTokens` to avoid overflow.
|
|
910
|
+
- `maxContextTokens: 12000` — total context budget. Use up to 16000 for GPT-4o, 32000 for GPT-4-turbo.
|
|
911
|
+
|
|
912
|
+
---
|
|
913
|
+
|
|
914
|
+
### Global Search
|
|
915
|
+
|
|
916
|
+
Global search answers broad questions ("Summarise all lease obligations", "What are the key financial terms across all documents?").
|
|
917
|
+
|
|
918
|
+
```yaml
|
|
919
|
+
graphrag:
|
|
920
|
+
globalSearch:
|
|
921
|
+
maxCommunities: 10
|
|
922
|
+
mapMaxTokens: 4000
|
|
923
|
+
reduceMaxTokens: 8000
|
|
924
|
+
|
|
925
|
+
knowledgePrompt: |
|
|
926
|
+
---Role---
|
|
927
|
+
You are a [domain] expert with deep knowledge of [document type].
|
|
928
|
+
|
|
929
|
+
---Domain Knowledge---
|
|
930
|
+
[Terminology, concepts, and interpretation rules for your domain]
|
|
931
|
+
|
|
932
|
+
---Goal---
|
|
933
|
+
Use this expertise to interpret the provided data accurately.
|
|
934
|
+
|
|
935
|
+
---Data---
|
|
936
|
+
{context_data}
|
|
937
|
+
|
|
938
|
+
mapPrompt: |
|
|
939
|
+
---Role---
|
|
940
|
+
You are a [domain] expert analyzing a community report to answer a question.
|
|
941
|
+
|
|
942
|
+
---Goal---
|
|
943
|
+
From this community report, extract:
|
|
944
|
+
1. A relevance score (0-100) for the question
|
|
945
|
+
2. Key points relevant to the answer
|
|
946
|
+
|
|
947
|
+
---Target response length and format---
|
|
948
|
+
{response_type}
|
|
949
|
+
|
|
950
|
+
---Community Report---
|
|
951
|
+
{context_data}
|
|
952
|
+
|
|
953
|
+
reducePrompt: |
|
|
954
|
+
---Role---
|
|
955
|
+
You are a [domain] expert synthesizing community analyses.
|
|
956
|
+
|
|
957
|
+
---Goal---
|
|
958
|
+
Combine the community analyses into a comprehensive answer:
|
|
959
|
+
1. Prioritize higher-scored communities
|
|
960
|
+
2. Include all specific values (amounts, dates, percentages)
|
|
961
|
+
3. Note contradictions or variations
|
|
962
|
+
4. State clearly when information is unavailable
|
|
963
|
+
|
|
964
|
+
---Target response length and format---
|
|
965
|
+
{response_type}
|
|
966
|
+
|
|
967
|
+
---Community Analyses---
|
|
968
|
+
{report_data}
|
|
969
|
+
```
|
|
970
|
+
|
|
971
|
+
- `maxCommunities: 10` — how many community reports to scan per query. Increase to 20–30 for large corpora.
|
|
972
|
+
- `mapMaxTokens: 4000` / `reduceMaxTokens: 8000` — token budgets for each phase. Increase if responses are being cut off.
|
|
973
|
+
|
|
974
|
+
---
|
|
975
|
+
|
|
976
|
+
### DRIFT Search
|
|
977
|
+
|
|
978
|
+
DRIFT (Dynamic Reasoning and Inference for Finding Themes) is an experimental search mode that iteratively refines queries.
|
|
979
|
+
|
|
980
|
+
```yaml
|
|
981
|
+
graphrag:
|
|
982
|
+
driftSearch:
|
|
983
|
+
enabled: false # experimental, disabled by default
|
|
984
|
+
prompt: |
|
|
985
|
+
...
|
|
986
|
+
reducePrompt: |
|
|
987
|
+
...
|
|
988
|
+
```
|
|
989
|
+
|
|
990
|
+
Enable only if you need theme discovery across large, varied document collections. For most use cases, local and global search are sufficient.
|
|
991
|
+
|
|
992
|
+
---
|
|
993
|
+
|
|
994
|
+
### Clustering & Cache
|
|
995
|
+
|
|
996
|
+
```yaml
|
|
997
|
+
graphrag:
|
|
998
|
+
clusterGraph:
|
|
999
|
+
maxClusterSize: 10 # max entities per cluster
|
|
1000
|
+
useLcc: true # use largest connected component
|
|
1001
|
+
seed: 42 # reproducibility seed
|
|
1002
|
+
|
|
1003
|
+
cache:
|
|
1004
|
+
enabled: true
|
|
1005
|
+
type: file # file | memory | none
|
|
1006
|
+
```
|
|
1007
|
+
|
|
1008
|
+
- `maxClusterSize: 10` — limits cluster size for community reports. Reduce to 5–7 for very large graphs with many communities.
|
|
1009
|
+
- `useLcc: true` — focuses GraphRAG on the main connected subgraph, discarding outliers. Keep `true`.
|
|
1010
|
+
- `seed: 42` — for reproducible community detection across runs.
|
|
1011
|
+
- `cache: file` — caches LLM calls to disk. Dramatically speeds up re-runs and reduces API costs. Always keep enabled in production.
|
|
1012
|
+
|
|
1013
|
+
---
|
|
1014
|
+
|
|
1015
|
+
### maxGleanings
|
|
1016
|
+
|
|
1017
|
+
```yaml
|
|
1018
|
+
graphrag:
|
|
1019
|
+
entityExtraction:
|
|
1020
|
+
maxGleanings: 1 # 0 | 1 | 2 | 3
|
|
1021
|
+
```
|
|
1022
|
+
|
|
1023
|
+
Controls how many additional extraction passes the LLM performs on each chunk to find missed entities.
|
|
1024
|
+
|
|
1025
|
+
| Value | Cost | When to Use |
|
|
1026
|
+
|-------|------|-------------|
|
|
1027
|
+
| 0 | Minimal | High volume, cost-sensitive, documents with simple structure |
|
|
1028
|
+
| 1 | Low | **Default. Good balance for most documents** |
|
|
1029
|
+
| 2 | Medium | Complex documents where thoroughness matters (dense contracts) |
|
|
1030
|
+
| 3+ | High | Maximum extraction; only for critical documents |
|
|
1031
|
+
|
|
1032
|
+
Each gleaning pass costs additional API tokens. For a 500-chunk document with `maxGleanings: 2`, expect roughly 3× the API cost vs `maxGleanings: 0`.
|
|
1033
|
+
|
|
1034
|
+
---
|
|
1035
|
+
|
|
1036
|
+
## API Keys
|
|
1037
|
+
|
|
1038
|
+
```yaml
|
|
1039
|
+
apiKeys:
|
|
1040
|
+
openai: ${OPENAI_API_KEY} # reads from environment variable
|
|
1041
|
+
# baseUrl: https://api.openai.com/v1 # optional: custom endpoint
|
|
1042
|
+
```
|
|
1043
|
+
|
|
1044
|
+
Always use environment variable syntax (`${VAR_NAME}`) rather than hardcoding keys in config files.
|
|
1045
|
+
|
|
1046
|
+
**Custom endpoints** (`baseUrl`) allow you to point the pipeline at:
|
|
1047
|
+
- Azure OpenAI: `https://your-resource.openai.azure.com/`
|
|
1048
|
+
- Local LLMs (vLLM, Ollama with OpenAI compat): `http://localhost:8080/v1`
|
|
1049
|
+
- OpenRouter: `https://openrouter.ai/api/v1`
|
|
1050
|
+
|
|
1051
|
+
The `baseUrl` applies to all model calls: GraphRAG chat, embeddings, and Docling picture descriptions.
|
|
1052
|
+
|
|
1053
|
+
---
|
|
1054
|
+
|
|
1055
|
+
## Recommendations by Document Type
|
|
1056
|
+
|
|
1057
|
+
### Commercial Leases / Contracts
|
|
1058
|
+
|
|
1059
|
+
```yaml
|
|
1060
|
+
docling:
|
|
1061
|
+
layout:
|
|
1062
|
+
model: docling-layout-egret-large
|
|
1063
|
+
ocr:
|
|
1064
|
+
enabled: true
|
|
1065
|
+
engine: rapidocr
|
|
1066
|
+
backend: torch
|
|
1067
|
+
forceFullPageOcr: true
|
|
1068
|
+
tables:
|
|
1069
|
+
mode: accurate
|
|
1070
|
+
|
|
1071
|
+
chunking:
|
|
1072
|
+
strategy: docling_hybrid
|
|
1073
|
+
maxTokens: 1000
|
|
1074
|
+
overlapTokens: 250
|
|
1075
|
+
contextualize: true
|
|
1076
|
+
|
|
1077
|
+
graphrag:
|
|
1078
|
+
models:
|
|
1079
|
+
chatModel: gpt-4o-mini
|
|
1080
|
+
entityExtraction:
|
|
1081
|
+
maxGleanings: 1
|
|
1082
|
+
entityTypes: [domain-specific types as above]
|
|
1083
|
+
```
|
|
1084
|
+
|
|
1085
|
+
**Rationale:** Contracts have complex formatting and precise values that must not be missed. High overlap (250) because clauses often reference terms defined pages earlier. `maxGleanings: 1` for balance between cost and completeness.
|
|
1086
|
+
|
|
1087
|
+
---
|
|
1088
|
+
|
|
1089
|
+
### Financial Reports / Investor Documents
|
|
1090
|
+
|
|
1091
|
+
```yaml
|
|
1092
|
+
docling:
|
|
1093
|
+
layout:
|
|
1094
|
+
model: docling-layout-egret-xlarge # complex tables, charts
|
|
1095
|
+
tables:
|
|
1096
|
+
mode: accurate
|
|
1097
|
+
pictures:
|
|
1098
|
+
enabled: true
|
|
1099
|
+
enableDescription: true
|
|
1100
|
+
descriptionModel: gpt-4o # charts need high accuracy
|
|
1101
|
+
|
|
1102
|
+
chunking:
|
|
1103
|
+
strategy: docling_hybrid
|
|
1104
|
+
maxTokens: 800
|
|
1105
|
+
overlapTokens: 150
|
|
1106
|
+
|
|
1107
|
+
graphrag:
|
|
1108
|
+
models:
|
|
1109
|
+
chatModel: gpt-4o # financial data needs accuracy
|
|
1110
|
+
entityExtraction:
|
|
1111
|
+
entityTypes:
|
|
1112
|
+
- COMPANY
|
|
1113
|
+
- FUND
|
|
1114
|
+
- REVENUE
|
|
1115
|
+
- EXPENSE
|
|
1116
|
+
- ASSET
|
|
1117
|
+
- VALUATION
|
|
1118
|
+
- FINANCIAL_PERIOD
|
|
1119
|
+
- RISK_FACTOR
|
|
1120
|
+
maxGleanings: 2
|
|
1121
|
+
```
|
|
1122
|
+
|
|
1123
|
+
**Rationale:** Financial reports have complex tables and charts. `egret-xlarge` gives best table detection. `gpt-4o` for extraction reduces numeric errors. Smaller chunks (800) improve precision of financial figure extraction.
|
|
1124
|
+
|
|
1125
|
+
---
|
|
1126
|
+
|
|
1127
|
+
### Scanned Documents / Poor Quality PDFs
|
|
1128
|
+
|
|
1129
|
+
```yaml
|
|
1130
|
+
docling:
|
|
1131
|
+
layout:
|
|
1132
|
+
model: docling-layout-egret-large
|
|
1133
|
+
ocr:
|
|
1134
|
+
enabled: true
|
|
1135
|
+
engine: rapidocr
|
|
1136
|
+
backend: torch
|
|
1137
|
+
textScore: 0.3 # lower threshold for poor quality
|
|
1138
|
+
forceFullPageOcr: true
|
|
1139
|
+
limits:
|
|
1140
|
+
documentTimeout: 600 # scanned docs take longer
|
|
1141
|
+
|
|
1142
|
+
chunking:
|
|
1143
|
+
strategy: docling_hybrid
|
|
1144
|
+
maxTokens: 1200
|
|
1145
|
+
overlapTokens: 300 # higher overlap for OCR errors at boundaries
|
|
1146
|
+
```
|
|
1147
|
+
|
|
1148
|
+
**Rationale:** Lower `textScore` (0.3) accepts more text even when confidence is lower — better than missing content on degraded scans. Higher timeout for processing. More overlap compensates for OCR errors at chunk boundaries.
|
|
1149
|
+
|
|
1150
|
+
---
|
|
1151
|
+
|
|
1152
|
+
### Multi-Language Documents
|
|
1153
|
+
|
|
1154
|
+
```yaml
|
|
1155
|
+
docling:
|
|
1156
|
+
ocr:
|
|
1157
|
+
enabled: true
|
|
1158
|
+
engine: easyocr
|
|
1159
|
+
languages: [en, fr, de] # list all languages present
|
|
1160
|
+
backend: torch
|
|
1161
|
+
|
|
1162
|
+
chunking:
|
|
1163
|
+
tokenizer: cl100k_base # works well across languages
|
|
1164
|
+
|
|
1165
|
+
graphrag:
|
|
1166
|
+
models:
|
|
1167
|
+
chatModel: gpt-4o # better multilingual than gpt-4o-mini
|
|
1168
|
+
```
|
|
1169
|
+
|
|
1170
|
+
**Rationale:** `easyocr` has the best multilingual OCR support. `gpt-4o` handles non-English entity extraction more reliably.
|
|
1171
|
+
|
|
1172
|
+
---
|
|
1173
|
+
|
|
1174
|
+
### Technical Manuals / Documentation
|
|
1175
|
+
|
|
1176
|
+
```yaml
|
|
1177
|
+
docling:
|
|
1178
|
+
layout:
|
|
1179
|
+
model: docling-layout-egret-medium # manuals have predictable structure
|
|
1180
|
+
pictures:
|
|
1181
|
+
enabled: true
|
|
1182
|
+
enableDescription: true
|
|
1183
|
+
descriptionModel: gpt-4o-mini
|
|
1184
|
+
descriptionPrompt: |
|
|
1185
|
+
Analyze this technical diagram or figure.
|
|
1186
|
+
List all labeled components and their connections.
|
|
1187
|
+
Describe flow directions, measurement values, and specifications.
|
|
1188
|
+
Include part numbers, error codes, and annotations.
|
|
1189
|
+
|
|
1190
|
+
chunking:
|
|
1191
|
+
strategy: docling_hybrid
|
|
1192
|
+
maxTokens: 1500 # technical sections can be longer
|
|
1193
|
+
overlapTokens: 100
|
|
1194
|
+
|
|
1195
|
+
graphrag:
|
|
1196
|
+
entityExtraction:
|
|
1197
|
+
entityTypes:
|
|
1198
|
+
- COMPONENT
|
|
1199
|
+
- API_ENDPOINT
|
|
1200
|
+
- PARAMETER
|
|
1201
|
+
- CONFIGURATION
|
|
1202
|
+
- ERROR_CODE
|
|
1203
|
+
- WORKFLOW
|
|
1204
|
+
- DEPENDENCY
|
|
1205
|
+
- VERSION
|
|
1206
|
+
```
|
|
1207
|
+
|
|
1208
|
+
---
|
|
1209
|
+
|
|
1210
|
+
### High-Volume Batch Processing (Speed Priority)
|
|
1211
|
+
|
|
1212
|
+
```yaml
|
|
1213
|
+
docling:
|
|
1214
|
+
layout:
|
|
1215
|
+
model: docling-layout-heron # fastest
|
|
1216
|
+
ocr:
|
|
1217
|
+
engine: rapidocr
|
|
1218
|
+
backend: onnx # CPU-optimized
|
|
1219
|
+
forceFullPageOcr: false # only OCR where needed
|
|
1220
|
+
pictures:
|
|
1221
|
+
enableDescription: false # skip for speed
|
|
1222
|
+
|
|
1223
|
+
chunking:
|
|
1224
|
+
strategy: docling_hybrid
|
|
1225
|
+
maxTokens: 1500 # fewer, larger chunks
|
|
1226
|
+
|
|
1227
|
+
graphrag:
|
|
1228
|
+
models:
|
|
1229
|
+
chatModel: gpt-4o-mini
|
|
1230
|
+
entityExtraction:
|
|
1231
|
+
maxGleanings: 0 # single pass only
|
|
1232
|
+
```
|
|
1233
|
+
|
|
1234
|
+
---
|
|
1235
|
+
|
|
1236
|
+
## Complete Example Configs
|
|
1237
|
+
|
|
1238
|
+
### Minimal Config (defaults only)
|
|
1239
|
+
|
|
1240
|
+
```yaml
|
|
1241
|
+
name: "My Pipeline"
|
|
1242
|
+
apiKeys:
|
|
1243
|
+
openai: ${OPENAI_API_KEY}
|
|
1244
|
+
```
|
|
1245
|
+
|
|
1246
|
+
---
|
|
1247
|
+
|
|
1248
|
+
### General Purpose Document Pipeline
|
|
1249
|
+
|
|
1250
|
+
```yaml
|
|
1251
|
+
name: "General Document Pipeline"
|
|
1252
|
+
description: "Balanced config for mixed document types"
|
|
1253
|
+
|
|
1254
|
+
docling:
|
|
1255
|
+
layout:
|
|
1256
|
+
model: docling-layout-egret-large
|
|
1257
|
+
ocr:
|
|
1258
|
+
enabled: true
|
|
1259
|
+
engine: rapidocr
|
|
1260
|
+
backend: torch
|
|
1261
|
+
languages: [en]
|
|
1262
|
+
textScore: 0.5
|
|
1263
|
+
forceFullPageOcr: true
|
|
1264
|
+
tables:
|
|
1265
|
+
enabled: true
|
|
1266
|
+
mode: accurate
|
|
1267
|
+
pictures:
|
|
1268
|
+
enabled: true
|
|
1269
|
+
enableDescription: true
|
|
1270
|
+
descriptionModel: gpt-4o-mini
|
|
1271
|
+
accelerator:
|
|
1272
|
+
device: auto
|
|
1273
|
+
numThreads: 4
|
|
1274
|
+
limits:
|
|
1275
|
+
documentTimeout: 300
|
|
1276
|
+
|
|
1277
|
+
chunking:
|
|
1278
|
+
strategy: docling_hybrid
|
|
1279
|
+
maxTokens: 1200
|
|
1280
|
+
overlapTokens: 200
|
|
1281
|
+
tokenizer: cl100k_base
|
|
1282
|
+
mergePeers: true
|
|
1283
|
+
contextualize: true
|
|
1284
|
+
output:
|
|
1285
|
+
format: text_files
|
|
1286
|
+
includeMetadataHeader: true
|
|
1287
|
+
metadata:
|
|
1288
|
+
includeHeadings: true
|
|
1289
|
+
includePageNumbers: true
|
|
1290
|
+
includePosition: true
|
|
1291
|
+
includeSource: true
|
|
1292
|
+
|
|
1293
|
+
graphrag:
|
|
1294
|
+
enabled: true
|
|
1295
|
+
models:
|
|
1296
|
+
chatModel: gpt-4o-mini
|
|
1297
|
+
embeddingModel: text-embedding-3-large
|
|
1298
|
+
temperature: 0
|
|
1299
|
+
maxTokens: 4096
|
|
1300
|
+
entityExtraction:
|
|
1301
|
+
entityTypes:
|
|
1302
|
+
- PERSON
|
|
1303
|
+
- ORGANIZATION
|
|
1304
|
+
- LOCATION
|
|
1305
|
+
- DATE
|
|
1306
|
+
- MONEY
|
|
1307
|
+
- DOCUMENT
|
|
1308
|
+
- OBLIGATION
|
|
1309
|
+
- CONDITION
|
|
1310
|
+
maxGleanings: 1
|
|
1311
|
+
summarizeDescriptions:
|
|
1312
|
+
maxLength: 500
|
|
1313
|
+
communities:
|
|
1314
|
+
algorithm: leiden
|
|
1315
|
+
resolution: 1.0
|
|
1316
|
+
minCommunitySize: 3
|
|
1317
|
+
communityReports:
|
|
1318
|
+
maxLength: 2000
|
|
1319
|
+
localSearch:
|
|
1320
|
+
topKEntities: 10
|
|
1321
|
+
topKRelationships: 10
|
|
1322
|
+
maxContextTokens: 12000
|
|
1323
|
+
globalSearch:
|
|
1324
|
+
maxCommunities: 10
|
|
1325
|
+
clusterGraph:
|
|
1326
|
+
maxClusterSize: 10
|
|
1327
|
+
useLcc: true
|
|
1328
|
+
seed: 42
|
|
1329
|
+
cache:
|
|
1330
|
+
enabled: true
|
|
1331
|
+
type: file
|
|
1332
|
+
|
|
1333
|
+
apiKeys:
|
|
1334
|
+
openai: ${OPENAI_API_KEY}
|
|
1335
|
+
```
|
|
1336
|
+
|
|
1337
|
+
---
|
|
1338
|
+
|
|
1339
|
+
### Lean Config (CPU server, cost-sensitive)
|
|
1340
|
+
|
|
1341
|
+
```yaml
|
|
1342
|
+
name: "Lean CPU Pipeline"
|
|
1343
|
+
|
|
1344
|
+
docling:
|
|
1345
|
+
layout:
|
|
1346
|
+
model: docling-layout-egret-medium
|
|
1347
|
+
ocr:
|
|
1348
|
+
enabled: true
|
|
1349
|
+
engine: rapidocr
|
|
1350
|
+
backend: onnx
|
|
1351
|
+
forceFullPageOcr: false
|
|
1352
|
+
pictures:
|
|
1353
|
+
enabled: false
|
|
1354
|
+
accelerator:
|
|
1355
|
+
device: cpu
|
|
1356
|
+
numThreads: 8
|
|
1357
|
+
|
|
1358
|
+
chunking:
|
|
1359
|
+
strategy: docling_hybrid
|
|
1360
|
+
maxTokens: 1500
|
|
1361
|
+
overlapTokens: 100
|
|
1362
|
+
|
|
1363
|
+
graphrag:
|
|
1364
|
+
enabled: true
|
|
1365
|
+
models:
|
|
1366
|
+
chatModel: gpt-4o-mini
|
|
1367
|
+
embeddingModel: text-embedding-3-small
|
|
1368
|
+
entityExtraction:
|
|
1369
|
+
maxGleanings: 0
|
|
1370
|
+
cache:
|
|
1371
|
+
enabled: true
|
|
1372
|
+
type: file
|
|
1373
|
+
|
|
1374
|
+
apiKeys:
|
|
1375
|
+
openai: ${OPENAI_API_KEY}
|
|
1376
|
+
```
|
|
1377
|
+
|
|
1378
|
+
---
|
|
1379
|
+
|
|
1380
|
+
*For schema reference, see: `packages/nest-doc-processing-api/src/schemas/config.schema.yaml`*
|
|
1381
|
+
*For a full annotated template, see: `packages/nest-doc-processing-worker/templates/config.default.yaml`*
|