@nestbox-ai/cli 1.0.59 → 1.0.61

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1381 @@
1
+ # Nestbox Document Pipeline — Configuration Guide
2
+
3
+ This guide explains every option in the pipeline configuration file (profile), with recommendations for each setting based on document type and use case.
4
+
5
+ The config file is a single YAML file that controls three stages of the pipeline:
6
+
7
+ 1. **Docling** — document extraction (PDF/DOCX → structured text, tables, images)
8
+ 2. **Chunking** — text segmentation for RAG
9
+ 3. **GraphRAG** — knowledge graph construction for semantic search and Q&A
10
+
11
+ ---
12
+
13
+ ## Table of Contents
14
+
15
+ - [Quick Start](#quick-start)
16
+ - [File Structure](#file-structure)
17
+ - [Docling — Document Extraction](#docling--document-extraction)
18
+ - [Layout Model](#layout-model)
19
+ - [OCR Engine](#ocr-engine)
20
+ - [Tables](#tables)
21
+ - [Pictures](#pictures)
22
+ - [Accelerator](#accelerator)
23
+ - [Limits](#limits)
24
+ - [Chunking — Text Segmentation](#chunking--text-segmentation)
25
+ - [Strategy](#strategy)
26
+ - [Token Sizes](#token-sizes)
27
+ - [Metadata](#metadata)
28
+ - [GraphRAG — Knowledge Graph](#graphrag--knowledge-graph)
29
+ - [Models](#models)
30
+ - [Entity Types](#entity-types)
31
+ - [Writing the Entity Extraction Prompt](#writing-the-entity-extraction-prompt)
32
+ - [Relationships](#relationships)
33
+ - [Summarize Descriptions](#summarize-descriptions)
34
+ - [Claim Extraction](#claim-extraction)
35
+ - [Community Detection](#community-detection)
36
+ - [Community Reports Prompt](#community-reports-prompt)
37
+ - [Local Search](#local-search)
38
+ - [Global Search](#global-search)
39
+ - [DRIFT Search](#drift-search)
40
+ - [Clustering & Cache](#clustering--cache)
41
+ - [API Keys](#api-keys)
42
+ - [Recommendations by Document Type](#recommendations-by-document-type)
43
+ - [Complete Example Configs](#complete-example-configs)
44
+
45
+ ---
46
+
47
+ ## Quick Start
48
+
49
+ The minimum valid config requires only a `name`:
50
+
51
+ ```yaml
52
+ name: "My Pipeline"
53
+ ```
54
+
55
+ Everything else defaults to sensible values. To customise for your domain, at minimum change:
56
+
57
+ 1. `graphrag.entityExtraction.entityTypes` — what concepts to extract
58
+ 2. `graphrag.entityExtraction.prompt` — instructions for the LLM
59
+ 3. `docling.layout.model` — based on your hardware and document quality
60
+ 4. `docling.ocr.engine` — whether documents are digital or scanned
61
+
62
+ ---
63
+
64
+ ## File Structure
65
+
66
+ ```yaml
67
+ name: "..." # required
68
+ description: "..." # optional
69
+
70
+ docling:
71
+ layout: ...
72
+ ocr: ...
73
+ tables: ...
74
+ pictures: ...
75
+ accelerator: ...
76
+ limits: ...
77
+
78
+ chunking:
79
+ strategy: ...
80
+ maxTokens: ...
81
+ overlapTokens: ...
82
+ tokenizer: ...
83
+ mergePeers: ...
84
+ contextualize: ...
85
+ output: ...
86
+ metadata: ...
87
+
88
+ graphrag:
89
+ enabled: true
90
+ models: ...
91
+ entityExtraction: ...
92
+ summarizeDescriptions: ...
93
+ claimExtraction: ...
94
+ communities: ...
95
+ communityReports: ...
96
+ embeddings: ...
97
+ localSearch: ...
98
+ globalSearch: ...
99
+ driftSearch: ...
100
+ clusterGraph: ...
101
+ cache: ...
102
+
103
+ apiKeys:
104
+ openai: ${OPENAI_API_KEY}
105
+ ```
106
+
107
+ ---
108
+
109
+ ## Docling — Document Extraction
110
+
111
+ Docling converts your source documents (PDF, DOCX, HTML, PPTX, Markdown) into structured JSON and Markdown with extracted tables and figures.
112
+
113
+ ### Layout Model
114
+
115
+ The layout model detects document structure: headings, paragraphs, tables, figures, columns.
116
+
117
+ ```yaml
118
+ docling:
119
+ layout:
120
+ model: docling-layout-egret-large # default
121
+ createOrphanClusters: true
122
+ keepEmptyClusters: true
123
+ ```
124
+
125
+ | Model | GPU Memory | Speed | Accuracy | When to Use |
126
+ |-------|-----------|-------|----------|-------------|
127
+ | `docling-layout-heron` | ~2 GB | Fastest | Good | High volume, simple documents (plain text PDFs, single-column articles) |
128
+ | `docling-layout-heron-101` | ~2 GB | Fast | Better | Simple documents where heron is insufficient |
129
+ | `docling-layout-egret-medium` | ~4 GB | Medium | High | Balanced choice for most office documents |
130
+ | `docling-layout-egret-large` | ~6 GB | Slower | Highest | **Default. Best for complex layouts: multi-column, mixed tables/text, legal docs** |
131
+ | `docling-layout-egret-xlarge` | ~10 GB | Slowest | Best | Dense academic papers, financial reports with complex tables, maximum accuracy needed |
132
+
133
+ **Recommendations by document type:**
134
+
135
+ - **Commercial leases, contracts** → `egret-large` (complex formatting, tables, multi-column)
136
+ - **Scanned old documents** → `egret-large` or `egret-xlarge` (needs best layout detection to assist OCR)
137
+ - **Simple text PDFs / reports** → `egret-medium` or `heron` (faster processing)
138
+ - **Academic papers with equations** → `egret-xlarge`
139
+ - **High-volume batch processing** → `heron` or `heron-101` (trade accuracy for throughput)
140
+
141
+ **Other layout options:**
142
+
143
+ - `createOrphanClusters: true` — groups floating text elements (headers, footers, captions) into clusters. Recommended: keep `true`.
144
+ - `keepEmptyClusters: true` — preserves empty structural elements for layout fidelity. Recommended: keep `true`.
145
+
146
+ ---
147
+
148
+ ### OCR Engine
149
+
150
+ OCR is used when documents contain scanned pages or images rather than selectable text.
151
+
152
+ ```yaml
153
+ docling:
154
+ ocr:
155
+ enabled: true
156
+ engine: rapidocr # rapidocr | tesseract | easyocr | mac
157
+ backend: torch # torch | onnx | cpu
158
+ languages: [en]
159
+ textScore: 0.5
160
+ forceFullPageOcr: true
161
+ ```
162
+
163
+ | Engine | Backend | Speed | Multi-Language | Best For |
164
+ |--------|---------|-------|---------------|----------|
165
+ | `rapidocr` | `torch` / `onnx` | Fast | Limited | **Default. GPU-accelerated, excellent for English documents** |
166
+ | `rapidocr` | `onnx` | Medium | Limited | CPU-only servers, Docker deployments without GPU |
167
+ | `tesseract` | `cpu` | Slow | Very good | Legacy systems, broad language support, well-known engine |
168
+ | `easyocr` | `torch` | Medium | Excellent | Documents with multiple languages, Asian scripts, mixed content |
169
+ | `mac` | native | Fast | Good | macOS development environments only |
170
+
171
+ **Recommendations:**
172
+
173
+ - **English business documents on GPU server** → `engine: rapidocr`, `backend: torch`
174
+ - **CPU-only deployment** → `engine: rapidocr`, `backend: onnx`
175
+ - **Documents with French, Spanish, German, etc.** → `engine: easyocr`, add languages: `[en, fr]`
176
+ - **Japanese, Chinese, Arabic scripts** → `engine: easyocr`, add the relevant language codes
177
+ - **macOS development** → `engine: mac`
178
+
179
+ **Key parameters:**
180
+
181
+ - `textScore` (0–1): Confidence threshold for accepting detected text. Lower (0.3) = accept more text but more noise. Higher (0.7) = stricter. **Recommended: 0.5 for most documents, 0.3 for poor quality scans.**
182
+ - `forceFullPageOcr: true`: Process the entire page even when digital text is detected. Use when documents mix selectable text and scanned images. Set to `false` for pure digital PDFs to speed up processing.
183
+ - `languages`: ISO codes. Examples: `[en]`, `[en, fr]`, `[en, zh]`
184
+
185
+ **When to disable OCR:**
186
+
187
+ Set `enabled: false` only for fully digital PDFs with selectable text. This significantly speeds up processing.
188
+
189
+ ```yaml
190
+ ocr:
191
+ enabled: false # pure digital PDFs only
192
+ ```
193
+
194
+ ---
195
+
196
+ ### Tables
197
+
198
+ ```yaml
199
+ docling:
200
+ tables:
201
+ enabled: true
202
+ mode: accurate # accurate | fast
203
+ doCellMatching: true
204
+ ```
205
+
206
+ - `mode: accurate` — uses a more thorough table recognition algorithm. **Always recommended** unless processing speed is critical.
207
+ - `mode: fast` — quicker but may miss table structure in complex tables (merged cells, nested headers).
208
+ - `doCellMatching: true` — matches detected cells to the table grid structure. Keep `true` for structured output.
209
+
210
+ **When to use `fast`:** Simple tables, high-volume processing where table accuracy is secondary.
211
+ **When to use `accurate`:** Financial statements, lease schedules, data-heavy reports.
212
+
213
+ ---
214
+
215
+ ### Pictures
216
+
217
+ ```yaml
218
+ docling:
219
+ pictures:
220
+ enabled: true
221
+ enableClassification: true
222
+ enableDescription: true
223
+ descriptionProvider: openai # openai | local
224
+ descriptionModel: gpt-4o # gpt-4o | gpt-4o-mini
225
+ imagesScale: 2.0
226
+ descriptionPrompt: |
227
+ Describe this image...
228
+ ```
229
+
230
+ - `enabled: false` — skip all image processing (fastest, use when documents have no relevant images)
231
+ - `enableClassification` — classifies each image as chart, diagram, floor plan, photo, etc.
232
+ - `enableDescription` — uses a vision LLM to generate a text description of each image, making images searchable
233
+ - `descriptionModel`:
234
+ - `gpt-4o` — highest quality descriptions, more expensive
235
+ - `gpt-4o-mini` — good quality, significantly cheaper (recommended for most cases)
236
+ - `imagesScale` (0.1–4.0): Resolution multiplier for image extraction. Higher = better quality but larger files.
237
+ - `1.0` — original resolution
238
+ - `2.0` — **default, good balance**
239
+ - `3.0–4.0` — for documents with fine details (architectural drawings, technical schematics)
240
+
241
+ **Custom description prompt:**
242
+
243
+ The `descriptionPrompt` tells the vision model how to describe images. Write it based on what images appear in your documents:
244
+
245
+ ```yaml
246
+ # For property/real estate documents
247
+ descriptionPrompt: |
248
+ Analyze this image from a real estate document.
249
+ If it is a floor plan: list all rooms with labels and measurements.
250
+ If it is a chart: extract all data points, labels, and axis values.
251
+ If it is a photo: describe the property features visible.
252
+ Include all visible text, numbers, and dimensions.
253
+
254
+ # For financial documents
255
+ descriptionPrompt: |
256
+ Analyze this image from a financial report.
257
+ If it is a chart or graph: extract all data values, axis labels, legend entries, and trends.
258
+ If it is a table rendered as image: transcribe all cell values.
259
+ Be precise with all numbers and percentages.
260
+
261
+ # For technical manuals
262
+ descriptionPrompt: |
263
+ Analyze this technical diagram or figure.
264
+ List all labeled components and their connections.
265
+ Describe flow directions, measurements, and specifications.
266
+ Include part numbers and annotations if visible.
267
+ ```
268
+
269
+ ---
270
+
271
+ ### Accelerator
272
+
273
+ ```yaml
274
+ docling:
275
+ accelerator:
276
+ device: auto # auto | cpu | cuda | mps
277
+ numThreads: 4
278
+ cudaUseFlashAttention2: false
279
+ ```
280
+
281
+ - `device: auto` — auto-detects the best available hardware (recommended)
282
+ - `device: cuda` — force NVIDIA GPU
283
+ - `device: mps` — Apple Silicon GPU (M1/M2/M3)
284
+ - `device: cpu` — force CPU (slow, use only if no GPU available)
285
+ - `numThreads` — CPU threads for parallel processing (1–32). Match to your server's core count.
286
+ - `cudaUseFlashAttention2: true` — enables Flash Attention 2 optimization on NVIDIA GPUs (A100, H100, RTX 3090+). Leave `false` for older GPUs.
287
+
288
+ ---
289
+
290
+ ### Limits
291
+
292
+ ```yaml
293
+ docling:
294
+ limits:
295
+ documentTimeout: 300 # seconds per document
296
+ maxPages: 100 # optional: skip pages beyond this
297
+ maxFileSize: 104857600 # optional: bytes (100MB)
298
+ ```
299
+
300
+ - `documentTimeout` — maximum seconds to spend on one document (60–3600). **For large PDFs (100+ pages), increase to 600–900.**
301
+ - `maxPages` — omit to process all pages; set a limit to cap processing costs
302
+ - `maxFileSize` — omit for no limit; useful to reject unexpectedly large uploads
303
+
304
+ ---
305
+
306
+ ## Chunking — Text Segmentation
307
+
308
+ Chunking splits extracted text into pieces suitable for embedding and GraphRAG.
309
+
310
+ ### Strategy
311
+
312
+ ```yaml
313
+ chunking:
314
+ strategy: docling_hybrid # docling_hybrid | sentence | paragraph | fixed
315
+ ```
316
+
317
+ | Strategy | How It Works | Best For |
318
+ |----------|-------------|----------|
319
+ | `docling_hybrid` | Respects document structure (headings, sections, tables). Splits at natural boundaries. | **Default. Best for structured documents: contracts, reports, manuals** |
320
+ | `sentence` | Splits on sentence boundaries | Narrative text, articles, news |
321
+ | `paragraph` | Splits on paragraph breaks | General documents without clear section structure |
322
+ | `fixed` | Fixed token windows | When structure is irrelevant; simple text corpora |
323
+
324
+ **Use `docling_hybrid` in almost all cases.** It understands the document's original structure and produces semantically coherent chunks that improve retrieval quality significantly.
325
+
326
+ ---
327
+
328
+ ### Token Sizes
329
+
330
+ ```yaml
331
+ chunking:
332
+ maxTokens: 1200 # tokens per chunk (100–8000)
333
+ overlapTokens: 200 # token overlap between chunks (0–1000)
334
+ tokenizer: cl100k_base
335
+ mergePeers: true
336
+ contextualize: true
337
+ ```
338
+
339
+ **maxTokens recommendations:**
340
+
341
+ | Use Case | Recommended maxTokens | Reasoning |
342
+ |----------|----------------------|-----------|
343
+ | GraphRAG (default) | **1200** | Optimal balance for entity extraction. Smaller = more precise entities but less context |
344
+ | Dense financial/legal | 800–1000 | Shorter chunks improve extraction of specific values |
345
+ | Long narrative text | 1500–2000 | Preserves more context per chunk |
346
+ | Very short documents | 400–600 | Avoid splitting small sections |
347
+ | Simple Q&A retrieval | 512 | Fast, precise retrieval |
348
+
349
+ **overlapTokens recommendations:**
350
+
351
+ | Document Type | Recommended Overlap |
352
+ |--------------|--------------------|
353
+ | Contracts (clauses span sections) | 200–300 |
354
+ | Technical manuals | 100–200 |
355
+ | Plain text / articles | 50–100 |
356
+ | Self-contained sections (reports) | 0–100 |
357
+
358
+ The overlap ensures that context at the boundary between chunks isn't lost. **200 is a safe default.**
359
+
360
+ **Other options:**
361
+
362
+ - `tokenizer: cl100k_base` — use for GPT-4 / text-embedding-3 family (default, recommended)
363
+ - `mergePeers: true` — merges adjacent small chunks from the same section before splitting. Produces cleaner chunks. Keep `true`.
364
+ - `contextualize: true` — prepends the section heading path to each chunk (e.g., "Article 5 > Payment Terms > Section 5.2"). Dramatically improves retrieval. **Always keep `true`.**
365
+
366
+ ---
367
+
368
+ ### Metadata
369
+
370
+ ```yaml
371
+ chunking:
372
+ output:
373
+ format: text_files # text_files | json
374
+ includeMetadataHeader: true
375
+
376
+ metadata:
377
+ includeHeadings: true
378
+ includePageNumbers: true
379
+ includePosition: true
380
+ includeSource: true
381
+ ```
382
+
383
+ - `format: text_files` — required for GraphRAG (one text file per chunk). Use `json` for API/custom consumption.
384
+ - `includeMetadataHeader: true` — adds a metadata block at the top of each chunk (source file, page, headings). Keeps `true` for GraphRAG — improves retrieval context.
385
+ - Keep all `metadata` flags `true`. They add provenance information used in citations and search results.
386
+
387
+ ---
388
+
389
+ ## GraphRAG — Knowledge Graph
390
+
391
+ GraphRAG builds a knowledge graph from your chunks using an LLM to extract entities, relationships, and community summaries. This enables both precise entity-level retrieval (local search) and broad thematic analysis (global search).
392
+
393
+ ### Models
394
+
395
+ ```yaml
396
+ graphrag:
397
+ models:
398
+ chatModel: gpt-4o-mini # for entity extraction + summarization
399
+ embeddingModel: text-embedding-3-large
400
+ temperature: 0
401
+ maxTokens: 4096
402
+ embeddingBatchSize: 16
403
+ ```
404
+
405
+ - `chatModel`: The LLM used for all extraction and summarization.
406
+ - `gpt-4o-mini` — **recommended default**. Good accuracy, lower cost, sufficient for most domains.
407
+ - `gpt-4o` — higher accuracy for complex extractions, ambiguous entities, or demanding domains. 5–10x more expensive.
408
+ - `gpt-4-turbo` — alternative for large context requirements.
409
+ - `embeddingModel`: Used for vector embeddings.
410
+ - `text-embedding-3-large` — **recommended**. 3072 dimensions, best quality.
411
+ - `text-embedding-3-small` — cheaper, 1536 dimensions. Use for large corpora on tight budget.
412
+ - `temperature: 0` — always use 0 for extraction tasks. Higher temperature introduces inconsistency in structured outputs.
413
+ - `maxTokens: 4096` — response token limit. Increase to 8192 if entity lists are being truncated.
414
+ - `embeddingBatchSize` — how many texts to embed per API call. Increase to 32–64 on fast connections to speed up indexing.
415
+
416
+ ---
417
+
418
+ ### Entity Types
419
+
420
+ Entity types define what concepts the LLM should extract from your documents. **This is the most impactful configuration decision.**
421
+
422
+ ```yaml
423
+ graphrag:
424
+ entityExtraction:
425
+ entityTypes:
426
+ - PERSON
427
+ - ORGANIZATION
428
+ - LOCATION
429
+ ...
430
+ ```
431
+
432
+ **Rules for defining entity types:**
433
+
434
+ 1. **Use domain-specific types** — generic types (PERSON, ORGANIZATION) produce weaker graphs than domain-specific ones (LANDLORD, TENANT, PREMISES)
435
+ 2. **Keep types UPPER_CASE** — this is the expected format
436
+ 3. **Use 8–25 types** — too few misses distinctions; too many confuses the LLM
437
+ 4. **Name types for what they represent, not what they contain** — `MINIMUM_RENT` is better than `FINANCIAL_VALUE`
438
+ 5. **Include relational types** (who the parties are) alongside content types (what the obligations are)
439
+
440
+ **Examples by domain:**
441
+
442
+ ```yaml
443
+ # Commercial Leases
444
+ entityTypes:
445
+ - LANDLORD
446
+ - TENANT
447
+ - GUARANTOR
448
+ - PREMISES
449
+ - BUILDING
450
+ - MINIMUM_RENT
451
+ - ADDITIONAL_RENT
452
+ - SECURITY_DEPOSIT
453
+ - TERM
454
+ - COMMENCEMENT_DATE
455
+ - EXPIRY_DATE
456
+ - EXTENSION_OPTION
457
+ - TERMINATION_RIGHT
458
+ - INSURANCE_REQUIREMENT
459
+ - MAINTENANCE_OBLIGATION
460
+ - USE_RESTRICTION
461
+ - DEFAULT
462
+
463
+ # Financial Reports / Investment Documents
464
+ entityTypes:
465
+ - COMPANY
466
+ - FUND
467
+ - INVESTOR
468
+ - ASSET
469
+ - REVENUE
470
+ - EXPENSE
471
+ - PROFIT
472
+ - VALUATION
473
+ - RISK_FACTOR
474
+ - FINANCIAL_PERIOD
475
+ - PROJECTION
476
+ - REGULATORY_REQUIREMENT
477
+
478
+ # Medical / Clinical Documents
479
+ entityTypes:
480
+ - PATIENT
481
+ - DIAGNOSIS
482
+ - MEDICATION
483
+ - DOSAGE
484
+ - TREATMENT
485
+ - PROCEDURE
486
+ - PHYSICIAN
487
+ - INSTITUTION
488
+ - OUTCOME
489
+ - ADVERSE_EVENT
490
+ - TRIAL_PHASE
491
+ - CONTRAINDICATION
492
+
493
+ # Software / Technical Documentation
494
+ entityTypes:
495
+ - SERVICE
496
+ - API_ENDPOINT
497
+ - PARAMETER
498
+ - RETURN_VALUE
499
+ - ERROR_CODE
500
+ - DEPENDENCY
501
+ - VERSION
502
+ - CONFIGURATION
503
+ - WORKFLOW
504
+ - PERMISSION
505
+ - DATA_MODEL
506
+ ```
507
+
508
+ ---
509
+
510
+ ### Writing the Entity Extraction Prompt
511
+
512
+ The entity extraction prompt is the most important customisation. A well-written prompt dramatically improves the quality and consistency of extracted entities and relationships.
513
+
514
+ **Structure of an effective prompt:**
515
+
516
+ ```
517
+ -Goal-
518
+ Brief description of the extraction task and document domain.
519
+
520
+ -Entity Types-
521
+ List allowed types. State clearly that NO other types are allowed.
522
+
523
+ Type Definitions:
524
+ - TYPE_NAME: what it means, what to include/exclude
525
+
526
+ -Steps-
527
+ 1. Entity extraction format
528
+ 2. Relationship extraction format
529
+ 3. Delimiter instructions
530
+
531
+ IMPORTANT: Rules to prevent common mistakes.
532
+
533
+ -Examples-
534
+ ######################
535
+ [3+ examples showing input text → expected output]
536
+
537
+ -Real Data-
538
+ ######################
539
+ entity_types: [...]
540
+ text: {input_text}
541
+ ######################
542
+ output:
543
+ ```
544
+
545
+ **Required placeholders** (GraphRAG injects these automatically):
546
+
547
+ | Placeholder | Description |
548
+ |-------------|-------------|
549
+ | `{input_text}` | The actual chunk text to process |
550
+ | `{tuple_delimiter}` | Separator between fields within a record |
551
+ | `{record_delimiter}` | Separator between records |
552
+ | `{completion_delimiter}` | End-of-output marker |
553
+
554
+ **Entity output format:**
555
+ ```
556
+ ("entity"{tuple_delimiter}<NAME>{tuple_delimiter}<TYPE>{tuple_delimiter}<DESCRIPTION>)
557
+ ```
558
+
559
+ **Relationship output format:**
560
+ ```
561
+ ("relationship"{tuple_delimiter}<SOURCE>{tuple_delimiter}<TARGET>{tuple_delimiter}<DESCRIPTION>{tuple_delimiter}<STRENGTH>)
562
+ ```
563
+
564
+ - Relationship strength: integer 1–10 (10 = core to the document's purpose)
565
+
566
+ ---
567
+
568
+ **Full example prompt — Commercial Leases:**
569
+
570
+ ```yaml
571
+ graphrag:
572
+ entityExtraction:
573
+ entityTypes:
574
+ - LANDLORD
575
+ - TENANT
576
+ - GUARANTOR
577
+ - PREMISES
578
+ - BUILDING
579
+ - MINIMUM_RENT
580
+ - ADDITIONAL_RENT
581
+ - SECURITY_DEPOSIT
582
+ - TERM
583
+ - COMMENCEMENT_DATE
584
+ - EXPIRY_DATE
585
+ - EXTENSION_OPTION
586
+ - TERMINATION_RIGHT
587
+ - NOTICE_PERIOD
588
+ - INSURANCE_REQUIREMENT
589
+ - MAINTENANCE_OBLIGATION
590
+ - USE_RESTRICTION
591
+ - DEFAULT
592
+
593
+ maxGleanings: 1
594
+
595
+ prompt: |
596
+ -Goal-
597
+ Extract entities and relationships from commercial lease documents. Always include specific values (dollar amounts, dates, percentages, square footage) directly in entity names.
598
+
599
+ -Entity Types-
600
+ You MUST use ONLY these entity types:
601
+ [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
602
+
603
+ Type Definitions:
604
+ - LANDLORD: The property owner or lessor granting the lease
605
+ - TENANT: The lessee or occupant paying rent
606
+ - GUARANTOR: Party (often parent company) guaranteeing the tenant's obligations
607
+ - PREMISES: The specific leased space (include address, unit, square footage)
608
+ - BUILDING: The building containing the premises
609
+ - MINIMUM_RENT: Base rent — include $/sqft AND total annual/monthly amounts
610
+ - ADDITIONAL_RENT: Operating costs, taxes, utilities passed to tenant beyond base rent
611
+ - SECURITY_DEPOSIT: Upfront deposit securing tenant obligations (include exact amount)
612
+ - TERM: Total lease duration (e.g., "5 YEAR TERM")
613
+ - COMMENCEMENT_DATE: When lease term officially starts
614
+ - EXPIRY_DATE: When lease ends
615
+ - EXTENSION_OPTION: Right to extend — include number of options and duration
616
+ - TERMINATION_RIGHT: Right to end lease early — include conditions
617
+ - NOTICE_PERIOD: Required advance notice for any action
618
+ - INSURANCE_REQUIREMENT: Required coverage types and minimum amounts
619
+ - MAINTENANCE_OBLIGATION: Who is responsible for repairs and what
620
+ - USE_RESTRICTION: Permitted or prohibited uses of the premises
621
+ - DEFAULT: Events constituting a breach of the lease
622
+
623
+ -Steps-
624
+ 1. Identify all entities. For each:
625
+ - entity_name: Descriptive WITH specific values (e.g., "MINIMUM RENT $135.00/SQFT $291,060/YEAR - YEARS 1-2")
626
+ - entity_type: MUST be one of the types listed above
627
+ - entity_description: Complete details including all dollar amounts, dates, percentages, conditions
628
+ Format: ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
629
+
630
+ 2. Identify all relationships. For each:
631
+ - source_entity: Entity name from step 1
632
+ - target_entity: Entity name from step 1
633
+ - relationship_description: How they relate or interact
634
+ - relationship_strength: 1-10 (10 = core lease term like rent or parties; 1 = minor reference)
635
+ Format: ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
636
+
637
+ 3. Use {record_delimiter} between records. End with {completion_delimiter}.
638
+
639
+ IMPORTANT: Never leave entity_type blank. If an entity doesn't perfectly match a type, use the closest one. Never invent new types.
640
+
641
+ -Examples-
642
+ ######################
643
+
644
+ Example 1:
645
+
646
+ entity_types: [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
647
+ text:
648
+ The Tenant, EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN, agrees to lease the Premises from the Landlord, YORKVILLE OFFICE RETAIL CORPORATION. The Rentable Area is approximately 2,156 square feet at 135 Yorkville Avenue, Units 2 and 3. The Term is five (5) years commencing January 15, 2026.
649
+ ------------------------
650
+ output:
651
+ ("entity"{tuple_delimiter}EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN{tuple_delimiter}TENANT{tuple_delimiter}Tenant corporation operating as Bang & Olufsen, high-end consumer electronics retailer, lessee under the lease)
652
+ {record_delimiter}
653
+ ("entity"{tuple_delimiter}YORKVILLE OFFICE RETAIL CORPORATION{tuple_delimiter}LANDLORD{tuple_delimiter}Landlord and property owner, lessor of 135 Yorkville Avenue)
654
+ {record_delimiter}
655
+ ("entity"{tuple_delimiter}135 YORKVILLE UNITS 2-3 - 2,156 SQFT{tuple_delimiter}PREMISES{tuple_delimiter}Commercial retail premises at 135 Yorkville Avenue, Units 2 and 3, Level 1, approximately 2,156 square feet rentable area)
656
+ {record_delimiter}
657
+ ("entity"{tuple_delimiter}5 YEAR TERM{tuple_delimiter}TERM{tuple_delimiter}Initial lease term of five years)
658
+ {record_delimiter}
659
+ ("entity"{tuple_delimiter}JANUARY 15, 2026{tuple_delimiter}COMMENCEMENT_DATE{tuple_delimiter}Date when the lease term officially commences)
660
+ {record_delimiter}
661
+ ("relationship"{tuple_delimiter}EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN{tuple_delimiter}YORKVILLE OFFICE RETAIL CORPORATION{tuple_delimiter}Tenant leases premises from Landlord under this commercial lease{tuple_delimiter}10)
662
+ {record_delimiter}
663
+ ("relationship"{tuple_delimiter}EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN{tuple_delimiter}135 YORKVILLE UNITS 2-3 - 2,156 SQFT{tuple_delimiter}Tenant occupies and leases this premises for retail operations{tuple_delimiter}10)
664
+ {record_delimiter}
665
+ ("relationship"{tuple_delimiter}5 YEAR TERM{tuple_delimiter}JANUARY 15, 2026{tuple_delimiter}Lease term of five years begins on this commencement date{tuple_delimiter}9)
666
+ {completion_delimiter}
667
+ #############################
668
+
669
+ Example 2:
670
+
671
+ entity_types: [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
672
+ text:
673
+ Minimum Rent for Years 1 and 2 is $135.00 per square foot per annum ($291,060.00 annually, $24,255.00 monthly). Year 3 increases to $140.00 per square foot ($301,840.00 annually). Years 4 and 5 are $145.00 per square foot ($312,620.00 annually). All amounts are plus HST. The Security Deposit is $71,464.22 inclusive of HST.
674
+ ------------------------
675
+ output:
676
+ ("entity"{tuple_delimiter}MINIMUM RENT $135/SQFT - $291,060/YEAR - YEARS 1-2{tuple_delimiter}MINIMUM_RENT{tuple_delimiter}Base rent for Years 1-2: $135.00 per square foot per annum, totaling $291,060.00 annually ($24,255.00 monthly) plus HST)
677
+ {record_delimiter}
678
+ ("entity"{tuple_delimiter}MINIMUM RENT $140/SQFT - $301,840/YEAR - YEAR 3{tuple_delimiter}MINIMUM_RENT{tuple_delimiter}Base rent for Year 3: $140.00 per square foot per annum, totaling $301,840.00 annually plus HST)
679
+ {record_delimiter}
680
+ ("entity"{tuple_delimiter}MINIMUM RENT $145/SQFT - $312,620/YEAR - YEARS 4-5{tuple_delimiter}MINIMUM_RENT{tuple_delimiter}Base rent for Years 4-5: $145.00 per square foot per annum, totaling $312,620.00 annually plus HST)
681
+ {record_delimiter}
682
+ ("entity"{tuple_delimiter}SECURITY DEPOSIT $71,464.22 INCLUDING HST{tuple_delimiter}SECURITY_DEPOSIT{tuple_delimiter}Security deposit of $71,464.22 inclusive of HST, held by Landlord to secure Tenant's obligations)
683
+ {record_delimiter}
684
+ ("relationship"{tuple_delimiter}TENANT{tuple_delimiter}MINIMUM RENT $135/SQFT - $291,060/YEAR - YEARS 1-2{tuple_delimiter}Tenant pays this base rent during Years 1 and 2 of the lease{tuple_delimiter}10)
685
+ {record_delimiter}
686
+ ("relationship"{tuple_delimiter}TENANT{tuple_delimiter}MINIMUM RENT $140/SQFT - $301,840/YEAR - YEAR 3{tuple_delimiter}Tenant pays this escalated base rent during Year 3{tuple_delimiter}10)
687
+ {record_delimiter}
688
+ ("relationship"{tuple_delimiter}TENANT{tuple_delimiter}MINIMUM RENT $145/SQFT - $312,620/YEAR - YEARS 4-5{tuple_delimiter}Tenant pays this escalated base rent during Years 4 and 5{tuple_delimiter}10)
689
+ {record_delimiter}
690
+ ("relationship"{tuple_delimiter}MINIMUM RENT $135/SQFT - $291,060/YEAR - YEARS 1-2{tuple_delimiter}MINIMUM RENT $140/SQFT - $301,840/YEAR - YEAR 3{tuple_delimiter}Rent escalates by $5/sqft from Years 1-2 to Year 3{tuple_delimiter}8)
691
+ {record_delimiter}
692
+ ("relationship"{tuple_delimiter}TENANT{tuple_delimiter}SECURITY DEPOSIT $71,464.22 INCLUDING HST{tuple_delimiter}Tenant provides deposit to secure lease obligations{tuple_delimiter}9)
693
+ {completion_delimiter}
694
+ #############################
695
+
696
+ Example 3:
697
+
698
+ entity_types: [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
699
+ text:
700
+ The Tenant shall maintain comprehensive general liability insurance of not less than $5,000,000 per occurrence. The Tenant has two (2) options to extend the Term, each for five (5) years, provided 180 days' prior written notice is given. The Premises may only be used for the retail sale of luxury consumer electronics and related accessories.
701
+ ------------------------
702
+ output:
703
+ ("entity"{tuple_delimiter}LIABILITY INSURANCE - $5,000,000 PER OCCURRENCE{tuple_delimiter}INSURANCE_REQUIREMENT{tuple_delimiter}Tenant must maintain comprehensive general liability insurance with minimum coverage of $5,000,000 per occurrence)
704
+ {record_delimiter}
705
+ ("entity"{tuple_delimiter}TWO 5-YEAR EXTENSION OPTIONS{tuple_delimiter}EXTENSION_OPTION{tuple_delimiter}Tenant holds two options to extend the lease term, each for five years, exercisable upon proper notice)
706
+ {record_delimiter}
707
+ ("entity"{tuple_delimiter}180 DAYS PRIOR WRITTEN NOTICE - EXTENSION{tuple_delimiter}NOTICE_PERIOD{tuple_delimiter}Required advance notice of 180 days to exercise extension option)
708
+ {record_delimiter}
709
+ ("entity"{tuple_delimiter}LUXURY CONSUMER ELECTRONICS RETAIL ONLY{tuple_delimiter}USE_RESTRICTION{tuple_delimiter}Permitted use of the Premises restricted to retail sale of luxury consumer electronics and related accessories only)
710
+ {record_delimiter}
711
+ ("relationship"{tuple_delimiter}TENANT{tuple_delimiter}LIABILITY INSURANCE - $5,000,000 PER OCCURRENCE{tuple_delimiter}Tenant is required to maintain this insurance coverage throughout the lease term{tuple_delimiter}9)
712
+ {record_delimiter}
713
+ ("relationship"{tuple_delimiter}TENANT{tuple_delimiter}TWO 5-YEAR EXTENSION OPTIONS{tuple_delimiter}Tenant holds the right to exercise these extension options{tuple_delimiter}8)
714
+ {record_delimiter}
715
+ ("relationship"{tuple_delimiter}TWO 5-YEAR EXTENSION OPTIONS{tuple_delimiter}180 DAYS PRIOR WRITTEN NOTICE - EXTENSION{tuple_delimiter}Extension option must be exercised with 180 days prior written notice{tuple_delimiter}8)
716
+ {record_delimiter}
717
+ ("relationship"{tuple_delimiter}TENANT{tuple_delimiter}LUXURY CONSUMER ELECTRONICS RETAIL ONLY{tuple_delimiter}Tenant's use of premises is restricted to this permitted use{tuple_delimiter}7)
718
+ {completion_delimiter}
719
+ #############################
720
+
721
+ -Real Data-
722
+ ######################
723
+ entity_types: [LANDLORD, TENANT, GUARANTOR, PREMISES, BUILDING, MINIMUM_RENT, ADDITIONAL_RENT, SECURITY_DEPOSIT, TERM, COMMENCEMENT_DATE, EXPIRY_DATE, EXTENSION_OPTION, TERMINATION_RIGHT, NOTICE_PERIOD, INSURANCE_REQUIREMENT, MAINTENANCE_OBLIGATION, USE_RESTRICTION, DEFAULT]
724
+ text: {input_text}
725
+ ######################
726
+ output:
727
+ ```
728
+
729
+ ---
730
+
731
+ ### Relationships
732
+
733
+ Relationships are defined in the same extraction prompt as entities. Key principles:
734
+
735
+ **Strength scale (1–10):**
736
+
737
+ | Strength | Meaning |
738
+ |----------|---------|
739
+ | 10 | Core document relationship (party to party, party to primary obligation) |
740
+ | 8–9 | Important contractual link (obligation to deadline, right to condition) |
741
+ | 6–7 | Supporting relationship (secondary obligation, cross-reference) |
742
+ | 3–5 | Contextual link (location to building, general description) |
743
+ | 1–2 | Weak or incidental mention |
744
+
745
+ **Always extract relationships between:**
746
+ - Parties ↔ Parties (Tenant ↔ Landlord)
747
+ - Parties ↔ Obligations (Tenant → Insurance requirement)
748
+ - Obligations ↔ Conditions (Extension option → Notice period required)
749
+ - Financial terms ↔ Time periods (Rent $135 → Years 1-2)
750
+ - Escalation chains (Rent Year 1 → Rent Year 2 → Rent Year 3)
751
+
752
+ ---
753
+
754
+ ### Summarize Descriptions
755
+
756
+ When the same entity appears in multiple chunks, its descriptions are merged using the summarization prompt.
757
+
758
+ ```yaml
759
+ graphrag:
760
+ summarizeDescriptions:
761
+ maxLength: 500 # characters per summarized description
762
+ maxInputLength: 8000 # input character limit before truncation
763
+ prompt: |
764
+ You are consolidating descriptions for entities from [domain] documents.
765
+
766
+ Given multiple descriptions of the same entity, produce one comprehensive description that:
767
+ 1. Preserves ALL specific values (amounts, dates, percentages, names)
768
+ 2. Combines unique details without duplication
769
+ 3. Never rounds, approximates, or generalizes numbers
770
+
771
+ Entity: {entity_name}
772
+ Descriptions: {description_list}
773
+
774
+ Consolidated Description:
775
+ ```
776
+
777
+ - `maxLength: 500` — good default. Increase to 800–1000 for entities that accumulate many details (e.g., a complex rent schedule).
778
+ - The prompt receives `{entity_name}` and `{description_list}` — always reference both.
779
+
780
+ ---
781
+
782
+ ### Claim Extraction
783
+
784
+ Claims are facts, obligations, or assertions extracted separately from entities.
785
+
786
+ ```yaml
787
+ graphrag:
788
+ claimExtraction:
789
+ enabled: false # disabled by default — adds API cost
790
+ description: "Explicit obligations and factual claims in the document"
791
+ maxGleanings: 1
792
+ prompt: |
793
+ Extract specific claims, obligations, and factual statements...
794
+ ```
795
+
796
+ **Enable for:** Legal documents, compliance documents, regulatory filings where individual claims need to be tracked separately from entities.
797
+ **Leave disabled for:** General documents, when entity extraction already captures the needed information.
798
+
799
+ ---
800
+
801
+ ### Community Detection
802
+
803
+ GraphRAG groups related entities into communities and generates summaries for each.
804
+
805
+ ```yaml
806
+ graphrag:
807
+ communities:
808
+ algorithm: leiden # leiden | louvain
809
+ resolution: 1.0
810
+ minCommunitySize: 3
811
+ maxLevels: 3
812
+ ```
813
+
814
+ - `algorithm: leiden` — **recommended**. More accurate community detection than louvain.
815
+ - `resolution` (0.1–10): Controls granularity.
816
+ - Lower (0.5) = fewer, larger communities (broader summaries)
817
+ - Higher (2.0) = more, smaller communities (more specific summaries)
818
+ - **1.0 is a good default for most document types**
819
+ - `minCommunitySize: 3` — minimum entities to form a community. Prevents trivial communities.
820
+ - `maxLevels: 3` — depth of hierarchical community structure. Increase to 4–5 for very large corpora.
821
+
822
+ ---
823
+
824
+ ### Community Reports Prompt
825
+
826
+ Community reports summarise each entity cluster. This is what global search uses to answer broad questions.
827
+
828
+ ```yaml
829
+ graphrag:
830
+ communityReports:
831
+ maxLength: 2000
832
+ maxInputLength: 8000
833
+ prompt: |
834
+ You are a [domain] expert. Analyze communities of entities to produce actionable summaries.
835
+
836
+ # Goal
837
+ Write a comprehensive report about a community of related entities from [domain] documents.
838
+
839
+ # Report Structure
840
+ - TITLE: Short descriptive name including key entities
841
+ - SUMMARY: Executive summary with specific values and dates
842
+ - RATING: Float 0-10 (10 = most critical/central to domain)
843
+ - RATING EXPLANATION: One sentence
844
+ - FINDINGS: 5-10 specific insights with data references
845
+
846
+ Return as JSON:
847
+ {{
848
+ "title": <title>,
849
+ "summary": <summary>,
850
+ "rating": <rating>,
851
+ "rating_explanation": <explanation>,
852
+ "findings": [
853
+ {{
854
+ "summary": <finding>,
855
+ "explanation": <explanation with data references>
856
+ }}
857
+ ]
858
+ }}
859
+
860
+ # Grounding Rules
861
+ Reference data as: [Data: Entities (ids); Relationships (ids)]
862
+ Use max 5 IDs per reference, add "+more" if needed.
863
+
864
+ # Data
865
+ {input_text}
866
+ Output:
867
+ ```
868
+
869
+ **Note:** Use `{{` and `}}` (double braces) around JSON template keys in YAML to escape them from GraphRAG's template engine.
870
+
871
+ - `maxLength: 2000` — report character limit. Increase to 3000–4000 for complex communities with many entities.
872
+ - `maxInputLength: 8000` — input limit before truncation. Increase for large communities.
873
+
874
+ ---
875
+
876
+ ### Local Search
877
+
878
+ Local search answers specific, entity-focused questions ("What is the rent?", "Who are the parties?").
879
+
880
+ ```yaml
881
+ graphrag:
882
+ localSearch:
883
+ topKEntities: 10 # top entities to retrieve
884
+ topKRelationships: 10 # top relationships to retrieve
885
+ topKCommunityReports: 5 # community reports to include
886
+ maxContextTokens: 12000 # total context window
887
+ prompt: |
888
+ ---Role---
889
+ You are a [domain] expert answering questions using extracted knowledge graph data.
890
+
891
+ ---Domain Knowledge---
892
+ [Define what each entity type means and how to interpret it]
893
+
894
+ ---Goal---
895
+ Answer the user's question using ONLY the data tables provided.
896
+ - Be specific: include exact values, dates, and amounts from the data
897
+ - Cite sources: [Data: Entities (ids); Relationships (ids)]
898
+ - State clearly if information is not in the data
899
+
900
+ ---Target response length and format---
901
+ {response_type}
902
+
903
+ ---Data tables---
904
+ {context_data}
905
+
906
+ Style the response in markdown.
907
+ ```
908
+
909
+ - `topKEntities / topKRelationships` — increase to 20–30 for complex queries that touch many entities. Watch `maxContextTokens` to avoid overflow.
910
+ - `maxContextTokens: 12000` — total context budget. Use up to 16000 for GPT-4o, 32000 for GPT-4-turbo.
911
+
912
+ ---
913
+
914
+ ### Global Search
915
+
916
+ Global search answers broad questions ("Summarise all lease obligations", "What are the key financial terms across all documents?").
917
+
918
+ ```yaml
919
+ graphrag:
920
+ globalSearch:
921
+ maxCommunities: 10
922
+ mapMaxTokens: 4000
923
+ reduceMaxTokens: 8000
924
+
925
+ knowledgePrompt: |
926
+ ---Role---
927
+ You are a [domain] expert with deep knowledge of [document type].
928
+
929
+ ---Domain Knowledge---
930
+ [Terminology, concepts, and interpretation rules for your domain]
931
+
932
+ ---Goal---
933
+ Use this expertise to interpret the provided data accurately.
934
+
935
+ ---Data---
936
+ {context_data}
937
+
938
+ mapPrompt: |
939
+ ---Role---
940
+ You are a [domain] expert analyzing a community report to answer a question.
941
+
942
+ ---Goal---
943
+ From this community report, extract:
944
+ 1. A relevance score (0-100) for the question
945
+ 2. Key points relevant to the answer
946
+
947
+ ---Target response length and format---
948
+ {response_type}
949
+
950
+ ---Community Report---
951
+ {context_data}
952
+
953
+ reducePrompt: |
954
+ ---Role---
955
+ You are a [domain] expert synthesizing community analyses.
956
+
957
+ ---Goal---
958
+ Combine the community analyses into a comprehensive answer:
959
+ 1. Prioritize higher-scored communities
960
+ 2. Include all specific values (amounts, dates, percentages)
961
+ 3. Note contradictions or variations
962
+ 4. State clearly when information is unavailable
963
+
964
+ ---Target response length and format---
965
+ {response_type}
966
+
967
+ ---Community Analyses---
968
+ {report_data}
969
+ ```
970
+
971
+ - `maxCommunities: 10` — how many community reports to scan per query. Increase to 20–30 for large corpora.
972
+ - `mapMaxTokens: 4000` / `reduceMaxTokens: 8000` — token budgets for each phase. Increase if responses are being cut off.
973
+
974
+ ---
975
+
976
+ ### DRIFT Search
977
+
978
+ DRIFT (Dynamic Reasoning and Inference for Finding Themes) is an experimental search mode that iteratively refines queries.
979
+
980
+ ```yaml
981
+ graphrag:
982
+ driftSearch:
983
+ enabled: false # experimental, disabled by default
984
+ prompt: |
985
+ ...
986
+ reducePrompt: |
987
+ ...
988
+ ```
989
+
990
+ Enable only if you need theme discovery across large, varied document collections. For most use cases, local and global search are sufficient.
991
+
992
+ ---
993
+
994
+ ### Clustering & Cache
995
+
996
+ ```yaml
997
+ graphrag:
998
+ clusterGraph:
999
+ maxClusterSize: 10 # max entities per cluster
1000
+ useLcc: true # use largest connected component
1001
+ seed: 42 # reproducibility seed
1002
+
1003
+ cache:
1004
+ enabled: true
1005
+ type: file # file | memory | none
1006
+ ```
1007
+
1008
+ - `maxClusterSize: 10` — limits cluster size for community reports. Reduce to 5–7 for very large graphs with many communities.
1009
+ - `useLcc: true` — focuses GraphRAG on the main connected subgraph, discarding outliers. Keep `true`.
1010
+ - `seed: 42` — for reproducible community detection across runs.
1011
+ - `cache: file` — caches LLM calls to disk. Dramatically speeds up re-runs and reduces API costs. Always keep enabled in production.
1012
+
1013
+ ---
1014
+
1015
+ ### maxGleanings
1016
+
1017
+ ```yaml
1018
+ graphrag:
1019
+ entityExtraction:
1020
+ maxGleanings: 1 # 0 | 1 | 2 | 3
1021
+ ```
1022
+
1023
+ Controls how many additional extraction passes the LLM performs on each chunk to find missed entities.
1024
+
1025
+ | Value | Cost | When to Use |
1026
+ |-------|------|-------------|
1027
+ | 0 | Minimal | High volume, cost-sensitive, documents with simple structure |
1028
+ | 1 | Low | **Default. Good balance for most documents** |
1029
+ | 2 | Medium | Complex documents where thoroughness matters (dense contracts) |
1030
+ | 3+ | High | Maximum extraction; only for critical documents |
1031
+
1032
+ Each gleaning pass costs additional API tokens. For a 500-chunk document with `maxGleanings: 2`, expect roughly 3× the API cost vs `maxGleanings: 0`.
1033
+
1034
+ ---
1035
+
1036
+ ## API Keys
1037
+
1038
+ ```yaml
1039
+ apiKeys:
1040
+ openai: ${OPENAI_API_KEY} # reads from environment variable
1041
+ # baseUrl: https://api.openai.com/v1 # optional: custom endpoint
1042
+ ```
1043
+
1044
+ Always use environment variable syntax (`${VAR_NAME}`) rather than hardcoding keys in config files.
1045
+
1046
+ **Custom endpoints** (`baseUrl`) allow you to point the pipeline at:
1047
+ - Azure OpenAI: `https://your-resource.openai.azure.com/`
1048
+ - Local LLMs (vLLM, Ollama with OpenAI compat): `http://localhost:8080/v1`
1049
+ - OpenRouter: `https://openrouter.ai/api/v1`
1050
+
1051
+ The `baseUrl` applies to all model calls: GraphRAG chat, embeddings, and Docling picture descriptions.
1052
+
1053
+ ---
1054
+
1055
+ ## Recommendations by Document Type
1056
+
1057
+ ### Commercial Leases / Contracts
1058
+
1059
+ ```yaml
1060
+ docling:
1061
+ layout:
1062
+ model: docling-layout-egret-large
1063
+ ocr:
1064
+ enabled: true
1065
+ engine: rapidocr
1066
+ backend: torch
1067
+ forceFullPageOcr: true
1068
+ tables:
1069
+ mode: accurate
1070
+
1071
+ chunking:
1072
+ strategy: docling_hybrid
1073
+ maxTokens: 1000
1074
+ overlapTokens: 250
1075
+ contextualize: true
1076
+
1077
+ graphrag:
1078
+ models:
1079
+ chatModel: gpt-4o-mini
1080
+ entityExtraction:
1081
+ maxGleanings: 1
1082
+ entityTypes: [domain-specific types as above]
1083
+ ```
1084
+
1085
+ **Rationale:** Contracts have complex formatting and precise values that must not be missed. High overlap (250) because clauses often reference terms defined pages earlier. `maxGleanings: 1` for balance between cost and completeness.
1086
+
1087
+ ---
1088
+
1089
+ ### Financial Reports / Investor Documents
1090
+
1091
+ ```yaml
1092
+ docling:
1093
+ layout:
1094
+ model: docling-layout-egret-xlarge # complex tables, charts
1095
+ tables:
1096
+ mode: accurate
1097
+ pictures:
1098
+ enabled: true
1099
+ enableDescription: true
1100
+ descriptionModel: gpt-4o # charts need high accuracy
1101
+
1102
+ chunking:
1103
+ strategy: docling_hybrid
1104
+ maxTokens: 800
1105
+ overlapTokens: 150
1106
+
1107
+ graphrag:
1108
+ models:
1109
+ chatModel: gpt-4o # financial data needs accuracy
1110
+ entityExtraction:
1111
+ entityTypes:
1112
+ - COMPANY
1113
+ - FUND
1114
+ - REVENUE
1115
+ - EXPENSE
1116
+ - ASSET
1117
+ - VALUATION
1118
+ - FINANCIAL_PERIOD
1119
+ - RISK_FACTOR
1120
+ maxGleanings: 2
1121
+ ```
1122
+
1123
+ **Rationale:** Financial reports have complex tables and charts. `egret-xlarge` gives best table detection. `gpt-4o` for extraction reduces numeric errors. Smaller chunks (800) improve precision of financial figure extraction.
1124
+
1125
+ ---
1126
+
1127
+ ### Scanned Documents / Poor Quality PDFs
1128
+
1129
+ ```yaml
1130
+ docling:
1131
+ layout:
1132
+ model: docling-layout-egret-large
1133
+ ocr:
1134
+ enabled: true
1135
+ engine: rapidocr
1136
+ backend: torch
1137
+ textScore: 0.3 # lower threshold for poor quality
1138
+ forceFullPageOcr: true
1139
+ limits:
1140
+ documentTimeout: 600 # scanned docs take longer
1141
+
1142
+ chunking:
1143
+ strategy: docling_hybrid
1144
+ maxTokens: 1200
1145
+ overlapTokens: 300 # higher overlap for OCR errors at boundaries
1146
+ ```
1147
+
1148
+ **Rationale:** Lower `textScore` (0.3) accepts more text even when confidence is lower — better than missing content on degraded scans. Higher timeout for processing. More overlap compensates for OCR errors at chunk boundaries.
1149
+
1150
+ ---
1151
+
1152
+ ### Multi-Language Documents
1153
+
1154
+ ```yaml
1155
+ docling:
1156
+ ocr:
1157
+ enabled: true
1158
+ engine: easyocr
1159
+ languages: [en, fr, de] # list all languages present
1160
+ backend: torch
1161
+
1162
+ chunking:
1163
+ tokenizer: cl100k_base # works well across languages
1164
+
1165
+ graphrag:
1166
+ models:
1167
+ chatModel: gpt-4o # better multilingual than gpt-4o-mini
1168
+ ```
1169
+
1170
+ **Rationale:** `easyocr` has the best multilingual OCR support. `gpt-4o` handles non-English entity extraction more reliably.
1171
+
1172
+ ---
1173
+
1174
+ ### Technical Manuals / Documentation
1175
+
1176
+ ```yaml
1177
+ docling:
1178
+ layout:
1179
+ model: docling-layout-egret-medium # manuals have predictable structure
1180
+ pictures:
1181
+ enabled: true
1182
+ enableDescription: true
1183
+ descriptionModel: gpt-4o-mini
1184
+ descriptionPrompt: |
1185
+ Analyze this technical diagram or figure.
1186
+ List all labeled components and their connections.
1187
+ Describe flow directions, measurement values, and specifications.
1188
+ Include part numbers, error codes, and annotations.
1189
+
1190
+ chunking:
1191
+ strategy: docling_hybrid
1192
+ maxTokens: 1500 # technical sections can be longer
1193
+ overlapTokens: 100
1194
+
1195
+ graphrag:
1196
+ entityExtraction:
1197
+ entityTypes:
1198
+ - COMPONENT
1199
+ - API_ENDPOINT
1200
+ - PARAMETER
1201
+ - CONFIGURATION
1202
+ - ERROR_CODE
1203
+ - WORKFLOW
1204
+ - DEPENDENCY
1205
+ - VERSION
1206
+ ```
1207
+
1208
+ ---
1209
+
1210
+ ### High-Volume Batch Processing (Speed Priority)
1211
+
1212
+ ```yaml
1213
+ docling:
1214
+ layout:
1215
+ model: docling-layout-heron # fastest
1216
+ ocr:
1217
+ engine: rapidocr
1218
+ backend: onnx # CPU-optimized
1219
+ forceFullPageOcr: false # only OCR where needed
1220
+ pictures:
1221
+ enableDescription: false # skip for speed
1222
+
1223
+ chunking:
1224
+ strategy: docling_hybrid
1225
+ maxTokens: 1500 # fewer, larger chunks
1226
+
1227
+ graphrag:
1228
+ models:
1229
+ chatModel: gpt-4o-mini
1230
+ entityExtraction:
1231
+ maxGleanings: 0 # single pass only
1232
+ ```
1233
+
1234
+ ---
1235
+
1236
+ ## Complete Example Configs
1237
+
1238
+ ### Minimal Config (defaults only)
1239
+
1240
+ ```yaml
1241
+ name: "My Pipeline"
1242
+ apiKeys:
1243
+ openai: ${OPENAI_API_KEY}
1244
+ ```
1245
+
1246
+ ---
1247
+
1248
+ ### General Purpose Document Pipeline
1249
+
1250
+ ```yaml
1251
+ name: "General Document Pipeline"
1252
+ description: "Balanced config for mixed document types"
1253
+
1254
+ docling:
1255
+ layout:
1256
+ model: docling-layout-egret-large
1257
+ ocr:
1258
+ enabled: true
1259
+ engine: rapidocr
1260
+ backend: torch
1261
+ languages: [en]
1262
+ textScore: 0.5
1263
+ forceFullPageOcr: true
1264
+ tables:
1265
+ enabled: true
1266
+ mode: accurate
1267
+ pictures:
1268
+ enabled: true
1269
+ enableDescription: true
1270
+ descriptionModel: gpt-4o-mini
1271
+ accelerator:
1272
+ device: auto
1273
+ numThreads: 4
1274
+ limits:
1275
+ documentTimeout: 300
1276
+
1277
+ chunking:
1278
+ strategy: docling_hybrid
1279
+ maxTokens: 1200
1280
+ overlapTokens: 200
1281
+ tokenizer: cl100k_base
1282
+ mergePeers: true
1283
+ contextualize: true
1284
+ output:
1285
+ format: text_files
1286
+ includeMetadataHeader: true
1287
+ metadata:
1288
+ includeHeadings: true
1289
+ includePageNumbers: true
1290
+ includePosition: true
1291
+ includeSource: true
1292
+
1293
+ graphrag:
1294
+ enabled: true
1295
+ models:
1296
+ chatModel: gpt-4o-mini
1297
+ embeddingModel: text-embedding-3-large
1298
+ temperature: 0
1299
+ maxTokens: 4096
1300
+ entityExtraction:
1301
+ entityTypes:
1302
+ - PERSON
1303
+ - ORGANIZATION
1304
+ - LOCATION
1305
+ - DATE
1306
+ - MONEY
1307
+ - DOCUMENT
1308
+ - OBLIGATION
1309
+ - CONDITION
1310
+ maxGleanings: 1
1311
+ summarizeDescriptions:
1312
+ maxLength: 500
1313
+ communities:
1314
+ algorithm: leiden
1315
+ resolution: 1.0
1316
+ minCommunitySize: 3
1317
+ communityReports:
1318
+ maxLength: 2000
1319
+ localSearch:
1320
+ topKEntities: 10
1321
+ topKRelationships: 10
1322
+ maxContextTokens: 12000
1323
+ globalSearch:
1324
+ maxCommunities: 10
1325
+ clusterGraph:
1326
+ maxClusterSize: 10
1327
+ useLcc: true
1328
+ seed: 42
1329
+ cache:
1330
+ enabled: true
1331
+ type: file
1332
+
1333
+ apiKeys:
1334
+ openai: ${OPENAI_API_KEY}
1335
+ ```
1336
+
1337
+ ---
1338
+
1339
+ ### Lean Config (CPU server, cost-sensitive)
1340
+
1341
+ ```yaml
1342
+ name: "Lean CPU Pipeline"
1343
+
1344
+ docling:
1345
+ layout:
1346
+ model: docling-layout-egret-medium
1347
+ ocr:
1348
+ enabled: true
1349
+ engine: rapidocr
1350
+ backend: onnx
1351
+ forceFullPageOcr: false
1352
+ pictures:
1353
+ enabled: false
1354
+ accelerator:
1355
+ device: cpu
1356
+ numThreads: 8
1357
+
1358
+ chunking:
1359
+ strategy: docling_hybrid
1360
+ maxTokens: 1500
1361
+ overlapTokens: 100
1362
+
1363
+ graphrag:
1364
+ enabled: true
1365
+ models:
1366
+ chatModel: gpt-4o-mini
1367
+ embeddingModel: text-embedding-3-small
1368
+ entityExtraction:
1369
+ maxGleanings: 0
1370
+ cache:
1371
+ enabled: true
1372
+ type: file
1373
+
1374
+ apiKeys:
1375
+ openai: ${OPENAI_API_KEY}
1376
+ ```
1377
+
1378
+ ---
1379
+
1380
+ *For schema reference, see: `packages/nest-doc-processing-api/src/schemas/config.schema.yaml`*
1381
+ *For a full annotated template, see: `packages/nest-doc-processing-worker/templates/config.default.yaml`*