@nestbox-ai/cli 1.0.59 → 1.0.60
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/agents/docProc/CONFIG_GUIDE.md +1381 -0
- package/dist/agents/docProc/EVAL_GUIDE.md +800 -0
- package/dist/agents/docProc/SYSTEM_PROMPT.md +24 -0
- package/dist/agents/docProc/config.schema.yaml +564 -0
- package/dist/agents/docProc/eval-test-cases.schema.yaml +248 -0
- package/dist/agents/docProc/index.d.ts +20 -0
- package/dist/agents/docProc/index.js +212 -0
- package/dist/agents/docProc/index.js.map +1 -0
- package/dist/commands/generate/docProc.d.ts +2 -0
- package/dist/commands/generate/docProc.js +99 -0
- package/dist/commands/generate/docProc.js.map +1 -0
- package/dist/commands/generate.js +2 -0
- package/dist/commands/generate.js.map +1 -1
- package/package.json +4 -2
|
@@ -0,0 +1,800 @@
|
|
|
1
|
+
# Nestbox Document Pipeline — Evaluation (Eval) Guide
|
|
2
|
+
|
|
3
|
+
The eval file lets you measure the quality of your pipeline configuration against a set of known questions and expected answers. It runs real queries against a processed document and scores the responses automatically using semantic similarity.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Table of Contents
|
|
8
|
+
|
|
9
|
+
- [How Evaluation Works](#how-evaluation-works)
|
|
10
|
+
- [File Structure](#file-structure)
|
|
11
|
+
- [eval_params — Query Parameters](#eval_params--query-parameters)
|
|
12
|
+
- [Basic Search Test Cases](#basic-search-test-cases)
|
|
13
|
+
- [Local Search Test Cases](#local-search-test-cases)
|
|
14
|
+
- [Global Search Test Cases](#global-search-test-cases)
|
|
15
|
+
- [Writing Good expected_answer](#writing-good-expected_answer)
|
|
16
|
+
- [Writing Good bad_answer](#writing-good-bad_answer)
|
|
17
|
+
- [Scoring and Results](#scoring-and-results)
|
|
18
|
+
- [How Many Test Cases to Write](#how-many-test-cases-to-write)
|
|
19
|
+
- [Running Evaluations](#running-evaluations)
|
|
20
|
+
- [Interpreting Results](#interpreting-results)
|
|
21
|
+
- [Recommendations by Document Type](#recommendations-by-document-type)
|
|
22
|
+
- [Complete Example Eval Files](#complete-example-eval-files)
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## How Evaluation Works
|
|
27
|
+
|
|
28
|
+
When you run an eval, the pipeline:
|
|
29
|
+
|
|
30
|
+
1. Submits each question to GraphRAG using the specified search mode (basic / local / global)
|
|
31
|
+
2. Receives a real response from the knowledge graph
|
|
32
|
+
3. Gets OpenAI embeddings for three texts: the **response**, your **expected_answer**, and your **bad_answer**
|
|
33
|
+
4. Computes cosine similarity: `sim_good` (response ↔ expected) and `sim_bad` (response ↔ bad)
|
|
34
|
+
5. Computes **delta** = `sim_good − sim_bad`
|
|
35
|
+
6. Classifies the result:
|
|
36
|
+
- **GOOD** — delta ≥ 0.10 (response is meaningfully closer to your expected answer)
|
|
37
|
+
- **BAD** — delta < 0.10 (response is not sufficiently better than the bad answer)
|
|
38
|
+
- **ERROR** — query failed or response was empty
|
|
39
|
+
|
|
40
|
+
The evaluation is **semantic**, not keyword-based. The pipeline embeds the meaning of the response and compares it to the meaning of your expected and bad answers. This means exact wording doesn't matter — what matters is whether the response conveys the correct information.
|
|
41
|
+
|
|
42
|
+
**Key implication:** Your `expected_answer` should describe the correct information in natural language, and your `bad_answer` should describe a plausible but wrong or vague answer.
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## File Structure
|
|
47
|
+
|
|
48
|
+
```yaml
|
|
49
|
+
# eval.yaml
|
|
50
|
+
eval_params: # optional: query-level parameters per mode
|
|
51
|
+
basic_search: ...
|
|
52
|
+
local_search: ...
|
|
53
|
+
global_search: ...
|
|
54
|
+
|
|
55
|
+
basic_search: # simple factual retrieval test cases
|
|
56
|
+
- question: "..."
|
|
57
|
+
expected_answer: "..."
|
|
58
|
+
bad_answer: "..."
|
|
59
|
+
|
|
60
|
+
local_search: # entity-focused question test cases
|
|
61
|
+
- question: "..."
|
|
62
|
+
expected_answer: "..."
|
|
63
|
+
bad_answer: "..."
|
|
64
|
+
|
|
65
|
+
global_search: # summary and thematic question test cases
|
|
66
|
+
- question: "..."
|
|
67
|
+
expected_answer: "..."
|
|
68
|
+
bad_answer: "..."
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
At least one of `basic_search`, `local_search`, or `global_search` must be present. You can include any combination.
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## eval_params — Query Parameters
|
|
76
|
+
|
|
77
|
+
`eval_params` lets you override GraphRAG query parameters for all test cases in each mode. These are optional — defaults from your pipeline config apply if omitted.
|
|
78
|
+
|
|
79
|
+
```yaml
|
|
80
|
+
eval_params:
|
|
81
|
+
basic_search:
|
|
82
|
+
k: 10
|
|
83
|
+
temperature: 0
|
|
84
|
+
max_tokens: 4096
|
|
85
|
+
|
|
86
|
+
local_search:
|
|
87
|
+
top_k_entities: 20
|
|
88
|
+
top_k_relationships: 20
|
|
89
|
+
text_unit_prop: 0.5
|
|
90
|
+
community_prop: 0.3
|
|
91
|
+
max_context_tokens: 12000
|
|
92
|
+
temperature: 0
|
|
93
|
+
max_tokens: 4096
|
|
94
|
+
|
|
95
|
+
global_search:
|
|
96
|
+
max_context_tokens: 16000
|
|
97
|
+
data_max_tokens: 12000
|
|
98
|
+
map_max_length: 1000
|
|
99
|
+
reduce_max_length: 2000
|
|
100
|
+
dynamic_search_threshold: 1
|
|
101
|
+
dynamic_search_keep_parent: true
|
|
102
|
+
dynamic_search_num_repeats: 1
|
|
103
|
+
dynamic_search_use_summary: false
|
|
104
|
+
dynamic_search_max_level: 3
|
|
105
|
+
temperature: 0
|
|
106
|
+
max_tokens: 4096
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Basic Search Parameters
|
|
110
|
+
|
|
111
|
+
| Parameter | Default | Description |
|
|
112
|
+
|-----------|---------|-------------|
|
|
113
|
+
| `k` | 10 | Number of text chunks to retrieve by vector similarity |
|
|
114
|
+
| `temperature` | 0 | LLM temperature. Keep at 0 for deterministic eval results |
|
|
115
|
+
| `max_tokens` | 4096 | Maximum tokens in the response |
|
|
116
|
+
|
|
117
|
+
### Local Search Parameters
|
|
118
|
+
|
|
119
|
+
| Parameter | Default | Description |
|
|
120
|
+
|-----------|---------|-------------|
|
|
121
|
+
| `top_k_entities` | 10 | How many entity matches to retrieve. Increase to 20–30 for complex questions |
|
|
122
|
+
| `top_k_relationships` | 10 | How many relationship matches to retrieve |
|
|
123
|
+
| `text_unit_prop` | 0.5 | Proportion of context budget allocated to raw text units (0–1) |
|
|
124
|
+
| `community_prop` | 0.3 | Proportion of context budget allocated to community reports (0–1) |
|
|
125
|
+
| `conversation_history_max_turns` | 0 | For multi-turn conversations. Leave 0 for eval |
|
|
126
|
+
| `max_context_tokens` | 12000 | Total context window for the query |
|
|
127
|
+
| `temperature` | 0 | Keep at 0 for eval |
|
|
128
|
+
| `max_tokens` | 4096 | Maximum response length |
|
|
129
|
+
|
|
130
|
+
### Global Search Parameters
|
|
131
|
+
|
|
132
|
+
| Parameter | Default | Description |
|
|
133
|
+
|-----------|---------|-------------|
|
|
134
|
+
| `max_context_tokens` | 16000 | Total tokens available for community report context |
|
|
135
|
+
| `data_max_tokens` | 12000 | Token budget for the data portion |
|
|
136
|
+
| `map_max_length` | 1000 | Max response tokens per community in the map phase |
|
|
137
|
+
| `reduce_max_length` | 2000 | Max tokens for the final reduce/synthesis response |
|
|
138
|
+
| `dynamic_search_threshold` | 1 | Community rating threshold to include (0–10). Lower = include more communities |
|
|
139
|
+
| `dynamic_search_keep_parent` | true | Whether to include parent communities even when children qualify |
|
|
140
|
+
| `dynamic_search_num_repeats` | 1 | How many times to repeat community scoring for confidence |
|
|
141
|
+
| `dynamic_search_use_summary` | false | Use community summaries instead of full report text |
|
|
142
|
+
| `dynamic_search_max_level` | 3 | Maximum hierarchy depth to search (matches your `communities.maxLevels`) |
|
|
143
|
+
| `temperature` | 0 | Keep at 0 for eval |
|
|
144
|
+
| `max_tokens` | 4096 | Maximum response length |
|
|
145
|
+
|
|
146
|
+
**For eval runs always set `temperature: 0`** across all modes. Non-zero temperature introduces randomness that makes results inconsistent between runs.
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
## Basic Search Test Cases
|
|
151
|
+
|
|
152
|
+
Basic search is vector similarity retrieval — it finds the most relevant text chunks and uses them to answer the question. It does **not** use the knowledge graph entities or relationships.
|
|
153
|
+
|
|
154
|
+
**When to use basic search:**
|
|
155
|
+
- Simple factual lookups that are answered by a single sentence or paragraph
|
|
156
|
+
- Keyword-rich questions where the answer is likely verbatim in the document
|
|
157
|
+
- Testing whether the chunking and embedding pipeline captured a specific piece of information
|
|
158
|
+
|
|
159
|
+
**When NOT to use basic search:**
|
|
160
|
+
- Questions that require reasoning across multiple parts of the document
|
|
161
|
+
- Questions about relationships between entities
|
|
162
|
+
- Summary or thematic questions
|
|
163
|
+
|
|
164
|
+
**Question style:** Direct, specific, self-contained. The answer should exist in a single chunk.
|
|
165
|
+
|
|
166
|
+
```yaml
|
|
167
|
+
basic_search:
|
|
168
|
+
- question: "What is the security deposit amount?"
|
|
169
|
+
expected_answer: "The security deposit is $71,464.22 including HST."
|
|
170
|
+
bad_answer: "There is no security deposit mentioned in the document."
|
|
171
|
+
|
|
172
|
+
- question: "What OCR language is configured?"
|
|
173
|
+
expected_answer: "The OCR engine is configured for English language recognition."
|
|
174
|
+
bad_answer: "No language settings are specified."
|
|
175
|
+
|
|
176
|
+
- question: "What is the building address?"
|
|
177
|
+
expected_answer: "The building is located at 135 Yorkville Avenue, Toronto."
|
|
178
|
+
bad_answer: "The document does not mention a specific address."
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
**Rules for basic_search questions:**
|
|
182
|
+
|
|
183
|
+
1. Ask about a single concrete fact (a number, a name, a date, a yes/no)
|
|
184
|
+
2. The answer must exist somewhere in the document text
|
|
185
|
+
3. Avoid "list all..." or "summarise..." — these belong in global_search
|
|
186
|
+
4. Avoid "how does X relate to Y" — these belong in local_search
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## Local Search Test Cases
|
|
191
|
+
|
|
192
|
+
Local search traverses the knowledge graph — it retrieves entities, relationships, and community reports related to the question, then synthesises an answer. It is the most powerful mode for document Q&A.
|
|
193
|
+
|
|
194
|
+
**When to use local search:**
|
|
195
|
+
- Questions about specific entities and their properties
|
|
196
|
+
- Questions about relationships between parties or concepts
|
|
197
|
+
- Questions that require combining information from multiple places in the document
|
|
198
|
+
- Questions about obligations, rights, financial terms, timelines
|
|
199
|
+
|
|
200
|
+
**When NOT to use local search:**
|
|
201
|
+
- Very broad "tell me everything" questions (use global search)
|
|
202
|
+
- Questions where the answer is a single verbatim sentence (basic search may be faster)
|
|
203
|
+
|
|
204
|
+
**Question style:** Entity-focused, relational, specific but potentially spread across the document.
|
|
205
|
+
|
|
206
|
+
```yaml
|
|
207
|
+
local_search:
|
|
208
|
+
- question: "What is the total minimum rent over the entire lease term?"
|
|
209
|
+
expected_answer: "Over the 5-year term, minimum rent totals approximately $1,499,400: $291,060/year for Years 1-2 ($582,120), $301,840 in Year 3, and $312,620/year for Years 4-5 ($625,240)."
|
|
210
|
+
bad_answer: "The rent is $135 per square foot."
|
|
211
|
+
|
|
212
|
+
- question: "What are the tenant's insurance obligations?"
|
|
213
|
+
expected_answer: "The tenant must maintain comprehensive general liability insurance of at least $5,000,000 per occurrence throughout the lease term."
|
|
214
|
+
bad_answer: "The tenant has some insurance requirements."
|
|
215
|
+
|
|
216
|
+
- question: "What are the conditions for exercising the extension option?"
|
|
217
|
+
expected_answer: "The tenant may exercise up to two 5-year extension options by providing 180 days prior written notice to the landlord before the expiry of the current term."
|
|
218
|
+
bad_answer: "There is an extension option available."
|
|
219
|
+
|
|
220
|
+
- question: "Who is the guarantor and what are they guaranteeing?"
|
|
221
|
+
expected_answer: "Bang & Olufsen A/S is the guarantor, guaranteeing all obligations of the tenant EPIC LUXURY SYSTEMS INC. under the lease, including rent payments and compliance with all lease terms."
|
|
222
|
+
bad_answer: "Someone guarantees the lease."
|
|
223
|
+
|
|
224
|
+
- question: "What happens in the event of a default?"
|
|
225
|
+
expected_answer: "Upon default, the landlord may terminate the lease, re-enter the premises, and pursue damages. The tenant has a cure period of 10 days for monetary defaults and 30 days for non-monetary defaults."
|
|
226
|
+
bad_answer: "Default has consequences for the tenant."
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
**Rules for local_search questions:**
|
|
230
|
+
|
|
231
|
+
1. Reference specific entity types from your config (parties, financial terms, dates, obligations)
|
|
232
|
+
2. Ask questions whose answers span multiple clauses or sections
|
|
233
|
+
3. The `expected_answer` should name specific entities with their values
|
|
234
|
+
4. The `bad_answer` should be a vague or incomplete version of the correct answer, not completely wrong — this tests whether the model extracts sufficient detail
|
|
235
|
+
5. Test both easy questions (parties, dates) and hard ones (cross-references, conditions)
|
|
236
|
+
|
|
237
|
+
**Testing relationship extraction quality:**
|
|
238
|
+
|
|
239
|
+
Local search results depend heavily on how well your entity extraction prompt captured relationships. Good test cases for this:
|
|
240
|
+
|
|
241
|
+
```yaml
|
|
242
|
+
local_search:
|
|
243
|
+
# Tests that rent escalation relationships were captured
|
|
244
|
+
- question: "How does the minimum rent change over the lease term?"
|
|
245
|
+
expected_answer: "Minimum rent starts at $135/sqft ($291,060/year) for Years 1-2, increases to $140/sqft ($301,840/year) in Year 3, then to $145/sqft ($312,620/year) for Years 4-5."
|
|
246
|
+
bad_answer: "The rent increases over time."
|
|
247
|
+
|
|
248
|
+
# Tests that party-obligation relationships were captured
|
|
249
|
+
- question: "What are the landlord's maintenance obligations?"
|
|
250
|
+
expected_answer: "The landlord is responsible for maintaining the structural elements of the building, roof, common areas, and building systems including HVAC, plumbing, and electrical not within the premises."
|
|
251
|
+
bad_answer: "The landlord has some maintenance duties."
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
---
|
|
255
|
+
|
|
256
|
+
## Global Search Test Cases
|
|
257
|
+
|
|
258
|
+
Global search uses community reports — summaries of entity clusters built during indexing — to answer broad, thematic questions about the document. It does not look at individual entity details but synthesises across communities.
|
|
259
|
+
|
|
260
|
+
**When to use global search:**
|
|
261
|
+
- "Summarise all..." questions
|
|
262
|
+
- "What are the main themes..." questions
|
|
263
|
+
- "List all obligations/rights/financial terms..."
|
|
264
|
+
- Questions that require awareness of the full document scope
|
|
265
|
+
- Portfolio-level questions when multiple documents were indexed together
|
|
266
|
+
|
|
267
|
+
**When NOT to use global search:**
|
|
268
|
+
- Specific factual lookups (use basic or local)
|
|
269
|
+
- Questions about exact values (local search is more precise)
|
|
270
|
+
- Short documents with few entities (local search works better)
|
|
271
|
+
|
|
272
|
+
**Question style:** Broad, thematic, synthesis-oriented.
|
|
273
|
+
|
|
274
|
+
```yaml
|
|
275
|
+
global_search:
|
|
276
|
+
- question: "Summarise all financial obligations of the tenant under this lease."
|
|
277
|
+
expected_answer: "The tenant's financial obligations include: Minimum Rent starting at $291,060/year ($135/sqft) escalating to $312,620/year ($145/sqft); Additional Rent covering operating costs and property taxes; a Security Deposit of $71,464.22; HST on all payments; and liability insurance of at least $5,000,000."
|
|
278
|
+
bad_answer: "The tenant must pay rent and other fees."
|
|
279
|
+
|
|
280
|
+
- question: "What are all the key dates and time periods in this lease?"
|
|
281
|
+
expected_answer: "Key dates include: Commencement Date January 15 2026; Fixturing Period of 60 days preceding commencement; initial 5-year Term expiring January 14 2031; two 5-year extension options potentially running to 2041; and various notice periods (180 days for extension, 90 days for termination)."
|
|
282
|
+
bad_answer: "The lease starts in 2026 and lasts 5 years."
|
|
283
|
+
|
|
284
|
+
- question: "What rights does the tenant have under this lease?"
|
|
285
|
+
expected_answer: "The tenant holds the following rights: two 5-year extension options exercisable with 180 days notice; a right of first refusal on adjacent space; a termination right if the premises are damaged beyond 50% and not restored within 180 days; and a right to sublet with landlord consent not to be unreasonably withheld."
|
|
286
|
+
bad_answer: "The tenant has some rights to extend the lease."
|
|
287
|
+
|
|
288
|
+
- question: "What are the landlord's and tenant's respective maintenance responsibilities?"
|
|
289
|
+
expected_answer: "The landlord maintains structural elements, roof, common areas, and building systems (HVAC, plumbing, electrical) outside the premises. The tenant is responsible for all interior non-structural maintenance, interior finishes, fixtures, signage, and HVAC equipment serving only the premises."
|
|
290
|
+
bad_answer: "Both parties have maintenance obligations."
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
**Rules for global_search questions:**
|
|
294
|
+
|
|
295
|
+
1. Questions should genuinely require synthesising across the full document
|
|
296
|
+
2. `expected_answer` should be a mini-summary — it can be multi-sentence
|
|
297
|
+
3. `bad_answer` should be a shallow or incomplete version: "The tenant has some responsibilities" vs. "The tenant must pay $X, maintain Y, and comply with Z"
|
|
298
|
+
4. Avoid questions whose answers come from a single paragraph — those belong in local or basic
|
|
299
|
+
5. Include "list all..." and "summarise all..." framing — this is where global search excels
|
|
300
|
+
6. Test the quality of your community reports (these are what global search draws from)
|
|
301
|
+
|
|
302
|
+
---
|
|
303
|
+
|
|
304
|
+
## Writing Good expected_answer
|
|
305
|
+
|
|
306
|
+
The `expected_answer` is used as the "gold standard" for semantic similarity comparison. The pipeline embeds it and measures how close the GraphRAG response is to it.
|
|
307
|
+
|
|
308
|
+
**Principles:**
|
|
309
|
+
|
|
310
|
+
### 1. Include the actual values
|
|
311
|
+
|
|
312
|
+
Bad (too vague — similar to a bad answer):
|
|
313
|
+
```yaml
|
|
314
|
+
expected_answer: "The tenant pays rent to the landlord."
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
Good (specific values anchor the semantic comparison):
|
|
318
|
+
```yaml
|
|
319
|
+
expected_answer: "The tenant pays minimum rent of $135.00 per square foot ($291,060 annually) for Years 1-2, escalating to $145.00/sqft ($312,620 annually) for Years 4-5."
|
|
320
|
+
```
|
|
321
|
+
|
|
322
|
+
### 2. Match the scope of the question
|
|
323
|
+
|
|
324
|
+
If the question asks about one thing, the `expected_answer` should cover one thing well, not everything tangentially related.
|
|
325
|
+
|
|
326
|
+
```yaml
|
|
327
|
+
question: "What is the security deposit?"
|
|
328
|
+
expected_answer: "The security deposit is $71,464.22 including HST, held by the landlord to secure the tenant's obligations under the lease."
|
|
329
|
+
# Don't add: "The tenant also pays $291,060/year in rent..." — out of scope
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
### 3. Use natural language, not bullet points
|
|
333
|
+
|
|
334
|
+
The embedding model handles prose better than structured lists for similarity comparison.
|
|
335
|
+
|
|
336
|
+
```yaml
|
|
337
|
+
# Less effective
|
|
338
|
+
expected_answer: |
|
|
339
|
+
- Rent: $291,060
|
|
340
|
+
- Deposit: $71,464.22
|
|
341
|
+
- Insurance: $5,000,000
|
|
342
|
+
|
|
343
|
+
# More effective
|
|
344
|
+
expected_answer: "The tenant's main financial obligations are minimum rent of $291,060 annually, a security deposit of $71,464.22, and comprehensive liability insurance of $5,000,000 per occurrence."
|
|
345
|
+
```
|
|
346
|
+
|
|
347
|
+
### 4. Write it as a complete answer, not a description of the answer
|
|
348
|
+
|
|
349
|
+
```yaml
|
|
350
|
+
# Wrong — describes what the answer is, not the answer itself
|
|
351
|
+
expected_answer: "A dollar amount for the rent is stated."
|
|
352
|
+
|
|
353
|
+
# Correct — is the answer
|
|
354
|
+
expected_answer: "The minimum rent is $135.00 per square foot per annum, totalling $291,060.00 annually for Years 1 and 2."
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
### 5. For global search, write a comprehensive summary
|
|
358
|
+
|
|
359
|
+
Global search synthesises across the document. The `expected_answer` should reflect that breadth:
|
|
360
|
+
|
|
361
|
+
```yaml
|
|
362
|
+
# Too narrow for a global question
|
|
363
|
+
question: "Summarise all tenant financial obligations."
|
|
364
|
+
expected_answer: "The tenant pays $291,060 in rent."
|
|
365
|
+
|
|
366
|
+
# Appropriate for global
|
|
367
|
+
expected_answer: "Tenant financial obligations encompass minimum rent escalating from $291,060 to $312,620 annually over five years, additional rent covering proportionate operating costs and taxes, a security deposit of $71,464.22, HST on all amounts, and maintenance of $5,000,000 liability insurance."
|
|
368
|
+
```
|
|
369
|
+
|
|
370
|
+
---
|
|
371
|
+
|
|
372
|
+
## Writing Good bad_answer
|
|
373
|
+
|
|
374
|
+
The `bad_answer` defines the lower bound of the similarity comparison. A response must be meaningfully closer to `expected_answer` than to `bad_answer` (by at least 10%) to be classified as GOOD.
|
|
375
|
+
|
|
376
|
+
**The bad_answer should be:**
|
|
377
|
+
- Plausible-sounding but wrong or incomplete
|
|
378
|
+
- Not obviously nonsensical (if it's too different from everything, it doesn't provide contrast)
|
|
379
|
+
- The type of answer a poorly-configured pipeline or a hallucinating LLM might give
|
|
380
|
+
|
|
381
|
+
**Common bad_answer patterns:**
|
|
382
|
+
|
|
383
|
+
### Pattern 1: Vague / Non-committal
|
|
384
|
+
```yaml
|
|
385
|
+
# For a question about rent
|
|
386
|
+
bad_answer: "The tenant is required to make regular payments to the landlord."
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
### Pattern 2: "Not found" / Empty
|
|
390
|
+
```yaml
|
|
391
|
+
bad_answer: "The document does not contain information about this topic."
|
|
392
|
+
```
|
|
393
|
+
Use this when the expected_answer is a specific value — a "not found" response is clearly bad.
|
|
394
|
+
|
|
395
|
+
### Pattern 3: Wrong value / Plausibly wrong
|
|
396
|
+
```yaml
|
|
397
|
+
# For a question about the security deposit
|
|
398
|
+
bad_answer: "The security deposit is three months' rent, held in escrow by the landlord."
|
|
399
|
+
# (Plausible but wrong — actual is a specific dollar amount, not formula-based)
|
|
400
|
+
```
|
|
401
|
+
|
|
402
|
+
### Pattern 4: Incomplete — right topic, missing the key detail
|
|
403
|
+
```yaml
|
|
404
|
+
# For a question about extension options
|
|
405
|
+
bad_answer: "The tenant has the option to extend the lease beyond the initial term."
|
|
406
|
+
# Missing: number of options, duration, notice period requirements
|
|
407
|
+
```
|
|
408
|
+
|
|
409
|
+
Pattern 4 is often the most useful because it tests whether the model extracts **sufficient detail**, not just whether it found the right topic.
|
|
410
|
+
|
|
411
|
+
### Pattern 5: Off-topic but superficially related
|
|
412
|
+
```yaml
|
|
413
|
+
# For a question about landlord maintenance obligations
|
|
414
|
+
bad_answer: "The tenant is responsible for all maintenance and repairs within the premises."
|
|
415
|
+
# Right topic (maintenance) but wrong party
|
|
416
|
+
```
|
|
417
|
+
|
|
418
|
+
**Avoid:**
|
|
419
|
+
- Completely unrelated answers ("The sky is blue") — too easy, doesn't test the model
|
|
420
|
+
- Answers that are identical to the expected answer — delta will be ~0
|
|
421
|
+
- Answers that are more complete than the expected_answer — the model could reasonably score higher on the bad answer
|
|
422
|
+
|
|
423
|
+
---
|
|
424
|
+
|
|
425
|
+
## Scoring and Results
|
|
426
|
+
|
|
427
|
+
### Per-question metrics
|
|
428
|
+
|
|
429
|
+
| Metric | Description |
|
|
430
|
+
|--------|-------------|
|
|
431
|
+
| `similarityToGood` | Cosine similarity (0–1) between the response and `expected_answer`. Higher is better. |
|
|
432
|
+
| `similarityToBad` | Cosine similarity (0–1) between the response and `bad_answer`. Lower is better. |
|
|
433
|
+
| `deltaScore` | `similarityToGood − similarityToBad`. Must be ≥ 0.10 to classify as GOOD. |
|
|
434
|
+
| `classification` | GOOD / BAD / ERROR |
|
|
435
|
+
|
|
436
|
+
**Example result breakdown:**
|
|
437
|
+
|
|
438
|
+
```
|
|
439
|
+
question: "What is the security deposit?"
|
|
440
|
+
graphragResponse: "The security deposit is $71,464.22 including HST, per Section 4.3 of the lease."
|
|
441
|
+
similarityToGood: 0.9241 (very close to expected answer)
|
|
442
|
+
similarityToBad: 0.4823 (clearly different from bad answer)
|
|
443
|
+
deltaScore: 0.4418 ✓ GOOD (well above 0.10 threshold)
|
|
444
|
+
```
|
|
445
|
+
|
|
446
|
+
```
|
|
447
|
+
question: "What are the operating cost escalations?"
|
|
448
|
+
graphragResponse: "The tenant pays a proportionate share of operating costs, subject to annual adjustment."
|
|
449
|
+
similarityToGood: 0.7234
|
|
450
|
+
similarityToBad: 0.6891
|
|
451
|
+
deltaScore: 0.0343 ✗ BAD (below 0.10 — response too vague, close to bad answer)
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
### Summary metrics
|
|
455
|
+
|
|
456
|
+
| Metric | Description |
|
|
457
|
+
|--------|-------------|
|
|
458
|
+
| `accuracy` | Proportion of GOOD results among non-ERROR results |
|
|
459
|
+
| `avgDeltaScore` | Average delta across all test cases. Higher = better overall quality |
|
|
460
|
+
| `avgSimilarityToGood` | Average cosine similarity to expected answers |
|
|
461
|
+
| `avgSimilarityToBad` | Average cosine similarity to bad answers |
|
|
462
|
+
| `f1Score` | Harmonic mean of precision and recall for classification |
|
|
463
|
+
|
|
464
|
+
### Interpreting delta scores
|
|
465
|
+
|
|
466
|
+
| avgDeltaScore | Interpretation |
|
|
467
|
+
|---------------|---------------|
|
|
468
|
+
| > 0.40 | Excellent — responses are strongly aligned with expected answers |
|
|
469
|
+
| 0.25–0.40 | Good — solid extraction and retrieval |
|
|
470
|
+
| 0.10–0.25 | Acceptable — passing threshold but room for improvement |
|
|
471
|
+
| 0.05–0.10 | Borderline — many BAD results, review entity/prompt config |
|
|
472
|
+
| < 0.05 | Poor — responses are barely distinguishable from bad answers |
|
|
473
|
+
|
|
474
|
+
---
|
|
475
|
+
|
|
476
|
+
## How Many Test Cases to Write
|
|
477
|
+
|
|
478
|
+
| Document Type | basic_search | local_search | global_search | Total |
|
|
479
|
+
|--------------|-------------|-------------|--------------|-------|
|
|
480
|
+
| Short contract (< 20 pages) | 3–5 | 8–12 | 3–5 | 15–22 |
|
|
481
|
+
| Long contract (20–100 pages) | 5–8 | 15–20 | 5–8 | 25–36 |
|
|
482
|
+
| Multi-document collection | 5–10 | 15–25 | 8–15 | 30–50 |
|
|
483
|
+
| Technical manual | 5–10 | 10–15 | 3–5 | 18–30 |
|
|
484
|
+
| Financial report | 3–5 | 12–18 | 5–10 | 20–33 |
|
|
485
|
+
|
|
486
|
+
**Minimum recommended:** 5 local_search cases. This is the highest-signal mode for documents with a configured knowledge graph.
|
|
487
|
+
|
|
488
|
+
**Distribution principle:** Allocate most test cases to the mode you rely on most in production. If you mostly use local search for Q&A, weight your eval there.
|
|
489
|
+
|
|
490
|
+
**Coverage principle:** Each entity type in your config should appear in at least one test case. If you defined `EXTENSION_OPTION` as an entity type, write a local_search question that requires it.
|
|
491
|
+
|
|
492
|
+
---
|
|
493
|
+
|
|
494
|
+
## Running Evaluations
|
|
495
|
+
|
|
496
|
+
```bash
|
|
497
|
+
# Validate your eval file before running
|
|
498
|
+
nestdoc eval validate --file ./eval.yaml --verbose
|
|
499
|
+
|
|
500
|
+
# Run evaluation against a processed document
|
|
501
|
+
nestdoc eval run --document doc-abc123 --test-file ./eval.yaml --watch
|
|
502
|
+
|
|
503
|
+
# Run and save results to file
|
|
504
|
+
nestdoc eval run --document doc-abc123 --test-file ./eval.yaml --watch --save --output ./results.json
|
|
505
|
+
|
|
506
|
+
# Check progress of a running eval
|
|
507
|
+
nestdoc eval status --eval eval-abc123
|
|
508
|
+
|
|
509
|
+
# View detailed results
|
|
510
|
+
nestdoc eval results --document doc-abc123 --eval eval-abc123 --show-details
|
|
511
|
+
|
|
512
|
+
# Generate a report
|
|
513
|
+
nestdoc eval report --document doc-abc123 --eval eval-abc123 --format markdown
|
|
514
|
+
|
|
515
|
+
# Compare two pipeline configs (run evals on both, then compare)
|
|
516
|
+
nestdoc eval compare --document doc-abc123 --eval-a eval-aaa111 --eval-b eval-bbb222
|
|
517
|
+
```
|
|
518
|
+
|
|
519
|
+
---
|
|
520
|
+
|
|
521
|
+
## Interpreting Results
|
|
522
|
+
|
|
523
|
+
### If accuracy is low on basic_search
|
|
524
|
+
|
|
525
|
+
- The document text was not captured correctly by Docling (OCR issue, wrong layout model)
|
|
526
|
+
- Chunks are too small and the relevant sentence was split across two chunks
|
|
527
|
+
- The answer exists in a table or image that wasn't extracted
|
|
528
|
+
- **Fix:** Review Docling config — try a stronger layout model or enable/improve OCR
|
|
529
|
+
|
|
530
|
+
### If accuracy is low on local_search
|
|
531
|
+
|
|
532
|
+
Most common cause: entity extraction is missing entities or relationships.
|
|
533
|
+
|
|
534
|
+
- Check that your entity types match what you're asking about
|
|
535
|
+
- Review the entity extraction prompt — add more examples for the failing question types
|
|
536
|
+
- Increase `maxGleanings` to 1 or 2
|
|
537
|
+
- Increase `top_k_entities` in `eval_params.local_search` to 20–30
|
|
538
|
+
- If specific values (dollar amounts, dates) are missing, check that your prompt includes instructions to include values in entity names
|
|
539
|
+
|
|
540
|
+
### If accuracy is low on global_search
|
|
541
|
+
|
|
542
|
+
- Community detection is not grouping related entities together
|
|
543
|
+
- Community reports are not detailed enough
|
|
544
|
+
- **Fix:** Increase `communityReports.maxLength`, improve the community report prompt
|
|
545
|
+
- Try reducing `dynamic_search_threshold` to 0 or 1 (include more communities)
|
|
546
|
+
- Increase `max_context_tokens` in `eval_params.global_search`
|
|
547
|
+
|
|
548
|
+
### If deltaScore is near 0 for many cases
|
|
549
|
+
|
|
550
|
+
- Your `expected_answer` and `bad_answer` may be too similar to each other
|
|
551
|
+
- Or the model is returning very generic responses that don't commit to either
|
|
552
|
+
- **Fix:** Make `expected_answer` more specific (add exact values), make `bad_answer` more vague
|
|
553
|
+
|
|
554
|
+
### If all results are BAD but responses look reasonable
|
|
555
|
+
|
|
556
|
+
- Check your `expected_answer` — it might be asking for information that is genuinely not in the document
|
|
557
|
+
- Or the question is in the wrong mode (e.g. a summary question in basic_search)
|
|
558
|
+
- Try increasing `top_k_entities` or `max_context_tokens`
|
|
559
|
+
|
|
560
|
+
---
|
|
561
|
+
|
|
562
|
+
## Recommendations by Document Type
|
|
563
|
+
|
|
564
|
+
### Commercial Leases / Contracts
|
|
565
|
+
|
|
566
|
+
```yaml
|
|
567
|
+
eval_params:
|
|
568
|
+
local_search:
|
|
569
|
+
top_k_entities: 20
|
|
570
|
+
top_k_relationships: 20
|
|
571
|
+
max_context_tokens: 16000
|
|
572
|
+
temperature: 0
|
|
573
|
+
global_search:
|
|
574
|
+
dynamic_search_threshold: 1
|
|
575
|
+
max_context_tokens: 16000
|
|
576
|
+
temperature: 0
|
|
577
|
+
|
|
578
|
+
basic_search:
|
|
579
|
+
# Test verbatim value extraction
|
|
580
|
+
- question: "What is the rentable area of the premises?"
|
|
581
|
+
expected_answer: "The rentable area is approximately 2,156 square feet."
|
|
582
|
+
bad_answer: "The premises is a retail unit."
|
|
583
|
+
|
|
584
|
+
local_search:
|
|
585
|
+
# Test financial entity extraction
|
|
586
|
+
- question: "What are the minimum rent amounts and how do they escalate over the lease term?"
|
|
587
|
+
expected_answer: "Minimum rent is $135.00/sqft ($291,060/year) for Years 1-2, $140.00/sqft ($301,840/year) in Year 3, and $145.00/sqft ($312,620/year) for Years 4-5, all plus HST."
|
|
588
|
+
bad_answer: "Rent increases over time."
|
|
589
|
+
|
|
590
|
+
# Test relationship extraction between parties and obligations
|
|
591
|
+
- question: "Who are the parties to the lease and what are their primary obligations?"
|
|
592
|
+
expected_answer: "YORKVILLE OFFICE RETAIL CORPORATION is the Landlord, obligated to deliver and maintain the premises. EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN is the Tenant, obligated to pay minimum and additional rent, maintain the interior, and comply with use restrictions. BANG & OLUFSEN A/S is the Guarantor."
|
|
593
|
+
bad_answer: "There is a landlord and a tenant."
|
|
594
|
+
|
|
595
|
+
# Test notice/condition chain
|
|
596
|
+
- question: "What notice is required to exercise the extension option and when must it be given?"
|
|
597
|
+
expected_answer: "The tenant must give 180 days prior written notice to exercise an extension option, before the expiry of the current term or option period."
|
|
598
|
+
bad_answer: "The tenant must give advance notice to extend."
|
|
599
|
+
|
|
600
|
+
global_search:
|
|
601
|
+
# Test community synthesis of all financial terms
|
|
602
|
+
- question: "Provide a complete summary of all financial terms in this lease."
|
|
603
|
+
expected_answer: "Financial terms include: Minimum Rent of $291,060–$312,620/year escalating over 5 years; Additional Rent covering proportionate share of operating costs and taxes; Security Deposit of $71,464.22 including HST; liability insurance requirement of $5,000,000; and all payments are subject to HST."
|
|
604
|
+
bad_answer: "The tenant pays rent and other charges."
|
|
605
|
+
```
|
|
606
|
+
|
|
607
|
+
### Financial Reports
|
|
608
|
+
|
|
609
|
+
```yaml
|
|
610
|
+
eval_params:
|
|
611
|
+
local_search:
|
|
612
|
+
top_k_entities: 25
|
|
613
|
+
max_context_tokens: 16000
|
|
614
|
+
temperature: 0
|
|
615
|
+
global_search:
|
|
616
|
+
dynamic_search_threshold: 0
|
|
617
|
+
max_context_tokens: 20000
|
|
618
|
+
temperature: 0
|
|
619
|
+
|
|
620
|
+
basic_search:
|
|
621
|
+
- question: "What was total revenue for the fiscal year?"
|
|
622
|
+
expected_answer: "Total revenue for the fiscal year was $4.2 billion, a 12% increase year-over-year."
|
|
623
|
+
bad_answer: "Revenue information is not available."
|
|
624
|
+
|
|
625
|
+
local_search:
|
|
626
|
+
- question: "What are the primary risk factors identified in the report?"
|
|
627
|
+
expected_answer: "The report identifies three primary risk factors: interest rate sensitivity affecting the loan portfolio by approximately $45M per 100bps movement; concentration risk with the top 10 clients representing 38% of revenue; and regulatory compliance risk from pending Basel IV implementation."
|
|
628
|
+
bad_answer: "The company faces various risks."
|
|
629
|
+
|
|
630
|
+
global_search:
|
|
631
|
+
- question: "Summarise the company's financial performance and outlook."
|
|
632
|
+
expected_answer: "Revenue grew 12% to $4.2B driven by commercial lending expansion. Net income increased 8% to $620M. The outlook projects 8–10% revenue growth supported by the acquisition of three regional banks. Key risks include interest rate exposure and regulatory changes."
|
|
633
|
+
bad_answer: "The company performed well and expects continued growth."
|
|
634
|
+
```
|
|
635
|
+
|
|
636
|
+
### Technical Documentation
|
|
637
|
+
|
|
638
|
+
```yaml
|
|
639
|
+
eval_params:
|
|
640
|
+
basic_search:
|
|
641
|
+
k: 15
|
|
642
|
+
temperature: 0
|
|
643
|
+
local_search:
|
|
644
|
+
top_k_entities: 15
|
|
645
|
+
max_context_tokens: 12000
|
|
646
|
+
temperature: 0
|
|
647
|
+
|
|
648
|
+
basic_search:
|
|
649
|
+
- question: "What is the maximum payload size for the upload endpoint?"
|
|
650
|
+
expected_answer: "The upload endpoint accepts a maximum payload of 100MB per request."
|
|
651
|
+
bad_answer: "There are limits on file upload sizes."
|
|
652
|
+
|
|
653
|
+
local_search:
|
|
654
|
+
- question: "What authentication methods does the API support?"
|
|
655
|
+
expected_answer: "The API supports three authentication methods: API key via Authorization header, OAuth 2.0 Bearer tokens with 1-hour expiry, and HMAC-SHA256 request signing for server-to-server calls."
|
|
656
|
+
bad_answer: "The API requires authentication."
|
|
657
|
+
|
|
658
|
+
- question: "What are the rate limits and how are they enforced?"
|
|
659
|
+
expected_answer: "Rate limits are 1,000 requests per minute per API key and 10,000 per day. Exceeded limits return HTTP 429 with a Retry-After header. Enterprise accounts have custom limits."
|
|
660
|
+
bad_answer: "There are rate limits on API usage."
|
|
661
|
+
|
|
662
|
+
global_search:
|
|
663
|
+
- question: "What are all the error codes documented and what do they mean?"
|
|
664
|
+
expected_answer: "Documented error codes include: 400 Bad Request for invalid parameters; 401 Unauthorized for missing/invalid API key; 403 Forbidden for insufficient permissions; 404 Not Found for missing resources; 429 Too Many Requests for rate limit exceeded; 500 Internal Server Error for system failures; and 503 Service Unavailable during maintenance."
|
|
665
|
+
bad_answer: "The API returns standard HTTP error codes."
|
|
666
|
+
```
|
|
667
|
+
|
|
668
|
+
---
|
|
669
|
+
|
|
670
|
+
## Complete Example Eval Files
|
|
671
|
+
|
|
672
|
+
### Minimal eval file
|
|
673
|
+
|
|
674
|
+
```yaml
|
|
675
|
+
local_search:
|
|
676
|
+
- question: "Who are the landlord and tenant in this lease?"
|
|
677
|
+
expected_answer: "The landlord is YORKVILLE OFFICE RETAIL CORPORATION and the tenant is EPIC LUXURY SYSTEMS INC. operating as Bang & Olufsen."
|
|
678
|
+
bad_answer: "There is a landlord and tenant named in the document."
|
|
679
|
+
```
|
|
680
|
+
|
|
681
|
+
---
|
|
682
|
+
|
|
683
|
+
### Full commercial lease eval file
|
|
684
|
+
|
|
685
|
+
```yaml
|
|
686
|
+
eval_params:
|
|
687
|
+
basic_search:
|
|
688
|
+
k: 10
|
|
689
|
+
temperature: 0
|
|
690
|
+
max_tokens: 2048
|
|
691
|
+
|
|
692
|
+
local_search:
|
|
693
|
+
top_k_entities: 20
|
|
694
|
+
top_k_relationships: 20
|
|
695
|
+
text_unit_prop: 0.5
|
|
696
|
+
community_prop: 0.3
|
|
697
|
+
max_context_tokens: 16000
|
|
698
|
+
temperature: 0
|
|
699
|
+
max_tokens: 4096
|
|
700
|
+
|
|
701
|
+
global_search:
|
|
702
|
+
max_context_tokens: 16000
|
|
703
|
+
dynamic_search_threshold: 1
|
|
704
|
+
dynamic_search_keep_parent: true
|
|
705
|
+
temperature: 0
|
|
706
|
+
max_tokens: 4096
|
|
707
|
+
|
|
708
|
+
# ---------------------------------------------------------------------------
|
|
709
|
+
# BASIC SEARCH — simple factual lookups from document text
|
|
710
|
+
# ---------------------------------------------------------------------------
|
|
711
|
+
basic_search:
|
|
712
|
+
- question: "What is the rentable area of the leased premises?"
|
|
713
|
+
expected_answer: "The rentable area is approximately 2,156 square feet."
|
|
714
|
+
bad_answer: "The premises size is not specified."
|
|
715
|
+
|
|
716
|
+
- question: "What is the address of the leased property?"
|
|
717
|
+
expected_answer: "The property is located at 135 Yorkville Avenue, Toronto, Ontario."
|
|
718
|
+
bad_answer: "The document contains a property address somewhere."
|
|
719
|
+
|
|
720
|
+
- question: "What is the security deposit amount?"
|
|
721
|
+
expected_answer: "The security deposit is $71,464.22 including HST."
|
|
722
|
+
bad_answer: "A security deposit is required but the amount is not mentioned."
|
|
723
|
+
|
|
724
|
+
- question: "What is the fixturing period length?"
|
|
725
|
+
expected_answer: "The tenant receives a 60-day fixturing period prior to the commencement date for tenant improvements, during which no minimum rent is payable."
|
|
726
|
+
bad_answer: "There is a period before the lease starts for the tenant to prepare."
|
|
727
|
+
|
|
728
|
+
- question: "What insurance coverage amounts are required?"
|
|
729
|
+
expected_answer: "The tenant must maintain comprehensive general liability insurance of not less than $5,000,000 per occurrence."
|
|
730
|
+
bad_answer: "The tenant must have insurance."
|
|
731
|
+
|
|
732
|
+
# ---------------------------------------------------------------------------
|
|
733
|
+
# LOCAL SEARCH — entity and relationship focused questions
|
|
734
|
+
# ---------------------------------------------------------------------------
|
|
735
|
+
local_search:
|
|
736
|
+
- question: "Who are the parties to this lease agreement?"
|
|
737
|
+
expected_answer: "The parties are: YORKVILLE OFFICE RETAIL CORPORATION as Landlord; EPIC LUXURY SYSTEMS INC. o/a BANG & OLUFSEN as Tenant; and BANG & OLUFSEN A/S as Guarantor guaranteeing the tenant's obligations."
|
|
738
|
+
bad_answer: "There is a landlord and a tenant in this agreement."
|
|
739
|
+
|
|
740
|
+
- question: "What are the minimum rent amounts for each year of the lease term?"
|
|
741
|
+
expected_answer: "Minimum rent is $135.00 per square foot ($291,060.00 annually, $24,255.00 monthly) for Years 1–2; $140.00 per square foot ($301,840.00 annually) for Year 3; and $145.00 per square foot ($312,620.00 annually) for Years 4–5. All amounts are plus HST."
|
|
742
|
+
bad_answer: "The rent increases each year over the course of the lease."
|
|
743
|
+
|
|
744
|
+
- question: "What are the tenant's extension rights and what are the conditions to exercise them?"
|
|
745
|
+
expected_answer: "The tenant has two options to extend the lease term for five years each, exercisable by providing 180 days prior written notice before expiry of the then-current term, provided the tenant is not in default and has not assigned or sublet the premises."
|
|
746
|
+
bad_answer: "The tenant can potentially extend the lease."
|
|
747
|
+
|
|
748
|
+
- question: "What is the commencement date and when does the lease expire?"
|
|
749
|
+
expected_answer: "The lease commences on January 15, 2026, following a 60-day fixturing period. The initial 5-year term expires on January 14, 2031. If both extension options are exercised, the lease could run until January 14, 2041."
|
|
750
|
+
bad_answer: "The lease starts in 2026 and runs for 5 years."
|
|
751
|
+
|
|
752
|
+
- question: "What use restrictions apply to the premises?"
|
|
753
|
+
expected_answer: "The premises may only be used for the retail sale of luxury consumer electronics and related accessories under the Bang & Olufsen brand. Any change of use or co-tenancy requires prior written consent of the landlord."
|
|
754
|
+
bad_answer: "The tenant must use the space for its business."
|
|
755
|
+
|
|
756
|
+
- question: "What maintenance obligations does the tenant have?"
|
|
757
|
+
expected_answer: "The tenant is responsible for maintaining the interior of the premises in good repair, including all non-structural elements, interior finishes, fixtures, tenant's equipment, HVAC units serving only the premises, and signage. The tenant must redecorate every five years."
|
|
758
|
+
bad_answer: "The tenant must keep the premises in good condition."
|
|
759
|
+
|
|
760
|
+
- question: "What are the events of default and what cure periods apply?"
|
|
761
|
+
expected_answer: "Events of default include: failure to pay rent (10-day cure period after written notice); breach of any non-monetary obligation (30-day cure period, or such longer period as reasonably required); insolvency or bankruptcy of the tenant; and abandonment of the premises."
|
|
762
|
+
bad_answer: "Non-payment of rent and other breaches are defaults."
|
|
763
|
+
|
|
764
|
+
- question: "What additional rent components does the tenant pay?"
|
|
765
|
+
expected_answer: "Additional rent includes: the tenant's proportionate share (approximately 4.2%) of building operating costs; proportionate share of property and realty taxes; utility charges for the premises; HVAC maintenance costs; and waste removal charges."
|
|
766
|
+
bad_answer: "The tenant pays more than just base rent."
|
|
767
|
+
|
|
768
|
+
- question: "What are the landlord's obligations regarding the building and common areas?"
|
|
769
|
+
expected_answer: "The landlord must maintain the structural elements of the building, roof, exterior walls, common areas, and building systems (HVAC, plumbing, electrical) serving areas beyond the premises. The landlord must keep common areas clean and in good repair."
|
|
770
|
+
bad_answer: "The landlord is responsible for some building maintenance."
|
|
771
|
+
|
|
772
|
+
- question: "What happens to the lease if the premises are damaged or destroyed?"
|
|
773
|
+
expected_answer: "If the premises are damaged, the landlord must restore them within 180 days unless the damage affects more than 50% of the building, in which case the landlord may terminate the lease on 60 days notice. If restoration takes more than 180 days, the tenant may terminate."
|
|
774
|
+
bad_answer: "Damage provisions are addressed in the lease."
|
|
775
|
+
|
|
776
|
+
# ---------------------------------------------------------------------------
|
|
777
|
+
# GLOBAL SEARCH — thematic and summary questions
|
|
778
|
+
# ---------------------------------------------------------------------------
|
|
779
|
+
global_search:
|
|
780
|
+
- question: "Provide a complete financial summary of this commercial lease including all costs the tenant must pay."
|
|
781
|
+
expected_answer: "The tenant's total financial obligations include: minimum rent escalating from $291,060/year ($135/sqft) in Years 1-2 to $312,620/year ($145/sqft) in Years 4-5; proportionate share (4.2%) of operating costs and property taxes as additional rent; a security deposit of $71,464.22 including HST; annual HVAC maintenance; and a comprehensive general liability insurance requirement of $5,000,000 per occurrence. All monetary amounts are subject to HST."
|
|
782
|
+
bad_answer: "The tenant must pay rent and various fees over the lease term."
|
|
783
|
+
|
|
784
|
+
- question: "Summarise all rights the tenant holds under this lease."
|
|
785
|
+
expected_answer: "The tenant's rights include: two 5-year extension options (180 days notice required); a right of first refusal on adjacent available space; a termination right if the premises are not restored within 180 days after damage; a co-tenancy right permitting rent reduction if anchor tenants vacate; and a right to sublet or assign with landlord consent not to be unreasonably withheld."
|
|
786
|
+
bad_answer: "The tenant has various rights regarding the lease term and property use."
|
|
787
|
+
|
|
788
|
+
- question: "What are the key terms and timeline of this lease from start to finish?"
|
|
789
|
+
expected_answer: "Timeline: 60-day fixturing period (November 16 – January 14 2026, rent-free for minimum rent); Commencement January 15 2026; initial 5-year term expiring January 14 2031; First extension option to January 14 2036; Second extension option to January 14 2041. Rent escalates in three steps. 180-day notice required for extension. 90-day notice required for termination where applicable."
|
|
790
|
+
bad_answer: "The lease runs for five years starting in 2026 with options to extend."
|
|
791
|
+
|
|
792
|
+
- question: "What obligations does each party have under this lease?"
|
|
793
|
+
expected_answer: "Landlord obligations: deliver premises in shell condition; maintain structure, roof, common areas, and building systems; provide parking; remedy defaults within 30 days notice. Tenant obligations: pay minimum and additional rent; maintain interior; carry $5M liability insurance; restrict use to luxury electronics retail; not assign without consent; redecorate every 5 years; restore premises on expiry. Guarantor: guarantee all tenant obligations unconditionally."
|
|
794
|
+
bad_answer: "The landlord and tenant each have obligations to perform under the agreement."
|
|
795
|
+
```
|
|
796
|
+
|
|
797
|
+
---
|
|
798
|
+
|
|
799
|
+
*For schema reference, see: `packages/nest-doc-processing-cli/src/schemas/eval-test-cases.schema.yaml`*
|
|
800
|
+
*To generate a blank template: `nestdoc eval init --output ./eval.yaml`*
|