@claritylabs/cl-sdk 0.3.1 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,8 +1,8 @@
1
- # CL-0 SDK
1
+ # CL-SDK
2
2
 
3
- [Clarity Labs](https://claritylabs.inc) allows insurers to understand their clients as well as they know themselves. Having a better understanding of clients means insurers can automate servicing to reduce costs and identify coverage gaps to cross-sell products.
3
+ [Clarity Labs](https://claritylabs.inc) allows insurers to understand their clients as well as they know themselves. A better understanding of clients means insurers can automate servicing to reduce costs and identify coverage gaps to cross-sell products.
4
4
 
5
- CL-0 SDK is the open infrastructure layer that makes this possible: a shared intelligence system that any product or agent can import to understand, reason about, and act on insurance documents and workflows.
5
+ CL-SDK is the open infrastructure layer that makes this possible a pure TypeScript library for extracting, reasoning about, and acting on insurance documents. Provider-agnostic by design: bring any LLM, any embedding model, any storage backend.
6
6
 
7
7
  ## Installation
8
8
 
@@ -25,199 +25,587 @@ npm install better-sqlite3
25
25
 
26
26
  ### Document Extraction
27
27
 
28
- The v6 extraction pipeline uses a coordinator/worker pattern with provider-agnostic callbacks:
28
+ CL-SDK extracts structured data from insurance PDFs using a multi-agent pipeline. You provide two callback functions — `generateText` and `generateObject` — and the SDK handles the rest:
29
29
 
30
30
  ```typescript
31
31
  import { createExtractor } from "@claritylabs/cl-sdk";
32
- import { anthropic } from "@ai-sdk/anthropic"; // or any provider
33
- import { generateText, generateObject } from "ai";
34
32
 
35
- const extract = createExtractor({
33
+ const extractor = createExtractor({
36
34
  generateText: async ({ prompt, system, maxTokens }) => {
37
- const { text, usage } = await generateText({
38
- model: anthropic("claude-sonnet-4-6"),
39
- prompt,
40
- system,
41
- maxTokens,
42
- });
43
- return { text, usage };
35
+ // Wrap your preferred LLM provider
36
+ const result = await yourProvider.generate({ prompt, system, maxTokens });
37
+ return { text: result.text, usage: result.usage };
44
38
  },
45
39
  generateObject: async ({ prompt, system, schema, maxTokens }) => {
46
- const { object, usage } = await generateObject({
47
- model: anthropic("claude-sonnet-4-6"),
48
- prompt,
49
- system,
50
- schema,
51
- maxTokens,
52
- });
53
- return { object, usage };
40
+ // schema is a Zod schema use it for structured output
41
+ const result = await yourProvider.generateStructured({ prompt, system, schema, maxTokens });
42
+ return { object: result.object, usage: result.usage };
54
43
  },
55
44
  });
56
45
 
57
- const pdfBase64 = "..."; // base64-encoded PDF
58
- const result = await extract.extract(pdfBase64);
59
- console.log(result.document); // Structured InsuranceDocument
46
+ const pdfBase64 = "..."; // base64-encoded insurance PDF
47
+ const result = await extractor.extract(pdfBase64);
48
+ console.log(result.document); // Typed InsuranceDocument (policy or quote)
49
+ console.log(result.chunks); // DocumentChunk[] ready for vector storage
60
50
  ```
61
51
 
62
- ### With PDF to Image Conversion
52
+ ### With PDF-to-Image Conversion
63
53
 
64
- For providers that don't support native PDF input:
54
+ For providers that don't support native PDF input (e.g., OpenAI):
65
55
 
66
56
  ```typescript
67
- const extract = createExtractor({
57
+ const extractor = createExtractor({
68
58
  generateText: /* ... */,
69
59
  generateObject: /* ... */,
70
60
  convertPdfToImages: async (pdfBase64, startPage, endPage) => {
71
61
  // Convert PDF pages to images using your preferred library
72
- return [
73
- { imageBase64: "...", mimeType: "image/png" },
74
- // ... one per page
75
- ];
62
+ return [{ imageBase64: "...", mimeType: "image/png" }]; // one per page
76
63
  },
77
64
  });
78
65
  ```
79
66
 
80
- ### Storage (Optional)
81
-
82
- ```typescript
83
- import { createExtractor } from "@claritylabs/cl-sdk";
84
- import { SQLiteDocumentStore, SQLiteMemoryStore } from "@claritylabs/cl-sdk/storage/sqlite";
85
-
86
- const documentStore = new SQLiteDocumentStore("./docs.db");
87
- const memoryStore = new SQLiteMemoryStore("./memory.db");
88
-
89
- const extract = createExtractor({
90
- generateText: /* ... */,
91
- generateObject: /* ... */,
92
- documentStore,
93
- memoryStore,
94
- });
95
- ```
96
-
97
67
  ## Architecture
98
68
 
99
- ### Provider-Agnostic Design
69
+ ### Provider-Agnostic Callbacks
100
70
 
101
- CL-0 SDK has **zero framework dependencies**. You provide simple callback functions:
71
+ CL-SDK has **zero framework dependencies**. All LLM interaction happens through two callback types:
102
72
 
103
73
  ```typescript
104
74
  type GenerateText = (params: {
105
75
  prompt: string;
106
76
  system?: string;
107
77
  maxTokens: number;
108
- }) => Promise<{ text: string; usage?: TokenUsage }>;
78
+ providerOptions?: Record<string, unknown>;
79
+ }) => Promise<{ text: string; usage?: { inputTokens: number; outputTokens: number } }>;
109
80
 
110
81
  type GenerateObject<T> = (params: {
111
82
  prompt: string;
112
83
  system?: string;
113
84
  schema: ZodSchema<T>;
114
85
  maxTokens: number;
115
- }) => Promise<{ object: T; usage?: TokenUsage }>;
86
+ providerOptions?: Record<string, unknown>;
87
+ }) => Promise<{ object: T; usage?: { inputTokens: number; outputTokens: number } }>;
116
88
  ```
117
89
 
118
- Works with any provider: OpenAI, Anthropic, Google, Mistral, Bedrock, Azure, Ollama, etc.
90
+ Works with any provider: Anthropic, OpenAI, Google, Mistral, Bedrock, Azure, Ollama, etc. You write the adapter once; the SDK calls it throughout the pipeline.
119
91
 
120
92
  ### Extraction Pipeline
121
93
 
122
- The `createExtractor` function returns an extraction engine:
94
+ The extraction system uses a **coordinator/worker pattern** — a coordinator agent plans the work, specialized extractor agents execute in parallel, and a review loop ensures completeness.
95
+
96
+ ```
97
+ ┌─────────────┐ ┌─────────────┐ ┌──────────────────────┐
98
+ │ 1. CLASSIFY │────▶│ 2. PLAN │────▶│ 3. EXTRACT (parallel)│
99
+ │ │ │ │ │ │
100
+ │ Document │ │ Select │ │ Run focused │
101
+ │ type, line │ │ template, │ │ extractors against │
102
+ │ of business │ │ assign │ │ assigned page │
103
+ │ │ │ extractors │ │ ranges │
104
+ │ │ │ to pages │ │ │
105
+ └─────────────┘ └─────────────┘ └──────────┬───────────┘
106
+
107
+ ┌─────────────┐ ┌─────────────┐ ┌──────────▼───────────┐
108
+ │ 6. FORMAT │◀────│ 5. ASSEMBLE │◀────│ 4. REVIEW │
109
+ │ │ │ │ │ │
110
+ │ Clean up │ │ Merge all │ │ Check completeness │
111
+ │ markdown │ │ results │ │ against template, │
112
+ │ tables, │ │ into final │ │ dispatch follow-up │
113
+ │ spacing │ │ document │ │ extractors for gaps │
114
+ └──────┬──────┘ └─────────────┘ └──────────────────────┘
115
+
116
+ ┌──────▼──────┐
117
+ │ 7. CHUNK │
118
+ │ Break into │
119
+ │ retrieval- │
120
+ │ ready │
121
+ │ chunks │
122
+ └─────────────┘
123
+ ```
124
+
125
+ #### Phase 1: Classify
126
+
127
+ The coordinator sends the document to `generateObject` with the `ClassifyResultSchema`. It determines:
128
+ - **Document type** — policy or quote
129
+ - **Policy types** — one or more lines of business (e.g., `general_liability`, `workers_comp`)
130
+ - **Confidence score**
131
+
132
+ #### Phase 2: Plan
133
+
134
+ Based on the classification, the coordinator selects a **line-of-business template** (e.g., `workers_comp`, `cyber`, `homeowners_ho3`) that defines expected sections and page hints. It then generates an **extraction plan** — a list of tasks that map specific extractors to page ranges within the PDF.
135
+
136
+ #### Phase 3: Extract
137
+
138
+ Focused extractor agents are dispatched **in parallel** (concurrency-limited, default 2). Each extractor targets a specific data domain against its assigned page range. The 11 extractor types are:
123
139
 
124
- 1. **Classify** Determine document type (policy/quote) and line of business
125
- 2. **Plan** — Generate extraction plan using line-specific templates
126
- 3. **Extract** Dispatch focused extractors in parallel (concurrency-limited, default 2)
127
- 4. **Review** Check completeness against template requirements (up to 2 review rounds)
128
- 5. **Assemble** Merge results into final `InsuranceDocument`
140
+ | Extractor | What It Extracts |
141
+ |-----------|-----------------|
142
+ | `carrier_info` | Carrier name, NAIC, AM Best rating, MGA, underwriter, broker |
143
+ | `named_insured` | Insured name, DBA, address, entity type, FEIN, SIC/NAICS |
144
+ | `declarations` | Line-specific structured declarations (varies by policy type) |
145
+ | `coverage_limits` | Coverage names, limits, deductibles, forms, triggers |
146
+ | `endorsements` | Form numbers, titles, types, content, affected parties |
147
+ | `exclusions` | Exclusion titles, content, applicability |
148
+ | `conditions` | Duties after loss, cancellation, other insurance, etc. |
149
+ | `premium_breakdown` | Premium amounts, taxes, fees, payment plans, rating basis |
150
+ | `loss_history` | Loss runs, claim records, experience modification |
151
+ | `supplementary` | Regulatory context, contacts, TPA, claims contacts |
152
+ | `sections` | Raw section content (fallback for unmatched sections) |
153
+
154
+ Each extractor writes its results to an in-memory `Map`. Results accumulate across all extractors.
155
+
156
+ #### Phase 4: Review
157
+
158
+ After initial extraction, a review loop (up to `maxReviewRounds`, default 2) checks completeness against the template's expected sections. If gaps are found, additional extractor tasks are dispatched to fill missing data. This iterative refinement ensures comprehensive extraction.
159
+
160
+ #### Phase 5: Assemble
161
+
162
+ All extractor results are merged into a final validated `InsuranceDocument`.
163
+
164
+ #### Phase 6: Format
165
+
166
+ A formatting agent pass cleans up markdown in all content-bearing string fields (sections, subsections, endorsements, exclusions, conditions, summary). It fixes:
167
+
168
+ - **Pipe tables missing separator rows** — adds `| --- | --- |` and leading/trailing pipes
169
+ - **Space-aligned tables** — converts whitespace-padded columns into proper markdown tables
170
+ - **Sub-items mixed into tables** — pulls indented sub-items out of tables into lists
171
+ - **Mixed table/prose content** — handles each segment independently
172
+ - **General cleanup** — excessive blank lines, trailing whitespace, orphaned formatting markers
173
+
174
+ Content is batched (up to 20 fields per call) and sent through `generateText` for formatting cleanup. Token usage is tracked the same as other pipeline steps.
175
+
176
+ #### Phase 7: Chunk
177
+
178
+ The formatted document is chunked into `DocumentChunk[]` for vector storage. Chunks are deterministically IDed as `${documentId}:${type}:${index}`.
179
+
180
+ ### Configuration
129
181
 
130
182
  ```typescript
131
- const extract = createExtractor({
183
+ const extractor = createExtractor({
184
+ // Required: LLM callbacks
132
185
  generateText,
133
186
  generateObject,
134
- concurrency: 2, // Parallel extractor limit
135
- maxReviewRounds: 2, // Review loop iterations
136
- onTokenUsage: (usage) => {
137
- console.log(`${usage.inputTokens} in, ${usage.outputTokens} out`);
138
- },
187
+
188
+ // Optional: PDF vision mode
189
+ convertPdfToImages: async (pdfBase64, startPage, endPage) => [...],
190
+
191
+ // Optional: storage backends
192
+ documentStore, // Persist extracted documents
193
+ memoryStore, // Vector search over chunks + conversation history
194
+
195
+ // Optional: tuning
196
+ concurrency: 2, // Max parallel extractors (default: 2)
197
+ maxReviewRounds: 2, // Review loop iterations (default: 2)
198
+
199
+ // Optional: observability
200
+ onTokenUsage: (usage) => console.log(`${usage.inputTokens} in, ${usage.outputTokens} out`),
201
+ onProgress: (message) => console.log(message),
202
+ log: async (message) => logger.info(message),
203
+ providerOptions: {}, // Passed through to every LLM call
139
204
  });
140
205
  ```
141
206
 
142
- ### Document Types
207
+ ### Line-of-Business Templates
208
+
209
+ Templates define what the extraction pipeline expects for each policy type. Each template specifies expected sections, page hints, and required vs. optional fields.
210
+
211
+ **Personal lines:** homeowners (HO-3, HO-5), renters (HO-4), condo (HO-6), dwelling fire, personal auto, personal umbrella, personal inland marine, flood (NFIP + private), earthquake, watercraft, recreational vehicle, farm/ranch, mobile home
212
+
213
+ **Commercial lines:** general liability, commercial property, commercial auto, workers' comp, umbrella/excess, professional liability, cyber, directors & officers, crime/fidelity
143
214
 
144
- Comprehensive TypeScript types for the insurance domain:
215
+ ## Storage
216
+
217
+ CL-SDK defines two storage interfaces (`DocumentStore` and `MemoryStore`) and ships a reference SQLite implementation. You can implement these interfaces with any backend.
218
+
219
+ ### DocumentStore
220
+
221
+ CRUD for extracted `InsuranceDocument` objects:
145
222
 
146
223
  ```typescript
147
- import type {
148
- InsuranceDocument, // PolicyDocument | QuoteDocument
149
- PolicyDocument,
150
- QuoteDocument,
151
- Coverage,
152
- Endorsement,
153
- Declaration, // 20+ line types
154
- Platform,
155
- AgentContext,
156
- } from "@claritylabs/cl-sdk";
224
+ interface DocumentStore {
225
+ save(doc: InsuranceDocument): Promise<void>;
226
+ get(id: string): Promise<InsuranceDocument | null>;
227
+ query(filters: DocumentFilters): Promise<InsuranceDocument[]>;
228
+ delete(id: string): Promise<void>;
229
+ }
157
230
  ```
158
231
 
159
- ## API Reference
232
+ Filters support: `type` (policy/quote), `carrier` (fuzzy), `insuredName` (fuzzy), `policyNumber` (exact), `quoteNumber` (exact).
160
233
 
161
- ### Core Functions
234
+ ### MemoryStore
162
235
 
163
- | Function | Description |
164
- |----------|-------------|
165
- | `createExtractor(config)` | Create extraction engine with callbacks |
166
- | `extract.extract(pdfBase64, documentId?)` | Run full extraction pipeline |
167
- | `chunkDocument(text, maxChunkSize?)` | Chunk text for vector storage |
236
+ Vector-searchable storage for document chunks and conversation history. Requires an `EmbedText` callback for generating embeddings:
168
237
 
169
- ### Agent System
238
+ ```typescript
239
+ type EmbedText = (text: string) => Promise<number[]>;
240
+
241
+ interface MemoryStore {
242
+ // Document chunks with embeddings
243
+ addChunks(chunks: DocumentChunk[]): Promise<void>;
244
+ search(query: string, options?: { limit?: number; filter?: ChunkFilter }): Promise<DocumentChunk[]>;
245
+
246
+ // Conversation turns with embeddings
247
+ addTurn(turn: ConversationTurn): Promise<void>;
248
+ getHistory(conversationId: string, options?: { limit?: number }): Promise<ConversationTurn[]>;
249
+ searchHistory(query: string, conversationId?: string): Promise<ConversationTurn[]>;
250
+ }
251
+ ```
252
+
253
+ Search uses **cosine similarity** over embeddings to find semantically relevant chunks or conversation turns. Embedding failures are non-fatal — chunks are still stored, just not searchable by vector.
254
+
255
+ ### SQLite Reference Implementation
170
256
 
171
257
  ```typescript
172
- import {
173
- buildAgentSystemPrompt,
174
- buildIdentityPrompt,
175
- buildSafetyPrompt,
176
- buildCoverageGapPrompt,
177
- } from "@claritylabs/cl-sdk";
258
+ import { createSqliteStore } from "@claritylabs/cl-sdk/storage/sqlite";
259
+
260
+ const store = createSqliteStore({
261
+ path: "./cl-sdk.db",
262
+ embed: async (text) => {
263
+ // Your embedding function (OpenAI, Cohere, local model, etc.)
264
+ return await yourEmbeddingProvider.embed(text);
265
+ },
266
+ });
267
+
268
+ // Use with extractor
269
+ const extractor = createExtractor({
270
+ generateText,
271
+ generateObject,
272
+ documentStore: store.documents,
273
+ memoryStore: store.memory,
274
+ });
275
+
276
+ // Or use standalone
277
+ await store.documents.save(document);
278
+ const results = await store.memory.search("what is the deductible?", { limit: 5 });
279
+
280
+ // Clean up
281
+ store.close();
282
+ ```
283
+
284
+ ## Agent System
285
+
286
+ CL-SDK includes a composable prompt system for building insurance-aware AI agents. The `buildAgentSystemPrompt` function assembles modular prompt segments based on the agent's context:
287
+
288
+ ```typescript
289
+ import { buildAgentSystemPrompt } from "@claritylabs/cl-sdk";
178
290
 
179
291
  const systemPrompt = buildAgentSystemPrompt({
180
- platform: "email",
181
- intent: "direct",
292
+ platform: "email", // email | chat | sms | slack | discord
293
+ intent: "direct", // direct | mediated | observed
182
294
  userName: "John",
183
295
  companyName: "Acme Insurance",
184
296
  });
185
297
  ```
186
298
 
187
- ### Tool Definitions
299
+ ### Prompt Modules
300
+
301
+ The system prompt is composed from these modules:
302
+
303
+ | Module | Purpose |
304
+ |--------|---------|
305
+ | **identity** | Agent role, company context, professional persona |
306
+ | **intent** | Behavioral rules based on platform and interaction mode |
307
+ | **formatting** | Output formatting rules (markdown for chat, plaintext for email/SMS) |
308
+ | **safety** | Security guardrails, prompt injection resistance, data handling |
309
+ | **coverage-gaps** | Coverage gap disclosure rules (only in mediated/observed mode) |
310
+ | **coi-routing** | Certificate of Insurance request handling |
311
+ | **quotes-policies** | Guidance for distinguishing quotes vs. active policies |
312
+ | **conversation-memory** | Context about conversation history and document retrieval |
313
+
314
+ ### Message Intent Classification
315
+
316
+ Classify incoming messages to route them appropriately:
317
+
318
+ ```typescript
319
+ import { buildClassifyMessagePrompt } from "@claritylabs/cl-sdk";
320
+
321
+ const prompt = buildClassifyMessagePrompt("email");
322
+ // Returns classification prompt for intents:
323
+ // policy_question, coi_request, renewal_inquiry, claim_report,
324
+ // coverage_shopping, general, unrelated
325
+ ```
326
+
327
+ ## Application Processing Pipeline
328
+
329
+ The application pipeline processes insurance applications through an agentic coordinator/worker system — small focused agents handle classification, field extraction, auto-fill, question batching, reply routing, and PDF mapping. Supports persistent state and vector-based answer backfill from prior applications.
330
+
331
+ ### Quick Start
332
+
333
+ ```typescript
334
+ import { createApplicationPipeline } from "@claritylabs/cl-sdk";
335
+
336
+ const pipeline = createApplicationPipeline({
337
+ generateText,
338
+ generateObject,
339
+ applicationStore, // persistent state storage
340
+ documentStore, // for policy/quote lookups during auto-fill
341
+ memoryStore, // for vector-based answer backfill
342
+ orgContext: [ // business context for auto-fill
343
+ { key: "company_name", value: "Acme Corp", category: "company_info" },
344
+ { key: "company_address", value: "123 Main St", category: "company_info" },
345
+ ],
346
+ });
347
+
348
+ // Process a new application PDF
349
+ const { state } = await pipeline.processApplication({
350
+ pdfBase64: "...",
351
+ applicationId: "app-123",
352
+ });
353
+ // state.fields → extracted fields, some already auto-filled
354
+ // state.batches → question batches ready for user collection
355
+
356
+ // Generate email for current batch
357
+ const { text: emailBody } = await pipeline.generateCurrentBatchEmail("app-123", {
358
+ companyName: "Acme Corp",
359
+ });
360
+
361
+ // Process user's reply
362
+ const { state: updated, fieldsFilled, responseText } = await pipeline.processReply({
363
+ applicationId: "app-123",
364
+ replyText: "1. Yes\n2. $1,000,000\n3. Check our website for revenue",
365
+ });
366
+ ```
367
+
368
+ ### Pipeline Phases
369
+
370
+ ```
371
+ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐
372
+ │ 1. CLASSIFY │────>│ 2. EXTRACT │────>│ 3. BACKFILL + │
373
+ │ │ │ FIELDS │ │ AUTO-FILL │
374
+ │ Is this an │ │ │ │ (parallel) │
375
+ │ application? │ │ All fillable │ │ │
376
+ │ │ │ fields as │ │ • vector backfill │
377
+ │ │ │ structured │ │ • context auto-fill │
378
+ │ │ │ data │ │ • document search │
379
+ └──────────────┘ └──────────────┘ └──────────┬──────────┘
380
+
381
+ ┌──────────────┐ ┌──────────v──────────┐
382
+ │ REPLY LOOP │<────│ 4. BATCH QUESTIONS │
383
+ │ │ │ │
384
+ │ Route intent │ │ Group unfilled │
385
+ │ Parse answers│ │ fields by topic │
386
+ │ Handle lookup│ │ Generate emails │
387
+ │ Explain field│ │ │
388
+ └──────┬───────┘ └─────────────────────┘
389
+
390
+ ┌──────v───────┐
391
+ │ 5. CONFIRM + │
392
+ │ MAP PDF │
393
+ └──────────────┘
394
+ ```
395
+
396
+ ### Focused Agents (8 types)
397
+
398
+ | Agent | Task | Model Size |
399
+ |-------|------|-----------|
400
+ | `classifier` | Detect if PDF is an application | Tiny |
401
+ | `field-extractor` | Extract all form fields | Medium |
402
+ | `auto-filler` | Match fields to business context | Small |
403
+ | `batcher` | Group fields into topic batches | Small |
404
+ | `reply-router` | Classify reply intent | Tiny |
405
+ | `answer-parser` | Extract answers from replies | Small |
406
+ | `lookup-filler` | Fill from policy/record lookups | Small |
407
+ | `email-generator` | Generate professional batch emails | Small |
408
+
409
+ ### Vector-Based Answer Backfill
410
+
411
+ The `BackfillProvider` interface enables searching prior application answers and extracted document data to pre-fill new applications:
412
+
413
+ ```typescript
414
+ interface BackfillProvider {
415
+ searchPriorAnswers(
416
+ fields: { id: string; label: string; section: string; fieldType: string }[],
417
+ options?: { limit?: number },
418
+ ): Promise<PriorAnswer[]>;
419
+ }
420
+ ```
421
+
422
+ This runs in parallel with context-based auto-fill, so the pipeline fills as many fields as possible before asking the user anything.
423
+
424
+ ### Application Prompts (for advanced use)
425
+
426
+ The individual prompt functions are still exported for custom pipelines:
427
+
428
+ ```typescript
429
+ import {
430
+ buildFieldExtractionPrompt,
431
+ buildAutoFillPrompt,
432
+ buildQuestionBatchPrompt,
433
+ buildAnswerParsingPrompt,
434
+ buildConfirmationSummaryPrompt,
435
+ buildBatchEmailGenerationPrompt,
436
+ buildReplyIntentClassificationPrompt,
437
+ buildFieldExplanationPrompt,
438
+ buildFlatPdfMappingPrompt,
439
+ buildAcroFormMappingPrompt,
440
+ buildLookupFillPrompt,
441
+ } from "@claritylabs/cl-sdk";
442
+ ```
443
+
444
+ ## Query Agent Pipeline
445
+
446
+ The query agent answers user questions against stored documents with citation-backed provenance. It mirrors the extraction pipeline's coordinator/worker pattern: a classifier decomposes questions, retrievers pull evidence in parallel, reasoners answer from evidence only, and a verifier checks grounding.
447
+
448
+ ### Quick Start
449
+
450
+ ```typescript
451
+ import { createQueryAgent } from "@claritylabs/cl-sdk";
452
+
453
+ const agent = createQueryAgent({
454
+ generateText,
455
+ generateObject,
456
+ documentStore, // where extracted documents are stored
457
+ memoryStore, // where document chunks + conversation history live
458
+ });
459
+
460
+ const result = await agent.query({
461
+ question: "What is the deductible on our GL policy?",
462
+ conversationId: "conv-123",
463
+ });
464
+
465
+ console.log(result.answer); // Natural language answer
466
+ console.log(result.citations); // Source references with exact quotes
467
+ console.log(result.confidence); // 0-1 confidence score
468
+ ```
469
+
470
+ ### Pipeline Phases
471
+
472
+ ```
473
+ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐
474
+ │ 1. CLASSIFY │────>│ 2. RETRIEVE │────>│ 3. REASON │
475
+ │ │ │ (parallel) │ │ (parallel) │
476
+ │ Intent + │ │ │ │ │
477
+ │ sub-question │ │ chunk search │ │ Answer each sub-Q │
478
+ │ decomposition│ │ doc lookup │ │ from evidence only │
479
+ │ │ │ conv history │ │ │
480
+ └──────────────┘ └──────────────┘ └─────────┬──────────┘
481
+
482
+ ┌──────────────┐ ┌─────────v──────────┐
483
+ │ 5. RESPOND │<────│ 4. VERIFY │
484
+ │ │ │ │
485
+ │ Format with │ │ Grounding check │
486
+ │ citations, │ │ Consistency check │
487
+ │ store turn │ │ Completeness check │
488
+ └──────────────┘ └────────────────────┘
489
+ ```
490
+
491
+ **Phase 1 — Classify:** Determines intent (`policy_question`, `coverage_comparison`, `document_search`, `claims_inquiry`, `general_knowledge`) and decomposes complex questions into atomic sub-questions. Each sub-question specifies which chunk types and document filters to use for retrieval.
492
+
493
+ **Phase 2 — Retrieve (parallel):** For each sub-question, a retriever searches chunk embeddings, does structured document lookups, and pulls conversation history — all in parallel. Returns ranked evidence items.
494
+
495
+ **Phase 3 — Reason (parallel):** For each sub-question, a reasoner receives only the retrieved evidence (never the full document) and produces a sub-answer with citations. Intent-specific prompts guide reasoning (e.g., coverage questions get prompts tuned for interpreting limits and endorsements).
496
+
497
+ **Phase 4 — Verify:** The verifier checks that every claim is grounded in a citation, sub-answers don't contradict each other, and no evidence was overlooked. If issues are found, it can trigger re-retrieval with broader context.
498
+
499
+ **Phase 5 — Respond:** Merges verified sub-answers into a single natural-language response with inline citations (`[1]`, `[2]`), deduplicates references, and stores the exchange as conversation turns.
500
+
501
+ ### Configuration
502
+
503
+ ```typescript
504
+ const agent = createQueryAgent({
505
+ // Required
506
+ generateText,
507
+ generateObject,
508
+ documentStore,
509
+ memoryStore,
510
+
511
+ // Optional: tuning
512
+ concurrency: 3, // max parallel retrievers/reasoners (default: 3)
513
+ maxVerifyRounds: 1, // verification loop iterations (default: 1)
514
+ retrievalLimit: 10, // max evidence items per sub-question (default: 10)
515
+
516
+ // Optional: observability
517
+ onTokenUsage: (usage) => console.log(`${usage.inputTokens} in, ${usage.outputTokens} out`),
518
+ onProgress: (message) => console.log(message),
519
+ log: async (message) => logger.info(message),
520
+ providerOptions: {},
521
+ });
522
+ ```
523
+
524
+ ### Citations
525
+
526
+ Every factual claim in the answer references its source:
527
+
528
+ ```typescript
529
+ interface Citation {
530
+ index: number; // [1], [2], etc.
531
+ chunkId: string; // e.g. "doc-123:coverage:2"
532
+ documentId: string;
533
+ documentType?: "policy" | "quote";
534
+ field?: string; // e.g. "coverages[0].deductible"
535
+ quote: string; // exact text from source
536
+ relevance: number; // 0-1 similarity score
537
+ }
538
+ ```
539
+
540
+ ## PDF Operations
188
541
 
189
542
  ```typescript
190
543
  import {
191
- AGENT_TOOLS,
192
- DOCUMENT_LOOKUP_TOOL,
193
- COI_GENERATION_TOOL,
544
+ extractPageRange, // Extract specific pages from a PDF
545
+ getPdfPageCount, // Get total page count
546
+ getAcroFormFields, // Enumerate form fields (text, checkbox, dropdown, radio)
547
+ fillAcroForm, // Fill and flatten AcroForm fields
548
+ overlayTextOnPdf, // Overlay text at coordinates on flat PDFs
194
549
  } from "@claritylabs/cl-sdk";
195
550
  ```
196
551
 
197
- ### PDF Operations
552
+ ## Tool Definitions
553
+
554
+ Claude `tool_use`-compatible schemas for agent integrations:
198
555
 
199
556
  ```typescript
200
557
  import {
201
- getAcroFormFields,
202
- fillAcroForm,
203
- overlayTextOnPdf,
558
+ AGENT_TOOLS, // All tools as an array
559
+ DOCUMENT_LOOKUP_TOOL, // Search/retrieve policies and quotes
560
+ COI_GENERATION_TOOL, // Generate Certificates of Insurance
561
+ COVERAGE_COMPARISON_TOOL, // Compare coverages across documents
562
+ } from "@claritylabs/cl-sdk";
563
+ ```
564
+
565
+ These are schema-only definitions (input schemas + descriptions). You provide the implementations that call your storage and PDF layers.
566
+
567
+ ## Document Types
568
+
569
+ All types are derived from Zod schemas, providing both runtime validation and TypeScript types:
570
+
571
+ ```typescript
572
+ import type {
573
+ InsuranceDocument, // PolicyDocument | QuoteDocument (discriminated union)
574
+ PolicyDocument, // Extracted policy with all enrichments
575
+ QuoteDocument, // Extracted quote with subjectivities, premium breakdown
576
+ Coverage, // Coverage name, limits, deductibles, form
577
+ EnrichedCoverage, // Coverage + additional metadata
578
+ Endorsement, // Form number, title, type, content
579
+ Exclusion, // Title, content, applicability
580
+ Condition, // Type, title, content
581
+ Declaration, // Line-specific declarations (19 types)
582
+ Platform, // email | chat | sms | slack | discord
583
+ AgentContext, // Platform + intent + user/company context
204
584
  } from "@claritylabs/cl-sdk";
205
585
  ```
206
586
 
207
- ### Storage
587
+ ### Supported Policy Types
588
+
589
+ 42 policy types across personal and commercial lines — including general liability, commercial property, workers' comp, cyber, D&O, homeowners (HO-3/HO-5/HO-4/HO-6), personal auto, flood (NFIP + private), earthquake, and more.
590
+
591
+ ## Core Utilities
208
592
 
209
593
  ```typescript
210
- import { SQLiteDocumentStore, SQLiteMemoryStore } from "@claritylabs/cl-sdk/storage/sqlite";
594
+ import {
595
+ withRetry, // Exponential backoff with jitter (5 retries, 2–32s) for rate limits
596
+ pLimit, // Concurrency limiter for parallel async tasks
597
+ sanitizeNulls, // Recursively convert null → undefined (for database compatibility)
598
+ stripFences, // Remove markdown code fences from LLM JSON responses
599
+ } from "@claritylabs/cl-sdk";
211
600
  ```
212
601
 
213
602
  ## Development
214
603
 
215
604
  ```bash
216
605
  npm install
217
- npm run build # Build ESM + CJS + types
606
+ npm run build # Build ESM + CJS + types via tsup
218
607
  npm run dev # Watch mode
219
- npm run typecheck # Type check
220
- npm run test # Run tests (vitest)
608
+ npm run typecheck # Type check (tsc --noEmit)
221
609
  ```
222
610
 
223
611
  Zero framework dependencies. Peer deps: `pdf-lib`, `zod`. Optional: `better-sqlite3`.