@claritylabs/cl-sdk 0.5.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -104,14 +104,22 @@ The extraction system uses a **coordinator/worker pattern** — a coordinator ag
104
104
  │ │ │ to pages │ │ │
105
105
  └─────────────┘ └─────────────┘ └──────────┬───────────┘
106
106
 
107
- ┌─────────────┐ ┌──────────▼───────────┐
108
- │ 5. ASSEMBLE │◀────│ 4. REVIEW │
109
- │ │ │ │
110
- │ Merge all │ │ Check completeness │
111
- results, │ │ against template, │
112
- validate, │ │ dispatch follow-up │
113
- chunk │ │ extractors for gaps │
114
- └─────────────┘ └──────────────────────┘
107
+ ┌─────────────┐ ┌─────────────┐ ┌──────────▼───────────┐
108
+ 6. FORMAT │◀────│ 5. ASSEMBLE │◀────│ 4. REVIEW │
109
+ │ │ │ │ │
110
+ Clean up │ │ Merge all │ │ Check completeness │
111
+ markdown │ │ results │ │ against template, │
112
+ tables, │ │ into final │ │ dispatch follow-up │
113
+ spacing │ │ document │ │ extractors for gaps │
114
+ └──────┬──────┘ └─────────────┘ └──────────────────────┘
115
+
116
+ ┌──────▼──────┐
117
+ │ 7. CHUNK │
118
+ │ Break into │
119
+ │ retrieval- │
120
+ │ ready │
121
+ │ chunks │
122
+ └─────────────┘
115
123
  ```
116
124
 
117
125
  #### Phase 1: Classify
@@ -151,7 +159,23 @@ After initial extraction, a review loop (up to `maxReviewRounds`, default 2) che
151
159
 
152
160
  #### Phase 5: Assemble
153
161
 
154
- All extractor results are merged into a final validated `InsuranceDocument`, then chunked into `DocumentChunk[]` for vector storage. Chunks are deterministically IDed as `${documentId}:${type}:${index}`.
162
+ All extractor results are merged into a final validated `InsuranceDocument`.
163
+
164
+ #### Phase 6: Format
165
+
166
+ A formatting agent pass cleans up markdown in all content-bearing string fields (sections, subsections, endorsements, exclusions, conditions, summary). It fixes:
167
+
168
+ - **Pipe tables missing separator rows** — adds `| --- | --- |` and leading/trailing pipes
169
+ - **Space-aligned tables** — converts whitespace-padded columns into proper markdown tables
170
+ - **Sub-items mixed into tables** — pulls indented sub-items out of tables into lists
171
+ - **Mixed table/prose content** — handles each segment independently
172
+ - **General cleanup** — excessive blank lines, trailing whitespace, orphaned formatting markers
173
+
174
+ Content is batched (up to 20 fields per call) and sent through `generateText` for formatting cleanup. Token usage is tracked the same as other pipeline steps.
175
+
176
+ #### Phase 7: Chunk
177
+
178
+ The formatted document is chunked into `DocumentChunk[]` for vector storage. Chunks are deterministically IDed as `${documentId}:${type}:${index}`.
155
179
 
156
180
  ### Configuration
157
181