@heripo/document-processor 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,332 @@
1
+ # @heripo/document-processor
2
+
3
+ > LLM-based document structure analysis and processing library
4
+
5
+ [![npm version](https://img.shields.io/npm/v/@heripo/document-processor.svg)](https://www.npmjs.com/package/@heripo/document-processor)
6
+ [![Node.js](https://img.shields.io/badge/Node.js-%3E%3D22-339933?logo=node.js&logoColor=white)](https://nodejs.org/)
7
+ ![coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)
8
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](../../LICENSE)
9
+
10
+ **English** | [한국어](./README.ko.md)
11
+
12
+ > **Note**: Please check the [root README](../../README.md) first for project overview, installation instructions, and roadmap.
13
+
14
+ `@heripo/document-processor` is a library that transforms DoclingDocument into ProcessedDocument, optimized for LLM analysis.
15
+
16
+ ## Table of Contents
17
+
18
+ - [Key Features](#key-features)
19
+ - [Installation](#installation)
20
+ - [Usage](#usage)
21
+ - [Processing Pipeline](#processing-pipeline)
22
+ - [API Documentation](#api-documentation)
23
+ - [License](#license)
24
+
25
+ ## Key Features
26
+
27
+ - **TOC Extraction**: Automatic TOC recognition with rule-based + LLM fallback
28
+ - **Hierarchical Structure**: Automatic generation of chapter/section/subsection hierarchy
29
+ - **Page Mapping**: Actual page number mapping using Vision LLM
30
+ - **Caption Parsing**: Automatic parsing of image and table captions
31
+ - **LLM Flexibility**: Support for various LLMs including OpenAI, Anthropic, Google
32
+ - **Fallback Retry**: Automatic retry with fallback model on failure
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ # Install with npm
38
+ npm install @heripo/document-processor @heripo/model
39
+
40
+ # Install with pnpm
41
+ pnpm add @heripo/document-processor @heripo/model
42
+
43
+ # Install with yarn
44
+ yarn add @heripo/document-processor @heripo/model
45
+ ```
46
+
47
+ Additionally, LLM provider SDKs are required:
48
+
49
+ ```bash
50
+ # Vercel AI SDK and provider packages
51
+ npm install ai @ai-sdk/openai @ai-sdk/anthropic @ai-sdk/google
52
+ ```
53
+
54
+ ## Usage
55
+
56
+ ### Basic Usage
57
+
58
+ ```typescript
59
+ import { anthropic } from '@ai-sdk/anthropic';
60
+ import { DocumentProcessor } from '@heripo/document-processor';
61
+ import { Logger } from '@heripo/logger';
62
+
63
+ const logger = Logger(...);
64
+
65
+ // Basic usage - specify fallback model only
66
+ const processor = new DocumentProcessor({
67
+ logger,
68
+ fallbackModel: anthropic('claude-opus-4-5'),
69
+ textCleanerBatchSize: 10,
70
+ captionParserBatchSize: 5,
71
+ captionValidatorBatchSize: 5,
72
+ });
73
+
74
+ // Process document
75
+ const processedDoc = await processor.process(
76
+ doclingDocument, // PDF parser output
77
+ 'report-001', // Report ID
78
+ outputPath, // Directory containing images/pages
79
+ );
80
+
81
+ // Use results
82
+ console.log('TOC:', processedDoc.chapters);
83
+ console.log('Images:', processedDoc.images);
84
+ console.log('Tables:', processedDoc.tables);
85
+ ```
86
+
87
+ ### Advanced Usage - Per-Component Model Specification
88
+
89
+ ```typescript
90
+ import { anthropic } from '@ai-sdk/anthropic';
91
+ import { openai } from '@ai-sdk/openai';
92
+
93
+ const processor = new DocumentProcessor({
94
+ logger,
95
+ // Fallback model (for retry on failure)
96
+ fallbackModel: anthropic('claude-opus-4-5'),
97
+
98
+ // Per-component model specification
99
+ pageRangeParserModel: openai('gpt-5.1'), // Vision required
100
+ tocExtractorModel: openai('gpt-5.1'), // Structured output
101
+ validatorModel: openai('gpt-5.2'), // Simple validation
102
+ visionTocExtractorModel: openai('gpt-5.1'), // Vision required
103
+ captionParserModel: openai('gpt-5-mini'), // Caption parsing
104
+
105
+ // Batch size settings
106
+ textCleanerBatchSize: 20, // Synchronous processing (can be large)
107
+ captionParserBatchSize: 10, // LLM calls (medium)
108
+ captionValidatorBatchSize: 10, // LLM calls (medium)
109
+
110
+ // Retry settings
111
+ maxRetries: 3,
112
+ enableFallbackRetry: true, // Automatic retry with fallback model
113
+ });
114
+
115
+ const processedDoc = await processor.process(
116
+ doclingDocument,
117
+ 'report-001',
118
+ outputPath,
119
+ );
120
+ ```
121
+
122
+ ## Processing Pipeline
123
+
124
+ DocumentProcessor processes documents through a 5-stage pipeline:
125
+
126
+ ### 1. Text Cleaning (TextCleaner)
127
+
128
+ - Unicode normalization (NFC)
129
+ - Whitespace cleanup
130
+ - Invalid text filtering (numbers-only text, empty text)
131
+
132
+ ### 2. Page Range Mapping (PageRangeParser - Vision LLM)
133
+
134
+ - Extract actual page numbers from page images
135
+ - PDF page to document logical page mapping
136
+ - Handle page number mismatches due to scanning errors
137
+
138
+ ### 3. TOC Extraction (5-Stage Pipeline)
139
+
140
+ #### Stage 1: TocFinder (Rule-Based)
141
+
142
+ - Keyword search (Table of Contents, Contents, etc.)
143
+ - Structure analysis (lists/tables with page number patterns)
144
+ - Multi-page TOC detection with continuation markers
145
+
146
+ #### Stage 2: MarkdownConverter
147
+
148
+ - Group → Indented list format
149
+ - Table → Markdown table format
150
+ - Preserve hierarchy for LLM processing
151
+
152
+ #### Stage 3: TocContentValidator (LLM Validation)
153
+
154
+ - Verify if extracted content is actual TOC
155
+ - Return confidence score and reason
156
+
157
+ #### Stage 4: VisionTocExtractor (Vision LLM Fallback)
158
+
159
+ - Used when rule-based extraction or validation fails
160
+ - Extract TOC directly from page images
161
+
162
+ #### Stage 5: TocExtractor (LLM Structuring)
163
+
164
+ - Extract hierarchical TocEntry[] (title, level, pageNo)
165
+ - Recursive children structure for nested sections
166
+
167
+ ### 4. Resource Transformation
168
+
169
+ - **Images**: Caption extraction and parsing with CaptionParser
170
+ - **Tables**: Grid data transformation and caption parsing
171
+ - **Caption Validation**: Parsing result validation with CaptionValidator
172
+
173
+ ### 5. Chapter Conversion (ChapterConverter)
174
+
175
+ - Build chapter tree based on TOC
176
+ - Create Chapter hierarchy
177
+ - Link text blocks to chapters by page range
178
+ - Connect image/table IDs to appropriate chapters
179
+ - Fallback: Create single "Document" chapter when TOC is empty
180
+
181
+ ## API Documentation
182
+
183
+ ### DocumentProcessor Class
184
+
185
+ #### Constructor Options
186
+
187
+ ```typescript
188
+ interface DocumentProcessorOptions {
189
+ logger: Logger; // Logger instance (required)
190
+
191
+ // LLM model settings
192
+ fallbackModel: LanguageModel; // Fallback model (required)
193
+ pageRangeParserModel?: LanguageModel; // For page range parser
194
+ tocExtractorModel?: LanguageModel; // For TOC extraction
195
+ validatorModel?: LanguageModel; // For validation
196
+ visionTocExtractorModel?: LanguageModel; // For Vision TOC extraction
197
+ captionParserModel?: LanguageModel; // For caption parser
198
+
199
+ // Batch processing settings
200
+ textCleanerBatchSize?: number; // Text cleaning (default: 10)
201
+ captionParserBatchSize?: number; // Caption parsing (default: 5)
202
+ captionValidatorBatchSize?: number; // Caption validation (default: 5)
203
+
204
+ // Retry settings
205
+ maxRetries?: number; // LLM API retry count (default: 3)
206
+ enableFallbackRetry?: boolean; // Enable fallback retry (default: true)
207
+ }
208
+ ```
209
+
210
+ #### Methods
211
+
212
+ ##### `process(doclingDoc, reportId, outputPath): Promise<ProcessedDocument>`
213
+
214
+ Transforms DoclingDocument into ProcessedDocument.
215
+
216
+ **Parameters:**
217
+
218
+ - `doclingDoc` (DoclingDocument): PDF parser output
219
+ - `reportId` (string): Report ID
220
+ - `outputPath` (string): Output directory containing images/pages
221
+
222
+ **Returns:**
223
+
224
+ - `Promise<ProcessedDocument>`: Processed document
225
+
226
+ ### Fallback Retry Mechanism
227
+
228
+ When `enableFallbackRetry: true` is set, LLM components automatically retry with fallbackModel on failure:
229
+
230
+ ```typescript
231
+ const processor = new DocumentProcessor({
232
+ logger,
233
+ fallbackModel: anthropic('claude-opus-4-5'), // For retry
234
+ pageRangeParserModel: openai('gpt-5.2'), // First attempt
235
+ enableFallbackRetry: true, // Use fallback on failure
236
+ });
237
+
238
+ // If pageRangeParserModel fails, automatically retries with fallbackModel
239
+ const result = await processor.process(doc, 'id', 'path');
240
+ ```
241
+
242
+ ### Batch Size Parameters
243
+
244
+ - **textCleanerBatchSize**: Synchronous text normalization and filtering batch size. Large values possible due to local processing
245
+ - **captionParserBatchSize**: LLM-based caption parsing batch size. Small values for API request concurrency and cost management
246
+ - **captionValidatorBatchSize**: LLM-based caption validation batch size. Small values to limit validation request concurrency
247
+
248
+ ## Error Handling
249
+
250
+ ### TocExtractError
251
+
252
+ Errors thrown when TOC extraction fails:
253
+
254
+ - `TocNotFoundError`: TOC not found in document
255
+ - `TocParseError`: LLM response parsing failed
256
+ - `TocValidationError`: TOC validation failed
257
+
258
+ ```typescript
259
+ try {
260
+ const result = await processor.process(doc, 'id', 'path');
261
+ } catch (error) {
262
+ if (error instanceof TocNotFoundError) {
263
+ console.log('TOC not found. Processing as single chapter.');
264
+ } else if (error instanceof TocParseError) {
265
+ console.error('TOC parsing failed:', error.message);
266
+ }
267
+ }
268
+ ```
269
+
270
+ ### PageRangeParseError
271
+
272
+ Page range parsing failure:
273
+
274
+ ```typescript
275
+ import { PageRangeParseError } from '@heripo/document-processor';
276
+ ```
277
+
278
+ ### CaptionParseError & CaptionValidationError
279
+
280
+ Caption parsing/validation failure:
281
+
282
+ ```typescript
283
+ import {
284
+ CaptionParseError,
285
+ CaptionValidationError,
286
+ } from '@heripo/document-processor';
287
+ ```
288
+
289
+ ## Token Usage Tracking
290
+
291
+ Major LLM components return token usage:
292
+
293
+ ```typescript
294
+ // PageRangeParser
295
+ const { pageRangeMap, tokenUsage } = await pageRangeParser.parse(doc);
296
+ console.log('Token usage:', tokenUsage);
297
+
298
+ // TocExtractor
299
+ const { entries, tokenUsage } = await tocExtractor.extract(markdown);
300
+ console.log('Token usage:', tokenUsage);
301
+ ```
302
+
303
+ ## Related Packages
304
+
305
+ - [@heripo/pdf-parser](../pdf-parser) - PDF parsing and OCR
306
+ - [@heripo/model](../model) - Data models and type definitions
307
+
308
+ ## License
309
+
310
+ This package is distributed under the [Apache License 2.0](../../LICENSE).
311
+
312
+ ## Contributing
313
+
314
+ Contributions are always welcome! Please see the [Contributing Guide](../../CONTRIBUTING.md).
315
+
316
+ ## Issues and Support
317
+
318
+ - **Bug Reports**: [GitHub Issues](https://github.com/heripo-lab/heripo-engine/issues)
319
+ - **Discussions**: [GitHub Discussions](https://github.com/heripo-lab/heripo-engine/discussions)
320
+
321
+ ## Project-Wide Information
322
+
323
+ For project-wide information not covered in this package, see the [root README](../../README.md):
324
+
325
+ - **Citation and Attribution**: Academic citation (BibTeX) and attribution methods
326
+ - **Contributing Guidelines**: Development guidelines, commit rules, PR procedures
327
+ - **Community**: Issue tracker, discussions, security policy
328
+ - **Roadmap**: Project development plans
329
+
330
+ ---
331
+
332
+ **heripo lab** | [GitHub](https://github.com/heripo-lab) | [heripo engine](https://github.com/heripo-lab/heripo-engine)