@opendataloader/pdf 2.3.0 → 2.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
  name: opendataloader-pdf
3
3
  category: PDF data extraction, PDF accessibility automation
4
4
  license: Apache-2.0
5
- solves: [PDF to structured data for RAG/LLM pipelines, automate PDF accessibility compliance — layout analysis + auto-tagging to Tagged PDF (first open-source end-to-end)]
5
+ solves: [PDF to structured data for RAG/LLM pipelines, accelerate PDF accessibility remediation — layout analysis + auto-tagging to Tagged PDF as foundation for PDF/UA (first open-source end-to-end)]
6
6
  input: PDF files (digital, scanned, tagged)
7
7
  output: Markdown, JSON (with bounding boxes), HTML, Tagged PDF, PDF/UA (enterprise)
8
8
  sdk: Python, Node.js, Java
@@ -32,10 +32,10 @@ key-differentiators: [benchmark #1 PDF parser, deterministic output, bounding bo
32
32
  - **Tables, formulas, images, charts?** — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))
33
33
  - **How do I use this for RAG?** — `pip install opendataloader-pdf`, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs ([quick start](#get-started-in-30-seconds) | [LangChain](#langchain-integration))
34
34
 
35
- ♿ **PDF accessibility automation** — The same layout analysis engine also powers auto-tagging. First open-source tool to generate Tagged PDFs end-to-end (coming Q2 2026).
35
+ ♿ **PDF accessibility automation** — Auto-tag untagged PDFs into screen-reader-ready Tagged PDFs at scale. First open-source tool to generate Tagged PDFs end-to-end.
36
36
 
37
37
  - **What's the problem?** — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn't scale ([regulations](#pdf-accessibility--pdfua-conversion))
38
- - **What's free?** — Layout analysis + auto-tagging (Q2 2026, Apache 2.0). Untagged PDF in → Tagged PDF out. No proprietary SDK dependency ([auto-tagging preview](#auto-tagging-preview-coming-q2-2026))
38
+ - **What's free?** — Layout analysis + auto-tagging (Apache 2.0). Untagged PDF in → Tagged PDF out. No proprietary SDK dependency ([auto-tagging](#auto-tagging))
39
39
  - **What about PDF/UA compliance?** — Converting Tagged PDF to PDF/UA-1 or PDF/UA-2 is an enterprise add-on. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step ([pipeline](#accessibility-pipeline))
40
40
  - **Why trust this?** — Built in collaboration with [Dual Lab](https://duallab.com) ([veraPDF](https://verapdf.org) developers) based on [PDF Association](https://pdfa.org) specifications, best practice guides and expertise of the [PDF Community](https://pdfa.org/community/). Auto-tagging follows the [Well-Tagged PDF specification](https://pdfa.org/wtpdf/), validated with veraPDF ([collaboration](https://opendataloader.org/docs/tagged-pdf-collaboration))
41
41
 
@@ -70,7 +70,7 @@ opendataloader_pdf.convert(
70
70
  |---------|----------|--------|
71
71
  | **PDF structure lost during parsing** — wrong reading order, broken tables, no element coordinates | Deterministic local PDF to Markdown/JSON with bounding boxes, XY-Cut++ reading order | Shipped |
72
72
  | **Complex tables, scanned PDFs, formulas, charts** need AI-level understanding | Hybrid mode routes complex pages to AI backend (#1 in benchmarks) | Shipped |
73
- | **PDF accessibility compliance** — EAA, ADA, Section 508 enforced. Manual remediation $50–200/doc | Auto-tagging: layout analysis Tagged PDF (free, Q2 2026). Built with PDF Association & veraPDF validation. PDF/UA export (enterprise add-on) | Auto-tag: Q2 2026 |
73
+ | **Manual PDF remediation cost** — Accessibility regulations (EAA, ADA, Section 508) demand Tagged PDFs. Manual remediation costs $50–200/doc | Auto-tag untagged PDFs into Tagged PDFs (free, Apache 2.0). Foundation for PDF/UA workflows; full PDF/UA-1/2 export is an enterprise add-on | Auto-tag: Shipped. PDF/UA export: Enterprise |
74
74
 
75
75
  ## Capability Matrix
76
76
 
@@ -91,7 +91,7 @@ opendataloader_pdf.convert(
91
91
  | AI safety (prompt injection filtering) | Yes | Free |
92
92
  | Header/footer/watermark filtering | Yes | Free |
93
93
  | **Accessibility** | | |
94
- | Auto-tagging → Tagged PDF for untagged PDFs | Coming Q2 2026 | Free (Apache 2.0) |
94
+ | Auto-tagging → Tagged PDF for untagged PDFs | Yes | Free (Apache 2.0) |
95
95
  | PDF/UA-1, PDF/UA-2 export | 💼 Available | Enterprise |
96
96
  | Accessibility studio (visual editor) | 💼 Available | Enterprise |
97
97
  | **Limitations** | | |
@@ -133,7 +133,7 @@ opendataloader_pdf.convert(
133
133
  | Non-English scanned PDF | Hybrid + OCR | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"` | `opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/` |
134
134
  | Mathematical formulas | Hybrid + formula | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --enrich-formula` | `opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/` |
135
135
  | Charts needing description | Hybrid + picture | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --enrich-picture-description` | `opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/` |
136
- | Untagged PDFs needing accessibility | Auto-tagging → Tagged PDF | Coming Q2 2026 | | |
136
+ | Untagged PDFs needing accessibility | Auto-tagging → Tagged PDF | `pip install opendataloader-pdf` | None needed | `opendataloader-pdf --format tagged-pdf file1.pdf file2.pdf folder/` |
137
137
 
138
138
  ## Quick Start
139
139
 
@@ -416,37 +416,48 @@ opendataloader_pdf.convert(
416
416
  | Step | Feature | Status | Tier |
417
417
  |------|---------|--------|------|
418
418
  | 1. **Audit** | Read existing PDF tags, detect untagged PDFs | Shipped | Free |
419
- | 2. **Auto-tag → Tagged PDF** | Generate structure tags for untagged PDFs | Coming Q2 2026 | Free (Apache 2.0) |
419
+ | 2. **Auto-tag → Tagged PDF** | Generate structure tags for untagged PDFs | Shipped | Free (Apache 2.0) |
420
420
  | 3. **Export PDF/UA** | Convert to PDF/UA-1 or PDF/UA-2 compliant files | 💼 Available | Enterprise |
421
421
  | 4. **Visual editing** | Accessibility studio — review and fix tags | 💼 Available | Enterprise |
422
422
 
423
423
  > **💼 Enterprise features** are available on request. [Contact us](https://opendataloader.org/contact) to get started.
424
424
 
425
- ### Auto-Tagging Preview (Coming Q2 2026)
425
+ ### Auto-Tagging
426
+
427
+ Generate Tagged PDFs from untagged PDFs — output is a screen-reader-ready PDF with structure tags (headings, paragraphs, lists, tables, reading order).
426
428
 
427
429
  ```python
428
- # API shape preview — available Q2 2026
430
+ import opendataloader_pdf
431
+
432
+ # Untagged PDF in → Tagged PDF out
429
433
  opendataloader_pdf.convert(
430
434
  input_path=["file1.pdf", "file2.pdf", "folder/"],
431
435
  output_dir="output/",
432
- auto_tag=True # Generate structure tags for untagged PDFs
436
+ format="tagged-pdf"
433
437
  )
434
438
  ```
435
439
 
440
+ ```bash
441
+ # CLI
442
+ opendataloader-pdf --format tagged-pdf file1.pdf file2.pdf folder/
443
+ ```
444
+
445
+ Combine with other formats: `format="json,tagged-pdf"`.
446
+
436
447
  ### End-to-End Compliance Workflow
437
448
 
438
449
  ```
439
450
  Existing PDFs (untagged)
440
451
 
441
452
 
442
- ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
443
- │ 1. Audit │───>│ 2. Auto-Tag │───>│ 3. Export │───>│ 4. Studio │
444
- │ (check tags) │ │ (→ Tagged PDF) │ │ (PDF/UA) │ │ (visual editor) │
445
- └─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
446
- │ │
447
- ▼ ▼
448
- use_struct_tree auto_tag PDF/UA export Accessibility Studio
449
- (Available now) (Q2 2026, Apache 2.0) (Enterprise) (Enterprise)
453
+ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌──────────────────┐
454
+ │ 1. Audit │───>│ 2. Auto-Tag │───>│ 3. Export │───>│ 4. Studio │
455
+ │ (check tags) │ │ (→ Tagged PDF) │ │ (PDF/UA) │ │ (visual editor) │
456
+ └─────────────────┘ └──────────────────┘ └─────────────────┘ └──────────────────┘
457
+ │ │
458
+ ▼ ▼
459
+ use_struct_tree format="tagged-pdf" PDF/UA export Accessibility Studio
460
+ (Available now) (Available, Apache 2.0) (Enterprise) (Enterprise)
450
461
  ```
451
462
 
452
463
  [PDF Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance)
@@ -455,9 +466,8 @@ Existing PDFs (untagged)
455
466
 
456
467
  | Feature | Timeline | Tier |
457
468
  |---------|----------|------|
458
- | **Auto-tagging → Tagged PDF** — Generate Tagged PDFs from untagged PDFs | Q2 2026 | Free |
459
469
  | **[Hancom Data Loader](https://sdk.hancom.com/en/services/1?utm_source=github&utm_medium=readme&utm_campaign=opendataloader-pdf)** — Enterprise AI document analysis, customer-customized models, VLM-based chart/image understanding, production-grade OCR | Q2-Q3 2026 | Planned |
460
- | **Structure validation** — Verify PDF tag trees | Q2 2026 | Planned |
470
+ | **Structure validation** — Verify PDF tag trees | Q3 2026 | Planned |
461
471
 
462
472
  [Full Roadmap](https://opendataloader.org/docs/upcoming-roadmap)
463
473
 
@@ -542,7 +552,7 @@ OpenDataLoader preserves heading hierarchy, table structure, and reading order i
542
552
 
543
553
  ### Is there an automated PDF accessibility remediation tool?
544
554
 
545
- Yes. OpenDataLoader is the first open-source tool that automates PDF accessibility end-to-end. Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) (veraPDF developers), auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF. The layout analysis engine detects document structure (headings, tables, lists, reading order) and generates accessibility tags automatically. Auto-tagging (Q2 2026) converts untagged PDFs into Tagged PDFs under Apache 2.0 — no proprietary SDK dependency. For organizations needing full PDF/UA compliance, enterprise add-ons provide PDF/UA export and a visual tag editor. This replaces manual remediation workflows that typically cost $50–200+ per document.
555
+ Yes. OpenDataLoader is the first open-source tool that automates PDF accessibility end-to-end. Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) (veraPDF developers), auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF. The layout analysis engine detects document structure (headings, tables, lists, reading order) and generates accessibility tags automatically. Auto-tagging converts untagged PDFs into Tagged PDFs under Apache 2.0 — no proprietary SDK dependency. Use `format="tagged-pdf"` (Python/Node.js) or `--format tagged-pdf` (CLI). For organizations needing full PDF/UA compliance, enterprise add-ons provide PDF/UA export and a visual tag editor. This replaces manual remediation workflows that typically cost $50–200+ per document.
546
556
 
547
557
  ### Is this really the first open-source PDF auto-tagging tool?
548
558
 
@@ -550,15 +560,15 @@ Yes. Existing tools either depend on proprietary SDKs for writing structure tags
550
560
 
551
561
  ### How do I convert existing PDFs to PDF/UA?
552
562
 
553
- OpenDataLoader provides an end-to-end pipeline: audit existing PDFs for tags (`use_struct_tree=True`), auto-tag untagged PDFs into Tagged PDFs (Q2 2026, free under Apache 2.0), and export as PDF/UA-1 or PDF/UA-2 (enterprise add-on). Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step. [Contact us](https://opendataloader.org/contact) for enterprise integration.
563
+ OpenDataLoader provides an end-to-end pipeline: audit existing PDFs for tags (`use_struct_tree=True`), auto-tag untagged PDFs into Tagged PDFs (`format="tagged-pdf"`, free under Apache 2.0), and export as PDF/UA-1 or PDF/UA-2 (enterprise add-on). Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step. [Contact us](https://opendataloader.org/contact) for enterprise integration.
554
564
 
555
565
  ### How do I make my PDFs accessible for EAA compliance?
556
566
 
557
- The European Accessibility Act requires accessible digital products by June 28, 2025. OpenDataLoader supports the full remediation workflow: audit → auto-tag → Tagged PDF → PDF/UA export. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, ensuring standards-compliant output. Auto-tagging to Tagged PDF will be open-sourced under Apache 2.0 (Q2 2026). PDF/UA export and accessibility studio are enterprise add-ons. See our [Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance).
567
+ The European Accessibility Act requires accessible digital products by June 28, 2025. OpenDataLoader supports the full remediation workflow: audit → auto-tag → Tagged PDF → PDF/UA export. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, ensuring standards-compliant output. Auto-tagging to Tagged PDF is open-source under Apache 2.0. PDF/UA export and accessibility studio are enterprise add-ons. See our [Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance).
558
568
 
559
569
  ### Is OpenDataLoader PDF free?
560
570
 
561
- The core library is **open-source under Apache 2.0** — free for commercial use. This includes all extraction features (text, tables, images, OCR, formulas, charts via hybrid mode), AI safety filters, Tagged PDF support, and auto-tagging to Tagged PDF (Q2 2026). We are committed to keeping the core accessibility pipeline (layout analysis → auto-tagging → Tagged PDF) free and open-source. Enterprise add-ons (PDF/UA export, accessibility studio) are available for organizations needing end-to-end regulatory compliance.
571
+ The core library is **open-source under Apache 2.0** — free for commercial use. This includes all extraction features (text, tables, images, OCR, formulas, charts via hybrid mode), AI safety filters, Tagged PDF support, and auto-tagging to Tagged PDF. We are committed to keeping the core accessibility pipeline (layout analysis → auto-tagging → Tagged PDF) free and open-source. Enterprise add-ons (PDF/UA export, accessibility studio) are available for organizations needing end-to-end regulatory compliance.
562
572
 
563
573
  ### Why did the license change from MPL 2.0 to Apache 2.0?
564
574
 
package/dist/cli.cjs CHANGED
@@ -34,6 +34,7 @@ var import_commander = require("commander");
34
34
  var import_child_process = require("child_process");
35
35
  var path = __toESM(require("path"), 1);
36
36
  var fs = __toESM(require("fs"), 1);
37
+ var import_string_decoder = require("string_decoder");
37
38
  var import_url = require("url");
38
39
 
39
40
  // src/convert-options.generated.ts
@@ -114,9 +115,21 @@ function buildConvertOptions(cliOptions) {
114
115
  if (cliOptions.hybridFallback) {
115
116
  convertOptions.hybridFallback = true;
116
117
  }
118
+ if (cliOptions.hybridHancomAiRegionlistStrategy) {
119
+ convertOptions.hybridHancomAiRegionlistStrategy = cliOptions.hybridHancomAiRegionlistStrategy;
120
+ }
121
+ if (cliOptions.hybridHancomAiOcrStrategy) {
122
+ convertOptions.hybridHancomAiOcrStrategy = cliOptions.hybridHancomAiOcrStrategy;
123
+ }
124
+ if (cliOptions.hybridHancomAiImageCache) {
125
+ convertOptions.hybridHancomAiImageCache = cliOptions.hybridHancomAiImageCache;
126
+ }
117
127
  if (cliOptions.toStdout) {
118
128
  convertOptions.toStdout = true;
119
129
  }
130
+ if (cliOptions.threads) {
131
+ convertOptions.threads = cliOptions.threads;
132
+ }
120
133
  return convertOptions;
121
134
  }
122
135
  function buildArgs(options) {
@@ -208,9 +221,21 @@ function buildArgs(options) {
208
221
  if (options.hybridFallback) {
209
222
  args.push("--hybrid-fallback");
210
223
  }
224
+ if (options.hybridHancomAiRegionlistStrategy) {
225
+ args.push("--hybrid-hancom-ai-regionlist-strategy", options.hybridHancomAiRegionlistStrategy);
226
+ }
227
+ if (options.hybridHancomAiOcrStrategy) {
228
+ args.push("--hybrid-hancom-ai-ocr-strategy", options.hybridHancomAiOcrStrategy);
229
+ }
230
+ if (options.hybridHancomAiImageCache) {
231
+ args.push("--hybrid-hancom-ai-image-cache", options.hybridHancomAiImageCache);
232
+ }
211
233
  if (options.toStdout) {
212
234
  args.push("--to-stdout");
213
235
  }
236
+ if (options.threads) {
237
+ args.push("--threads", options.threads);
238
+ }
214
239
  return args;
215
240
  }
216
241
 
@@ -232,21 +257,39 @@ function executeJar(args, executionOptions = {}) {
232
257
  const javaProcess = (0, import_child_process.spawn)(command, commandArgs);
233
258
  let stdout = "";
234
259
  let stderr = "";
260
+ const stdoutDecoder = new import_string_decoder.StringDecoder("utf8");
261
+ const stderrDecoder = new import_string_decoder.StringDecoder("utf8");
235
262
  javaProcess.stdout.on("data", (data) => {
236
- const chunk = data.toString();
263
+ const chunk = stdoutDecoder.write(data);
264
+ if (chunk.length === 0) return;
237
265
  if (streamOutput) {
238
266
  process.stdout.write(chunk);
267
+ } else {
268
+ stdout += chunk;
239
269
  }
240
- stdout += chunk;
241
270
  });
242
271
  javaProcess.stderr.on("data", (data) => {
243
- const chunk = data.toString();
272
+ const chunk = stderrDecoder.write(data);
273
+ if (chunk.length === 0) return;
244
274
  if (streamOutput) {
245
275
  process.stderr.write(chunk);
246
276
  }
247
277
  stderr += chunk;
248
278
  });
249
279
  javaProcess.on("close", (code) => {
280
+ const stdoutTail = stdoutDecoder.end();
281
+ const stderrTail = stderrDecoder.end();
282
+ if (stdoutTail.length > 0) {
283
+ if (streamOutput) {
284
+ process.stdout.write(stdoutTail);
285
+ } else {
286
+ stdout += stdoutTail;
287
+ }
288
+ }
289
+ if (stderrTail.length > 0) {
290
+ if (streamOutput) process.stderr.write(stderrTail);
291
+ stderr += stderrTail;
292
+ }
250
293
  if (code === 0) {
251
294
  resolve(stdout);
252
295
  } else {
@@ -272,20 +315,24 @@ ${errorOutput}`
272
315
  });
273
316
  });
274
317
  }
275
- function convert(inputPaths, options = {}) {
318
+ function buildJarArgs(inputPaths, options) {
276
319
  const inputList = Array.isArray(inputPaths) ? inputPaths : [inputPaths];
277
320
  if (inputList.length === 0) {
278
- return Promise.reject(new Error("At least one input path must be provided."));
321
+ return new Error("At least one input path must be provided.");
279
322
  }
280
323
  for (const input of inputList) {
281
324
  if (!fs.existsSync(input)) {
282
- return Promise.reject(new Error(`Input file or folder not found: ${input}`));
325
+ return new Error(`Input file or folder not found: ${input}`);
283
326
  }
284
327
  }
285
- const args = [...inputList, ...buildArgs(options)];
286
- return executeJar(args, {
287
- streamOutput: !options.quiet
288
- });
328
+ return [...inputList, ...buildArgs(options)];
329
+ }
330
+ async function _runForCli(inputPaths, options = {}) {
331
+ const argsOrError = buildJarArgs(inputPaths, options);
332
+ if (argsOrError instanceof Error) {
333
+ throw argsOrError;
334
+ }
335
+ await executeJar(argsOrError, { streamOutput: true });
289
336
  }
290
337
 
291
338
  // src/cli-options.generated.ts
@@ -310,12 +357,16 @@ function registerCliOptions(program) {
310
357
  program.option("--pages <value>", 'Pages to extract (e.g., "1,3,5-7"). Default: all pages');
311
358
  program.option("--include-header-footer", "Include page headers and footers in output");
312
359
  program.option("--detect-strikethrough", "Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental)");
313
- program.option("--hybrid <value>", 'Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast');
360
+ program.option("--hybrid <value>", 'Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai');
314
361
  program.option("--hybrid-mode <value>", "Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)");
315
362
  program.option("--hybrid-url <value>", "Hybrid backend server URL (overrides default)");
316
363
  program.option("--hybrid-timeout <value>", "Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0");
317
364
  program.option("--hybrid-fallback", "Opt in to Java fallback on hybrid backend error (default: disabled)");
365
+ program.option("--hybrid-hancom-ai-regionlist-strategy <value>", "DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list)");
366
+ program.option("--hybrid-hancom-ai-ocr-strategy <value>", "OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only)");
367
+ program.option("--hybrid-hancom-ai-image-cache <value>", "Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk");
318
368
  program.option("--to-stdout", "Write output to stdout instead of file (single format only)");
369
+ program.option("--threads <value>", "Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode");
319
370
  }
320
371
 
321
372
  // src/cli.ts
@@ -354,13 +405,7 @@ async function main() {
354
405
  const inputPaths = program.args;
355
406
  const convertOptions = buildConvertOptions(cliOptions);
356
407
  try {
357
- const output = await convert(inputPaths, convertOptions);
358
- if (output && !convertOptions.quiet) {
359
- process.stdout.write(output);
360
- if (!output.endsWith("\n")) {
361
- process.stdout.write("\n");
362
- }
363
- }
408
+ await _runForCli(inputPaths, convertOptions);
364
409
  return 0;
365
410
  } catch (err) {
366
411
  const message = err instanceof Error ? err.message : String(err);
package/dist/cli.cjs.map CHANGED
@@ -1 +1 @@
1
- {"version":3,"sources":["../node_modules/.pnpm/tsup@8.5.1_postcss@8.5.9_tsx@4.20.5_typescript@6.0.2/node_modules/tsup/assets/cjs_shims.js","../src/cli.ts","../src/index.ts","../src/convert-options.generated.ts","../src/cli-options.generated.ts"],"sourcesContent":["// Shim globals in cjs bundle\n// There's a weird bug that esbuild will always inject importMetaUrl\n// if we export it as `const importMetaUrl = ... __filename ...`\n// But using a function will not cause this issue\n\nconst getImportMetaUrl = () => \n typeof document === \"undefined\" \n ? new URL(`file:${__filename}`).href \n : (document.currentScript && document.currentScript.tagName.toUpperCase() === 'SCRIPT') \n ? document.currentScript.src \n : new URL(\"main.js\", document.baseURI).href;\n\nexport const importMetaUrl = /* @__PURE__ */ getImportMetaUrl()\n","#!/usr/bin/env node\nimport { Command, CommanderError } from 'commander';\nimport { convert } from './index.js';\nimport { CliOptions, buildConvertOptions } from './convert-options.generated.js';\nimport { registerCliOptions } from './cli-options.generated.js';\n\nfunction createProgram(): Command {\n const program = new Command();\n\n program\n .name('opendataloader-pdf')\n .usage('[options] <input...>')\n .description('Convert PDFs using the OpenDataLoader CLI.')\n .showHelpAfterError(\"Use '--help' to see available options.\")\n .showSuggestionAfterError(false)\n .argument('<input...>', 'Input files or directories to convert');\n\n // Register CLI options from auto-generated file\n registerCliOptions(program);\n\n program.configureOutput({\n writeErr: (str) => {\n console.error(str.trimEnd());\n },\n outputError: (str, write) => {\n write(str);\n },\n });\n\n return program;\n}\n\nasync function main(): Promise<number> {\n const program = createProgram();\n\n program.exitOverride();\n\n try {\n program.parse(process.argv);\n } catch (err) {\n if (err instanceof CommanderError) {\n if (err.code === 'commander.helpDisplayed') {\n return 0;\n }\n return err.exitCode ?? 1;\n }\n\n const message = err instanceof Error ? err.message : String(err);\n console.error(message);\n console.error(\"Use '--help' to see available options.\");\n return 1;\n }\n\n const cliOptions = program.opts<CliOptions>();\n const inputPaths = program.args;\n const convertOptions = buildConvertOptions(cliOptions);\n\n try {\n const output = await convert(inputPaths, convertOptions);\n if (output && !convertOptions.quiet) {\n process.stdout.write(output);\n if (!output.endsWith('\\n')) {\n process.stdout.write('\\n');\n }\n }\n return 0;\n } catch (err) {\n const message = err instanceof Error ? err.message : String(err);\n console.error(message);\n return 1;\n }\n}\n\nmain().then((code) => {\n if (code !== 0) {\n process.exit(code);\n }\n});\n","import { spawn } from 'child_process';\nimport * as path from 'path';\nimport * as fs from 'fs';\nimport { fileURLToPath } from 'url';\n\n// Re-export types and utilities from auto-generated file\nexport type { ConvertOptions } from './convert-options.generated.js';\nexport { buildArgs } from './convert-options.generated.js';\nimport type { ConvertOptions } from './convert-options.generated.js';\nimport { buildArgs } from './convert-options.generated.js';\n\nconst __filename = fileURLToPath(import.meta.url);\nconst __dirname = path.dirname(__filename);\n\nconst JAR_NAME = 'opendataloader-pdf-cli.jar';\n\ninterface JarExecutionOptions {\n streamOutput?: boolean;\n}\n\nfunction executeJar(args: string[], executionOptions: JarExecutionOptions = {}): Promise<string> {\n const { streamOutput = false } = executionOptions;\n\n return new Promise((resolve, reject) => {\n const jarPath = path.join(__dirname, '..', 'lib', JAR_NAME);\n\n if (!fs.existsSync(jarPath)) {\n return reject(\n new Error(`JAR file not found at ${jarPath}. Please run the build script first.`),\n );\n }\n\n const command = 'java';\n const commandArgs = ['-jar', jarPath, ...args];\n\n const javaProcess = spawn(command, commandArgs);\n\n let stdout = '';\n let stderr = '';\n\n javaProcess.stdout.on('data', (data) => {\n const chunk = data.toString();\n if (streamOutput) {\n process.stdout.write(chunk);\n }\n stdout += chunk;\n });\n\n javaProcess.stderr.on('data', (data) => {\n const chunk = data.toString();\n if (streamOutput) {\n process.stderr.write(chunk);\n }\n stderr += chunk;\n });\n\n javaProcess.on('close', (code) => {\n if (code === 0) {\n resolve(stdout);\n } else {\n const errorOutput = stderr || stdout;\n const error = new Error(\n `The opendataloader-pdf CLI exited with code ${code}.\\n\\n${errorOutput}`,\n );\n reject(error);\n }\n });\n\n javaProcess.on('error', (err: Error) => {\n if (err.message.includes('ENOENT')) {\n reject(\n new Error(\n \"'java' command not found. Please ensure Java is installed and in your system's PATH.\",\n ),\n );\n } else {\n reject(err);\n }\n });\n });\n}\n\nexport function convert(\n inputPaths: string | string[],\n options: ConvertOptions = {},\n): Promise<string> {\n const inputList = Array.isArray(inputPaths) ? inputPaths : [inputPaths];\n if (inputList.length === 0) {\n return Promise.reject(new Error('At least one input path must be provided.'));\n }\n\n for (const input of inputList) {\n if (!fs.existsSync(input)) {\n return Promise.reject(new Error(`Input file or folder not found: ${input}`));\n }\n }\n\n const args: string[] = [...inputList, ...buildArgs(options)];\n\n return executeJar(args, {\n streamOutput: !options.quiet,\n });\n}\n\n/**\n * @deprecated Use `convert()` and `ConvertOptions` instead. This function will be removed in a future version.\n */\nexport interface RunOptions {\n outputFolder?: string;\n password?: string;\n replaceInvalidChars?: string;\n generateMarkdown?: boolean;\n generateHtml?: boolean;\n generateAnnotatedPdf?: boolean;\n keepLineBreaks?: boolean;\n contentSafetyOff?: string;\n htmlInMarkdown?: boolean;\n addImageToMarkdown?: boolean;\n noJson?: boolean;\n debug?: boolean;\n useStructTree?: boolean;\n}\n\n/**\n * @deprecated Use `convert()` instead. This function will be removed in a future version.\n */\nexport function run(inputPath: string, options: RunOptions = {}): Promise<string> {\n console.warn(\n 'Warning: run() is deprecated and will be removed in a future version. Use convert() instead.',\n );\n\n // Build format array based on legacy boolean options\n const formats: string[] = [];\n if (!options.noJson) {\n formats.push('json');\n }\n if (options.generateMarkdown) {\n if (options.addImageToMarkdown) {\n formats.push('markdown-with-images');\n } else if (options.htmlInMarkdown) {\n formats.push('markdown-with-html');\n } else {\n formats.push('markdown');\n }\n }\n if (options.generateHtml) {\n formats.push('html');\n }\n if (options.generateAnnotatedPdf) {\n formats.push('pdf');\n }\n\n return convert(inputPath, {\n outputDir: options.outputFolder,\n password: options.password,\n replaceInvalidChars: options.replaceInvalidChars,\n keepLineBreaks: options.keepLineBreaks,\n contentSafetyOff: options.contentSafetyOff,\n useStructTree: options.useStructTree,\n format: formats.length > 0 ? formats : undefined,\n quiet: !options.debug,\n });\n}\n","// AUTO-GENERATED FROM options.json - DO NOT EDIT DIRECTLY\n// Run `npm run generate-options` to regenerate\n\n/**\n * Options for the convert function.\n */\nexport interface ConvertOptions {\n /** Directory where output files are written. Default: input file directory */\n outputDir?: string;\n /** Password for encrypted PDF files */\n password?: string;\n /** Output formats (comma-separated). Values: json, text, html, pdf, markdown, markdown-with-html, markdown-with-images, tagged-pdf. Default: json */\n format?: string | string[];\n /** Suppress console logging output */\n quiet?: boolean;\n /** Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg */\n contentSafetyOff?: string | string[];\n /** Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders */\n sanitize?: boolean;\n /** Preserve original line breaks in extracted text */\n keepLineBreaks?: boolean;\n /** Replacement character for invalid/unrecognized characters. Default: space */\n replaceInvalidChars?: string;\n /** Use PDF structure tree (tagged PDF) for reading order and semantic structure */\n useStructTree?: boolean;\n /** Table detection method. Values: default (border-based), cluster (border + cluster). Default: default */\n tableMethod?: string;\n /** Reading order algorithm. Values: off, xycut. Default: xycut */\n readingOrder?: string;\n /** Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none */\n markdownPageSeparator?: string;\n /** Separator between pages in text output. Use %page-number% for page numbers. Default: none */\n textPageSeparator?: string;\n /** Separator between pages in HTML output. Use %page-number% for page numbers. Default: none */\n htmlPageSeparator?: string;\n /** Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external */\n imageOutput?: string;\n /** Output format for extracted images. Values: png, jpeg. Default: png */\n imageFormat?: string;\n /** Directory for extracted images */\n imageDir?: string;\n /** Pages to extract (e.g., \"1,3,5-7\"). Default: all pages */\n pages?: string;\n /** Include page headers and footers in output */\n includeHeaderFooter?: boolean;\n /** Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental) */\n detectStrikethrough?: boolean;\n /** Hybrid backend (requires a running server). Quick start: pip install \"opendataloader-pdf[hybrid]\" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast */\n hybrid?: string;\n /** Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend) */\n hybridMode?: string;\n /** Hybrid backend server URL (overrides default) */\n hybridUrl?: string;\n /** Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0 */\n hybridTimeout?: string;\n /** Opt in to Java fallback on hybrid backend error (default: disabled) */\n hybridFallback?: boolean;\n /** Write output to stdout instead of file (single format only) */\n toStdout?: boolean;\n}\n\n/**\n * Options as parsed from CLI (all values are strings from commander).\n */\nexport interface CliOptions {\n outputDir?: string;\n password?: string;\n format?: string;\n quiet?: boolean;\n contentSafetyOff?: string;\n sanitize?: boolean;\n keepLineBreaks?: boolean;\n replaceInvalidChars?: string;\n useStructTree?: boolean;\n tableMethod?: string;\n readingOrder?: string;\n markdownPageSeparator?: string;\n textPageSeparator?: string;\n htmlPageSeparator?: string;\n imageOutput?: string;\n imageFormat?: string;\n imageDir?: string;\n pages?: string;\n includeHeaderFooter?: boolean;\n detectStrikethrough?: boolean;\n hybrid?: string;\n hybridMode?: string;\n hybridUrl?: string;\n hybridTimeout?: string;\n hybridFallback?: boolean;\n toStdout?: boolean;\n}\n\n/**\n * Convert CLI options to ConvertOptions.\n */\nexport function buildConvertOptions(cliOptions: CliOptions): ConvertOptions {\n const convertOptions: ConvertOptions = {};\n\n if (cliOptions.outputDir) {\n convertOptions.outputDir = cliOptions.outputDir;\n }\n if (cliOptions.password) {\n convertOptions.password = cliOptions.password;\n }\n if (cliOptions.format) {\n convertOptions.format = cliOptions.format;\n }\n if (cliOptions.quiet) {\n convertOptions.quiet = true;\n }\n if (cliOptions.contentSafetyOff) {\n convertOptions.contentSafetyOff = cliOptions.contentSafetyOff;\n }\n if (cliOptions.sanitize) {\n convertOptions.sanitize = true;\n }\n if (cliOptions.keepLineBreaks) {\n convertOptions.keepLineBreaks = true;\n }\n if (cliOptions.replaceInvalidChars) {\n convertOptions.replaceInvalidChars = cliOptions.replaceInvalidChars;\n }\n if (cliOptions.useStructTree) {\n convertOptions.useStructTree = true;\n }\n if (cliOptions.tableMethod) {\n convertOptions.tableMethod = cliOptions.tableMethod;\n }\n if (cliOptions.readingOrder) {\n convertOptions.readingOrder = cliOptions.readingOrder;\n }\n if (cliOptions.markdownPageSeparator) {\n convertOptions.markdownPageSeparator = cliOptions.markdownPageSeparator;\n }\n if (cliOptions.textPageSeparator) {\n convertOptions.textPageSeparator = cliOptions.textPageSeparator;\n }\n if (cliOptions.htmlPageSeparator) {\n convertOptions.htmlPageSeparator = cliOptions.htmlPageSeparator;\n }\n if (cliOptions.imageOutput) {\n convertOptions.imageOutput = cliOptions.imageOutput;\n }\n if (cliOptions.imageFormat) {\n convertOptions.imageFormat = cliOptions.imageFormat;\n }\n if (cliOptions.imageDir) {\n convertOptions.imageDir = cliOptions.imageDir;\n }\n if (cliOptions.pages) {\n convertOptions.pages = cliOptions.pages;\n }\n if (cliOptions.includeHeaderFooter) {\n convertOptions.includeHeaderFooter = true;\n }\n if (cliOptions.detectStrikethrough) {\n convertOptions.detectStrikethrough = true;\n }\n if (cliOptions.hybrid) {\n convertOptions.hybrid = cliOptions.hybrid;\n }\n if (cliOptions.hybridMode) {\n convertOptions.hybridMode = cliOptions.hybridMode;\n }\n if (cliOptions.hybridUrl) {\n convertOptions.hybridUrl = cliOptions.hybridUrl;\n }\n if (cliOptions.hybridTimeout) {\n convertOptions.hybridTimeout = cliOptions.hybridTimeout;\n }\n if (cliOptions.hybridFallback) {\n convertOptions.hybridFallback = true;\n }\n if (cliOptions.toStdout) {\n convertOptions.toStdout = true;\n }\n\n return convertOptions;\n}\n\n/**\n * Build CLI arguments array from ConvertOptions.\n */\nexport function buildArgs(options: ConvertOptions): string[] {\n const args: string[] = [];\n\n if (options.outputDir) {\n args.push('--output-dir', options.outputDir);\n }\n if (options.password) {\n args.push('--password', options.password);\n }\n if (options.format) {\n if (Array.isArray(options.format)) {\n if (options.format.length > 0) {\n args.push('--format', options.format.join(','));\n }\n } else {\n args.push('--format', options.format);\n }\n }\n if (options.quiet) {\n args.push('--quiet');\n }\n if (options.contentSafetyOff) {\n if (Array.isArray(options.contentSafetyOff)) {\n if (options.contentSafetyOff.length > 0) {\n args.push('--content-safety-off', options.contentSafetyOff.join(','));\n }\n } else {\n args.push('--content-safety-off', options.contentSafetyOff);\n }\n }\n if (options.sanitize) {\n args.push('--sanitize');\n }\n if (options.keepLineBreaks) {\n args.push('--keep-line-breaks');\n }\n if (options.replaceInvalidChars) {\n args.push('--replace-invalid-chars', options.replaceInvalidChars);\n }\n if (options.useStructTree) {\n args.push('--use-struct-tree');\n }\n if (options.tableMethod) {\n args.push('--table-method', options.tableMethod);\n }\n if (options.readingOrder) {\n args.push('--reading-order', options.readingOrder);\n }\n if (options.markdownPageSeparator) {\n args.push('--markdown-page-separator', options.markdownPageSeparator);\n }\n if (options.textPageSeparator) {\n args.push('--text-page-separator', options.textPageSeparator);\n }\n if (options.htmlPageSeparator) {\n args.push('--html-page-separator', options.htmlPageSeparator);\n }\n if (options.imageOutput) {\n args.push('--image-output', options.imageOutput);\n }\n if (options.imageFormat) {\n args.push('--image-format', options.imageFormat);\n }\n if (options.imageDir) {\n args.push('--image-dir', options.imageDir);\n }\n if (options.pages) {\n args.push('--pages', options.pages);\n }\n if (options.includeHeaderFooter) {\n args.push('--include-header-footer');\n }\n if (options.detectStrikethrough) {\n args.push('--detect-strikethrough');\n }\n if (options.hybrid) {\n args.push('--hybrid', options.hybrid);\n }\n if (options.hybridMode) {\n args.push('--hybrid-mode', options.hybridMode);\n }\n if (options.hybridUrl) {\n args.push('--hybrid-url', options.hybridUrl);\n }\n if (options.hybridTimeout) {\n args.push('--hybrid-timeout', options.hybridTimeout);\n }\n if (options.hybridFallback) {\n args.push('--hybrid-fallback');\n }\n if (options.toStdout) {\n args.push('--to-stdout');\n }\n\n return args;\n}\n","// AUTO-GENERATED FROM options.json - DO NOT EDIT DIRECTLY\n// Run `npm run generate-options` to regenerate\n\nimport { Command } from 'commander';\n\n/**\n * Register all CLI options on the given Commander program.\n */\nexport function registerCliOptions(program: Command): void {\n program.option('-o, --output-dir <value>', 'Directory where output files are written. Default: input file directory');\n program.option('-p, --password <value>', 'Password for encrypted PDF files');\n program.option('-f, --format <value>', 'Output formats (comma-separated). Values: json, text, html, pdf, markdown, markdown-with-html, markdown-with-images, tagged-pdf. Default: json');\n program.option('-q, --quiet', 'Suppress console logging output');\n program.option('--content-safety-off <value>', 'Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg');\n program.option('--sanitize', 'Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders');\n program.option('--keep-line-breaks', 'Preserve original line breaks in extracted text');\n program.option('--replace-invalid-chars <value>', 'Replacement character for invalid/unrecognized characters. Default: space');\n program.option('--use-struct-tree', 'Use PDF structure tree (tagged PDF) for reading order and semantic structure');\n program.option('--table-method <value>', 'Table detection method. Values: default (border-based), cluster (border + cluster). Default: default');\n program.option('--reading-order <value>', 'Reading order algorithm. Values: off, xycut. Default: xycut');\n program.option('--markdown-page-separator <value>', 'Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none');\n program.option('--text-page-separator <value>', 'Separator between pages in text output. Use %page-number% for page numbers. Default: none');\n program.option('--html-page-separator <value>', 'Separator between pages in HTML output. Use %page-number% for page numbers. Default: none');\n program.option('--image-output <value>', 'Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external');\n program.option('--image-format <value>', 'Output format for extracted images. Values: png, jpeg. Default: png');\n program.option('--image-dir <value>', 'Directory for extracted images');\n program.option('--pages <value>', 'Pages to extract (e.g., \"1,3,5-7\"). Default: all pages');\n program.option('--include-header-footer', 'Include page headers and footers in output');\n program.option('--detect-strikethrough', 'Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental)');\n program.option('--hybrid <value>', 'Hybrid backend (requires a running server). Quick start: pip install \"opendataloader-pdf[hybrid]\" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast');\n program.option('--hybrid-mode <value>', 'Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)');\n program.option('--hybrid-url <value>', 'Hybrid backend server URL (overrides default)');\n program.option('--hybrid-timeout <value>', 'Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0');\n program.option('--hybrid-fallback', 'Opt in to Java fallback on hybrid backend error (default: disabled)');\n program.option('--to-stdout', 'Write output to stdout instead of file (single format only)');\n}\n"],"mappings":";;;;;;;;;;;;;;;;;;;;;;;;;;AAKA,IAAM,mBAAmB,MACvB,OAAO,aAAa,cAChB,IAAI,IAAI,QAAQ,UAAU,EAAE,EAAE,OAC7B,SAAS,iBAAiB,SAAS,cAAc,QAAQ,YAAY,MAAM,WAC1E,SAAS,cAAc,MACvB,IAAI,IAAI,WAAW,SAAS,OAAO,EAAE;AAEtC,IAAM,gBAAgC,iCAAiB;;;ACX9D,uBAAwC;;;ACDxC,2BAAsB;AACtB,WAAsB;AACtB,SAAoB;AACpB,iBAA8B;;;AC6FvB,SAAS,oBAAoB,YAAwC;AAC1E,QAAM,iBAAiC,CAAC;AAExC,MAAI,WAAW,WAAW;AACxB,mBAAe,YAAY,WAAW;AAAA,EACxC;AACA,MAAI,WAAW,UAAU;AACvB,mBAAe,WAAW,WAAW;AAAA,EACvC;AACA,MAAI,WAAW,QAAQ;AACrB,mBAAe,SAAS,WAAW;AAAA,EACrC;AACA,MAAI,WAAW,OAAO;AACpB,mBAAe,QAAQ;AAAA,EACzB;AACA,MAAI,WAAW,kBAAkB;AAC/B,mBAAe,mBAAmB,WAAW;AAAA,EAC/C;AACA,MAAI,WAAW,UAAU;AACvB,mBAAe,WAAW;AAAA,EAC5B;AACA,MAAI,WAAW,gBAAgB;AAC7B,mBAAe,iBAAiB;AAAA,EAClC;AACA,MAAI,WAAW,qBAAqB;AAClC,mBAAe,sBAAsB,WAAW;AAAA,EAClD;AACA,MAAI,WAAW,eAAe;AAC5B,mBAAe,gBAAgB;AAAA,EACjC;AACA,MAAI,WAAW,aAAa;AAC1B,mBAAe,cAAc,WAAW;AAAA,EAC1C;AACA,MAAI,WAAW,cAAc;AAC3B,mBAAe,eAAe,WAAW;AAAA,EAC3C;AACA,MAAI,WAAW,uBAAuB;AACpC,mBAAe,wBAAwB,WAAW;AAAA,EACpD;AACA,MAAI,WAAW,mBAAmB;AAChC,mBAAe,oBAAoB,WAAW;AAAA,EAChD;AACA,MAAI,WAAW,mBAAmB;AAChC,mBAAe,oBAAoB,WAAW;AAAA,EAChD;AACA,MAAI,WAAW,aAAa;AAC1B,mBAAe,cAAc,WAAW;AAAA,EAC1C;AACA,MAAI,WAAW,aAAa;AAC1B,mBAAe,cAAc,WAAW;AAAA,EAC1C;AACA,MAAI,WAAW,UAAU;AACvB,mBAAe,WAAW,WAAW;AAAA,EACvC;AACA,MAAI,WAAW,OAAO;AACpB,mBAAe,QAAQ,WAAW;AAAA,EACpC;AACA,MAAI,WAAW,qBAAqB;AAClC,mBAAe,sBAAsB;AAAA,EACvC;AACA,MAAI,WAAW,qBAAqB;AAClC,mBAAe,sBAAsB;AAAA,EACvC;AACA,MAAI,WAAW,QAAQ;AACrB,mBAAe,SAAS,WAAW;AAAA,EACrC;AACA,MAAI,WAAW,YAAY;AACzB,mBAAe,aAAa,WAAW;AAAA,EACzC;AACA,MAAI,WAAW,WAAW;AACxB,mBAAe,YAAY,WAAW;AAAA,EACxC;AACA,MAAI,WAAW,eAAe;AAC5B,mBAAe,gBAAgB,WAAW;AAAA,EAC5C;AACA,MAAI,WAAW,gBAAgB;AAC7B,mBAAe,iBAAiB;AAAA,EAClC;AACA,MAAI,WAAW,UAAU;AACvB,mBAAe,WAAW;AAAA,EAC5B;AAEA,SAAO;AACT;AAKO,SAAS,UAAU,SAAmC;AAC3D,QAAM,OAAiB,CAAC;AAExB,MAAI,QAAQ,WAAW;AACrB,SAAK,KAAK,gBAAgB,QAAQ,SAAS;AAAA,EAC7C;AACA,MAAI,QAAQ,UAAU;AACpB,SAAK,KAAK,cAAc,QAAQ,QAAQ;AAAA,EAC1C;AACA,MAAI,QAAQ,QAAQ;AAClB,QAAI,MAAM,QAAQ,QAAQ,MAAM,GAAG;AACjC,UAAI,QAAQ,OAAO,SAAS,GAAG;AAC7B,aAAK,KAAK,YAAY,QAAQ,OAAO,KAAK,GAAG,CAAC;AAAA,MAChD;AAAA,IACF,OAAO;AACL,WAAK,KAAK,YAAY,QAAQ,MAAM;AAAA,IACtC;AAAA,EACF;AACA,MAAI,QAAQ,OAAO;AACjB,SAAK,KAAK,SAAS;AAAA,EACrB;AACA,MAAI,QAAQ,kBAAkB;AAC5B,QAAI,MAAM,QAAQ,QAAQ,gBAAgB,GAAG;AAC3C,UAAI,QAAQ,iBAAiB,SAAS,GAAG;AACvC,aAAK,KAAK,wBAAwB,QAAQ,iBAAiB,KAAK,GAAG,CAAC;AAAA,MACtE;AAAA,IACF,OAAO;AACL,WAAK,KAAK,wBAAwB,QAAQ,gBAAgB;AAAA,IAC5D;AAAA,EACF;AACA,MAAI,QAAQ,UAAU;AACpB,SAAK,KAAK,YAAY;AAAA,EACxB;AACA,MAAI,QAAQ,gBAAgB;AAC1B,SAAK,KAAK,oBAAoB;AAAA,EAChC;AACA,MAAI,QAAQ,qBAAqB;AAC/B,SAAK,KAAK,2BAA2B,QAAQ,mBAAmB;AAAA,EAClE;AACA,MAAI,QAAQ,eAAe;AACzB,SAAK,KAAK,mBAAmB;AAAA,EAC/B;AACA,MAAI,QAAQ,aAAa;AACvB,SAAK,KAAK,kBAAkB,QAAQ,WAAW;AAAA,EACjD;AACA,MAAI,QAAQ,cAAc;AACxB,SAAK,KAAK,mBAAmB,QAAQ,YAAY;AAAA,EACnD;AACA,MAAI,QAAQ,uBAAuB;AACjC,SAAK,KAAK,6BAA6B,QAAQ,qBAAqB;AAAA,EACtE;AACA,MAAI,QAAQ,mBAAmB;AAC7B,SAAK,KAAK,yBAAyB,QAAQ,iBAAiB;AAAA,EAC9D;AACA,MAAI,QAAQ,mBAAmB;AAC7B,SAAK,KAAK,yBAAyB,QAAQ,iBAAiB;AAAA,EAC9D;AACA,MAAI,QAAQ,aAAa;AACvB,SAAK,KAAK,kBAAkB,QAAQ,WAAW;AAAA,EACjD;AACA,MAAI,QAAQ,aAAa;AACvB,SAAK,KAAK,kBAAkB,QAAQ,WAAW;AAAA,EACjD;AACA,MAAI,QAAQ,UAAU;AACpB,SAAK,KAAK,eAAe,QAAQ,QAAQ;AAAA,EAC3C;AACA,MAAI,QAAQ,OAAO;AACjB,SAAK,KAAK,WAAW,QAAQ,KAAK;AAAA,EACpC;AACA,MAAI,QAAQ,qBAAqB;AAC/B,SAAK,KAAK,yBAAyB;AAAA,EACrC;AACA,MAAI,QAAQ,qBAAqB;AAC/B,SAAK,KAAK,wBAAwB;AAAA,EACpC;AACA,MAAI,QAAQ,QAAQ;AAClB,SAAK,KAAK,YAAY,QAAQ,MAAM;AAAA,EACtC;AACA,MAAI,QAAQ,YAAY;AACtB,SAAK,KAAK,iBAAiB,QAAQ,UAAU;AAAA,EAC/C;AACA,MAAI,QAAQ,WAAW;AACrB,SAAK,KAAK,gBAAgB,QAAQ,SAAS;AAAA,EAC7C;AACA,MAAI,QAAQ,eAAe;AACzB,SAAK,KAAK,oBAAoB,QAAQ,aAAa;AAAA,EACrD;AACA,MAAI,QAAQ,gBAAgB;AAC1B,SAAK,KAAK,mBAAmB;AAAA,EAC/B;AACA,MAAI,QAAQ,UAAU;AACpB,SAAK,KAAK,aAAa;AAAA,EACzB;AAEA,SAAO;AACT;;;AD5QA,IAAMA,kBAAa,0BAAc,aAAe;AAChD,IAAM,YAAiB,aAAQA,WAAU;AAEzC,IAAM,WAAW;AAMjB,SAAS,WAAW,MAAgB,mBAAwC,CAAC,GAAoB;AAC/F,QAAM,EAAE,eAAe,MAAM,IAAI;AAEjC,SAAO,IAAI,QAAQ,CAAC,SAAS,WAAW;AACtC,UAAM,UAAe,UAAK,WAAW,MAAM,OAAO,QAAQ;AAE1D,QAAI,CAAI,cAAW,OAAO,GAAG;AAC3B,aAAO;AAAA,QACL,IAAI,MAAM,yBAAyB,OAAO,sCAAsC;AAAA,MAClF;AAAA,IACF;AAEA,UAAM,UAAU;AAChB,UAAM,cAAc,CAAC,QAAQ,SAAS,GAAG,IAAI;AAE7C,UAAM,kBAAc,4BAAM,SAAS,WAAW;AAE9C,QAAI,SAAS;AACb,QAAI,SAAS;AAEb,gBAAY,OAAO,GAAG,QAAQ,CAAC,SAAS;AACtC,YAAM,QAAQ,KAAK,SAAS;AAC5B,UAAI,cAAc;AAChB,gBAAQ,OAAO,MAAM,KAAK;AAAA,MAC5B;AACA,gBAAU;AAAA,IACZ,CAAC;AAED,gBAAY,OAAO,GAAG,QAAQ,CAAC,SAAS;AACtC,YAAM,QAAQ,KAAK,SAAS;AAC5B,UAAI,cAAc;AAChB,gBAAQ,OAAO,MAAM,KAAK;AAAA,MAC5B;AACA,gBAAU;AAAA,IACZ,CAAC;AAED,gBAAY,GAAG,SAAS,CAAC,SAAS;AAChC,UAAI,SAAS,GAAG;AACd,gBAAQ,MAAM;AAAA,MAChB,OAAO;AACL,cAAM,cAAc,UAAU;AAC9B,cAAM,QAAQ,IAAI;AAAA,UAChB,+CAA+C,IAAI;AAAA;AAAA,EAAQ,WAAW;AAAA,QACxE;AACA,eAAO,KAAK;AAAA,MACd;AAAA,IACF,CAAC;AAED,gBAAY,GAAG,SAAS,CAAC,QAAe;AACtC,UAAI,IAAI,QAAQ,SAAS,QAAQ,GAAG;AAClC;AAAA,UACE,IAAI;AAAA,YACF;AAAA,UACF;AAAA,QACF;AAAA,MACF,OAAO;AACL,eAAO,GAAG;AAAA,MACZ;AAAA,IACF,CAAC;AAAA,EACH,CAAC;AACH;AAEO,SAAS,QACd,YACA,UAA0B,CAAC,GACV;AACjB,QAAM,YAAY,MAAM,QAAQ,UAAU,IAAI,aAAa,CAAC,UAAU;AACtE,MAAI,UAAU,WAAW,GAAG;AAC1B,WAAO,QAAQ,OAAO,IAAI,MAAM,2CAA2C,CAAC;AAAA,EAC9E;AAEA,aAAW,SAAS,WAAW;AAC7B,QAAI,CAAI,cAAW,KAAK,GAAG;AACzB,aAAO,QAAQ,OAAO,IAAI,MAAM,mCAAmC,KAAK,EAAE,CAAC;AAAA,IAC7E;AAAA,EACF;AAEA,QAAM,OAAiB,CAAC,GAAG,WAAW,GAAG,UAAU,OAAO,CAAC;AAE3D,SAAO,WAAW,MAAM;AAAA,IACtB,cAAc,CAAC,QAAQ;AAAA,EACzB,CAAC;AACH;;;AE9FO,SAAS,mBAAmB,SAAwB;AACzD,UAAQ,OAAO,4BAA4B,yEAAyE;AACpH,UAAQ,OAAO,0BAA0B,kCAAkC;AAC3E,UAAQ,OAAO,wBAAwB,gJAAgJ;AACvL,UAAQ,OAAO,eAAe,iCAAiC;AAC/D,UAAQ,OAAO,gCAAgC,sFAAsF;AACrI,UAAQ,OAAO,cAAc,mHAAmH;AAChJ,UAAQ,OAAO,sBAAsB,iDAAiD;AACtF,UAAQ,OAAO,mCAAmC,2EAA2E;AAC7H,UAAQ,OAAO,qBAAqB,8EAA8E;AAClH,UAAQ,OAAO,0BAA0B,sGAAsG;AAC/I,UAAQ,OAAO,2BAA2B,6DAA6D;AACvG,UAAQ,OAAO,qCAAqC,+FAA+F;AACnJ,UAAQ,OAAO,iCAAiC,2FAA2F;AAC3I,UAAQ,OAAO,iCAAiC,2FAA2F;AAC3I,UAAQ,OAAO,0BAA0B,wHAAwH;AACjK,UAAQ,OAAO,0BAA0B,qEAAqE;AAC9G,UAAQ,OAAO,uBAAuB,gCAAgC;AACtE,UAAQ,OAAO,mBAAmB,wDAAwD;AAC1F,UAAQ,OAAO,2BAA2B,4CAA4C;AACtF,UAAQ,OAAO,0BAA0B,gHAAgH;AACzJ,UAAQ,OAAO,oBAAoB,sNAAsN;AACzP,UAAQ,OAAO,yBAAyB,sGAAsG;AAC9I,UAAQ,OAAO,wBAAwB,+CAA+C;AACtF,UAAQ,OAAO,4BAA4B,6EAA6E;AACxH,UAAQ,OAAO,qBAAqB,qEAAqE;AACzG,UAAQ,OAAO,eAAe,6DAA6D;AAC7F;;;AH7BA,SAAS,gBAAyB;AAChC,QAAM,UAAU,IAAI,yBAAQ;AAE5B,UACG,KAAK,oBAAoB,EACzB,MAAM,sBAAsB,EAC5B,YAAY,4CAA4C,EACxD,mBAAmB,wCAAwC,EAC3D,yBAAyB,KAAK,EAC9B,SAAS,cAAc,uCAAuC;AAGjE,qBAAmB,OAAO;AAE1B,UAAQ,gBAAgB;AAAA,IACtB,UAAU,CAAC,QAAQ;AACjB,cAAQ,MAAM,IAAI,QAAQ,CAAC;AAAA,IAC7B;AAAA,IACA,aAAa,CAAC,KAAK,UAAU;AAC3B,YAAM,GAAG;AAAA,IACX;AAAA,EACF,CAAC;AAED,SAAO;AACT;AAEA,eAAe,OAAwB;AACrC,QAAM,UAAU,cAAc;AAE9B,UAAQ,aAAa;AAErB,MAAI;AACF,YAAQ,MAAM,QAAQ,IAAI;AAAA,EAC5B,SAAS,KAAK;AACZ,QAAI,eAAe,iCAAgB;AACjC,UAAI,IAAI,SAAS,2BAA2B;AAC1C,eAAO;AAAA,MACT;AACA,aAAO,IAAI,YAAY;AAAA,IACzB;AAEA,UAAM,UAAU,eAAe,QAAQ,IAAI,UAAU,OAAO,GAAG;AAC/D,YAAQ,MAAM,OAAO;AACrB,YAAQ,MAAM,wCAAwC;AACtD,WAAO;AAAA,EACT;AAEA,QAAM,aAAa,QAAQ,KAAiB;AAC5C,QAAM,aAAa,QAAQ;AAC3B,QAAM,iBAAiB,oBAAoB,UAAU;AAErD,MAAI;AACF,UAAM,SAAS,MAAM,QAAQ,YAAY,cAAc;AACvD,QAAI,UAAU,CAAC,eAAe,OAAO;AACnC,cAAQ,OAAO,MAAM,MAAM;AAC3B,UAAI,CAAC,OAAO,SAAS,IAAI,GAAG;AAC1B,gBAAQ,OAAO,MAAM,IAAI;AAAA,MAC3B;AAAA,IACF;AACA,WAAO;AAAA,EACT,SAAS,KAAK;AACZ,UAAM,UAAU,eAAe,QAAQ,IAAI,UAAU,OAAO,GAAG;AAC/D,YAAQ,MAAM,OAAO;AACrB,WAAO;AAAA,EACT;AACF;AAEA,KAAK,EAAE,KAAK,CAAC,SAAS;AACpB,MAAI,SAAS,GAAG;AACd,YAAQ,KAAK,IAAI;AAAA,EACnB;AACF,CAAC;","names":["__filename"]}
1
+ {"version":3,"sources":["../node_modules/.pnpm/tsup@8.5.1_postcss@8.5.9_tsx@4.20.5_typescript@6.0.2/node_modules/tsup/assets/cjs_shims.js","../src/cli.ts","../src/index.ts","../src/convert-options.generated.ts","../src/cli-options.generated.ts"],"sourcesContent":["// Shim globals in cjs bundle\n// There's a weird bug that esbuild will always inject importMetaUrl\n// if we export it as `const importMetaUrl = ... __filename ...`\n// But using a function will not cause this issue\n\nconst getImportMetaUrl = () => \n typeof document === \"undefined\" \n ? new URL(`file:${__filename}`).href \n : (document.currentScript && document.currentScript.tagName.toUpperCase() === 'SCRIPT') \n ? document.currentScript.src \n : new URL(\"main.js\", document.baseURI).href;\n\nexport const importMetaUrl = /* @__PURE__ */ getImportMetaUrl()\n","#!/usr/bin/env node\nimport { Command, CommanderError } from 'commander';\nimport { _runForCli } from './index.js';\nimport { CliOptions, buildConvertOptions } from './convert-options.generated.js';\nimport { registerCliOptions } from './cli-options.generated.js';\n\nfunction createProgram(): Command {\n const program = new Command();\n\n program\n .name('opendataloader-pdf')\n .usage('[options] <input...>')\n .description('Convert PDFs using the OpenDataLoader CLI.')\n .showHelpAfterError(\"Use '--help' to see available options.\")\n .showSuggestionAfterError(false)\n .argument('<input...>', 'Input files or directories to convert');\n\n // Register CLI options from auto-generated file\n registerCliOptions(program);\n\n program.configureOutput({\n writeErr: (str) => {\n console.error(str.trimEnd());\n },\n outputError: (str, write) => {\n write(str);\n },\n });\n\n return program;\n}\n\nasync function main(): Promise<number> {\n const program = createProgram();\n\n program.exitOverride();\n\n try {\n program.parse(process.argv);\n } catch (err) {\n if (err instanceof CommanderError) {\n if (err.code === 'commander.helpDisplayed') {\n return 0;\n }\n return err.exitCode ?? 1;\n }\n\n const message = err instanceof Error ? err.message : String(err);\n console.error(message);\n console.error(\"Use '--help' to see available options.\");\n return 1;\n }\n\n const cliOptions = program.opts<CliOptions>();\n const inputPaths = program.args;\n const convertOptions = buildConvertOptions(cliOptions);\n\n try {\n // _runForCli streams stdout/stderr to the parent process as they arrive;\n // we deliberately do not re-print anything here. (Issue #398.)\n await _runForCli(inputPaths, convertOptions);\n return 0;\n } catch (err) {\n const message = err instanceof Error ? err.message : String(err);\n console.error(message);\n return 1;\n }\n}\n\nmain().then((code) => {\n if (code !== 0) {\n process.exit(code);\n }\n});\n","import { spawn } from 'child_process';\nimport * as path from 'path';\nimport * as fs from 'fs';\nimport { StringDecoder } from 'string_decoder';\nimport { fileURLToPath } from 'url';\n\n// Re-export types and utilities from auto-generated file\nexport type { ConvertOptions } from './convert-options.generated.js';\nexport { buildArgs } from './convert-options.generated.js';\nimport type { ConvertOptions } from './convert-options.generated.js';\nimport { buildArgs } from './convert-options.generated.js';\n\nconst __filename = fileURLToPath(import.meta.url);\nconst __dirname = path.dirname(__filename);\n\nconst JAR_NAME = 'opendataloader-pdf-cli.jar';\n\ninterface JarExecutionOptions {\n /**\n * When true, forwards Java's stdout and stderr chunks to the parent\n * process in real time as well as accumulating them. Used by the bundled\n * CLI so long-running conversions show progress as it happens.\n */\n streamOutput?: boolean;\n}\n\nfunction executeJar(args: string[], executionOptions: JarExecutionOptions = {}): Promise<string> {\n const { streamOutput = false } = executionOptions;\n\n return new Promise((resolve, reject) => {\n const jarPath = path.join(__dirname, '..', 'lib', JAR_NAME);\n\n if (!fs.existsSync(jarPath)) {\n return reject(\n new Error(`JAR file not found at ${jarPath}. Please run the build script first.`),\n );\n }\n\n const command = 'java';\n const commandArgs = ['-jar', jarPath, ...args];\n\n const javaProcess = spawn(command, commandArgs);\n\n let stdout = '';\n let stderr = '';\n // StringDecoder buffers incomplete multi-byte UTF-8 sequences across\n // chunk boundaries — Buffer.toString() alone would emit replacement\n // characters when, e.g., a 3-byte Korean codepoint splits across two\n // 'data' events. One decoder per stream so they don't share state.\n const stdoutDecoder = new StringDecoder('utf8');\n const stderrDecoder = new StringDecoder('utf8');\n\n javaProcess.stdout.on('data', (data: Buffer) => {\n const chunk = stdoutDecoder.write(data);\n if (chunk.length === 0) return;\n if (streamOutput) {\n // Stream-only: don't also accumulate, or a multi-hour conversion\n // would buffer its entire (potentially gigabyte) stdout in memory\n // for no consumer.\n process.stdout.write(chunk);\n } else {\n stdout += chunk;\n }\n });\n\n javaProcess.stderr.on('data', (data: Buffer) => {\n const chunk = stderrDecoder.write(data);\n if (chunk.length === 0) return;\n if (streamOutput) {\n process.stderr.write(chunk);\n }\n // stderr is always accumulated (progress logs are small and we need\n // them for error messages on non-zero exit).\n stderr += chunk;\n });\n\n javaProcess.on('close', (code) => {\n // Flush any trailing bytes the decoder is still holding (always emit\n // them — if we drop them on error paths, error messages with non-ASCII\n // characters lose their tail).\n const stdoutTail = stdoutDecoder.end();\n const stderrTail = stderrDecoder.end();\n if (stdoutTail.length > 0) {\n if (streamOutput) {\n process.stdout.write(stdoutTail);\n } else {\n stdout += stdoutTail;\n }\n }\n if (stderrTail.length > 0) {\n if (streamOutput) process.stderr.write(stderrTail);\n stderr += stderrTail;\n }\n\n if (code === 0) {\n resolve(stdout);\n } else {\n const errorOutput = stderr || stdout;\n const error = new Error(\n `The opendataloader-pdf CLI exited with code ${code}.\\n\\n${errorOutput}`,\n );\n reject(error);\n }\n });\n\n javaProcess.on('error', (err: Error) => {\n if (err.message.includes('ENOENT')) {\n reject(\n new Error(\n \"'java' command not found. Please ensure Java is installed and in your system's PATH.\",\n ),\n );\n } else {\n reject(err);\n }\n });\n });\n}\n\nfunction buildJarArgs(\n inputPaths: string | string[],\n options: ConvertOptions,\n): string[] | Error {\n const inputList = Array.isArray(inputPaths) ? inputPaths : [inputPaths];\n if (inputList.length === 0) {\n return new Error('At least one input path must be provided.');\n }\n\n for (const input of inputList) {\n if (!fs.existsSync(input)) {\n return new Error(`Input file or folder not found: ${input}`);\n }\n }\n\n return [...inputList, ...buildArgs(options)];\n}\n\nexport function convert(\n inputPaths: string | string[],\n options: ConvertOptions = {},\n): Promise<string> {\n const argsOrError = buildJarArgs(inputPaths, options);\n if (argsOrError instanceof Error) {\n return Promise.reject(argsOrError);\n }\n // Library API: never streams to the parent process. Returns the full stdout\n // string so callers can do `const out = await convert(...)` without surprise\n // side-effects on process.stdout / process.stderr.\n return executeJar(argsOrError, { streamOutput: false });\n}\n\n/**\n * Internal entry point used by the bundled CLI. Streams Java's stdout and\n * stderr to the parent process in real time (so long-running conversions like\n * hybrid mode show progress as it happens) and resolves without a stdout\n * payload — preventing the caller from re-printing what was already streamed.\n *\n * Not part of the public API: do not import this from application code. Use\n * {@link convert} instead.\n *\n * @internal\n */\nexport async function _runForCli(\n inputPaths: string | string[],\n options: ConvertOptions = {},\n): Promise<void> {\n const argsOrError = buildJarArgs(inputPaths, options);\n if (argsOrError instanceof Error) {\n throw argsOrError;\n }\n await executeJar(argsOrError, { streamOutput: true });\n}\n\n/**\n * @deprecated Use `convert()` and `ConvertOptions` instead. This function will be removed in a future version.\n */\nexport interface RunOptions {\n outputFolder?: string;\n password?: string;\n replaceInvalidChars?: string;\n generateMarkdown?: boolean;\n generateHtml?: boolean;\n generateAnnotatedPdf?: boolean;\n keepLineBreaks?: boolean;\n contentSafetyOff?: string;\n htmlInMarkdown?: boolean;\n addImageToMarkdown?: boolean;\n noJson?: boolean;\n debug?: boolean;\n useStructTree?: boolean;\n}\n\n/**\n * @deprecated Use `convert()` instead. This function will be removed in a future version.\n */\nexport function run(inputPath: string, options: RunOptions = {}): Promise<string> {\n console.warn(\n 'Warning: run() is deprecated and will be removed in a future version. Use convert() instead.',\n );\n\n // Build format array based on legacy boolean options\n const formats: string[] = [];\n if (!options.noJson) {\n formats.push('json');\n }\n if (options.generateMarkdown) {\n if (options.addImageToMarkdown) {\n formats.push('markdown-with-images');\n } else if (options.htmlInMarkdown) {\n formats.push('markdown-with-html');\n } else {\n formats.push('markdown');\n }\n }\n if (options.generateHtml) {\n formats.push('html');\n }\n if (options.generateAnnotatedPdf) {\n formats.push('pdf');\n }\n\n return convert(inputPath, {\n outputDir: options.outputFolder,\n password: options.password,\n replaceInvalidChars: options.replaceInvalidChars,\n keepLineBreaks: options.keepLineBreaks,\n contentSafetyOff: options.contentSafetyOff,\n useStructTree: options.useStructTree,\n format: formats.length > 0 ? formats : undefined,\n quiet: !options.debug,\n });\n}\n","// AUTO-GENERATED FROM options.json - DO NOT EDIT DIRECTLY\n// Run `npm run generate-options` to regenerate\n\n/**\n * Options for the convert function.\n */\nexport interface ConvertOptions {\n /** Directory where output files are written. Default: input file directory */\n outputDir?: string;\n /** Password for encrypted PDF files */\n password?: string;\n /** Output formats (comma-separated). Values: json, text, html, pdf, markdown, markdown-with-html, markdown-with-images, tagged-pdf. Default: json */\n format?: string | string[];\n /** Suppress console logging output */\n quiet?: boolean;\n /** Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg */\n contentSafetyOff?: string | string[];\n /** Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders */\n sanitize?: boolean;\n /** Preserve original line breaks in extracted text */\n keepLineBreaks?: boolean;\n /** Replacement character for invalid/unrecognized characters. Default: space */\n replaceInvalidChars?: string;\n /** Use PDF structure tree (tagged PDF) for reading order and semantic structure */\n useStructTree?: boolean;\n /** Table detection method. Values: default (border-based), cluster (border + cluster). Default: default */\n tableMethod?: string;\n /** Reading order algorithm. Values: off, xycut. Default: xycut */\n readingOrder?: string;\n /** Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none */\n markdownPageSeparator?: string;\n /** Separator between pages in text output. Use %page-number% for page numbers. Default: none */\n textPageSeparator?: string;\n /** Separator between pages in HTML output. Use %page-number% for page numbers. Default: none */\n htmlPageSeparator?: string;\n /** Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external */\n imageOutput?: string;\n /** Output format for extracted images. Values: png, jpeg. Default: png */\n imageFormat?: string;\n /** Directory for extracted images */\n imageDir?: string;\n /** Pages to extract (e.g., \"1,3,5-7\"). Default: all pages */\n pages?: string;\n /** Include page headers and footers in output */\n includeHeaderFooter?: boolean;\n /** Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental) */\n detectStrikethrough?: boolean;\n /** Hybrid backend (requires a running server). Quick start: pip install \"opendataloader-pdf[hybrid]\" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai */\n hybrid?: string;\n /** Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend) */\n hybridMode?: string;\n /** Hybrid backend server URL (overrides default) */\n hybridUrl?: string;\n /** Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0 */\n hybridTimeout?: string;\n /** Opt in to Java fallback on hybrid backend error (default: disabled) */\n hybridFallback?: boolean;\n /** DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list) */\n hybridHancomAiRegionlistStrategy?: string;\n /** OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only) */\n hybridHancomAiOcrStrategy?: string;\n /** Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk */\n hybridHancomAiImageCache?: string;\n /** Write output to stdout instead of file (single format only) */\n toStdout?: boolean;\n /** Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode */\n threads?: string;\n}\n\n/**\n * Options as parsed from CLI (all values are strings from commander).\n */\nexport interface CliOptions {\n outputDir?: string;\n password?: string;\n format?: string;\n quiet?: boolean;\n contentSafetyOff?: string;\n sanitize?: boolean;\n keepLineBreaks?: boolean;\n replaceInvalidChars?: string;\n useStructTree?: boolean;\n tableMethod?: string;\n readingOrder?: string;\n markdownPageSeparator?: string;\n textPageSeparator?: string;\n htmlPageSeparator?: string;\n imageOutput?: string;\n imageFormat?: string;\n imageDir?: string;\n pages?: string;\n includeHeaderFooter?: boolean;\n detectStrikethrough?: boolean;\n hybrid?: string;\n hybridMode?: string;\n hybridUrl?: string;\n hybridTimeout?: string;\n hybridFallback?: boolean;\n hybridHancomAiRegionlistStrategy?: string;\n hybridHancomAiOcrStrategy?: string;\n hybridHancomAiImageCache?: string;\n toStdout?: boolean;\n threads?: string;\n}\n\n/**\n * Convert CLI options to ConvertOptions.\n */\nexport function buildConvertOptions(cliOptions: CliOptions): ConvertOptions {\n const convertOptions: ConvertOptions = {};\n\n if (cliOptions.outputDir) {\n convertOptions.outputDir = cliOptions.outputDir;\n }\n if (cliOptions.password) {\n convertOptions.password = cliOptions.password;\n }\n if (cliOptions.format) {\n convertOptions.format = cliOptions.format;\n }\n if (cliOptions.quiet) {\n convertOptions.quiet = true;\n }\n if (cliOptions.contentSafetyOff) {\n convertOptions.contentSafetyOff = cliOptions.contentSafetyOff;\n }\n if (cliOptions.sanitize) {\n convertOptions.sanitize = true;\n }\n if (cliOptions.keepLineBreaks) {\n convertOptions.keepLineBreaks = true;\n }\n if (cliOptions.replaceInvalidChars) {\n convertOptions.replaceInvalidChars = cliOptions.replaceInvalidChars;\n }\n if (cliOptions.useStructTree) {\n convertOptions.useStructTree = true;\n }\n if (cliOptions.tableMethod) {\n convertOptions.tableMethod = cliOptions.tableMethod;\n }\n if (cliOptions.readingOrder) {\n convertOptions.readingOrder = cliOptions.readingOrder;\n }\n if (cliOptions.markdownPageSeparator) {\n convertOptions.markdownPageSeparator = cliOptions.markdownPageSeparator;\n }\n if (cliOptions.textPageSeparator) {\n convertOptions.textPageSeparator = cliOptions.textPageSeparator;\n }\n if (cliOptions.htmlPageSeparator) {\n convertOptions.htmlPageSeparator = cliOptions.htmlPageSeparator;\n }\n if (cliOptions.imageOutput) {\n convertOptions.imageOutput = cliOptions.imageOutput;\n }\n if (cliOptions.imageFormat) {\n convertOptions.imageFormat = cliOptions.imageFormat;\n }\n if (cliOptions.imageDir) {\n convertOptions.imageDir = cliOptions.imageDir;\n }\n if (cliOptions.pages) {\n convertOptions.pages = cliOptions.pages;\n }\n if (cliOptions.includeHeaderFooter) {\n convertOptions.includeHeaderFooter = true;\n }\n if (cliOptions.detectStrikethrough) {\n convertOptions.detectStrikethrough = true;\n }\n if (cliOptions.hybrid) {\n convertOptions.hybrid = cliOptions.hybrid;\n }\n if (cliOptions.hybridMode) {\n convertOptions.hybridMode = cliOptions.hybridMode;\n }\n if (cliOptions.hybridUrl) {\n convertOptions.hybridUrl = cliOptions.hybridUrl;\n }\n if (cliOptions.hybridTimeout) {\n convertOptions.hybridTimeout = cliOptions.hybridTimeout;\n }\n if (cliOptions.hybridFallback) {\n convertOptions.hybridFallback = true;\n }\n if (cliOptions.hybridHancomAiRegionlistStrategy) {\n convertOptions.hybridHancomAiRegionlistStrategy = cliOptions.hybridHancomAiRegionlistStrategy;\n }\n if (cliOptions.hybridHancomAiOcrStrategy) {\n convertOptions.hybridHancomAiOcrStrategy = cliOptions.hybridHancomAiOcrStrategy;\n }\n if (cliOptions.hybridHancomAiImageCache) {\n convertOptions.hybridHancomAiImageCache = cliOptions.hybridHancomAiImageCache;\n }\n if (cliOptions.toStdout) {\n convertOptions.toStdout = true;\n }\n if (cliOptions.threads) {\n convertOptions.threads = cliOptions.threads;\n }\n\n return convertOptions;\n}\n\n/**\n * Build CLI arguments array from ConvertOptions.\n */\nexport function buildArgs(options: ConvertOptions): string[] {\n const args: string[] = [];\n\n if (options.outputDir) {\n args.push('--output-dir', options.outputDir);\n }\n if (options.password) {\n args.push('--password', options.password);\n }\n if (options.format) {\n if (Array.isArray(options.format)) {\n if (options.format.length > 0) {\n args.push('--format', options.format.join(','));\n }\n } else {\n args.push('--format', options.format);\n }\n }\n if (options.quiet) {\n args.push('--quiet');\n }\n if (options.contentSafetyOff) {\n if (Array.isArray(options.contentSafetyOff)) {\n if (options.contentSafetyOff.length > 0) {\n args.push('--content-safety-off', options.contentSafetyOff.join(','));\n }\n } else {\n args.push('--content-safety-off', options.contentSafetyOff);\n }\n }\n if (options.sanitize) {\n args.push('--sanitize');\n }\n if (options.keepLineBreaks) {\n args.push('--keep-line-breaks');\n }\n if (options.replaceInvalidChars) {\n args.push('--replace-invalid-chars', options.replaceInvalidChars);\n }\n if (options.useStructTree) {\n args.push('--use-struct-tree');\n }\n if (options.tableMethod) {\n args.push('--table-method', options.tableMethod);\n }\n if (options.readingOrder) {\n args.push('--reading-order', options.readingOrder);\n }\n if (options.markdownPageSeparator) {\n args.push('--markdown-page-separator', options.markdownPageSeparator);\n }\n if (options.textPageSeparator) {\n args.push('--text-page-separator', options.textPageSeparator);\n }\n if (options.htmlPageSeparator) {\n args.push('--html-page-separator', options.htmlPageSeparator);\n }\n if (options.imageOutput) {\n args.push('--image-output', options.imageOutput);\n }\n if (options.imageFormat) {\n args.push('--image-format', options.imageFormat);\n }\n if (options.imageDir) {\n args.push('--image-dir', options.imageDir);\n }\n if (options.pages) {\n args.push('--pages', options.pages);\n }\n if (options.includeHeaderFooter) {\n args.push('--include-header-footer');\n }\n if (options.detectStrikethrough) {\n args.push('--detect-strikethrough');\n }\n if (options.hybrid) {\n args.push('--hybrid', options.hybrid);\n }\n if (options.hybridMode) {\n args.push('--hybrid-mode', options.hybridMode);\n }\n if (options.hybridUrl) {\n args.push('--hybrid-url', options.hybridUrl);\n }\n if (options.hybridTimeout) {\n args.push('--hybrid-timeout', options.hybridTimeout);\n }\n if (options.hybridFallback) {\n args.push('--hybrid-fallback');\n }\n if (options.hybridHancomAiRegionlistStrategy) {\n args.push('--hybrid-hancom-ai-regionlist-strategy', options.hybridHancomAiRegionlistStrategy);\n }\n if (options.hybridHancomAiOcrStrategy) {\n args.push('--hybrid-hancom-ai-ocr-strategy', options.hybridHancomAiOcrStrategy);\n }\n if (options.hybridHancomAiImageCache) {\n args.push('--hybrid-hancom-ai-image-cache', options.hybridHancomAiImageCache);\n }\n if (options.toStdout) {\n args.push('--to-stdout');\n }\n if (options.threads) {\n args.push('--threads', options.threads);\n }\n\n return args;\n}\n","// AUTO-GENERATED FROM options.json - DO NOT EDIT DIRECTLY\n// Run `npm run generate-options` to regenerate\n\nimport { Command } from 'commander';\n\n/**\n * Register all CLI options on the given Commander program.\n */\nexport function registerCliOptions(program: Command): void {\n program.option('-o, --output-dir <value>', 'Directory where output files are written. Default: input file directory');\n program.option('-p, --password <value>', 'Password for encrypted PDF files');\n program.option('-f, --format <value>', 'Output formats (comma-separated). Values: json, text, html, pdf, markdown, markdown-with-html, markdown-with-images, tagged-pdf. Default: json');\n program.option('-q, --quiet', 'Suppress console logging output');\n program.option('--content-safety-off <value>', 'Disable content safety filters. Values: all, hidden-text, off-page, tiny, hidden-ocg');\n program.option('--sanitize', 'Enable sensitive data sanitization. Replaces emails, phone numbers, IPs, credit cards, and URLs with placeholders');\n program.option('--keep-line-breaks', 'Preserve original line breaks in extracted text');\n program.option('--replace-invalid-chars <value>', 'Replacement character for invalid/unrecognized characters. Default: space');\n program.option('--use-struct-tree', 'Use PDF structure tree (tagged PDF) for reading order and semantic structure');\n program.option('--table-method <value>', 'Table detection method. Values: default (border-based), cluster (border + cluster). Default: default');\n program.option('--reading-order <value>', 'Reading order algorithm. Values: off, xycut. Default: xycut');\n program.option('--markdown-page-separator <value>', 'Separator between pages in Markdown output. Use %page-number% for page numbers. Default: none');\n program.option('--text-page-separator <value>', 'Separator between pages in text output. Use %page-number% for page numbers. Default: none');\n program.option('--html-page-separator <value>', 'Separator between pages in HTML output. Use %page-number% for page numbers. Default: none');\n program.option('--image-output <value>', 'Image output mode. Values: off (no images), embedded (Base64 data URIs), external (file references). Default: external');\n program.option('--image-format <value>', 'Output format for extracted images. Values: png, jpeg. Default: png');\n program.option('--image-dir <value>', 'Directory for extracted images');\n program.option('--pages <value>', 'Pages to extract (e.g., \"1,3,5-7\"). Default: all pages');\n program.option('--include-header-footer', 'Include page headers and footers in output');\n program.option('--detect-strikethrough', 'Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental)');\n program.option('--hybrid <value>', 'Hybrid backend (requires a running server). Quick start: pip install \"opendataloader-pdf[hybrid]\" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai');\n program.option('--hybrid-mode <value>', 'Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)');\n program.option('--hybrid-url <value>', 'Hybrid backend server URL (overrides default)');\n program.option('--hybrid-timeout <value>', 'Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0');\n program.option('--hybrid-fallback', 'Opt in to Java fallback on hybrid backend error (default: disabled)');\n program.option('--hybrid-hancom-ai-regionlist-strategy <value>', 'DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list)');\n program.option('--hybrid-hancom-ai-ocr-strategy <value>', 'OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only)');\n program.option('--hybrid-hancom-ai-image-cache <value>', 'Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk');\n program.option('--to-stdout', 'Write output to stdout instead of file (single format only)');\n program.option('--threads <value>', 'Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode');\n}\n"],"mappings":";;;;;;;;;;;;;;;;;;;;;;;;;;AAKA,IAAM,mBAAmB,MACvB,OAAO,aAAa,cAChB,IAAI,IAAI,QAAQ,UAAU,EAAE,EAAE,OAC7B,SAAS,iBAAiB,SAAS,cAAc,QAAQ,YAAY,MAAM,WAC1E,SAAS,cAAc,MACvB,IAAI,IAAI,WAAW,SAAS,OAAO,EAAE;AAEtC,IAAM,gBAAgC,iCAAiB;;;ACX9D,uBAAwC;;;ACDxC,2BAAsB;AACtB,WAAsB;AACtB,SAAoB;AACpB,4BAA8B;AAC9B,iBAA8B;;;ACwGvB,SAAS,oBAAoB,YAAwC;AAC1E,QAAM,iBAAiC,CAAC;AAExC,MAAI,WAAW,WAAW;AACxB,mBAAe,YAAY,WAAW;AAAA,EACxC;AACA,MAAI,WAAW,UAAU;AACvB,mBAAe,WAAW,WAAW;AAAA,EACvC;AACA,MAAI,WAAW,QAAQ;AACrB,mBAAe,SAAS,WAAW;AAAA,EACrC;AACA,MAAI,WAAW,OAAO;AACpB,mBAAe,QAAQ;AAAA,EACzB;AACA,MAAI,WAAW,kBAAkB;AAC/B,mBAAe,mBAAmB,WAAW;AAAA,EAC/C;AACA,MAAI,WAAW,UAAU;AACvB,mBAAe,WAAW;AAAA,EAC5B;AACA,MAAI,WAAW,gBAAgB;AAC7B,mBAAe,iBAAiB;AAAA,EAClC;AACA,MAAI,WAAW,qBAAqB;AAClC,mBAAe,sBAAsB,WAAW;AAAA,EAClD;AACA,MAAI,WAAW,eAAe;AAC5B,mBAAe,gBAAgB;AAAA,EACjC;AACA,MAAI,WAAW,aAAa;AAC1B,mBAAe,cAAc,WAAW;AAAA,EAC1C;AACA,MAAI,WAAW,cAAc;AAC3B,mBAAe,eAAe,WAAW;AAAA,EAC3C;AACA,MAAI,WAAW,uBAAuB;AACpC,mBAAe,wBAAwB,WAAW;AAAA,EACpD;AACA,MAAI,WAAW,mBAAmB;AAChC,mBAAe,oBAAoB,WAAW;AAAA,EAChD;AACA,MAAI,WAAW,mBAAmB;AAChC,mBAAe,oBAAoB,WAAW;AAAA,EAChD;AACA,MAAI,WAAW,aAAa;AAC1B,mBAAe,cAAc,WAAW;AAAA,EAC1C;AACA,MAAI,WAAW,aAAa;AAC1B,mBAAe,cAAc,WAAW;AAAA,EAC1C;AACA,MAAI,WAAW,UAAU;AACvB,mBAAe,WAAW,WAAW;AAAA,EACvC;AACA,MAAI,WAAW,OAAO;AACpB,mBAAe,QAAQ,WAAW;AAAA,EACpC;AACA,MAAI,WAAW,qBAAqB;AAClC,mBAAe,sBAAsB;AAAA,EACvC;AACA,MAAI,WAAW,qBAAqB;AAClC,mBAAe,sBAAsB;AAAA,EACvC;AACA,MAAI,WAAW,QAAQ;AACrB,mBAAe,SAAS,WAAW;AAAA,EACrC;AACA,MAAI,WAAW,YAAY;AACzB,mBAAe,aAAa,WAAW;AAAA,EACzC;AACA,MAAI,WAAW,WAAW;AACxB,mBAAe,YAAY,WAAW;AAAA,EACxC;AACA,MAAI,WAAW,eAAe;AAC5B,mBAAe,gBAAgB,WAAW;AAAA,EAC5C;AACA,MAAI,WAAW,gBAAgB;AAC7B,mBAAe,iBAAiB;AAAA,EAClC;AACA,MAAI,WAAW,kCAAkC;AAC/C,mBAAe,mCAAmC,WAAW;AAAA,EAC/D;AACA,MAAI,WAAW,2BAA2B;AACxC,mBAAe,4BAA4B,WAAW;AAAA,EACxD;AACA,MAAI,WAAW,0BAA0B;AACvC,mBAAe,2BAA2B,WAAW;AAAA,EACvD;AACA,MAAI,WAAW,UAAU;AACvB,mBAAe,WAAW;AAAA,EAC5B;AACA,MAAI,WAAW,SAAS;AACtB,mBAAe,UAAU,WAAW;AAAA,EACtC;AAEA,SAAO;AACT;AAKO,SAAS,UAAU,SAAmC;AAC3D,QAAM,OAAiB,CAAC;AAExB,MAAI,QAAQ,WAAW;AACrB,SAAK,KAAK,gBAAgB,QAAQ,SAAS;AAAA,EAC7C;AACA,MAAI,QAAQ,UAAU;AACpB,SAAK,KAAK,cAAc,QAAQ,QAAQ;AAAA,EAC1C;AACA,MAAI,QAAQ,QAAQ;AAClB,QAAI,MAAM,QAAQ,QAAQ,MAAM,GAAG;AACjC,UAAI,QAAQ,OAAO,SAAS,GAAG;AAC7B,aAAK,KAAK,YAAY,QAAQ,OAAO,KAAK,GAAG,CAAC;AAAA,MAChD;AAAA,IACF,OAAO;AACL,WAAK,KAAK,YAAY,QAAQ,MAAM;AAAA,IACtC;AAAA,EACF;AACA,MAAI,QAAQ,OAAO;AACjB,SAAK,KAAK,SAAS;AAAA,EACrB;AACA,MAAI,QAAQ,kBAAkB;AAC5B,QAAI,MAAM,QAAQ,QAAQ,gBAAgB,GAAG;AAC3C,UAAI,QAAQ,iBAAiB,SAAS,GAAG;AACvC,aAAK,KAAK,wBAAwB,QAAQ,iBAAiB,KAAK,GAAG,CAAC;AAAA,MACtE;AAAA,IACF,OAAO;AACL,WAAK,KAAK,wBAAwB,QAAQ,gBAAgB;AAAA,IAC5D;AAAA,EACF;AACA,MAAI,QAAQ,UAAU;AACpB,SAAK,KAAK,YAAY;AAAA,EACxB;AACA,MAAI,QAAQ,gBAAgB;AAC1B,SAAK,KAAK,oBAAoB;AAAA,EAChC;AACA,MAAI,QAAQ,qBAAqB;AAC/B,SAAK,KAAK,2BAA2B,QAAQ,mBAAmB;AAAA,EAClE;AACA,MAAI,QAAQ,eAAe;AACzB,SAAK,KAAK,mBAAmB;AAAA,EAC/B;AACA,MAAI,QAAQ,aAAa;AACvB,SAAK,KAAK,kBAAkB,QAAQ,WAAW;AAAA,EACjD;AACA,MAAI,QAAQ,cAAc;AACxB,SAAK,KAAK,mBAAmB,QAAQ,YAAY;AAAA,EACnD;AACA,MAAI,QAAQ,uBAAuB;AACjC,SAAK,KAAK,6BAA6B,QAAQ,qBAAqB;AAAA,EACtE;AACA,MAAI,QAAQ,mBAAmB;AAC7B,SAAK,KAAK,yBAAyB,QAAQ,iBAAiB;AAAA,EAC9D;AACA,MAAI,QAAQ,mBAAmB;AAC7B,SAAK,KAAK,yBAAyB,QAAQ,iBAAiB;AAAA,EAC9D;AACA,MAAI,QAAQ,aAAa;AACvB,SAAK,KAAK,kBAAkB,QAAQ,WAAW;AAAA,EACjD;AACA,MAAI,QAAQ,aAAa;AACvB,SAAK,KAAK,kBAAkB,QAAQ,WAAW;AAAA,EACjD;AACA,MAAI,QAAQ,UAAU;AACpB,SAAK,KAAK,eAAe,QAAQ,QAAQ;AAAA,EAC3C;AACA,MAAI,QAAQ,OAAO;AACjB,SAAK,KAAK,WAAW,QAAQ,KAAK;AAAA,EACpC;AACA,MAAI,QAAQ,qBAAqB;AAC/B,SAAK,KAAK,yBAAyB;AAAA,EACrC;AACA,MAAI,QAAQ,qBAAqB;AAC/B,SAAK,KAAK,wBAAwB;AAAA,EACpC;AACA,MAAI,QAAQ,QAAQ;AAClB,SAAK,KAAK,YAAY,QAAQ,MAAM;AAAA,EACtC;AACA,MAAI,QAAQ,YAAY;AACtB,SAAK,KAAK,iBAAiB,QAAQ,UAAU;AAAA,EAC/C;AACA,MAAI,QAAQ,WAAW;AACrB,SAAK,KAAK,gBAAgB,QAAQ,SAAS;AAAA,EAC7C;AACA,MAAI,QAAQ,eAAe;AACzB,SAAK,KAAK,oBAAoB,QAAQ,aAAa;AAAA,EACrD;AACA,MAAI,QAAQ,gBAAgB;AAC1B,SAAK,KAAK,mBAAmB;AAAA,EAC/B;AACA,MAAI,QAAQ,kCAAkC;AAC5C,SAAK,KAAK,0CAA0C,QAAQ,gCAAgC;AAAA,EAC9F;AACA,MAAI,QAAQ,2BAA2B;AACrC,SAAK,KAAK,mCAAmC,QAAQ,yBAAyB;AAAA,EAChF;AACA,MAAI,QAAQ,0BAA0B;AACpC,SAAK,KAAK,kCAAkC,QAAQ,wBAAwB;AAAA,EAC9E;AACA,MAAI,QAAQ,UAAU;AACpB,SAAK,KAAK,aAAa;AAAA,EACzB;AACA,MAAI,QAAQ,SAAS;AACnB,SAAK,KAAK,aAAa,QAAQ,OAAO;AAAA,EACxC;AAEA,SAAO;AACT;;;AD/SA,IAAMA,kBAAa,0BAAc,aAAe;AAChD,IAAM,YAAiB,aAAQA,WAAU;AAEzC,IAAM,WAAW;AAWjB,SAAS,WAAW,MAAgB,mBAAwC,CAAC,GAAoB;AAC/F,QAAM,EAAE,eAAe,MAAM,IAAI;AAEjC,SAAO,IAAI,QAAQ,CAAC,SAAS,WAAW;AACtC,UAAM,UAAe,UAAK,WAAW,MAAM,OAAO,QAAQ;AAE1D,QAAI,CAAI,cAAW,OAAO,GAAG;AAC3B,aAAO;AAAA,QACL,IAAI,MAAM,yBAAyB,OAAO,sCAAsC;AAAA,MAClF;AAAA,IACF;AAEA,UAAM,UAAU;AAChB,UAAM,cAAc,CAAC,QAAQ,SAAS,GAAG,IAAI;AAE7C,UAAM,kBAAc,4BAAM,SAAS,WAAW;AAE9C,QAAI,SAAS;AACb,QAAI,SAAS;AAKb,UAAM,gBAAgB,IAAI,oCAAc,MAAM;AAC9C,UAAM,gBAAgB,IAAI,oCAAc,MAAM;AAE9C,gBAAY,OAAO,GAAG,QAAQ,CAAC,SAAiB;AAC9C,YAAM,QAAQ,cAAc,MAAM,IAAI;AACtC,UAAI,MAAM,WAAW,EAAG;AACxB,UAAI,cAAc;AAIhB,gBAAQ,OAAO,MAAM,KAAK;AAAA,MAC5B,OAAO;AACL,kBAAU;AAAA,MACZ;AAAA,IACF,CAAC;AAED,gBAAY,OAAO,GAAG,QAAQ,CAAC,SAAiB;AAC9C,YAAM,QAAQ,cAAc,MAAM,IAAI;AACtC,UAAI,MAAM,WAAW,EAAG;AACxB,UAAI,cAAc;AAChB,gBAAQ,OAAO,MAAM,KAAK;AAAA,MAC5B;AAGA,gBAAU;AAAA,IACZ,CAAC;AAED,gBAAY,GAAG,SAAS,CAAC,SAAS;AAIhC,YAAM,aAAa,cAAc,IAAI;AACrC,YAAM,aAAa,cAAc,IAAI;AACrC,UAAI,WAAW,SAAS,GAAG;AACzB,YAAI,cAAc;AAChB,kBAAQ,OAAO,MAAM,UAAU;AAAA,QACjC,OAAO;AACL,oBAAU;AAAA,QACZ;AAAA,MACF;AACA,UAAI,WAAW,SAAS,GAAG;AACzB,YAAI,aAAc,SAAQ,OAAO,MAAM,UAAU;AACjD,kBAAU;AAAA,MACZ;AAEA,UAAI,SAAS,GAAG;AACd,gBAAQ,MAAM;AAAA,MAChB,OAAO;AACL,cAAM,cAAc,UAAU;AAC9B,cAAM,QAAQ,IAAI;AAAA,UAChB,+CAA+C,IAAI;AAAA;AAAA,EAAQ,WAAW;AAAA,QACxE;AACA,eAAO,KAAK;AAAA,MACd;AAAA,IACF,CAAC;AAED,gBAAY,GAAG,SAAS,CAAC,QAAe;AACtC,UAAI,IAAI,QAAQ,SAAS,QAAQ,GAAG;AAClC;AAAA,UACE,IAAI;AAAA,YACF;AAAA,UACF;AAAA,QACF;AAAA,MACF,OAAO;AACL,eAAO,GAAG;AAAA,MACZ;AAAA,IACF,CAAC;AAAA,EACH,CAAC;AACH;AAEA,SAAS,aACP,YACA,SACkB;AAClB,QAAM,YAAY,MAAM,QAAQ,UAAU,IAAI,aAAa,CAAC,UAAU;AACtE,MAAI,UAAU,WAAW,GAAG;AAC1B,WAAO,IAAI,MAAM,2CAA2C;AAAA,EAC9D;AAEA,aAAW,SAAS,WAAW;AAC7B,QAAI,CAAI,cAAW,KAAK,GAAG;AACzB,aAAO,IAAI,MAAM,mCAAmC,KAAK,EAAE;AAAA,IAC7D;AAAA,EACF;AAEA,SAAO,CAAC,GAAG,WAAW,GAAG,UAAU,OAAO,CAAC;AAC7C;AA2BA,eAAsB,WACpB,YACA,UAA0B,CAAC,GACZ;AACf,QAAM,cAAc,aAAa,YAAY,OAAO;AACpD,MAAI,uBAAuB,OAAO;AAChC,UAAM;AAAA,EACR;AACA,QAAM,WAAW,aAAa,EAAE,cAAc,KAAK,CAAC;AACtD;;;AEnKO,SAAS,mBAAmB,SAAwB;AACzD,UAAQ,OAAO,4BAA4B,yEAAyE;AACpH,UAAQ,OAAO,0BAA0B,kCAAkC;AAC3E,UAAQ,OAAO,wBAAwB,gJAAgJ;AACvL,UAAQ,OAAO,eAAe,iCAAiC;AAC/D,UAAQ,OAAO,gCAAgC,sFAAsF;AACrI,UAAQ,OAAO,cAAc,mHAAmH;AAChJ,UAAQ,OAAO,sBAAsB,iDAAiD;AACtF,UAAQ,OAAO,mCAAmC,2EAA2E;AAC7H,UAAQ,OAAO,qBAAqB,8EAA8E;AAClH,UAAQ,OAAO,0BAA0B,sGAAsG;AAC/I,UAAQ,OAAO,2BAA2B,6DAA6D;AACvG,UAAQ,OAAO,qCAAqC,+FAA+F;AACnJ,UAAQ,OAAO,iCAAiC,2FAA2F;AAC3I,UAAQ,OAAO,iCAAiC,2FAA2F;AAC3I,UAAQ,OAAO,0BAA0B,wHAAwH;AACjK,UAAQ,OAAO,0BAA0B,qEAAqE;AAC9G,UAAQ,OAAO,uBAAuB,gCAAgC;AACtE,UAAQ,OAAO,mBAAmB,wDAAwD;AAC1F,UAAQ,OAAO,2BAA2B,4CAA4C;AACtF,UAAQ,OAAO,0BAA0B,gHAAgH;AACzJ,UAAQ,OAAO,oBAAoB,iOAAiO;AACpQ,UAAQ,OAAO,yBAAyB,sGAAsG;AAC9I,UAAQ,OAAO,wBAAwB,+CAA+C;AACtF,UAAQ,OAAO,4BAA4B,6EAA6E;AACxH,UAAQ,OAAO,qBAAqB,qEAAqE;AACzG,UAAQ,OAAO,kDAAkD,8JAA8J;AAC/N,UAAQ,OAAO,2CAA2C,oIAAoI;AAC9L,UAAQ,OAAO,0CAA0C,uFAAuF;AAChJ,UAAQ,OAAO,eAAe,6DAA6D;AAC3F,UAAQ,OAAO,qBAAqB,iTAAiT;AACvV;;;AHjCA,SAAS,gBAAyB;AAChC,QAAM,UAAU,IAAI,yBAAQ;AAE5B,UACG,KAAK,oBAAoB,EACzB,MAAM,sBAAsB,EAC5B,YAAY,4CAA4C,EACxD,mBAAmB,wCAAwC,EAC3D,yBAAyB,KAAK,EAC9B,SAAS,cAAc,uCAAuC;AAGjE,qBAAmB,OAAO;AAE1B,UAAQ,gBAAgB;AAAA,IACtB,UAAU,CAAC,QAAQ;AACjB,cAAQ,MAAM,IAAI,QAAQ,CAAC;AAAA,IAC7B;AAAA,IACA,aAAa,CAAC,KAAK,UAAU;AAC3B,YAAM,GAAG;AAAA,IACX;AAAA,EACF,CAAC;AAED,SAAO;AACT;AAEA,eAAe,OAAwB;AACrC,QAAM,UAAU,cAAc;AAE9B,UAAQ,aAAa;AAErB,MAAI;AACF,YAAQ,MAAM,QAAQ,IAAI;AAAA,EAC5B,SAAS,KAAK;AACZ,QAAI,eAAe,iCAAgB;AACjC,UAAI,IAAI,SAAS,2BAA2B;AAC1C,eAAO;AAAA,MACT;AACA,aAAO,IAAI,YAAY;AAAA,IACzB;AAEA,UAAM,UAAU,eAAe,QAAQ,IAAI,UAAU,OAAO,GAAG;AAC/D,YAAQ,MAAM,OAAO;AACrB,YAAQ,MAAM,wCAAwC;AACtD,WAAO;AAAA,EACT;AAEA,QAAM,aAAa,QAAQ,KAAiB;AAC5C,QAAM,aAAa,QAAQ;AAC3B,QAAM,iBAAiB,oBAAoB,UAAU;AAErD,MAAI;AAGF,UAAM,WAAW,YAAY,cAAc;AAC3C,WAAO;AAAA,EACT,SAAS,KAAK;AACZ,UAAM,UAAU,eAAe,QAAQ,IAAI,UAAU,OAAO,GAAG;AAC/D,YAAQ,MAAM,OAAO;AACrB,WAAO;AAAA,EACT;AACF;AAEA,KAAK,EAAE,KAAK,CAAC,SAAS;AACpB,MAAI,SAAS,GAAG;AACd,YAAQ,KAAK,IAAI;AAAA,EACnB;AACF,CAAC;","names":["__filename"]}
package/dist/cli.js CHANGED
@@ -7,6 +7,7 @@ import { Command, CommanderError } from "commander";
7
7
  import { spawn } from "child_process";
8
8
  import * as path from "path";
9
9
  import * as fs from "fs";
10
+ import { StringDecoder } from "string_decoder";
10
11
  import { fileURLToPath } from "url";
11
12
 
12
13
  // src/convert-options.generated.ts
@@ -87,9 +88,21 @@ function buildConvertOptions(cliOptions) {
87
88
  if (cliOptions.hybridFallback) {
88
89
  convertOptions.hybridFallback = true;
89
90
  }
91
+ if (cliOptions.hybridHancomAiRegionlistStrategy) {
92
+ convertOptions.hybridHancomAiRegionlistStrategy = cliOptions.hybridHancomAiRegionlistStrategy;
93
+ }
94
+ if (cliOptions.hybridHancomAiOcrStrategy) {
95
+ convertOptions.hybridHancomAiOcrStrategy = cliOptions.hybridHancomAiOcrStrategy;
96
+ }
97
+ if (cliOptions.hybridHancomAiImageCache) {
98
+ convertOptions.hybridHancomAiImageCache = cliOptions.hybridHancomAiImageCache;
99
+ }
90
100
  if (cliOptions.toStdout) {
91
101
  convertOptions.toStdout = true;
92
102
  }
103
+ if (cliOptions.threads) {
104
+ convertOptions.threads = cliOptions.threads;
105
+ }
93
106
  return convertOptions;
94
107
  }
95
108
  function buildArgs(options) {
@@ -181,9 +194,21 @@ function buildArgs(options) {
181
194
  if (options.hybridFallback) {
182
195
  args.push("--hybrid-fallback");
183
196
  }
197
+ if (options.hybridHancomAiRegionlistStrategy) {
198
+ args.push("--hybrid-hancom-ai-regionlist-strategy", options.hybridHancomAiRegionlistStrategy);
199
+ }
200
+ if (options.hybridHancomAiOcrStrategy) {
201
+ args.push("--hybrid-hancom-ai-ocr-strategy", options.hybridHancomAiOcrStrategy);
202
+ }
203
+ if (options.hybridHancomAiImageCache) {
204
+ args.push("--hybrid-hancom-ai-image-cache", options.hybridHancomAiImageCache);
205
+ }
184
206
  if (options.toStdout) {
185
207
  args.push("--to-stdout");
186
208
  }
209
+ if (options.threads) {
210
+ args.push("--threads", options.threads);
211
+ }
187
212
  return args;
188
213
  }
189
214
 
@@ -205,21 +230,39 @@ function executeJar(args, executionOptions = {}) {
205
230
  const javaProcess = spawn(command, commandArgs);
206
231
  let stdout = "";
207
232
  let stderr = "";
233
+ const stdoutDecoder = new StringDecoder("utf8");
234
+ const stderrDecoder = new StringDecoder("utf8");
208
235
  javaProcess.stdout.on("data", (data) => {
209
- const chunk = data.toString();
236
+ const chunk = stdoutDecoder.write(data);
237
+ if (chunk.length === 0) return;
210
238
  if (streamOutput) {
211
239
  process.stdout.write(chunk);
240
+ } else {
241
+ stdout += chunk;
212
242
  }
213
- stdout += chunk;
214
243
  });
215
244
  javaProcess.stderr.on("data", (data) => {
216
- const chunk = data.toString();
245
+ const chunk = stderrDecoder.write(data);
246
+ if (chunk.length === 0) return;
217
247
  if (streamOutput) {
218
248
  process.stderr.write(chunk);
219
249
  }
220
250
  stderr += chunk;
221
251
  });
222
252
  javaProcess.on("close", (code) => {
253
+ const stdoutTail = stdoutDecoder.end();
254
+ const stderrTail = stderrDecoder.end();
255
+ if (stdoutTail.length > 0) {
256
+ if (streamOutput) {
257
+ process.stdout.write(stdoutTail);
258
+ } else {
259
+ stdout += stdoutTail;
260
+ }
261
+ }
262
+ if (stderrTail.length > 0) {
263
+ if (streamOutput) process.stderr.write(stderrTail);
264
+ stderr += stderrTail;
265
+ }
223
266
  if (code === 0) {
224
267
  resolve(stdout);
225
268
  } else {
@@ -245,20 +288,24 @@ ${errorOutput}`
245
288
  });
246
289
  });
247
290
  }
248
- function convert(inputPaths, options = {}) {
291
+ function buildJarArgs(inputPaths, options) {
249
292
  const inputList = Array.isArray(inputPaths) ? inputPaths : [inputPaths];
250
293
  if (inputList.length === 0) {
251
- return Promise.reject(new Error("At least one input path must be provided."));
294
+ return new Error("At least one input path must be provided.");
252
295
  }
253
296
  for (const input of inputList) {
254
297
  if (!fs.existsSync(input)) {
255
- return Promise.reject(new Error(`Input file or folder not found: ${input}`));
298
+ return new Error(`Input file or folder not found: ${input}`);
256
299
  }
257
300
  }
258
- const args = [...inputList, ...buildArgs(options)];
259
- return executeJar(args, {
260
- streamOutput: !options.quiet
261
- });
301
+ return [...inputList, ...buildArgs(options)];
302
+ }
303
+ async function _runForCli(inputPaths, options = {}) {
304
+ const argsOrError = buildJarArgs(inputPaths, options);
305
+ if (argsOrError instanceof Error) {
306
+ throw argsOrError;
307
+ }
308
+ await executeJar(argsOrError, { streamOutput: true });
262
309
  }
263
310
 
264
311
  // src/cli-options.generated.ts
@@ -283,12 +330,16 @@ function registerCliOptions(program) {
283
330
  program.option("--pages <value>", 'Pages to extract (e.g., "1,3,5-7"). Default: all pages');
284
331
  program.option("--include-header-footer", "Include page headers and footers in output");
285
332
  program.option("--detect-strikethrough", "Detect strikethrough text and wrap with ~~ in Markdown output or <del></del> tag in HTML output (experimental)");
286
- program.option("--hybrid <value>", 'Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast');
333
+ program.option("--hybrid <value>", 'Hybrid backend (requires a running server). Quick start: pip install "opendataloader-pdf[hybrid]" && opendataloader-pdf-hybrid --port 5002. For remote servers use --hybrid-url. Values: off (default), docling-fast, hancom-ai');
287
334
  program.option("--hybrid-mode <value>", "Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)");
288
335
  program.option("--hybrid-url <value>", "Hybrid backend server URL (overrides default)");
289
336
  program.option("--hybrid-timeout <value>", "Hybrid backend request timeout in milliseconds (0 = no timeout). Default: 0");
290
337
  program.option("--hybrid-fallback", "Opt in to Java fallback on hybrid backend error (default: disabled)");
338
+ program.option("--hybrid-hancom-ai-regionlist-strategy <value>", "DLA label 7 (regionlist) handling. Requires --hybrid=hancom-ai. Values: table-first (default; check TSR overlap), list-only (skip TSR, always treat as list)");
339
+ program.option("--hybrid-hancom-ai-ocr-strategy <value>", "OCR strategy. Requires --hybrid=hancom-ai. Values: off (stream-only), auto (default; stream first, OCR fallback), force (OCR-only)");
340
+ program.option("--hybrid-hancom-ai-image-cache <value>", "Page image cache backing. Requires --hybrid=hancom-ai. Values: memory (default), disk");
291
341
  program.option("--to-stdout", "Write output to stdout instead of file (single format only)");
342
+ program.option("--threads <value>", "Number of worker threads for per-page processing. Default: 1 (sequential, stable). Values >1 (experimental) run pages in parallel for faster throughput; output may vary slightly on some PDFs. Capped at the number of available CPU cores. Applies to the native Java pipeline only; ignored in --hybrid mode");
292
343
  }
293
344
 
294
345
  // src/cli.ts
@@ -327,13 +378,7 @@ async function main() {
327
378
  const inputPaths = program.args;
328
379
  const convertOptions = buildConvertOptions(cliOptions);
329
380
  try {
330
- const output = await convert(inputPaths, convertOptions);
331
- if (output && !convertOptions.quiet) {
332
- process.stdout.write(output);
333
- if (!output.endsWith("\n")) {
334
- process.stdout.write("\n");
335
- }
336
- }
381
+ await _runForCli(inputPaths, convertOptions);
337
382
  return 0;
338
383
  } catch (err) {
339
384
  const message = err instanceof Error ? err.message : String(err);