@shuji-bonji/pdf-reader-mcp 0.2.3 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (37) hide show
  1. package/CHANGELOG.md +35 -0
  2. package/README.ja.md +45 -13
  3. package/README.md +45 -13
  4. package/dist/constants.d.ts +1 -1
  5. package/dist/constants.js +1 -1
  6. package/dist/schemas/tier1.d.ts +17 -0
  7. package/dist/schemas/tier1.d.ts.map +1 -1
  8. package/dist/schemas/tier1.js +22 -0
  9. package/dist/schemas/tier1.js.map +1 -1
  10. package/dist/schemas/tier2.d.ts +15 -0
  11. package/dist/schemas/tier2.d.ts.map +1 -1
  12. package/dist/schemas/tier2.js +8 -0
  13. package/dist/schemas/tier2.js.map +1 -1
  14. package/dist/services/pdfjs-service.d.ts +35 -3
  15. package/dist/services/pdfjs-service.d.ts.map +1 -1
  16. package/dist/services/pdfjs-service.js +257 -9
  17. package/dist/services/pdfjs-service.js.map +1 -1
  18. package/dist/tools/index.d.ts.map +1 -1
  19. package/dist/tools/index.js +2 -0
  20. package/dist/tools/index.js.map +1 -1
  21. package/dist/tools/tier1/read-text.d.ts.map +1 -1
  22. package/dist/tools/tier1/read-text.js +8 -3
  23. package/dist/tools/tier1/read-text.js.map +1 -1
  24. package/dist/tools/tier1/read-url.d.ts.map +1 -1
  25. package/dist/tools/tier1/read-url.js +7 -2
  26. package/dist/tools/tier1/read-url.js.map +1 -1
  27. package/dist/tools/tier2/extract-tables.d.ts +11 -0
  28. package/dist/tools/tier2/extract-tables.d.ts.map +1 -0
  29. package/dist/tools/tier2/extract-tables.js +66 -0
  30. package/dist/tools/tier2/extract-tables.js.map +1 -0
  31. package/dist/types.d.ts +41 -0
  32. package/dist/types.d.ts.map +1 -1
  33. package/dist/utils/formatter.d.ts +8 -1
  34. package/dist/utils/formatter.d.ts.map +1 -1
  35. package/dist/utils/formatter.js +52 -0
  36. package/dist/utils/formatter.js.map +1 -1
  37. package/package.json +1 -1
package/CHANGELOG.md CHANGED
@@ -5,6 +5,39 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.4.0] - 2026-05-07
9
+
10
+ ### Added
11
+
12
+ - **`split_columns` parameter on `read_text` and `read_url`** (Issue #3): opt-in column-aware reordering for **untagged** multi-column PDFs. When `split_columns: 2` (or `3`) is passed, text items are first bucketed by X-coordinate into N equal-width columns, then each bucket is Y-sorted independently and concatenated left-to-right with a blank-line separator. Designed for older 新旧対照表 PDFs and similar two-column documents that lack a structure tree (Tagged PDFs with proper `<Table>` markup should use `extract_tables` instead).
13
+ - `splitColumns: 1` (default / undefined) is unchanged — existing single-column Y-sort behaviour is preserved as a regression-tested baseline.
14
+ - `extractText` / `extractTextFromDoc` now accept an `ExtractTextOptions` object with `splitColumns?: number`. Internally, line-grouping logic is factored out into `itemsToText` so the column path reuses the same Y-sort.
15
+ - **Test fixture `tests/fixtures/two-column.pdf`**: 1-page A4 untagged PDF with `LEFT-1..4` and `RIGHT-1..4` placed at paired Y-coordinates so plain Y-sort interleaves them — the regression target for `split_columns`.
16
+ - **E2E tests** in `tests/e2e/02-tier1-text.test.ts`: 4 new cases (RT-SC-1..4) covering the failure mode without `split_columns`, the success mode with `split_columns: 2`, the `split_columns: 1` regression guard, and a sanity check that `split_columns: 2` on a single-column PDF preserves all content.
17
+
18
+ ### Changed
19
+
20
+ - **`read_text` / `read_url` tool descriptions**: documented the new `split_columns` parameter with guidance to prefer `extract_tables` for Tagged PDFs.
21
+
22
+ ## [0.3.0] - 2026-05-06
23
+
24
+ ### Added
25
+
26
+ - **`extract_tables` (Tier 2)**: New tool that walks a Tagged PDF's structure tree and emits every `<Table>` subtree as a Markdown table or a JSON document. Designed for documents whose meaning depends on multi-column layout — e.g. 国税庁 新旧対照表 (kaisei tsutatsu) PDFs where reading-order extraction merges 改正後 / 改正前 columns into ambiguous text. Internals:
27
+ - `extractTables(filePath, pages?)` / `extractTablesFromDoc(doc, pages?)` in `services/pdfjs-service.ts`. Walks the StructTree, identifies `<Table>` → `<THead> | <TBody> | <TFoot>` → `<TR>` → `<TH> | <TD>`, then resolves cell text by mapping each leaf node's marked-content `id` (e.g. `p715R_mc4`) to the corresponding `beginMarkedContentProps` boundary in `getTextContent({ includeMarkedContent: true })`.
28
+ - Cell text post-processing: collapses whitespace runs (incl. U+3000), folds per-character kerning runs ("消 費 税 法" → "消費税法") while preserving natural inter-word spacing ("事業者 法人番号"), escapes Markdown table delimiters.
29
+ - Untagged PDFs return `isTagged: false`, an empty `tables` array, and a `note` recommending column-aware extraction (planned in Issue #3) as the fallback.
30
+ - colspan / rowspan and nested tables are skipped in this initial release; cells appear in source order.
31
+ - **Types**: `TableCell`, `TableRow`, `ExtractedTable`, `TablesExtractionResult` in `types.ts`.
32
+ - **Schema**: `ExtractTablesSchema` in `schemas/tier2.ts` (file_path + pages + response_format).
33
+ - **Markdown formatter**: `formatTablesMarkdown` in `utils/formatter.ts` renders results as `# Extracted Tables` summary block followed by `## Page N — Table M` GFM tables.
34
+ - **E2E tests**: 5 new tests in `tests/e2e/04-tier2-structure.test.ts` covering untagged → note path, tagged-but-empty path, formatter shape, and pages filter.
35
+
36
+ ### Changed
37
+
38
+ - **Tool count**: 15 → 16 tools (Tier 2 now has 6).
39
+ - README / README.ja.md tool tables and architecture diagram updated accordingly.
40
+
8
41
  ## [0.2.3] - 2026-05-06
9
42
 
10
43
  ### Fixed
@@ -72,6 +105,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
72
105
  - Y-coordinate-based text extraction preserving natural reading order
73
106
  - Unit tests for core utilities and pdfjs-service
74
107
 
108
+ [0.4.0]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.3.0...v0.4.0
109
+ [0.3.0]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.3...v0.3.0
75
110
  [0.2.3]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.2...v0.2.3
76
111
  [0.2.2]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.1...v0.2.2
77
112
  [0.2.1]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.0...v0.2.1
package/README.ja.md CHANGED
@@ -12,7 +12,7 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
12
12
 
13
13
  ## 機能
14
14
 
15
- **15 ツール** を 3 層構成で提供します。
15
+ **16 ツール** を 3 層構成で提供します。
16
16
 
17
17
  ### Tier 1: 基本機能
18
18
 
@@ -20,7 +20,7 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
20
20
  | ---------------- | ----------------------------------------------------- |
21
21
  | `get_page_count` | ページ数の軽量取得 |
22
22
  | `get_metadata` | メタデータ抽出(タイトル、著者、PDF版、タグ有無等) |
23
- | `read_text` | テキスト抽出(Y座標ベースの読み順保持) |
23
+ | `read_text` | テキスト抽出(Y座標ベースの読み順保持。`split_columns: 2 \| 3` で **タグなし** 多カラム PDF にも対応) |
24
24
  | `search_text` | 全文検索(前後コンテキスト付き) |
25
25
  | `read_images` | 画像抽出(base64、メタデータ付き) |
26
26
  | `read_url` | URLからリモートPDFを取得して処理 |
@@ -28,13 +28,14 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
28
28
 
29
29
  ### Tier 2: 構造解析
30
30
 
31
- | ツール | 説明 |
32
- | --------------------- | -------------------------------------------- |
33
- | `inspect_structure` | オブジェクトツリー・カタログ辞書の解析 |
34
- | `inspect_tags` | Tagged PDF のタグツリー可視化 |
35
- | `inspect_fonts` | フォント一覧(埋め込み/サブセット/Type判定) |
36
- | `inspect_annotations` | 注釈一覧(タイプ別分類) |
37
- | `inspect_signatures` | 電子署名フィールドの構造解析 |
31
+ | ツール | 説明 |
32
+ | --------------------- | -------------------------------------------------------- |
33
+ | `inspect_structure` | オブジェクトツリー・カタログ辞書の解析 |
34
+ | `inspect_tags` | Tagged PDF のタグツリー可視化 |
35
+ | `inspect_fonts` | フォント一覧(埋め込み/サブセット/Type判定) |
36
+ | `inspect_annotations` | 注釈一覧(タイプ別分類) |
37
+ | `inspect_signatures` | 電子署名フィールドの構造解析 |
38
+ | `extract_tables` | Tagged PDF の `<Table>` を Markdown テーブルとして抽出 |
38
39
 
39
40
  ### Tier 3: 検証・分析
40
41
 
@@ -145,12 +146,43 @@ compare_structure({
145
146
  | Tagged | true | true | ✅ |
146
147
  ```
147
148
 
149
+ ### Tagged PDF からテーブル抽出
150
+
151
+ ```
152
+ extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
153
+ → # Extracted Tables
154
+ - **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
155
+
156
+ ## Page 1 — Table 1
157
+
158
+ | 改正後 | 改正前 |
159
+ | --- | --- |
160
+ | …第2条第 16 項《定義》… | …第2条第 15 項《定義》… |
161
+ ```
162
+
163
+ タグ無し PDF では空結果と note を返し、下記の column-aware 抽出への
164
+ フォールバックを推奨します。
165
+
166
+ ### タグなし多カラム PDF をカラム単位で読む
167
+
168
+ ```
169
+ read_text({ file_path: "/path/to/older-shinkyu.pdf", split_columns: 2 })
170
+ → // 通常の Y ソートだとカラムが交互連結:
171
+ // "改正後セル1 改正前セル1\n 改正後セル2 改正前セル2..."
172
+ //
173
+ // split_columns: 2 では左カラム → 空行 → 右カラムの順:
174
+ // "改正後セル1\n改正後セル2\n…\n\n改正前セル1\n改正前セル2\n…"
175
+ ```
176
+
177
+ **タグなし** 多カラム PDF (古い 新旧対照表など) で `split_columns: 2 \| 3` を指定。
178
+ タグ付き PDF (`<Table>` 構造あり) では `extract_tables` の方が表構造を保持できます。
179
+
148
180
  ## 技術スタック
149
181
 
150
182
  - **TypeScript** + MCP TypeScript SDK
151
183
  - **pdfjs-dist** (Mozilla) — テキスト/画像抽出、タグツリー、注釈
152
184
  - **pdf-lib** — 低レベルオブジェクト構造解析
153
- - **Vitest** — Unit + E2E テスト(159 tests)
185
+ - **Vitest** — Unit + E2E テスト(168 tests)
154
186
  - **Biome** — lint + format
155
187
  - **Zod** — 入力バリデーション
156
188
 
@@ -158,7 +190,7 @@ compare_structure({
158
190
 
159
191
  ```bash
160
192
  npm test # 全テスト実行(Unit: 39 tests)
161
- npm run test:e2e # E2E のみ(120 tests)
193
+ npm run test:e2e # E2E のみ(129 tests)
162
194
  npm run test:watch # ウォッチモード
163
195
  ```
164
196
 
@@ -172,7 +204,7 @@ pdf-reader-mcp/
172
204
  │ ├── types.ts # 型定義
173
205
  │ ├── tools/
174
206
  │ │ ├── tier1/ # 基本ツール (7)
175
- │ │ ├── tier2/ # 構造解析 (5)
207
+ │ │ ├── tier2/ # 構造解析 (6)
176
208
  │ │ ├── tier3/ # 検証・分析 (3)
177
209
  │ │ └── index.ts # ツール登録
178
210
  │ ├── services/
@@ -188,7 +220,7 @@ pdf-reader-mcp/
188
220
  │ └── error-handler.ts # エラーハンドリング
189
221
  └── tests/
190
222
  ├── tier1/ # Unit tests
191
- └── e2e/ # E2E tests (9 suites, 120 tests)
223
+ └── e2e/ # E2E tests (9 suites, 129 tests)
192
224
  ```
193
225
 
194
226
  ## pdf-spec-mcp との連携
package/README.md CHANGED
@@ -12,7 +12,7 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
12
12
 
13
13
  ## Features
14
14
 
15
- **15 tools** organized into three tiers:
15
+ **16 tools** organized into three tiers:
16
16
 
17
17
  ### Tier 1: Basic Operations
18
18
 
@@ -20,7 +20,7 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
20
20
  | ---------------- | -------------------------------------------------------- |
21
21
  | `get_page_count` | Lightweight page count retrieval |
22
22
  | `get_metadata` | Full metadata extraction (title, author, PDF version...) |
23
- | `read_text` | Text extraction with Y-coordinate reading order |
23
+ | `read_text` | Text extraction with Y-coordinate reading order (opt-in `split_columns: 2 \| 3` for untagged multi-column PDFs) |
24
24
  | `search_text` | Full-text search with surrounding context |
25
25
  | `read_images` | Image extraction as base64 with metadata |
26
26
  | `read_url` | Fetch and process remote PDFs from URLs |
@@ -28,13 +28,14 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
28
28
 
29
29
  ### Tier 2: Structure Inspection
30
30
 
31
- | Tool | Description |
32
- | --------------------- | ------------------------------------------------ |
33
- | `inspect_structure` | Object tree and catalog dictionary analysis |
34
- | `inspect_tags` | Tagged PDF structure tree visualization |
35
- | `inspect_fonts` | Font inventory (embedded/subset/type detection) |
36
- | `inspect_annotations` | Annotation listing (categorized by subtype) |
37
- | `inspect_signatures` | Digital signature field structure analysis |
31
+ | Tool | Description |
32
+ | --------------------- | ------------------------------------------------------------------- |
33
+ | `inspect_structure` | Object tree and catalog dictionary analysis |
34
+ | `inspect_tags` | Tagged PDF structure tree visualization |
35
+ | `inspect_fonts` | Font inventory (embedded/subset/type detection) |
36
+ | `inspect_annotations` | Annotation listing (categorized by subtype) |
37
+ | `inspect_signatures` | Digital signature field structure analysis |
38
+ | `extract_tables` | Tagged PDF `<Table>` subtree → Markdown table (preserves columns) |
38
39
 
39
40
  ### Tier 3: Validation & Analysis
40
41
 
@@ -145,12 +146,43 @@ compare_structure({
145
146
  | Tagged | true | true | ✅ |
146
147
  ```
147
148
 
149
+ ### Extract Tables (Tagged PDF)
150
+
151
+ ```
152
+ extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
153
+ → # Extracted Tables
154
+ - **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
155
+
156
+ ## Page 1 — Table 1
157
+
158
+ | 改正後 | 改正前 |
159
+ | --- | --- |
160
+ | …第2条第 16 項《定義》… | …第2条第 15 項《定義》… |
161
+ ```
162
+
163
+ Untagged PDFs return an empty result with a `note` recommending the
164
+ column-aware fallback below.
165
+
166
+ ### Read Untagged Multi-Column PDF
167
+
168
+ ```
169
+ read_text({ file_path: "/path/to/older-shinkyu.pdf", split_columns: 2 })
170
+ → // Plain Y-sort would interleave columns:
171
+ // "改正後セル1 改正前セル1\n 改正後セル2 改正前セル2..."
172
+ //
173
+ // With split_columns: 2 the left column is emitted first, then the right:
174
+ // "改正後セル1\n改正後セル2\n…\n\n改正前セル1\n改正前セル2\n…"
175
+ ```
176
+
177
+ Use `split_columns: 2 | 3` for **untagged** multi-column PDFs. For Tagged
178
+ PDFs with proper `<Table>` markup, `extract_tables` (above) is preferred.
179
+
148
180
  ## Tech Stack
149
181
 
150
182
  - **TypeScript** + MCP TypeScript SDK
151
183
  - **pdfjs-dist** (Mozilla) — text/image extraction, tag tree, annotations
152
184
  - **pdf-lib** — low-level object structure analysis
153
- - **Vitest** — unit + E2E testing (159 tests)
185
+ - **Vitest** — unit + E2E testing (168 tests)
154
186
  - **Biome** — linting + formatting
155
187
  - **Zod** — input validation
156
188
 
@@ -158,7 +190,7 @@ compare_structure({
158
190
 
159
191
  ```bash
160
192
  npm test # Run all tests (unit: 39 tests)
161
- npm run test:e2e # E2E tests only (120 tests)
193
+ npm run test:e2e # E2E tests only (129 tests)
162
194
  npm run test:watch # Watch mode
163
195
  ```
164
196
 
@@ -172,7 +204,7 @@ pdf-reader-mcp/
172
204
  │ ├── types.ts # Type definitions
173
205
  │ ├── tools/
174
206
  │ │ ├── tier1/ # Basic tools (7)
175
- │ │ ├── tier2/ # Structure inspection (5)
207
+ │ │ ├── tier2/ # Structure inspection (6)
176
208
  │ │ ├── tier3/ # Validation & analysis (3)
177
209
  │ │ └── index.ts # Tool registration
178
210
  │ ├── services/
@@ -188,7 +220,7 @@ pdf-reader-mcp/
188
220
  │ └── error-handler.ts # Error handling
189
221
  └── tests/
190
222
  ├── tier1/ # Unit tests
191
- └── e2e/ # E2E tests (9 suites, 120 tests)
223
+ └── e2e/ # E2E tests (9 suites, 129 tests)
192
224
  ```
193
225
 
194
226
  ## Pairing with pdf-spec-mcp
@@ -13,7 +13,7 @@ export declare const MAX_SEARCH_RESULTS = 100;
13
13
  export declare const DEFAULT_SEARCH_CONTEXT = 80;
14
14
  /** Server info */
15
15
  export declare const SERVER_NAME = "pdf-reader-mcp";
16
- export declare const SERVER_VERSION = "0.2.3";
16
+ export declare const SERVER_VERSION = "0.4.0";
17
17
  /** Response format enum */
18
18
  export declare enum ResponseFormat {
19
19
  MARKDOWN = "markdown",
package/dist/constants.js CHANGED
@@ -13,7 +13,7 @@ export const MAX_SEARCH_RESULTS = 100;
13
13
  export const DEFAULT_SEARCH_CONTEXT = 80;
14
14
  /** Server info */
15
15
  export const SERVER_NAME = 'pdf-reader-mcp';
16
- export const SERVER_VERSION = '0.2.3';
16
+ export const SERVER_VERSION = '0.4.0';
17
17
  /** Response format enum */
18
18
  export var ResponseFormat;
19
19
  (function (ResponseFormat) {
@@ -21,19 +21,33 @@ export declare const GetMetadataSchema: z.ZodObject<{
21
21
  file_path: string;
22
22
  response_format?: import("../constants.js").ResponseFormat | undefined;
23
23
  }>;
24
+ /**
25
+ * `split_columns` — Issue #3: column-aware extraction for untagged
26
+ * multi-column PDFs.
27
+ *
28
+ * Acts as an opt-in override for the default reading-order strategy. When
29
+ * `>= 2`, items are bucketed by X-coordinate into N equal-width columns and
30
+ * the buckets are concatenated left-to-right. `1` (default) preserves the
31
+ * existing Y-sort behaviour. Tagged PDFs with proper `<Table>` markup should
32
+ * use `extract_tables` instead — `split_columns` is for untagged cases.
33
+ */
34
+ export declare const SplitColumnsSchema: z.ZodOptional<z.ZodNumber>;
24
35
  /** read_text */
25
36
  export declare const ReadTextSchema: z.ZodObject<{
26
37
  file_path: z.ZodString;
27
38
  pages: z.ZodOptional<z.ZodString>;
28
39
  response_format: z.ZodDefault<z.ZodNativeEnum<typeof import("../constants.js").ResponseFormat>>;
40
+ split_columns: z.ZodOptional<z.ZodNumber>;
29
41
  }, "strict", z.ZodTypeAny, {
30
42
  file_path: string;
31
43
  response_format: import("../constants.js").ResponseFormat;
32
44
  pages?: string | undefined;
45
+ split_columns?: number | undefined;
33
46
  }, {
34
47
  file_path: string;
35
48
  response_format?: import("../constants.js").ResponseFormat | undefined;
36
49
  pages?: string | undefined;
50
+ split_columns?: number | undefined;
37
51
  }>;
38
52
  /** search_text */
39
53
  export declare const SearchTextSchema: z.ZodObject<{
@@ -74,14 +88,17 @@ export declare const ReadUrlSchema: z.ZodObject<{
74
88
  url: z.ZodString;
75
89
  pages: z.ZodOptional<z.ZodString>;
76
90
  response_format: z.ZodDefault<z.ZodNativeEnum<typeof import("../constants.js").ResponseFormat>>;
91
+ split_columns: z.ZodOptional<z.ZodNumber>;
77
92
  }, "strict", z.ZodTypeAny, {
78
93
  response_format: import("../constants.js").ResponseFormat;
79
94
  url: string;
80
95
  pages?: string | undefined;
96
+ split_columns?: number | undefined;
81
97
  }, {
82
98
  url: string;
83
99
  response_format?: import("../constants.js").ResponseFormat | undefined;
84
100
  pages?: string | undefined;
101
+ split_columns?: number | undefined;
85
102
  }>;
86
103
  /** summarize */
87
104
  export declare const SummarizeSchema: z.ZodObject<{
@@ -1 +1 @@
1
- {"version":3,"file":"tier1.d.ts","sourceRoot":"","sources":["../../src/schemas/tier1.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAIxB,qBAAqB;AACrB,eAAO,MAAM,kBAAkB;;;;;;EAIpB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,gBAAgB;AAChB,eAAO,MAAM,cAAc;;;;;;;;;;;;EAMhB,CAAC;AAEZ,kBAAkB;AAClB,eAAO,MAAM,gBAAgB;;;;;;;;;;;;;;;;;;;;;EAyBlB,CAAC;AAEZ,kBAAkB;AAClB,eAAO,MAAM,gBAAgB;;;;;;;;;EAKlB,CAAC;AAEZ,eAAe;AACf,eAAO,MAAM,aAAa;;;;;;;;;;;;EAMf,CAAC;AAEZ,gBAAgB;AAChB,eAAO,MAAM,eAAe;;;;;;;;;EAKjB,CAAC;AAGZ,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,aAAa,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,cAAc,CAAC,CAAC;AAC3D,MAAM,MAAM,eAAe,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,gBAAgB,CAAC,CAAC;AAC/D,MAAM,MAAM,eAAe,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,gBAAgB,CAAC,CAAC;AAC/D,MAAM,MAAM,YAAY,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,aAAa,CAAC,CAAC;AACzD,MAAM,MAAM,cAAc,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,eAAe,CAAC,CAAC"}
1
+ {"version":3,"file":"tier1.d.ts","sourceRoot":"","sources":["../../src/schemas/tier1.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAIxB,qBAAqB;AACrB,eAAO,MAAM,kBAAkB;;;;;;EAIpB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ;;;;;;;;;GASG;AACH,eAAO,MAAM,kBAAkB,4BAW5B,CAAC;AAEJ,gBAAgB;AAChB,eAAO,MAAM,cAAc;;;;;;;;;;;;;;;EAOhB,CAAC;AAEZ,kBAAkB;AAClB,eAAO,MAAM,gBAAgB;;;;;;;;;;;;;;;;;;;;;EAyBlB,CAAC;AAEZ,kBAAkB;AAClB,eAAO,MAAM,gBAAgB;;;;;;;;;EAKlB,CAAC;AAEZ,eAAe;AACf,eAAO,MAAM,aAAa;;;;;;;;;;;;;;;EAOf,CAAC;AAEZ,gBAAgB;AAChB,eAAO,MAAM,eAAe;;;;;;;;;EAKjB,CAAC;AAGZ,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,aAAa,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,cAAc,CAAC,CAAC;AAC3D,MAAM,MAAM,eAAe,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,gBAAgB,CAAC,CAAC;AAC/D,MAAM,MAAM,eAAe,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,gBAAgB,CAAC,CAAC;AAC/D,MAAM,MAAM,YAAY,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,aAAa,CAAC,CAAC;AACzD,MAAM,MAAM,cAAc,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,eAAe,CAAC,CAAC"}
@@ -17,12 +17,33 @@ export const GetMetadataSchema = z
17
17
  response_format: ResponseFormatSchema,
18
18
  })
19
19
  .strict();
20
+ /**
21
+ * `split_columns` — Issue #3: column-aware extraction for untagged
22
+ * multi-column PDFs.
23
+ *
24
+ * Acts as an opt-in override for the default reading-order strategy. When
25
+ * `>= 2`, items are bucketed by X-coordinate into N equal-width columns and
26
+ * the buckets are concatenated left-to-right. `1` (default) preserves the
27
+ * existing Y-sort behaviour. Tagged PDFs with proper `<Table>` markup should
28
+ * use `extract_tables` instead — `split_columns` is for untagged cases.
29
+ */
30
+ export const SplitColumnsSchema = z
31
+ .number()
32
+ .int()
33
+ .min(1)
34
+ .max(3)
35
+ .optional()
36
+ .describe('Number of columns to use when reordering text. 1 (default) = existing Y-sort. ' +
37
+ '2 or 3 = bucket by X-coordinate left-to-right. Use for untagged 新旧対照表 / ' +
38
+ 'two-column PDFs where Y-sort would interleave columns. Tagged PDFs with proper ' +
39
+ '<Table> markup should use extract_tables instead.');
20
40
  /** read_text */
21
41
  export const ReadTextSchema = z
22
42
  .object({
23
43
  file_path: FilePathSchema,
24
44
  pages: PagesSchema,
25
45
  response_format: ResponseFormatSchema,
46
+ split_columns: SplitColumnsSchema,
26
47
  })
27
48
  .strict();
28
49
  /** search_text */
@@ -65,6 +86,7 @@ export const ReadUrlSchema = z
65
86
  url: UrlSchema,
66
87
  pages: PagesSchema,
67
88
  response_format: ResponseFormatSchema,
89
+ split_columns: SplitColumnsSchema,
68
90
  })
69
91
  .strict();
70
92
  /** summarize */
@@ -1 +1 @@
1
- {"version":3,"file":"tier1.js","sourceRoot":"","sources":["../../src/schemas/tier1.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,sBAAsB,EAAE,kBAAkB,EAAE,MAAM,iBAAiB,CAAC;AAC7E,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,SAAS,EAAE,MAAM,aAAa,CAAC;AAE3F,qBAAqB;AACrB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;CAC1B,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,gBAAgB;AAChB,MAAM,CAAC,MAAM,cAAc,GAAG,CAAC;KAC5B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,kBAAkB;AAClB,MAAM,CAAC,MAAM,gBAAgB,GAAG,CAAC;KAC9B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,CAAC;SACL,MAAM,EAAE;SACR,GAAG,CAAC,CAAC,EAAE,0BAA0B,CAAC;SAClC,GAAG,CAAC,GAAG,EAAE,sCAAsC,CAAC;SAChD,QAAQ,CAAC,uCAAuC,CAAC;IACpD,KAAK,EAAE,WAAW;IAClB,aAAa,EAAE,CAAC;SACb,MAAM,EAAE;SACR,GAAG,EAAE;SACL,GAAG,CAAC,CAAC,CAAC;SACN,GAAG,CAAC,GAAG,CAAC;SACR,OAAO,CAAC,sBAAsB,CAAC;SAC/B,QAAQ,CAAC,0DAA0D,CAAC;IACvE,WAAW,EAAE,CAAC;SACX,MAAM,EAAE;SACR,GAAG,EAAE;SACL,GAAG,CAAC,CAAC,CAAC;SACN,GAAG,CAAC,kBAAkB,CAAC;SACvB,OAAO,CAAC,EAAE,CAAC;SACX,QAAQ,CAAC,qCAAqC,CAAC;IAClD,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,kBAAkB;AAClB,MAAM,CAAC,MAAM,gBAAgB,GAAG,CAAC;KAC9B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;CACnB,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,eAAe;AACf,MAAM,CAAC,MAAM,aAAa,GAAG,CAAC;KAC3B,MAAM,CAAC;IACN,GAAG,EAAE,SAAS;IACd,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,gBAAgB;AAChB,MAAM,CAAC,MAAM,eAAe,GAAG,CAAC;KAC7B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
1
+ {"version":3,"file":"tier1.js","sourceRoot":"","sources":["../../src/schemas/tier1.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,sBAAsB,EAAE,kBAAkB,EAAE,MAAM,iBAAiB,CAAC;AAC7E,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,SAAS,EAAE,MAAM,aAAa,CAAC;AAE3F,qBAAqB;AACrB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;CAC1B,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ;;;;;;;;;GASG;AACH,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,EAAE;KACR,GAAG,EAAE;KACL,GAAG,CAAC,CAAC,CAAC;KACN,GAAG,CAAC,CAAC,CAAC;KACN,QAAQ,EAAE;KACV,QAAQ,CACP,gFAAgF;IAC9E,0EAA0E;IAC1E,iFAAiF;IACjF,mDAAmD,CACtD,CAAC;AAEJ,gBAAgB;AAChB,MAAM,CAAC,MAAM,cAAc,GAAG,CAAC;KAC5B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;IACrC,aAAa,EAAE,kBAAkB;CAClC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,kBAAkB;AAClB,MAAM,CAAC,MAAM,gBAAgB,GAAG,CAAC;KAC9B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,CAAC;SACL,MAAM,EAAE;SACR,GAAG,CAAC,CAAC,EAAE,0BAA0B,CAAC;SAClC,GAAG,CAAC,GAAG,EAAE,sCAAsC,CAAC;SAChD,QAAQ,CAAC,uCAAuC,CAAC;IACpD,KAAK,EAAE,WAAW;IAClB,aAAa,EAAE,CAAC;SACb,MAAM,EAAE;SACR,GAAG,EAAE;SACL,GAAG,CAAC,CAAC,CAAC;SACN,GAAG,CAAC,GAAG,CAAC;SACR,OAAO,CAAC,sBAAsB,CAAC;SAC/B,QAAQ,CAAC,0DAA0D,CAAC;IACvE,WAAW,EAAE,CAAC;SACX,MAAM,EAAE;SACR,GAAG,EAAE;SACL,GAAG,CAAC,CAAC,CAAC;SACN,GAAG,CAAC,kBAAkB,CAAC;SACvB,OAAO,CAAC,EAAE,CAAC;SACX,QAAQ,CAAC,qCAAqC,CAAC;IAClD,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,kBAAkB;AAClB,MAAM,CAAC,MAAM,gBAAgB,GAAG,CAAC;KAC9B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;CACnB,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,eAAe;AACf,MAAM,CAAC,MAAM,aAAa,GAAG,CAAC;KAC3B,MAAM,CAAC;IACN,GAAG,EAAE,SAAS;IACd,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;IACrC,aAAa,EAAE,kBAAkB;CAClC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,gBAAgB;AAChB,MAAM,CAAC,MAAM,eAAe,GAAG,CAAC;KAC7B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
@@ -60,9 +60,24 @@ export declare const InspectSignaturesSchema: z.ZodObject<{
60
60
  file_path: string;
61
61
  response_format?: import("../constants.js").ResponseFormat | undefined;
62
62
  }>;
63
+ /** extract_tables — Tagged PDF Table → Markdown */
64
+ export declare const ExtractTablesSchema: z.ZodObject<{
65
+ file_path: z.ZodString;
66
+ pages: z.ZodOptional<z.ZodString>;
67
+ response_format: z.ZodDefault<z.ZodNativeEnum<typeof import("../constants.js").ResponseFormat>>;
68
+ }, "strict", z.ZodTypeAny, {
69
+ file_path: string;
70
+ response_format: import("../constants.js").ResponseFormat;
71
+ pages?: string | undefined;
72
+ }, {
73
+ file_path: string;
74
+ response_format?: import("../constants.js").ResponseFormat | undefined;
75
+ pages?: string | undefined;
76
+ }>;
63
77
  export type InspectStructureInput = z.infer<typeof InspectStructureSchema>;
64
78
  export type InspectTagsInput = z.infer<typeof InspectTagsSchema>;
65
79
  export type InspectFontsInput = z.infer<typeof InspectFontsSchema>;
66
80
  export type InspectAnnotationsInput = z.infer<typeof InspectAnnotationsSchema>;
67
81
  export type InspectSignaturesInput = z.infer<typeof InspectSignaturesSchema>;
82
+ export type ExtractTablesInput = z.infer<typeof ExtractTablesSchema>;
68
83
  //# sourceMappingURL=tier2.d.ts.map
@@ -1 +1 @@
1
- {"version":3,"file":"tier2.d.ts","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAGxB,wBAAwB;AACxB,eAAO,MAAM,sBAAsB;;;;;;;;;EAKxB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,oBAAoB;AACpB,eAAO,MAAM,kBAAkB;;;;;;;;;EAKpB,CAAC;AAEZ,0BAA0B;AAC1B,eAAO,MAAM,wBAAwB;;;;;;;;;;;;EAM1B,CAAC;AAEZ,yBAAyB;AACzB,eAAO,MAAM,uBAAuB;;;;;;;;;EAKzB,CAAC;AAGZ,MAAM,MAAM,qBAAqB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,sBAAsB,CAAC,CAAC;AAC3E,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,uBAAuB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,wBAAwB,CAAC,CAAC;AAC/E,MAAM,MAAM,sBAAsB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,uBAAuB,CAAC,CAAC"}
1
+ {"version":3,"file":"tier2.d.ts","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAGxB,wBAAwB;AACxB,eAAO,MAAM,sBAAsB;;;;;;;;;EAKxB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,oBAAoB;AACpB,eAAO,MAAM,kBAAkB;;;;;;;;;EAKpB,CAAC;AAEZ,0BAA0B;AAC1B,eAAO,MAAM,wBAAwB;;;;;;;;;;;;EAM1B,CAAC;AAEZ,yBAAyB;AACzB,eAAO,MAAM,uBAAuB;;;;;;;;;EAKzB,CAAC;AAEZ,mDAAmD;AACnD,eAAO,MAAM,mBAAmB;;;;;;;;;;;;EAMrB,CAAC;AAGZ,MAAM,MAAM,qBAAqB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,sBAAsB,CAAC,CAAC;AAC3E,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,uBAAuB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,wBAAwB,CAAC,CAAC;AAC/E,MAAM,MAAM,sBAAsB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,uBAAuB,CAAC,CAAC;AAC7E,MAAM,MAAM,kBAAkB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,mBAAmB,CAAC,CAAC"}
@@ -39,4 +39,12 @@ export const InspectSignaturesSchema = z
39
39
  response_format: ResponseFormatSchema,
40
40
  })
41
41
  .strict();
42
+ /** extract_tables — Tagged PDF Table → Markdown */
43
+ export const ExtractTablesSchema = z
44
+ .object({
45
+ file_path: FilePathSchema,
46
+ pages: PagesSchema,
47
+ response_format: ResponseFormatSchema,
48
+ })
49
+ .strict();
42
50
  //# sourceMappingURL=tier2.js.map
@@ -1 +1 @@
1
- {"version":3,"file":"tier2.js","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,aAAa,CAAC;AAEhF,wBAAwB;AACxB,MAAM,CAAC,MAAM,sBAAsB,GAAG,CAAC;KACpC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,oBAAoB;AACpB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,0BAA0B;AAC1B,MAAM,CAAC,MAAM,wBAAwB,GAAG,CAAC;KACtC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,yBAAyB;AACzB,MAAM,CAAC,MAAM,uBAAuB,GAAG,CAAC;KACrC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
1
+ {"version":3,"file":"tier2.js","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,aAAa,CAAC;AAEhF,wBAAwB;AACxB,MAAM,CAAC,MAAM,sBAAsB,GAAG,CAAC;KACpC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,oBAAoB;AACpB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,0BAA0B;AAC1B,MAAM,CAAC,MAAM,wBAAwB,GAAG,CAAC;KACtC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,yBAAyB;AACzB,MAAM,CAAC,MAAM,uBAAuB,GAAG,CAAC;KACrC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mDAAmD;AACnD,MAAM,CAAC,MAAM,mBAAmB,GAAG,CAAC;KACjC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
@@ -4,7 +4,7 @@
4
4
  * Centralizes all pdfjs-dist interactions for reuse across tools.
5
5
  */
6
6
  import { type PDFDocumentProxy } from 'pdfjs-dist/legacy/build/pdf.mjs';
7
- import type { AnnotationsAnalysis, ImageExtractionResult, PageText, PdfMetadata, SearchMatch, TagsAnalysis } from '../types.js';
7
+ import type { AnnotationsAnalysis, ImageExtractionResult, PageText, PdfMetadata, SearchMatch, TablesExtractionResult, TagsAnalysis } from '../types.js';
8
8
  /**
9
9
  * Load a PDF document from a file path.
10
10
  */
@@ -22,15 +22,26 @@ export declare function getMetadata(filePath: string): Promise<PdfMetadata>;
22
22
  * Does NOT destroy the document — caller is responsible for lifecycle.
23
23
  */
24
24
  export declare function getMetadataFromDoc(doc: PDFDocumentProxy, filePath: string): Promise<PdfMetadata>;
25
+ /**
26
+ * Options for text extraction.
27
+ *
28
+ * `splitColumns` controls Issue #3 column-aware reordering. When `>= 2`,
29
+ * text items are bucketed into N equal-width columns by X-coordinate and
30
+ * concatenated left-to-right. `1` (default / undefined) preserves the
31
+ * existing single-column Y-sort behaviour.
32
+ */
33
+ export interface ExtractTextOptions {
34
+ splitColumns?: number;
35
+ }
25
36
  /**
26
37
  * Extract text from a pre-loaded PDFDocumentProxy.
27
38
  * Does NOT destroy the document — caller is responsible for lifecycle.
28
39
  */
29
- export declare function extractTextFromDoc(doc: PDFDocumentProxy, pages?: string): Promise<PageText[]>;
40
+ export declare function extractTextFromDoc(doc: PDFDocumentProxy, pages?: string, options?: ExtractTextOptions): Promise<PageText[]>;
30
41
  /**
31
42
  * Extract text from specified pages (1-based).
32
43
  */
33
- export declare function extractText(filePath: string, pages?: string): Promise<PageText[]>;
44
+ export declare function extractText(filePath: string, pages?: string, options?: ExtractTextOptions): Promise<PageText[]>;
34
45
  /**
35
46
  * Search for text across all pages.
36
47
  */
@@ -58,6 +69,27 @@ export declare function analyzeTagsFromDoc(doc: PDFDocumentProxy): Promise<TagsA
58
69
  * Analyze Tagged PDF structure tree.
59
70
  */
60
71
  export declare function analyzeTags(filePath: string): Promise<TagsAnalysis>;
72
+ /**
73
+ * Extract tables from a Tagged PDF as structured rows/cells.
74
+ *
75
+ * The strategy is: for each page, walk the StructTree, identify `<Table>`
76
+ * subtrees, then walk down `<THead>/<TBody>/<TFoot>` → `<TR>` → `<TH>/<TD>`.
77
+ * Cell text is reconstructed by mapping each Span/P/Lbl/LBody leaf node's
78
+ * `id` (e.g. `p715R_mc4`) onto the corresponding `beginMarkedContentProps`
79
+ * boundary in `getTextContent({ includeMarkedContent: true })`.
80
+ *
81
+ * Untagged PDFs return `isTagged: false`, an empty `tables` array, and a
82
+ * `note` recommending the column-aware extraction (planned in a future
83
+ * release) as the fallback for two-column layouts without a structure tree.
84
+ *
85
+ * Cell text is post-processed:
86
+ * - Newlines (`hasEOL`) become single spaces.
87
+ * - Repeated whitespace runs (including U+3000 fullwidth space) collapse to one.
88
+ * - Per-character kerning spaces (e.g. `"消 費 税 法"`) are folded
89
+ * by removing single ASCII spaces between two CJK characters.
90
+ */
91
+ export declare function extractTablesFromDoc(doc: PDFDocumentProxy, pages?: string): Promise<TablesExtractionResult>;
92
+ export declare function extractTables(filePath: string, pages?: string): Promise<TablesExtractionResult>;
61
93
  /**
62
94
  * Analyze annotations across all pages.
63
95
  */
@@ -1 +1 @@
1
- {"version":3,"file":"pdfjs-service.d.ts","sourceRoot":"","sources":["../../src/services/pdfjs-service.ts"],"names":[],"mappings":"AAAA;;;;GAIG;AAEH,OAAO,EAGL,KAAK,gBAAgB,EAEtB,MAAM,iCAAiC,CAAC;AAGzC,OAAO,KAAK,EAEV,mBAAmB,EAEnB,qBAAqB,EACrB,QAAQ,EACR,WAAW,EACX,WAAW,EAEX,YAAY,EACb,MAAM,aAAa,CAAC;AAWrB;;GAEG;AACH,wBAAsB,YAAY,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAI9E;AAED;;GAEG;AACH,wBAAsB,oBAAoB,CAAC,IAAI,EAAE,UAAU,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAGtF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,WAAW,CAAC,CAOxE;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,QAAQ,EAAE,MAAM,GACf,OAAO,CAAC,WAAW,CAAC,CA6BtB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,QAAQ,EAAE,CAAC,CAarB;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,QAAQ,EAAE,CAAC,CAQvF;AAED;;GAEG;AACH,wBAAsB,UAAU,CAC9B,QAAQ,EAAE,MAAM,EAChB,KAAK,EAAE,MAAM,EACb,YAAY,GAAE,MAA+B,EAC7C,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,WAAW,EAAE,CAAC,CAsDxB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAmB/F;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAQnF;AAED;;;GAGG;AACH,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,qBAAqB,CAAC,CAoEhC;AAgGD;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,GAAG,OAAO,CAAC,YAAY,CAAC,CAmErF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,YAAY,CAAC,CAOzE;AAED;;GAEG;AACH,wBAAsB,kBAAkB,CACtC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,mBAAmB,CAAC,CAmG9B"}
1
+ {"version":3,"file":"pdfjs-service.d.ts","sourceRoot":"","sources":["../../src/services/pdfjs-service.ts"],"names":[],"mappings":"AAAA;;;;GAIG;AAEH,OAAO,EAGL,KAAK,gBAAgB,EAEtB,MAAM,iCAAiC,CAAC;AAGzC,OAAO,KAAK,EAEV,mBAAmB,EAGnB,qBAAqB,EACrB,QAAQ,EACR,WAAW,EACX,WAAW,EAGX,sBAAsB,EAEtB,YAAY,EACb,MAAM,aAAa,CAAC;AAWrB;;GAEG;AACH,wBAAsB,YAAY,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAI9E;AAED;;GAEG;AACH,wBAAsB,oBAAoB,CAAC,IAAI,EAAE,UAAU,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAGtF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,WAAW,CAAC,CAOxE;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,QAAQ,EAAE,MAAM,GACf,OAAO,CAAC,WAAW,CAAC,CA6BtB;AAED;;;;;;;GAOG;AACH,MAAM,WAAW,kBAAkB;IACjC,YAAY,CAAC,EAAE,MAAM,CAAC;CACvB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,EACd,OAAO,GAAE,kBAAuB,GAC/B,OAAO,CAAC,QAAQ,EAAE,CAAC,CAarB;AAED;;GAEG;AACH,wBAAsB,WAAW,CAC/B,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,EACd,OAAO,GAAE,kBAAuB,GAC/B,OAAO,CAAC,QAAQ,EAAE,CAAC,CAQrB;AAED;;GAEG;AACH,wBAAsB,UAAU,CAC9B,QAAQ,EAAE,MAAM,EAChB,KAAK,EAAE,MAAM,EACb,YAAY,GAAE,MAA+B,EAC7C,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,WAAW,EAAE,CAAC,CAsDxB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAmB/F;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAQnF;AAED;;;GAGG;AACH,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,qBAAqB,CAAC,CAoEhC;AAyID;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,GAAG,OAAO,CAAC,YAAY,CAAC,CAmErF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,YAAY,CAAC,CAOzE;AAID;;;;;;;;;;;;;;;;;;GAkBG;AACH,wBAAsB,oBAAoB,CACxC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,sBAAsB,CAAC,CA8CjC;AAED,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,sBAAsB,CAAC,CAOjC;AA2KD;;GAEG;AACH,wBAAsB,kBAAkB,CACtC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,mBAAmB,CAAC,CAmG9B"}