@shuji-bonji/pdf-reader-mcp 0.2.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -5,6 +5,40 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.3.0] - 2026-05-06
9
+
10
+ ### Added
11
+
12
+ - **`extract_tables` (Tier 2)**: New tool that walks a Tagged PDF's structure tree and emits every `<Table>` subtree as a Markdown table or a JSON document. Designed for documents whose meaning depends on multi-column layout — e.g. 国税庁 新旧対照表 (kaisei tsutatsu) PDFs where reading-order extraction merges 改正後 / 改正前 columns into ambiguous text. Internals:
13
+ - `extractTables(filePath, pages?)` / `extractTablesFromDoc(doc, pages?)` in `services/pdfjs-service.ts`. Walks the StructTree, identifies `<Table>` → `<THead> | <TBody> | <TFoot>` → `<TR>` → `<TH> | <TD>`, then resolves cell text by mapping each leaf node's marked-content `id` (e.g. `p715R_mc4`) to the corresponding `beginMarkedContentProps` boundary in `getTextContent({ includeMarkedContent: true })`.
14
+ - Cell text post-processing: collapses whitespace runs (incl. U+3000), folds per-character kerning runs ("消 費 税 法" → "消費税法") while preserving natural inter-word spacing ("事業者 法人番号"), escapes Markdown table delimiters.
15
+ - Untagged PDFs return `isTagged: false`, an empty `tables` array, and a `note` recommending column-aware extraction (planned in Issue #3) as the fallback.
16
+ - colspan / rowspan and nested tables are skipped in this initial release; cells appear in source order.
17
+ - **Types**: `TableCell`, `TableRow`, `ExtractedTable`, `TablesExtractionResult` in `types.ts`.
18
+ - **Schema**: `ExtractTablesSchema` in `schemas/tier2.ts` (file_path + pages + response_format).
19
+ - **Markdown formatter**: `formatTablesMarkdown` in `utils/formatter.ts` renders results as `# Extracted Tables` summary block followed by `## Page N — Table M` GFM tables.
20
+ - **E2E tests**: 5 new tests in `tests/e2e/04-tier2-structure.test.ts` covering untagged → note path, tagged-but-empty path, formatter shape, and pages filter.
21
+
22
+ ### Changed
23
+
24
+ - **Tool count**: 15 → 16 tools (Tier 2 now has 6).
25
+ - README / README.ja.md tool tables and architecture diagram updated accordingly.
26
+
27
+ ## [0.2.3] - 2026-05-06
28
+
29
+ ### Fixed
30
+
31
+ - **`inspect_structure` / `inspect_fonts`**: No longer throw `Expected instance of PDFDict, but got instance of undefined` on Linearized PDFs whose cross-reference cannot be fully resolved by pdf-lib. Public-sector publishers (e.g. Japanese government agencies) routinely emit Linearized PDFs from Microsoft Word; previously every such file produced a hard error. After the fix:
32
+ - `analyzeStructure` returns a partial result. The page count is filled in via pdfjs-dist when pdf-lib's page-tree traversal fails, and a `note` field describes the fallback.
33
+ - `analyzeFontsWithPdfLib` returns an empty font map with an explanatory `note` instead of throwing.
34
+ - All pdf-lib call sites (`loadWithPdfLib`, `analyzeStructure`, `analyzeFontsWithPdfLib`, `analyzeSignatures`, `detectEncryption`) now run inside `withSuppressedPdfLibLogs`, preventing pdf-lib's `console.log`-based parse diagnostics from polluting the JSON-RPC stream when the server runs over stdio.
35
+ - **Types**: `StructureAnalysis` and `FontsAnalysis` gained an optional `note?: string` field. `FontAnalysisResult` (internal) gained the same field.
36
+ - **Markdown formatter**: `formatStructureMarkdown` and `formatFontsMarkdown` now render the `note` as a `> ...` blockquote when present.
37
+
38
+ ### Added
39
+
40
+ - **Test fixture**: `tests/fixtures/linearized.pdf` (regenerated by `create-test-pdf.ts` via `qpdf --linearize`) and matching e2e regression tests in `tests/e2e/04-tier2-structure.test.ts`.
41
+
8
42
  ## [0.2.2] - 2026-04-18
9
43
 
10
44
  ### Fixed
@@ -57,6 +91,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
57
91
  - Y-coordinate-based text extraction preserving natural reading order
58
92
  - Unit tests for core utilities and pdfjs-service
59
93
 
94
+ [0.3.0]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.3...v0.3.0
95
+ [0.2.3]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.2...v0.2.3
60
96
  [0.2.2]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.1...v0.2.2
61
97
  [0.2.1]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.0...v0.2.1
62
98
  [0.2.0]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.1.0...v0.2.0
package/README.ja.md CHANGED
@@ -12,7 +12,7 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
12
12
 
13
13
  ## 機能
14
14
 
15
- **15 ツール** を 3 層構成で提供します。
15
+ **16 ツール** を 3 層構成で提供します。
16
16
 
17
17
  ### Tier 1: 基本機能
18
18
 
@@ -28,13 +28,14 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
28
28
 
29
29
  ### Tier 2: 構造解析
30
30
 
31
- | ツール | 説明 |
32
- | --------------------- | -------------------------------------------- |
33
- | `inspect_structure` | オブジェクトツリー・カタログ辞書の解析 |
34
- | `inspect_tags` | Tagged PDF のタグツリー可視化 |
35
- | `inspect_fonts` | フォント一覧(埋め込み/サブセット/Type判定) |
36
- | `inspect_annotations` | 注釈一覧(タイプ別分類) |
37
- | `inspect_signatures` | 電子署名フィールドの構造解析 |
31
+ | ツール | 説明 |
32
+ | --------------------- | -------------------------------------------------------- |
33
+ | `inspect_structure` | オブジェクトツリー・カタログ辞書の解析 |
34
+ | `inspect_tags` | Tagged PDF のタグツリー可視化 |
35
+ | `inspect_fonts` | フォント一覧(埋め込み/サブセット/Type判定) |
36
+ | `inspect_annotations` | 注釈一覧(タイプ別分類) |
37
+ | `inspect_signatures` | 電子署名フィールドの構造解析 |
38
+ | `extract_tables` | Tagged PDF の `<Table>` を Markdown テーブルとして抽出 |
38
39
 
39
40
  ### Tier 3: 検証・分析
40
41
 
@@ -145,12 +146,29 @@ compare_structure({
145
146
  | Tagged | true | true | ✅ |
146
147
  ```
147
148
 
149
+ ### Tagged PDF からテーブル抽出
150
+
151
+ ```
152
+ extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
153
+ → # Extracted Tables
154
+ - **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
155
+
156
+ ## Page 1 — Table 1
157
+
158
+ | 改正後 | 改正前 |
159
+ | --- | --- |
160
+ | …第2条第 16 項《定義》… | …第2条第 15 項《定義》… |
161
+ ```
162
+
163
+ タグ無し PDF では空結果と note を返し、column-aware 抽出(ロードマップ Issue #3)への
164
+ フォールバックを推奨します。
165
+
148
166
  ## 技術スタック
149
167
 
150
168
  - **TypeScript** + MCP TypeScript SDK
151
169
  - **pdfjs-dist** (Mozilla) — テキスト/画像抽出、タグツリー、注釈
152
170
  - **pdf-lib** — 低レベルオブジェクト構造解析
153
- - **Vitest** — Unit + E2E テスト(157 tests)
171
+ - **Vitest** — Unit + E2E テスト(164 tests)
154
172
  - **Biome** — lint + format
155
173
  - **Zod** — 入力バリデーション
156
174
 
@@ -158,7 +176,7 @@ compare_structure({
158
176
 
159
177
  ```bash
160
178
  npm test # 全テスト実行(Unit: 39 tests)
161
- npm run test:e2e # E2E のみ(118 tests)
179
+ npm run test:e2e # E2E のみ(125 tests)
162
180
  npm run test:watch # ウォッチモード
163
181
  ```
164
182
 
@@ -172,7 +190,7 @@ pdf-reader-mcp/
172
190
  │ ├── types.ts # 型定義
173
191
  │ ├── tools/
174
192
  │ │ ├── tier1/ # 基本ツール (7)
175
- │ │ ├── tier2/ # 構造解析 (5)
193
+ │ │ ├── tier2/ # 構造解析 (6)
176
194
  │ │ ├── tier3/ # 検証・分析 (3)
177
195
  │ │ └── index.ts # ツール登録
178
196
  │ ├── services/
@@ -188,7 +206,7 @@ pdf-reader-mcp/
188
206
  │ └── error-handler.ts # エラーハンドリング
189
207
  └── tests/
190
208
  ├── tier1/ # Unit tests
191
- └── e2e/ # E2E tests (9 suites, 118 tests)
209
+ └── e2e/ # E2E tests (9 suites, 125 tests)
192
210
  ```
193
211
 
194
212
  ## pdf-spec-mcp との連携
package/README.md CHANGED
@@ -12,7 +12,7 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
12
12
 
13
13
  ## Features
14
14
 
15
- **15 tools** organized into three tiers:
15
+ **16 tools** organized into three tiers:
16
16
 
17
17
  ### Tier 1: Basic Operations
18
18
 
@@ -28,13 +28,14 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
28
28
 
29
29
  ### Tier 2: Structure Inspection
30
30
 
31
- | Tool | Description |
32
- | --------------------- | ------------------------------------------------ |
33
- | `inspect_structure` | Object tree and catalog dictionary analysis |
34
- | `inspect_tags` | Tagged PDF structure tree visualization |
35
- | `inspect_fonts` | Font inventory (embedded/subset/type detection) |
36
- | `inspect_annotations` | Annotation listing (categorized by subtype) |
37
- | `inspect_signatures` | Digital signature field structure analysis |
31
+ | Tool | Description |
32
+ | --------------------- | ------------------------------------------------------------------- |
33
+ | `inspect_structure` | Object tree and catalog dictionary analysis |
34
+ | `inspect_tags` | Tagged PDF structure tree visualization |
35
+ | `inspect_fonts` | Font inventory (embedded/subset/type detection) |
36
+ | `inspect_annotations` | Annotation listing (categorized by subtype) |
37
+ | `inspect_signatures` | Digital signature field structure analysis |
38
+ | `extract_tables` | Tagged PDF `<Table>` subtree → Markdown table (preserves columns) |
38
39
 
39
40
  ### Tier 3: Validation & Analysis
40
41
 
@@ -145,12 +146,29 @@ compare_structure({
145
146
  | Tagged | true | true | ✅ |
146
147
  ```
147
148
 
149
+ ### Extract Tables (Tagged PDF)
150
+
151
+ ```
152
+ extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
153
+ → # Extracted Tables
154
+ - **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
155
+
156
+ ## Page 1 — Table 1
157
+
158
+ | 改正後 | 改正前 |
159
+ | --- | --- |
160
+ | …第2条第 16 項《定義》… | …第2条第 15 項《定義》… |
161
+ ```
162
+
163
+ Untagged PDFs return an empty result with a `note` recommending column-aware
164
+ fallback (see roadmap Issue #3).
165
+
148
166
  ## Tech Stack
149
167
 
150
168
  - **TypeScript** + MCP TypeScript SDK
151
169
  - **pdfjs-dist** (Mozilla) — text/image extraction, tag tree, annotations
152
170
  - **pdf-lib** — low-level object structure analysis
153
- - **Vitest** — unit + E2E testing (157 tests)
171
+ - **Vitest** — unit + E2E testing (164 tests)
154
172
  - **Biome** — linting + formatting
155
173
  - **Zod** — input validation
156
174
 
@@ -158,7 +176,7 @@ compare_structure({
158
176
 
159
177
  ```bash
160
178
  npm test # Run all tests (unit: 39 tests)
161
- npm run test:e2e # E2E tests only (118 tests)
179
+ npm run test:e2e # E2E tests only (125 tests)
162
180
  npm run test:watch # Watch mode
163
181
  ```
164
182
 
@@ -172,7 +190,7 @@ pdf-reader-mcp/
172
190
  │ ├── types.ts # Type definitions
173
191
  │ ├── tools/
174
192
  │ │ ├── tier1/ # Basic tools (7)
175
- │ │ ├── tier2/ # Structure inspection (5)
193
+ │ │ ├── tier2/ # Structure inspection (6)
176
194
  │ │ ├── tier3/ # Validation & analysis (3)
177
195
  │ │ └── index.ts # Tool registration
178
196
  │ ├── services/
@@ -188,7 +206,7 @@ pdf-reader-mcp/
188
206
  │ └── error-handler.ts # Error handling
189
207
  └── tests/
190
208
  ├── tier1/ # Unit tests
191
- └── e2e/ # E2E tests (9 suites, 118 tests)
209
+ └── e2e/ # E2E tests (9 suites, 125 tests)
192
210
  ```
193
211
 
194
212
  ## Pairing with pdf-spec-mcp
@@ -13,7 +13,7 @@ export declare const MAX_SEARCH_RESULTS = 100;
13
13
  export declare const DEFAULT_SEARCH_CONTEXT = 80;
14
14
  /** Server info */
15
15
  export declare const SERVER_NAME = "pdf-reader-mcp";
16
- export declare const SERVER_VERSION = "0.2.2";
16
+ export declare const SERVER_VERSION = "0.3.0";
17
17
  /** Response format enum */
18
18
  export declare enum ResponseFormat {
19
19
  MARKDOWN = "markdown",
package/dist/constants.js CHANGED
@@ -13,7 +13,7 @@ export const MAX_SEARCH_RESULTS = 100;
13
13
  export const DEFAULT_SEARCH_CONTEXT = 80;
14
14
  /** Server info */
15
15
  export const SERVER_NAME = 'pdf-reader-mcp';
16
- export const SERVER_VERSION = '0.2.2';
16
+ export const SERVER_VERSION = '0.3.0';
17
17
  /** Response format enum */
18
18
  export var ResponseFormat;
19
19
  (function (ResponseFormat) {
@@ -60,9 +60,24 @@ export declare const InspectSignaturesSchema: z.ZodObject<{
60
60
  file_path: string;
61
61
  response_format?: import("../constants.js").ResponseFormat | undefined;
62
62
  }>;
63
+ /** extract_tables — Tagged PDF Table → Markdown */
64
+ export declare const ExtractTablesSchema: z.ZodObject<{
65
+ file_path: z.ZodString;
66
+ pages: z.ZodOptional<z.ZodString>;
67
+ response_format: z.ZodDefault<z.ZodNativeEnum<typeof import("../constants.js").ResponseFormat>>;
68
+ }, "strict", z.ZodTypeAny, {
69
+ file_path: string;
70
+ response_format: import("../constants.js").ResponseFormat;
71
+ pages?: string | undefined;
72
+ }, {
73
+ file_path: string;
74
+ response_format?: import("../constants.js").ResponseFormat | undefined;
75
+ pages?: string | undefined;
76
+ }>;
63
77
  export type InspectStructureInput = z.infer<typeof InspectStructureSchema>;
64
78
  export type InspectTagsInput = z.infer<typeof InspectTagsSchema>;
65
79
  export type InspectFontsInput = z.infer<typeof InspectFontsSchema>;
66
80
  export type InspectAnnotationsInput = z.infer<typeof InspectAnnotationsSchema>;
67
81
  export type InspectSignaturesInput = z.infer<typeof InspectSignaturesSchema>;
82
+ export type ExtractTablesInput = z.infer<typeof ExtractTablesSchema>;
68
83
  //# sourceMappingURL=tier2.d.ts.map
@@ -1 +1 @@
1
- {"version":3,"file":"tier2.d.ts","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAGxB,wBAAwB;AACxB,eAAO,MAAM,sBAAsB;;;;;;;;;EAKxB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,oBAAoB;AACpB,eAAO,MAAM,kBAAkB;;;;;;;;;EAKpB,CAAC;AAEZ,0BAA0B;AAC1B,eAAO,MAAM,wBAAwB;;;;;;;;;;;;EAM1B,CAAC;AAEZ,yBAAyB;AACzB,eAAO,MAAM,uBAAuB;;;;;;;;;EAKzB,CAAC;AAGZ,MAAM,MAAM,qBAAqB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,sBAAsB,CAAC,CAAC;AAC3E,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,uBAAuB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,wBAAwB,CAAC,CAAC;AAC/E,MAAM,MAAM,sBAAsB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,uBAAuB,CAAC,CAAC"}
1
+ {"version":3,"file":"tier2.d.ts","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAGxB,wBAAwB;AACxB,eAAO,MAAM,sBAAsB;;;;;;;;;EAKxB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,oBAAoB;AACpB,eAAO,MAAM,kBAAkB;;;;;;;;;EAKpB,CAAC;AAEZ,0BAA0B;AAC1B,eAAO,MAAM,wBAAwB;;;;;;;;;;;;EAM1B,CAAC;AAEZ,yBAAyB;AACzB,eAAO,MAAM,uBAAuB;;;;;;;;;EAKzB,CAAC;AAEZ,mDAAmD;AACnD,eAAO,MAAM,mBAAmB;;;;;;;;;;;;EAMrB,CAAC;AAGZ,MAAM,MAAM,qBAAqB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,sBAAsB,CAAC,CAAC;AAC3E,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,uBAAuB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,wBAAwB,CAAC,CAAC;AAC/E,MAAM,MAAM,sBAAsB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,uBAAuB,CAAC,CAAC;AAC7E,MAAM,MAAM,kBAAkB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,mBAAmB,CAAC,CAAC"}
@@ -39,4 +39,12 @@ export const InspectSignaturesSchema = z
39
39
  response_format: ResponseFormatSchema,
40
40
  })
41
41
  .strict();
42
+ /** extract_tables — Tagged PDF Table → Markdown */
43
+ export const ExtractTablesSchema = z
44
+ .object({
45
+ file_path: FilePathSchema,
46
+ pages: PagesSchema,
47
+ response_format: ResponseFormatSchema,
48
+ })
49
+ .strict();
42
50
  //# sourceMappingURL=tier2.js.map
@@ -1 +1 @@
1
- {"version":3,"file":"tier2.js","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,aAAa,CAAC;AAEhF,wBAAwB;AACxB,MAAM,CAAC,MAAM,sBAAsB,GAAG,CAAC;KACpC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,oBAAoB;AACpB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,0BAA0B;AAC1B,MAAM,CAAC,MAAM,wBAAwB,GAAG,CAAC;KACtC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,yBAAyB;AACzB,MAAM,CAAC,MAAM,uBAAuB,GAAG,CAAC;KACrC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
1
+ {"version":3,"file":"tier2.js","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,aAAa,CAAC;AAEhF,wBAAwB;AACxB,MAAM,CAAC,MAAM,sBAAsB,GAAG,CAAC;KACpC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,oBAAoB;AACpB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,0BAA0B;AAC1B,MAAM,CAAC,MAAM,wBAAwB,GAAG,CAAC;KACtC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,yBAAyB;AACzB,MAAM,CAAC,MAAM,uBAAuB,GAAG,CAAC;KACrC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mDAAmD;AACnD,MAAM,CAAC,MAAM,mBAAmB,GAAG,CAAC;KACjC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
@@ -4,7 +4,7 @@
4
4
  * Centralizes all pdfjs-dist interactions for reuse across tools.
5
5
  */
6
6
  import { type PDFDocumentProxy } from 'pdfjs-dist/legacy/build/pdf.mjs';
7
- import type { AnnotationsAnalysis, ImageExtractionResult, PageText, PdfMetadata, SearchMatch, TagsAnalysis } from '../types.js';
7
+ import type { AnnotationsAnalysis, ImageExtractionResult, PageText, PdfMetadata, SearchMatch, TablesExtractionResult, TagsAnalysis } from '../types.js';
8
8
  /**
9
9
  * Load a PDF document from a file path.
10
10
  */
@@ -58,6 +58,27 @@ export declare function analyzeTagsFromDoc(doc: PDFDocumentProxy): Promise<TagsA
58
58
  * Analyze Tagged PDF structure tree.
59
59
  */
60
60
  export declare function analyzeTags(filePath: string): Promise<TagsAnalysis>;
61
+ /**
62
+ * Extract tables from a Tagged PDF as structured rows/cells.
63
+ *
64
+ * The strategy is: for each page, walk the StructTree, identify `<Table>`
65
+ * subtrees, then walk down `<THead>/<TBody>/<TFoot>` → `<TR>` → `<TH>/<TD>`.
66
+ * Cell text is reconstructed by mapping each Span/P/Lbl/LBody leaf node's
67
+ * `id` (e.g. `p715R_mc4`) onto the corresponding `beginMarkedContentProps`
68
+ * boundary in `getTextContent({ includeMarkedContent: true })`.
69
+ *
70
+ * Untagged PDFs return `isTagged: false`, an empty `tables` array, and a
71
+ * `note` recommending the column-aware extraction (planned in a future
72
+ * release) as the fallback for two-column layouts without a structure tree.
73
+ *
74
+ * Cell text is post-processed:
75
+ * - Newlines (`hasEOL`) become single spaces.
76
+ * - Repeated whitespace runs (including U+3000 fullwidth space) collapse to one.
77
+ * - Per-character kerning spaces (e.g. `"消 費 税 法"`) are folded
78
+ * by removing single ASCII spaces between two CJK characters.
79
+ */
80
+ export declare function extractTablesFromDoc(doc: PDFDocumentProxy, pages?: string): Promise<TablesExtractionResult>;
81
+ export declare function extractTables(filePath: string, pages?: string): Promise<TablesExtractionResult>;
61
82
  /**
62
83
  * Analyze annotations across all pages.
63
84
  */
@@ -1 +1 @@
1
- {"version":3,"file":"pdfjs-service.d.ts","sourceRoot":"","sources":["../../src/services/pdfjs-service.ts"],"names":[],"mappings":"AAAA;;;;GAIG;AAEH,OAAO,EAGL,KAAK,gBAAgB,EAEtB,MAAM,iCAAiC,CAAC;AAGzC,OAAO,KAAK,EAEV,mBAAmB,EAEnB,qBAAqB,EACrB,QAAQ,EACR,WAAW,EACX,WAAW,EAEX,YAAY,EACb,MAAM,aAAa,CAAC;AAWrB;;GAEG;AACH,wBAAsB,YAAY,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAI9E;AAED;;GAEG;AACH,wBAAsB,oBAAoB,CAAC,IAAI,EAAE,UAAU,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAGtF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,WAAW,CAAC,CAOxE;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,QAAQ,EAAE,MAAM,GACf,OAAO,CAAC,WAAW,CAAC,CA6BtB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,QAAQ,EAAE,CAAC,CAarB;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,QAAQ,EAAE,CAAC,CAQvF;AAED;;GAEG;AACH,wBAAsB,UAAU,CAC9B,QAAQ,EAAE,MAAM,EAChB,KAAK,EAAE,MAAM,EACb,YAAY,GAAE,MAA+B,EAC7C,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,WAAW,EAAE,CAAC,CAsDxB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAmB/F;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAQnF;AAED;;;GAGG;AACH,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,qBAAqB,CAAC,CAoEhC;AAgGD;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,GAAG,OAAO,CAAC,YAAY,CAAC,CAmErF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,YAAY,CAAC,CAOzE;AAED;;GAEG;AACH,wBAAsB,kBAAkB,CACtC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,mBAAmB,CAAC,CAmG9B"}
1
+ {"version":3,"file":"pdfjs-service.d.ts","sourceRoot":"","sources":["../../src/services/pdfjs-service.ts"],"names":[],"mappings":"AAAA;;;;GAIG;AAEH,OAAO,EAGL,KAAK,gBAAgB,EAEtB,MAAM,iCAAiC,CAAC;AAGzC,OAAO,KAAK,EAEV,mBAAmB,EAGnB,qBAAqB,EACrB,QAAQ,EACR,WAAW,EACX,WAAW,EAGX,sBAAsB,EAEtB,YAAY,EACb,MAAM,aAAa,CAAC;AAWrB;;GAEG;AACH,wBAAsB,YAAY,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAI9E;AAED;;GAEG;AACH,wBAAsB,oBAAoB,CAAC,IAAI,EAAE,UAAU,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAGtF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,WAAW,CAAC,CAOxE;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,QAAQ,EAAE,MAAM,GACf,OAAO,CAAC,WAAW,CAAC,CA6BtB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,QAAQ,EAAE,CAAC,CAarB;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,QAAQ,EAAE,CAAC,CAQvF;AAED;;GAEG;AACH,wBAAsB,UAAU,CAC9B,QAAQ,EAAE,MAAM,EAChB,KAAK,EAAE,MAAM,EACb,YAAY,GAAE,MAA+B,EAC7C,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,WAAW,EAAE,CAAC,CAsDxB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAmB/F;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAQnF;AAED;;;GAGG;AACH,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,qBAAqB,CAAC,CAoEhC;AAgGD;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,GAAG,OAAO,CAAC,YAAY,CAAC,CAmErF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,YAAY,CAAC,CAOzE;AAID;;;;;;;;;;;;;;;;;;GAkBG;AACH,wBAAsB,oBAAoB,CACxC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,sBAAsB,CAAC,CA8CjC;AAED,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,sBAAsB,CAAC,CAOjC;AA2KD;;GAEG;AACH,wBAAsB,kBAAkB,CACtC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,mBAAmB,CAAC,CAmG9B"}
@@ -391,6 +391,221 @@ export async function analyzeTags(filePath) {
391
391
  await doc.destroy();
392
392
  }
393
393
  }
394
+ // ─── extract_tables ────────────────────────────────────────────────────────
395
+ /**
396
+ * Extract tables from a Tagged PDF as structured rows/cells.
397
+ *
398
+ * The strategy is: for each page, walk the StructTree, identify `<Table>`
399
+ * subtrees, then walk down `<THead>/<TBody>/<TFoot>` → `<TR>` → `<TH>/<TD>`.
400
+ * Cell text is reconstructed by mapping each Span/P/Lbl/LBody leaf node's
401
+ * `id` (e.g. `p715R_mc4`) onto the corresponding `beginMarkedContentProps`
402
+ * boundary in `getTextContent({ includeMarkedContent: true })`.
403
+ *
404
+ * Untagged PDFs return `isTagged: false`, an empty `tables` array, and a
405
+ * `note` recommending the column-aware extraction (planned in a future
406
+ * release) as the fallback for two-column layouts without a structure tree.
407
+ *
408
+ * Cell text is post-processed:
409
+ * - Newlines (`hasEOL`) become single spaces.
410
+ * - Repeated whitespace runs (including U+3000 fullwidth space) collapse to one.
411
+ * - Per-character kerning spaces (e.g. `"消 費 税 法"`) are folded
412
+ * by removing single ASCII spaces between two CJK characters.
413
+ */
414
+ export async function extractTablesFromDoc(doc, pages) {
415
+ const markInfo = await getMarkInfo(doc);
416
+ const isTagged = markInfo?.Marked === true;
417
+ if (!isTagged) {
418
+ return {
419
+ isTagged: false,
420
+ tables: [],
421
+ totalTables: 0,
422
+ pagesScanned: 0,
423
+ note: 'Document is not tagged. extract_tables relies on /MarkInfo /Marked true ' +
424
+ 'and a StructTree. For untagged two-column PDFs, fall back to a ' +
425
+ 'column-aware reading strategy (see pdf-reader-mcp Issue #3).',
426
+ };
427
+ }
428
+ const pageNumbers = resolvePageNumbers(pages, doc.numPages);
429
+ const perPage = await Promise.all(pageNumbers.map(async (pageNum) => {
430
+ const page = await doc.getPage(pageNum);
431
+ try {
432
+ const [tree, textContent] = await Promise.all([
433
+ page.getStructTree(),
434
+ page.getTextContent({ includeMarkedContent: true }),
435
+ ]);
436
+ if (!tree)
437
+ return [];
438
+ const idToText = buildIdToTextMap(textContent.items);
439
+ const tables = [];
440
+ collectTables(tree, pageNum, idToText, tables);
441
+ return tables;
442
+ }
443
+ catch {
444
+ return [];
445
+ }
446
+ }));
447
+ const tables = perPage.flat();
448
+ return {
449
+ isTagged: true,
450
+ tables,
451
+ totalTables: tables.length,
452
+ pagesScanned: pageNumbers.length,
453
+ };
454
+ }
455
+ export async function extractTables(filePath, pages) {
456
+ const doc = await loadDocument(filePath);
457
+ try {
458
+ return await extractTablesFromDoc(doc, pages);
459
+ }
460
+ finally {
461
+ await doc.destroy();
462
+ }
463
+ }
464
+ /**
465
+ * Build a map from a marked-content `id` (e.g. `p715R_mc4`) to the concatenated
466
+ * raw text inside the corresponding `beginMarkedContentProps`/`endMarkedContent`
467
+ * pair. Nested marked content is supported via a stack — text counts toward
468
+ * every active id (so a `<Span>` inside a `<P>` contributes to both).
469
+ *
470
+ * Items with `tag === 'Artifact'` are page-level artifacts (page numbers,
471
+ * running headers, etc.) outside the structure tree, and are skipped.
472
+ */
473
+ function buildIdToTextMap(items) {
474
+ const map = new Map();
475
+ const stack = [];
476
+ for (const item of items) {
477
+ const t = item.type;
478
+ if (t === 'beginMarkedContent' || t === 'beginMarkedContentProps') {
479
+ const isArtifact = item.tag === 'Artifact';
480
+ const id = item.id ?? null;
481
+ stack.push({ id, isArtifact });
482
+ continue;
483
+ }
484
+ if (t === 'endMarkedContent') {
485
+ stack.pop();
486
+ continue;
487
+ }
488
+ if (t !== undefined)
489
+ continue; // unknown marker
490
+ // Text item
491
+ if (stack.some((s) => s.isArtifact))
492
+ continue;
493
+ const str = item.hasEOL ? ' ' : (item.str ?? '');
494
+ if (!str)
495
+ continue;
496
+ for (const frame of stack) {
497
+ if (frame.id) {
498
+ const buf = map.get(frame.id);
499
+ if (buf)
500
+ buf.push(str);
501
+ else
502
+ map.set(frame.id, [str]);
503
+ }
504
+ }
505
+ }
506
+ const out = new Map();
507
+ for (const [id, parts] of map)
508
+ out.set(id, parts.join(''));
509
+ return out;
510
+ }
511
+ /** Walk the StructTree and append every `<Table>` subtree as an ExtractedTable. */
512
+ function collectTables(node, pageNum, idToText, out) {
513
+ if (node.type === 'content')
514
+ return;
515
+ if (node.role === 'Table') {
516
+ const headerRows = [];
517
+ const bodyRows = [];
518
+ const footerRows = [];
519
+ for (const child of node.children ?? []) {
520
+ if (child.type === 'content')
521
+ continue;
522
+ if (child.role === 'THead') {
523
+ appendTableRowsFromSection(child, idToText, headerRows);
524
+ }
525
+ else if (child.role === 'TBody') {
526
+ appendTableRowsFromSection(child, idToText, bodyRows);
527
+ }
528
+ else if (child.role === 'TFoot') {
529
+ appendTableRowsFromSection(child, idToText, footerRows);
530
+ }
531
+ else if (child.role === 'TR') {
532
+ // Tables sometimes omit THead/TBody and place TRs directly under <Table>.
533
+ const row = buildRowFromTR(child, idToText);
534
+ if (row)
535
+ bodyRows.push(row);
536
+ }
537
+ }
538
+ // `out` is a per-page accumulator passed in by the caller, so
539
+ // `out.length + 1` is the next index within this page (1-based).
540
+ out.push({
541
+ page: pageNum,
542
+ index: out.length + 1,
543
+ headerRows,
544
+ bodyRows,
545
+ footerRows,
546
+ });
547
+ return; // Don't recurse into a Table — nested tables are uncommon and
548
+ // would confuse the per-page index. (Add nested-table support later.)
549
+ }
550
+ for (const child of node.children ?? []) {
551
+ collectTables(child, pageNum, idToText, out);
552
+ }
553
+ }
554
+ function appendTableRowsFromSection(section, idToText, out) {
555
+ for (const child of section.children ?? []) {
556
+ if (child.type === 'content')
557
+ continue;
558
+ if (child.role === 'TR') {
559
+ const row = buildRowFromTR(child, idToText);
560
+ if (row)
561
+ out.push(row);
562
+ }
563
+ }
564
+ }
565
+ function buildRowFromTR(tr, idToText) {
566
+ const cells = [];
567
+ for (const child of tr.children ?? []) {
568
+ if (child.type === 'content')
569
+ continue;
570
+ if (child.role === 'TH' || child.role === 'TD') {
571
+ const text = compactCellText(collectTextUnder(child, idToText));
572
+ cells.push({ text, isHeader: child.role === 'TH' });
573
+ }
574
+ }
575
+ return cells.length === 0 ? null : { cells };
576
+ }
577
+ function collectTextUnder(node, idToText) {
578
+ if (node.type === 'content') {
579
+ return node.id ? (idToText.get(node.id) ?? '') : '';
580
+ }
581
+ const parts = [];
582
+ for (const child of node.children ?? []) {
583
+ parts.push(collectTextUnder(child, idToText));
584
+ }
585
+ return parts.join(' ');
586
+ }
587
+ /**
588
+ * Normalise raw cell text:
589
+ * 1. Collapse any whitespace run (`\s` + U+3000) to a single ASCII space.
590
+ * 2. Fold per-character kerning runs between CJK characters
591
+ * (e.g. "消 費 税 法" → "消費税法") — but only when at least three
592
+ * single CJK chars are separated by single spaces in a row, so that
593
+ * natural inter-word spacing like "事業者 法人番号" is preserved.
594
+ * 3. Trim and Markdown-escape pipes / newlines.
595
+ */
596
+ function compactCellText(s) {
597
+ if (!s)
598
+ return '';
599
+ // Step 1: collapse whitespace runs (incl. U+3000) to one ASCII space.
600
+ let t = s.replace(/[\s ]+/g, ' ').trim();
601
+ // Step 2: fold runs of `CJK + space` repeated at least twice followed by
602
+ // a final CJK char. Anything shorter is treated as a real word boundary.
603
+ const cjk = '[\\u3040-\\u30ff\\u3400-\\u9fff\\uff00-\\uffef]';
604
+ const kerningRun = new RegExp(`(?:${cjk} ){2,}${cjk}`, 'g');
605
+ t = t.replace(kerningRun, (m) => m.replace(/ /g, ''));
606
+ // Step 3: escape Markdown table delimiters.
607
+ return t.replace(/\|/g, '\\|').replace(/\n/g, ' ');
608
+ }
394
609
  /**
395
610
  * Analyze annotations across all pages.
396
611
  */