npm - @shuji-bonji/pdf-reader-mcp - Versions diffs - 0.2.2 → 0.3.0 - Mend

@shuji-bonji/pdf-reader-mcp 0.2.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

package/CHANGELOG.md +36 -0
package/README.ja.md +30 -12
package/README.md +30 -12
package/dist/constants.d.ts +1 -1
package/dist/constants.js +1 -1
package/dist/schemas/tier2.d.ts +15 -0
package/dist/schemas/tier2.d.ts.map +1 -1
package/dist/schemas/tier2.js +8 -0
package/dist/schemas/tier2.js.map +1 -1
package/dist/services/pdfjs-service.d.ts +22 -1
package/dist/services/pdfjs-service.d.ts.map +1 -1
package/dist/services/pdfjs-service.js +215 -0
package/dist/services/pdfjs-service.js.map +1 -1
package/dist/services/pdflib-service.d.ts +21 -0
package/dist/services/pdflib-service.d.ts.map +1 -1
package/dist/services/pdflib-service.js +128 -24
package/dist/services/pdflib-service.js.map +1 -1
package/dist/tools/index.d.ts.map +1 -1
package/dist/tools/index.js +2 -0
package/dist/tools/index.js.map +1 -1
package/dist/tools/tier2/extract-tables.d.ts +11 -0
package/dist/tools/tier2/extract-tables.d.ts.map +1 -0
package/dist/tools/tier2/extract-tables.js +66 -0
package/dist/tools/tier2/extract-tables.js.map +1 -0
package/dist/tools/tier2/inspect-fonts.d.ts.map +1 -1
package/dist/tools/tier2/inspect-fonts.js +1 -0
package/dist/tools/tier2/inspect-fonts.js.map +1 -1
package/dist/types.d.ts +54 -0
package/dist/types.d.ts.map +1 -1
package/dist/utils/formatter.d.ts +8 -1
package/dist/utils/formatter.d.ts.map +1 -1
package/dist/utils/formatter.js +61 -0
package/dist/utils/formatter.js.map +1 -1
package/package.json +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,40 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.3.0] - 2026-05-06
+### Added
+- **`extract_tables` (Tier 2)**: New tool that walks a Tagged PDF's structure tree and emits every `<Table>` subtree as a Markdown table or a JSON document. Designed for documents whose meaning depends on multi-column layout — e.g. 国税庁 新旧対照表 (kaisei tsutatsu) PDFs where reading-order extraction merges 改正後 / 改正前 columns into ambiguous text. Internals:
+  - `extractTables(filePath, pages?)` / `extractTablesFromDoc(doc, pages?)` in `services/pdfjs-service.ts`. Walks the StructTree, identifies `<Table>` → `<THead> | <TBody> | <TFoot>` → `<TR>` → `<TH> | <TD>`, then resolves cell text by mapping each leaf node's marked-content `id` (e.g. `p715R_mc4`) to the corresponding `beginMarkedContentProps` boundary in `getTextContent({ includeMarkedContent: true })`.
+  - Cell text post-processing: collapses whitespace runs (incl. U+3000), folds per-character kerning runs ("消 費 税 法" → "消費税法") while preserving natural inter-word spacing ("事業者 法人番号"), escapes Markdown table delimiters.
+  - Untagged PDFs return `isTagged: false`, an empty `tables` array, and a `note` recommending column-aware extraction (planned in Issue #3) as the fallback.
+  - colspan / rowspan and nested tables are skipped in this initial release; cells appear in source order.
+- **Types**: `TableCell`, `TableRow`, `ExtractedTable`, `TablesExtractionResult` in `types.ts`.
+- **Schema**: `ExtractTablesSchema` in `schemas/tier2.ts` (file_path + pages + response_format).
+- **Markdown formatter**: `formatTablesMarkdown` in `utils/formatter.ts` renders results as `# Extracted Tables` summary block followed by `## Page N — Table M` GFM tables.
+- **E2E tests**: 5 new tests in `tests/e2e/04-tier2-structure.test.ts` covering untagged → note path, tagged-but-empty path, formatter shape, and pages filter.
+### Changed
+- **Tool count**: 15 → 16 tools (Tier 2 now has 6).
+- README / README.ja.md tool tables and architecture diagram updated accordingly.
+## [0.2.3] - 2026-05-06
+### Fixed
+- **`inspect_structure` / `inspect_fonts`**: No longer throw `Expected instance of PDFDict, but got instance of undefined` on Linearized PDFs whose cross-reference cannot be fully resolved by pdf-lib. Public-sector publishers (e.g. Japanese government agencies) routinely emit Linearized PDFs from Microsoft Word; previously every such file produced a hard error. After the fix:
+  - `analyzeStructure` returns a partial result. The page count is filled in via pdfjs-dist when pdf-lib's page-tree traversal fails, and a `note` field describes the fallback.
+  - `analyzeFontsWithPdfLib` returns an empty font map with an explanatory `note` instead of throwing.
+  - All pdf-lib call sites (`loadWithPdfLib`, `analyzeStructure`, `analyzeFontsWithPdfLib`, `analyzeSignatures`, `detectEncryption`) now run inside `withSuppressedPdfLibLogs`, preventing pdf-lib's `console.log`-based parse diagnostics from polluting the JSON-RPC stream when the server runs over stdio.
+- **Types**: `StructureAnalysis` and `FontsAnalysis` gained an optional `note?: string` field. `FontAnalysisResult` (internal) gained the same field.
+- **Markdown formatter**: `formatStructureMarkdown` and `formatFontsMarkdown` now render the `note` as a `> ...` blockquote when present.
+### Added
+- **Test fixture**: `tests/fixtures/linearized.pdf` (regenerated by `create-test-pdf.ts` via `qpdf --linearize`) and matching e2e regression tests in `tests/e2e/04-tier2-structure.test.ts`.
 ## [0.2.2] - 2026-04-18
 ### Fixed
@@ -57,6 +91,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Y-coordinate-based text extraction preserving natural reading order
 - Unit tests for core utilities and pdfjs-service
+[0.3.0]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.3...v0.3.0
+[0.2.3]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.2...v0.2.3
 [0.2.2]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.1...v0.2.2
 [0.2.1]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.0...v0.2.1
 [0.2.0]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.1.0...v0.2.0

package/README.ja.md CHANGED Viewed

@@ -12,7 +12,7 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
 ## 機能
-**15 ツール** を 3 層構成で提供します。
+**16 ツール** を 3 層構成で提供します。
 ### Tier 1: 基本機能
@@ -28,13 +28,14 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
 ### Tier 2: 構造解析
-| ツール                | 説明                                         |
-| --------------------- | -------------------------------------------- |
-| `inspect_structure`   | オブジェクトツリー・カタログ辞書の解析       |
-| `inspect_tags`        | Tagged PDF のタグツリー可視化                |
-| `inspect_fonts`       | フォント一覧（埋め込み/サブセット/Type判定） |
-| `inspect_annotations` | 注釈一覧（タイプ別分類）                     |
-| `inspect_signatures`  | 電子署名フィールドの構造解析                 |
+| ツール                | 説明                                                     |
+| --------------------- | -------------------------------------------------------- |
+| `inspect_structure`   | オブジェクトツリー・カタログ辞書の解析                   |
+| `inspect_tags`        | Tagged PDF のタグツリー可視化                            |
+| `inspect_fonts`       | フォント一覧（埋め込み/サブセット/Type判定）             |
+| `inspect_annotations` | 注釈一覧（タイプ別分類）                                 |
+| `inspect_signatures`  | 電子署名フィールドの構造解析                             |
+| `extract_tables`      | Tagged PDF の `<Table>` を Markdown テーブルとして抽出   |
 ### Tier 3: 検証・分析
@@ -145,12 +146,29 @@ compare_structure({
   | Tagged      | true | true | ✅ |
 ```
+### Tagged PDF からテーブル抽出
+```
+extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
+→ # Extracted Tables
+  - **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
+  ## Page 1 — Table 1
+  | 改正後 | 改正前 |
+  | --- | --- |
+  | …第２条第 16 項《定義》… | …第２条第 15 項《定義》… |
+```
+タグ無し PDF では空結果と note を返し、column-aware 抽出（ロードマップ Issue #3）への
+フォールバックを推奨します。
 ## 技術スタック
 - **TypeScript** + MCP TypeScript SDK
 - **pdfjs-dist** (Mozilla) — テキスト/画像抽出、タグツリー、注釈
 - **pdf-lib** — 低レベルオブジェクト構造解析
-- **Vitest** — Unit + E2E テスト（157 tests）
+- **Vitest** — Unit + E2E テスト（164 tests）
 - **Biome** — lint + format
 - **Zod** — 入力バリデーション
@@ -158,7 +176,7 @@ compare_structure({
 ```bash
 npm test              # 全テスト実行（Unit: 39 tests）
-npm run test:e2e      # E2E のみ（118 tests）
+npm run test:e2e      # E2E のみ（125 tests）
 npm run test:watch    # ウォッチモード
 ```
@@ -172,7 +190,7 @@ pdf-reader-mcp/
 │   ├── types.ts              # 型定義
 │   ├── tools/
 │   │   ├── tier1/            # 基本ツール (7)
-│   │   ├── tier2/            # 構造解析 (5)
+│   │   ├── tier2/            # 構造解析 (6)
 │   │   ├── tier3/            # 検証・分析 (3)
 │   │   └── index.ts          # ツール登録
 │   ├── services/
@@ -188,7 +206,7 @@ pdf-reader-mcp/
 │       └── error-handler.ts  # エラーハンドリング
 └── tests/
     ├── tier1/                # Unit tests
-    └── e2e/                  # E2E tests (9 suites, 118 tests)
+    └── e2e/                  # E2E tests (9 suites, 125 tests)
 ```
 ## pdf-spec-mcp との連携

package/README.md CHANGED Viewed

@@ -12,7 +12,7 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
 ## Features
-**15 tools** organized into three tiers:
+**16 tools** organized into three tiers:
 ### Tier 1: Basic Operations
@@ -28,13 +28,14 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
 ### Tier 2: Structure Inspection
-| Tool                  | Description                                      |
-| --------------------- | ------------------------------------------------ |
-| `inspect_structure`   | Object tree and catalog dictionary analysis      |
-| `inspect_tags`        | Tagged PDF structure tree visualization          |
-| `inspect_fonts`       | Font inventory (embedded/subset/type detection)  |
-| `inspect_annotations` | Annotation listing (categorized by subtype)      |
-| `inspect_signatures`  | Digital signature field structure analysis        |
+| Tool                  | Description                                                         |
+| --------------------- | ------------------------------------------------------------------- |
+| `inspect_structure`   | Object tree and catalog dictionary analysis                         |
+| `inspect_tags`        | Tagged PDF structure tree visualization                             |
+| `inspect_fonts`       | Font inventory (embedded/subset/type detection)                     |
+| `inspect_annotations` | Annotation listing (categorized by subtype)                         |
+| `inspect_signatures`  | Digital signature field structure analysis                          |
+| `extract_tables`      | Tagged PDF `<Table>` subtree → Markdown table (preserves columns)   |
 ### Tier 3: Validation & Analysis
@@ -145,12 +146,29 @@ compare_structure({
   | Tagged      | true | true | ✅ |
 ```
+### Extract Tables (Tagged PDF)
+```
+extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
+→ # Extracted Tables
+  - **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
+  ## Page 1 — Table 1
+  | 改正後 | 改正前 |
+  | --- | --- |
+  | …第２条第 16 項《定義》… | …第２条第 15 項《定義》… |
+```
+Untagged PDFs return an empty result with a `note` recommending column-aware
+fallback (see roadmap Issue #3).
 ## Tech Stack
 - **TypeScript** + MCP TypeScript SDK
 - **pdfjs-dist** (Mozilla) — text/image extraction, tag tree, annotations
 - **pdf-lib** — low-level object structure analysis
-- **Vitest** — unit + E2E testing (157 tests)
+- **Vitest** — unit + E2E testing (164 tests)
 - **Biome** — linting + formatting
 - **Zod** — input validation
@@ -158,7 +176,7 @@ compare_structure({
 ```bash
 npm test              # Run all tests (unit: 39 tests)
-npm run test:e2e      # E2E tests only (118 tests)
+npm run test:e2e      # E2E tests only (125 tests)
 npm run test:watch    # Watch mode
 ```
@@ -172,7 +190,7 @@ pdf-reader-mcp/
 │   ├── types.ts              # Type definitions
 │   ├── tools/
 │   │   ├── tier1/            # Basic tools (7)
-│   │   ├── tier2/            # Structure inspection (5)
+│   │   ├── tier2/            # Structure inspection (6)
 │   │   ├── tier3/            # Validation & analysis (3)
 │   │   └── index.ts          # Tool registration
 │   ├── services/
@@ -188,7 +206,7 @@ pdf-reader-mcp/
 │       └── error-handler.ts  # Error handling
 └── tests/
     ├── tier1/                # Unit tests
-    └── e2e/                  # E2E tests (9 suites, 118 tests)
+    └── e2e/                  # E2E tests (9 suites, 125 tests)
 ```
 ## Pairing with pdf-spec-mcp

package/dist/constants.d.ts CHANGED Viewed

@@ -13,7 +13,7 @@ export declare const MAX_SEARCH_RESULTS = 100;
 export declare const DEFAULT_SEARCH_CONTEXT = 80;
 /** Server info */
 export declare const SERVER_NAME = "pdf-reader-mcp";
-export declare const SERVER_VERSION = "0.2.2";
+export declare const SERVER_VERSION = "0.3.0";
 /** Response format enum */
 export declare enum ResponseFormat {
     MARKDOWN = "markdown",

package/dist/constants.js CHANGED Viewed

@@ -13,7 +13,7 @@ export const MAX_SEARCH_RESULTS = 100;
 export const DEFAULT_SEARCH_CONTEXT = 80;
 /** Server info */
 export const SERVER_NAME = 'pdf-reader-mcp';
-export const SERVER_VERSION = '0.2.2';
+export const SERVER_VERSION = '0.3.0';
 /** Response format enum */
 export var ResponseFormat;
 (function (ResponseFormat) {

package/dist/schemas/tier2.d.ts CHANGED Viewed

@@ -60,9 +60,24 @@ export declare const InspectSignaturesSchema: z.ZodObject<{
     file_path: string;
     response_format?: import("../constants.js").ResponseFormat | undefined;
 }>;
+/** extract_tables — Tagged PDF Table → Markdown */
+export declare const ExtractTablesSchema: z.ZodObject<{
+    file_path: z.ZodString;
+    pages: z.ZodOptional<z.ZodString>;
+    response_format: z.ZodDefault<z.ZodNativeEnum<typeof import("../constants.js").ResponseFormat>>;
+}, "strict", z.ZodTypeAny, {
+    file_path: string;
+    response_format: import("../constants.js").ResponseFormat;
+    pages?: string | undefined;
+}, {
+    file_path: string;
+    response_format?: import("../constants.js").ResponseFormat | undefined;
+    pages?: string | undefined;
+}>;
 export type InspectStructureInput = z.infer<typeof InspectStructureSchema>;
 export type InspectTagsInput = z.infer<typeof InspectTagsSchema>;
 export type InspectFontsInput = z.infer<typeof InspectFontsSchema>;
 export type InspectAnnotationsInput = z.infer<typeof InspectAnnotationsSchema>;
 export type InspectSignaturesInput = z.infer<typeof InspectSignaturesSchema>;
+export type ExtractTablesInput = z.infer<typeof ExtractTablesSchema>;
 //# sourceMappingURL=tier2.d.ts.map

package/dist/schemas/tier2.d.ts.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"tier2.d.ts","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAGxB,wBAAwB;AACxB,eAAO,MAAM,sBAAsB;;;;;;;;;EAKxB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,oBAAoB;AACpB,eAAO,MAAM,kBAAkB;;;;;;;;;EAKpB,CAAC;AAEZ,0BAA0B;AAC1B,eAAO,MAAM,wBAAwB;;;;;;;;;;;;EAM1B,CAAC;AAEZ,yBAAyB;AACzB,eAAO,MAAM,uBAAuB;;;;;;;;;EAKzB,CAAC;AAGZ,MAAM,MAAM,qBAAqB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,sBAAsB,CAAC,CAAC;AAC3E,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,uBAAuB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,wBAAwB,CAAC,CAAC;AAC/E,MAAM,MAAM,sBAAsB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,uBAAuB,CAAC,CAAC"}
1	+ {"version":3,"file":"tier2.d.ts","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAGxB,wBAAwB;AACxB,eAAO,MAAM,sBAAsB;;;;;;;;;EAKxB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,oBAAoB;AACpB,eAAO,MAAM,kBAAkB;;;;;;;;;EAKpB,CAAC;AAEZ,0BAA0B;AAC1B,eAAO,MAAM,wBAAwB;;;;;;;;;;;;EAM1B,CAAC;AAEZ,yBAAyB;AACzB,eAAO,MAAM,uBAAuB;;;;;;;;;EAKzB,CAAC;AAEZ,mDAAmD;AACnD,eAAO,MAAM,mBAAmB;;;;;;;;;;;;EAMrB,CAAC;AAGZ,MAAM,MAAM,qBAAqB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,sBAAsB,CAAC,CAAC;AAC3E,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,uBAAuB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,wBAAwB,CAAC,CAAC;AAC/E,MAAM,MAAM,sBAAsB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,uBAAuB,CAAC,CAAC;AAC7E,MAAM,MAAM,kBAAkB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,mBAAmB,CAAC,CAAC"}

package/dist/schemas/tier2.js CHANGED Viewed

@@ -39,4 +39,12 @@ export const InspectSignaturesSchema = z
     response_format: ResponseFormatSchema,
 })
     .strict();
+/** extract_tables — Tagged PDF Table → Markdown */
+export const ExtractTablesSchema = z
+    .object({
+    file_path: FilePathSchema,
+    pages: PagesSchema,
+    response_format: ResponseFormatSchema,
+})
+    .strict();
 //# sourceMappingURL=tier2.js.map

package/dist/schemas/tier2.js.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"tier2.js","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,aAAa,CAAC;AAEhF,wBAAwB;AACxB,MAAM,CAAC,MAAM,sBAAsB,GAAG,CAAC;KACpC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,oBAAoB;AACpB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,0BAA0B;AAC1B,MAAM,CAAC,MAAM,wBAAwB,GAAG,CAAC;KACtC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,yBAAyB;AACzB,MAAM,CAAC,MAAM,uBAAuB,GAAG,CAAC;KACrC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
1	+ {"version":3,"file":"tier2.js","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,aAAa,CAAC;AAEhF,wBAAwB;AACxB,MAAM,CAAC,MAAM,sBAAsB,GAAG,CAAC;KACpC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,oBAAoB;AACpB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,0BAA0B;AAC1B,MAAM,CAAC,MAAM,wBAAwB,GAAG,CAAC;KACtC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,yBAAyB;AACzB,MAAM,CAAC,MAAM,uBAAuB,GAAG,CAAC;KACrC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mDAAmD;AACnD,MAAM,CAAC,MAAM,mBAAmB,GAAG,CAAC;KACjC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}

package/dist/services/pdfjs-service.d.ts CHANGED Viewed

@@ -4,7 +4,7 @@
  * Centralizes all pdfjs-dist interactions for reuse across tools.
  */
 import { type PDFDocumentProxy } from 'pdfjs-dist/legacy/build/pdf.mjs';
-import type { AnnotationsAnalysis, ImageExtractionResult, PageText, PdfMetadata, SearchMatch, TagsAnalysis } from '../types.js';
+import type { AnnotationsAnalysis, ImageExtractionResult, PageText, PdfMetadata, SearchMatch, TablesExtractionResult, TagsAnalysis } from '../types.js';
 /**
  * Load a PDF document from a file path.
  */
@@ -58,6 +58,27 @@ export declare function analyzeTagsFromDoc(doc: PDFDocumentProxy): Promise<TagsA
  * Analyze Tagged PDF structure tree.
  */
 export declare function analyzeTags(filePath: string): Promise<TagsAnalysis>;
+/**
+ * Extract tables from a Tagged PDF as structured rows/cells.
+ *
+ * The strategy is: for each page, walk the StructTree, identify `<Table>`
+ * subtrees, then walk down `<THead>/<TBody>/<TFoot>` → `<TR>` → `<TH>/<TD>`.
+ * Cell text is reconstructed by mapping each Span/P/Lbl/LBody leaf node's
+ * `id` (e.g. `p715R_mc4`) onto the corresponding `beginMarkedContentProps`
+ * boundary in `getTextContent({ includeMarkedContent: true })`.
+ *
+ * Untagged PDFs return `isTagged: false`, an empty `tables` array, and a
+ * `note` recommending the column-aware extraction (planned in a future
+ * release) as the fallback for two-column layouts without a structure tree.
+ *
+ * Cell text is post-processed:
+ *   - Newlines (`hasEOL`) become single spaces.
+ *   - Repeated whitespace runs (including U+3000 fullwidth space) collapse to one.
+ *   - Per-character kerning spaces (e.g. `"消 費 税 法"`) are folded
+ *     by removing single ASCII spaces between two CJK characters.
+ */
+export declare function extractTablesFromDoc(doc: PDFDocumentProxy, pages?: string): Promise<TablesExtractionResult>;
+export declare function extractTables(filePath: string, pages?: string): Promise<TablesExtractionResult>;
 /**
  * Analyze annotations across all pages.
  */

package/dist/services/pdfjs-service.d.ts.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"pdfjs-service.d.ts","sourceRoot":"","sources":["../../src/services/pdfjs-service.ts"],"names":[],"mappings":"AAAA;;;;GAIG;AAEH,OAAO,EAGL,KAAK,gBAAgB,EAEtB,MAAM,iCAAiC,CAAC;AAGzC,OAAO,KAAK,EAEV,mBAAmB,~~EAEnB~~,qBAAqB,EACrB,QAAQ,EACR,WAAW,EACX,WAAW,~~EAEX~~,YAAY,EACb,MAAM,aAAa,CAAC;AAWrB;;GAEG;AACH,wBAAsB,YAAY,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAI9E;AAED;;GAEG;AACH,wBAAsB,oBAAoB,CAAC,IAAI,EAAE,UAAU,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAGtF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,WAAW,CAAC,CAOxE;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,QAAQ,EAAE,MAAM,GACf,OAAO,CAAC,WAAW,CAAC,CA6BtB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,QAAQ,EAAE,CAAC,CAarB;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,QAAQ,EAAE,CAAC,CAQvF;AAED;;GAEG;AACH,wBAAsB,UAAU,CAC9B,QAAQ,EAAE,MAAM,EAChB,KAAK,EAAE,MAAM,EACb,YAAY,GAAE,MAA+B,EAC7C,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,WAAW,EAAE,CAAC,CAsDxB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAmB/F;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAQnF;AAED;;;GAGG;AACH,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,qBAAqB,CAAC,CAoEhC;AAgGD;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,GAAG,OAAO,CAAC,YAAY,CAAC,CAmErF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,YAAY,CAAC,CAOzE;AAED;;GAEG;AACH,wBAAsB,kBAAkB,CACtC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,mBAAmB,CAAC,CAmG9B"}
1	+ {"version":3,"file":"pdfjs-service.d.ts","sourceRoot":"","sources":["../../src/services/pdfjs-service.ts"],"names":[],"mappings":"AAAA;;;;GAIG;AAEH,OAAO,EAGL,KAAK,gBAAgB,EAEtB,MAAM,iCAAiC,CAAC;AAGzC,OAAO,KAAK,EAEV,mBAAmB,EAGnB,qBAAqB,EACrB,QAAQ,EACR,WAAW,EACX,WAAW,EAGX,sBAAsB,EAEtB,YAAY,EACb,MAAM,aAAa,CAAC;AAWrB;;GAEG;AACH,wBAAsB,YAAY,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAI9E;AAED;;GAEG;AACH,wBAAsB,oBAAoB,CAAC,IAAI,EAAE,UAAU,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAGtF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,WAAW,CAAC,CAOxE;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,QAAQ,EAAE,MAAM,GACf,OAAO,CAAC,WAAW,CAAC,CA6BtB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,QAAQ,EAAE,CAAC,CAarB;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,QAAQ,EAAE,CAAC,CAQvF;AAED;;GAEG;AACH,wBAAsB,UAAU,CAC9B,QAAQ,EAAE,MAAM,EAChB,KAAK,EAAE,MAAM,EACb,YAAY,GAAE,MAA+B,EAC7C,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,WAAW,EAAE,CAAC,CAsDxB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAmB/F;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAQnF;AAED;;;GAGG;AACH,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,qBAAqB,CAAC,CAoEhC;AAgGD;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,GAAG,OAAO,CAAC,YAAY,CAAC,CAmErF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,YAAY,CAAC,CAOzE;AAID;;;;;;;;;;;;;;;;;;GAkBG;AACH,wBAAsB,oBAAoB,CACxC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,sBAAsB,CAAC,CA8CjC;AAED,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,sBAAsB,CAAC,CAOjC;AA2KD;;GAEG;AACH,wBAAsB,kBAAkB,CACtC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,mBAAmB,CAAC,CAmG9B"}

package/dist/services/pdfjs-service.js CHANGED Viewed

@@ -391,6 +391,221 @@ export async function analyzeTags(filePath) {
         await doc.destroy();
     }
 }
+// ─── extract_tables ────────────────────────────────────────────────────────
+/**
+ * Extract tables from a Tagged PDF as structured rows/cells.
+ *
+ * The strategy is: for each page, walk the StructTree, identify `<Table>`
+ * subtrees, then walk down `<THead>/<TBody>/<TFoot>` → `<TR>` → `<TH>/<TD>`.
+ * Cell text is reconstructed by mapping each Span/P/Lbl/LBody leaf node's
+ * `id` (e.g. `p715R_mc4`) onto the corresponding `beginMarkedContentProps`
+ * boundary in `getTextContent({ includeMarkedContent: true })`.
+ *
+ * Untagged PDFs return `isTagged: false`, an empty `tables` array, and a
+ * `note` recommending the column-aware extraction (planned in a future
+ * release) as the fallback for two-column layouts without a structure tree.
+ *
+ * Cell text is post-processed:
+ *   - Newlines (`hasEOL`) become single spaces.
+ *   - Repeated whitespace runs (including U+3000 fullwidth space) collapse to one.
+ *   - Per-character kerning spaces (e.g. `"消 費 税 法"`) are folded
+ *     by removing single ASCII spaces between two CJK characters.
+ */
+export async function extractTablesFromDoc(doc, pages) {
+    const markInfo = await getMarkInfo(doc);
+    const isTagged = markInfo?.Marked === true;
+    if (!isTagged) {
+        return {
+            isTagged: false,
+            tables: [],
+            totalTables: 0,
+            pagesScanned: 0,
+            note: 'Document is not tagged. extract_tables relies on /MarkInfo /Marked true ' +
+                'and a StructTree. For untagged two-column PDFs, fall back to a ' +
+                'column-aware reading strategy (see pdf-reader-mcp Issue #3).',
+        };
+    }
+    const pageNumbers = resolvePageNumbers(pages, doc.numPages);
+    const perPage = await Promise.all(pageNumbers.map(async (pageNum) => {
+        const page = await doc.getPage(pageNum);
+        try {
+            const [tree, textContent] = await Promise.all([
+                page.getStructTree(),
+                page.getTextContent({ includeMarkedContent: true }),
+            ]);
+            if (!tree)
+                return [];
+            const idToText = buildIdToTextMap(textContent.items);
+            const tables = [];
+            collectTables(tree, pageNum, idToText, tables);
+            return tables;
+        }
+        catch {
+            return [];
+        }
+    }));
+    const tables = perPage.flat();
+    return {
+        isTagged: true,
+        tables,
+        totalTables: tables.length,
+        pagesScanned: pageNumbers.length,
+    };
+}
+export async function extractTables(filePath, pages) {
+    const doc = await loadDocument(filePath);
+    try {
+        return await extractTablesFromDoc(doc, pages);
+    }
+    finally {
+        await doc.destroy();
+    }
+}
+/**
+ * Build a map from a marked-content `id` (e.g. `p715R_mc4`) to the concatenated
+ * raw text inside the corresponding `beginMarkedContentProps`/`endMarkedContent`
+ * pair. Nested marked content is supported via a stack — text counts toward
+ * every active id (so a `<Span>` inside a `<P>` contributes to both).
+ *
+ * Items with `tag === 'Artifact'` are page-level artifacts (page numbers,
+ * running headers, etc.) outside the structure tree, and are skipped.
+ */
+function buildIdToTextMap(items) {
+    const map = new Map();
+    const stack = [];
+    for (const item of items) {
+        const t = item.type;
+        if (t === 'beginMarkedContent' || t === 'beginMarkedContentProps') {
+            const isArtifact = item.tag === 'Artifact';
+            const id = item.id ?? null;
+            stack.push({ id, isArtifact });
+            continue;
+        }
+        if (t === 'endMarkedContent') {
+            stack.pop();
+            continue;
+        }
+        if (t !== undefined)
+            continue; // unknown marker
+        // Text item
+        if (stack.some((s) => s.isArtifact))
+            continue;
+        const str = item.hasEOL ? ' ' : (item.str ?? '');
+        if (!str)
+            continue;
+        for (const frame of stack) {
+            if (frame.id) {
+                const buf = map.get(frame.id);
+                if (buf)
+                    buf.push(str);
+                else
+                    map.set(frame.id, [str]);
+            }
+        }
+    }
+    const out = new Map();
+    for (const [id, parts] of map)
+        out.set(id, parts.join(''));
+    return out;
+}
+/** Walk the StructTree and append every `<Table>` subtree as an ExtractedTable. */
+function collectTables(node, pageNum, idToText, out) {
+    if (node.type === 'content')
+        return;
+    if (node.role === 'Table') {
+        const headerRows = [];
+        const bodyRows = [];
+        const footerRows = [];
+        for (const child of node.children ?? []) {
+            if (child.type === 'content')
+                continue;
+            if (child.role === 'THead') {
+                appendTableRowsFromSection(child, idToText, headerRows);
+            }
+            else if (child.role === 'TBody') {
+                appendTableRowsFromSection(child, idToText, bodyRows);
+            }
+            else if (child.role === 'TFoot') {
+                appendTableRowsFromSection(child, idToText, footerRows);
+            }
+            else if (child.role === 'TR') {
+                // Tables sometimes omit THead/TBody and place TRs directly under <Table>.
+                const row = buildRowFromTR(child, idToText);
+                if (row)
+                    bodyRows.push(row);
+            }
+        }
+        // `out` is a per-page accumulator passed in by the caller, so
+        // `out.length + 1` is the next index within this page (1-based).
+        out.push({
+            page: pageNum,
+            index: out.length + 1,
+            headerRows,
+            bodyRows,
+            footerRows,
+        });
+        return; // Don't recurse into a Table — nested tables are uncommon and
+        // would confuse the per-page index. (Add nested-table support later.)
+    }
+    for (const child of node.children ?? []) {
+        collectTables(child, pageNum, idToText, out);
+    }
+}
+function appendTableRowsFromSection(section, idToText, out) {
+    for (const child of section.children ?? []) {
+        if (child.type === 'content')
+            continue;
+        if (child.role === 'TR') {
+            const row = buildRowFromTR(child, idToText);
+            if (row)
+                out.push(row);
+        }
+    }
+}
+function buildRowFromTR(tr, idToText) {
+    const cells = [];
+    for (const child of tr.children ?? []) {
+        if (child.type === 'content')
+            continue;
+        if (child.role === 'TH' || child.role === 'TD') {
+            const text = compactCellText(collectTextUnder(child, idToText));
+            cells.push({ text, isHeader: child.role === 'TH' });
+        }
+    }
+    return cells.length === 0 ? null : { cells };
+}
+function collectTextUnder(node, idToText) {
+    if (node.type === 'content') {
+        return node.id ? (idToText.get(node.id) ?? '') : '';
+    }
+    const parts = [];
+    for (const child of node.children ?? []) {
+        parts.push(collectTextUnder(child, idToText));
+    }
+    return parts.join(' ');
+}
+/**
+ * Normalise raw cell text:
+ *   1. Collapse any whitespace run (`\s` + U+3000) to a single ASCII space.
+ *   2. Fold per-character kerning runs between CJK characters
+ *      (e.g. "消 費 税 法" → "消費税法") — but only when at least three
+ *      single CJK chars are separated by single spaces in a row, so that
+ *      natural inter-word spacing like "事業者 法人番号" is preserved.
+ *   3. Trim and Markdown-escape pipes / newlines.
+ */
+function compactCellText(s) {
+    if (!s)
+        return '';
+    // Step 1: collapse whitespace runs (incl. U+3000) to one ASCII space.
+    let t = s.replace(/[\s　]+/g, ' ').trim();
+    // Step 2: fold runs of `CJK + space` repeated at least twice followed by
+    // a final CJK char. Anything shorter is treated as a real word boundary.
+    const cjk = '[\\u3040-\\u30ff\\u3400-\\u9fff\\uff00-\\uffef]';
+    const kerningRun = new RegExp(`(?:${cjk} ){2,}${cjk}`, 'g');
+    t = t.replace(kerningRun, (m) => m.replace(/ /g, ''));
+    // Step 3: escape Markdown table delimiters.
+    return t.replace(/\|/g, '\\|').replace(/\n/g, ' ');
+}
 /**
  * Analyze annotations across all pages.
  */