@shuji-bonji/pdf-reader-mcp 0.2.2 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +36 -0
- package/README.ja.md +30 -12
- package/README.md +30 -12
- package/dist/constants.d.ts +1 -1
- package/dist/constants.js +1 -1
- package/dist/schemas/tier2.d.ts +15 -0
- package/dist/schemas/tier2.d.ts.map +1 -1
- package/dist/schemas/tier2.js +8 -0
- package/dist/schemas/tier2.js.map +1 -1
- package/dist/services/pdfjs-service.d.ts +22 -1
- package/dist/services/pdfjs-service.d.ts.map +1 -1
- package/dist/services/pdfjs-service.js +215 -0
- package/dist/services/pdfjs-service.js.map +1 -1
- package/dist/services/pdflib-service.d.ts +21 -0
- package/dist/services/pdflib-service.d.ts.map +1 -1
- package/dist/services/pdflib-service.js +128 -24
- package/dist/services/pdflib-service.js.map +1 -1
- package/dist/tools/index.d.ts.map +1 -1
- package/dist/tools/index.js +2 -0
- package/dist/tools/index.js.map +1 -1
- package/dist/tools/tier2/extract-tables.d.ts +11 -0
- package/dist/tools/tier2/extract-tables.d.ts.map +1 -0
- package/dist/tools/tier2/extract-tables.js +66 -0
- package/dist/tools/tier2/extract-tables.js.map +1 -0
- package/dist/tools/tier2/inspect-fonts.d.ts.map +1 -1
- package/dist/tools/tier2/inspect-fonts.js +1 -0
- package/dist/tools/tier2/inspect-fonts.js.map +1 -1
- package/dist/types.d.ts +54 -0
- package/dist/types.d.ts.map +1 -1
- package/dist/utils/formatter.d.ts +8 -1
- package/dist/utils/formatter.d.ts.map +1 -1
- package/dist/utils/formatter.js +61 -0
- package/dist/utils/formatter.js.map +1 -1
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,40 @@ All notable changes to this project will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [0.3.0] - 2026-05-06
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
- **`extract_tables` (Tier 2)**: New tool that walks a Tagged PDF's structure tree and emits every `<Table>` subtree as a Markdown table or a JSON document. Designed for documents whose meaning depends on multi-column layout — e.g. 国税庁 新旧対照表 (kaisei tsutatsu) PDFs where reading-order extraction merges 改正後 / 改正前 columns into ambiguous text. Internals:
|
|
13
|
+
- `extractTables(filePath, pages?)` / `extractTablesFromDoc(doc, pages?)` in `services/pdfjs-service.ts`. Walks the StructTree, identifies `<Table>` → `<THead> | <TBody> | <TFoot>` → `<TR>` → `<TH> | <TD>`, then resolves cell text by mapping each leaf node's marked-content `id` (e.g. `p715R_mc4`) to the corresponding `beginMarkedContentProps` boundary in `getTextContent({ includeMarkedContent: true })`.
|
|
14
|
+
- Cell text post-processing: collapses whitespace runs (incl. U+3000), folds per-character kerning runs ("消 費 税 法" → "消費税法") while preserving natural inter-word spacing ("事業者 法人番号"), escapes Markdown table delimiters.
|
|
15
|
+
- Untagged PDFs return `isTagged: false`, an empty `tables` array, and a `note` recommending column-aware extraction (planned in Issue #3) as the fallback.
|
|
16
|
+
- colspan / rowspan and nested tables are skipped in this initial release; cells appear in source order.
|
|
17
|
+
- **Types**: `TableCell`, `TableRow`, `ExtractedTable`, `TablesExtractionResult` in `types.ts`.
|
|
18
|
+
- **Schema**: `ExtractTablesSchema` in `schemas/tier2.ts` (file_path + pages + response_format).
|
|
19
|
+
- **Markdown formatter**: `formatTablesMarkdown` in `utils/formatter.ts` renders results as `# Extracted Tables` summary block followed by `## Page N — Table M` GFM tables.
|
|
20
|
+
- **E2E tests**: 5 new tests in `tests/e2e/04-tier2-structure.test.ts` covering untagged → note path, tagged-but-empty path, formatter shape, and pages filter.
|
|
21
|
+
|
|
22
|
+
### Changed
|
|
23
|
+
|
|
24
|
+
- **Tool count**: 15 → 16 tools (Tier 2 now has 6).
|
|
25
|
+
- README / README.ja.md tool tables and architecture diagram updated accordingly.
|
|
26
|
+
|
|
27
|
+
## [0.2.3] - 2026-05-06
|
|
28
|
+
|
|
29
|
+
### Fixed
|
|
30
|
+
|
|
31
|
+
- **`inspect_structure` / `inspect_fonts`**: No longer throw `Expected instance of PDFDict, but got instance of undefined` on Linearized PDFs whose cross-reference cannot be fully resolved by pdf-lib. Public-sector publishers (e.g. Japanese government agencies) routinely emit Linearized PDFs from Microsoft Word; previously every such file produced a hard error. After the fix:
|
|
32
|
+
- `analyzeStructure` returns a partial result. The page count is filled in via pdfjs-dist when pdf-lib's page-tree traversal fails, and a `note` field describes the fallback.
|
|
33
|
+
- `analyzeFontsWithPdfLib` returns an empty font map with an explanatory `note` instead of throwing.
|
|
34
|
+
- All pdf-lib call sites (`loadWithPdfLib`, `analyzeStructure`, `analyzeFontsWithPdfLib`, `analyzeSignatures`, `detectEncryption`) now run inside `withSuppressedPdfLibLogs`, preventing pdf-lib's `console.log`-based parse diagnostics from polluting the JSON-RPC stream when the server runs over stdio.
|
|
35
|
+
- **Types**: `StructureAnalysis` and `FontsAnalysis` gained an optional `note?: string` field. `FontAnalysisResult` (internal) gained the same field.
|
|
36
|
+
- **Markdown formatter**: `formatStructureMarkdown` and `formatFontsMarkdown` now render the `note` as a `> ...` blockquote when present.
|
|
37
|
+
|
|
38
|
+
### Added
|
|
39
|
+
|
|
40
|
+
- **Test fixture**: `tests/fixtures/linearized.pdf` (regenerated by `create-test-pdf.ts` via `qpdf --linearize`) and matching e2e regression tests in `tests/e2e/04-tier2-structure.test.ts`.
|
|
41
|
+
|
|
8
42
|
## [0.2.2] - 2026-04-18
|
|
9
43
|
|
|
10
44
|
### Fixed
|
|
@@ -57,6 +91,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
57
91
|
- Y-coordinate-based text extraction preserving natural reading order
|
|
58
92
|
- Unit tests for core utilities and pdfjs-service
|
|
59
93
|
|
|
94
|
+
[0.3.0]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.3...v0.3.0
|
|
95
|
+
[0.2.3]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.2...v0.2.3
|
|
60
96
|
[0.2.2]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.1...v0.2.2
|
|
61
97
|
[0.2.1]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.2.0...v0.2.1
|
|
62
98
|
[0.2.0]: https://github.com/shuji-bonji/pdf-reader-mcp/compare/v0.1.0...v0.2.0
|
package/README.ja.md
CHANGED
|
@@ -12,7 +12,7 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
|
|
|
12
12
|
|
|
13
13
|
## 機能
|
|
14
14
|
|
|
15
|
-
**
|
|
15
|
+
**16 ツール** を 3 層構成で提供します。
|
|
16
16
|
|
|
17
17
|
### Tier 1: 基本機能
|
|
18
18
|
|
|
@@ -28,13 +28,14 @@ PDF 内部構造解析に特化した MCP (Model Context Protocol) サーバー
|
|
|
28
28
|
|
|
29
29
|
### Tier 2: 構造解析
|
|
30
30
|
|
|
31
|
-
| ツール | 説明
|
|
32
|
-
| --------------------- |
|
|
33
|
-
| `inspect_structure` | オブジェクトツリー・カタログ辞書の解析
|
|
34
|
-
| `inspect_tags` | Tagged PDF のタグツリー可視化
|
|
35
|
-
| `inspect_fonts` | フォント一覧(埋め込み/サブセット/Type判定)
|
|
36
|
-
| `inspect_annotations` | 注釈一覧(タイプ別分類)
|
|
37
|
-
| `inspect_signatures` | 電子署名フィールドの構造解析
|
|
31
|
+
| ツール | 説明 |
|
|
32
|
+
| --------------------- | -------------------------------------------------------- |
|
|
33
|
+
| `inspect_structure` | オブジェクトツリー・カタログ辞書の解析 |
|
|
34
|
+
| `inspect_tags` | Tagged PDF のタグツリー可視化 |
|
|
35
|
+
| `inspect_fonts` | フォント一覧(埋め込み/サブセット/Type判定) |
|
|
36
|
+
| `inspect_annotations` | 注釈一覧(タイプ別分類) |
|
|
37
|
+
| `inspect_signatures` | 電子署名フィールドの構造解析 |
|
|
38
|
+
| `extract_tables` | Tagged PDF の `<Table>` を Markdown テーブルとして抽出 |
|
|
38
39
|
|
|
39
40
|
### Tier 3: 検証・分析
|
|
40
41
|
|
|
@@ -145,12 +146,29 @@ compare_structure({
|
|
|
145
146
|
| Tagged | true | true | ✅ |
|
|
146
147
|
```
|
|
147
148
|
|
|
149
|
+
### Tagged PDF からテーブル抽出
|
|
150
|
+
|
|
151
|
+
```
|
|
152
|
+
extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
|
|
153
|
+
→ # Extracted Tables
|
|
154
|
+
- **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
|
|
155
|
+
|
|
156
|
+
## Page 1 — Table 1
|
|
157
|
+
|
|
158
|
+
| 改正後 | 改正前 |
|
|
159
|
+
| --- | --- |
|
|
160
|
+
| …第2条第 16 項《定義》… | …第2条第 15 項《定義》… |
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
タグ無し PDF では空結果と note を返し、column-aware 抽出(ロードマップ Issue #3)への
|
|
164
|
+
フォールバックを推奨します。
|
|
165
|
+
|
|
148
166
|
## 技術スタック
|
|
149
167
|
|
|
150
168
|
- **TypeScript** + MCP TypeScript SDK
|
|
151
169
|
- **pdfjs-dist** (Mozilla) — テキスト/画像抽出、タグツリー、注釈
|
|
152
170
|
- **pdf-lib** — 低レベルオブジェクト構造解析
|
|
153
|
-
- **Vitest** — Unit + E2E テスト(
|
|
171
|
+
- **Vitest** — Unit + E2E テスト(164 tests)
|
|
154
172
|
- **Biome** — lint + format
|
|
155
173
|
- **Zod** — 入力バリデーション
|
|
156
174
|
|
|
@@ -158,7 +176,7 @@ compare_structure({
|
|
|
158
176
|
|
|
159
177
|
```bash
|
|
160
178
|
npm test # 全テスト実行(Unit: 39 tests)
|
|
161
|
-
npm run test:e2e # E2E のみ(
|
|
179
|
+
npm run test:e2e # E2E のみ(125 tests)
|
|
162
180
|
npm run test:watch # ウォッチモード
|
|
163
181
|
```
|
|
164
182
|
|
|
@@ -172,7 +190,7 @@ pdf-reader-mcp/
|
|
|
172
190
|
│ ├── types.ts # 型定義
|
|
173
191
|
│ ├── tools/
|
|
174
192
|
│ │ ├── tier1/ # 基本ツール (7)
|
|
175
|
-
│ │ ├── tier2/ # 構造解析 (
|
|
193
|
+
│ │ ├── tier2/ # 構造解析 (6)
|
|
176
194
|
│ │ ├── tier3/ # 検証・分析 (3)
|
|
177
195
|
│ │ └── index.ts # ツール登録
|
|
178
196
|
│ ├── services/
|
|
@@ -188,7 +206,7 @@ pdf-reader-mcp/
|
|
|
188
206
|
│ └── error-handler.ts # エラーハンドリング
|
|
189
207
|
└── tests/
|
|
190
208
|
├── tier1/ # Unit tests
|
|
191
|
-
└── e2e/ # E2E tests (9 suites,
|
|
209
|
+
└── e2e/ # E2E tests (9 suites, 125 tests)
|
|
192
210
|
```
|
|
193
211
|
|
|
194
212
|
## pdf-spec-mcp との連携
|
package/README.md
CHANGED
|
@@ -12,7 +12,7 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
|
|
|
12
12
|
|
|
13
13
|
## Features
|
|
14
14
|
|
|
15
|
-
**
|
|
15
|
+
**16 tools** organized into three tiers:
|
|
16
16
|
|
|
17
17
|
### Tier 1: Basic Operations
|
|
18
18
|
|
|
@@ -28,13 +28,14 @@ While typical PDF MCP servers are thin wrappers for text extraction, this projec
|
|
|
28
28
|
|
|
29
29
|
### Tier 2: Structure Inspection
|
|
30
30
|
|
|
31
|
-
| Tool | Description
|
|
32
|
-
| --------------------- |
|
|
33
|
-
| `inspect_structure` | Object tree and catalog dictionary analysis
|
|
34
|
-
| `inspect_tags` | Tagged PDF structure tree visualization
|
|
35
|
-
| `inspect_fonts` | Font inventory (embedded/subset/type detection)
|
|
36
|
-
| `inspect_annotations` | Annotation listing (categorized by subtype)
|
|
37
|
-
| `inspect_signatures` | Digital signature field structure analysis
|
|
31
|
+
| Tool | Description |
|
|
32
|
+
| --------------------- | ------------------------------------------------------------------- |
|
|
33
|
+
| `inspect_structure` | Object tree and catalog dictionary analysis |
|
|
34
|
+
| `inspect_tags` | Tagged PDF structure tree visualization |
|
|
35
|
+
| `inspect_fonts` | Font inventory (embedded/subset/type detection) |
|
|
36
|
+
| `inspect_annotations` | Annotation listing (categorized by subtype) |
|
|
37
|
+
| `inspect_signatures` | Digital signature field structure analysis |
|
|
38
|
+
| `extract_tables` | Tagged PDF `<Table>` subtree → Markdown table (preserves columns) |
|
|
38
39
|
|
|
39
40
|
### Tier 3: Validation & Analysis
|
|
40
41
|
|
|
@@ -145,12 +146,29 @@ compare_structure({
|
|
|
145
146
|
| Tagged | true | true | ✅ |
|
|
146
147
|
```
|
|
147
148
|
|
|
149
|
+
### Extract Tables (Tagged PDF)
|
|
150
|
+
|
|
151
|
+
```
|
|
152
|
+
extract_tables({ file_path: "/path/to/kaisei-tsutatsu.pdf", pages: "1" })
|
|
153
|
+
→ # Extracted Tables
|
|
154
|
+
- **Tagged**: Yes / **Pages Scanned**: 1 / **Tables Found**: 1
|
|
155
|
+
|
|
156
|
+
## Page 1 — Table 1
|
|
157
|
+
|
|
158
|
+
| 改正後 | 改正前 |
|
|
159
|
+
| --- | --- |
|
|
160
|
+
| …第2条第 16 項《定義》… | …第2条第 15 項《定義》… |
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
Untagged PDFs return an empty result with a `note` recommending column-aware
|
|
164
|
+
fallback (see roadmap Issue #3).
|
|
165
|
+
|
|
148
166
|
## Tech Stack
|
|
149
167
|
|
|
150
168
|
- **TypeScript** + MCP TypeScript SDK
|
|
151
169
|
- **pdfjs-dist** (Mozilla) — text/image extraction, tag tree, annotations
|
|
152
170
|
- **pdf-lib** — low-level object structure analysis
|
|
153
|
-
- **Vitest** — unit + E2E testing (
|
|
171
|
+
- **Vitest** — unit + E2E testing (164 tests)
|
|
154
172
|
- **Biome** — linting + formatting
|
|
155
173
|
- **Zod** — input validation
|
|
156
174
|
|
|
@@ -158,7 +176,7 @@ compare_structure({
|
|
|
158
176
|
|
|
159
177
|
```bash
|
|
160
178
|
npm test # Run all tests (unit: 39 tests)
|
|
161
|
-
npm run test:e2e # E2E tests only (
|
|
179
|
+
npm run test:e2e # E2E tests only (125 tests)
|
|
162
180
|
npm run test:watch # Watch mode
|
|
163
181
|
```
|
|
164
182
|
|
|
@@ -172,7 +190,7 @@ pdf-reader-mcp/
|
|
|
172
190
|
│ ├── types.ts # Type definitions
|
|
173
191
|
│ ├── tools/
|
|
174
192
|
│ │ ├── tier1/ # Basic tools (7)
|
|
175
|
-
│ │ ├── tier2/ # Structure inspection (
|
|
193
|
+
│ │ ├── tier2/ # Structure inspection (6)
|
|
176
194
|
│ │ ├── tier3/ # Validation & analysis (3)
|
|
177
195
|
│ │ └── index.ts # Tool registration
|
|
178
196
|
│ ├── services/
|
|
@@ -188,7 +206,7 @@ pdf-reader-mcp/
|
|
|
188
206
|
│ └── error-handler.ts # Error handling
|
|
189
207
|
└── tests/
|
|
190
208
|
├── tier1/ # Unit tests
|
|
191
|
-
└── e2e/ # E2E tests (9 suites,
|
|
209
|
+
└── e2e/ # E2E tests (9 suites, 125 tests)
|
|
192
210
|
```
|
|
193
211
|
|
|
194
212
|
## Pairing with pdf-spec-mcp
|
package/dist/constants.d.ts
CHANGED
|
@@ -13,7 +13,7 @@ export declare const MAX_SEARCH_RESULTS = 100;
|
|
|
13
13
|
export declare const DEFAULT_SEARCH_CONTEXT = 80;
|
|
14
14
|
/** Server info */
|
|
15
15
|
export declare const SERVER_NAME = "pdf-reader-mcp";
|
|
16
|
-
export declare const SERVER_VERSION = "0.
|
|
16
|
+
export declare const SERVER_VERSION = "0.3.0";
|
|
17
17
|
/** Response format enum */
|
|
18
18
|
export declare enum ResponseFormat {
|
|
19
19
|
MARKDOWN = "markdown",
|
package/dist/constants.js
CHANGED
|
@@ -13,7 +13,7 @@ export const MAX_SEARCH_RESULTS = 100;
|
|
|
13
13
|
export const DEFAULT_SEARCH_CONTEXT = 80;
|
|
14
14
|
/** Server info */
|
|
15
15
|
export const SERVER_NAME = 'pdf-reader-mcp';
|
|
16
|
-
export const SERVER_VERSION = '0.
|
|
16
|
+
export const SERVER_VERSION = '0.3.0';
|
|
17
17
|
/** Response format enum */
|
|
18
18
|
export var ResponseFormat;
|
|
19
19
|
(function (ResponseFormat) {
|
package/dist/schemas/tier2.d.ts
CHANGED
|
@@ -60,9 +60,24 @@ export declare const InspectSignaturesSchema: z.ZodObject<{
|
|
|
60
60
|
file_path: string;
|
|
61
61
|
response_format?: import("../constants.js").ResponseFormat | undefined;
|
|
62
62
|
}>;
|
|
63
|
+
/** extract_tables — Tagged PDF Table → Markdown */
|
|
64
|
+
export declare const ExtractTablesSchema: z.ZodObject<{
|
|
65
|
+
file_path: z.ZodString;
|
|
66
|
+
pages: z.ZodOptional<z.ZodString>;
|
|
67
|
+
response_format: z.ZodDefault<z.ZodNativeEnum<typeof import("../constants.js").ResponseFormat>>;
|
|
68
|
+
}, "strict", z.ZodTypeAny, {
|
|
69
|
+
file_path: string;
|
|
70
|
+
response_format: import("../constants.js").ResponseFormat;
|
|
71
|
+
pages?: string | undefined;
|
|
72
|
+
}, {
|
|
73
|
+
file_path: string;
|
|
74
|
+
response_format?: import("../constants.js").ResponseFormat | undefined;
|
|
75
|
+
pages?: string | undefined;
|
|
76
|
+
}>;
|
|
63
77
|
export type InspectStructureInput = z.infer<typeof InspectStructureSchema>;
|
|
64
78
|
export type InspectTagsInput = z.infer<typeof InspectTagsSchema>;
|
|
65
79
|
export type InspectFontsInput = z.infer<typeof InspectFontsSchema>;
|
|
66
80
|
export type InspectAnnotationsInput = z.infer<typeof InspectAnnotationsSchema>;
|
|
67
81
|
export type InspectSignaturesInput = z.infer<typeof InspectSignaturesSchema>;
|
|
82
|
+
export type ExtractTablesInput = z.infer<typeof ExtractTablesSchema>;
|
|
68
83
|
//# sourceMappingURL=tier2.d.ts.map
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"tier2.d.ts","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAGxB,wBAAwB;AACxB,eAAO,MAAM,sBAAsB;;;;;;;;;EAKxB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,oBAAoB;AACpB,eAAO,MAAM,kBAAkB;;;;;;;;;EAKpB,CAAC;AAEZ,0BAA0B;AAC1B,eAAO,MAAM,wBAAwB;;;;;;;;;;;;EAM1B,CAAC;AAEZ,yBAAyB;AACzB,eAAO,MAAM,uBAAuB;;;;;;;;;EAKzB,CAAC;AAGZ,MAAM,MAAM,qBAAqB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,sBAAsB,CAAC,CAAC;AAC3E,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,uBAAuB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,wBAAwB,CAAC,CAAC;AAC/E,MAAM,MAAM,sBAAsB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,uBAAuB,CAAC,CAAC"}
|
|
1
|
+
{"version":3,"file":"tier2.d.ts","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AAGxB,wBAAwB;AACxB,eAAO,MAAM,sBAAsB;;;;;;;;;EAKxB,CAAC;AAEZ,mBAAmB;AACnB,eAAO,MAAM,iBAAiB;;;;;;;;;EAKnB,CAAC;AAEZ,oBAAoB;AACpB,eAAO,MAAM,kBAAkB;;;;;;;;;EAKpB,CAAC;AAEZ,0BAA0B;AAC1B,eAAO,MAAM,wBAAwB;;;;;;;;;;;;EAM1B,CAAC;AAEZ,yBAAyB;AACzB,eAAO,MAAM,uBAAuB;;;;;;;;;EAKzB,CAAC;AAEZ,mDAAmD;AACnD,eAAO,MAAM,mBAAmB;;;;;;;;;;;;EAMrB,CAAC;AAGZ,MAAM,MAAM,qBAAqB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,sBAAsB,CAAC,CAAC;AAC3E,MAAM,MAAM,gBAAgB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,iBAAiB,CAAC,CAAC;AACjE,MAAM,MAAM,iBAAiB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,kBAAkB,CAAC,CAAC;AACnE,MAAM,MAAM,uBAAuB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,wBAAwB,CAAC,CAAC;AAC/E,MAAM,MAAM,sBAAsB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,uBAAuB,CAAC,CAAC;AAC7E,MAAM,MAAM,kBAAkB,GAAG,CAAC,CAAC,KAAK,CAAC,OAAO,mBAAmB,CAAC,CAAC"}
|
package/dist/schemas/tier2.js
CHANGED
|
@@ -39,4 +39,12 @@ export const InspectSignaturesSchema = z
|
|
|
39
39
|
response_format: ResponseFormatSchema,
|
|
40
40
|
})
|
|
41
41
|
.strict();
|
|
42
|
+
/** extract_tables — Tagged PDF Table → Markdown */
|
|
43
|
+
export const ExtractTablesSchema = z
|
|
44
|
+
.object({
|
|
45
|
+
file_path: FilePathSchema,
|
|
46
|
+
pages: PagesSchema,
|
|
47
|
+
response_format: ResponseFormatSchema,
|
|
48
|
+
})
|
|
49
|
+
.strict();
|
|
42
50
|
//# sourceMappingURL=tier2.js.map
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"tier2.js","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,aAAa,CAAC;AAEhF,wBAAwB;AACxB,MAAM,CAAC,MAAM,sBAAsB,GAAG,CAAC;KACpC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,oBAAoB;AACpB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,0BAA0B;AAC1B,MAAM,CAAC,MAAM,wBAAwB,GAAG,CAAC;KACtC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,yBAAyB;AACzB,MAAM,CAAC,MAAM,uBAAuB,GAAG,CAAC;KACrC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
|
|
1
|
+
{"version":3,"file":"tier2.js","sourceRoot":"","sources":["../../src/schemas/tier2.ts"],"names":[],"mappings":"AAAA;;GAEG;AAEH,OAAO,EAAE,CAAC,EAAE,MAAM,KAAK,CAAC;AACxB,OAAO,EAAE,cAAc,EAAE,WAAW,EAAE,oBAAoB,EAAE,MAAM,aAAa,CAAC;AAEhF,wBAAwB;AACxB,MAAM,CAAC,MAAM,sBAAsB,GAAG,CAAC;KACpC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mBAAmB;AACnB,MAAM,CAAC,MAAM,iBAAiB,GAAG,CAAC;KAC/B,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,oBAAoB;AACpB,MAAM,CAAC,MAAM,kBAAkB,GAAG,CAAC;KAChC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,0BAA0B;AAC1B,MAAM,CAAC,MAAM,wBAAwB,GAAG,CAAC;KACtC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,yBAAyB;AACzB,MAAM,CAAC,MAAM,uBAAuB,GAAG,CAAC;KACrC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC;AAEZ,mDAAmD;AACnD,MAAM,CAAC,MAAM,mBAAmB,GAAG,CAAC;KACjC,MAAM,CAAC;IACN,SAAS,EAAE,cAAc;IACzB,KAAK,EAAE,WAAW;IAClB,eAAe,EAAE,oBAAoB;CACtC,CAAC;KACD,MAAM,EAAE,CAAC"}
|
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
* Centralizes all pdfjs-dist interactions for reuse across tools.
|
|
5
5
|
*/
|
|
6
6
|
import { type PDFDocumentProxy } from 'pdfjs-dist/legacy/build/pdf.mjs';
|
|
7
|
-
import type { AnnotationsAnalysis, ImageExtractionResult, PageText, PdfMetadata, SearchMatch, TagsAnalysis } from '../types.js';
|
|
7
|
+
import type { AnnotationsAnalysis, ImageExtractionResult, PageText, PdfMetadata, SearchMatch, TablesExtractionResult, TagsAnalysis } from '../types.js';
|
|
8
8
|
/**
|
|
9
9
|
* Load a PDF document from a file path.
|
|
10
10
|
*/
|
|
@@ -58,6 +58,27 @@ export declare function analyzeTagsFromDoc(doc: PDFDocumentProxy): Promise<TagsA
|
|
|
58
58
|
* Analyze Tagged PDF structure tree.
|
|
59
59
|
*/
|
|
60
60
|
export declare function analyzeTags(filePath: string): Promise<TagsAnalysis>;
|
|
61
|
+
/**
|
|
62
|
+
* Extract tables from a Tagged PDF as structured rows/cells.
|
|
63
|
+
*
|
|
64
|
+
* The strategy is: for each page, walk the StructTree, identify `<Table>`
|
|
65
|
+
* subtrees, then walk down `<THead>/<TBody>/<TFoot>` → `<TR>` → `<TH>/<TD>`.
|
|
66
|
+
* Cell text is reconstructed by mapping each Span/P/Lbl/LBody leaf node's
|
|
67
|
+
* `id` (e.g. `p715R_mc4`) onto the corresponding `beginMarkedContentProps`
|
|
68
|
+
* boundary in `getTextContent({ includeMarkedContent: true })`.
|
|
69
|
+
*
|
|
70
|
+
* Untagged PDFs return `isTagged: false`, an empty `tables` array, and a
|
|
71
|
+
* `note` recommending the column-aware extraction (planned in a future
|
|
72
|
+
* release) as the fallback for two-column layouts without a structure tree.
|
|
73
|
+
*
|
|
74
|
+
* Cell text is post-processed:
|
|
75
|
+
* - Newlines (`hasEOL`) become single spaces.
|
|
76
|
+
* - Repeated whitespace runs (including U+3000 fullwidth space) collapse to one.
|
|
77
|
+
* - Per-character kerning spaces (e.g. `"消 費 税 法"`) are folded
|
|
78
|
+
* by removing single ASCII spaces between two CJK characters.
|
|
79
|
+
*/
|
|
80
|
+
export declare function extractTablesFromDoc(doc: PDFDocumentProxy, pages?: string): Promise<TablesExtractionResult>;
|
|
81
|
+
export declare function extractTables(filePath: string, pages?: string): Promise<TablesExtractionResult>;
|
|
61
82
|
/**
|
|
62
83
|
* Analyze annotations across all pages.
|
|
63
84
|
*/
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"pdfjs-service.d.ts","sourceRoot":"","sources":["../../src/services/pdfjs-service.ts"],"names":[],"mappings":"AAAA;;;;GAIG;AAEH,OAAO,EAGL,KAAK,gBAAgB,EAEtB,MAAM,iCAAiC,CAAC;AAGzC,OAAO,KAAK,EAEV,mBAAmB,
|
|
1
|
+
{"version":3,"file":"pdfjs-service.d.ts","sourceRoot":"","sources":["../../src/services/pdfjs-service.ts"],"names":[],"mappings":"AAAA;;;;GAIG;AAEH,OAAO,EAGL,KAAK,gBAAgB,EAEtB,MAAM,iCAAiC,CAAC;AAGzC,OAAO,KAAK,EAEV,mBAAmB,EAGnB,qBAAqB,EACrB,QAAQ,EACR,WAAW,EACX,WAAW,EAGX,sBAAsB,EAEtB,YAAY,EACb,MAAM,aAAa,CAAC;AAWrB;;GAEG;AACH,wBAAsB,YAAY,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAI9E;AAED;;GAEG;AACH,wBAAsB,oBAAoB,CAAC,IAAI,EAAE,UAAU,GAAG,OAAO,CAAC,gBAAgB,CAAC,CAGtF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,WAAW,CAAC,CAOxE;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,QAAQ,EAAE,MAAM,GACf,OAAO,CAAC,WAAW,CAAC,CA6BtB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CACtC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,QAAQ,EAAE,CAAC,CAarB;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,QAAQ,EAAE,CAAC,CAQvF;AAED;;GAEG;AACH,wBAAsB,UAAU,CAC9B,QAAQ,EAAE,MAAM,EAChB,KAAK,EAAE,MAAM,EACb,YAAY,GAAE,MAA+B,EAC7C,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,WAAW,EAAE,CAAC,CAsDxB;AAED;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAmB/F;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,EAAE,KAAK,CAAC,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAQnF;AAED;;;GAGG;AACH,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,qBAAqB,CAAC,CAoEhC;AAgGD;;;GAGG;AACH,wBAAsB,kBAAkB,CAAC,GAAG,EAAE,gBAAgB,GAAG,OAAO,CAAC,YAAY,CAAC,CAmErF;AAED;;GAEG;AACH,wBAAsB,WAAW,CAAC,QAAQ,EAAE,MAAM,GAAG,OAAO,CAAC,YAAY,CAAC,CAOzE;AAID;;;;;;;;;;;;;;;;;;GAkBG;AACH,wBAAsB,oBAAoB,CACxC,GAAG,EAAE,gBAAgB,EACrB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,sBAAsB,CAAC,CA8CjC;AAED,wBAAsB,aAAa,CACjC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,sBAAsB,CAAC,CAOjC;AA2KD;;GAEG;AACH,wBAAsB,kBAAkB,CACtC,QAAQ,EAAE,MAAM,EAChB,KAAK,CAAC,EAAE,MAAM,GACb,OAAO,CAAC,mBAAmB,CAAC,CAmG9B"}
|
|
@@ -391,6 +391,221 @@ export async function analyzeTags(filePath) {
|
|
|
391
391
|
await doc.destroy();
|
|
392
392
|
}
|
|
393
393
|
}
|
|
394
|
+
// ─── extract_tables ────────────────────────────────────────────────────────
|
|
395
|
+
/**
|
|
396
|
+
* Extract tables from a Tagged PDF as structured rows/cells.
|
|
397
|
+
*
|
|
398
|
+
* The strategy is: for each page, walk the StructTree, identify `<Table>`
|
|
399
|
+
* subtrees, then walk down `<THead>/<TBody>/<TFoot>` → `<TR>` → `<TH>/<TD>`.
|
|
400
|
+
* Cell text is reconstructed by mapping each Span/P/Lbl/LBody leaf node's
|
|
401
|
+
* `id` (e.g. `p715R_mc4`) onto the corresponding `beginMarkedContentProps`
|
|
402
|
+
* boundary in `getTextContent({ includeMarkedContent: true })`.
|
|
403
|
+
*
|
|
404
|
+
* Untagged PDFs return `isTagged: false`, an empty `tables` array, and a
|
|
405
|
+
* `note` recommending the column-aware extraction (planned in a future
|
|
406
|
+
* release) as the fallback for two-column layouts without a structure tree.
|
|
407
|
+
*
|
|
408
|
+
* Cell text is post-processed:
|
|
409
|
+
* - Newlines (`hasEOL`) become single spaces.
|
|
410
|
+
* - Repeated whitespace runs (including U+3000 fullwidth space) collapse to one.
|
|
411
|
+
* - Per-character kerning spaces (e.g. `"消 費 税 法"`) are folded
|
|
412
|
+
* by removing single ASCII spaces between two CJK characters.
|
|
413
|
+
*/
|
|
414
|
+
export async function extractTablesFromDoc(doc, pages) {
|
|
415
|
+
const markInfo = await getMarkInfo(doc);
|
|
416
|
+
const isTagged = markInfo?.Marked === true;
|
|
417
|
+
if (!isTagged) {
|
|
418
|
+
return {
|
|
419
|
+
isTagged: false,
|
|
420
|
+
tables: [],
|
|
421
|
+
totalTables: 0,
|
|
422
|
+
pagesScanned: 0,
|
|
423
|
+
note: 'Document is not tagged. extract_tables relies on /MarkInfo /Marked true ' +
|
|
424
|
+
'and a StructTree. For untagged two-column PDFs, fall back to a ' +
|
|
425
|
+
'column-aware reading strategy (see pdf-reader-mcp Issue #3).',
|
|
426
|
+
};
|
|
427
|
+
}
|
|
428
|
+
const pageNumbers = resolvePageNumbers(pages, doc.numPages);
|
|
429
|
+
const perPage = await Promise.all(pageNumbers.map(async (pageNum) => {
|
|
430
|
+
const page = await doc.getPage(pageNum);
|
|
431
|
+
try {
|
|
432
|
+
const [tree, textContent] = await Promise.all([
|
|
433
|
+
page.getStructTree(),
|
|
434
|
+
page.getTextContent({ includeMarkedContent: true }),
|
|
435
|
+
]);
|
|
436
|
+
if (!tree)
|
|
437
|
+
return [];
|
|
438
|
+
const idToText = buildIdToTextMap(textContent.items);
|
|
439
|
+
const tables = [];
|
|
440
|
+
collectTables(tree, pageNum, idToText, tables);
|
|
441
|
+
return tables;
|
|
442
|
+
}
|
|
443
|
+
catch {
|
|
444
|
+
return [];
|
|
445
|
+
}
|
|
446
|
+
}));
|
|
447
|
+
const tables = perPage.flat();
|
|
448
|
+
return {
|
|
449
|
+
isTagged: true,
|
|
450
|
+
tables,
|
|
451
|
+
totalTables: tables.length,
|
|
452
|
+
pagesScanned: pageNumbers.length,
|
|
453
|
+
};
|
|
454
|
+
}
|
|
455
|
+
export async function extractTables(filePath, pages) {
|
|
456
|
+
const doc = await loadDocument(filePath);
|
|
457
|
+
try {
|
|
458
|
+
return await extractTablesFromDoc(doc, pages);
|
|
459
|
+
}
|
|
460
|
+
finally {
|
|
461
|
+
await doc.destroy();
|
|
462
|
+
}
|
|
463
|
+
}
|
|
464
|
+
/**
|
|
465
|
+
* Build a map from a marked-content `id` (e.g. `p715R_mc4`) to the concatenated
|
|
466
|
+
* raw text inside the corresponding `beginMarkedContentProps`/`endMarkedContent`
|
|
467
|
+
* pair. Nested marked content is supported via a stack — text counts toward
|
|
468
|
+
* every active id (so a `<Span>` inside a `<P>` contributes to both).
|
|
469
|
+
*
|
|
470
|
+
* Items with `tag === 'Artifact'` are page-level artifacts (page numbers,
|
|
471
|
+
* running headers, etc.) outside the structure tree, and are skipped.
|
|
472
|
+
*/
|
|
473
|
+
function buildIdToTextMap(items) {
|
|
474
|
+
const map = new Map();
|
|
475
|
+
const stack = [];
|
|
476
|
+
for (const item of items) {
|
|
477
|
+
const t = item.type;
|
|
478
|
+
if (t === 'beginMarkedContent' || t === 'beginMarkedContentProps') {
|
|
479
|
+
const isArtifact = item.tag === 'Artifact';
|
|
480
|
+
const id = item.id ?? null;
|
|
481
|
+
stack.push({ id, isArtifact });
|
|
482
|
+
continue;
|
|
483
|
+
}
|
|
484
|
+
if (t === 'endMarkedContent') {
|
|
485
|
+
stack.pop();
|
|
486
|
+
continue;
|
|
487
|
+
}
|
|
488
|
+
if (t !== undefined)
|
|
489
|
+
continue; // unknown marker
|
|
490
|
+
// Text item
|
|
491
|
+
if (stack.some((s) => s.isArtifact))
|
|
492
|
+
continue;
|
|
493
|
+
const str = item.hasEOL ? ' ' : (item.str ?? '');
|
|
494
|
+
if (!str)
|
|
495
|
+
continue;
|
|
496
|
+
for (const frame of stack) {
|
|
497
|
+
if (frame.id) {
|
|
498
|
+
const buf = map.get(frame.id);
|
|
499
|
+
if (buf)
|
|
500
|
+
buf.push(str);
|
|
501
|
+
else
|
|
502
|
+
map.set(frame.id, [str]);
|
|
503
|
+
}
|
|
504
|
+
}
|
|
505
|
+
}
|
|
506
|
+
const out = new Map();
|
|
507
|
+
for (const [id, parts] of map)
|
|
508
|
+
out.set(id, parts.join(''));
|
|
509
|
+
return out;
|
|
510
|
+
}
|
|
511
|
+
/** Walk the StructTree and append every `<Table>` subtree as an ExtractedTable. */
|
|
512
|
+
function collectTables(node, pageNum, idToText, out) {
|
|
513
|
+
if (node.type === 'content')
|
|
514
|
+
return;
|
|
515
|
+
if (node.role === 'Table') {
|
|
516
|
+
const headerRows = [];
|
|
517
|
+
const bodyRows = [];
|
|
518
|
+
const footerRows = [];
|
|
519
|
+
for (const child of node.children ?? []) {
|
|
520
|
+
if (child.type === 'content')
|
|
521
|
+
continue;
|
|
522
|
+
if (child.role === 'THead') {
|
|
523
|
+
appendTableRowsFromSection(child, idToText, headerRows);
|
|
524
|
+
}
|
|
525
|
+
else if (child.role === 'TBody') {
|
|
526
|
+
appendTableRowsFromSection(child, idToText, bodyRows);
|
|
527
|
+
}
|
|
528
|
+
else if (child.role === 'TFoot') {
|
|
529
|
+
appendTableRowsFromSection(child, idToText, footerRows);
|
|
530
|
+
}
|
|
531
|
+
else if (child.role === 'TR') {
|
|
532
|
+
// Tables sometimes omit THead/TBody and place TRs directly under <Table>.
|
|
533
|
+
const row = buildRowFromTR(child, idToText);
|
|
534
|
+
if (row)
|
|
535
|
+
bodyRows.push(row);
|
|
536
|
+
}
|
|
537
|
+
}
|
|
538
|
+
// `out` is a per-page accumulator passed in by the caller, so
|
|
539
|
+
// `out.length + 1` is the next index within this page (1-based).
|
|
540
|
+
out.push({
|
|
541
|
+
page: pageNum,
|
|
542
|
+
index: out.length + 1,
|
|
543
|
+
headerRows,
|
|
544
|
+
bodyRows,
|
|
545
|
+
footerRows,
|
|
546
|
+
});
|
|
547
|
+
return; // Don't recurse into a Table — nested tables are uncommon and
|
|
548
|
+
// would confuse the per-page index. (Add nested-table support later.)
|
|
549
|
+
}
|
|
550
|
+
for (const child of node.children ?? []) {
|
|
551
|
+
collectTables(child, pageNum, idToText, out);
|
|
552
|
+
}
|
|
553
|
+
}
|
|
554
|
+
function appendTableRowsFromSection(section, idToText, out) {
|
|
555
|
+
for (const child of section.children ?? []) {
|
|
556
|
+
if (child.type === 'content')
|
|
557
|
+
continue;
|
|
558
|
+
if (child.role === 'TR') {
|
|
559
|
+
const row = buildRowFromTR(child, idToText);
|
|
560
|
+
if (row)
|
|
561
|
+
out.push(row);
|
|
562
|
+
}
|
|
563
|
+
}
|
|
564
|
+
}
|
|
565
|
+
function buildRowFromTR(tr, idToText) {
|
|
566
|
+
const cells = [];
|
|
567
|
+
for (const child of tr.children ?? []) {
|
|
568
|
+
if (child.type === 'content')
|
|
569
|
+
continue;
|
|
570
|
+
if (child.role === 'TH' || child.role === 'TD') {
|
|
571
|
+
const text = compactCellText(collectTextUnder(child, idToText));
|
|
572
|
+
cells.push({ text, isHeader: child.role === 'TH' });
|
|
573
|
+
}
|
|
574
|
+
}
|
|
575
|
+
return cells.length === 0 ? null : { cells };
|
|
576
|
+
}
|
|
577
|
+
function collectTextUnder(node, idToText) {
|
|
578
|
+
if (node.type === 'content') {
|
|
579
|
+
return node.id ? (idToText.get(node.id) ?? '') : '';
|
|
580
|
+
}
|
|
581
|
+
const parts = [];
|
|
582
|
+
for (const child of node.children ?? []) {
|
|
583
|
+
parts.push(collectTextUnder(child, idToText));
|
|
584
|
+
}
|
|
585
|
+
return parts.join(' ');
|
|
586
|
+
}
|
|
587
|
+
/**
|
|
588
|
+
* Normalise raw cell text:
|
|
589
|
+
* 1. Collapse any whitespace run (`\s` + U+3000) to a single ASCII space.
|
|
590
|
+
* 2. Fold per-character kerning runs between CJK characters
|
|
591
|
+
* (e.g. "消 費 税 法" → "消費税法") — but only when at least three
|
|
592
|
+
* single CJK chars are separated by single spaces in a row, so that
|
|
593
|
+
* natural inter-word spacing like "事業者 法人番号" is preserved.
|
|
594
|
+
* 3. Trim and Markdown-escape pipes / newlines.
|
|
595
|
+
*/
|
|
596
|
+
function compactCellText(s) {
|
|
597
|
+
if (!s)
|
|
598
|
+
return '';
|
|
599
|
+
// Step 1: collapse whitespace runs (incl. U+3000) to one ASCII space.
|
|
600
|
+
let t = s.replace(/[\s ]+/g, ' ').trim();
|
|
601
|
+
// Step 2: fold runs of `CJK + space` repeated at least twice followed by
|
|
602
|
+
// a final CJK char. Anything shorter is treated as a real word boundary.
|
|
603
|
+
const cjk = '[\\u3040-\\u30ff\\u3400-\\u9fff\\uff00-\\uffef]';
|
|
604
|
+
const kerningRun = new RegExp(`(?:${cjk} ){2,}${cjk}`, 'g');
|
|
605
|
+
t = t.replace(kerningRun, (m) => m.replace(/ /g, ''));
|
|
606
|
+
// Step 3: escape Markdown table delimiters.
|
|
607
|
+
return t.replace(/\|/g, '\\|').replace(/\n/g, ' ');
|
|
608
|
+
}
|
|
394
609
|
/**
|
|
395
610
|
* Analyze annotations across all pages.
|
|
396
611
|
*/
|