tech-book-extractor-skills 1.0.6 → 1.0.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +44 -13
- package/bin/install.js +8 -0
- package/package.json +5 -2
- package/scripts/extract_book.py +107 -0
- package/scripts/extract_chapter.py +115 -0
- package/scripts/pdf_extract_utils.py +540 -0
- package/skills/{tech-book-stage1 → book-map}/SKILL.md +17 -2
- package/skills/{tech-book-stage2 → chapter-drill}/SKILL.md +16 -3
package/README.md
CHANGED
|
@@ -2,30 +2,61 @@
|
|
|
2
2
|
|
|
3
3
|
Claude Code 技能:技术书深度萃取——两阶段流水线。
|
|
4
4
|
|
|
5
|
-
##
|
|
5
|
+
## 安装
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
| Stage 2 | [stage2/skill-stage2-chapter-extractor.md](stage2/skill-stage2-chapter-extractor.md) | 单章深度萃取:六层结构 + 复杂热点深度脚手架 |
|
|
7
|
+
```bash
|
|
8
|
+
npx tech-book-extractor-skills
|
|
9
|
+
```
|
|
11
10
|
|
|
12
|
-
|
|
11
|
+
**更新:**
|
|
13
12
|
|
|
14
13
|
```bash
|
|
15
|
-
npx tech-book-extractor-skills
|
|
14
|
+
npx tech-book-extractor-skills@latest
|
|
16
15
|
```
|
|
17
16
|
|
|
18
|
-
|
|
17
|
+
## 使用
|
|
18
|
+
|
|
19
|
+
### 日常工作流:只用 `/chapter-drill`
|
|
20
|
+
|
|
21
|
+
```
|
|
22
|
+
/chapter-drill 书文件路径.epub 第三章
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
一步完成。首次使用时会自动在后台生成知识骨架(透明),后续章节直接复用。不需要手动跑 `/book-map`。
|
|
26
|
+
|
|
27
|
+
### 可选:`/book-map` — 单独校对骨架
|
|
19
28
|
|
|
20
|
-
|
|
29
|
+
如果想要调整某章的权重、过时标注或阅读路线后再萃取:
|
|
30
|
+
|
|
31
|
+
```
|
|
32
|
+
/book-map 书文件路径.epub
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
骨架生成后可手动编辑 `stage1-skeleton.json`,然后 `/chapter-drill` 会读取你的改动。
|
|
36
|
+
|
|
37
|
+
### 示例
|
|
21
38
|
|
|
22
39
|
```bash
|
|
23
|
-
|
|
40
|
+
# 直接钻,骨架第一次自动建
|
|
41
|
+
/chapter-drill ~/books/jvm.epub 第二章
|
|
42
|
+
|
|
43
|
+
# 钻完继续钻,骨架复用
|
|
44
|
+
/chapter-drill ~/books/jvm.epub 第三章
|
|
45
|
+
|
|
46
|
+
# 偶尔跑一下,校对骨架内容
|
|
47
|
+
/book-map ~/books/jvm.epub
|
|
24
48
|
```
|
|
25
49
|
|
|
50
|
+
## 命令
|
|
51
|
+
|
|
52
|
+
| 命令 | 日常使用频率 | 职责 |
|
|
53
|
+
|------|------------|------|
|
|
54
|
+
| `/chapter-drill` | ⭐⭐⭐ 每次都跑 | 单章萃取 + 自动建骨架。complexity=high 的小节深度脚手架 |
|
|
55
|
+
| `/book-map` | ⭐ 校对时跑 | 显式重建全书骨架,方便手动调整元数据后提升萃取质量 |
|
|
56
|
+
|
|
26
57
|
## 设计思路
|
|
27
58
|
|
|
28
|
-
- **Stage 1
|
|
59
|
+
- **Stage 1 画地图**——标注知识价值分布,聚焦高权重章节
|
|
29
60
|
- **Stage 2 搭脚手架**——复杂小节(如 ZGC 染色指针)不精炼提纯,而是分步拆解:做了什么 → 为什么需要 → 不这样做会怎样
|
|
30
|
-
-
|
|
31
|
-
- **一套模板 +
|
|
61
|
+
- **复杂度客观检测**——小节长度、脚注密度、跨章引用数 → 自动判定 complexity,不靠 LLM 主观判断
|
|
62
|
+
- **一套模板 + 增量**——六层格式统一,complexity=high 仅在心智模型层追加分步拆解
|
package/bin/install.js
CHANGED
|
@@ -39,4 +39,12 @@ const scannerDest = path.join(scriptsDest, "complexity_scanner.py");
|
|
|
39
39
|
fs.copyFileSync(scriptsSrc, scannerDest);
|
|
40
40
|
console.log(`✓ script: complexity_scanner.py → ${scannerDest}`);
|
|
41
41
|
|
|
42
|
+
// 复制 PDF 文本提取脚本
|
|
43
|
+
const scriptsDir = path.join(__dirname, "..", "scripts");
|
|
44
|
+
for (const pyScript of ["pdf_extract_utils.py", "extract_book.py", "extract_chapter.py"]) {
|
|
45
|
+
const dest = path.join(scriptsDest, pyScript);
|
|
46
|
+
fs.copyFileSync(path.join(scriptsDir, pyScript), dest);
|
|
47
|
+
console.log(`✓ script: ${pyScript} → ${dest}`);
|
|
48
|
+
}
|
|
49
|
+
|
|
42
50
|
console.log(`\n${skills.length} skill(s) installed.`);
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "tech-book-extractor-skills",
|
|
3
|
-
"version": "1.0.
|
|
3
|
+
"version": "1.0.8",
|
|
4
4
|
"description": "Claude Code skills for deep technical book reading — structure parsing (Stage 1) and chapter extraction (Stage 2).",
|
|
5
5
|
"bin": {
|
|
6
6
|
"tech-book-extractor-skills": "bin/install.js"
|
|
@@ -8,7 +8,10 @@
|
|
|
8
8
|
"files": [
|
|
9
9
|
"skills/",
|
|
10
10
|
"bin/",
|
|
11
|
-
"stage1/complexity_scanner.py"
|
|
11
|
+
"stage1/complexity_scanner.py",
|
|
12
|
+
"scripts/pdf_extract_utils.py",
|
|
13
|
+
"scripts/extract_book.py",
|
|
14
|
+
"scripts/extract_chapter.py"
|
|
12
15
|
],
|
|
13
16
|
"scripts": {
|
|
14
17
|
"install-skills": "node bin/install.js",
|
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""
|
|
3
|
+
extract_book.py — 整本书文本提取脚本
|
|
4
|
+
=======================================
|
|
5
|
+
直接从 PDF 文字层提取全书文本(不需要 OCR/GPU),输出 Markdown 文件,
|
|
6
|
+
可直接喂给 book-map skill(Mode B)。
|
|
7
|
+
|
|
8
|
+
原理:PyMuPDF 直接读取 PDF 内嵌的文本层,和 PDF 阅读器里 Ctrl+C 一样。
|
|
9
|
+
对扫描件 PDF(文字层为空),脚本会检测并提示。
|
|
10
|
+
|
|
11
|
+
用法:
|
|
12
|
+
python extract_book.py ./深入理解Java虚拟机.pdf
|
|
13
|
+
python extract_book.py ./book.pdf --output ./note/book/
|
|
14
|
+
python extract_book.py ./book.pdf --max-pages 20 # 测试用
|
|
15
|
+
python extract_book.py ./book.pdf --export-images # 同时导出嵌入图片
|
|
16
|
+
|
|
17
|
+
输出:
|
|
18
|
+
{output_dir}/{书名}/{书名}-fulltext.md
|
|
19
|
+
{output_dir}/{书名}/images/ # 仅 --export-images 时
|
|
20
|
+
"""
|
|
21
|
+
|
|
22
|
+
from __future__ import annotations
|
|
23
|
+
|
|
24
|
+
import argparse
|
|
25
|
+
import sys
|
|
26
|
+
from pathlib import Path
|
|
27
|
+
|
|
28
|
+
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
|
29
|
+
|
|
30
|
+
from pdf_extract_utils import (
|
|
31
|
+
pdf_page_count,
|
|
32
|
+
extract_pages,
|
|
33
|
+
save_results,
|
|
34
|
+
extract_book_name,
|
|
35
|
+
)
|
|
36
|
+
|
|
37
|
+
|
|
38
|
+
def main():
|
|
39
|
+
parser = argparse.ArgumentParser(
|
|
40
|
+
description="整本书文本提取 — 从 PDF 文字层直接提取,对接 book-map skill",
|
|
41
|
+
epilog="注意:仅适用于文字版 PDF(排版导出的),扫描件 PDF 请用 OCR 工具。",
|
|
42
|
+
)
|
|
43
|
+
parser.add_argument("pdf", help="PDF 文件路径")
|
|
44
|
+
parser.add_argument(
|
|
45
|
+
"--output", "-o", default="./note/book/",
|
|
46
|
+
help="输出根目录(默认: ./note/book/)",
|
|
47
|
+
)
|
|
48
|
+
parser.add_argument(
|
|
49
|
+
"--max-pages", type=int, default=0,
|
|
50
|
+
help="最多处理多少页(0=全部,用于测试)",
|
|
51
|
+
)
|
|
52
|
+
parser.add_argument(
|
|
53
|
+
"--start-page", type=int, default=1,
|
|
54
|
+
help="起始页码(1-based,默认: 1)",
|
|
55
|
+
)
|
|
56
|
+
parser.add_argument(
|
|
57
|
+
"--export-images", action="store_true",
|
|
58
|
+
help="同时导出 PDF 中的嵌入图片到 images/ 目录",
|
|
59
|
+
)
|
|
60
|
+
args = parser.parse_args()
|
|
61
|
+
|
|
62
|
+
pdf_path = Path(args.pdf)
|
|
63
|
+
if not pdf_path.exists():
|
|
64
|
+
sys.exit(f"❌ PDF 文件不存在: {pdf_path}")
|
|
65
|
+
|
|
66
|
+
# 书名 & 输出路径
|
|
67
|
+
book_name = extract_book_name(pdf_path.name)
|
|
68
|
+
output_dir = Path(args.output) / book_name
|
|
69
|
+
output_path = output_dir / f"{book_name}-fulltext.md"
|
|
70
|
+
image_dir = str(output_dir / "images") if args.export_images else ""
|
|
71
|
+
|
|
72
|
+
# 页码范围
|
|
73
|
+
total_pages = pdf_page_count(pdf_path)
|
|
74
|
+
end_page = total_pages
|
|
75
|
+
if args.max_pages > 0:
|
|
76
|
+
end_page = min(args.start_page + args.max_pages - 1, total_pages)
|
|
77
|
+
|
|
78
|
+
print(f"📖 {book_name}")
|
|
79
|
+
print(f" 总页数: {total_pages}")
|
|
80
|
+
print(f" 处理范围: 第 {args.start_page}-{end_page} 页")
|
|
81
|
+
print(f" 输出: {output_path}")
|
|
82
|
+
if args.export_images:
|
|
83
|
+
print(f" 图片: {image_dir}/")
|
|
84
|
+
print()
|
|
85
|
+
|
|
86
|
+
# 提取
|
|
87
|
+
print(f"🔍 提取文本 ({end_page - args.start_page + 1} 页) ...")
|
|
88
|
+
results = extract_pages(
|
|
89
|
+
pdf_path,
|
|
90
|
+
page_range=(args.start_page, end_page),
|
|
91
|
+
export_images=args.export_images,
|
|
92
|
+
image_dir=image_dir,
|
|
93
|
+
)
|
|
94
|
+
|
|
95
|
+
# 保存
|
|
96
|
+
save_results(results, output_path, title=book_name)
|
|
97
|
+
|
|
98
|
+
# 下一步提示
|
|
99
|
+
print(f"""
|
|
100
|
+
📎 下一步 — 生成知识骨架:
|
|
101
|
+
用 book-map skill,选择 Mode B(全书全文),把以下文件内容贴进去:
|
|
102
|
+
{output_path.resolve()}
|
|
103
|
+
""")
|
|
104
|
+
|
|
105
|
+
|
|
106
|
+
if __name__ == "__main__":
|
|
107
|
+
main()
|
|
@@ -0,0 +1,115 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""
|
|
3
|
+
extract_chapter.py — 单章文本提取脚本
|
|
4
|
+
========================================
|
|
5
|
+
从 PDF 中提取指定页码范围的文本,输出 Markdown 文件,
|
|
6
|
+
可直接喂给 chapter-drill skill。
|
|
7
|
+
|
|
8
|
+
用法:
|
|
9
|
+
# 提取第 45-78 页(第 3 章)
|
|
10
|
+
python extract_chapter.py ./book.pdf --pages 45-78 --chapter "ch03"
|
|
11
|
+
|
|
12
|
+
# 指定输出目录
|
|
13
|
+
python extract_chapter.py ./book.pdf --pages 45-78 --chapter "ch03" --output ./note/book/
|
|
14
|
+
|
|
15
|
+
# 同时导出嵌入图片
|
|
16
|
+
python extract_chapter.py ./book.pdf --pages 45-78 --chapter "ch03" --export-images
|
|
17
|
+
|
|
18
|
+
输出:
|
|
19
|
+
{output_dir}/{书名}/chapters/{chapter_id}-raw.md
|
|
20
|
+
"""
|
|
21
|
+
|
|
22
|
+
from __future__ import annotations
|
|
23
|
+
|
|
24
|
+
import argparse
|
|
25
|
+
import sys
|
|
26
|
+
from pathlib import Path
|
|
27
|
+
|
|
28
|
+
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
|
29
|
+
|
|
30
|
+
from pdf_extract_utils import (
|
|
31
|
+
pdf_page_count,
|
|
32
|
+
extract_pages,
|
|
33
|
+
save_results,
|
|
34
|
+
extract_book_name,
|
|
35
|
+
parse_page_range,
|
|
36
|
+
)
|
|
37
|
+
|
|
38
|
+
|
|
39
|
+
def main():
|
|
40
|
+
parser = argparse.ArgumentParser(
|
|
41
|
+
description="单章文本提取 — 从 PDF 提取指定页码范围,对接 chapter-drill skill",
|
|
42
|
+
)
|
|
43
|
+
parser.add_argument("pdf", help="PDF 文件路径")
|
|
44
|
+
parser.add_argument(
|
|
45
|
+
"--pages", required=True,
|
|
46
|
+
help="页码范围,如 '45-78' 或 '45'(单页)",
|
|
47
|
+
)
|
|
48
|
+
parser.add_argument(
|
|
49
|
+
"--chapter", "-c", default="",
|
|
50
|
+
help="章节标识,如 'ch03'。输出文件名: ch03-raw.md",
|
|
51
|
+
)
|
|
52
|
+
parser.add_argument(
|
|
53
|
+
"--output", "-o", default="./note/book/",
|
|
54
|
+
help="输出根目录(默认: ./note/book/)",
|
|
55
|
+
)
|
|
56
|
+
parser.add_argument(
|
|
57
|
+
"--export-images", action="store_true",
|
|
58
|
+
help="同时导出 PDF 中的嵌入图片到 images/ 目录",
|
|
59
|
+
)
|
|
60
|
+
args = parser.parse_args()
|
|
61
|
+
|
|
62
|
+
pdf_path = Path(args.pdf)
|
|
63
|
+
if not pdf_path.exists():
|
|
64
|
+
sys.exit(f"❌ PDF 文件不存在: {pdf_path}")
|
|
65
|
+
|
|
66
|
+
# 页码范围
|
|
67
|
+
page_start, page_end = parse_page_range(args.pages)
|
|
68
|
+
if page_start > page_end:
|
|
69
|
+
page_start, page_end = page_end, page_start
|
|
70
|
+
|
|
71
|
+
total_pages = pdf_page_count(pdf_path)
|
|
72
|
+
if page_end > total_pages:
|
|
73
|
+
print(f"⚠️ 页码范围超出(PDF 共 {total_pages} 页),自动截断")
|
|
74
|
+
page_end = total_pages
|
|
75
|
+
|
|
76
|
+
# 章节标识
|
|
77
|
+
chapter_id = args.chapter if args.chapter else f"p{page_start}-{page_end}"
|
|
78
|
+
|
|
79
|
+
# 书名 & 输出路径
|
|
80
|
+
book_name = extract_book_name(pdf_path.name)
|
|
81
|
+
output_dir = Path(args.output) / book_name / "chapters"
|
|
82
|
+
output_path = output_dir / f"{chapter_id}-raw.md"
|
|
83
|
+
image_dir = str(Path(args.output) / book_name / "images") if args.export_images else ""
|
|
84
|
+
|
|
85
|
+
page_count = page_end - page_start + 1
|
|
86
|
+
print(f"📖 {book_name}")
|
|
87
|
+
print(f" 章节: {chapter_id} (第 {page_start}-{page_end} 页, 共 {page_count} 页)")
|
|
88
|
+
print(f" 输出: {output_path}")
|
|
89
|
+
print()
|
|
90
|
+
|
|
91
|
+
# 提取
|
|
92
|
+
print(f"🔍 提取文本 ({page_count} 页) ...")
|
|
93
|
+
results = extract_pages(
|
|
94
|
+
pdf_path,
|
|
95
|
+
page_range=(page_start, page_end),
|
|
96
|
+
export_images=args.export_images,
|
|
97
|
+
image_dir=image_dir,
|
|
98
|
+
)
|
|
99
|
+
|
|
100
|
+
# 保存
|
|
101
|
+
chapter_title = f"{book_name} · {chapter_id}"
|
|
102
|
+
save_results(results, output_path, title=chapter_title)
|
|
103
|
+
|
|
104
|
+
# 下一步提示
|
|
105
|
+
print(f"""
|
|
106
|
+
📎 下一步 — 深度萃取:
|
|
107
|
+
用 chapter-drill skill,将以下文件内容作为章节全文输入:
|
|
108
|
+
{output_path.resolve()}
|
|
109
|
+
|
|
110
|
+
同时提供 Stage 1 骨架片段(type, weight, keyQuestions 等)。
|
|
111
|
+
""")
|
|
112
|
+
|
|
113
|
+
|
|
114
|
+
if __name__ == "__main__":
|
|
115
|
+
main()
|
|
@@ -0,0 +1,540 @@
|
|
|
1
|
+
"""
|
|
2
|
+
PDF 文本提取工具库 — 直接读取 PDF 文本层,不依赖 OCR/GPU
|
|
3
|
+
==========================================================
|
|
4
|
+
为 extract_book.py 和 extract_chapter.py 提供共享能力:
|
|
5
|
+
- 文本提取(PyMuPDF fitz.get_text,CPU 毫秒级)
|
|
6
|
+
- 表格区域检测(基于文字坐标的网格分析)
|
|
7
|
+
- 嵌入图片导出(标记不可读内容)
|
|
8
|
+
- 结果汇总与输出格式化
|
|
9
|
+
|
|
10
|
+
原理:大多数技术书 PDF 是排版软件导出的,内嵌文字层。
|
|
11
|
+
PyMuPDF 直接读取这层文字,和 Ctrl+C 一个原理。
|
|
12
|
+
对扫描件 PDF(全图片),文字层为空,脚本会检测并提示。
|
|
13
|
+
"""
|
|
14
|
+
|
|
15
|
+
from __future__ import annotations
|
|
16
|
+
|
|
17
|
+
import os
|
|
18
|
+
import re
|
|
19
|
+
import sys
|
|
20
|
+
import time
|
|
21
|
+
from collections import defaultdict
|
|
22
|
+
from dataclasses import dataclass, field
|
|
23
|
+
from pathlib import Path
|
|
24
|
+
from typing import Optional
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
28
|
+
# 数据结构
|
|
29
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
30
|
+
|
|
31
|
+
@dataclass
|
|
32
|
+
class TextBlock:
|
|
33
|
+
"""PDF 中的一个文本块(段落/标题/单元格)"""
|
|
34
|
+
text: str
|
|
35
|
+
x0: float
|
|
36
|
+
y0: float
|
|
37
|
+
x1: float
|
|
38
|
+
y1: float
|
|
39
|
+
block_type: str = "" # "text" | "image"
|
|
40
|
+
font_size: float = 0.0
|
|
41
|
+
font_name: str = ""
|
|
42
|
+
|
|
43
|
+
|
|
44
|
+
@dataclass
|
|
45
|
+
class PageResult:
|
|
46
|
+
"""单页提取结果"""
|
|
47
|
+
page_num: int # 页码(1-based)
|
|
48
|
+
text: str # 提取的文本
|
|
49
|
+
tables: list[TableRegion] = field(default_factory=list) # 检测到的表格
|
|
50
|
+
image_count: int = 0 # 嵌入图片数
|
|
51
|
+
has_text_layer: bool = True # 是否有文字层
|
|
52
|
+
elapsed: float = 0.0 # 耗时(秒)
|
|
53
|
+
|
|
54
|
+
|
|
55
|
+
@dataclass
|
|
56
|
+
class TableRegion:
|
|
57
|
+
"""检测到的表格区域"""
|
|
58
|
+
y_start: float
|
|
59
|
+
y_end: float
|
|
60
|
+
rows: int # 估计行数
|
|
61
|
+
cols: int # 估计列数
|
|
62
|
+
raw_text: str = "" # 从该区域提取的原始文本
|
|
63
|
+
page_num: int = 0
|
|
64
|
+
|
|
65
|
+
|
|
66
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
67
|
+
# PDF 工具
|
|
68
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
69
|
+
|
|
70
|
+
def pdf_page_count(pdf_path: str | Path) -> int:
|
|
71
|
+
"""返回 PDF 总页数"""
|
|
72
|
+
import fitz
|
|
73
|
+
doc = fitz.open(str(pdf_path))
|
|
74
|
+
count = len(doc)
|
|
75
|
+
doc.close()
|
|
76
|
+
return count
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
def _group_blocks_by_line(blocks: list[TextBlock], y_tolerance: float = 4.0) -> list[list[TextBlock]]:
|
|
80
|
+
"""
|
|
81
|
+
将文本块按 y 坐标分组为"行"。
|
|
82
|
+
同一行内 block 按 x 坐标从左到右排序。
|
|
83
|
+
"""
|
|
84
|
+
if not blocks:
|
|
85
|
+
return []
|
|
86
|
+
|
|
87
|
+
# 按 y 排序
|
|
88
|
+
sorted_blocks = sorted(blocks, key=lambda b: (b.y0, b.x0))
|
|
89
|
+
|
|
90
|
+
lines: list[list[TextBlock]] = []
|
|
91
|
+
current_line: list[TextBlock] = [sorted_blocks[0]]
|
|
92
|
+
|
|
93
|
+
for block in sorted_blocks[1:]:
|
|
94
|
+
# 如果 y0 和上一行的平均 y0 接近,属于同一行
|
|
95
|
+
avg_y = sum(b.y0 for b in current_line) / len(current_line)
|
|
96
|
+
if abs(block.y0 - avg_y) < y_tolerance:
|
|
97
|
+
current_line.append(block)
|
|
98
|
+
else:
|
|
99
|
+
lines.append(sorted(current_line, key=lambda b: b.x0))
|
|
100
|
+
current_line = [block]
|
|
101
|
+
|
|
102
|
+
lines.append(sorted(current_line, key=lambda b: b.x0))
|
|
103
|
+
return lines
|
|
104
|
+
|
|
105
|
+
|
|
106
|
+
def _detect_tables(blocks: list[TextBlock]) -> list[TableRegion]:
|
|
107
|
+
"""
|
|
108
|
+
基于文本块坐标检测表格区域。
|
|
109
|
+
|
|
110
|
+
表格特征:
|
|
111
|
+
1. 多行文本的 x 坐标在列方向上对齐(形成网格)
|
|
112
|
+
2. 行间距均匀(和正文段落的分隔模式不同)
|
|
113
|
+
3. 单行内多个独立 block(多列)
|
|
114
|
+
|
|
115
|
+
返回检测到的表格区域列表。
|
|
116
|
+
"""
|
|
117
|
+
if len(blocks) < 4: # 至少 2 行 × 2 列才有表格意义
|
|
118
|
+
return []
|
|
119
|
+
|
|
120
|
+
# 过滤掉太小的块(可能是页码、页眉)
|
|
121
|
+
text_blocks = [b for b in blocks if len(b.text.strip()) > 1]
|
|
122
|
+
if len(text_blocks) < 4:
|
|
123
|
+
return []
|
|
124
|
+
|
|
125
|
+
lines = _group_blocks_by_line(text_blocks, y_tolerance=5.0)
|
|
126
|
+
|
|
127
|
+
tables: list[TableRegion] = []
|
|
128
|
+
i = 0
|
|
129
|
+
while i < len(lines):
|
|
130
|
+
line = lines[i]
|
|
131
|
+
# 单行有 ≥2 个对齐的 block → 可能是表格行
|
|
132
|
+
if len(line) >= 2:
|
|
133
|
+
# 向后扫描,看有多少连续行有相同数量的 block(列数一致)
|
|
134
|
+
table_lines = [line]
|
|
135
|
+
col_count = len(line)
|
|
136
|
+
j = i + 1
|
|
137
|
+
while j < len(lines) and len(lines[j]) >= 2:
|
|
138
|
+
# 检查列宽是否大致对齐
|
|
139
|
+
if _columns_aligned(table_lines[-1], lines[j], tolerance=10.0):
|
|
140
|
+
table_lines.append(lines[j])
|
|
141
|
+
j += 1
|
|
142
|
+
else:
|
|
143
|
+
break
|
|
144
|
+
|
|
145
|
+
if len(table_lines) >= 2: # 至少 2 行才算表格
|
|
146
|
+
y0 = min(b.y0 for line in table_lines for b in line)
|
|
147
|
+
y1 = max(b.y1 for line in table_lines for b in line)
|
|
148
|
+
# 聚合表格内所有文本(按行组织)
|
|
149
|
+
raw = ""
|
|
150
|
+
for tl in table_lines:
|
|
151
|
+
cells = [b.text.strip() for b in tl]
|
|
152
|
+
raw += " | ".join(cells) + "\n"
|
|
153
|
+
tables.append(TableRegion(
|
|
154
|
+
y_start=y0,
|
|
155
|
+
y_end=y1,
|
|
156
|
+
rows=len(table_lines),
|
|
157
|
+
cols=col_count,
|
|
158
|
+
raw_text=raw.strip(),
|
|
159
|
+
))
|
|
160
|
+
i = j
|
|
161
|
+
continue
|
|
162
|
+
i += 1
|
|
163
|
+
|
|
164
|
+
# 去重:合并重叠的表格区域
|
|
165
|
+
return _merge_overlapping_tables(tables)
|
|
166
|
+
|
|
167
|
+
|
|
168
|
+
def _columns_aligned(line_a: list[TextBlock], line_b: list[TextBlock], tolerance: float = 10.0) -> bool:
|
|
169
|
+
"""
|
|
170
|
+
检查两行文本的列是否对齐。
|
|
171
|
+
比较每列的 x0 坐标是否在容差范围内。
|
|
172
|
+
"""
|
|
173
|
+
if len(line_a) != len(line_b):
|
|
174
|
+
return False
|
|
175
|
+
for ba, bb in zip(line_a, line_b):
|
|
176
|
+
if abs(ba.x0 - bb.x0) > tolerance:
|
|
177
|
+
return False
|
|
178
|
+
return True
|
|
179
|
+
|
|
180
|
+
|
|
181
|
+
def _merge_overlapping_tables(tables: list[TableRegion]) -> list[TableRegion]:
|
|
182
|
+
"""合并 y 范围有重叠的表格检测结果"""
|
|
183
|
+
if len(tables) <= 1:
|
|
184
|
+
return tables
|
|
185
|
+
sorted_tables = sorted(tables, key=lambda t: t.y_start)
|
|
186
|
+
merged = [sorted_tables[0]]
|
|
187
|
+
for t in sorted_tables[1:]:
|
|
188
|
+
last = merged[-1]
|
|
189
|
+
if t.y_start <= last.y_end + 10: # 允许小间隙
|
|
190
|
+
last.y_end = max(last.y_end, t.y_end)
|
|
191
|
+
last.rows = max(last.rows, t.rows)
|
|
192
|
+
last.cols = max(last.cols, t.cols)
|
|
193
|
+
last.raw_text += "\n" + t.raw_text
|
|
194
|
+
else:
|
|
195
|
+
merged.append(t)
|
|
196
|
+
return merged
|
|
197
|
+
|
|
198
|
+
|
|
199
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
200
|
+
# 核心:单页提取
|
|
201
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
202
|
+
|
|
203
|
+
def extract_page(
|
|
204
|
+
page,
|
|
205
|
+
page_num: int,
|
|
206
|
+
export_images: bool = False,
|
|
207
|
+
image_dir: str = "",
|
|
208
|
+
) -> PageResult:
|
|
209
|
+
"""
|
|
210
|
+
提取单个 PDF 页面的文本、表格和图片信息。
|
|
211
|
+
|
|
212
|
+
Args:
|
|
213
|
+
page: PyMuPDF Page 对象
|
|
214
|
+
page_num: 页码(1-based)
|
|
215
|
+
export_images: 是否导出嵌入图片
|
|
216
|
+
image_dir: 图片导出目录
|
|
217
|
+
|
|
218
|
+
Returns:
|
|
219
|
+
PageResult
|
|
220
|
+
"""
|
|
221
|
+
import fitz
|
|
222
|
+
|
|
223
|
+
t0 = time.time()
|
|
224
|
+
|
|
225
|
+
# 1. 提取文本(dict 模式,保留坐标)
|
|
226
|
+
text_dict = page.get_text("dict")
|
|
227
|
+
blocks_raw = text_dict.get("blocks", [])
|
|
228
|
+
|
|
229
|
+
# 2. 提取嵌入图片
|
|
230
|
+
image_count = 0
|
|
231
|
+
image_refs: list[str] = []
|
|
232
|
+
try:
|
|
233
|
+
image_list = page.get_images(full=True)
|
|
234
|
+
for img_idx, img_info in enumerate(image_list):
|
|
235
|
+
image_count += 1
|
|
236
|
+
if export_images and image_dir:
|
|
237
|
+
xref = img_info[0]
|
|
238
|
+
base_image = page.parent.extract_image(xref)
|
|
239
|
+
if base_image:
|
|
240
|
+
ext = base_image["ext"]
|
|
241
|
+
img_path = os.path.join(image_dir, f"p{page_num:04d}_img{img_idx+1}.{ext}")
|
|
242
|
+
os.makedirs(os.path.dirname(img_path), exist_ok=True)
|
|
243
|
+
with open(img_path, "wb") as f:
|
|
244
|
+
f.write(base_image["image"])
|
|
245
|
+
image_refs.append(img_path)
|
|
246
|
+
except Exception:
|
|
247
|
+
pass # 某些 PDF 图片提取可能失败,不阻塞流程
|
|
248
|
+
|
|
249
|
+
# 3. 分离文本块和图片块
|
|
250
|
+
text_blocks: list[TextBlock] = []
|
|
251
|
+
for block in blocks_raw:
|
|
252
|
+
if block.get("type") == 0: # 文本块
|
|
253
|
+
for line in block.get("lines", []):
|
|
254
|
+
line_text = ""
|
|
255
|
+
x0, y0, x1, y1 = float("inf"), float("inf"), 0, 0
|
|
256
|
+
font_size = 0.0
|
|
257
|
+
font_name = ""
|
|
258
|
+
for span in line.get("spans", []):
|
|
259
|
+
line_text += span.get("text", "")
|
|
260
|
+
bbox = span.get("bbox", (0, 0, 0, 0))
|
|
261
|
+
x0 = min(x0, bbox[0])
|
|
262
|
+
y0 = min(y0, bbox[1])
|
|
263
|
+
x1 = max(x1, bbox[2])
|
|
264
|
+
y1 = max(y1, bbox[3])
|
|
265
|
+
if span.get("size", 0) > font_size:
|
|
266
|
+
font_size = span.get("size", 0)
|
|
267
|
+
font_name = span.get("font", "")
|
|
268
|
+
if line_text.strip():
|
|
269
|
+
text_blocks.append(TextBlock(
|
|
270
|
+
text=line_text.strip(),
|
|
271
|
+
x0=x0, y0=y0, x1=x1, y1=y1,
|
|
272
|
+
block_type="text",
|
|
273
|
+
font_size=font_size,
|
|
274
|
+
font_name=font_name,
|
|
275
|
+
))
|
|
276
|
+
|
|
277
|
+
# 4. 检测表格
|
|
278
|
+
tables = _detect_tables(text_blocks)
|
|
279
|
+
|
|
280
|
+
# 5. 构建输出文本
|
|
281
|
+
# 策略:按 y 坐标从上到下排列文本块,表格区域插入特殊标记
|
|
282
|
+
has_text = len(text_blocks) > 0
|
|
283
|
+
|
|
284
|
+
if not has_text:
|
|
285
|
+
# 可能是扫描件 PDF — 没有文字层
|
|
286
|
+
elapsed = time.time() - t0
|
|
287
|
+
return PageResult(
|
|
288
|
+
page_num=page_num,
|
|
289
|
+
text=f"> ⚠️ 本页无文字层,可能是扫描件图片。需要 OCR 才能提取文字。\n",
|
|
290
|
+
tables=[],
|
|
291
|
+
image_count=image_count,
|
|
292
|
+
has_text_layer=False,
|
|
293
|
+
elapsed=elapsed,
|
|
294
|
+
)
|
|
295
|
+
|
|
296
|
+
# 按 y 坐标排序所有文本块
|
|
297
|
+
sorted_blocks = sorted(text_blocks, key=lambda b: (b.y0, b.x0))
|
|
298
|
+
|
|
299
|
+
# 标记哪些 block 属于表格区域
|
|
300
|
+
table_block_indices: set[int] = set()
|
|
301
|
+
for table in tables:
|
|
302
|
+
for idx, block in enumerate(sorted_blocks):
|
|
303
|
+
if table.y_start - 5 <= block.y0 <= table.y_end + 5:
|
|
304
|
+
table_block_indices.add(idx)
|
|
305
|
+
|
|
306
|
+
# 生成文本输出
|
|
307
|
+
output_lines: list[str] = []
|
|
308
|
+
output_lines.append(f"\n<!-- 第 {page_num} 页 -->\n")
|
|
309
|
+
|
|
310
|
+
prev_y = -100
|
|
311
|
+
prev_was_table = False
|
|
312
|
+
inside_table = False
|
|
313
|
+
|
|
314
|
+
for idx, block in enumerate(sorted_blocks):
|
|
315
|
+
in_table = idx in table_block_indices
|
|
316
|
+
|
|
317
|
+
# 表格区域开始
|
|
318
|
+
if in_table and not inside_table:
|
|
319
|
+
inside_table = True
|
|
320
|
+
# 找到对应的 TableRegion
|
|
321
|
+
for table in tables:
|
|
322
|
+
if table.y_start - 5 <= block.y0 <= table.y_end + 5:
|
|
323
|
+
output_lines.append(f"\n> 📊 **表格区域**({table.rows}行 × {table.cols}列)— 建议查阅原书第 {page_num} 页确认结构\n>\n")
|
|
324
|
+
# 输出提取的原始文本
|
|
325
|
+
for line in table.raw_text.split("\n"):
|
|
326
|
+
output_lines.append(f"> | {line}\n")
|
|
327
|
+
output_lines.append(f">\n> ⚠️ 上方表格的单元格内容已提取,但行列关系可能不准确\n")
|
|
328
|
+
break
|
|
329
|
+
prev_was_table = True
|
|
330
|
+
continue
|
|
331
|
+
|
|
332
|
+
# 表格区域内的后续行跳过(已在 header 中输出)
|
|
333
|
+
if in_table and inside_table:
|
|
334
|
+
prev_y = block.y0
|
|
335
|
+
continue
|
|
336
|
+
|
|
337
|
+
# 离开表格区域
|
|
338
|
+
if not in_table and inside_table:
|
|
339
|
+
inside_table = False
|
|
340
|
+
output_lines.append("\n")
|
|
341
|
+
|
|
342
|
+
# 判断是否为标题(字号较大)
|
|
343
|
+
is_heading = block.font_size >= 14.0 and len(block.text) < 80
|
|
344
|
+
|
|
345
|
+
# 段落间距判断
|
|
346
|
+
if prev_y > 0 and block.y0 - prev_y > 20:
|
|
347
|
+
output_lines.append("")
|
|
348
|
+
|
|
349
|
+
if is_heading and len(block.text) > 2:
|
|
350
|
+
output_lines.append(f"## {block.text}")
|
|
351
|
+
else:
|
|
352
|
+
output_lines.append(block.text)
|
|
353
|
+
|
|
354
|
+
prev_y = block.y0
|
|
355
|
+
prev_was_table = False
|
|
356
|
+
|
|
357
|
+
# 6. 图片引用
|
|
358
|
+
if image_count > 0:
|
|
359
|
+
output_lines.append(f"\n> 🖼️ 本页含 {image_count} 张嵌入图片,建议查原书\n")
|
|
360
|
+
|
|
361
|
+
elapsed = time.time() - t0
|
|
362
|
+
return PageResult(
|
|
363
|
+
page_num=page_num,
|
|
364
|
+
text="".join(output_lines) if output_lines else "",
|
|
365
|
+
tables=tables,
|
|
366
|
+
image_count=image_count,
|
|
367
|
+
has_text_layer=True,
|
|
368
|
+
elapsed=elapsed,
|
|
369
|
+
)
|
|
370
|
+
|
|
371
|
+
|
|
372
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
373
|
+
# 批量提取
|
|
374
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
375
|
+
|
|
376
|
+
def extract_pages(
|
|
377
|
+
pdf_path: str | Path,
|
|
378
|
+
page_range: Optional[tuple[int, int]] = None,
|
|
379
|
+
export_images: bool = False,
|
|
380
|
+
image_dir: str = "",
|
|
381
|
+
progress_callback=None,
|
|
382
|
+
) -> list[PageResult]:
|
|
383
|
+
"""
|
|
384
|
+
从 PDF 批量提取页面文本。
|
|
385
|
+
|
|
386
|
+
Args:
|
|
387
|
+
pdf_path: PDF 文件路径
|
|
388
|
+
page_range: (start, end) 1-based,None 表示全部
|
|
389
|
+
export_images: 是否导出嵌入图片
|
|
390
|
+
image_dir: 图片导出目录
|
|
391
|
+
progress_callback: fn(page_num, status)
|
|
392
|
+
|
|
393
|
+
Returns:
|
|
394
|
+
[PageResult, ...]
|
|
395
|
+
"""
|
|
396
|
+
import fitz
|
|
397
|
+
|
|
398
|
+
doc = fitz.open(str(pdf_path))
|
|
399
|
+
total = len(doc)
|
|
400
|
+
|
|
401
|
+
start = 1
|
|
402
|
+
end = total
|
|
403
|
+
if page_range:
|
|
404
|
+
start, end = page_range
|
|
405
|
+
start = max(1, start)
|
|
406
|
+
end = min(total, end)
|
|
407
|
+
|
|
408
|
+
results: list[PageResult] = []
|
|
409
|
+
scan_pages = 0
|
|
410
|
+
|
|
411
|
+
for i in range(start - 1, end):
|
|
412
|
+
page_num = i + 1
|
|
413
|
+
page = doc[i]
|
|
414
|
+
result = extract_page(page, page_num, export_images, image_dir)
|
|
415
|
+
results.append(result)
|
|
416
|
+
|
|
417
|
+
if not result.has_text_layer:
|
|
418
|
+
scan_pages += 1
|
|
419
|
+
|
|
420
|
+
if progress_callback:
|
|
421
|
+
status = "scan" if not result.has_text_layer else "ok"
|
|
422
|
+
progress_callback(page_num, status)
|
|
423
|
+
else:
|
|
424
|
+
flag = " ⚠️ 扫描件" if not result.has_text_layer else ""
|
|
425
|
+
print(f" ✓ 第 {page_num} 页 ({result.elapsed:.3f}s){flag}")
|
|
426
|
+
|
|
427
|
+
doc.close()
|
|
428
|
+
|
|
429
|
+
# 汇总报告
|
|
430
|
+
if scan_pages > 0:
|
|
431
|
+
print(f"\n⚠️ {scan_pages}/{len(results)} 页无文字层(可能是扫描件),需要 OCR。")
|
|
432
|
+
if scan_pages == len(results):
|
|
433
|
+
print(" 这本书似乎是纯扫描件 PDF,建议使用 OCR 工具(如 Unlimited-OCR)处理。")
|
|
434
|
+
|
|
435
|
+
return results
|
|
436
|
+
|
|
437
|
+
|
|
438
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
439
|
+
# 输出格式化
|
|
440
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
441
|
+
|
|
442
|
+
def merge_to_markdown(
|
|
443
|
+
results: list[PageResult],
|
|
444
|
+
title: str = "",
|
|
445
|
+
) -> str:
|
|
446
|
+
"""
|
|
447
|
+
将多页提取结果合并为一个 Markdown 文档。
|
|
448
|
+
|
|
449
|
+
Args:
|
|
450
|
+
results: 按页码排序的结果列表
|
|
451
|
+
title: 文档标题
|
|
452
|
+
|
|
453
|
+
Returns:
|
|
454
|
+
完整的 Markdown 文本
|
|
455
|
+
"""
|
|
456
|
+
lines = []
|
|
457
|
+
if title:
|
|
458
|
+
lines.append(f"# {title}\n")
|
|
459
|
+
|
|
460
|
+
# 报告统计
|
|
461
|
+
total_tables = sum(len(r.tables) for r in results)
|
|
462
|
+
total_images = sum(r.image_count for r in results)
|
|
463
|
+
scan_pages = sum(1 for r in results if not r.has_text_layer)
|
|
464
|
+
|
|
465
|
+
if total_tables > 0 or total_images > 0 or scan_pages > 0:
|
|
466
|
+
lines.append("> 📋 **提取报告**\n")
|
|
467
|
+
if total_tables > 0:
|
|
468
|
+
lines.append(f"> - 检测到 {total_tables} 个表格区域,建议查原书验证结构\n")
|
|
469
|
+
if total_images > 0:
|
|
470
|
+
lines.append(f"> - {total_images} 张嵌入图片无法提取文字\n")
|
|
471
|
+
if scan_pages > 0:
|
|
472
|
+
lines.append(f"> - {scan_pages} 页无文字层(扫描件),内容缺失\n")
|
|
473
|
+
lines.append(">\n")
|
|
474
|
+
|
|
475
|
+
for r in results:
|
|
476
|
+
lines.append(r.text)
|
|
477
|
+
lines.append("")
|
|
478
|
+
|
|
479
|
+
return "\n".join(lines)
|
|
480
|
+
|
|
481
|
+
|
|
482
|
+
def save_results(
|
|
483
|
+
results: list[PageResult],
|
|
484
|
+
output_path: str | Path,
|
|
485
|
+
title: str = "",
|
|
486
|
+
) -> str:
|
|
487
|
+
"""
|
|
488
|
+
保存提取结果为 Markdown 文件。
|
|
489
|
+
|
|
490
|
+
Args:
|
|
491
|
+
results: 提取结果列表
|
|
492
|
+
output_path: 输出文件路径
|
|
493
|
+
title: 文档标题
|
|
494
|
+
|
|
495
|
+
Returns:
|
|
496
|
+
实际写入的文件路径
|
|
497
|
+
"""
|
|
498
|
+
output_path = Path(output_path)
|
|
499
|
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
|
500
|
+
|
|
501
|
+
markdown = merge_to_markdown(results, title)
|
|
502
|
+
output_path.write_text(markdown, encoding="utf-8")
|
|
503
|
+
|
|
504
|
+
total_ok = sum(1 for r in results if r.has_text_layer)
|
|
505
|
+
total_time = sum(r.elapsed for r in results)
|
|
506
|
+
total_tables = sum(len(r.tables) for r in results)
|
|
507
|
+
total_images = sum(r.image_count for r in results)
|
|
508
|
+
|
|
509
|
+
print(f"\n✅ 保存到: {output_path}")
|
|
510
|
+
print(f" 文字页: {total_ok}/{len(results)}, "
|
|
511
|
+
f"表格: {total_tables}, 图片: {total_images}, "
|
|
512
|
+
f"耗时: {total_time:.2f}s")
|
|
513
|
+
|
|
514
|
+
return str(output_path)
|
|
515
|
+
|
|
516
|
+
|
|
517
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
518
|
+
# 辅助函数
|
|
519
|
+
# ═══════════════════════════════════════════════════════════════════
|
|
520
|
+
|
|
521
|
+
def extract_book_name(pdf_path: str | Path) -> str:
|
|
522
|
+
"""从 PDF 文件名提取书名(去后缀、去特殊字符)"""
|
|
523
|
+
name = Path(pdf_path).stem
|
|
524
|
+
name = re.sub(r"[((]第?\d+版[))]", "", name)
|
|
525
|
+
name = re.sub(r"[::].*", "", name)
|
|
526
|
+
return name.strip()
|
|
527
|
+
|
|
528
|
+
|
|
529
|
+
def parse_page_range(range_str: str) -> tuple[int, int]:
|
|
530
|
+
"""
|
|
531
|
+
解析页码范围字符串。
|
|
532
|
+
|
|
533
|
+
Examples:
|
|
534
|
+
"45-78" → (45, 78)
|
|
535
|
+
"45" → (45, 45)
|
|
536
|
+
"""
|
|
537
|
+
parts = range_str.split("-")
|
|
538
|
+
if len(parts) == 2:
|
|
539
|
+
return int(parts[0].strip()), int(parts[1].strip())
|
|
540
|
+
return int(parts[0].strip()), int(parts[0].strip())
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
|
-
name:
|
|
3
|
-
description: "
|
|
2
|
+
name: book-map
|
|
3
|
+
description: "技术书结构解析——生成知识骨架地图。触发词:/book-map、画地图、分析结构。"
|
|
4
4
|
---
|
|
5
5
|
# 阶段一 Skill:技术书结构解析器
|
|
6
6
|
|
|
@@ -10,6 +10,21 @@ description: "技术书精读前的知识骨架生成器。当用户想在读书
|
|
|
10
10
|
|
|
11
11
|
---
|
|
12
12
|
|
|
13
|
+
## Step -1:从 PDF 提取文本(如有 PDF)
|
|
14
|
+
|
|
15
|
+
如果手头是 PDF 而非纯文本,先用提取脚本拿到全书原文:
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
# 全书提取 → 输出 {书名}-fulltext.md
|
|
19
|
+
python ~/.claude/scripts/extract_book.py ./你的书.pdf --output ./
|
|
20
|
+
|
|
21
|
+
# 测试:只看前 20 页
|
|
22
|
+
python ~/.claude/scripts/extract_book.py ./你的书.pdf --max-pages 20
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
> 原理:PyMuPDF 直接读取 PDF 文字层(和 Ctrl+C 一样),不需要 GPU/OCR。
|
|
26
|
+
> 扫描件 PDF 无文字层,脚本会检测并提示。
|
|
27
|
+
|
|
13
28
|
## Step 0:运行预处理脚本(必须先做)
|
|
14
29
|
|
|
15
30
|
```bash
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
|
-
name:
|
|
3
|
-
description: "
|
|
2
|
+
name: chapter-drill
|
|
3
|
+
description: "技术书章节深度萃取——把一章钻透。自动生成骨架(如需要)。触发词:/chapter-drill、钻读、萃取。"
|
|
4
4
|
---
|
|
5
5
|
# 阶段二 Skill:技术书章节深度萃取器
|
|
6
6
|
|
|
@@ -8,11 +8,24 @@ description: "技术书章节深度萃取器。当用户想精读某章节、把
|
|
|
8
8
|
|
|
9
9
|
---
|
|
10
10
|
|
|
11
|
+
## 前置步骤:从 PDF 提取单章文本
|
|
12
|
+
|
|
13
|
+
如果章节原文在 PDF 里,先用提取脚本拿到文字:
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
# 提取第 45-78 页(如第 3 章)
|
|
17
|
+
python ~/.claude/scripts/extract_chapter.py ./你的书.pdf --pages 45-78 --chapter "ch03" --output ./
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
输出 `{书名}/chapters/ch03-raw.md`,可直接作为下方「章节全文」输入。
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
|
|
11
24
|
## 输入
|
|
12
25
|
|
|
13
26
|
| 字段 | 来源 | 说明 |
|
|
14
27
|
|------|------|------|
|
|
15
|
-
| 章节全文 |
|
|
28
|
+
| 章节全文 | 用户提供(或 `extract_chapter.py` 产出) | 单章完整原文 |
|
|
16
29
|
| 章节骨架 | Stage 1 产出 | 本章的 type、weight、keyQuestions、outdatedRisks、prerequisites、subsections.skipIf、subsections.complexity |
|
|
17
30
|
| complexityHotspots | Stage 1 产出 | 全书中 complexity=high 的小节列表,Stage 2 对此类小节启用深度脚手架而非精炼提纯 |
|
|
18
31
|
| 前置章节萃取 | Stage 2 先前产出(可选) | 已处理章节的"关键问答",用于检测术语引用一致性 |
|