chem-pdf2ppt 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,235 @@
1
+ # PDF2PPT — 化学学术论文 → 演示文稿转换器
2
+
3
+ 将化学领域学术论文 PDF 转换为专业学术演示文稿。支持 **PPTX** 和 **HTML** 两种输出格式。
4
+
5
+ ## 核心特性
6
+
7
+ - **论文化学类型自动识别**:实验化学 / 理论计算化学 / 实验+理论混合,自动匹配叙事结构
8
+ - **双格式输出**:PPTX(python-pptx)和单文件 HTML(横向翻页,图片 base64 嵌入)
9
+ - **化学领域深度适配**:催化、材料、有机合成、计算化学、电化学、能源、环境、辐射化学等
10
+ - **智能内容生成**:从论文提取真实信息,不生成"请手动添加XX"占位符
11
+ - **多策略图表提取**:矢量图聚类 + 嵌入式图片 + 整页渲染回退,兼容 PyMuPDF 1.19–1.23+
12
+ - **错误追踪与报告**:全链路 JSON 报告(分析 → 提取 → 构建),Windows 编码安全
13
+ - **4 套学术配色**:学术经典 / 分子科技 / 绿色化学 / Nature 风格
14
+ - **7 种幻灯片类型**:封面、章节分隔、内容要点、图表说明(4 种布局)、数据表格、总结、致谢
15
+
16
+ ---
17
+
18
+ ## 安装
19
+
20
+ ```bash
21
+ pip install -r requirements.txt
22
+ ```
23
+
24
+ **依赖**:`pymupdf>=1.19.0` · `python-pptx>=0.6.23` · `pdfplumber>=0.10.0` · `Pillow>=10.0.0`
25
+
26
+ `pdf2image` 可选(需系统安装 Poppler)。
27
+
28
+ ---
29
+
30
+ ## 快速开始
31
+
32
+ ### 完整工作流
33
+
34
+ ```bash
35
+ # Step 1: 分析论文类型与结构
36
+ python scripts/analyze_paper.py paper.pdf --json analysis.json
37
+
38
+ # Step 2: 提取图表(多策略 + 自动回退)
39
+ python scripts/extract_charts.py paper.pdf output/figures 300 --report
40
+
41
+ # Step 3: 构建 PPTX 或 HTML
42
+ python build_my_ppt.py
43
+ ```
44
+
45
+ ### PPTX 格式
46
+
47
+ ```python
48
+ import sys
49
+ sys.path.insert(0, 'scripts')
50
+ from create_ppt import ChemistryPPT
51
+
52
+ ppt = ChemistryPPT(theme='academic')
53
+
54
+ ppt.add_title_slide(
55
+ title_cn='Ru₁/Cu 单原子合金高效电催化 CO₂ 还原',
56
+ title_en='Single-Atom Ru Alloyed with Cu for Efficient CO₂RR',
57
+ authors='Zhang, L. et al.',
58
+ journal='J. Am. Chem. Soc., 2024, 146, 12345',
59
+ doi='10.1021/jacs.4c01234'
60
+ )
61
+
62
+ ppt.add_section_slide('研究背景')
63
+ ppt.add_content_slide(
64
+ title='电催化 CO₂ 还原的核心挑战',
65
+ bullets=[
66
+ 'CO₂RR 产物分布广泛,选择性控制困难',
67
+ 'Cu 基催化剂 C₂₊ FE 通常 < 50%',
68
+ '关键瓶颈:*CO 吸附能与 C-C 偶联动力学的矛盾'
69
+ ]
70
+ )
71
+ ppt.add_figure_slide(
72
+ title='HAADF-STEM 确认 Ru 单原子分散',
73
+ figure_path='figures/p3_fig1.png',
74
+ bullets=['亮点均匀分散,无团簇', 'EDS mapping 确认均匀分布'],
75
+ figure_label='Figure 1',
76
+ layout='figure_right'
77
+ )
78
+ ppt.add_table_slide(
79
+ title='催化性能对比',
80
+ headers=['催化剂', 'FE(C₂₊)%', 'j (mA/cm²)', '稳定性 (h)'],
81
+ rows=[['Ru₁/Cu', '82%', '300', '100'], ['Cu NPs', '45%', '150', '20']]
82
+ )
83
+ ppt.add_summary_slide(
84
+ title='全文总结',
85
+ bullets=['核心发现1', '核心发现2', '核心发现3']
86
+ )
87
+ ppt.add_thankyou_slide()
88
+
89
+ ppt.save('output/presentation.pptx')
90
+ ppt.save_report('output/presentation.pptx') # 生成 JSON 构建报告
91
+ ```
92
+
93
+ ### HTML 格式
94
+
95
+ ```python
96
+ import sys
97
+ sys.path.insert(0, 'scripts')
98
+ from generate_html import HtmlPPT
99
+
100
+ html = HtmlPPT(title="学术报告", theme="molecular")
101
+
102
+ # API 与 ChemistryPPT 完全一致
103
+ html.add_title_slide("标题", title_en="Title", authors="...", journal="...")
104
+ html.add_section_slide("第一部分")
105
+ html.add_content_slide("要点标题", ["bullet 1", "bullet 2"])
106
+ html.add_figure_slide("图表", figure_path="figures/fig1.png",
107
+ bullets=["说明"], figure_label="Figure 1",
108
+ layout="figure_right")
109
+ html.add_summary_slide("总结", ["结论1", "结论2"])
110
+ html.add_thankyou_slide()
111
+
112
+ html.save('output/presentation.html') # 单文件,可直接浏览器打开
113
+ ```
114
+
115
+ **HTML 特性**:
116
+ - 图片以 base64 嵌入,单文件零依赖
117
+ - 横向翻页:键盘 ← → Home End、滚轮、触摸滑动、底部圆点导航
118
+ - 页码追踪 + 键盘提示
119
+ - 响应式设计,适配投影仪和移动端
120
+
121
+ ---
122
+
123
+ ## 配色主题
124
+
125
+ | 主题 | PPTX 参数 | HTML 参数 | 适合 |
126
+ |------|-----------|-----------|------|
127
+ | 学术经典 | `theme="academic"` | `theme="academic"` | 通用化学(默认) |
128
+ | 分子科技 | `theme="molecular"` | `theme="molecular"` | 计算化学/材料 |
129
+ | 绿色化学 | `theme="green"` | `theme="green"` | 催化/能源/环境 |
130
+ | Nature 风格 | `theme="nature"` | `theme="nature"` | CNS 期刊汇报 |
131
+
132
+ ---
133
+
134
+ ## 幻灯片类型
135
+
136
+ | 方法 | PPTX | HTML | 用途 |
137
+ |------|------|------|------|
138
+ | `add_title_slide()` | ✓ | ✓ | 封面页(中英文标题、作者、期刊、DOI) |
139
+ | `add_section_slide()` | ✓ | ✓ | 章节分隔页(深色背景) |
140
+ | `add_content_slide()` | ✓ | ✓ | 文字要点页(标题 + bullets + 备注) |
141
+ | `add_figure_slide()` | ✓ | ✓ | 图表+说明(4 种布局:right/top/left/full) |
142
+ | `add_table_slide()` | ✓ | ✓ | 数据对比表(斑马纹、表头着色) |
143
+ | `add_image_grid_slide()` | ✓ | — | 多图网格页 |
144
+ | `add_summary_slide()` | ✓ | ✓ | 总结页(浅色背景) |
145
+ | `add_thankyou_slide()` | ✓ | ✓ | 致谢/提问页 |
146
+
147
+ ---
148
+
149
+ ## 图表提取:多策略 + 版本兼容
150
+
151
+ ```
152
+ 策略 1: cluster_drawings() 默认容忍度 (3,3)
153
+ ↓ 结果 < 3
154
+ 策略 2: 多容忍度尝试 (6,6) → (10,10) → (15,15) → (20,20)
155
+
156
+ 策略 3: 提取嵌入式位图 (get_images)
157
+ ↓ 结果 < 3
158
+ 策略 4: 图页整页渲染回退
159
+ ```
160
+
161
+ - 兼容 PyMuPDF 1.19+ (`get_drawings` 手动聚类) 和 1.23+ (`cluster_drawings`)
162
+ - `--report` 输出 `extraction_report.json`(各策略提取详情)
163
+
164
+ ---
165
+
166
+ ## 错误处理与报告
167
+
168
+ 全链路 JSON 报告,便于自动化集成和问题诊断:
169
+
170
+ | 阶段 | 报告文件 | 生成方式 |
171
+ |------|---------|---------|
172
+ | 论文分析 | `analysis.json` | `analyze_paper.py --json analysis.json` |
173
+ | 图表提取 | `extraction_report.json` | `extract_charts.py --report` |
174
+ | PPTX 构建 | `presentation_report.json` | `ppt.save_report("output.pptx")` |
175
+
176
+ **Windows 编码安全**:所有脚本使用 `_safe_print()` 避免 GBK 编码崩溃。
177
+
178
+ **常见问题自动诊断**:
179
+
180
+ | 症状 | 可能原因 | 脚本输出 |
181
+ |------|---------|---------|
182
+ | 矢量图提取 0 个 | PyMuPDF < 1.23 或 PDF 渲染特殊 | 自动回退到 `get_drawings` 手动聚类 |
183
+ | 图表总数仍不足 | 图片均为嵌入式位图 | 策略 3 自动覆盖 |
184
+ | Windows `print` 崩溃 | Unicode 字符 (如 − ₂) | `_safe_print` 回退 ASCII |
185
+ | 论文类型误判 | 参考文献含表征术语 | 加权检测 + confidence 标注 |
186
+ | PPT 中图片缺失 | 图片路径不存在 | 记录到 `missing_images`,构建不中断 |
187
+
188
+ ---
189
+
190
+ ## 论文化学类型适配
191
+
192
+ | 实验化学 | 理论计算化学 | 实验+理论混合 |
193
+ |---------|------------|-------------|
194
+ | 合成 → 表征 → 性能 → 机理 | 方法 → 模型 → 电子结构 → 能量 → 机理 | 实验 → 计算 → 互验 → 统一机理 |
195
+ | 催化/材料/有机/能源 | DFT/MM/AIMD/电子结构 | 实验+DFT 联合 |
196
+
197
+ 详细模板见 `references/chemistry_templates.md`。
198
+
199
+ ---
200
+
201
+ ## 文件结构
202
+
203
+ ```
204
+ PDF2PPT/
205
+ ├── SKILL.md # Skill 主文件
206
+ ├── README.md # 本文件
207
+ ├── requirements.txt
208
+ ├── assets/
209
+ │ └── academic_template.html # HTML PPT 模板(CSS + 翻页 JS)
210
+ ├── scripts/
211
+ │ ├── create_ppt.py # PPTX 构建器 (ChemistryPPT)
212
+ │ ├── generate_html.py # HTML 构建器 (HtmlPPT)
213
+ │ ├── extract_charts.py # 多策略图表提取
214
+ │ ├── analyze_paper.py # 论文分析 + 类型分类
215
+ │ └── convert_to_images.py # PDF 页面 → 图片
216
+ ├── references/
217
+ │ ├── chemistry_templates.md # 三种论文类型的逐页模板
218
+ │ └── visual_style.md # 学术 PPT 视觉设计规范
219
+ └── examples/
220
+ └── example_usage.py # 三种化学论文类型的完整示例
221
+ ```
222
+
223
+ ---
224
+
225
+ ## 兼容性
226
+
227
+ - **OS**: macOS / Linux / Windows
228
+ - **Python**: 3.8+
229
+ - **PyMuPDF**: 1.19+(自动兼容新旧 API)
230
+ - **环境**: Claude Code / Claude Desktop / Cursor / VS Code / 任何 Python 环境
231
+ - **HTML 输出**: 任何现代浏览器(Chrome / Firefox / Edge / Safari)
232
+
233
+ ## 许可证
234
+
235
+ MIT License
package/README_EN.md ADDED
@@ -0,0 +1,239 @@
1
+ # PDF2PPT — Chemistry Academic Paper → Presentation Converter
2
+
3
+ Convert chemistry academic paper PDFs into professional presentations for group meetings, defenses, and academic reports. Supports both **PPTX** and **HTML** output formats.
4
+
5
+ ## Key Features
6
+
7
+ - **Automatic paper type recognition**: Experimental / Computational / Hybrid chemistry — auto-matched narrative structure
8
+ - **Dual output formats**: PPTX (python-pptx) and single-file HTML (horizontal-slide, base64-embedded figures)
9
+ - **Deep chemistry domain support**: Catalysis, materials, organic synthesis, computational chemistry, electrochemistry, energy, environmental, radiation chemistry, and more
10
+ - **Intelligent content generation**: Extracts real information from papers — no "please fill in XX" placeholder content
11
+ - **Multi-strategy figure extraction**: Vector clustering + embedded images + page rendering fallback, compatible with PyMuPDF 1.19–1.23+
12
+ - **Error tracking & reporting**: Full-chain JSON reports (analysis → extraction → build), Windows encoding-safe
13
+ - **4 academic color themes**: Academic Classic / Molecular Tech / Green Chemistry / Nature Style
14
+ - **7 slide types**: Title, section divider, content, figure (4 layouts), data table, summary, thank you
15
+
16
+ ---
17
+
18
+ ## Installation
19
+
20
+ ```bash
21
+ pip install -r requirements.txt
22
+ ```
23
+
24
+ **Dependencies**: `pymupdf>=1.19.0` · `python-pptx>=0.6.23` · `pdfplumber>=0.10.0` · `Pillow>=10.0.0`
25
+
26
+ `pdf2image` is optional (requires system Poppler installation).
27
+
28
+ ---
29
+
30
+ ## Quick Start
31
+
32
+ ### Complete Workflow
33
+
34
+ ```bash
35
+ # Step 1: Analyze paper type and structure
36
+ python scripts/analyze_paper.py paper.pdf --json analysis.json
37
+
38
+ # Step 2: Extract figures (multi-strategy + auto-fallback)
39
+ python scripts/extract_charts.py paper.pdf output/figures 300 --report
40
+
41
+ # Step 3: Build PPTX or HTML
42
+ python build_my_ppt.py
43
+ ```
44
+
45
+ ### PPTX Format
46
+
47
+ ```python
48
+ import sys
49
+ sys.path.insert(0, 'scripts')
50
+ from create_ppt import ChemistryPPT
51
+
52
+ ppt = ChemistryPPT(theme='academic')
53
+
54
+ ppt.add_title_slide(
55
+ title_cn='Ru₁/Cu Single-Atom Alloy for Efficient CO₂ Electroreduction',
56
+ title_en='Single-Atom Ru Alloyed with Cu for Efficient CO₂RR',
57
+ authors='Zhang, L. et al.',
58
+ journal='J. Am. Chem. Soc., 2024, 146, 12345',
59
+ doi='10.1021/jacs.4c01234'
60
+ )
61
+
62
+ ppt.add_section_slide('Background')
63
+ ppt.add_content_slide(
64
+ title='Core Challenge in Electrocatalytic CO₂ Reduction',
65
+ bullets=[
66
+ 'CO₂RR produces diverse products — selectivity control is difficult',
67
+ 'Cu-based catalysts achieve C₂₊ FE typically < 50%',
68
+ 'Key bottleneck: conflicting demands on *CO binding and C-C coupling'
69
+ ]
70
+ )
71
+ ppt.add_figure_slide(
72
+ title='HAADF-STEM Confirms Single-Atom Ru Dispersion',
73
+ figure_path='figures/p3_fig1.png',
74
+ bullets=['Bright dots uniformly dispersed — no clusters', 'EDS mapping confirms uniformity'],
75
+ figure_label='Figure 1',
76
+ layout='figure_right'
77
+ )
78
+ ppt.add_table_slide(
79
+ title='Performance Comparison',
80
+ headers=['Catalyst', 'FE(C₂₊)%', 'j (mA/cm²)', 'Stability (h)'],
81
+ rows=[['Ru₁/Cu', '82%', '300', '100'], ['Cu NPs', '45%', '150', '20']]
82
+ )
83
+ ppt.add_summary_slide(
84
+ title='Summary',
85
+ bullets=['Finding 1', 'Finding 2', 'Finding 3']
86
+ )
87
+ ppt.add_thankyou_slide()
88
+
89
+ ppt.save('output/presentation.pptx')
90
+ ppt.save_report('output/presentation.pptx') # JSON build report
91
+ ```
92
+
93
+ ### HTML Format
94
+
95
+ ```python
96
+ import sys
97
+ sys.path.insert(0, 'scripts')
98
+ from generate_html import HtmlPPT
99
+
100
+ html = HtmlPPT(title="Academic Report", theme="molecular")
101
+
102
+ # API is identical to ChemistryPPT
103
+ html.add_title_slide("Title", title_en="Title EN", authors="...", journal="...")
104
+ html.add_section_slide("Part 1")
105
+ html.add_content_slide("Key Point", ["bullet 1", "bullet 2"])
106
+ html.add_figure_slide("Figure", figure_path="figures/fig1.png",
107
+ bullets=["description"], figure_label="Figure 1",
108
+ layout="figure_right")
109
+ html.add_summary_slide("Summary", ["finding 1", "finding 2"])
110
+ html.add_thankyou_slide()
111
+
112
+ html.save('output/presentation.html') # Single file, open directly in browser
113
+ ```
114
+
115
+ **HTML features**:
116
+ - Figures embedded as base64 — single file, zero dependencies
117
+ - Horizontal navigation: keyboard ← → Home End, scroll wheel, touch swipe, dot navigation
118
+ - Page counter + keyboard hints
119
+ - Responsive design for projectors and mobile
120
+
121
+ ---
122
+
123
+ ## Color Themes
124
+
125
+ | Theme | PPTX param | HTML param | Best for |
126
+ |-------|-----------|------------|----------|
127
+ | Academic Classic | `theme="academic"` | `theme="academic"` | General chemistry (default) |
128
+ | Molecular Tech | `theme="molecular"` | `theme="molecular"` | Computational/materials |
129
+ | Green Chemistry | `theme="green"` | `theme="green"` | Catalysis/energy/environment |
130
+ | Nature Style | `theme="nature"` | `theme="nature"` | CNS journal presentations |
131
+
132
+ ---
133
+
134
+ ## Slide Types
135
+
136
+ | Method | PPTX | HTML | Purpose |
137
+ |--------|------|------|---------|
138
+ | `add_title_slide()` | ✓ | ✓ | Cover: bilingual title, authors, journal, DOI |
139
+ | `add_section_slide()` | ✓ | ✓ | Section divider (dark background) |
140
+ | `add_content_slide()` | ✓ | ✓ | Text content: title + bullets + notes |
141
+ | `add_figure_slide()` | ✓ | ✓ | Figure + explanation (4 layouts: right/top/left/full) |
142
+ | `add_table_slide()` | ✓ | ✓ | Data comparison table (striped rows, colored header) |
143
+ | `add_image_grid_slide()` | ✓ | — | Multi-image grid |
144
+ | `add_summary_slide()` | ✓ | ✓ | Summary (light background) |
145
+ | `add_thankyou_slide()` | ✓ | ✓ | Thank you / Q&A |
146
+
147
+ ---
148
+
149
+ ## Figure Extraction: Multi-Strategy + Version Compatibility
150
+
151
+ ```
152
+ Strategy 1: cluster_drawings() default tolerance (3,3)
153
+ ↓ < 3 results
154
+ Strategy 2: Multi-tolerance retry (6,6) → (10,10) → (15,15) → (20,20)
155
+
156
+ Strategy 3: Extract embedded bitmaps (get_images)
157
+ ↓ < 3 results
158
+ Strategy 4: Full-page rendering fallback for figure pages
159
+ ```
160
+
161
+ - Compatible with PyMuPDF 1.19+ (`get_drawings` manual clustering) and 1.23+ (`cluster_drawings`)
162
+ - `--report` outputs `extraction_report.json` (per-strategy details)
163
+
164
+ ---
165
+
166
+ ## Error Handling & Reports
167
+
168
+ Full-chain JSON reports for workflow automation and diagnostics:
169
+
170
+ | Stage | Report File | Generated By |
171
+ |-------|------------|--------------|
172
+ | Paper analysis | `analysis.json` | `analyze_paper.py --json analysis.json` |
173
+ | Figure extraction | `extraction_report.json` | `extract_charts.py --report` |
174
+ | PPTX build | `presentation_report.json` | `ppt.save_report("output.pptx")` |
175
+
176
+ **Windows encoding safety**: All scripts use `_safe_print()` to prevent GBK encoding crashes.
177
+
178
+ **Common issue diagnostics**:
179
+
180
+ | Symptom | Likely Cause | Script Behavior |
181
+ |---------|-------------|-----------------|
182
+ | 0 vector figures extracted | PyMuPDF < 1.23 or special PDF rendering | Auto-fallback to `get_drawings` manual clustering |
183
+ | Still too few figures total | All bitmaps | Strategy 3 auto-covers |
184
+ | Windows `print` crash | Unicode chars (e.g. − ₂) | `_safe_print` fallback to ASCII |
185
+ | Paper type misclassified | Characterization terms in references | Weighted detection + confidence annotation |
186
+ | Images missing in PPT | Non-existent file paths | Recorded in `missing_images`, build continues |
187
+
188
+ ---
189
+
190
+ ## Paper Type Adaptations
191
+
192
+ | Experimental Chemistry | Computational Chemistry | Experimental + Theoretical |
193
+ |------------------------|------------------------|---------------------------|
194
+ | Synthesis → Characterization → Performance → Mechanism | Method → Model → Electronic Structure → Energetics → Mechanism | Experiment → Computation → Cross-Validation → Unified Mechanism |
195
+ | Catalysis / Materials / Organic / Energy | DFT / MM / AIMD / Electronic Structure | Experiment + DFT joint studies |
196
+
197
+ Detailed templates in `references/chemistry_templates.md`.
198
+
199
+ ---
200
+
201
+ ## File Structure
202
+
203
+ ```
204
+ PDF2PPT/
205
+ ├── SKILL.md # Skill main file (Chinese)
206
+ ├── SKILL_EN.md # Skill main file (English)
207
+ ├── README.md # This file (Chinese)
208
+ ├── README_EN.md # This file (English)
209
+ ├── requirements.txt
210
+ ├── assets/
211
+ │ └── academic_template.html # HTML PPT template (CSS + navigation JS)
212
+ ├── scripts/
213
+ │ ├── create_ppt.py # PPTX builder (ChemistryPPT)
214
+ │ ├── generate_html.py # HTML builder (HtmlPPT)
215
+ │ ├── extract_charts.py # Multi-strategy figure extraction
216
+ │ ├── analyze_paper.py # Paper analysis + type classification
217
+ │ └── convert_to_images.py # PDF page → image conversion
218
+ ├── references/
219
+ │ ├── chemistry_templates.md # Slide-by-slide templates for 3 paper types
220
+ │ ├── chemistry_templates_en.md # (English version)
221
+ │ ├── visual_style.md # Academic PPT visual design spec
222
+ │ └── visual_style_en.md # (English version)
223
+ └── examples/
224
+ └── example_usage.py # Complete examples for 3 chemistry paper types
225
+ ```
226
+
227
+ ---
228
+
229
+ ## Compatibility
230
+
231
+ - **OS**: macOS / Linux / Windows
232
+ - **Python**: 3.8+
233
+ - **PyMuPDF**: 1.19+ (auto-detects and adapts API)
234
+ - **Environment**: Claude Code / Claude Desktop / Cursor / VS Code / any Python environment
235
+ - **HTML output**: Any modern browser (Chrome / Firefox / Edge / Safari)
236
+
237
+ ## License
238
+
239
+ MIT License