rc-docparser 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- rc_docparser-0.2.0/.gitignore +37 -0
- rc_docparser-0.2.0/CHANGELOG.md +59 -0
- rc_docparser-0.2.0/LICENSE +21 -0
- rc_docparser-0.2.0/PKG-INFO +344 -0
- rc_docparser-0.2.0/README.md +243 -0
- rc_docparser-0.2.0/pyproject.toml +148 -0
- rc_docparser-0.2.0/src/docparser/__init__.py +87 -0
- rc_docparser-0.2.0/src/docparser/cli.py +209 -0
- rc_docparser-0.2.0/src/docparser/common.py +163 -0
- rc_docparser-0.2.0/src/docparser/csvtab.py +131 -0
- rc_docparser-0.2.0/src/docparser/docx.py +488 -0
- rc_docparser-0.2.0/src/docparser/epub.py +349 -0
- rc_docparser-0.2.0/src/docparser/html.py +322 -0
- rc_docparser-0.2.0/src/docparser/image.py +343 -0
- rc_docparser-0.2.0/src/docparser/localvlm.py +103 -0
- rc_docparser-0.2.0/src/docparser/ocr.py +68 -0
- rc_docparser-0.2.0/src/docparser/orchestrator.py +304 -0
- rc_docparser-0.2.0/src/docparser/pdf.py +430 -0
- rc_docparser-0.2.0/src/docparser/pdf_backends.py +89 -0
- rc_docparser-0.2.0/src/docparser/pptx.py +332 -0
- rc_docparser-0.2.0/src/docparser/py.typed +0 -0
- rc_docparser-0.2.0/src/docparser/text.py +189 -0
- rc_docparser-0.2.0/src/docparser/xlsx.py +319 -0
- rc_docparser-0.2.0/tests/conftest.py +247 -0
- rc_docparser-0.2.0/tests/test_cli.py +41 -0
- rc_docparser-0.2.0/tests/test_common.py +77 -0
- rc_docparser-0.2.0/tests/test_csv.py +17 -0
- rc_docparser-0.2.0/tests/test_docx.py +64 -0
- rc_docparser-0.2.0/tests/test_epub.py +48 -0
- rc_docparser-0.2.0/tests/test_html.py +30 -0
- rc_docparser-0.2.0/tests/test_image.py +149 -0
- rc_docparser-0.2.0/tests/test_localvlm.py +35 -0
- rc_docparser-0.2.0/tests/test_orchestrator.py +70 -0
- rc_docparser-0.2.0/tests/test_pdf.py +30 -0
- rc_docparser-0.2.0/tests/test_pdf_backends.py +55 -0
- rc_docparser-0.2.0/tests/test_pptx.py +52 -0
- rc_docparser-0.2.0/tests/test_text.py +27 -0
- rc_docparser-0.2.0/tests/test_xlsx.py +40 -0
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
# --- secrets / env ---
|
|
2
|
+
.env
|
|
3
|
+
.env.*
|
|
4
|
+
!.env.example
|
|
5
|
+
|
|
6
|
+
# --- Python ---
|
|
7
|
+
.venv/
|
|
8
|
+
venv/
|
|
9
|
+
__pycache__/
|
|
10
|
+
*.pyc
|
|
11
|
+
*.pyo
|
|
12
|
+
*.pyd
|
|
13
|
+
.Python
|
|
14
|
+
*.egg-info/
|
|
15
|
+
.pytest_cache/
|
|
16
|
+
.mypy_cache/
|
|
17
|
+
.ruff_cache/
|
|
18
|
+
.coverage
|
|
19
|
+
.coverage.*
|
|
20
|
+
htmlcov/
|
|
21
|
+
|
|
22
|
+
# --- build artefacts ---
|
|
23
|
+
build/
|
|
24
|
+
dist/
|
|
25
|
+
*.whl
|
|
26
|
+
*.tar.gz
|
|
27
|
+
|
|
28
|
+
# --- macOS ---
|
|
29
|
+
.DS_Store
|
|
30
|
+
|
|
31
|
+
# --- editor ---
|
|
32
|
+
.vscode/
|
|
33
|
+
.idea/
|
|
34
|
+
*.swp
|
|
35
|
+
|
|
36
|
+
# --- pipeline cache ---
|
|
37
|
+
.cache/
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to **docparser** are documented here.
|
|
4
|
+
Format follows [Keep a Changelog](https://keepachangelog.com/) and the project
|
|
5
|
+
follows [Semantic Versioning](https://semver.org/).
|
|
6
|
+
|
|
7
|
+
## [Unreleased]
|
|
8
|
+
|
|
9
|
+
## [0.2.0] - 2026-06-16
|
|
10
|
+
### Added
|
|
11
|
+
- PPTX parser (extra `[pptx]`): walks slides in order; emits per-slide
|
|
12
|
+
headings, bulleted text, tables, images, and speaker notes; image captioning.
|
|
13
|
+
- Plain-text and Markdown parser (core): `.txt` / `.md` into structured blocks
|
|
14
|
+
(headings, list items, code fences, paragraphs).
|
|
15
|
+
- CSV/TSV parser (core): delimiter sniffing, header detection, Markdown table
|
|
16
|
+
+ per-row JSON records.
|
|
17
|
+
- EPUB parser (extra `[epub]`): spine-ordered chapter extraction with a
|
|
18
|
+
BeautifulSoup structural walk and embedded-image captioning.
|
|
19
|
+
- Pluggable high-fidelity PDF backends via `parse_pdf(backend=...)` and
|
|
20
|
+
`docparser.pdf_backends`: `pymupdf4llm`, `docling`, `marker` (each an opt-in
|
|
21
|
+
extra; Marker/PyMuPDF4LLM licenses noted).
|
|
22
|
+
- OCR for scanned/low-text PDFs via `parse_pdf(ocr="off|auto|force")` and an
|
|
23
|
+
image OCR helper (`docparser.ocr`, extra `[ocr]` using `rapidocr-onnxruntime`).
|
|
24
|
+
- Better PDF tables via `parse_pdf(extract_tables=True)` using `pdfplumber`
|
|
25
|
+
(extra `[tables]`), emitting real `table` blocks.
|
|
26
|
+
- Multi-provider captioning: `caption_image(provider=...)` presets for
|
|
27
|
+
`openrouter` / `openai` / `gemini` / `local` (any OpenAI-compatible endpoint),
|
|
28
|
+
plus a fully-local `transformers` backend (`docparser.localvlm`, extra
|
|
29
|
+
`[localvlm]`).
|
|
30
|
+
- CLI flags: `--vlm-provider`, `--vlm-model`, `--pdf-backend`, `--ocr`,
|
|
31
|
+
`--pdf-tables`; `run_all`/`parse_path` thread these options through.
|
|
32
|
+
- Typing: PEP 561 `py.typed` marker and a `mypy` configuration.
|
|
33
|
+
- GitHub Actions: CI (lint + mypy + tests across Python 3.10-3.12 + build) and
|
|
34
|
+
a tag-triggered PyPI Trusted-Publishing workflow.
|
|
35
|
+
|
|
36
|
+
### Changed
|
|
37
|
+
- `parse_path` and `run_all` accept the new PDF and captioning options;
|
|
38
|
+
non-PDF parsers ignore PDF-only keyword arguments.
|
|
39
|
+
|
|
40
|
+
## [0.1.0] - 2026-06-12
|
|
41
|
+
### Added
|
|
42
|
+
- Initial public release.
|
|
43
|
+
- DOCX parser: walks the document body in order; preserves headings, lists,
|
|
44
|
+
paragraphs, tables, and inline images; associates figure captions and
|
|
45
|
+
surrounding context with each image.
|
|
46
|
+
- XLSX parser: traverses every sheet, every row, every column; preserves cell
|
|
47
|
+
types, formulas, hyperlinks, comments, merged ranges, frozen panes, and
|
|
48
|
+
embedded images.
|
|
49
|
+
- PDF parser (extra `[pdf]`): page-by-page text + image extraction via PyMuPDF;
|
|
50
|
+
best-effort heading detection from font sizing.
|
|
51
|
+
- HTML parser (extra `[html]`): article-grade text extraction via trafilatura
|
|
52
|
+
with a BeautifulSoup-based fallback that preserves headings, lists, and
|
|
53
|
+
tables.
|
|
54
|
+
- Image semantic captioner (extra `[vlm]`): OpenRouter-backed VLM client with
|
|
55
|
+
on-disk caching keyed by image SHA-1; returns
|
|
56
|
+
`{caption, description, visible_text, tags, image_kind, domain_relevance}`.
|
|
57
|
+
- `docparser` CLI: `parse` (single file), `parse-all` (directory walk),
|
|
58
|
+
`version`.
|
|
59
|
+
- `WorkspaceLayout` dataclass and `parse_path` dispatcher for library use.
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Research Commons
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,344 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: rc-docparser
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Convert research literature (.docx, .xlsx, .pdf, .html, .pptx, .epub, .txt, .md, .csv) into structured Markdown + JSON corpora, with optional VLM image semantic captioning.
|
|
5
|
+
Project-URL: Homepage, https://github.com/Research-Commons/docparser
|
|
6
|
+
Project-URL: Repository, https://github.com/Research-Commons/docparser
|
|
7
|
+
Project-URL: Issues, https://github.com/Research-Commons/docparser/issues
|
|
8
|
+
Project-URL: Changelog, https://github.com/Research-Commons/docparser/blob/main/CHANGELOG.md
|
|
9
|
+
Author-email: Research Commons <shubhankitsingh@researchcommons.ai>
|
|
10
|
+
License: MIT License
|
|
11
|
+
|
|
12
|
+
Copyright (c) 2026 Research Commons
|
|
13
|
+
|
|
14
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
15
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
16
|
+
in the Software without restriction, including without limitation the rights
|
|
17
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
18
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
19
|
+
furnished to do so, subject to the following conditions:
|
|
20
|
+
|
|
21
|
+
The above copyright notice and this permission notice shall be included in all
|
|
22
|
+
copies or substantial portions of the Software.
|
|
23
|
+
|
|
24
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
25
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
26
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
27
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
28
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
29
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
30
|
+
SOFTWARE.
|
|
31
|
+
License-File: LICENSE
|
|
32
|
+
Keywords: corpus,csv,docx,epub,html,literature,markdown,ocr,parser,pdf,pptx,rag,vlm,xlsx
|
|
33
|
+
Classifier: Development Status :: 4 - Beta
|
|
34
|
+
Classifier: Intended Audience :: Developers
|
|
35
|
+
Classifier: Intended Audience :: Science/Research
|
|
36
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
37
|
+
Classifier: Operating System :: OS Independent
|
|
38
|
+
Classifier: Programming Language :: Python :: 3
|
|
39
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
40
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
41
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
42
|
+
Classifier: Topic :: Scientific/Engineering
|
|
43
|
+
Classifier: Topic :: Text Processing :: Markup
|
|
44
|
+
Requires-Python: >=3.10
|
|
45
|
+
Requires-Dist: lxml>=5.3.0
|
|
46
|
+
Requires-Dist: openpyxl>=3.1.5
|
|
47
|
+
Requires-Dist: pillow>=10.4.0
|
|
48
|
+
Requires-Dist: python-docx>=1.1.2
|
|
49
|
+
Requires-Dist: python-dotenv>=1.0.1
|
|
50
|
+
Requires-Dist: pyyaml>=6.0.2
|
|
51
|
+
Requires-Dist: tqdm>=4.66.5
|
|
52
|
+
Provides-Extra: all
|
|
53
|
+
Requires-Dist: beautifulsoup4>=4.12.0; extra == 'all'
|
|
54
|
+
Requires-Dist: ebooklib>=0.18; extra == 'all'
|
|
55
|
+
Requires-Dist: numpy>=1.24.0; extra == 'all'
|
|
56
|
+
Requires-Dist: pdfplumber>=0.11.0; extra == 'all'
|
|
57
|
+
Requires-Dist: pymupdf>=1.24.0; extra == 'all'
|
|
58
|
+
Requires-Dist: python-pptx>=1.0.0; extra == 'all'
|
|
59
|
+
Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == 'all'
|
|
60
|
+
Requires-Dist: requests>=2.32.3; extra == 'all'
|
|
61
|
+
Requires-Dist: trafilatura>=1.12.0; extra == 'all'
|
|
62
|
+
Provides-Extra: dev
|
|
63
|
+
Requires-Dist: build>=1.2.0; extra == 'dev'
|
|
64
|
+
Requires-Dist: ebooklib>=0.18; extra == 'dev'
|
|
65
|
+
Requires-Dist: mypy>=1.10.0; extra == 'dev'
|
|
66
|
+
Requires-Dist: pandas>=2.2.0; extra == 'dev'
|
|
67
|
+
Requires-Dist: pdfplumber>=0.11.0; extra == 'dev'
|
|
68
|
+
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
|
|
69
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
70
|
+
Requires-Dist: python-pptx>=1.0.0; extra == 'dev'
|
|
71
|
+
Requires-Dist: ruff>=0.6.0; extra == 'dev'
|
|
72
|
+
Requires-Dist: twine>=5.1.0; extra == 'dev'
|
|
73
|
+
Provides-Extra: docling
|
|
74
|
+
Requires-Dist: docling>=2.0.0; extra == 'docling'
|
|
75
|
+
Provides-Extra: epub
|
|
76
|
+
Requires-Dist: beautifulsoup4>=4.12.0; extra == 'epub'
|
|
77
|
+
Requires-Dist: ebooklib>=0.18; extra == 'epub'
|
|
78
|
+
Provides-Extra: html
|
|
79
|
+
Requires-Dist: beautifulsoup4>=4.12.0; extra == 'html'
|
|
80
|
+
Requires-Dist: trafilatura>=1.12.0; extra == 'html'
|
|
81
|
+
Provides-Extra: localvlm
|
|
82
|
+
Requires-Dist: pillow>=10.4.0; extra == 'localvlm'
|
|
83
|
+
Requires-Dist: torch>=2.2.0; extra == 'localvlm'
|
|
84
|
+
Requires-Dist: transformers>=4.40.0; extra == 'localvlm'
|
|
85
|
+
Provides-Extra: marker
|
|
86
|
+
Requires-Dist: marker-pdf>=1.0.0; extra == 'marker'
|
|
87
|
+
Provides-Extra: ocr
|
|
88
|
+
Requires-Dist: numpy>=1.24.0; extra == 'ocr'
|
|
89
|
+
Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == 'ocr'
|
|
90
|
+
Provides-Extra: pdf
|
|
91
|
+
Requires-Dist: pymupdf>=1.24.0; extra == 'pdf'
|
|
92
|
+
Provides-Extra: pptx
|
|
93
|
+
Requires-Dist: python-pptx>=1.0.0; extra == 'pptx'
|
|
94
|
+
Provides-Extra: pymupdf4llm
|
|
95
|
+
Requires-Dist: pymupdf4llm>=0.0.17; extra == 'pymupdf4llm'
|
|
96
|
+
Provides-Extra: tables
|
|
97
|
+
Requires-Dist: pdfplumber>=0.11.0; extra == 'tables'
|
|
98
|
+
Provides-Extra: vlm
|
|
99
|
+
Requires-Dist: requests>=2.32.3; extra == 'vlm'
|
|
100
|
+
Description-Content-Type: text/markdown
|
|
101
|
+
|
|
102
|
+
# docparser
|
|
103
|
+
|
|
104
|
+
Convert research literature (`.docx`, `.xlsx`, `.pdf`, `.html`, `.pptx`,
|
|
105
|
+
`.epub`, `.txt`, `.md`, `.csv`) into a clean, reproducible **Markdown + JSON**
|
|
106
|
+
corpus, with optional **vision-language captioning** of every embedded figure
|
|
107
|
+
via OpenRouter, OpenAI, Gemini, a local server, or a fully-local model.
|
|
108
|
+
|
|
109
|
+
```text
|
|
110
|
+
┌────────────────┐
|
|
111
|
+
data/raw/*.docx │ docparser │ data/parsed/<slug>/
|
|
112
|
+
data/raw/*.xlsx │ - parse_docx │ document.md
|
|
113
|
+
data/raw/*.pdf ─────► │ - parse_xlsx │ ────► document.json
|
|
114
|
+
data/raw/*.html │ - parse_pdf │
|
|
115
|
+
data/raw/*.pptx │ - parse_html │ data/assets/<slug>/
|
|
116
|
+
data/raw/*.epub │ - parse_pptx │ img-*.png
|
|
117
|
+
data/raw/*.txt|md|csv │ - parse_epub │
|
|
118
|
+
│ - VLM caption │
|
|
119
|
+
└────────────────┘
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
## Install
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
pip install docparser # core: docx + xlsx + txt/md + csv/tsv
|
|
126
|
+
pip install 'docparser[pdf]' # + PyMuPDF for PDFs
|
|
127
|
+
pip install 'docparser[html]' # + trafilatura + bs4 for HTML
|
|
128
|
+
pip install 'docparser[pptx]' # + python-pptx for PowerPoint
|
|
129
|
+
pip install 'docparser[epub]' # + EbookLib + bs4 for EPUB
|
|
130
|
+
pip install 'docparser[vlm]' # + requests for API VLM captions
|
|
131
|
+
pip install 'docparser[all]' # everything above (recommended)
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
Higher-fidelity / heavier features are separate opt-in extras (so the core
|
|
135
|
+
install stays small and MIT):
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
pip install 'docparser[tables]' # + pdfplumber for PDF table extraction
|
|
139
|
+
pip install 'docparser[ocr]' # + rapidocr-onnxruntime for scanned PDFs
|
|
140
|
+
pip install 'docparser[pymupdf4llm]' # PyMuPDF4LLM PDF backend (AGPL/commercial)
|
|
141
|
+
pip install 'docparser[docling]' # IBM Docling PDF backend (MIT)
|
|
142
|
+
pip install 'docparser[marker]' # Datalab Marker PDF backend (GPL-3.0)
|
|
143
|
+
pip install 'docparser[localvlm]' # transformers/torch local captioning
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
`docparser` requires Python 3.10+.
|
|
147
|
+
|
|
148
|
+
## Quick start (library)
|
|
149
|
+
|
|
150
|
+
```python
|
|
151
|
+
from docparser import WorkspaceLayout, run_all
|
|
152
|
+
|
|
153
|
+
layout = WorkspaceLayout.under("./project") # data/raw, data/parsed, data/assets, .cache
|
|
154
|
+
layout.ensure()
|
|
155
|
+
|
|
156
|
+
run_all(layout, use_vlm=False) # parse everything in data/raw
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
For a single file:
|
|
160
|
+
|
|
161
|
+
```python
|
|
162
|
+
from docparser import parse_path, WorkspaceLayout
|
|
163
|
+
|
|
164
|
+
layout = WorkspaceLayout.under(".")
|
|
165
|
+
payload = parse_path("paper.pdf", layout)
|
|
166
|
+
print(payload["stats"])
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
## Quick start (CLI)
|
|
170
|
+
|
|
171
|
+
```bash
|
|
172
|
+
# parse a single file
|
|
173
|
+
docparser parse paper.pdf --workspace ./out --no-vlm
|
|
174
|
+
|
|
175
|
+
# walk a whole directory
|
|
176
|
+
docparser parse-all --workspace ./project --no-vlm
|
|
177
|
+
|
|
178
|
+
# enable VLM captioning (requires OPENROUTER_API_KEY in env or .env)
|
|
179
|
+
export OPENROUTER_API_KEY=sk-or-v1-...
|
|
180
|
+
docparser parse-all --workspace ./project --max-images 50
|
|
181
|
+
|
|
182
|
+
# higher-fidelity PDF: pick a backend, OCR scanned pages, extract tables
|
|
183
|
+
docparser parse paper.pdf --pdf-backend docling --ocr auto --pdf-tables --no-vlm
|
|
184
|
+
|
|
185
|
+
# caption with a different provider
|
|
186
|
+
docparser parse-all --workspace ./project --vlm-provider openai --vlm-model gpt-4o-mini
|
|
187
|
+
|
|
188
|
+
docparser version
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
## What gets captured
|
|
192
|
+
|
|
193
|
+
### `.docx`
|
|
194
|
+
- Walks the document body in **document order** (paragraphs + tables + drawings).
|
|
195
|
+
- Preserves heading hierarchy (`section_path`) on every block.
|
|
196
|
+
- Extracts every embedded image to `data/assets/<slug>/` with a stable
|
|
197
|
+
`img-<seq>-<sha10>.<ext>` name.
|
|
198
|
+
- Detects figure/table captions (style `Caption` or text matching
|
|
199
|
+
`Figure 1: …` / `Fig. 1.` / `Table 1.`) and associates the caption with the
|
|
200
|
+
preceding image.
|
|
201
|
+
- Captures `context_before` and `context_after` for every image so the VLM has
|
|
202
|
+
document-grounded context.
|
|
203
|
+
|
|
204
|
+
### `.xlsx`
|
|
205
|
+
- Iterates **every sheet, every row, every column**.
|
|
206
|
+
- For each cell stores: address, row/col indices, value, openpyxl `data_type`,
|
|
207
|
+
`number_format`, `hyperlink`, `comment`, and the **formula** (from a second
|
|
208
|
+
pass with `data_only=False`).
|
|
209
|
+
- Stores `merged_ranges`, `frozen_panes`, and any embedded images.
|
|
210
|
+
- Markdown rendering uses the first non-empty row as a header heuristic and
|
|
211
|
+
preserves multi-line cells with `<br>`.
|
|
212
|
+
|
|
213
|
+
### `.pdf`
|
|
214
|
+
- Page-by-page text extraction in reading order via PyMuPDF's blocks API.
|
|
215
|
+
- Best-effort heading detection from font size (≥120% of the body-text median
|
|
216
|
+
promotes a line to a heading; bold flag tracked).
|
|
217
|
+
- Embedded raster images extracted via `doc.extract_image(xref)`.
|
|
218
|
+
- **Pluggable backends** (`backend="pymupdf4llm" | "docling" | "marker"`) route
|
|
219
|
+
conversion to a higher-fidelity engine; their Markdown is normalized into the
|
|
220
|
+
same block schema. Images are still extracted via PyMuPDF.
|
|
221
|
+
- **OCR** (`ocr="auto" | "force"`, `[ocr]` extra) recognizes text on scanned /
|
|
222
|
+
low-text pages; OCR'd blocks carry `"ocr": true`.
|
|
223
|
+
- **Tables** (`extract_tables=True`, `[tables]` extra) emit real `table` blocks
|
|
224
|
+
via `pdfplumber`.
|
|
225
|
+
|
|
226
|
+
### `.html`
|
|
227
|
+
- Article-grade body extraction via `trafilatura`.
|
|
228
|
+
- Plus a **structural** BeautifulSoup walk that emits typed blocks
|
|
229
|
+
(`heading` / `paragraph` / `list` / `table` / `image`) so downstream RAG
|
|
230
|
+
layers can rely on the JSON.
|
|
231
|
+
|
|
232
|
+
### `.pptx`
|
|
233
|
+
- Walks slides in presentation order; each slide becomes a section.
|
|
234
|
+
- Emits per-slide headings (slide title), bulleted text frames (with list
|
|
235
|
+
level), tables, pictures, and **speaker notes**.
|
|
236
|
+
- Embedded pictures extracted and optionally captioned.
|
|
237
|
+
|
|
238
|
+
### `.epub`
|
|
239
|
+
- Walks the spine in reading order; per-chapter BeautifulSoup structural walk.
|
|
240
|
+
- Captures metadata (title/author/language), headings, paragraphs, lists,
|
|
241
|
+
tables, and embedded images (resolved from the EPUB image manifest).
|
|
242
|
+
|
|
243
|
+
### `.txt` / `.md` and `.csv` / `.tsv` (core, no extras)
|
|
244
|
+
- Plain text is split into paragraph blocks; Markdown is passed through and
|
|
245
|
+
also decomposed into heading / list / code / paragraph blocks.
|
|
246
|
+
- CSV/TSV: delimiter sniffing, header detection, a Markdown table, and one JSON
|
|
247
|
+
record per row.
|
|
248
|
+
|
|
249
|
+
### Images (`[vlm]` extra)
|
|
250
|
+
Each image is sent to a vision-language model (default provider OpenRouter,
|
|
251
|
+
model `anthropic/claude-sonnet-4`) along with its surrounding caption +
|
|
252
|
+
context. Any OpenAI-compatible provider works via `--vlm-provider`
|
|
253
|
+
(`openrouter` / `openai` / `gemini` / `local`), or use a fully-local
|
|
254
|
+
`transformers` model with `--vlm-provider transformers` (`[localvlm]` extra).
|
|
255
|
+
The model returns a strict JSON object:
|
|
256
|
+
|
|
257
|
+
```json
|
|
258
|
+
{
|
|
259
|
+
"caption": "one-sentence figure caption",
|
|
260
|
+
"description": "2–5 sentence paragraph",
|
|
261
|
+
"visible_text": "OCR-style transcription",
|
|
262
|
+
"tags": ["world-model", "diagram", "..."],
|
|
263
|
+
"image_kind": "diagram | plot | screenshot | photo | equation | table | ...",
|
|
264
|
+
"domain_relevance": "how this relates to the document's topic"
|
|
265
|
+
}
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
Results are cached on disk at `<cache_dir>/vlm/<model>/<sha1>.json`, keyed by
|
|
269
|
+
**SHA-1 of the image bytes × model**, so re-runs are free until the source
|
|
270
|
+
image bytes change.
|
|
271
|
+
|
|
272
|
+
## Configuration (`.env`)
|
|
273
|
+
|
|
274
|
+
| Var | Default | Purpose |
|
|
275
|
+
| --- | --- | --- |
|
|
276
|
+
| `DOCPARSER_VLM_PROVIDER` | `openrouter` | `openrouter` / `openai` / `gemini` / `local` |
|
|
277
|
+
| `OPENROUTER_API_KEY` | _required for OpenRouter_ | OpenRouter key (`sk-or-...`) |
|
|
278
|
+
| `OPENROUTER_VLM_MODEL` | `anthropic/claude-sonnet-4` | any vision-capable OpenRouter model |
|
|
279
|
+
| `OPENROUTER_BASE_URL` | `https://openrouter.ai/api/v1` | override for a proxy |
|
|
280
|
+
| `OPENROUTER_REFERER` / `OPENROUTER_TITLE` | repo URL / `docparser` | OpenRouter attribution headers |
|
|
281
|
+
| `OPENAI_API_KEY` / `OPENAI_VLM_MODEL` | _required for OpenAI_ / `gpt-4o-mini` | OpenAI provider |
|
|
282
|
+
| `GEMINI_API_KEY` / `GEMINI_VLM_MODEL` | _required for Gemini_ / `gemini-1.5-flash` | Gemini provider |
|
|
283
|
+
| `DOCPARSER_VLM_BASE_URL` | `http://localhost:11434/v1` | base URL for the `local` provider |
|
|
284
|
+
| `DOCPARSER_VLM_API_KEY` / `DOCPARSER_VLM_MODEL` | — / `llava` | key + model for the `local` provider |
|
|
285
|
+
| `DOCPARSER_LOCAL_VLM_MODEL` | `Salesforce/blip-image-captioning-large` | model for the `transformers` backend |
|
|
286
|
+
|
|
287
|
+
## API reference (highlights)
|
|
288
|
+
|
|
289
|
+
- `WorkspaceLayout(raw_dir, parsed_dir, assets_dir, cache_dir)` —
|
|
290
|
+
dataclass describing where parser output lives. Use `.under(root)` for the
|
|
291
|
+
default `data/raw + data/parsed + data/assets + .cache` layout under a root.
|
|
292
|
+
- `parse_docx(source, layout=None, *, captioner=None, write_outputs=True)` →
|
|
293
|
+
payload dict.
|
|
294
|
+
- `parse_xlsx(source, layout=None, *, captioner=None, write_outputs=True)` →
|
|
295
|
+
payload dict.
|
|
296
|
+
- `parse_pdf(source, layout=None, *, captioner=None, write_outputs=True, extract_images=True, backend="builtin", ocr="off", extract_tables=False)` →
|
|
297
|
+
payload dict. (requires `[pdf]`; backends/OCR/tables require their extras)
|
|
298
|
+
- `parse_html(source, layout=None, *, captioner=None, write_outputs=True, use_trafilatura=True)` →
|
|
299
|
+
payload dict. `source` may be a path or `http(s)://` URL. (requires `[html]`)
|
|
300
|
+
- `parse_pptx(source, layout=None, *, captioner=None, write_outputs=True)` →
|
|
301
|
+
payload dict. (requires `[pptx]`)
|
|
302
|
+
- `parse_epub(source, layout=None, *, captioner=None, write_outputs=True)` →
|
|
303
|
+
payload dict. (requires `[epub]`)
|
|
304
|
+
- `parse_text(source, layout=None, ...)` / `parse_csv(source, layout=None, ...)` —
|
|
305
|
+
core parsers for `.txt`/`.md` and `.csv`/`.tsv`.
|
|
306
|
+
- `parse_path(source, layout=None, **kwargs)` — dispatches by extension;
|
|
307
|
+
PDF-only kwargs (`backend`, `ocr`, `extract_tables`) are forwarded to PDFs.
|
|
308
|
+
- `run_all(layout, *, use_vlm=True, only=None, max_images=None, continue_on_error=False, vlm_provider=None, vlm_model=None, pdf_backend="builtin", ocr="off", extract_tables=False)` —
|
|
309
|
+
walks `layout.raw_dir`, parses everything supported, writes a top-level
|
|
310
|
+
`CORPUS.md` and `data/parsed/corpus.json`.
|
|
311
|
+
- `caption_image(image_bytes, *, mime, doc_name, nearby_caption, context, provider=None, model=None, layout=None, ...)` →
|
|
312
|
+
`VLMResult`. (requires `[vlm]`)
|
|
313
|
+
|
|
314
|
+
## Development
|
|
315
|
+
|
|
316
|
+
```bash
|
|
317
|
+
git clone https://github.com/Research-Commons/docparser
|
|
318
|
+
cd docparser
|
|
319
|
+
python -m venv .venv && source .venv/bin/activate
|
|
320
|
+
pip install -e ".[all,dev]"
|
|
321
|
+
pytest -ra
|
|
322
|
+
ruff check src tests
|
|
323
|
+
mypy
|
|
324
|
+
python -m build # produces dist/*.whl + *.tar.gz
|
|
325
|
+
twine check dist/*
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
### Publishing
|
|
329
|
+
|
|
330
|
+
CI runs lint + mypy + tests on Python 3.10-3.12 and builds the distribution on
|
|
331
|
+
every push/PR (`.github/workflows/ci.yml`). Pushing a version tag (e.g.
|
|
332
|
+
`v0.2.0`) triggers `.github/workflows/publish.yml`, which builds and uploads to
|
|
333
|
+
PyPI via **Trusted Publishing** (OIDC, no stored token) — configure a trusted
|
|
334
|
+
publisher for the project on PyPI first. To publish manually instead:
|
|
335
|
+
|
|
336
|
+
```bash
|
|
337
|
+
python -m build
|
|
338
|
+
twine check dist/*
|
|
339
|
+
twine upload dist/*
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
## License
|
|
343
|
+
|
|
344
|
+
MIT — see [LICENSE](LICENSE).
|