saraldocling 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- saraldocling-1.0.0/PKG-INFO +394 -0
- saraldocling-1.0.0/README.md +360 -0
- saraldocling-1.0.0/pyproject.toml +72 -0
- saraldocling-1.0.0/saraldocling/__init__.py +21 -0
- saraldocling-1.0.0/saraldocling/__main__.py +227 -0
- saraldocling-1.0.0/saraldocling/_bundled_model.py +19 -0
- saraldocling-1.0.0/saraldocling/image_extractor.py +277 -0
- saraldocling-1.0.0/saraldocling/models/yolov8n-doclaynet.onnx +0 -0
- saraldocling-1.0.0/saraldocling/pdf_processor.py +95 -0
- saraldocling-1.0.0/saraldocling/pipeline.py +93 -0
- saraldocling-1.0.0/saraldocling.egg-info/PKG-INFO +394 -0
- saraldocling-1.0.0/saraldocling.egg-info/SOURCES.txt +16 -0
- saraldocling-1.0.0/saraldocling.egg-info/dependency_links.txt +1 -0
- saraldocling-1.0.0/saraldocling.egg-info/entry_points.txt +2 -0
- saraldocling-1.0.0/saraldocling.egg-info/requires.txt +15 -0
- saraldocling-1.0.0/saraldocling.egg-info/top_level.txt +1 -0
- saraldocling-1.0.0/setup.cfg +4 -0
- saraldocling-1.0.0/tests/test_basic.py +141 -0
|
@@ -0,0 +1,394 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: saraldocling
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: Fast PDF text and image/table extraction using PyMuPDF + YOLOv8 DocLayNet ONNX
|
|
5
|
+
Author: SARAL
|
|
6
|
+
License-Expression: AGPL-3.0-only
|
|
7
|
+
Project-URL: Homepage, https://github.com/DemocratiseResearch/SARALDocling
|
|
8
|
+
Project-URL: Issues, https://github.com/DemocratiseResearch/SARALDocling/issues
|
|
9
|
+
Keywords: pdf,parsing,yolo,document-layout,onnx,table-extraction
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: Intended Audience :: Science/Research
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
+
Classifier: Topic :: Text Processing :: General
|
|
20
|
+
Requires-Python: >=3.9
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
Requires-Dist: pymupdf>=1.24.0
|
|
23
|
+
Requires-Dist: numpy<3.0,>=1.24.0
|
|
24
|
+
Requires-Dist: Pillow>=10.0.0
|
|
25
|
+
Provides-Extra: cpu
|
|
26
|
+
Requires-Dist: onnxruntime>=1.17.0; extra == "cpu"
|
|
27
|
+
Provides-Extra: gpu
|
|
28
|
+
Requires-Dist: onnxruntime-gpu>=1.17.0; extra == "gpu"
|
|
29
|
+
Provides-Extra: dev
|
|
30
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
31
|
+
Requires-Dist: pytest-cov; extra == "dev"
|
|
32
|
+
Requires-Dist: ruff; extra == "dev"
|
|
33
|
+
Requires-Dist: mypy; extra == "dev"
|
|
34
|
+
|
|
35
|
+
# SARAL Docling
|
|
36
|
+
|
|
37
|
+
Fast PDF text and image/table extraction.
|
|
38
|
+
Powered by **PyMuPDF** + **YOLOv8 DocLayNet** (bundled) — no model file needed.
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
pip install "saraldocling[cpu]"
|
|
42
|
+
saraldocling paper.pdf
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
> **License:** This package bundles `yolov8n-doclaynet.onnx` which is released
|
|
46
|
+
> under **AGPL-3.0**. The entire `saraldocling` package is therefore also
|
|
47
|
+
> distributed under **AGPL-3.0**.
|
|
48
|
+
> Full text: https://www.gnu.org/licenses/agpl-3.0.html
|
|
49
|
+
|
|
50
|
+
> **No OCR.** Only PDFs with a native text layer are supported.
|
|
51
|
+
> For scanned PDFs, pre-process with `ocrmypdf` first (see [Limitations](#limitations)).
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## Table of contents
|
|
56
|
+
|
|
57
|
+
- [How it works](#how-it-works)
|
|
58
|
+
- [Installation](#installation)
|
|
59
|
+
- [CLI usage](#cli-usage)
|
|
60
|
+
- [Python API](#python-api)
|
|
61
|
+
- [Output structure](#output-structure)
|
|
62
|
+
- [FastAPI / server deployment](#fastapi--server-deployment)
|
|
63
|
+
- [Configuration reference](#configuration-reference)
|
|
64
|
+
- [Limitations](#limitations)
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
## How it works
|
|
69
|
+
|
|
70
|
+
| Stage | What happens |
|
|
71
|
+
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
|
|
72
|
+
| **1 — Text** | PyMuPDF extracts the native text layer from every page |
|
|
73
|
+
| **2 — Images/Tables** | YOLOv8 DocLayNet (bundled ONNX) detects Picture and Table regions, re-renders those regions at 300 DPI, and saves them as PNGs |
|
|
74
|
+
|
|
75
|
+
The ONNX model (`yolov8n-doclaynet.onnx`) is included inside the wheel — there
|
|
76
|
+
is no separate download step.
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## Installation
|
|
81
|
+
|
|
82
|
+
> Pick **exactly one** of `[cpu]` or `[gpu]`. Installing both puts two competing
|
|
83
|
+
> `onnxruntime` builds in your environment.
|
|
84
|
+
|
|
85
|
+
### CPU — works everywhere, no GPU required
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
pip install "saraldocling[cpu]"
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### GPU — NVIDIA CUDA
|
|
92
|
+
|
|
93
|
+
Requires CUDA 11.8+ and cuDNN on the host.
|
|
94
|
+
Compatibility: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
pip install "saraldocling[gpu]"
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### Apple Silicon — CoreML acceleration (M1/M2/M3/M4)
|
|
101
|
+
|
|
102
|
+
The `[cpu]` wheel works fine on ARM. For ANE-accelerated inference:
|
|
103
|
+
|
|
104
|
+
```bash
|
|
105
|
+
pip install "saraldocling[cpu]"
|
|
106
|
+
pip install onnxruntime-silicon # replaces the cpu wheel
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Linux / Debian server — one-time system packages
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
sudo apt-get install -y libgl1 libglib2.0-0
|
|
113
|
+
pip install "saraldocling[cpu]"
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
### From source (development)
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
git clone https://github.com/yourname/saraldocling
|
|
120
|
+
cd saraldocling
|
|
121
|
+
pip install -e ".[cpu,dev]"
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## CLI usage
|
|
127
|
+
|
|
128
|
+
The bundled model runs automatically. No `--model` flag needed.
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
# Full pipeline: text + image/table crops (default)
|
|
132
|
+
saraldocling paper.pdf
|
|
133
|
+
|
|
134
|
+
# Text only — skip YOLO, much faster
|
|
135
|
+
saraldocling paper.pdf --no-images
|
|
136
|
+
|
|
137
|
+
# Custom output directory
|
|
138
|
+
saraldocling paper.pdf -o ./results
|
|
139
|
+
|
|
140
|
+
# Preview extracted text in the terminal
|
|
141
|
+
saraldocling paper.pdf --preview 800
|
|
142
|
+
|
|
143
|
+
# One .txt file per page
|
|
144
|
+
saraldocling paper.pdf --pages
|
|
145
|
+
|
|
146
|
+
# GPU (CUDA) — needs saraldocling[gpu]
|
|
147
|
+
saraldocling paper.pdf --gpu
|
|
148
|
+
|
|
149
|
+
# Apple Silicon CoreML
|
|
150
|
+
saraldocling paper.pdf --coreml
|
|
151
|
+
|
|
152
|
+
# Debug: see every step logged
|
|
153
|
+
saraldocling paper.pdf -v
|
|
154
|
+
|
|
155
|
+
# All flags
|
|
156
|
+
saraldocling --help
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### What gets written
|
|
160
|
+
|
|
161
|
+
```
|
|
162
|
+
paper_out/
|
|
163
|
+
├── extracted_text.txt ← all pages joined
|
|
164
|
+
├── manifest.json ← JSON summary of the run
|
|
165
|
+
├── pages/ ← only with --pages
|
|
166
|
+
│ ├── page_001.txt
|
|
167
|
+
│ └── page_002.txt
|
|
168
|
+
└── extracted_images/ ← Picture + Table crops at 300 DPI
|
|
169
|
+
├── p0_Picture_231.png
|
|
170
|
+
└── p2_Table_445.png
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
Output directory is auto-named `<pdf_stem>_out/` next to the PDF unless you
|
|
174
|
+
pass `-o`.
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## Python API
|
|
179
|
+
|
|
180
|
+
### One-call pipeline (recommended)
|
|
181
|
+
|
|
182
|
+
```python
|
|
183
|
+
from saraldocling import parse_pdf, ParseConfig
|
|
184
|
+
|
|
185
|
+
# Full pipeline — bundled model used automatically
|
|
186
|
+
result = parse_pdf(ParseConfig(
|
|
187
|
+
pdf_path="paper.pdf",
|
|
188
|
+
output_dir="./out",
|
|
189
|
+
))
|
|
190
|
+
|
|
191
|
+
print(result.num_pages) # int
|
|
192
|
+
print(result.text[:500]) # full extracted text
|
|
193
|
+
print(result.image_paths) # list[str] of saved PNG paths
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
```python
|
|
197
|
+
# Text only — skip YOLO
|
|
198
|
+
result = parse_pdf(ParseConfig(
|
|
199
|
+
pdf_path="paper.pdf",
|
|
200
|
+
output_dir="./out",
|
|
201
|
+
extract_images=False,
|
|
202
|
+
))
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
### Lower-level: text extraction only
|
|
206
|
+
|
|
207
|
+
```python
|
|
208
|
+
from saraldocling import PDFProcessor
|
|
209
|
+
|
|
210
|
+
with PDFProcessor("paper.pdf") as proc:
|
|
211
|
+
full_text = proc.extract_text()
|
|
212
|
+
page_text = proc.extract_text_by_page(1) # 0-indexed
|
|
213
|
+
page_image = proc.extract_page_image(0, dpi=150) # PIL.Image
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
### Lower-level: YOLO extraction only
|
|
217
|
+
|
|
218
|
+
```python
|
|
219
|
+
from saraldocling import ImageExtractor
|
|
220
|
+
|
|
221
|
+
# CPU (default)
|
|
222
|
+
with ImageExtractor() as ext:
|
|
223
|
+
paths = ext.extract_images_from_pdf("paper.pdf", "./out")
|
|
224
|
+
|
|
225
|
+
# GPU
|
|
226
|
+
with ImageExtractor(providers=["CUDAExecutionProvider", "CPUExecutionProvider"]) as ext:
|
|
227
|
+
paths = ext.extract_images_from_pdf("paper.pdf", "./out")
|
|
228
|
+
|
|
229
|
+
# Apple Silicon
|
|
230
|
+
with ImageExtractor(providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]) as ext:
|
|
231
|
+
paths = ext.extract_images_from_pdf("paper.pdf", "./out")
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
### Using a custom model (advanced)
|
|
235
|
+
|
|
236
|
+
```python
|
|
237
|
+
result = parse_pdf(ParseConfig(
|
|
238
|
+
pdf_path="paper.pdf",
|
|
239
|
+
output_dir="./out",
|
|
240
|
+
model_path="/path/to/custom.onnx", # overrides the bundled model
|
|
241
|
+
))
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
## Output structure
|
|
247
|
+
|
|
248
|
+
`manifest.json` example:
|
|
249
|
+
|
|
250
|
+
```json
|
|
251
|
+
{
|
|
252
|
+
"pdf": "/abs/path/to/paper.pdf",
|
|
253
|
+
"pages": 12,
|
|
254
|
+
"text_chars": 38201,
|
|
255
|
+
"text_file": "/abs/path/to/paper_out/extracted_text.txt",
|
|
256
|
+
"per_page_files": [],
|
|
257
|
+
"image_paths": [
|
|
258
|
+
"/abs/path/to/paper_out/extracted_images/p0_Picture_231.png",
|
|
259
|
+
"/abs/path/to/paper_out/extracted_images/p3_Table_112.png"
|
|
260
|
+
]
|
|
261
|
+
}
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
Crop filenames follow the pattern `p<page>_<Label>_<x>.png` where `<Label>`
|
|
265
|
+
is either `Picture` or `Table`.
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
269
|
+
## FastAPI / server deployment
|
|
270
|
+
|
|
271
|
+
`saraldocling` is a plain pip package — drop it into any FastAPI app on Debian
|
|
272
|
+
with no special system configuration beyond the apt packages above.
|
|
273
|
+
|
|
274
|
+
### Minimal endpoint
|
|
275
|
+
|
|
276
|
+
```python
|
|
277
|
+
# main.py
|
|
278
|
+
import shutil, uuid
|
|
279
|
+
from pathlib import Path
|
|
280
|
+
|
|
281
|
+
from fastapi import FastAPI, File, UploadFile
|
|
282
|
+
from fastapi.responses import JSONResponse
|
|
283
|
+
from saraldocling import parse_pdf, ParseConfig
|
|
284
|
+
|
|
285
|
+
OUTPUT_ROOT = Path("/tmp/saraldocling")
|
|
286
|
+
|
|
287
|
+
app = FastAPI()
|
|
288
|
+
|
|
289
|
+
@app.post("/parse")
|
|
290
|
+
async def parse(file: UploadFile = File(...)):
|
|
291
|
+
job_dir = OUTPUT_ROOT / str(uuid.uuid4())
|
|
292
|
+
job_dir.mkdir(parents=True, exist_ok=True)
|
|
293
|
+
pdf_path = job_dir / file.filename
|
|
294
|
+
|
|
295
|
+
with open(pdf_path, "wb") as f:
|
|
296
|
+
shutil.copyfileobj(file.file, f)
|
|
297
|
+
|
|
298
|
+
result = parse_pdf(ParseConfig(
|
|
299
|
+
pdf_path=str(pdf_path),
|
|
300
|
+
output_dir=str(job_dir),
|
|
301
|
+
))
|
|
302
|
+
|
|
303
|
+
return JSONResponse({
|
|
304
|
+
"pages": result.num_pages,
|
|
305
|
+
"text": result.text,
|
|
306
|
+
"image_paths": result.image_paths,
|
|
307
|
+
})
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
```bash
|
|
311
|
+
pip install "saraldocling[cpu]" fastapi uvicorn python-multipart
|
|
312
|
+
uvicorn main:app --host 0.0.0.0 --port 8000
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
### Pre-load the ONNX session at startup
|
|
316
|
+
|
|
317
|
+
Creating an `ImageExtractor` loads and optimises the ONNX graph (~300 ms).
|
|
318
|
+
Do this once at startup and reuse it — `session.run()` is thread-safe:
|
|
319
|
+
|
|
320
|
+
```python
|
|
321
|
+
from contextlib import asynccontextmanager
|
|
322
|
+
from fastapi import FastAPI
|
|
323
|
+
from saraldocling import ImageExtractor
|
|
324
|
+
|
|
325
|
+
extractor: ImageExtractor | None = None
|
|
326
|
+
|
|
327
|
+
@asynccontextmanager
|
|
328
|
+
async def lifespan(app: FastAPI):
|
|
329
|
+
global extractor
|
|
330
|
+
extractor = ImageExtractor() # loads bundled model
|
|
331
|
+
yield
|
|
332
|
+
extractor.close()
|
|
333
|
+
|
|
334
|
+
app = FastAPI(lifespan=lifespan)
|
|
335
|
+
|
|
336
|
+
# then in your endpoint, call extractor.extract_images_from_pdf() directly
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
---
|
|
340
|
+
|
|
341
|
+
## Configuration reference (`ParseConfig`)
|
|
342
|
+
|
|
343
|
+
| Field | Default | Description |
|
|
344
|
+
| ---------------- | -------------------------- | ----------------------------------------------- |
|
|
345
|
+
| `pdf_path` | required | Path to the source PDF |
|
|
346
|
+
| `output_dir` | required | Root directory for all output files |
|
|
347
|
+
| `extract_images` | `True` | Run YOLO stage. `False` = text only |
|
|
348
|
+
| `model_path` | `None` | Custom `.onnx` path. `None` = use bundled model |
|
|
349
|
+
| `conf_threshold` | `0.30` | YOLO confidence cutoff |
|
|
350
|
+
| `nms_threshold` | `0.45` | NMS IoU cutoff |
|
|
351
|
+
| `min_box_size` | `30` | Minimum crop side-length in pixels |
|
|
352
|
+
| `num_workers` | `cpu_count` | Thread-pool size for page processing |
|
|
353
|
+
| `onnx_providers` | `["CPUExecutionProvider"]` | ONNX Runtime execution providers |
|
|
354
|
+
|
|
355
|
+
---
|
|
356
|
+
|
|
357
|
+
## DocLayNet labels
|
|
358
|
+
|
|
359
|
+
The model detects 11 document element types.
|
|
360
|
+
Only **`Picture`** and **`Table`** regions are saved as crops.
|
|
361
|
+
|
|
362
|
+
```
|
|
363
|
+
Caption Footnote Formula List-item Page-footer
|
|
364
|
+
Page-header Picture Section-header Table Text Title
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
---
|
|
368
|
+
|
|
369
|
+
## Limitations
|
|
370
|
+
|
|
371
|
+
**Not OCR** — requires a native text layer in the PDF. For scanned documents:
|
|
372
|
+
|
|
373
|
+
```bash
|
|
374
|
+
pip install ocrmypdf
|
|
375
|
+
ocrmypdf scanned.pdf scanned_ocr.pdf
|
|
376
|
+
saraldocling scanned_ocr.pdf
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
**Thread safety** — both `PDFProcessor` and `ImageExtractor` use internal
|
|
380
|
+
mutex locks (mirroring the Go `SafeDocument`). Safe for concurrent use.
|
|
381
|
+
|
|
382
|
+
---
|
|
383
|
+
|
|
384
|
+
## License
|
|
385
|
+
|
|
386
|
+
This package is distributed under **AGPL-3.0** because it bundles
|
|
387
|
+
`yolov8n-doclaynet.onnx`, which carries that licence.
|
|
388
|
+
|
|
389
|
+
What this means in practice:
|
|
390
|
+
|
|
391
|
+
- Free to use for any purpose, including commercial.
|
|
392
|
+
- If you modify the source and run it as a network service, you must make
|
|
393
|
+
your modified source available under AGPL-3.0.
|
|
394
|
+
- The full licence text is at https://www.gnu.org/licenses/agpl-3.0.html
|