saraldocling 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,394 @@
1
+ Metadata-Version: 2.4
2
+ Name: saraldocling
3
+ Version: 1.0.0
4
+ Summary: Fast PDF text and image/table extraction using PyMuPDF + YOLOv8 DocLayNet ONNX
5
+ Author: SARAL
6
+ License-Expression: AGPL-3.0-only
7
+ Project-URL: Homepage, https://github.com/DemocratiseResearch/SARALDocling
8
+ Project-URL: Issues, https://github.com/DemocratiseResearch/SARALDocling/issues
9
+ Keywords: pdf,parsing,yolo,document-layout,onnx,table-extraction
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Classifier: Topic :: Text Processing :: General
20
+ Requires-Python: >=3.9
21
+ Description-Content-Type: text/markdown
22
+ Requires-Dist: pymupdf>=1.24.0
23
+ Requires-Dist: numpy<3.0,>=1.24.0
24
+ Requires-Dist: Pillow>=10.0.0
25
+ Provides-Extra: cpu
26
+ Requires-Dist: onnxruntime>=1.17.0; extra == "cpu"
27
+ Provides-Extra: gpu
28
+ Requires-Dist: onnxruntime-gpu>=1.17.0; extra == "gpu"
29
+ Provides-Extra: dev
30
+ Requires-Dist: pytest>=7.0; extra == "dev"
31
+ Requires-Dist: pytest-cov; extra == "dev"
32
+ Requires-Dist: ruff; extra == "dev"
33
+ Requires-Dist: mypy; extra == "dev"
34
+
35
+ # SARAL Docling
36
+
37
+ Fast PDF text and image/table extraction.
38
+ Powered by **PyMuPDF** + **YOLOv8 DocLayNet** (bundled) — no model file needed.
39
+
40
+ ```bash
41
+ pip install "saraldocling[cpu]"
42
+ saraldocling paper.pdf
43
+ ```
44
+
45
+ > **License:** This package bundles `yolov8n-doclaynet.onnx` which is released
46
+ > under **AGPL-3.0**. The entire `saraldocling` package is therefore also
47
+ > distributed under **AGPL-3.0**.
48
+ > Full text: https://www.gnu.org/licenses/agpl-3.0.html
49
+
50
+ > **No OCR.** Only PDFs with a native text layer are supported.
51
+ > For scanned PDFs, pre-process with `ocrmypdf` first (see [Limitations](#limitations)).
52
+
53
+ ---
54
+
55
+ ## Table of contents
56
+
57
+ - [How it works](#how-it-works)
58
+ - [Installation](#installation)
59
+ - [CLI usage](#cli-usage)
60
+ - [Python API](#python-api)
61
+ - [Output structure](#output-structure)
62
+ - [FastAPI / server deployment](#fastapi--server-deployment)
63
+ - [Configuration reference](#configuration-reference)
64
+ - [Limitations](#limitations)
65
+
66
+ ---
67
+
68
+ ## How it works
69
+
70
+ | Stage | What happens |
71
+ | --------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
72
+ | **1 — Text** | PyMuPDF extracts the native text layer from every page |
73
+ | **2 — Images/Tables** | YOLOv8 DocLayNet (bundled ONNX) detects Picture and Table regions, re-renders those regions at 300 DPI, and saves them as PNGs |
74
+
75
+ The ONNX model (`yolov8n-doclaynet.onnx`) is included inside the wheel — there
76
+ is no separate download step.
77
+
78
+ ---
79
+
80
+ ## Installation
81
+
82
+ > Pick **exactly one** of `[cpu]` or `[gpu]`. Installing both puts two competing
83
+ > `onnxruntime` builds in your environment.
84
+
85
+ ### CPU — works everywhere, no GPU required
86
+
87
+ ```bash
88
+ pip install "saraldocling[cpu]"
89
+ ```
90
+
91
+ ### GPU — NVIDIA CUDA
92
+
93
+ Requires CUDA 11.8+ and cuDNN on the host.
94
+ Compatibility: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html
95
+
96
+ ```bash
97
+ pip install "saraldocling[gpu]"
98
+ ```
99
+
100
+ ### Apple Silicon — CoreML acceleration (M1/M2/M3/M4)
101
+
102
+ The `[cpu]` wheel works fine on ARM. For ANE-accelerated inference:
103
+
104
+ ```bash
105
+ pip install "saraldocling[cpu]"
106
+ pip install onnxruntime-silicon # replaces the cpu wheel
107
+ ```
108
+
109
+ ### Linux / Debian server — one-time system packages
110
+
111
+ ```bash
112
+ sudo apt-get install -y libgl1 libglib2.0-0
113
+ pip install "saraldocling[cpu]"
114
+ ```
115
+
116
+ ### From source (development)
117
+
118
+ ```bash
119
+ git clone https://github.com/yourname/saraldocling
120
+ cd saraldocling
121
+ pip install -e ".[cpu,dev]"
122
+ ```
123
+
124
+ ---
125
+
126
+ ## CLI usage
127
+
128
+ The bundled model runs automatically. No `--model` flag needed.
129
+
130
+ ```bash
131
+ # Full pipeline: text + image/table crops (default)
132
+ saraldocling paper.pdf
133
+
134
+ # Text only — skip YOLO, much faster
135
+ saraldocling paper.pdf --no-images
136
+
137
+ # Custom output directory
138
+ saraldocling paper.pdf -o ./results
139
+
140
+ # Preview extracted text in the terminal
141
+ saraldocling paper.pdf --preview 800
142
+
143
+ # One .txt file per page
144
+ saraldocling paper.pdf --pages
145
+
146
+ # GPU (CUDA) — needs saraldocling[gpu]
147
+ saraldocling paper.pdf --gpu
148
+
149
+ # Apple Silicon CoreML
150
+ saraldocling paper.pdf --coreml
151
+
152
+ # Debug: see every step logged
153
+ saraldocling paper.pdf -v
154
+
155
+ # All flags
156
+ saraldocling --help
157
+ ```
158
+
159
+ ### What gets written
160
+
161
+ ```
162
+ paper_out/
163
+ ├── extracted_text.txt ← all pages joined
164
+ ├── manifest.json ← JSON summary of the run
165
+ ├── pages/ ← only with --pages
166
+ │ ├── page_001.txt
167
+ │ └── page_002.txt
168
+ └── extracted_images/ ← Picture + Table crops at 300 DPI
169
+ ├── p0_Picture_231.png
170
+ └── p2_Table_445.png
171
+ ```
172
+
173
+ Output directory is auto-named `<pdf_stem>_out/` next to the PDF unless you
174
+ pass `-o`.
175
+
176
+ ---
177
+
178
+ ## Python API
179
+
180
+ ### One-call pipeline (recommended)
181
+
182
+ ```python
183
+ from saraldocling import parse_pdf, ParseConfig
184
+
185
+ # Full pipeline — bundled model used automatically
186
+ result = parse_pdf(ParseConfig(
187
+ pdf_path="paper.pdf",
188
+ output_dir="./out",
189
+ ))
190
+
191
+ print(result.num_pages) # int
192
+ print(result.text[:500]) # full extracted text
193
+ print(result.image_paths) # list[str] of saved PNG paths
194
+ ```
195
+
196
+ ```python
197
+ # Text only — skip YOLO
198
+ result = parse_pdf(ParseConfig(
199
+ pdf_path="paper.pdf",
200
+ output_dir="./out",
201
+ extract_images=False,
202
+ ))
203
+ ```
204
+
205
+ ### Lower-level: text extraction only
206
+
207
+ ```python
208
+ from saraldocling import PDFProcessor
209
+
210
+ with PDFProcessor("paper.pdf") as proc:
211
+ full_text = proc.extract_text()
212
+ page_text = proc.extract_text_by_page(1) # 0-indexed
213
+ page_image = proc.extract_page_image(0, dpi=150) # PIL.Image
214
+ ```
215
+
216
+ ### Lower-level: YOLO extraction only
217
+
218
+ ```python
219
+ from saraldocling import ImageExtractor
220
+
221
+ # CPU (default)
222
+ with ImageExtractor() as ext:
223
+ paths = ext.extract_images_from_pdf("paper.pdf", "./out")
224
+
225
+ # GPU
226
+ with ImageExtractor(providers=["CUDAExecutionProvider", "CPUExecutionProvider"]) as ext:
227
+ paths = ext.extract_images_from_pdf("paper.pdf", "./out")
228
+
229
+ # Apple Silicon
230
+ with ImageExtractor(providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]) as ext:
231
+ paths = ext.extract_images_from_pdf("paper.pdf", "./out")
232
+ ```
233
+
234
+ ### Using a custom model (advanced)
235
+
236
+ ```python
237
+ result = parse_pdf(ParseConfig(
238
+ pdf_path="paper.pdf",
239
+ output_dir="./out",
240
+ model_path="/path/to/custom.onnx", # overrides the bundled model
241
+ ))
242
+ ```
243
+
244
+ ---
245
+
246
+ ## Output structure
247
+
248
+ `manifest.json` example:
249
+
250
+ ```json
251
+ {
252
+ "pdf": "/abs/path/to/paper.pdf",
253
+ "pages": 12,
254
+ "text_chars": 38201,
255
+ "text_file": "/abs/path/to/paper_out/extracted_text.txt",
256
+ "per_page_files": [],
257
+ "image_paths": [
258
+ "/abs/path/to/paper_out/extracted_images/p0_Picture_231.png",
259
+ "/abs/path/to/paper_out/extracted_images/p3_Table_112.png"
260
+ ]
261
+ }
262
+ ```
263
+
264
+ Crop filenames follow the pattern `p<page>_<Label>_<x>.png` where `<Label>`
265
+ is either `Picture` or `Table`.
266
+
267
+ ---
268
+
269
+ ## FastAPI / server deployment
270
+
271
+ `saraldocling` is a plain pip package — drop it into any FastAPI app on Debian
272
+ with no special system configuration beyond the apt packages above.
273
+
274
+ ### Minimal endpoint
275
+
276
+ ```python
277
+ # main.py
278
+ import shutil, uuid
279
+ from pathlib import Path
280
+
281
+ from fastapi import FastAPI, File, UploadFile
282
+ from fastapi.responses import JSONResponse
283
+ from saraldocling import parse_pdf, ParseConfig
284
+
285
+ OUTPUT_ROOT = Path("/tmp/saraldocling")
286
+
287
+ app = FastAPI()
288
+
289
+ @app.post("/parse")
290
+ async def parse(file: UploadFile = File(...)):
291
+ job_dir = OUTPUT_ROOT / str(uuid.uuid4())
292
+ job_dir.mkdir(parents=True, exist_ok=True)
293
+ pdf_path = job_dir / file.filename
294
+
295
+ with open(pdf_path, "wb") as f:
296
+ shutil.copyfileobj(file.file, f)
297
+
298
+ result = parse_pdf(ParseConfig(
299
+ pdf_path=str(pdf_path),
300
+ output_dir=str(job_dir),
301
+ ))
302
+
303
+ return JSONResponse({
304
+ "pages": result.num_pages,
305
+ "text": result.text,
306
+ "image_paths": result.image_paths,
307
+ })
308
+ ```
309
+
310
+ ```bash
311
+ pip install "saraldocling[cpu]" fastapi uvicorn python-multipart
312
+ uvicorn main:app --host 0.0.0.0 --port 8000
313
+ ```
314
+
315
+ ### Pre-load the ONNX session at startup
316
+
317
+ Creating an `ImageExtractor` loads and optimises the ONNX graph (~300 ms).
318
+ Do this once at startup and reuse it — `session.run()` is thread-safe:
319
+
320
+ ```python
321
+ from contextlib import asynccontextmanager
322
+ from fastapi import FastAPI
323
+ from saraldocling import ImageExtractor
324
+
325
+ extractor: ImageExtractor | None = None
326
+
327
+ @asynccontextmanager
328
+ async def lifespan(app: FastAPI):
329
+ global extractor
330
+ extractor = ImageExtractor() # loads bundled model
331
+ yield
332
+ extractor.close()
333
+
334
+ app = FastAPI(lifespan=lifespan)
335
+
336
+ # then in your endpoint, call extractor.extract_images_from_pdf() directly
337
+ ```
338
+
339
+ ---
340
+
341
+ ## Configuration reference (`ParseConfig`)
342
+
343
+ | Field | Default | Description |
344
+ | ---------------- | -------------------------- | ----------------------------------------------- |
345
+ | `pdf_path` | required | Path to the source PDF |
346
+ | `output_dir` | required | Root directory for all output files |
347
+ | `extract_images` | `True` | Run YOLO stage. `False` = text only |
348
+ | `model_path` | `None` | Custom `.onnx` path. `None` = use bundled model |
349
+ | `conf_threshold` | `0.30` | YOLO confidence cutoff |
350
+ | `nms_threshold` | `0.45` | NMS IoU cutoff |
351
+ | `min_box_size` | `30` | Minimum crop side-length in pixels |
352
+ | `num_workers` | `cpu_count` | Thread-pool size for page processing |
353
+ | `onnx_providers` | `["CPUExecutionProvider"]` | ONNX Runtime execution providers |
354
+
355
+ ---
356
+
357
+ ## DocLayNet labels
358
+
359
+ The model detects 11 document element types.
360
+ Only **`Picture`** and **`Table`** regions are saved as crops.
361
+
362
+ ```
363
+ Caption Footnote Formula List-item Page-footer
364
+ Page-header Picture Section-header Table Text Title
365
+ ```
366
+
367
+ ---
368
+
369
+ ## Limitations
370
+
371
+ **Not OCR** — requires a native text layer in the PDF. For scanned documents:
372
+
373
+ ```bash
374
+ pip install ocrmypdf
375
+ ocrmypdf scanned.pdf scanned_ocr.pdf
376
+ saraldocling scanned_ocr.pdf
377
+ ```
378
+
379
+ **Thread safety** — both `PDFProcessor` and `ImageExtractor` use internal
380
+ mutex locks (mirroring the Go `SafeDocument`). Safe for concurrent use.
381
+
382
+ ---
383
+
384
+ ## License
385
+
386
+ This package is distributed under **AGPL-3.0** because it bundles
387
+ `yolov8n-doclaynet.onnx`, which carries that licence.
388
+
389
+ What this means in practice:
390
+
391
+ - Free to use for any purpose, including commercial.
392
+ - If you modify the source and run it as a network service, you must make
393
+ your modified source available under AGPL-3.0.
394
+ - The full licence text is at https://www.gnu.org/licenses/agpl-3.0.html