docling-serve 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 International Business Machines
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,483 @@
1
+ Metadata-Version: 2.2
2
+ Name: docling-serve
3
+ Version: 0.3.0
4
+ Summary: Running Docling as a service
5
+ Author-email: Michele Dolfi <dol@zurich.ibm.com>, Guillaume Moutier <gmoutier@redhat.com>, Anil Vishnoi <avishnoi@redhat.com>, Panos Vagenas <pva@zurich.ibm.com>, Panos Vagenas <pva@zurich.ibm.com>, Christoph Auer <cau@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>
6
+ Maintainer-email: Michele Dolfi <dol@zurich.ibm.com>, Anil Vishnoi <avishnoi@redhat.com>, Panos Vagenas <pva@zurich.ibm.com>, Christoph Auer <cau@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>
7
+ License: MIT
8
+ Project-URL: Homepage, https://github.com/DS4SD/docling-serve
9
+ Project-URL: Repository, https://github.com/DS4SD/docling-serve
10
+ Project-URL: Issues, https://github.com/DS4SD/docling-serve/issues
11
+ Project-URL: Changelog, https://github.com/DS4SD/docling-serve/blob/main/CHANGELOG.md
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Operating System :: OS Independent
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Typing :: Typed
16
+ Classifier: Programming Language :: Python :: 3
17
+ Requires-Python: >=3.10
18
+ Description-Content-Type: text/markdown
19
+ License-File: LICENSE
20
+ Requires-Dist: docling~=2.23
21
+ Requires-Dist: fastapi[standard]~=0.115
22
+ Requires-Dist: httpx~=0.28
23
+ Requires-Dist: pydantic~=2.10
24
+ Requires-Dist: pydantic-settings~=2.4
25
+ Requires-Dist: python-multipart<0.1.0,>=0.0.14
26
+ Requires-Dist: typer~=0.12
27
+ Requires-Dist: uvicorn[standard]<1.0.0,>=0.29.0
28
+ Provides-Extra: ui
29
+ Requires-Dist: gradio~=5.9; extra == "ui"
30
+ Provides-Extra: tesserocr
31
+ Requires-Dist: tesserocr~=2.7; extra == "tesserocr"
32
+ Provides-Extra: rapidocr
33
+ Requires-Dist: rapidocr-onnxruntime~=1.4; python_version < "3.13" and extra == "rapidocr"
34
+ Requires-Dist: onnxruntime~=1.7; extra == "rapidocr"
35
+ Provides-Extra: cpu
36
+ Requires-Dist: torch>=2.6.0; extra == "cpu"
37
+ Requires-Dist: torchvision>=0.21.0; extra == "cpu"
38
+ Provides-Extra: cu124
39
+ Requires-Dist: torch>=2.6.0; extra == "cu124"
40
+ Requires-Dist: torchvision>=0.21.0; extra == "cu124"
41
+
42
+ # Docling Serve
43
+
44
+ Running [Docling](https://github.com/DS4SD/docling) as an API service.
45
+
46
+ ## Usage
47
+
48
+ The API provides two endpoints: one for urls, one for files. This is necessary to send files directly in binary format instead of base64-encoded strings.
49
+
50
+ ### Common parameters
51
+
52
+ On top of the source of file (see below), both endpoints support the same parameters, which are almost the same as the Docling CLI.
53
+
54
+ - `from_format` (List[str]): Input format(s) to convert from. Allowed values: `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md`. Defaults to all formats.
55
+ - `to_formats` (List[str]): Output format(s) to convert to. Allowed values: `md`, `json`, `html`, `text`, `doctags`. Defaults to `md`.
56
+ - `do_ocr` (bool): If enabled, the bitmap content will be processed using OCR. Defaults to `True`.
57
+ - `image_export_mode`: Image export mode for the document (only in case of JSON, Markdown or HTML). Allowed values: embedded, placeholder, referenced. Optional, defaults to `embedded`.
58
+ - `force_ocr` (bool): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to `False`.
59
+ - `ocr_engine` (str): OCR engine to use. Allowed values: `easyocr`, `tesseract_cli`, `tesseract`, `rapidocr`, `ocrmac`. Defaults to `easyocr`.
60
+ - `ocr_lang` (List[str]): List of languages used by the OCR engine. Note that each OCR engine has different values for the language names. Defaults to empty.
61
+ - `pdf_backend` (str): PDF backend to use. Allowed values: `pypdfium2`, `dlparse_v1`, `dlparse_v2`. Defaults to `dlparse_v2`.
62
+ - `table_mode` (str): Table mode to use. Allowed values: `fast`, `accurate`. Defaults to `fast`.
63
+ - `abort_on_error` (bool): If enabled, abort on error. Defaults to false.
64
+ - `return_as_file` (boo): If enabled, return the output as a file. Defaults to false.
65
+ - `do_table_structure` (bool): If enabled, the table structure will be extracted. Defaults to true.
66
+ - `include_images` (bool): If enabled, images will be extracted from the document. Defaults to true.
67
+ - `images_scale` (float): Scale factor for images. Defaults to 2.0.
68
+
69
+ ### URL endpoint
70
+
71
+ The endpoint is `/v1alpha/convert/source`, listening for POST requests of JSON payloads.
72
+
73
+ On top of the above parameters, you must send the URL(s) of the document you want process with either the `http_sources` or `file_sources` fields.
74
+ The first is fetching URL(s) (optionally using with extra headers), the second allows to provide documents as base64-encoded strings.
75
+ No `options` is required, they can be partially or completely omitted.
76
+
77
+ Simple payload example:
78
+
79
+ ```json
80
+ {
81
+ "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
82
+ }
83
+ ```
84
+
85
+ <details>
86
+
87
+ <summary>Complete payload example:</summary>
88
+
89
+ ```json
90
+ {
91
+ "options": {
92
+ "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
93
+ "to_formats": ["md", "json", "html", "text", "doctags"],
94
+ "image_export_mode": "placeholder",
95
+ "do_ocr": true,
96
+ "force_ocr": false,
97
+ "ocr_engine": "easyocr",
98
+ "ocr_lang": ["en"],
99
+ "pdf_backend": "dlparse_v2",
100
+ "table_mode": "fast",
101
+ "abort_on_error": false,
102
+ "return_as_file": false,
103
+ },
104
+ "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
105
+ }
106
+ ```
107
+
108
+ </details>
109
+
110
+ <details>
111
+
112
+ <summary>CURL example:</summary>
113
+
114
+ ```sh
115
+ curl -X 'POST' \
116
+ 'http://localhost:5001/v1alpha/convert/source' \
117
+ -H 'accept: application/json' \
118
+ -H 'Content-Type: application/json' \
119
+ -d '{
120
+ "options": {
121
+ "from_formats": [
122
+ "docx",
123
+ "pptx",
124
+ "html",
125
+ "image",
126
+ "pdf",
127
+ "asciidoc",
128
+ "md",
129
+ "xlsx"
130
+ ],
131
+ "to_formats": ["md", "json", "html", "text", "doctags"],
132
+ "image_export_mode": "placeholder",
133
+ "do_ocr": true,
134
+ "force_ocr": false,
135
+ "ocr_engine": "easyocr",
136
+ "ocr_lang": [
137
+ "fr",
138
+ "de",
139
+ "es",
140
+ "en"
141
+ ],
142
+ "pdf_backend": "dlparse_v2",
143
+ "table_mode": "fast",
144
+ "abort_on_error": false,
145
+ "return_as_file": false,
146
+ "do_table_structure": true,
147
+ "include_images": true,
148
+ "images_scale": 2,
149
+ },
150
+ "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
151
+ }'
152
+ ```
153
+
154
+ </details>
155
+
156
+ <details>
157
+ <summary>Python example:</summary>
158
+
159
+ ```python
160
+ import httpx
161
+
162
+ async_client = httpx.AsyncClient(timeout=60.0)
163
+ url = "http://localhost:5001/v1alpha/convert/source"
164
+ payload = {
165
+ "options": {
166
+ "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
167
+ "to_formats": ["md", "json", "html", "text", "doctags"],
168
+ "image_export_mode": "placeholder",
169
+ "do_ocr": True,
170
+ "force_ocr": False,
171
+ "ocr_engine": "easyocr",
172
+ "ocr_lang": "en",
173
+ "pdf_backend": "dlparse_v2",
174
+ "table_mode": "fast",
175
+ "abort_on_error": False,
176
+ "return_as_file": False,
177
+ },
178
+ "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
179
+ }
180
+
181
+ response = await async_client_client.post(url, json=payload)
182
+
183
+ data = response.json()
184
+ ```
185
+
186
+ </details>
187
+
188
+ #### File as base64
189
+
190
+ The `file_sources` argument in the endpoint allows to send files as base64-encoded strings.
191
+ When your PDF or other file type is too large, encoding it and passing it inline to curl
192
+ can lead to an โ€œArgument list too longโ€ error on some systems. To avoid this, we write
193
+ the JSON request body to a file and have curl read from that file.
194
+
195
+ <details>
196
+ <summary>CURL steps:</summary>
197
+
198
+ ```sh
199
+ # 1. Base64-encode the file
200
+ B64_DATA=$(base64 -w 0 /path/to/file/pdf-to-convert.pdf)
201
+
202
+ # 2. Build the JSON with your options
203
+ cat <<EOF > /tmp/request_body.json
204
+ {
205
+ "options": {
206
+ },
207
+ "file_sources": [{
208
+ "base64_string": "${B64_DATA}",
209
+ "filename": "pdf-to-convert.pdf"
210
+ }]
211
+ }
212
+ EOF
213
+
214
+ # 3. POST the request to the docling service
215
+ curl -X POST "localhost:5001/v1alpha/convert/source" \
216
+ -H "Content-Type: application/json" \
217
+ -d @/tmp/request_body.json
218
+ ```
219
+
220
+ </details>
221
+
222
+ ### File endpoint
223
+
224
+ The endpoint is: `/v1alpha/convert/file`, listening for POST requests of Form payloads (necessary as the files are sent as multipart/form data). You can send one or multiple files.
225
+
226
+ <details>
227
+ <summary>CURL example:</summary>
228
+
229
+ ```sh
230
+ curl -X 'POST' \
231
+ 'http://127.0.0.1:5001/v1alpha/convert/file' \
232
+ -H 'accept: application/json' \
233
+ -H 'Content-Type: multipart/form-data' \
234
+ -F 'ocr_engine=easyocr' \
235
+ -F 'pdf_backend=dlparse_v2' \
236
+ -F 'from_formats=pdf' \
237
+ -F 'from_formats=docx' \
238
+ -F 'force_ocr=false' \
239
+ -F 'image_export_mode=embedded' \
240
+ -F 'ocr_lang=en' \
241
+ -F 'ocr_lang=pl' \
242
+ -F 'table_mode=fast' \
243
+ -F 'files=@2206.01062v1.pdf;type=application/pdf' \
244
+ -F 'abort_on_error=false' \
245
+ -F 'to_formats=md' \
246
+ -F 'to_formats=text' \
247
+ -F 'return_as_file=false' \
248
+ -F 'do_ocr=true'
249
+ ```
250
+
251
+ </details>
252
+
253
+ <details>
254
+ <summary>Python example:</summary>
255
+
256
+ ```python
257
+ import httpx
258
+
259
+ async_client = httpx.AsyncClient(timeout=60.0)
260
+ url = "http://localhost:5001/v1alpha/convert/file"
261
+ parameters = {
262
+ "from_formats": ["docx", "pptx", "html", "image", "pdf", "asciidoc", "md", "xlsx"],
263
+ "to_formats": ["md", "json", "html", "text", "doctags"],
264
+ "image_export_mode": "placeholder",
265
+ "do_ocr": True,
266
+ "force_ocr": False,
267
+ "ocr_engine": "easyocr",
268
+ "ocr_lang": ["en"],
269
+ "pdf_backend": "dlparse_v2",
270
+ "table_mode": "fast",
271
+ "abort_on_error": False,
272
+ "return_as_file": False
273
+ }
274
+
275
+ current_dir = os.path.dirname(__file__)
276
+ file_path = os.path.join(current_dir, '2206.01062v1.pdf')
277
+
278
+ files = {
279
+ 'files': ('2206.01062v1.pdf', open(file_path, 'rb'), 'application/pdf'),
280
+ }
281
+
282
+ response = await async_client.post(url, files=files, data={"parameters": json.dumps(parameters)})
283
+ assert response.status_code == 200, "Response should be 200 OK"
284
+
285
+ data = response.json()
286
+ ```
287
+
288
+ </details>
289
+
290
+ ### Response format
291
+
292
+ The response can be a JSON Document or a File.
293
+
294
+ - If you process only one file, the response will be a JSON document with the following format:
295
+
296
+ ```jsonc
297
+ {
298
+ "document": {
299
+ "md_content": "",
300
+ "json_content": {},
301
+ "html_content": "",
302
+ "text_content": "",
303
+ "doctags_content": ""
304
+ },
305
+ "status": "<success|partial_success|skipped|failure>",
306
+ "processing_time": 0.0,
307
+ "timings": {},
308
+ "errors": []
309
+ }
310
+ ```
311
+
312
+ Depending on the value you set in `output_formats`, the different items will be populated with their respective results or empty.
313
+
314
+ `processing_time` is the Docling processing time in seconds, and `timings` (when enabled in the backend) provides the detailed
315
+ timing of all the internal Docling components.
316
+
317
+ - If you set the parameter `return_as_file` to True, the response will be a zip file.
318
+ - If multiple files are generated (multiple inputs, or one input but multiple outputs with `return_as_file` True), the response will be a zip file.
319
+
320
+ ## Helpers
321
+
322
+ - A full Swagger UI is available at the `/docs` endpoint.
323
+
324
+ ![swagger.png](img/swagger.png)
325
+
326
+ - An easy to use UI is available at the `/ui` endpoint.
327
+
328
+ ![ui-input.png](img/ui-input.png)
329
+
330
+ ![ui-output.png](img/ui-output.png)
331
+
332
+ ## Development
333
+
334
+ ### CPU only
335
+
336
+ ```sh
337
+ # Install uv if not already available
338
+ curl -LsSf https://astral.sh/uv/install.sh | sh
339
+
340
+ # Install dependencies
341
+ uv sync --extra cpu
342
+ ```
343
+
344
+ ### Cuda GPU
345
+
346
+ For GPU support use the following command:
347
+
348
+ ```sh
349
+ # Install dependencies
350
+ uv sync
351
+ ```
352
+
353
+ ### Gradio UI and different OCR backends
354
+
355
+ `/ui` endpoint using `gradio` and different OCR backends can be enabled via package extras:
356
+
357
+ ```sh
358
+ # Enable ui and rapidocr
359
+ uv sync --extra ui --extra rapidocr
360
+ ```
361
+
362
+ ```sh
363
+ # Enable tesserocr
364
+ uv sync --extra tesserocr
365
+ ```
366
+
367
+ See `[project.optional-dependencies]` section in `pyproject.toml` for full list of options.
368
+
369
+ ### Run the server
370
+
371
+ The `docling-serve` executable is a convenient script for launching the webserver both in
372
+ development and production mode.
373
+
374
+ ```sh
375
+ # Run the server in development mode
376
+ # - reload is enabled by default
377
+ # - listening on the 127.0.0.1 address
378
+ # - ui is enabled by default
379
+ docling-serve dev
380
+
381
+ # Run the server in production mode
382
+ # - reload is disabled by default
383
+ # - listening on the 0.0.0.0 address
384
+ # - ui is disabled by default
385
+ docling-serve run
386
+ ```
387
+
388
+ ### Options
389
+
390
+ The `docling-serve` executable allows is controlled with both command line
391
+ options and environment variables.
392
+
393
+ <details>
394
+ <summary>`docling-serve` help message</summary>
395
+
396
+ ```sh
397
+ $ docling-serve dev --help
398
+
399
+ Usage: docling-serve dev [OPTIONS]
400
+
401
+ Run a Docling Serve app in development mode. ๐Ÿงช
402
+ This is equivalent to docling-serve run but with reload
403
+ enabled and listening on the 127.0.0.1 address.
404
+
405
+ Options can be set also with the corresponding ENV variable, with the exception
406
+ of --enable-ui, --host and --reload.
407
+
408
+ โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
409
+ โ”‚ --host TEXT The host to serve on. For local development in localhost โ”‚
410
+ โ”‚ use 127.0.0.1. To enable public access, e.g. in a โ”‚
411
+ โ”‚ container, use all the IP addresses available with โ”‚
412
+ โ”‚ 0.0.0.0. โ”‚
413
+ โ”‚ [default: 127.0.0.1] โ”‚
414
+ โ”‚ --port INTEGER The port to serve on. [default: 5001] โ”‚
415
+ โ”‚ --reload --no-reload Enable auto-reload of the server when (code) files โ”‚
416
+ โ”‚ change. This is resource intensive, use it only during โ”‚
417
+ โ”‚ development. โ”‚
418
+ โ”‚ [default: reload] โ”‚
419
+ โ”‚ --root-path TEXT The root path is used to tell your app that it is being โ”‚
420
+ โ”‚ served to the outside world with some path prefix set up โ”‚
421
+ โ”‚ in some termination proxy or similar. โ”‚
422
+ โ”‚ --proxy-headers --no-proxy-headers Enable/Disable X-Forwarded-Proto, X-Forwarded-For, โ”‚
423
+ โ”‚ X-Forwarded-Port to populate remote address info. โ”‚
424
+ โ”‚ [default: proxy-headers] โ”‚
425
+ โ”‚ --artifacts-path PATH If set to a valid directory, the model weights will be โ”‚
426
+ โ”‚ loaded from this path. โ”‚
427
+ โ”‚ [default: None] โ”‚
428
+ โ”‚ --enable-ui --no-enable-ui Enable the development UI. [default: enable-ui] โ”‚
429
+ โ”‚ --help Show this message and exit. โ”‚
430
+ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
431
+ ```
432
+
433
+ </details>
434
+
435
+ #### Environment variables
436
+
437
+ The environment variables controlling the `uvicorn` execution can be specified with the `UVICORN_` prefix:
438
+
439
+ - `UVICORN_WORKERS`: Number of workers to use.
440
+ - `UVICORN_RELOAD`: If `True`, this will enable auto-reload when you modify files, useful for development.
441
+
442
+ The environment variables controlling specifics of the Docling Serve app can be specified with the
443
+ `DOCLING_SERVE_` prefix:
444
+
445
+ - `DOCLING_SERVE_ARTIFACTS_PATH`: if set Docling will use only the local weights of models, for example `/opt/app-root/src/.cache/docling/models`.
446
+ - `DOCLING_SERVE_ENABLE_UI`: If `True`, The Gradio UI will be available at `/ui`.
447
+
448
+ Others:
449
+
450
+ - `TESSDATA_PREFIX`: Tesseract data location, example `/usr/share/tesseract/tessdata/`.
451
+
452
+ ## Get help and support
453
+
454
+ Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
455
+
456
+ ## Contributing
457
+
458
+ Please read [Contributing to Docling Serve](https://github.com/DS4SD/docling-serve/blob/main/CONTRIBUTING.md) for details.
459
+
460
+ ## References
461
+
462
+ If you use Docling in your projects, please consider citing the following:
463
+
464
+ ```bib
465
+ @techreport{Docling,
466
+ author = {Deep Search Team},
467
+ month = {8},
468
+ title = {Docling Technical Report},
469
+ url = {https://arxiv.org/abs/2408.09869},
470
+ eprint = {2408.09869},
471
+ doi = {10.48550/arXiv.2408.09869},
472
+ version = {1.0.0},
473
+ year = {2024}
474
+ }
475
+ ```
476
+
477
+ ## License
478
+
479
+ The Docling Serve codebase is under MIT license.
480
+
481
+ ## IBM โค๏ธ Open Source AI
482
+
483
+ Docling has been brought to you by IBM.