docling 0.1.2__tar.gz → 1.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. docling-1.1.2/PKG-INFO +183 -0
  2. docling-1.1.2/README.md +146 -0
  3. {docling-0.1.2 → docling-1.1.2}/docling/backend/pypdfium2_backend.py +1 -7
  4. {docling-0.1.2 → docling-1.1.2}/docling/datamodel/base_models.py +24 -4
  5. {docling-0.1.2 → docling-1.1.2}/docling/datamodel/document.py +8 -11
  6. {docling-0.1.2 → docling-1.1.2}/docling/document_converter.py +56 -0
  7. {docling-0.1.2 → docling-1.1.2}/docling/models/page_assemble_model.py +0 -12
  8. {docling-0.1.2 → docling-1.1.2}/docling/models/table_structure_model.py +47 -12
  9. {docling-0.1.2 → docling-1.1.2}/docling/pipeline/standard_model_pipeline.py +1 -1
  10. {docling-0.1.2 → docling-1.1.2}/pyproject.toml +26 -11
  11. docling-0.1.2/PKG-INFO +0 -132
  12. docling-0.1.2/README.md +0 -99
  13. {docling-0.1.2 → docling-1.1.2}/LICENSE +0 -0
  14. {docling-0.1.2 → docling-1.1.2}/docling/__init__.py +0 -0
  15. {docling-0.1.2 → docling-1.1.2}/docling/backend/__init__.py +0 -0
  16. {docling-0.1.2 → docling-1.1.2}/docling/backend/abstract_backend.py +0 -0
  17. {docling-0.1.2 → docling-1.1.2}/docling/datamodel/__init__.py +0 -0
  18. {docling-0.1.2 → docling-1.1.2}/docling/datamodel/settings.py +0 -0
  19. {docling-0.1.2 → docling-1.1.2}/docling/models/__init__.py +0 -0
  20. {docling-0.1.2 → docling-1.1.2}/docling/models/ds_glm_model.py +0 -0
  21. {docling-0.1.2 → docling-1.1.2}/docling/models/easyocr_model.py +0 -0
  22. {docling-0.1.2 → docling-1.1.2}/docling/models/layout_model.py +0 -0
  23. {docling-0.1.2 → docling-1.1.2}/docling/pipeline/__init__.py +0 -0
  24. {docling-0.1.2 → docling-1.1.2}/docling/pipeline/base_model_pipeline.py +0 -0
  25. {docling-0.1.2 → docling-1.1.2}/docling/utils/__init__.py +0 -0
  26. {docling-0.1.2 → docling-1.1.2}/docling/utils/layout_utils.py +0 -0
  27. {docling-0.1.2 → docling-1.1.2}/docling/utils/utils.py +0 -0
docling-1.1.2/PKG-INFO ADDED
@@ -0,0 +1,183 @@
1
+ Metadata-Version: 2.1
2
+ Name: docling
3
+ Version: 1.1.2
4
+ Summary: Docling PDF conversion package
5
+ Home-page: https://github.com/DS4SD/docling
6
+ License: MIT
7
+ Keywords: docling,convert,document,pdf,layout model,segmentation,table structure,table former
8
+ Author: Christoph Auer
9
+ Author-email: cau@zurich.ibm.com
10
+ Requires-Python: >=3.10,<4.0
11
+ Classifier: Development Status :: 5 - Production/Stable
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Operating System :: MacOS :: MacOS X
16
+ Classifier: Operating System :: POSIX :: Linux
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Provides-Extra: easyocr
23
+ Provides-Extra: ocr
24
+ Requires-Dist: deepsearch-glm (>=0.19.0,<1)
25
+ Requires-Dist: docling-core (>=1.1.2,<2.0.0)
26
+ Requires-Dist: docling-ibm-models (>=1.1.0,<2.0.0)
27
+ Requires-Dist: easyocr (>=1.7,<2.0) ; extra == "easyocr" or extra == "ocr"
28
+ Requires-Dist: filetype (>=1.2.0,<2.0.0)
29
+ Requires-Dist: huggingface_hub (>=0.23,<1)
30
+ Requires-Dist: pydantic (>=2.0.0,<3.0.0)
31
+ Requires-Dist: pydantic-settings (>=2.3.0,<3.0.0)
32
+ Requires-Dist: pypdfium2 (>=4.30.0,<5.0.0)
33
+ Requires-Dist: requests (>=2.32.3,<3.0.0)
34
+ Project-URL: Repository, https://github.com/DS4SD/docling
35
+ Description-Content-Type: text/markdown
36
+
37
+ <p align="center">
38
+ <a href="https://github.com/ds4sd/docling">
39
+ <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
40
+ </a>
41
+ </p>
42
+
43
+ # Docling
44
+
45
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
46
+ ![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
47
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
48
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
49
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
50
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
51
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
52
+ [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
53
+
54
+ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
55
+
56
+ ## Features
57
+ * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
58
+ * 📑 Understands detailed page layout, reading order and recovers table structures
59
+ * 📝 Extracts metadata from the document, such as title, authors, references and language
60
+ * 🔍 Optionally applies OCR (use with scanned PDFs)
61
+
62
+ ## Installation
63
+
64
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
65
+ ```bash
66
+ pip install docling
67
+ ```
68
+
69
+ > [!NOTE]
70
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
71
+
72
+ ### Development setup
73
+
74
+ To develop for Docling, you need Python 3.10 / 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
75
+ ```bash
76
+ poetry install --all-extras
77
+ ```
78
+
79
+ ## Usage
80
+
81
+ ### Convert a single document
82
+
83
+ To convert invidual PDF documents, use `convert_single()`, for example:
84
+ ```python
85
+ from docling.document_converter import DocumentConverter
86
+
87
+ source = "https://arxiv.org/pdf/2206.01062" # PDF path or URL
88
+ converter = DocumentConverter()
89
+ doc = converter.convert_single(source)
90
+ print(doc.export_to_markdown()) # output: "## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis [...]"
91
+ ```
92
+
93
+ ### Convert a batch of documents
94
+
95
+ For an example of batch-converting documents, see [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py).
96
+
97
+ From a local repo clone, you can run it with:
98
+
99
+ ```
100
+ python examples/convert.py
101
+ ```
102
+ The output of the above command will be written to `./scratch`.
103
+
104
+ ### Adjust pipeline features
105
+
106
+ #### Control pipeline options
107
+
108
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
109
+ ```python
110
+ doc_converter = DocumentConverter(
111
+ artifacts_path=artifacts_path,
112
+ pipeline_options=PipelineOptions(
113
+ do_table_structure=False, # controls if table structure is recovered
114
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
115
+ ),
116
+ )
117
+ ```
118
+
119
+ #### Control table extraction options
120
+
121
+ You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
122
+ This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
123
+
124
+
125
+ ```python
126
+ pipeline_options = PipelineOptions(do_table_structure=True)
127
+ pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
128
+
129
+ doc_converter = DocumentConverter(
130
+ artifacts_path=artifacts_path,
131
+ pipeline_options=pipeline_options,
132
+ )
133
+ ```
134
+
135
+ ### Impose limits on the document size
136
+
137
+ You can limit the file size and number of pages which should be allowed to process per document:
138
+ ```python
139
+ conv_input = DocumentConversionInput.from_paths(
140
+ paths=[Path("./test/data/2206.01062.pdf")],
141
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
142
+ )
143
+ ```
144
+
145
+ ### Convert from binary PDF streams
146
+
147
+ You can convert PDFs from a binary stream instead of from the filesystem as follows:
148
+ ```python
149
+ buf = BytesIO(your_binary_stream)
150
+ docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
151
+ conv_input = DocumentConversionInput.from_streams(docs)
152
+ converted_docs = doc_converter.convert(conv_input)
153
+ ```
154
+ ### Limit resource usage
155
+
156
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
157
+
158
+
159
+ ## Contributing
160
+
161
+ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
162
+
163
+
164
+ ## References
165
+
166
+ If you use Docling in your projects, please consider citing the following:
167
+
168
+ ```bib
169
+ @software{Docling,
170
+ author = {Deep Search Team},
171
+ month = {7},
172
+ title = {{Docling}},
173
+ url = {https://github.com/DS4SD/docling},
174
+ version = {main},
175
+ year = {2024}
176
+ }
177
+ ```
178
+
179
+ ## License
180
+
181
+ The Docling codebase is under MIT license.
182
+ For individual model usage, please refer to the model licenses found in the original packages.
183
+
@@ -0,0 +1,146 @@
1
+ <p align="center">
2
+ <a href="https://github.com/ds4sd/docling">
3
+ <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
4
+ </a>
5
+ </p>
6
+
7
+ # Docling
8
+
9
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
10
+ ![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
11
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
12
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
13
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
14
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
15
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
16
+ [![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
17
+
18
+ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
19
+
20
+ ## Features
21
+ * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
22
+ * 📑 Understands detailed page layout, reading order and recovers table structures
23
+ * 📝 Extracts metadata from the document, such as title, authors, references and language
24
+ * 🔍 Optionally applies OCR (use with scanned PDFs)
25
+
26
+ ## Installation
27
+
28
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
29
+ ```bash
30
+ pip install docling
31
+ ```
32
+
33
+ > [!NOTE]
34
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
35
+
36
+ ### Development setup
37
+
38
+ To develop for Docling, you need Python 3.10 / 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
39
+ ```bash
40
+ poetry install --all-extras
41
+ ```
42
+
43
+ ## Usage
44
+
45
+ ### Convert a single document
46
+
47
+ To convert invidual PDF documents, use `convert_single()`, for example:
48
+ ```python
49
+ from docling.document_converter import DocumentConverter
50
+
51
+ source = "https://arxiv.org/pdf/2206.01062" # PDF path or URL
52
+ converter = DocumentConverter()
53
+ doc = converter.convert_single(source)
54
+ print(doc.export_to_markdown()) # output: "## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis [...]"
55
+ ```
56
+
57
+ ### Convert a batch of documents
58
+
59
+ For an example of batch-converting documents, see [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py).
60
+
61
+ From a local repo clone, you can run it with:
62
+
63
+ ```
64
+ python examples/convert.py
65
+ ```
66
+ The output of the above command will be written to `./scratch`.
67
+
68
+ ### Adjust pipeline features
69
+
70
+ #### Control pipeline options
71
+
72
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
73
+ ```python
74
+ doc_converter = DocumentConverter(
75
+ artifacts_path=artifacts_path,
76
+ pipeline_options=PipelineOptions(
77
+ do_table_structure=False, # controls if table structure is recovered
78
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
79
+ ),
80
+ )
81
+ ```
82
+
83
+ #### Control table extraction options
84
+
85
+ You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
86
+ This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
87
+
88
+
89
+ ```python
90
+ pipeline_options = PipelineOptions(do_table_structure=True)
91
+ pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
92
+
93
+ doc_converter = DocumentConverter(
94
+ artifacts_path=artifacts_path,
95
+ pipeline_options=pipeline_options,
96
+ )
97
+ ```
98
+
99
+ ### Impose limits on the document size
100
+
101
+ You can limit the file size and number of pages which should be allowed to process per document:
102
+ ```python
103
+ conv_input = DocumentConversionInput.from_paths(
104
+ paths=[Path("./test/data/2206.01062.pdf")],
105
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
106
+ )
107
+ ```
108
+
109
+ ### Convert from binary PDF streams
110
+
111
+ You can convert PDFs from a binary stream instead of from the filesystem as follows:
112
+ ```python
113
+ buf = BytesIO(your_binary_stream)
114
+ docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
115
+ conv_input = DocumentConversionInput.from_streams(docs)
116
+ converted_docs = doc_converter.convert(conv_input)
117
+ ```
118
+ ### Limit resource usage
119
+
120
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
121
+
122
+
123
+ ## Contributing
124
+
125
+ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
126
+
127
+
128
+ ## References
129
+
130
+ If you use Docling in your projects, please consider citing the following:
131
+
132
+ ```bib
133
+ @software{Docling,
134
+ author = {Deep Search Team},
135
+ month = {7},
136
+ title = {{Docling}},
137
+ url = {https://github.com/DS4SD/docling},
138
+ version = {main},
139
+ year = {2024}
140
+ }
141
+ ```
142
+
143
+ ## License
144
+
145
+ The Docling codebase is under MIT license.
146
+ For individual model usage, please refer to the model licenses found in the original packages.
@@ -201,13 +201,7 @@ class PyPdfiumPageBackend(PdfPageBackend):
201
201
  class PyPdfiumDocumentBackend(PdfDocumentBackend):
202
202
  def __init__(self, path_or_stream: Iterable[Union[BytesIO, Path]]):
203
203
  super().__init__(path_or_stream)
204
-
205
- if isinstance(path_or_stream, Path):
206
- self._pdoc = pdfium.PdfDocument(path_or_stream)
207
- elif isinstance(path_or_stream, BytesIO):
208
- self._pdoc = pdfium.PdfDocument(
209
- path_or_stream
210
- ) # TODO Fix me, won't accept bytes.
204
+ self._pdoc = pdfium.PdfDocument(path_or_stream)
211
205
 
212
206
  def page_count(self) -> int:
213
207
  return len(self._pdoc)
@@ -1,3 +1,4 @@
1
+ import copy
1
2
  from enum import Enum, auto
2
3
  from io import BytesIO
3
4
  from typing import Any, Dict, List, Optional, Tuple, Union
@@ -47,6 +48,15 @@ class BoundingBox(BaseModel):
47
48
  def height(self):
48
49
  return abs(self.t - self.b)
49
50
 
51
+ def scaled(self, scale: float) -> "BoundingBox":
52
+ out_bbox = copy.deepcopy(self)
53
+ out_bbox.l *= scale
54
+ out_bbox.r *= scale
55
+ out_bbox.t *= scale
56
+ out_bbox.b *= scale
57
+
58
+ return out_bbox
59
+
50
60
  def as_tuple(self):
51
61
  if self.coord_origin == CoordOrigin.TOPLEFT:
52
62
  return (self.l, self.t, self.r, self.b)
@@ -180,8 +190,7 @@ class TableStructurePrediction(BaseModel):
180
190
  table_map: Dict[int, TableElement] = {}
181
191
 
182
192
 
183
- class TextElement(BasePageElement):
184
- ...
193
+ class TextElement(BasePageElement): ...
185
194
 
186
195
 
187
196
  class FigureData(BaseModel):
@@ -242,6 +251,17 @@ class DocumentStream(BaseModel):
242
251
  stream: BytesIO
243
252
 
244
253
 
254
+ class TableStructureOptions(BaseModel):
255
+ do_cell_matching: bool = (
256
+ True
257
+ # True: Matches predictions back to PDF cells. Can break table output if PDF cells
258
+ # are merged across table columns.
259
+ # False: Let table structure model define the text cells, ignore PDF cells.
260
+ )
261
+
262
+
245
263
  class PipelineOptions(BaseModel):
246
- do_table_structure: bool = True
247
- do_ocr: bool = False
264
+ do_table_structure: bool = True # True: perform table structure extraction
265
+ do_ocr: bool = False # True: perform OCR, replace programmatic PDF text
266
+
267
+ table_structure_options: TableStructureOptions = TableStructureOptions()
@@ -3,7 +3,6 @@ from io import BytesIO
3
3
  from pathlib import Path, PurePath
4
4
  from typing import ClassVar, Dict, Iterable, List, Optional, Type, Union
5
5
 
6
- from deepsearch.documents.core.export import export_to_markdown
7
6
  from docling_core.types import BaseCell, BaseText
8
7
  from docling_core.types import BoundingBox as DsBoundingBox
9
8
  from docling_core.types import Document as DsDocument
@@ -117,16 +116,16 @@ class ConvertedDocument(BaseModel):
117
116
  errors: List[Dict] = [] # structure to keep errors
118
117
 
119
118
  pages: List[Page] = []
120
- assembled: AssembledUnit = None
119
+ assembled: Optional[AssembledUnit] = None
121
120
 
122
- output: DsDocument = None
121
+ output: Optional[DsDocument] = None
123
122
 
124
123
  def to_ds_document(self) -> DsDocument:
125
124
  title = ""
126
125
  desc = DsDocumentDescription(logs=[])
127
126
 
128
127
  page_hashes = [
129
- PageReference(hash=p.page_hash, page=p.page_no, model="default")
128
+ PageReference(hash=p.page_hash, page=p.page_no + 1, model="default")
130
129
  for p in self.pages
131
130
  ]
132
131
 
@@ -160,7 +159,7 @@ class ConvertedDocument(BaseModel):
160
159
  prov=[
161
160
  Prov(
162
161
  bbox=target_bbox,
163
- page=element.page_no,
162
+ page=element.page_no + 1,
164
163
  span=[0, len(element.text)],
165
164
  )
166
165
  ],
@@ -243,7 +242,7 @@ class ConvertedDocument(BaseModel):
243
242
  prov=[
244
243
  Prov(
245
244
  bbox=target_bbox,
246
- page=element.page_no,
245
+ page=element.page_no + 1,
247
246
  span=[0, 0],
248
247
  )
249
248
  ],
@@ -265,7 +264,7 @@ class ConvertedDocument(BaseModel):
265
264
  prov=[
266
265
  Prov(
267
266
  bbox=target_bbox,
268
- page=element.page_no,
267
+ page=element.page_no + 1,
269
268
  span=[0, 0],
270
269
  )
271
270
  ],
@@ -275,7 +274,7 @@ class ConvertedDocument(BaseModel):
275
274
  )
276
275
 
277
276
  page_dimensions = [
278
- PageDimensions(page=p.page_no, height=p.size.height, width=p.size.width)
277
+ PageDimensions(page=p.page_no + 1, height=p.size.height, width=p.size.width)
279
278
  for p in self.pages
280
279
  ]
281
280
 
@@ -299,9 +298,7 @@ class ConvertedDocument(BaseModel):
299
298
 
300
299
  def render_as_markdown(self):
301
300
  if self.output:
302
- return export_to_markdown(
303
- self.output.model_dump(by_alias=True, exclude_none=True)
304
- )
301
+ return self.output.export_to_markdown()
305
302
  else:
306
303
  return ""
307
304
 
@@ -1,11 +1,15 @@
1
1
  import functools
2
2
  import logging
3
+ import tempfile
3
4
  import time
4
5
  import traceback
5
6
  from pathlib import Path
6
7
  from typing import Iterable, Optional, Type, Union
7
8
 
9
+ import requests
10
+ from docling_core.types import Document
8
11
  from PIL import ImageDraw
12
+ from pydantic import AnyHttpUrl, TypeAdapter, ValidationError
9
13
 
10
14
  from docling.backend.abstract_backend import PdfDocumentBackend
11
15
  from docling.datamodel.base_models import (
@@ -32,6 +36,7 @@ _log = logging.getLogger(__name__)
32
36
  class DocumentConverter:
33
37
  _layout_model_path = "model_artifacts/layout/beehive_v0.0.5"
34
38
  _table_model_path = "model_artifacts/tableformer"
39
+ _default_download_filename = "file.pdf"
35
40
 
36
41
  def __init__(
37
42
  self,
@@ -80,6 +85,57 @@ class DocumentConverter:
80
85
  # Note: Pdfium backend is not thread-safe, thread pool usage was disabled.
81
86
  yield from map(self.process_document, input_batch)
82
87
 
88
+ def convert_single(self, source: Path | AnyHttpUrl | str) -> Document:
89
+ """Convert a single document.
90
+
91
+ Args:
92
+ source (Path | AnyHttpUrl | str): The PDF input source. Can be a path or URL.
93
+
94
+ Raises:
95
+ ValueError: If source is of unexpected type.
96
+ RuntimeError: If conversion fails.
97
+
98
+ Returns:
99
+ Document: The converted document object.
100
+ """
101
+ with tempfile.TemporaryDirectory() as temp_dir:
102
+ try:
103
+ http_url: AnyHttpUrl = TypeAdapter(AnyHttpUrl).validate_python(source)
104
+ res = requests.get(http_url, stream=True)
105
+ res.raise_for_status()
106
+ fname = None
107
+ # try to get filename from response header
108
+ if cont_disp := res.headers.get("Content-Disposition"):
109
+ for par in cont_disp.strip().split(";"):
110
+ # currently only handling directive "filename" (not "*filename")
111
+ if (split := par.split("=")) and split[0].strip() == "filename":
112
+ fname = "=".join(split[1:]).strip().strip("'\"") or None
113
+ break
114
+ # otherwise, use name from URL:
115
+ if fname is None:
116
+ fname = Path(http_url.path).name or self._default_download_filename
117
+ local_path = Path(temp_dir) / fname
118
+ with open(local_path, "wb") as f:
119
+ for chunk in res.iter_content(chunk_size=1024): # using 1-KB chunks
120
+ f.write(chunk)
121
+ except ValidationError:
122
+ try:
123
+ local_path = TypeAdapter(Path).validate_python(source)
124
+ except ValidationError:
125
+ raise ValueError(
126
+ f"Unexpected file path type encountered: {type(source)}"
127
+ )
128
+ conv_inp = DocumentConversionInput.from_paths(paths=[local_path])
129
+ converted_docs_iter = self.convert(conv_inp)
130
+ converted_doc: ConvertedDocument = next(converted_docs_iter)
131
+ if converted_doc.status not in {
132
+ ConversionStatus.SUCCESS,
133
+ ConversionStatus.SUCCESS_WITH_ERRORS,
134
+ }:
135
+ raise RuntimeError(f"Conversion failed with status: {converted_doc.status}")
136
+ doc = converted_doc.to_ds_document()
137
+ return doc
138
+
83
139
  def process_document(self, in_doc: InputDocument) -> ConvertedDocument:
84
140
  start_doc_time = time.time()
85
141
  converted_doc = ConvertedDocument(input=in_doc)
@@ -19,18 +19,6 @@ class PageAssembleModel:
19
19
  def __init__(self, config):
20
20
  self.config = config
21
21
 
22
- # self.line_wrap_pattern = re.compile(r'(?<=[^\W_])- \n(?=\w)')
23
-
24
- # def sanitize_text_poor(self, lines):
25
- # text = '\n'.join(lines)
26
- #
27
- # # treat line wraps.
28
- # sanitized_text = self.line_wrap_pattern.sub('', text)
29
- #
30
- # sanitized_text = sanitized_text.replace('\n', ' ')
31
- #
32
- # return sanitized_text
33
-
34
22
  def sanitize_text(self, lines):
35
23
  if len(lines) <= 1:
36
24
  return " ".join(lines)
@@ -1,7 +1,10 @@
1
- from typing import Iterable
1
+ import copy
2
+ import random
3
+ from typing import Iterable, List
2
4
 
3
5
  import numpy
4
6
  from docling_ibm_models.tableformer.data_management.tf_predictor import TFPredictor
7
+ from PIL import ImageDraw
5
8
 
6
9
  from docling.datamodel.base_models import (
7
10
  BoundingBox,
@@ -28,6 +31,21 @@ class TableStructureModel:
28
31
  self.tm_model_type = self.tm_config["model"]["type"]
29
32
 
30
33
  self.tf_predictor = TFPredictor(self.tm_config)
34
+ self.scale = 2.0 # Scale up table input images to 144 dpi
35
+
36
+ def draw_table_and_cells(self, page: Page, tbl_list: List[TableElement]):
37
+ image = page._backend.get_page_image()
38
+ draw = ImageDraw.Draw(image)
39
+
40
+ for table_element in tbl_list:
41
+ x0, y0, x1, y1 = table_element.cluster.bbox.as_tuple()
42
+ draw.rectangle([(x0, y0), (x1, y1)], outline="red")
43
+
44
+ for tc in table_element.table_cells:
45
+ x0, y0, x1, y1 = tc.bbox.as_tuple()
46
+ draw.rectangle([(x0, y0), (x1, y1)], outline="blue")
47
+
48
+ image.show()
31
49
 
32
50
  def __call__(self, page_batch: Iterable[Page]) -> Iterable[Page]:
33
51
 
@@ -36,16 +54,17 @@ class TableStructureModel:
36
54
  return
37
55
 
38
56
  for page in page_batch:
57
+
39
58
  page.predictions.tablestructure = TableStructurePrediction() # dummy
40
59
 
41
60
  in_tables = [
42
61
  (
43
62
  cluster,
44
63
  [
45
- round(cluster.bbox.l),
46
- round(cluster.bbox.t),
47
- round(cluster.bbox.r),
48
- round(cluster.bbox.b),
64
+ round(cluster.bbox.l) * self.scale,
65
+ round(cluster.bbox.t) * self.scale,
66
+ round(cluster.bbox.r) * self.scale,
67
+ round(cluster.bbox.b) * self.scale,
49
68
  ],
50
69
  )
51
70
  for cluster in page.predictions.layout.clusters
@@ -65,20 +84,29 @@ class TableStructureModel:
65
84
  ):
66
85
  # Only allow non empty stings (spaces) into the cells of a table
67
86
  if len(c.text.strip()) > 0:
68
- tokens.append(c.model_dump())
87
+ new_cell = copy.deepcopy(c)
88
+ new_cell.bbox = new_cell.bbox.scaled(scale=self.scale)
89
+
90
+ tokens.append(new_cell.model_dump())
69
91
 
70
- iocr_page = {
71
- "image": numpy.asarray(page.image),
92
+ page_input = {
72
93
  "tokens": tokens,
73
- "width": page.size.width,
74
- "height": page.size.height,
94
+ "width": page.size.width * self.scale,
95
+ "height": page.size.height * self.scale,
75
96
  }
97
+ # add image to page input.
98
+ if self.scale == 1.0:
99
+ page_input["image"] = numpy.asarray(page.image)
100
+ else: # render new page image on the fly at desired scale
101
+ page_input["image"] = numpy.asarray(
102
+ page._backend.get_page_image(scale=self.scale)
103
+ )
76
104
 
77
105
  table_clusters, table_bboxes = zip(*in_tables)
78
106
 
79
107
  if len(table_bboxes):
80
108
  tf_output = self.tf_predictor.multi_table_predict(
81
- iocr_page, table_bboxes, do_matching=self.do_cell_matching
109
+ page_input, table_bboxes, do_matching=self.do_cell_matching
82
110
  )
83
111
 
84
112
  for table_cluster, table_out in zip(table_clusters, tf_output):
@@ -86,11 +114,15 @@ class TableStructureModel:
86
114
  for element in table_out["tf_responses"]:
87
115
 
88
116
  if not self.do_cell_matching:
89
- the_bbox = BoundingBox.model_validate(element["bbox"])
117
+ the_bbox = BoundingBox.model_validate(
118
+ element["bbox"]
119
+ ).scaled(1 / self.scale)
90
120
  text_piece = page._backend.get_text_in_rect(the_bbox)
91
121
  element["bbox"]["token"] = text_piece
92
122
 
93
123
  tc = TableCell.model_validate(element)
124
+ if self.do_cell_matching:
125
+ tc.bbox = tc.bbox.scaled(1 / self.scale)
94
126
  table_cells.append(tc)
95
127
 
96
128
  # Retrieving cols/rows, after post processing:
@@ -111,4 +143,7 @@ class TableStructureModel:
111
143
 
112
144
  page.predictions.tablestructure.table_map[table_cluster.id] = tbl
113
145
 
146
+ # For debugging purposes:
147
+ # self.draw_table_and_cells(page, page.predictions.tablestructure.table_map.values())
148
+
114
149
  yield page
@@ -34,7 +34,7 @@ class StandardModelPipeline(BaseModelPipeline):
34
34
  "artifacts_path": artifacts_path
35
35
  / StandardModelPipeline._table_model_path,
36
36
  "enabled": pipeline_options.do_table_structure,
37
- "do_cell_matching": False,
37
+ "do_cell_matching": pipeline_options.table_structure_options.do_cell_matching,
38
38
  }
39
39
  ),
40
40
  ]
@@ -1,6 +1,6 @@
1
1
  [tool.poetry]
2
2
  name = "docling"
3
- version = "0.1.2"
3
+ version = "1.1.2" # DO NOT EDIT, updated automatically
4
4
  description = "Docling PDF conversion package"
5
5
  authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
6
6
  license = "MIT"
@@ -21,19 +21,17 @@ keywords= ["docling", "convert", "document", "pdf", "layout model", "segmentatio
21
21
  packages = [{include = "docling"}]
22
22
 
23
23
  [tool.poetry.dependencies]
24
- python = "^3.11"
24
+ python = "^3.10"
25
25
  pydantic = "^2.0.0"
26
- docling-core = "^0.2.0"
27
- docling-ibm-models = "^0.2.0"
28
- deepsearch-glm = ">=0.18.4,<1"
29
- deepsearch-toolkit = ">=0.47.0,<1"
26
+ docling-core = "^1.1.2"
27
+ docling-ibm-models = "^1.1.0"
28
+ deepsearch-glm = ">=0.19.0,<1"
30
29
  filetype = "^1.2.0"
31
30
  pypdfium2 = "^4.30.0"
32
31
  pydantic-settings = "^2.3.0"
33
32
  huggingface_hub = ">=0.23,<1"
34
-
35
- [tool.poetry.group.ocr.dependencies]
36
- easyocr = "^1.7"
33
+ requests = "^2.32.3"
34
+ easyocr = { version = "^1.7", optional = true }
37
35
 
38
36
  [tool.poetry.group.dev.dependencies]
39
37
  black = {extras = ["jupyter"], version = "^24.4.2"}
@@ -49,13 +47,17 @@ types-requests = "^2.31.0.2"
49
47
  flake8-pyproject = "^1.2.3"
50
48
  pylint = "^2.17.5"
51
49
 
50
+ [tool.poetry.extras]
51
+ easyocr = ["easyocr"]
52
+ ocr = ["easyocr"]
53
+
52
54
  [build-system]
53
55
  requires = ["poetry-core"]
54
56
  build-backend = "poetry.core.masonry.api"
55
57
 
56
58
  [tool.black]
57
59
  line-length = 88
58
- target-version = ["py311"]
60
+ target-version = ["py310"]
59
61
  include = '\.pyi?$'
60
62
 
61
63
  [tool.isort]
@@ -67,8 +69,21 @@ py_version=311
67
69
  pretty = true
68
70
  # strict = true
69
71
  no_implicit_optional = true
70
- python_version = "3.11"
72
+ python_version = "3.10"
71
73
 
72
74
  [tool.flake8]
73
75
  max-line-length = 88
74
76
  extend-ignore = ["E203", "E501"]
77
+
78
+ [tool.semantic_release]
79
+ # for default values check:
80
+ # https://github.com/python-semantic-release/python-semantic-release/blob/v7.32.2/semantic_release/defaults.cfg
81
+
82
+ version_source = "tag_only"
83
+ branch = "main"
84
+
85
+ # configure types which should trigger minor and patch version bumps respectively
86
+ # (note that they must be a subset of the configured allowed types):
87
+ parser_angular_allowed_types = "build,chore,ci,docs,feat,fix,perf,style,refactor,test"
88
+ parser_angular_minor_types = "feat"
89
+ parser_angular_patch_types = "fix,perf"
docling-0.1.2/PKG-INFO DELETED
@@ -1,132 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: docling
3
- Version: 0.1.2
4
- Summary: Docling PDF conversion package
5
- Home-page: https://github.com/DS4SD/docling
6
- License: MIT
7
- Keywords: docling,convert,document,pdf,layout model,segmentation,table structure,table former
8
- Author: Christoph Auer
9
- Author-email: cau@zurich.ibm.com
10
- Requires-Python: >=3.11,<4.0
11
- Classifier: Development Status :: 5 - Production/Stable
12
- Classifier: Intended Audience :: Developers
13
- Classifier: Intended Audience :: Science/Research
14
- Classifier: License :: OSI Approved :: MIT License
15
- Classifier: Operating System :: MacOS :: MacOS X
16
- Classifier: Operating System :: POSIX :: Linux
17
- Classifier: Programming Language :: Python :: 3
18
- Classifier: Programming Language :: Python :: 3.11
19
- Classifier: Programming Language :: Python :: 3.12
20
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
- Requires-Dist: deepsearch-glm (>=0.18.4,<1)
22
- Requires-Dist: deepsearch-toolkit (>=0.47.0,<1)
23
- Requires-Dist: docling-core (>=0.2.0,<0.3.0)
24
- Requires-Dist: docling-ibm-models (>=0.2.0,<0.3.0)
25
- Requires-Dist: filetype (>=1.2.0,<2.0.0)
26
- Requires-Dist: huggingface_hub (>=0.23,<1)
27
- Requires-Dist: pydantic (>=2.0.0,<3.0.0)
28
- Requires-Dist: pydantic-settings (>=2.3.0,<3.0.0)
29
- Requires-Dist: pypdfium2 (>=4.30.0,<5.0.0)
30
- Project-URL: Repository, https://github.com/DS4SD/docling
31
- Description-Content-Type: text/markdown
32
-
33
- <p align="center">
34
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
35
- </p>
36
-
37
- # Docling
38
-
39
- Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
40
-
41
- ## Features
42
- * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
43
- * 📑 Understands detailed page layout, reading order and recovers table structures
44
- * 📝 Extracts metadata from the document, such as title, authors, references and language
45
- * 🔍 Optionally applies OCR (use with scanned PDFs)
46
-
47
- ## Setup
48
-
49
- You need Python 3.11 and poetry. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
50
-
51
- Once you have `poetry` installed, create an environment and install the package:
52
-
53
- ```bash
54
- poetry env use $(which python3.11)
55
- poetry shell
56
- poetry install
57
- ```
58
-
59
- **Notes**:
60
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
61
-
62
-
63
- ## Usage
64
-
65
- For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
66
-
67
- ```
68
- python examples/convert.py
69
- ```
70
- The output of the above command will be written to `./scratch`.
71
-
72
- ### Enable or disable pipeline features
73
-
74
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
75
- ```python
76
- doc_converter = DocumentConverter(
77
- artifacts_path=artifacts_path,
78
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
79
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
80
- )
81
- ```
82
-
83
- ### Impose limits on the document size
84
-
85
- You can limit the file size and number of pages which should be allowed to process per document.
86
- ```python
87
- paths = [Path("./test/data/2206.01062.pdf")]
88
-
89
- input = DocumentConversionInput.from_paths(
90
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
91
- )
92
- ```
93
-
94
- ### Convert from binary PDF streams
95
-
96
- You can convert PDFs from a binary stream instead of from the filesystem as follows:
97
- ```python
98
- buf = BytesIO(your_binary_stream)
99
- docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
100
- input = DocumentConversionInput.from_streams(docs)
101
- converted_docs = doc_converter.convert(input)
102
- ```
103
- ### Limit resource usage
104
-
105
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
106
-
107
-
108
- ## Contributing
109
-
110
- Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
111
-
112
-
113
- ## References
114
-
115
- If you use `Docling` in your projects, please consider citing the following:
116
-
117
- ```bib
118
- @software{Docling,
119
- author = {Deep Search Team},
120
- month = {7},
121
- title = {{Docling}},
122
- url = {https://github.com/DS4SD/docling},
123
- version = {main},
124
- year = {2024}
125
- }
126
- ```
127
-
128
- ## License
129
-
130
- The `Docling` codebase is under MIT license.
131
- For individual model usage, please refer to the model licenses found in the original packages.
132
-
docling-0.1.2/README.md DELETED
@@ -1,99 +0,0 @@
1
- <p align="center">
2
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
3
- </p>
4
-
5
- # Docling
6
-
7
- Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
8
-
9
- ## Features
10
- * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
11
- * 📑 Understands detailed page layout, reading order and recovers table structures
12
- * 📝 Extracts metadata from the document, such as title, authors, references and language
13
- * 🔍 Optionally applies OCR (use with scanned PDFs)
14
-
15
- ## Setup
16
-
17
- You need Python 3.11 and poetry. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
18
-
19
- Once you have `poetry` installed, create an environment and install the package:
20
-
21
- ```bash
22
- poetry env use $(which python3.11)
23
- poetry shell
24
- poetry install
25
- ```
26
-
27
- **Notes**:
28
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
29
-
30
-
31
- ## Usage
32
-
33
- For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
34
-
35
- ```
36
- python examples/convert.py
37
- ```
38
- The output of the above command will be written to `./scratch`.
39
-
40
- ### Enable or disable pipeline features
41
-
42
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
43
- ```python
44
- doc_converter = DocumentConverter(
45
- artifacts_path=artifacts_path,
46
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
47
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
48
- )
49
- ```
50
-
51
- ### Impose limits on the document size
52
-
53
- You can limit the file size and number of pages which should be allowed to process per document.
54
- ```python
55
- paths = [Path("./test/data/2206.01062.pdf")]
56
-
57
- input = DocumentConversionInput.from_paths(
58
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
59
- )
60
- ```
61
-
62
- ### Convert from binary PDF streams
63
-
64
- You can convert PDFs from a binary stream instead of from the filesystem as follows:
65
- ```python
66
- buf = BytesIO(your_binary_stream)
67
- docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
68
- input = DocumentConversionInput.from_streams(docs)
69
- converted_docs = doc_converter.convert(input)
70
- ```
71
- ### Limit resource usage
72
-
73
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
74
-
75
-
76
- ## Contributing
77
-
78
- Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
79
-
80
-
81
- ## References
82
-
83
- If you use `Docling` in your projects, please consider citing the following:
84
-
85
- ```bib
86
- @software{Docling,
87
- author = {Deep Search Team},
88
- month = {7},
89
- title = {{Docling}},
90
- url = {https://github.com/DS4SD/docling},
91
- version = {main},
92
- year = {2024}
93
- }
94
- ```
95
-
96
- ## License
97
-
98
- The `Docling` codebase is under MIT license.
99
- For individual model usage, please refer to the model licenses found in the original packages.
File without changes
File without changes
File without changes