PyPI - docling - Versions diffs - 0.1.2__tar.gz → 1.5.0__tar.gz - Mend

docling 0.1.2tar.gz → 1.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

docling-1.5.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,192 @@
+Metadata-Version: 2.1
+Name: docling
+Version: 1.5.0
+Summary: Docling PDF conversion package
+Home-page: https://github.com/DS4SD/docling
+License: MIT
+Keywords: docling,convert,document,pdf,layout model,segmentation,table structure,table former
+Author: Christoph Auer
+Author-email: cau@zurich.ibm.com
+Requires-Python: >=3.10,<4.0
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: MacOS :: MacOS X
+Classifier: Operating System :: POSIX :: Linux
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Provides-Extra: easyocr
+Provides-Extra: ocr
+Requires-Dist: certifi (>=2024.7.4)
+Requires-Dist: deepsearch-glm (>=0.19.0,<1)
+Requires-Dist: docling-core (>=1.1.2,<2.0.0)
+Requires-Dist: docling-ibm-models (>=1.1.1,<2.0.0)
+Requires-Dist: docling-parse (>=0.2.0,<0.3.0)
+Requires-Dist: easyocr (>=1.7,<2.0) ; extra == "easyocr" or extra == "ocr"
+Requires-Dist: filetype (>=1.2.0,<2.0.0)
+Requires-Dist: huggingface_hub (>=0.23,<1)
+Requires-Dist: pydantic (>=2.0.0,<3.0.0)
+Requires-Dist: pydantic-settings (>=2.3.0,<3.0.0)
+Requires-Dist: pypdfium2 (>=4.30.0,<5.0.0)
+Requires-Dist: requests (>=2.32.3,<3.0.0)
+Project-URL: Repository, https://github.com/DS4SD/docling
+Description-Content-Type: text/markdown
+<p align="center">
+  <a href="https://github.com/ds4sd/docling">
+    <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
+  </a>
+</p>
+# Docling
+[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
+[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
+![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
+[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
+[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
+[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
+[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
+Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
+## Features
+* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
+* 📑 Understands detailed page layout, reading order and recovers table structures
+* 📝 Extracts metadata from the document, such as title, authors, references and language
+* 🔍 Optionally applies OCR (use with scanned PDFs)
+## Installation
+To use Docling, simply install `docling` from your package manager, e.g. pip:
+```bash
+pip install docling
+```
+> [!NOTE]
+> Works on macOS and Linux environments. Windows platforms are currently not tested.
+### Development setup
+To develop for Docling, you need Python 3.10 / 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
+```bash
+poetry install --all-extras
+```
+## Usage
+### Convert a single document
+To convert invidual PDF documents, use `convert_single()`, for example:
+```python
+from docling.document_converter import DocumentConverter
+source = "https://arxiv.org/pdf/2206.01062"  # PDF path or URL
+converter = DocumentConverter()
+doc = converter.convert_single(source)
+print(doc.export_to_markdown())  # output: "## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis [...]"
+```
+### Convert a batch of documents
+For an example of batch-converting documents, see [batch_convert.py](https://github.com/DS4SD/docling/blob/main/examples/batch_convert.py).
+From a local repo clone, you can run it with:
+```
+python examples/batch_convert.py
+```
+The output of the above command will be written to `./scratch`.
+### Adjust pipeline features
+The example file [custom_convert.py](https://github.com/DS4SD/docling/blob/main/examples/custom_convert.py) contains multiple ways
+one can adjust the conversion pipeline and features.
+#### Control pipeline options
+You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
+```python
+doc_converter = DocumentConverter(
+    artifacts_path=artifacts_path,
+    pipeline_options=PipelineOptions(
+        do_table_structure=False,  # controls if table structure is recovered
+        do_ocr=True,  # controls if OCR is applied (ignores programmatic content)
+    ),
+)
+```
+#### Control table extraction options
+You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
+This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
+```python
+pipeline_options = PipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.do_cell_matching = False  # uses text cells predicted from table structure model
+doc_converter = DocumentConverter(
+    artifacts_path=artifacts_path,
+    pipeline_options=pipeline_options,
+)
+```
+### Impose limits on the document size
+You can limit the file size and number of pages which should be allowed to process per document:
+```python
+conv_input = DocumentConversionInput.from_paths(
+    paths=[Path("./test/data/2206.01062.pdf")],
+    limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
+)
+```
+### Convert from binary PDF streams
+You can convert PDFs from a binary stream instead of from the filesystem as follows:
+```python
+buf = BytesIO(your_binary_stream)
+docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
+conv_input = DocumentConversionInput.from_streams(docs)
+converted_docs = doc_converter.convert(conv_input)
+```
+### Limit resource usage
+You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
+## Contributing
+Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
+## References
+If you use Docling in your projects, please consider citing the following:
+```bib
+@techreport{Docling,
+  author = {Deep Search Team},
+  month = {8},
+  title = {{Docling Technical Report}},
+  url={https://arxiv.org/abs/2408.09869},
+  eprint={2408.09869},
+  doi = "10.48550/arXiv.2408.09869",
+  version = {1.0.0},
+  year = {2024}
+}
+```
+## License
+The Docling codebase is under MIT license.
+For individual model usage, please refer to the model licenses found in the original packages.

docling-1.5.0/README.md ADDED Viewed

@@ -0,0 +1,153 @@
+<p align="center">
+  <a href="https://github.com/ds4sd/docling">
+    <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
+  </a>
+</p>
+# Docling
+[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
+[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
+![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
+[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
+[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
+[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
+[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
+Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
+## Features
+* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
+* 📑 Understands detailed page layout, reading order and recovers table structures
+* 📝 Extracts metadata from the document, such as title, authors, references and language
+* 🔍 Optionally applies OCR (use with scanned PDFs)
+## Installation
+To use Docling, simply install `docling` from your package manager, e.g. pip:
+```bash
+pip install docling
+```
+> [!NOTE]
+> Works on macOS and Linux environments. Windows platforms are currently not tested.
+### Development setup
+To develop for Docling, you need Python 3.10 / 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
+```bash
+poetry install --all-extras
+```
+## Usage
+### Convert a single document
+To convert invidual PDF documents, use `convert_single()`, for example:
+```python
+from docling.document_converter import DocumentConverter
+source = "https://arxiv.org/pdf/2206.01062"  # PDF path or URL
+converter = DocumentConverter()
+doc = converter.convert_single(source)
+print(doc.export_to_markdown())  # output: "## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis [...]"
+```
+### Convert a batch of documents
+For an example of batch-converting documents, see [batch_convert.py](https://github.com/DS4SD/docling/blob/main/examples/batch_convert.py).
+From a local repo clone, you can run it with:
+```
+python examples/batch_convert.py
+```
+The output of the above command will be written to `./scratch`.
+### Adjust pipeline features
+The example file [custom_convert.py](https://github.com/DS4SD/docling/blob/main/examples/custom_convert.py) contains multiple ways
+one can adjust the conversion pipeline and features.
+#### Control pipeline options
+You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
+```python
+doc_converter = DocumentConverter(
+    artifacts_path=artifacts_path,
+    pipeline_options=PipelineOptions(
+        do_table_structure=False,  # controls if table structure is recovered
+        do_ocr=True,  # controls if OCR is applied (ignores programmatic content)
+    ),
+)
+```
+#### Control table extraction options
+You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
+This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
+```python
+pipeline_options = PipelineOptions(do_table_structure=True)
+pipeline_options.table_structure_options.do_cell_matching = False  # uses text cells predicted from table structure model
+doc_converter = DocumentConverter(
+    artifacts_path=artifacts_path,
+    pipeline_options=pipeline_options,
+)
+```
+### Impose limits on the document size
+You can limit the file size and number of pages which should be allowed to process per document:
+```python
+conv_input = DocumentConversionInput.from_paths(
+    paths=[Path("./test/data/2206.01062.pdf")],
+    limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
+)
+```
+### Convert from binary PDF streams
+You can convert PDFs from a binary stream instead of from the filesystem as follows:
+```python
+buf = BytesIO(your_binary_stream)
+docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
+conv_input = DocumentConversionInput.from_streams(docs)
+converted_docs = doc_converter.convert(conv_input)
+```
+### Limit resource usage
+You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
+## Contributing
+Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
+## References
+If you use Docling in your projects, please consider citing the following:
+```bib
+@techreport{Docling,
+  author = {Deep Search Team},
+  month = {8},
+  title = {{Docling Technical Report}},
+  url={https://arxiv.org/abs/2408.09869},
+  eprint={2408.09869},
+  doi = "10.48550/arXiv.2408.09869",
+  version = {1.0.0},
+  year = {2024}
+}
+```
+## License
+The Docling codebase is under MIT license.
+For individual model usage, please refer to the model licenses found in the original packages.

{docling-0.1.2 → docling-1.5.0}/docling/backend/abstract_backend.py RENAMED Viewed

@@ -35,7 +35,7 @@ class PdfPageBackend(ABC):
 class PdfDocumentBackend(ABC):
     @abstractmethod
-    def __init__(self, path_or_stream: Iterable[Union[BytesIO, Path]]):
+    def __init__(self, path_or_stream: Union[BytesIO, Path]):
         pass
     @abstractmethod

docling-1.5.0/docling/backend/docling_parse_backend.py ADDED Viewed

@@ -0,0 +1,187 @@
+import logging
+import random
+import time
+from io import BytesIO
+from pathlib import Path
+from typing import Iterable, List, Optional, Union
+import pypdfium2 as pdfium
+from docling_parse.docling_parse import pdf_parser
+from PIL import Image, ImageDraw
+from pypdfium2 import PdfPage
+from docling.backend.abstract_backend import PdfDocumentBackend, PdfPageBackend
+from docling.datamodel.base_models import BoundingBox, Cell, CoordOrigin, PageSize
+_log = logging.getLogger(__name__)
+class DoclingParsePageBackend(PdfPageBackend):
+    def __init__(self, page_obj: PdfPage, docling_page_obj):
+        super().__init__(page_obj)
+        self._ppage = page_obj
+        self._dpage = docling_page_obj
+        self.text_page = None
+    def get_text_in_rect(self, bbox: BoundingBox) -> str:
+        # Find intersecting cells on the page
+        text_piece = ""
+        page_size = self.get_size()
+        parser_width = self._dpage["width"]
+        parser_height = self._dpage["height"]
+        scale = (
+            1  # FIX - Replace with param in get_text_in_rect across backends (optional)
+        )
+        for i in range(len(self._dpage["cells"])):
+            rect = self._dpage["cells"][i]["box"]["device"]
+            x0, y0, x1, y1 = rect
+            cell_bbox = BoundingBox(
+                l=x0 * scale * page_size.width / parser_width,
+                b=y0 * scale * page_size.height / parser_height,
+                r=x1 * scale * page_size.width / parser_width,
+                t=y1 * scale * page_size.height / parser_height,
+                coord_origin=CoordOrigin.BOTTOMLEFT,
+            ).to_top_left_origin(page_size.height * scale)
+            overlap_frac = cell_bbox.intersection_area_with(bbox) / cell_bbox.area()
+            if overlap_frac > 0.5:
+                if len(text_piece) > 0:
+                    text_piece += " "
+                text_piece += self._dpage["cells"][i]["content"]["rnormalized"]
+        return text_piece
+    def get_text_cells(self) -> Iterable[Cell]:
+        cells = []
+        cell_counter = 0
+        page_size = self.get_size()
+        parser_width = self._dpage["width"]
+        parser_height = self._dpage["height"]
+        for i in range(len(self._dpage["cells"])):
+            rect = self._dpage["cells"][i]["box"]["device"]
+            x0, y0, x1, y1 = rect
+            text_piece = self._dpage["cells"][i]["content"]["rnormalized"]
+            cells.append(
+                Cell(
+                    id=cell_counter,
+                    text=text_piece,
+                    bbox=BoundingBox(
+                        # l=x0, b=y0, r=x1, t=y1,
+                        l=x0 * page_size.width / parser_width,
+                        b=y0 * page_size.height / parser_height,
+                        r=x1 * page_size.width / parser_width,
+                        t=y1 * page_size.height / parser_height,
+                        coord_origin=CoordOrigin.BOTTOMLEFT,
+                    ).to_top_left_origin(page_size.height),
+                )
+            )
+            cell_counter += 1
+        def draw_clusters_and_cells():
+            image = (
+                self.get_page_image()
+            )  # make new image to avoid drawing on the saved ones
+            draw = ImageDraw.Draw(image)
+            for c in cells:
+                x0, y0, x1, y1 = c.bbox.as_tuple()
+                cell_color = (
+                    random.randint(30, 140),
+                    random.randint(30, 140),
+                    random.randint(30, 140),
+                )
+                draw.rectangle([(x0, y0), (x1, y1)], outline=cell_color)
+            image.show()
+        # before merge:
+        # draw_clusters_and_cells()
+        # cells = merge_horizontal_cells(cells)
+        # after merge:
+        # draw_clusters_and_cells()
+        return cells
+    def get_page_image(
+        self, scale: int = 1, cropbox: Optional[BoundingBox] = None
+    ) -> Image.Image:
+        page_size = self.get_size()
+        if not cropbox:
+            cropbox = BoundingBox(
+                l=0,
+                r=page_size.width,
+                t=0,
+                b=page_size.height,
+                coord_origin=CoordOrigin.TOPLEFT,
+            )
+            padbox = BoundingBox(
+                l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
+            )
+        else:
+            padbox = cropbox.to_bottom_left_origin(page_size.height)
+            padbox.r = page_size.width - padbox.r
+            padbox.t = page_size.height - padbox.t
+        image = (
+            self._ppage.render(
+                scale=scale * 1.5,
+                rotation=0,  # no additional rotation
+                crop=padbox.as_tuple(),
+            )
+            .to_pil()
+            .resize(size=(round(cropbox.width * scale), round(cropbox.height * scale)))
+        )  # We resize the image from 1.5x the given scale to make it sharper.
+        return image
+    def get_size(self) -> PageSize:
+        return PageSize(width=self._ppage.get_width(), height=self._ppage.get_height())
+    def unload(self):
+        self._ppage = None
+        self._dpage = None
+        self.text_page = None
+class DoclingParseDocumentBackend(PdfDocumentBackend):
+    def __init__(self, path_or_stream: Union[BytesIO, Path]):
+        super().__init__(path_or_stream)
+        self._pdoc = pdfium.PdfDocument(path_or_stream)
+        # Parsing cells with docling_parser call
+        parser = pdf_parser()
+        start_pb_time = time.time()
+        if isinstance(path_or_stream, BytesIO):
+            self._parser_doc = parser.find_cells_from_bytesio(path_or_stream)
+        else:
+            self._parser_doc = parser.find_cells(str(path_or_stream))
+        end_pb_time = time.time() - start_pb_time
+        _log.info(
+            f"Time to parse {path_or_stream.name} with docling-parse: time={end_pb_time:.3f}"
+        )
+    def page_count(self) -> int:
+        return len(self._parser_doc["pages"])
+    def load_page(self, page_no: int) -> PdfPage:
+        return DoclingParsePageBackend(
+            self._pdoc[page_no], self._parser_doc["pages"][page_no]
+        )
+    def is_valid(self) -> bool:
+        return self.page_count() > 0
+    def unload(self):
+        self._pdoc.close()
+        self._pdoc = None
+        self._parser_doc = None

{docling-0.1.2 → docling-1.5.0}/docling/backend/pypdfium2_backend.py RENAMED Viewed

@@ -134,7 +134,9 @@ class PyPdfiumPageBackend(PdfPageBackend):
             return merged_cells
         def draw_clusters_and_cells():
-            image = self.get_page_image()
+            image = (
+                self.get_page_image()
+            )  # make new image to avoid drawing on the saved ones
             draw = ImageDraw.Draw(image)
             for c in cells:
                 x0, y0, x1, y1 = c.bbox.as_tuple()
@@ -199,15 +201,9 @@ class PyPdfiumPageBackend(PdfPageBackend):
 class PyPdfiumDocumentBackend(PdfDocumentBackend):
-    def __init__(self, path_or_stream: Iterable[Union[BytesIO, Path]]):
+    def __init__(self, path_or_stream: Union[BytesIO, Path]):
         super().__init__(path_or_stream)
-        if isinstance(path_or_stream, Path):
-            self._pdoc = pdfium.PdfDocument(path_or_stream)
-        elif isinstance(path_or_stream, BytesIO):
-            self._pdoc = pdfium.PdfDocument(
-                path_or_stream
-            )  # TODO Fix me, won't accept bytes.
+        self._pdoc = pdfium.PdfDocument(path_or_stream)
     def page_count(self) -> int:
         return len(self._pdoc)

{docling-0.1.2 → docling-1.5.0}/docling/datamodel/base_models.py RENAMED Viewed

@@ -1,9 +1,12 @@
+import copy
+import warnings
 from enum import Enum, auto
 from io import BytesIO
-from typing import Any, Dict, List, Optional, Tuple, Union
+from typing import Annotated, Any, Dict, List, Optional, Tuple, Union
 from PIL.Image import Image
-from pydantic import BaseModel, ConfigDict, model_validator
+from pydantic import BaseModel, ConfigDict, Field, model_validator
+from typing_extensions import Self
 from docling.backend.abstract_backend import PdfPageBackend
@@ -47,6 +50,15 @@ class BoundingBox(BaseModel):
     def height(self):
         return abs(self.t - self.b)
+    def scaled(self, scale: float) -> "BoundingBox":
+        out_bbox = copy.deepcopy(self)
+        out_bbox.l *= scale
+        out_bbox.r *= scale
+        out_bbox.t *= scale
+        out_bbox.b *= scale
+        return out_bbox
     def as_tuple(self):
         if self.coord_origin == CoordOrigin.TOPLEFT:
             return (self.l, self.t, self.r, self.b)
@@ -180,8 +192,7 @@ class TableStructurePrediction(BaseModel):
     table_map: Dict[int, TableElement] = {}
-class TextElement(BasePageElement):
-    ...
+class TextElement(BasePageElement): ...
 class FigureData(BaseModel):
@@ -225,14 +236,30 @@ class Page(BaseModel):
     model_config = ConfigDict(arbitrary_types_allowed=True)
     page_no: int
-    page_hash: str = None
-    size: PageSize = None
-    image: Image = None
+    page_hash: Optional[str] = None
+    size: Optional[PageSize] = None
     cells: List[Cell] = None
     predictions: PagePredictions = PagePredictions()
-    assembled: AssembledUnit = None
+    assembled: Optional[AssembledUnit] = None
+    _backend: Optional[PdfPageBackend] = (
+        None  # Internal PDF backend. By default it is cleared during assembling.
+    )
+    _default_image_scale: float = 1.0  # Default image scale for external usage.
+    _image_cache: Dict[float, Image] = (
+        {}
+    )  # Cache of images in different scales. By default it is cleared during assembling.
+    def get_image(self, scale: float = 1.0) -> Optional[Image]:
+        if self._backend is None:
+            return self._image_cache.get(scale, None)
+        if not scale in self._image_cache:
+            self._image_cache[scale] = self._backend.get_page_image(scale=scale)
+        return self._image_cache[scale]
-    _backend: PdfPageBackend = None  # Internal PDF backend
+    @property
+    def image(self) -> Optional[Image]:
+        return self.get_image(scale=self._default_image_scale)
 class DocumentStream(BaseModel):
@@ -242,6 +269,36 @@ class DocumentStream(BaseModel):
     stream: BytesIO
+class TableStructureOptions(BaseModel):
+    do_cell_matching: bool = (
+        True
+        # True:  Matches predictions back to PDF cells. Can break table output if PDF cells
+        #        are merged across table columns.
+        # False: Let table structure model define the text cells, ignore PDF cells.
+    )
 class PipelineOptions(BaseModel):
-    do_table_structure: bool = True
-    do_ocr: bool = False
+    do_table_structure: bool = True  # True: perform table structure extraction
+    do_ocr: bool = False  # True: perform OCR, replace programmatic PDF text
+    table_structure_options: TableStructureOptions = TableStructureOptions()
+class AssembleOptions(BaseModel):
+    keep_page_images: Annotated[
+        bool,
+        Field(
+            deprecated="`keep_page_images` is depreacted, set the value of `images_scale` instead"
+        ),
+    ] = False  # False: page images are removed in the assemble step
+    images_scale: Optional[float] = None  # if set, the scale for generated images
+    @model_validator(mode="after")
+    def set_page_images_from_deprecated(self) -> Self:
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", DeprecationWarning)
+            default_scale = 1.0
+            if self.keep_page_images and self.images_scale is None:
+                self.images_scale = default_scale
+        return self

docling 0.1.2__tar.gz → 1.5.0__tar.gz

docling 0.1.2tar.gz → 1.5.0tar.gz