PyPI - docling - Versions diffs - 2.26.0__tar.gz → 2.27.0__tar.gz - Mend

docling 2.26.0tar.gz → 2.27.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (82) hide show

{docling-2.26.0 → docling-2.27.0}/PKG-INFO RENAMED Viewed

@@ -1,8 +1,8 @@
 Metadata-Version: 2.1
 Name: docling
-Version: 2.26.0
+Version: 2.27.0
 Summary: SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.
-Home-page: https://github.com/DS4SD/docling
+Home-page: https://github.com/docling-project/docling
 License: MIT
 Keywords: docling,convert,document,pdf,docx,html,markdown,layout model,segmentation,table structure,table former
 Author: Christoph Auer
@@ -28,9 +28,9 @@ Provides-Extra: vlm
 Requires-Dist: accelerate (>=1.2.1,<2.0.0) ; (sys_platform != "darwin" or platform_machine != "x86_64") and (extra == "vlm")
 Requires-Dist: beautifulsoup4 (>=4.12.3,<5.0.0)
 Requires-Dist: certifi (>=2024.7.4)
-Requires-Dist: docling-core[chunking] (>=2.19.0,<3.0.0)
+Requires-Dist: docling-core[chunking] (>=2.23.0,<3.0.0)
 Requires-Dist: docling-ibm-models (>=3.4.0,<4.0.0)
-Requires-Dist: docling-parse (>=3.3.0,<4.0.0)
+Requires-Dist: docling-parse (>=4.0.0,<5.0.0)
 Requires-Dist: easyocr (>=1.7,<2.0)
 Requires-Dist: filetype (>=1.2.0,<2.0.0)
 Requires-Dist: huggingface_hub (>=0.23,<1)
@@ -42,8 +42,10 @@ Requires-Dist: onnxruntime (>=1.7.0,<2.0.0) ; (python_version >= "3.10") and (ex
 Requires-Dist: openpyxl (>=3.1.5,<4.0.0)
 Requires-Dist: pandas (>=2.1.4,<3.0.0)
 Requires-Dist: pillow (>=10.0.0,<12.0.0)
+Requires-Dist: pluggy (>=1.0.0,<2.0.0)
 Requires-Dist: pydantic (>=2.0.0,<3.0.0)
 Requires-Dist: pydantic-settings (>=2.3.0,<3.0.0)
+Requires-Dist: pylatexenc (>=2.10,<3.0)
 Requires-Dist: pypdfium2 (>=4.30.0,<5.0.0)
 Requires-Dist: python-docx (>=1.1.2,<2.0.0)
 Requires-Dist: python-pptx (>=1.0.2,<2.0.0)
@@ -57,12 +59,12 @@ Requires-Dist: tqdm (>=4.65.0,<5.0.0)
 Requires-Dist: transformers (>=4.42.0,<4.43.0) ; (sys_platform == "darwin" and platform_machine == "x86_64") and (extra == "vlm")
 Requires-Dist: transformers (>=4.46.0,<5.0.0) ; (sys_platform != "darwin" or platform_machine != "x86_64") and (extra == "vlm")
 Requires-Dist: typer (>=0.12.5,<0.13.0)
-Project-URL: Repository, https://github.com/DS4SD/docling
+Project-URL: Repository, https://github.com/docling-project/docling
 Description-Content-Type: text/markdown
 <p align="center">
-  <a href="https://github.com/ds4sd/docling">
-    <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
+  <a href="https://github.com/docling-project/docling">
+    <img loading="lazy" alt="Docling" src="https://github.com/docling-project/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
   </a>
 </p>
@@ -73,7 +75,7 @@ Description-Content-Type: text/markdown
 </p>
 [![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
-[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/)
+[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://docling-project.github.io/docling/)
 [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling)](https://pypi.org/project/docling/)
 [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
@@ -81,8 +83,9 @@ Description-Content-Type: text/markdown
 [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
 [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
 [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
-[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
+[![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
+[![Docling Actor](https://apify.com/actor-badge?actor=vancura/docling?fpr=docling)](https://apify.com/vancura/docling)
 Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
@@ -113,7 +116,7 @@ pip install docling
 Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
-More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.
+More [detailed installation instructions](https://docling-project.github.io/docling/installation/) are available in the docs.
 ## Getting started
@@ -128,28 +131,54 @@ result = converter.convert(source)
 print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"
 ```
-More [advanced usage options](https://ds4sd.github.io/docling/usage/) are available in
+More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
 the docs.
 ## Documentation
-Check out Docling's [documentation](https://ds4sd.github.io/docling/), for details on
+Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
 installation, usage, concepts, recipes, extensions, and more.
 ## Examples
-Go hands-on with our [examples](https://ds4sd.github.io/docling/examples/),
+Go hands-on with our [examples](https://docling-project.github.io/docling/examples/),
 demonstrating how to address different application use cases with Docling.
 ## Integrations
 To further accelerate your AI application development, check out Docling's native
-[integrations](https://ds4sd.github.io/docling/integrations/) with popular frameworks
+[integrations](https://docling-project.github.io/docling/integrations/) with popular frameworks
 and tools.
+## Apify Actor
+<a href="https://apify.com/vancura/docling?fpr=docling"><img src="https://apify.com/ext/run-on-apify.png" alt="Run Docling Actor on Apify" width="176" height="39" /></a>
+You can run Docling in the cloud without installation using the [Docling Actor](https://apify.com/vancura/docling?fpr=docling) on Apify platform. Simply provide a document URL and get the processed result:
+```bash
+apify call vancura/docling -i '{
+  "options": {
+    "to_formats": ["md", "json", "html", "text", "doctags"]
+  },
+  "http_sources": [
+    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
+    {"url": "https://arxiv.org/pdf/2408.09869"}
+  ]
+}'
+```
+The Actor stores results in:
+* Processed document in key-value store (`OUTPUT_RESULT`)
+* Processing logs (`DOCLING_LOG`)
+* Dataset record with result URL and status
+Read more about the [Docling Actor](.actor/README.md), including how to use it via the Apify API and CLI.
 ## Get help and support
-Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
+Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
 ## Technical report
@@ -157,7 +186,7 @@ For more details on Docling's inner workings, check out the [Docling Technical R
 ## Contributing
-Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
+Please read [Contributing to Docling](https://github.com/docling-project/docling/blob/main/CONTRIBUTING.md) for details.
 ## References
@@ -185,7 +214,7 @@ For individual model usage, please refer to the model licenses found in the orig
 Docling has been brought to you by IBM.
-[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/
-[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
-[integrations]: https://ds4sd.github.io/docling/integrations/
+[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
+[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
+[integrations]: https://docling-project.github.io/docling/integrations/

{docling-2.26.0 → docling-2.27.0}/README.md RENAMED Viewed

@@ -1,6 +1,6 @@
 <p align="center">
-  <a href="https://github.com/ds4sd/docling">
-    <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
+  <a href="https://github.com/docling-project/docling">
+    <img loading="lazy" alt="Docling" src="https://github.com/docling-project/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
   </a>
 </p>
@@ -11,7 +11,7 @@
 </p>
 [![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
-[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/)
+[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://docling-project.github.io/docling/)
 [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling)](https://pypi.org/project/docling/)
 [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
@@ -19,8 +19,9 @@
 [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
 [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
 [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
-[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)
+[![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
+[![Docling Actor](https://apify.com/actor-badge?actor=vancura/docling?fpr=docling)](https://apify.com/vancura/docling)
 Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
@@ -51,7 +52,7 @@ pip install docling
 Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
-More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.
+More [detailed installation instructions](https://docling-project.github.io/docling/installation/) are available in the docs.
 ## Getting started
@@ -66,28 +67,54 @@ result = converter.convert(source)
 print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"
 ```
-More [advanced usage options](https://ds4sd.github.io/docling/usage/) are available in
+More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
 the docs.
 ## Documentation
-Check out Docling's [documentation](https://ds4sd.github.io/docling/), for details on
+Check out Docling's [documentation](https://docling-project.github.io/docling/), for details on
 installation, usage, concepts, recipes, extensions, and more.
 ## Examples
-Go hands-on with our [examples](https://ds4sd.github.io/docling/examples/),
+Go hands-on with our [examples](https://docling-project.github.io/docling/examples/),
 demonstrating how to address different application use cases with Docling.
 ## Integrations
 To further accelerate your AI application development, check out Docling's native
-[integrations](https://ds4sd.github.io/docling/integrations/) with popular frameworks
+[integrations](https://docling-project.github.io/docling/integrations/) with popular frameworks
 and tools.
+## Apify Actor
+<a href="https://apify.com/vancura/docling?fpr=docling"><img src="https://apify.com/ext/run-on-apify.png" alt="Run Docling Actor on Apify" width="176" height="39" /></a>
+You can run Docling in the cloud without installation using the [Docling Actor](https://apify.com/vancura/docling?fpr=docling) on Apify platform. Simply provide a document URL and get the processed result:
+```bash
+apify call vancura/docling -i '{
+  "options": {
+    "to_formats": ["md", "json", "html", "text", "doctags"]
+  },
+  "http_sources": [
+    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
+    {"url": "https://arxiv.org/pdf/2408.09869"}
+  ]
+}'
+```
+The Actor stores results in:
+* Processed document in key-value store (`OUTPUT_RESULT`)
+* Processing logs (`DOCLING_LOG`)
+* Dataset record with result URL and status
+Read more about the [Docling Actor](.actor/README.md), including how to use it via the Apify API and CLI.
 ## Get help and support
-Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
+Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).
 ## Technical report
@@ -95,7 +122,7 @@ For more details on Docling's inner workings, check out the [Docling Technical R
 ## Contributing
-Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
+Please read [Contributing to Docling](https://github.com/docling-project/docling/blob/main/CONTRIBUTING.md) for details.
 ## References
@@ -123,6 +150,6 @@ For individual model usage, please refer to the model licenses found in the orig
 Docling has been brought to you by IBM.
-[supported_formats]: https://ds4sd.github.io/docling/usage/supported_formats/
-[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
-[integrations]: https://ds4sd.github.io/docling/integrations/
+[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
+[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
+[integrations]: https://docling-project.github.io/docling/integrations/

{docling-2.26.0 → docling-2.27.0}/docling/backend/asciidoc_backend.py RENAMED Viewed

@@ -380,7 +380,7 @@ class AsciiDocBackend(DeclarativeDocumentBackend):
                     end_row_offset_idx=row_idx + row_span,
                     start_col_offset_idx=col_idx,
                     end_col_offset_idx=col_idx + col_span,
-                    col_header=False,
+                    column_header=row_idx == 0,
                     row_header=False,
                 )
                 data.table_cells.append(cell)

{docling-2.26.0 → docling-2.27.0}/docling/backend/csv_backend.py RENAMED Viewed

@@ -111,7 +111,7 @@ class CsvDocumentBackend(DeclarativeDocumentBackend):
                             end_row_offset_idx=row_idx + 1,
                             start_col_offset_idx=col_idx,
                             end_col_offset_idx=col_idx + 1,
-                            col_header=row_idx == 0,  # First row as header
+                            column_header=row_idx == 0,  # First row as header
                             row_header=False,
                         )
                         table_data.table_cells.append(cell)

{docling-2.26.0 → docling-2.27.0}/docling/backend/docling_parse_backend.py RENAMED Viewed

@@ -6,12 +6,12 @@ from typing import Iterable, List, Optional, Union
 import pypdfium2 as pdfium
 from docling_core.types.doc import BoundingBox, CoordOrigin, Size
+from docling_core.types.doc.page import BoundingRectangle, SegmentedPdfPage, TextCell
 from docling_parse.pdf_parsers import pdf_parser_v1
 from PIL import Image, ImageDraw
 from pypdfium2 import PdfPage
 from docling.backend.pdf_backend import PdfDocumentBackend, PdfPageBackend
-from docling.datamodel.base_models import Cell
 from docling.datamodel.document import InputDocument
 _log = logging.getLogger(__name__)
@@ -68,8 +68,11 @@ class DoclingParsePageBackend(PdfPageBackend):
         return text_piece
-    def get_text_cells(self) -> Iterable[Cell]:
-        cells: List[Cell] = []
+    def get_segmented_page(self) -> Optional[SegmentedPdfPage]:
+        return None
+    def get_text_cells(self) -> Iterable[TextCell]:
+        cells: List[TextCell] = []
         cell_counter = 0
         if not self.valid:
@@ -91,19 +94,24 @@ class DoclingParsePageBackend(PdfPageBackend):
             text_piece = self._dpage["cells"][i]["content"]["rnormalized"]
             cells.append(
-                Cell(
-                    id=cell_counter,
+                TextCell(
+                    index=cell_counter,
                     text=text_piece,
-                    bbox=BoundingBox(
-                        # l=x0, b=y0, r=x1, t=y1,
-                        l=x0 * page_size.width / parser_width,
-                        b=y0 * page_size.height / parser_height,
-                        r=x1 * page_size.width / parser_width,
-                        t=y1 * page_size.height / parser_height,
-                        coord_origin=CoordOrigin.BOTTOMLEFT,
+                    orig=text_piece,
+                    from_ocr=False,
+                    rect=BoundingRectangle.from_bounding_box(
+                        BoundingBox(
+                            # l=x0, b=y0, r=x1, t=y1,
+                            l=x0 * page_size.width / parser_width,
+                            b=y0 * page_size.height / parser_height,
+                            r=x1 * page_size.width / parser_width,
+                            t=y1 * page_size.height / parser_height,
+                            coord_origin=CoordOrigin.BOTTOMLEFT,
+                        )
                     ).to_top_left_origin(page_size.height),
                 )
             )
             cell_counter += 1
         def draw_clusters_and_cells():
@@ -112,7 +120,7 @@ class DoclingParsePageBackend(PdfPageBackend):
             )  # make new image to avoid drawing on the saved ones
             draw = ImageDraw.Draw(image)
             for c in cells:
-                x0, y0, x1, y1 = c.bbox.as_tuple()
+                x0, y0, x1, y1 = c.rect.to_bounding_box().as_tuple()
                 cell_color = (
                     random.randint(30, 140),
                     random.randint(30, 140),

{docling-2.26.0 → docling-2.27.0}/docling/backend/docling_parse_v2_backend.py RENAMED Viewed

@@ -6,12 +6,13 @@ from typing import TYPE_CHECKING, Iterable, List, Optional, Union
 import pypdfium2 as pdfium
 from docling_core.types.doc import BoundingBox, CoordOrigin
+from docling_core.types.doc.page import BoundingRectangle, SegmentedPdfPage, TextCell
 from docling_parse.pdf_parsers import pdf_parser_v2
 from PIL import Image, ImageDraw
 from pypdfium2 import PdfPage
 from docling.backend.pdf_backend import PdfDocumentBackend, PdfPageBackend
-from docling.datamodel.base_models import Cell, Size
+from docling.datamodel.base_models import Size
 from docling.utils.locks import pypdfium2_lock
 if TYPE_CHECKING:
@@ -78,8 +79,11 @@ class DoclingParseV2PageBackend(PdfPageBackend):
         return text_piece
-    def get_text_cells(self) -> Iterable[Cell]:
-        cells: List[Cell] = []
+    def get_segmented_page(self) -> Optional[SegmentedPdfPage]:
+        return None
+    def get_text_cells(self) -> Iterable[TextCell]:
+        cells: List[TextCell] = []
         cell_counter = 0
         if not self.valid:
@@ -106,16 +110,20 @@ class DoclingParseV2PageBackend(PdfPageBackend):
             text_piece = cell_data[cells_header.index("text")]
             cells.append(
-                Cell(
-                    id=cell_counter,
+                TextCell(
+                    index=cell_counter,
                     text=text_piece,
-                    bbox=BoundingBox(
-                        # l=x0, b=y0, r=x1, t=y1,
-                        l=x0 * page_size.width / parser_width,
-                        b=y0 * page_size.height / parser_height,
-                        r=x1 * page_size.width / parser_width,
-                        t=y1 * page_size.height / parser_height,
-                        coord_origin=CoordOrigin.BOTTOMLEFT,
+                    orig=text_piece,
+                    from_ocr=False,
+                    rect=BoundingRectangle.from_bounding_box(
+                        BoundingBox(
+                            # l=x0, b=y0, r=x1, t=y1,
+                            l=x0 * page_size.width / parser_width,
+                            b=y0 * page_size.height / parser_height,
+                            r=x1 * page_size.width / parser_width,
+                            t=y1 * page_size.height / parser_height,
+                            coord_origin=CoordOrigin.BOTTOMLEFT,
+                        )
                     ).to_top_left_origin(page_size.height),
                 )
             )

docling-2.27.0/docling/backend/docling_parse_v4_backend.py ADDED Viewed

@@ -0,0 +1,185 @@
+import logging
+import random
+from io import BytesIO
+from pathlib import Path
+from typing import TYPE_CHECKING, Iterable, List, Optional, Union
+import pypdfium2 as pdfium
+from docling_core.types.doc import BoundingBox, CoordOrigin
+from docling_core.types.doc.page import SegmentedPdfPage, TextCell
+from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument
+from PIL import Image, ImageDraw
+from pypdfium2 import PdfPage
+from docling.backend.pdf_backend import PdfDocumentBackend, PdfPageBackend
+from docling.datamodel.base_models import Size
+from docling.utils.locks import pypdfium2_lock
+if TYPE_CHECKING:
+    from docling.datamodel.document import InputDocument
+_log = logging.getLogger(__name__)
+class DoclingParseV4PageBackend(PdfPageBackend):
+    def __init__(self, parsed_page: SegmentedPdfPage, page_obj: PdfPage):
+        self._ppage = page_obj
+        self._dpage = parsed_page
+        self.valid = parsed_page is not None
+    def is_valid(self) -> bool:
+        return self.valid
+    def get_text_in_rect(self, bbox: BoundingBox) -> str:
+        # Find intersecting cells on the page
+        text_piece = ""
+        page_size = self.get_size()
+        scale = (
+            1  # FIX - Replace with param in get_text_in_rect across backends (optional)
+        )
+        for i, cell in enumerate(self._dpage.textline_cells):
+            cell_bbox = (
+                cell.rect.to_bounding_box()
+                .to_top_left_origin(page_height=page_size.height)
+                .scaled(scale)
+            )
+            overlap_frac = cell_bbox.intersection_area_with(bbox) / cell_bbox.area()
+            if overlap_frac > 0.5:
+                if len(text_piece) > 0:
+                    text_piece += " "
+                text_piece += cell.text
+        return text_piece
+    def get_segmented_page(self) -> Optional[SegmentedPdfPage]:
+        return self._dpage
+    def get_text_cells(self) -> Iterable[TextCell]:
+        page_size = self.get_size()
+        [tc.to_top_left_origin(page_size.height) for tc in self._dpage.textline_cells]
+        # for cell in self._dpage.textline_cells:
+        #     rect = cell.rect
+        #
+        #     assert (
+        #         rect.to_bounding_box().l <= rect.to_bounding_box().r
+        #     ), f"left is > right on bounding box {rect.to_bounding_box()} of rect {rect}"
+        #     assert (
+        #         rect.to_bounding_box().t <= rect.to_bounding_box().b
+        #     ), f"top is > bottom on bounding box {rect.to_bounding_box()} of rect {rect}"
+        return self._dpage.textline_cells
+    def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
+        AREA_THRESHOLD = 0  # 32 * 32
+        images = self._dpage.bitmap_resources
+        for img in images:
+            cropbox = img.rect.to_bounding_box().to_top_left_origin(
+                self.get_size().height
+            )
+            if cropbox.area() > AREA_THRESHOLD:
+                cropbox = cropbox.scaled(scale=scale)
+                yield cropbox
+    def get_page_image(
+        self, scale: float = 1, cropbox: Optional[BoundingBox] = None
+    ) -> Image.Image:
+        page_size = self.get_size()
+        if not cropbox:
+            cropbox = BoundingBox(
+                l=0,
+                r=page_size.width,
+                t=0,
+                b=page_size.height,
+                coord_origin=CoordOrigin.TOPLEFT,
+            )
+            padbox = BoundingBox(
+                l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
+            )
+        else:
+            padbox = cropbox.to_bottom_left_origin(page_size.height).model_copy()
+            padbox.r = page_size.width - padbox.r
+            padbox.t = page_size.height - padbox.t
+        image = (
+            self._ppage.render(
+                scale=scale * 1.5,
+                rotation=0,  # no additional rotation
+                crop=padbox.as_tuple(),
+            )
+            .to_pil()
+            .resize(size=(round(cropbox.width * scale), round(cropbox.height * scale)))
+        )  # We resize the image from 1.5x the given scale to make it sharper.
+        return image
+    def get_size(self) -> Size:
+        return Size(
+            width=self._dpage.dimension.width,
+            height=self._dpage.dimension.height,
+        )
+    def unload(self):
+        self._ppage = None
+        self._dpage = None
+class DoclingParseV4DocumentBackend(PdfDocumentBackend):
+    def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
+        super().__init__(in_doc, path_or_stream)
+        with pypdfium2_lock:
+            self._pdoc = pdfium.PdfDocument(self.path_or_stream)
+        self.parser = DoclingPdfParser(loglevel="fatal")
+        self.dp_doc: PdfDocument = self.parser.load(path_or_stream=self.path_or_stream)
+        success = self.dp_doc is not None
+        if not success:
+            raise RuntimeError(
+                f"docling-parse v4 could not load document {self.document_hash}."
+            )
+    def page_count(self) -> int:
+        # return len(self._pdoc)  # To be replaced with docling-parse API
+        len_1 = len(self._pdoc)
+        len_2 = self.dp_doc.number_of_pages()
+        if len_1 != len_2:
+            _log.error(f"Inconsistent number of pages: {len_1}!={len_2}")
+        return len_2
+    def load_page(
+        self, page_no: int, create_words: bool = True, create_textlines: bool = True
+    ) -> DoclingParseV4PageBackend:
+        with pypdfium2_lock:
+            return DoclingParseV4PageBackend(
+                self.dp_doc.get_page(
+                    page_no + 1,
+                    create_words=create_words,
+                    create_textlines=create_textlines,
+                ),
+                self._pdoc[page_no],
+            )
+    def is_valid(self) -> bool:
+        return self.page_count() > 0
+    def unload(self):
+        super().unload()
+        self.dp_doc.unload()
+        with pypdfium2_lock:
+            self._pdoc.close()
+        self._pdoc = None

docling 2.26.0__tar.gz → 2.27.0__tar.gz

docling 2.26.0tar.gz → 2.27.0tar.gz