PyPI - docling - Versions diffs - 2.39.0__tar.gz → 2.66.0__tar.gz - Mend

docling 2.39.0tar.gz → 2.66.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (190) hide show

{docling-2.39.0 → docling-2.66.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docling
-Version: 2.39.0
+Version: 2.66.0
 Summary: SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.
 Author-email: Christoph Auer <cau@zurich.ibm.com>, Michele Dolfi <dol@zurich.ibm.com>, Maxim Lysak <mly@zurich.ibm.com>, Nikos Livathinos <nli@zurich.ibm.com>, Ahmed Nassar <ahn@zurich.ibm.com>, Panos Vagenas <pva@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>
 License-Expression: MIT
@@ -22,34 +22,40 @@ Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
 Requires-Python: <4.0,>=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: pydantic<3.0.0,>=2.0.0
-Requires-Dist: docling-core[chunking]<3.0.0,>=2.39.0
-Requires-Dist: docling-ibm-models<4.0.0,>=3.4.4
-Requires-Dist: docling-parse<5.0.0,>=4.0.0
+Requires-Dist: docling-core[chunking]<3.0.0,>=2.50.1
+Requires-Dist: docling-parse<5.0.0,>=4.7.0
+Requires-Dist: docling-ibm-models<4,>=3.9.1
 Requires-Dist: filetype<2.0.0,>=1.2.0
-Requires-Dist: pypdfium2<5.0.0,>=4.30.0
+Requires-Dist: pypdfium2!=4.30.1,<5.0.0,>=4.30.0
 Requires-Dist: pydantic-settings<3.0.0,>=2.3.0
 Requires-Dist: huggingface_hub<1,>=0.23
 Requires-Dist: requests<3.0.0,>=2.32.2
-Requires-Dist: easyocr<2.0,>=1.7
+Requires-Dist: ocrmac<2.0.0,>=1.0.0; sys_platform == "darwin"
+Requires-Dist: rapidocr<4.0.0,>=3.3
 Requires-Dist: certifi>=2024.7.4
 Requires-Dist: rtree<2.0.0,>=1.3.0
-Requires-Dist: typer<0.17.0,>=0.12.5
+Requires-Dist: typer<0.20.0,>=0.12.5
 Requires-Dist: python-docx<2.0.0,>=1.1.2
 Requires-Dist: python-pptx<2.0.0,>=1.0.2
 Requires-Dist: beautifulsoup4<5.0.0,>=4.12.3
 Requires-Dist: pandas<3.0.0,>=2.1.4
 Requires-Dist: marko<3.0.0,>=2.1.2
 Requires-Dist: openpyxl<4.0.0,>=3.1.5
-Requires-Dist: lxml<6.0.0,>=4.0.0
+Requires-Dist: lxml<7.0.0,>=4.0.0
 Requires-Dist: pillow<12.0.0,>=10.0.0
 Requires-Dist: tqdm<5.0.0,>=4.65.0
 Requires-Dist: pluggy<2.0.0,>=1.0.0
 Requires-Dist: pylatexenc<3.0,>=2.10
 Requires-Dist: scipy<2.0.0,>=1.6.0
+Requires-Dist: accelerate<2,>=1.0.0
+Requires-Dist: polyfactory>=2.22.2
+Provides-Extra: easyocr
+Requires-Dist: easyocr<2.0,>=1.7; extra == "easyocr"
 Provides-Extra: tesserocr
 Requires-Dist: tesserocr<3.0.0,>=2.7.1; extra == "tesserocr"
 Provides-Extra: ocrmac
@@ -57,12 +63,15 @@ Requires-Dist: ocrmac<2.0.0,>=1.0.0; sys_platform == "darwin" and extra == "ocrm
 Provides-Extra: vlm
 Requires-Dist: transformers<5.0.0,>=4.46.0; extra == "vlm"
 Requires-Dist: accelerate<2.0.0,>=1.2.1; extra == "vlm"
-Requires-Dist: mlx-vlm>=0.1.22; (python_version >= "3.10" and sys_platform == "darwin" and platform_machine == "arm64") and extra == "vlm"
+Requires-Dist: mlx-vlm<1.0.0,>=0.3.0; (python_version >= "3.10" and python_version < "3.14" and sys_platform == "darwin" and platform_machine == "arm64") and extra == "vlm"
+Requires-Dist: vllm<1.0.0,>=0.10.0; (python_version >= "3.10" and python_version < "3.14" and sys_platform == "linux" and platform_machine == "x86_64") and extra == "vlm"
+Requires-Dist: qwen-vl-utils>=0.0.11; extra == "vlm"
 Provides-Extra: rapidocr
-Requires-Dist: rapidocr-onnxruntime<2.0.0,>=1.4.0; python_version < "3.13" and extra == "rapidocr"
-Requires-Dist: onnxruntime<2.0.0,>=1.7.0; extra == "rapidocr"
+Requires-Dist: rapidocr<4.0.0,>=3.3; extra == "rapidocr"
+Requires-Dist: onnxruntime<2.0.0,>=1.7.0; python_version < "3.14" and extra == "rapidocr"
 Provides-Extra: asr
-Requires-Dist: openai-whisper>=20240930; extra == "asr"
+Requires-Dist: mlx-whisper>=0.4.3; (python_version >= "3.10" and python_version < "3.14" and sys_platform == "darwin" and platform_machine == "arm64") and extra == "asr"
+Requires-Dist: openai-whisper>=20250625; python_version < "3.14" and extra == "asr"
 Dynamic: license-file
 <p align="center">
@@ -88,6 +97,8 @@ Dynamic: license-file
 [![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
 [![Docling Actor](https://apify.com/actor-badge?actor=vancura/docling?fpr=docling)](https://apify.com/vancura/docling)
+[![Chat with Dosu](https://dosu.dev/dosu-chat-badge.svg)](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github)
+[![Discord](https://img.shields.io/discord/1399788921306746971?color=6A7EC2&logo=discord&logoColor=ffffff)](https://docling.ai/discord)
 [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/10101/badge)](https://www.bestpractices.dev/projects/10101)
 [![LF AI & Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation&logoColor=fff&color=0094ff&labelColor=003778)](https://lfaidata.foundation/projects/)
@@ -95,17 +106,24 @@ Docling simplifies document processing, parsing diverse formats — including ad
 ## Features
-* 🗂️  Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
+* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, ...), and more
 * 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
 * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
-* ↪️  Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
+* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
 * 🔒 Local execution capabilities for sensitive data and air-gapped environments
 * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
 * 🔍 Extensive OCR support for scanned PDFs and images
-* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
-* 🎙️  Support for Audio with Automatic Speech Recognition (ASR) models
+* 👓 Support of several Visual Language Models ([GraniteDocling](https://huggingface.co/ibm-granite/granite-docling-258M))
+* 🎙️ Audio support with Automatic Speech Recognition (ASR) models
+* 🔌 Connect to any agent using the [MCP server](https://docling-project.github.io/docling/usage/mcp/)
 * 💻 Simple and convenient CLI
+### What's new
+* 📤 Structured [information extraction][extraction] \[🧪 beta\]
+* 📑 New layout model (**Heron**) by default, for faster PDF parsing
+* 🔌 [MCP server](https://docling-project.github.io/docling/usage/mcp/) for agentic applications
+* 💬 Parsing of Web Video Text Tracks (WebVTT) files
 ### Coming soon
 * 📝 Metadata extraction, including title, authors, references & language
@@ -136,7 +154,7 @@ result = converter.convert(source)
 print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"
 ```
-More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
+More [advanced usage options](https://docling-project.github.io/docling/usage/advanced_options/) are available in
 the docs.
 ## CLI
@@ -147,9 +165,9 @@ Docling has a built-in CLI to run conversions.
 docling https://arxiv.org/pdf/2206.01062
 ```
-You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
+You can also use 🥚[GraniteDocling](https://huggingface.co/ibm-granite/granite-docling-258M) and other VLMs via Docling CLI:
 ```bash
-docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
+docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062
 ```
 This will use MLX acceleration on supported Apple Silicon hardware.
@@ -216,3 +234,4 @@ The project was started by the AI for knowledge team at IBM Research Zurich.
 [supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
 [docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
 [integrations]: https://docling-project.github.io/docling/integrations/
+[extraction]: https://docling-project.github.io/docling/examples/extraction/

{docling-2.39.0 → docling-2.66.0}/README.md RENAMED Viewed

@@ -21,6 +21,8 @@
 [![License MIT](https://img.shields.io/github/license/docling-project/docling)](https://opensource.org/licenses/MIT)
 [![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)
 [![Docling Actor](https://apify.com/actor-badge?actor=vancura/docling?fpr=docling)](https://apify.com/vancura/docling)
+[![Chat with Dosu](https://dosu.dev/dosu-chat-badge.svg)](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github)
+[![Discord](https://img.shields.io/discord/1399788921306746971?color=6A7EC2&logo=discord&logoColor=ffffff)](https://docling.ai/discord)
 [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/10101/badge)](https://www.bestpractices.dev/projects/10101)
 [![LF AI & Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation&logoColor=fff&color=0094ff&labelColor=003778)](https://lfaidata.foundation/projects/)
@@ -28,17 +30,24 @@ Docling simplifies document processing, parsing diverse formats — including ad
 ## Features
-* 🗂️  Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
+* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, ...), and more
 * 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
 * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
-* ↪️  Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
+* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
 * 🔒 Local execution capabilities for sensitive data and air-gapped environments
 * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
 * 🔍 Extensive OCR support for scanned PDFs and images
-* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
-* 🎙️  Support for Audio with Automatic Speech Recognition (ASR) models
+* 👓 Support of several Visual Language Models ([GraniteDocling](https://huggingface.co/ibm-granite/granite-docling-258M))
+* 🎙️ Audio support with Automatic Speech Recognition (ASR) models
+* 🔌 Connect to any agent using the [MCP server](https://docling-project.github.io/docling/usage/mcp/)
 * 💻 Simple and convenient CLI
+### What's new
+* 📤 Structured [information extraction][extraction] \[🧪 beta\]
+* 📑 New layout model (**Heron**) by default, for faster PDF parsing
+* 🔌 [MCP server](https://docling-project.github.io/docling/usage/mcp/) for agentic applications
+* 💬 Parsing of Web Video Text Tracks (WebVTT) files
 ### Coming soon
 * 📝 Metadata extraction, including title, authors, references & language
@@ -69,7 +78,7 @@ result = converter.convert(source)
 print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"
 ```
-More [advanced usage options](https://docling-project.github.io/docling/usage/) are available in
+More [advanced usage options](https://docling-project.github.io/docling/usage/advanced_options/) are available in
 the docs.
 ## CLI
@@ -80,9 +89,9 @@ Docling has a built-in CLI to run conversions.
 docling https://arxiv.org/pdf/2206.01062
 ```
-You can also use 🥚[SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview) and other VLMs via Docling CLI:
+You can also use 🥚[GraniteDocling](https://huggingface.co/ibm-granite/granite-docling-258M) and other VLMs via Docling CLI:
 ```bash
-docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
+docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062
 ```
 This will use MLX acceleration on supported Apple Silicon hardware.
@@ -149,3 +158,4 @@ The project was started by the AI for knowledge team at IBM Research Zurich.
 [supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
 [docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
 [integrations]: https://docling-project.github.io/docling/integrations/
+[extraction]: https://docling-project.github.io/docling/examples/extraction/

{docling-2.39.0 → docling-2.66.0}/docling/backend/abstract_backend.py RENAMED Viewed

@@ -1,10 +1,16 @@
 from abc import ABC, abstractmethod
 from io import BytesIO
 from pathlib import Path
-from typing import TYPE_CHECKING, Set, Union
+from typing import TYPE_CHECKING, Union
 from docling_core.types.doc import DoclingDocument
+from docling.datamodel.backend_options import (
+    BackendOptions,
+    BaseBackendOptions,
+    DeclarativeBackendOptions,
+)
 if TYPE_CHECKING:
     from docling.datamodel.base_models import InputFormat
     from docling.datamodel.document import InputDocument
@@ -12,11 +18,17 @@ if TYPE_CHECKING:
 class AbstractDocumentBackend(ABC):
     @abstractmethod
-    def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
+    def __init__(
+        self,
+        in_doc: "InputDocument",
+        path_or_stream: Union[BytesIO, Path],
+        options: BaseBackendOptions = BaseBackendOptions(),
+    ):
         self.file = in_doc.file
         self.path_or_stream = path_or_stream
         self.document_hash = in_doc.document_hash
         self.input_format = in_doc.format
+        self.options = options
     @abstractmethod
     def is_valid(self) -> bool:
@@ -35,7 +47,7 @@ class AbstractDocumentBackend(ABC):
     @classmethod
     @abstractmethod
-    def supported_formats(cls) -> Set["InputFormat"]:
+    def supported_formats(cls) -> set["InputFormat"]:
         pass
@@ -58,6 +70,15 @@ class DeclarativeDocumentBackend(AbstractDocumentBackend):
     straight without a recognition pipeline.
     """
+    @abstractmethod
+    def __init__(
+        self,
+        in_doc: "InputDocument",
+        path_or_stream: Union[BytesIO, Path],
+        options: BackendOptions = DeclarativeBackendOptions(),
+    ) -> None:
+        super().__init__(in_doc, path_or_stream, options)
     @abstractmethod
     def convert(self) -> DoclingDocument:
         pass

{docling-2.39.0 → docling-2.66.0}/docling/backend/asciidoc_backend.py RENAMED Viewed

@@ -2,7 +2,7 @@ import logging
 import re
 from io import BytesIO
 from pathlib import Path
-from typing import Final, Set, Union
+from typing import Final, Union
 from docling_core.types.doc import (
     DocItemLabel,
@@ -27,7 +27,7 @@ DEFAULT_IMAGE_HEIGHT: Final = 128
 class AsciiDocBackend(DeclarativeDocumentBackend):
-    def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
+    def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
         super().__init__(in_doc, path_or_stream)
         self.path_or_stream = path_or_stream
@@ -58,7 +58,7 @@ class AsciiDocBackend(DeclarativeDocumentBackend):
         return
     @classmethod
-    def supported_formats(cls) -> Set[InputFormat]:
+    def supported_formats(cls) -> set[InputFormat]:
         return {InputFormat.ASCIIDOC}
     def convert(self) -> DoclingDocument:
@@ -78,7 +78,7 @@ class AsciiDocBackend(DeclarativeDocumentBackend):
         return doc
-    def _parse(self, doc: DoclingDocument):  # noqa: C901
+    def _parse(self, doc: DoclingDocument):
         """
         Main function that orchestrates the parsing by yielding components:
         title, section headers, text, lists, and tables.

{docling-2.39.0 → docling-2.66.0}/docling/backend/docling_parse_v4_backend.py RENAMED Viewed

@@ -12,6 +12,7 @@ from PIL import Image
 from pypdfium2 import PdfPage
 from docling.backend.pdf_backend import PdfDocumentBackend, PdfPageBackend
+from docling.datamodel.backend_options import PdfBackendOptions
 from docling.datamodel.base_models import Size
 from docling.utils.locks import pypdfium2_lock
@@ -22,15 +23,64 @@ _log = logging.getLogger(__name__)
 class DoclingParseV4PageBackend(PdfPageBackend):
-    def __init__(self, parsed_page: SegmentedPdfPage, page_obj: PdfPage):
+    def __init__(
+        self,
+        *,
+        dp_doc: PdfDocument,
+        page_obj: PdfPage,
+        page_no: int,
+        create_words: bool = True,
+        create_textlines: bool = True,
+        keep_chars: bool = False,
+        keep_lines: bool = False,
+        keep_images: bool = True,
+    ):
         self._ppage = page_obj
-        self._dpage = parsed_page
-        self.valid = parsed_page is not None
+        self._dp_doc = dp_doc
+        self._page_no = page_no
+        self._create_words = create_words
+        self._create_textlines = create_textlines
+        self._keep_chars = keep_chars
+        self._keep_lines = keep_lines
+        self._keep_images = keep_images
+        self._dpage: Optional[SegmentedPdfPage] = None
+        self._unloaded = False
+        self.valid = (self._ppage is not None) and (self._dp_doc is not None)
+    def _ensure_parsed(self) -> None:
+        if self._dpage is not None:
+            return
+        seg_page = self._dp_doc.get_page(
+            self._page_no + 1,
+            keep_chars=self._keep_chars,
+            keep_lines=self._keep_lines,
+            keep_bitmaps=self._keep_images,
+            create_words=self._create_words,
+            create_textlines=self._create_textlines,
+            enforce_same_font=True,
+        )
+        # In Docling, all TextCell instances are expected with top-left origin.
+        [
+            tc.to_top_left_origin(seg_page.dimension.height)
+            for tc in seg_page.textline_cells
+        ]
+        [tc.to_top_left_origin(seg_page.dimension.height) for tc in seg_page.char_cells]
+        [tc.to_top_left_origin(seg_page.dimension.height) for tc in seg_page.word_cells]
+        self._dpage = seg_page
     def is_valid(self) -> bool:
         return self.valid
     def get_text_in_rect(self, bbox: BoundingBox) -> str:
+        self._ensure_parsed()
+        assert self._dpage is not None
         # Find intersecting cells on the page
         text_piece = ""
         page_size = self.get_size()
@@ -56,12 +106,19 @@ class DoclingParseV4PageBackend(PdfPageBackend):
         return text_piece
     def get_segmented_page(self) -> Optional[SegmentedPdfPage]:
+        self._ensure_parsed()
         return self._dpage
     def get_text_cells(self) -> Iterable[TextCell]:
+        self._ensure_parsed()
+        assert self._dpage is not None
         return self._dpage.textline_cells
     def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
+        self._ensure_parsed()
+        assert self._dpage is not None
         AREA_THRESHOLD = 0  # 32 * 32
         images = self._dpage.bitmap_resources
@@ -123,18 +180,33 @@ class DoclingParseV4PageBackend(PdfPageBackend):
         # )
     def unload(self):
+        if not self._unloaded and self._dp_doc is not None:
+            self._dp_doc.unload_pages((self._page_no + 1, self._page_no + 2))
+            self._unloaded = True
         self._ppage = None
         self._dpage = None
+        self._dp_doc = None
 class DoclingParseV4DocumentBackend(PdfDocumentBackend):
-    def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
-        super().__init__(in_doc, path_or_stream)
+    def __init__(
+        self,
+        in_doc: "InputDocument",
+        path_or_stream: Union[BytesIO, Path],
+        options: PdfBackendOptions = PdfBackendOptions(),
+    ):
+        super().__init__(in_doc, path_or_stream, options)
+        password = (
+            self.options.password.get_secret_value() if self.options.password else None
+        )
         with pypdfium2_lock:
-            self._pdoc = pdfium.PdfDocument(self.path_or_stream)
+            self._pdoc = pdfium.PdfDocument(self.path_or_stream, password=password)
         self.parser = DoclingPdfParser(loglevel="fatal")
-        self.dp_doc: PdfDocument = self.parser.load(path_or_stream=self.path_or_stream)
+        self.dp_doc: PdfDocument = self.parser.load(
+            path_or_stream=self.path_or_stream, password=password
+        )
         success = self.dp_doc is not None
         if not success:
@@ -157,37 +229,32 @@ class DoclingParseV4DocumentBackend(PdfDocumentBackend):
         self, page_no: int, create_words: bool = True, create_textlines: bool = True
     ) -> DoclingParseV4PageBackend:
         with pypdfium2_lock:
-            seg_page = self.dp_doc.get_page(
-                page_no + 1,
-                create_words=create_words,
-                create_textlines=create_textlines,
-            )
-            # In Docling, all TextCell instances are expected with top-left origin.
-            [
-                tc.to_top_left_origin(seg_page.dimension.height)
-                for tc in seg_page.textline_cells
-            ]
-            [
-                tc.to_top_left_origin(seg_page.dimension.height)
-                for tc in seg_page.char_cells
-            ]
-            [
-                tc.to_top_left_origin(seg_page.dimension.height)
-                for tc in seg_page.word_cells
-            ]
-            return DoclingParseV4PageBackend(
-                seg_page,
-                self._pdoc[page_no],
-            )
+            ppage = self._pdoc[page_no]
+        return DoclingParseV4PageBackend(
+            dp_doc=self.dp_doc,
+            page_obj=ppage,
+            page_no=page_no,
+            create_words=create_words,
+            create_textlines=create_textlines,
+        )
     def is_valid(self) -> bool:
         return self.page_count() > 0
     def unload(self):
         super().unload()
-        self.dp_doc.unload()
-        with pypdfium2_lock:
-            self._pdoc.close()
-        self._pdoc = None
+        # Unload docling-parse document first
+        if self.dp_doc is not None:
+            self.dp_doc.unload()
+            self.dp_doc = None
+        # Then close pypdfium2 document with proper locking
+        if self._pdoc is not None:
+            with pypdfium2_lock:
+                try:
+                    self._pdoc.close()
+                except Exception:
+                    # Ignore cleanup errors
+                    pass
+            self._pdoc = None

docling-2.66.0/docling/backend/docx/drawingml/utils.py ADDED Viewed

@@ -0,0 +1,131 @@
+import os
+import shutil
+import subprocess
+from pathlib import Path
+from tempfile import mkdtemp
+from typing import Callable, Optional
+import pypdfium2
+from docx.document import Document
+from PIL import Image, ImageChops
+def get_libreoffice_cmd(raise_if_unavailable: bool = False) -> Optional[str]:
+    """Return the libreoffice cmd and optionally test it."""
+    libreoffice_cmd = (
+        shutil.which("libreoffice")
+        or shutil.which("soffice")
+        or (
+            "/Applications/LibreOffice.app/Contents/MacOS/soffice"
+            if os.path.isfile("/Applications/LibreOffice.app/Contents/MacOS/soffice")
+            else None
+        )
+    )
+    if raise_if_unavailable:
+        if libreoffice_cmd is None:
+            raise RuntimeError("Libreoffice not found")
+        # The following test will raise if the libreoffice_cmd cannot be used
+        subprocess.run(
+            [
+                libreoffice_cmd,
+                "-h",
+            ],
+            stdout=subprocess.DEVNULL,
+            stderr=subprocess.DEVNULL,
+            check=True,
+        )
+    return libreoffice_cmd
+def get_docx_to_pdf_converter() -> Optional[Callable]:
+    """
+    Detects the best available DOCX to PDF tool and returns a conversion function.
+    The returned function accepts (input_path, output_path).
+    Returns None if no tool is available.
+    """
+    # Try LibreOffice
+    libreoffice_cmd = get_libreoffice_cmd()
+    if libreoffice_cmd:
+        def convert_with_libreoffice(input_path, output_path):
+            subprocess.run(
+                [
+                    libreoffice_cmd,
+                    "--headless",
+                    "--convert-to",
+                    "pdf",
+                    "--outdir",
+                    os.path.dirname(output_path),
+                    input_path,
+                ],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+                check=True,
+            )
+            expected_output = os.path.join(
+                os.path.dirname(output_path),
+                os.path.splitext(os.path.basename(input_path))[0] + ".pdf",
+            )
+            if expected_output != output_path:
+                os.rename(expected_output, output_path)
+        return convert_with_libreoffice
+    ## Space for other DOCX to PDF converters if available
+    # No tools found
+    return None
+def crop_whitespace(image: Image.Image, bg_color=None, padding=0) -> Image.Image:
+    if bg_color is None:
+        bg_color = image.getpixel((0, 0))
+    bg = Image.new(image.mode, image.size, bg_color)
+    diff = ImageChops.difference(image, bg)
+    bbox = diff.getbbox()
+    if bbox:
+        left, upper, right, lower = bbox
+        left = max(0, left - padding)
+        upper = max(0, upper - padding)
+        right = min(image.width, right + padding)
+        lower = min(image.height, lower + padding)
+        return image.crop((left, upper, right, lower))
+    else:
+        return image
+def get_pil_from_dml_docx(
+    docx: Document, converter: Optional[Callable]
+) -> Optional[Image.Image]:
+    if converter is None:
+        return None
+    temp_dir = Path(mkdtemp())
+    temp_docx = Path(temp_dir / "drawing_only.docx")
+    temp_pdf = Path(temp_dir / "drawing_only.pdf")
+    # 1) Save docx temporarily
+    docx.save(str(temp_docx))
+    # 2) Export to PDF
+    converter(temp_docx, temp_pdf)
+    # 3) Load PDF as PNG
+    pdf = pypdfium2.PdfDocument(temp_pdf)
+    page = pdf[0]
+    image = crop_whitespace(page.render(scale=2).to_pil())
+    page.close()
+    pdf.close()
+    shutil.rmtree(temp_dir, ignore_errors=True)
+    return image

{docling-2.39.0 → docling-2.66.0}/docling/backend/docx/latex/latex_dict.py RENAMED Viewed

@@ -65,6 +65,11 @@ CHR_BO = {
     "\u2210": "\\coprod",
     "\u2211": "\\sum",
     "\u222b": "\\int",
+    "\u222c": "\\iint",
+    "\u222d": "\\iiint",
+    "\u222e": "\\oint",
+    "\u222f": "\\oiint",
+    "\u2230": "\\oiiint",
     "\u22c0": "\\bigwedge",
     "\u22c1": "\\bigvee",
     "\u22c2": "\\bigcap",

{docling-2.39.0 → docling-2.66.0}/docling/backend/docx/latex/omml.py RENAMED Viewed

@@ -260,7 +260,15 @@ class oMath2Latex(Tag2Method):
         the fraction object
         """
         c_dict = self.process_children_dict(elm)
-        pr = c_dict["fPr"]
+        pr = c_dict.get("fPr")
+        if pr is None:
+            # Handle missing fPr element gracefully
+            _log.debug("Missing fPr element in fraction, using default formatting")
+            latex_s = F_DEFAULT
+            return latex_s.format(
+                num=c_dict.get("num"),
+                den=c_dict.get("den"),
+            )
         latex_s = get_val(pr.type, default=F_DEFAULT, store=F)
         return pr.text + latex_s.format(num=c_dict.get("num"), den=c_dict.get("den"))
@@ -373,7 +381,8 @@ class oMath2Latex(Tag2Method):
         bo = ""
         for stag, t, e in self.process_children_list(elm):
             if stag == "naryPr":
-                bo = get_val(t.chr, store=CHR_BO)
+                # if <m:naryPr> contains no <m:chr>, the n-ary represents an integral
+                bo = get_val(t.chr, default="\\int", store=CHR_BO)
             else:
                 res.append(t)
         return bo + BLANK.join(res)

docling 2.39.0__tar.gz → 2.66.0__tar.gz

docling 2.39.0tar.gz → 2.66.0tar.gz