PyPI - docling-core - Versions diffs - 2.31.2__tar.gz → 2.33.0__tar.gz - Mend

docling-core 2.31.2tar.gz → 2.33.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of docling-core might be problematic. Click here for more details.

Files changed (109) hide show

docling_core-2.33.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,143 @@
+Metadata-Version: 2.4
+Name: docling-core
+Version: 2.33.0
+Summary: A python library to define and validate data types in Docling.
+Author-email: Cesar Berrospi Ramis <ceb@zurich.ibm.com>, Panos Vagenas <pva@zurich.ibm.com>, Michele Dolfi <dol@zurich.ibm.com>, Christoph Auer <cau@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>
+Maintainer-email: Panos Vagenas <pva@zurich.ibm.com>, Michele Dolfi <dol@zurich.ibm.com>, Christoph Auer <cau@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>, Cesar Berrospi Ramis <ceb@zurich.ibm.com>
+License-Expression: MIT
+Project-URL: homepage, https://github.com/docling-project
+Project-URL: repository, https://github.com/docling-project/docling-core
+Project-URL: issues, https://github.com/docling-project/docling-core/issues
+Project-URL: changelog, https://github.com/docling-project/docling-core/blob/main/CHANGELOG.md
+Keywords: docling,discovery,etl,information retrieval,analytics,database,database schema,schema,JSON
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: Natural Language :: English
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Database
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Typing :: Typed
+Classifier: Programming Language :: Python :: 3
+Requires-Python: <4.0,>=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: jsonschema<5.0.0,>=4.16.0
+Requires-Dist: pydantic!=2.10.0,!=2.10.1,!=2.10.2,<3.0.0,>=2.6.0
+Requires-Dist: jsonref<2.0.0,>=1.1.0
+Requires-Dist: tabulate<0.10.0,>=0.9.0
+Requires-Dist: pandas<3.0.0,>=2.1.4
+Requires-Dist: pillow<12.0.0,>=10.0.0
+Requires-Dist: pyyaml<7.0.0,>=5.1
+Requires-Dist: typing-extensions<5.0.0,>=4.12.2
+Requires-Dist: typer<0.16.0,>=0.12.5
+Requires-Dist: latex2mathml<4.0.0,>=3.77.0
+Provides-Extra: chunking
+Requires-Dist: semchunk<3.0.0,>=2.2.0; extra == "chunking"
+Requires-Dist: transformers<5.0.0,>=4.34.0; extra == "chunking"
+Provides-Extra: chunking-openai
+Requires-Dist: semchunk; extra == "chunking-openai"
+Requires-Dist: tiktoken<0.10.0,>=0.9.0; extra == "chunking-openai"
+Dynamic: license-file
+# Docling Core
+[![PyPI version](https://img.shields.io/pypi/v/docling-core)](https://pypi.org/project/docling-core/)
+![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13-blue)
+[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
+[![Checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)
+[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
+[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
+[![License MIT](https://img.shields.io/github/license/docling-project/docling-core)](https://opensource.org/licenses/MIT)
+Docling Core is a library that defines core data types and transformations in [Docling](https://github.com/docling-project/docling).
+## Installation
+To use Docling Core, simply install `docling-core` from your package manager, e.g. pip:
+```bash
+pip install docling-core
+```
+### Development setup
+To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 and uv. You can then install from your local clone's root dir:
+```bash
+uv sync --all-extras
+```
+To run the pytest suite, execute:
+```
+uv run pytest -s test
+```
+## Main features
+Docling Core provides the foundational DoclingDocument data model and API, as well as
+additional APIs for tasks like serialization and chunking, which are key to developing
+generative AI applications using Docling.
+### DoclingDocument
+Docling Core defines the DoclingDocument as a Pydantic model, allowing for advanced
+data model control, customizability, and interoperability.
+In addition to specifying the schema, it provides a handy API for building documents,
+as well as for basic operations, e.g. exporting to various formats, like Markdown, HTML,
+and others.
+👉 More details:
+- [Architecture docs](https://docling-project.github.io/docling/concepts/architecture/)
+- [DoclingDocument docs](https://docling-project.github.io/docling/concepts/docling_document/)
+### Serialization
+Different users can have varying requirements when it comes to serialization.
+To address this, the Serialization API introduces a design that allows easy extension,
+while providing feature-rich built-in implementations (on which the respective
+DoclingDocument helpers are actually based).
+👉 More details:
+- [Serialization docs](https://docling-project.github.io/docling/concepts/serialization/)
+- [Serialization example](https://docling-project.github.io/docling/examples/serialization/)
+### Chunking
+Similarly to above, the Chunking API provides built-in chunking capabilities as well as
+a design that enables easy extension, this way tackling customization requirements of
+different use cases.
+👉 More details:
+- [Chunking docs](https://docling-project.github.io/docling/concepts/chunking/)
+- [Hybrid chunking example](https://docling-project.github.io/docling/examples/hybrid_chunking/)
+- [Advanced chunking and serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/)
+## Contributing
+Please read [Contributing to Docling Core](./CONTRIBUTING.md) for details.
+## References
+If you use Docling Core in your projects, please consider citing the following:
+```bib
+@techreport{Docling,
+  author = "Deep Search Team",
+  month = 8,
+  title = "Docling Technical Report",
+  url = "https://arxiv.org/abs/2408.09869",
+  eprint = "2408.09869",
+  doi = "10.48550/arXiv.2408.09869",
+  version = "1.0.0",
+  year = 2024
+}
+```
+## License
+The Docling Core codebase is under MIT license.
+For individual model usage, please refer to the model licenses found in the original packages.

docling_core-2.33.0/README.md ADDED Viewed

@@ -0,0 +1,99 @@
+# Docling Core
+[![PyPI version](https://img.shields.io/pypi/v/docling-core)](https://pypi.org/project/docling-core/)
+![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13-blue)
+[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
+[![Checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)
+[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
+[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
+[![License MIT](https://img.shields.io/github/license/docling-project/docling-core)](https://opensource.org/licenses/MIT)
+Docling Core is a library that defines core data types and transformations in [Docling](https://github.com/docling-project/docling).
+## Installation
+To use Docling Core, simply install `docling-core` from your package manager, e.g. pip:
+```bash
+pip install docling-core
+```
+### Development setup
+To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 and uv. You can then install from your local clone's root dir:
+```bash
+uv sync --all-extras
+```
+To run the pytest suite, execute:
+```
+uv run pytest -s test
+```
+## Main features
+Docling Core provides the foundational DoclingDocument data model and API, as well as
+additional APIs for tasks like serialization and chunking, which are key to developing
+generative AI applications using Docling.
+### DoclingDocument
+Docling Core defines the DoclingDocument as a Pydantic model, allowing for advanced
+data model control, customizability, and interoperability.
+In addition to specifying the schema, it provides a handy API for building documents,
+as well as for basic operations, e.g. exporting to various formats, like Markdown, HTML,
+and others.
+👉 More details:
+- [Architecture docs](https://docling-project.github.io/docling/concepts/architecture/)
+- [DoclingDocument docs](https://docling-project.github.io/docling/concepts/docling_document/)
+### Serialization
+Different users can have varying requirements when it comes to serialization.
+To address this, the Serialization API introduces a design that allows easy extension,
+while providing feature-rich built-in implementations (on which the respective
+DoclingDocument helpers are actually based).
+👉 More details:
+- [Serialization docs](https://docling-project.github.io/docling/concepts/serialization/)
+- [Serialization example](https://docling-project.github.io/docling/examples/serialization/)
+### Chunking
+Similarly to above, the Chunking API provides built-in chunking capabilities as well as
+a design that enables easy extension, this way tackling customization requirements of
+different use cases.
+👉 More details:
+- [Chunking docs](https://docling-project.github.io/docling/concepts/chunking/)
+- [Hybrid chunking example](https://docling-project.github.io/docling/examples/hybrid_chunking/)
+- [Advanced chunking and serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/)
+## Contributing
+Please read [Contributing to Docling Core](./CONTRIBUTING.md) for details.
+## References
+If you use Docling Core in your projects, please consider citing the following:
+```bib
+@techreport{Docling,
+  author = "Deep Search Team",
+  month = 8,
+  title = "Docling Technical Report",
+  url = "https://arxiv.org/abs/2408.09869",
+  eprint = "2408.09869",
+  doi = "10.48550/arXiv.2408.09869",
+  version = "1.0.0",
+  year = 2024
+}
+```
+## License
+The Docling Core codebase is under MIT license.
+For individual model usage, please refer to the model licenses found in the original packages.

{docling_core-2.31.2 → docling_core-2.33.0}/docling_core/transforms/chunker/hybrid_chunker.py RENAMED Viewed

@@ -88,7 +88,6 @@ class HybridChunker(BaseChunker):
                     "For updated usage check out "
                     "https://docling-project.github.io/docling/examples/hybrid_chunking/",
                     DeprecationWarning,
-                    stacklevel=3,
                 )
                 if isinstance(tokenizer, str):
@@ -156,7 +155,6 @@ class HybridChunker(BaseChunker):
         meta = DocMeta(
             doc_items=doc_items,
             headings=doc_chunk.meta.headings,
-            captions=doc_chunk.meta.captions,
             origin=doc_chunk.meta.origin,
         )
         window_text = (
@@ -235,7 +233,9 @@ class HybridChunker(BaseChunker):
             )
             if available_length <= 0:
                 warnings.warn(
-                    f"Headers and captions for this chunk are longer than the total amount of size for the chunk, chunk will be ignored: {doc_chunk.text=}"  # noqa
+                    "Headers and captions for this chunk are longer than the total "
+                    "amount of size for the chunk, chunk will be ignored: "
+                    f"{doc_chunk.text=}"
                 )
                 return []
             text = doc_chunk.text
@@ -250,10 +250,10 @@ class HybridChunker(BaseChunker):
         num_chunks = len(chunks)
         while window_end < num_chunks:
             chunk = chunks[window_end]
-            headings_and_captions = (chunk.meta.headings, chunk.meta.captions)
+            headings = chunk.meta.headings
             ready_to_append = False
             if window_start == window_end:
-                current_headings_and_captions = headings_and_captions
+                current_headings = headings
                 window_end += 1
                 first_chunk_of_window = chunk
             else:
@@ -264,13 +264,12 @@ class HybridChunker(BaseChunker):
                     text=self.delim.join([chk.text for chk in chks]),
                     meta=DocMeta(
                         doc_items=doc_items,
-                        headings=current_headings_and_captions[0],
-                        captions=current_headings_and_captions[1],
+                        headings=current_headings,
                         origin=chunk.meta.origin,
                     ),
                 )
                 if (
-                    headings_and_captions == current_headings_and_captions
+                    headings == current_headings
                     and self._count_chunk_tokens(doc_chunk=candidate) <= self.max_tokens
                 ):
                     # there is room to include the new chunk so add it to the window and

{docling_core-2.31.2 → docling_core-2.33.0}/docling_core/transforms/chunker/tokenizer/huggingface.py RENAMED Viewed

@@ -1,10 +1,11 @@
 """HuggingFace tokenization."""
-import sys
+import json
 from os import PathLike
 from typing import Optional, Union
-from pydantic import ConfigDict, PositiveInt, TypeAdapter, model_validator
+from huggingface_hub import hf_hub_download
+from pydantic import ConfigDict, model_validator
 from typing_extensions import Self
 from docling_core.transforms.chunker.tokenizer.base import BaseTokenizer
@@ -28,16 +29,23 @@ class HuggingFaceTokenizer(BaseTokenizer):
     @model_validator(mode="after")
     def _patch(self) -> Self:
-        if hasattr(self.tokenizer, "model_max_length"):
-            model_max_tokens: PositiveInt = TypeAdapter(PositiveInt).validate_python(
-                self.tokenizer.model_max_length
-            )
-            user_max_tokens = self.max_tokens or sys.maxsize
-            self.max_tokens = min(model_max_tokens, user_max_tokens)
-        elif self.max_tokens is None:
-            raise ValueError(
-                "max_tokens must be defined as model does not define model_max_length"
-            )
+        if self.max_tokens is None:
+            try:
+                # try to use SentenceTransformers-specific config as that seems to be
+                # reliable (whenever available)
+                config_name = "sentence_bert_config.json"
+                config_path = hf_hub_download(
+                    repo_id=self.tokenizer.name_or_path,
+                    filename=config_name,
+                )
+                with open(config_path) as f:
+                    data = json.load(f)
+                self.max_tokens = int(data["max_seq_length"])
+            except Exception as e:
+                raise RuntimeError(
+                    "max_tokens could not be determined automatically; please set "
+                    "explicitly."
+                ) from e
         return self
     def count_tokens(self, text: str):

{docling_core-2.31.2 → docling_core-2.33.0}/docling_core/transforms/serializer/common.py RENAMED Viewed

@@ -11,7 +11,7 @@ from functools import cached_property
 from pathlib import Path
 from typing import Any, Iterable, Optional, Tuple, Union
-from pydantic import AnyUrl, BaseModel, NonNegativeInt, computed_field
+from pydantic import AnyUrl, BaseModel, ConfigDict, NonNegativeInt, computed_field
 from typing_extensions import Self, override
 from docling_core.transforms.serializer.base import (
@@ -39,7 +39,11 @@ from docling_core.types.doc.document import (
     KeyValueItem,
     NodeItem,
     OrderedList,
+    PictureClassificationData,
+    PictureDataType,
+    PictureDescriptionData,
     PictureItem,
+    PictureMoleculeData,
     TableItem,
     TextItem,
     UnorderedList,
@@ -118,6 +122,23 @@ def _iterate_items(
         yield item
+def _get_picture_annotation_text(annotation: PictureDataType) -> Optional[str]:
+    result = None
+    if isinstance(annotation, PictureClassificationData):
+        predicted_class = (
+            annotation.predicted_classes[0].class_name
+            if annotation.predicted_classes
+            else None
+        )
+        if predicted_class is not None:
+            result = predicted_class.replace("_", " ")
+    elif isinstance(annotation, PictureDescriptionData):
+        result = annotation.text
+    elif isinstance(annotation, PictureMoleculeData):
+        result = annotation.smi
+    return result
 def create_ser_result(
     *,
     text: str = "",
@@ -176,11 +197,7 @@ class CommonParams(BaseModel):
 class DocSerializer(BaseModel, BaseDocSerializer):
     """Class for document serializers."""
-    class Config:
-        """Pydantic config."""
-        arbitrary_types_allowed = True
-        extra = "forbid"
+    model_config = ConfigDict(arbitrary_types_allowed=True, extra="forbid")
     doc: DoclingDocument

{docling_core-2.31.2 → docling_core-2.33.0}/docling_core/transforms/serializer/html.py RENAMED Viewed

@@ -35,6 +35,7 @@ from docling_core.transforms.serializer.base import (
 from docling_core.transforms.serializer.common import (
     CommonParams,
     DocSerializer,
+    _get_picture_annotation_text,
     create_ser_result,
 )
 from docling_core.transforms.serializer.html_styles import (
@@ -110,6 +111,8 @@ class HTMLParams(CommonParams):
     # Enable charts to be printed into HTML as tables
     enable_chart_tables: bool = True
+    include_annotations: bool = True
 class HTMLTextSerializer(BaseModel, BaseTextSerializer):
     """HTML-specific text item serializer."""
@@ -943,18 +946,46 @@ class HTMLDocSerializer(DocSerializer):
         params = self.params.merge_with_patch(patch=kwargs)
         results: list[SerializationResult] = []
         text_res = ""
+        excluded_refs = self.get_excluded_refs(**kwargs)
         if DocItemLabel.CAPTION in params.labels:
-            results = [
-                create_ser_result(text=it.text, span_source=it)
-                for cap in item.captions
-                if isinstance(it := cap.resolve(self.doc), TextItem)
-                and it.self_ref not in self.get_excluded_refs(**kwargs)
-            ]
-            text_res = params.caption_delim.join([r.text for r in results])
-            if text_res:
-                text_dir = get_text_direction(text_res)
-                dir_str = f' dir="{text_dir}"' if text_dir == "rtl" else ""
-                text_res = f"<{tag}{dir_str}>{html.escape(text_res)}</{tag}>"
+            for cap in item.captions:
+                if (
+                    isinstance(it := cap.resolve(self.doc), TextItem)
+                    and it.self_ref not in excluded_refs
+                ):
+                    text_cap = it.text
+                    text_dir = get_text_direction(text_cap)
+                    dir_str = f' dir="{text_dir}"' if text_dir == "rtl" else ""
+                    cap_ser_res = create_ser_result(
+                        text=(
+                            f'<div class="caption"{dir_str}>'
+                            f"{html.escape(text_cap)}"
+                            f"</div>"
+                        ),
+                        span_source=it,
+                    )
+                    results.append(cap_ser_res)
+        if params.include_annotations and item.self_ref not in excluded_refs:
+            if isinstance(item, PictureItem):
+                for ann in item.annotations:
+                    if ann_text := _get_picture_annotation_text(annotation=ann):
+                        text_dir = get_text_direction(ann_text)
+                        dir_str = f' dir="{text_dir}"' if text_dir == "rtl" else ""
+                        ann_ser_res = create_ser_result(
+                            text=(
+                                f'<div data-annotation-kind="{ann.kind}"{dir_str}>'
+                                f"{html.escape(ann_text)}"
+                                f"</div>"
+                            ),
+                            span_source=item,
+                        )
+                        results.append(ann_ser_res)
+        text_res = params.caption_delim.join([r.text for r in results])
+        if text_res:
+            text_res = f"<{tag}>{text_res}</{tag}>"
         return create_ser_result(text=text_res, span_source=results)
     def _generate_head(self) -> str:

{docling_core-2.31.2 → docling_core-2.33.0}/docling_core/transforms/serializer/markdown.py RENAMED Viewed

@@ -29,6 +29,7 @@ from docling_core.transforms.serializer.base import (
 from docling_core.transforms.serializer.common import (
     CommonParams,
     DocSerializer,
+    _get_picture_annotation_text,
     _PageBreakSerResult,
     create_ser_result,
 )
@@ -69,6 +70,8 @@ class MarkdownParams(CommonParams):
     page_break_placeholder: Optional[str] = None  # e.g. "<!-- page break -->"
     escape_underscores: bool = True
     escape_html: bool = True
+    include_annotations: bool = True
+    mark_annotations: bool = False
 class MarkdownTextSerializer(BaseModel, BaseTextSerializer):
@@ -210,6 +213,24 @@ class MarkdownPictureSerializer(BasePictureSerializer):
             res_parts.append(cap_res)
         if item.self_ref not in doc_serializer.get_excluded_refs(**kwargs):
+            if params.include_annotations:
+                for ann in item.annotations:
+                    if ann_text := _get_picture_annotation_text(annotation=ann):
+                        ann_ser_res = create_ser_result(
+                            text=(
+                                (
+                                    f'<!--<annotation kind="{ann.kind}">-->'
+                                    f"{ann_text}"
+                                    f"<!--<annotation/>-->"
+                                )
+                                if params.mark_annotations
+                                else ann_text
+                            ),
+                            span_source=item,
+                        )
+                        res_parts.append(ann_ser_res)
             img_res = self._serialize_image_part(
                 item=item,
                 doc=doc,

{docling_core-2.31.2 → docling_core-2.33.0}/docling_core/types/doc/base.py RENAMED Viewed

@@ -395,3 +395,41 @@ class BoundingBox(BaseModel):
             raise ValueError("BoundingBoxes have different CoordOrigin")
         return cls(l=left, t=top, r=right, b=bottom, coord_origin=origin)
+    def x_overlap_with(self, other: "BoundingBox") -> float:
+        """Calculates the horizontal overlap with another bounding box."""
+        if self.coord_origin != other.coord_origin:
+            raise ValueError("BoundingBoxes have different CoordOrigin")
+        return max(0.0, min(self.r, other.r) - max(self.l, other.l))
+    def y_overlap_with(self, other: "BoundingBox") -> float:
+        """Calculates the vertical overlap with another bounding box, respecting coordinate origin."""
+        if self.coord_origin != other.coord_origin:
+            raise ValueError("BoundingBoxes have different CoordOrigin")
+        if self.coord_origin == CoordOrigin.TOPLEFT:
+            return max(0.0, min(self.b, other.b) - max(self.t, other.t))
+        elif self.coord_origin == CoordOrigin.BOTTOMLEFT:
+            return max(0.0, min(self.t, other.t) - max(self.b, other.b))
+        raise ValueError("Unsupported CoordOrigin")
+    def union_area_with(self, other: "BoundingBox") -> float:
+        """Calculates the union area with another bounding box."""
+        if self.coord_origin != other.coord_origin:
+            raise ValueError("BoundingBoxes have different CoordOrigin")
+        return self.area() + other.area() - self.intersection_area_with(other)
+    def x_union_with(self, other: "BoundingBox") -> float:
+        """Calculates the horizontal union dimension with another bounding box."""
+        if self.coord_origin != other.coord_origin:
+            raise ValueError("BoundingBoxes have different CoordOrigin")
+        return max(0.0, max(self.r, other.r) - min(self.l, other.l))
+    def y_union_with(self, other: "BoundingBox") -> float:
+        """Calculates the vertical union dimension with another bounding box, respecting coordinate origin."""
+        if self.coord_origin != other.coord_origin:
+            raise ValueError("BoundingBoxes have different CoordOrigin")
+        if self.coord_origin == CoordOrigin.TOPLEFT:
+            return max(0.0, max(self.b, other.b) - min(self.t, other.t))
+        elif self.coord_origin == CoordOrigin.BOTTOMLEFT:
+            return max(0.0, max(self.t, other.t) - min(self.b, other.b))
+        raise ValueError("Unsupported CoordOrigin")

{docling_core-2.31.2 → docling_core-2.33.0}/docling_core/types/doc/document.py RENAMED Viewed

@@ -11,6 +11,7 @@ import os
 import re
 import sys
 import typing
+import warnings
 from enum import Enum
 from io import BytesIO
 from pathlib import Path
@@ -2924,6 +2925,7 @@ class DoclingDocument(BaseModel):
         page_no: Optional[int] = None,
         included_content_layers: Optional[set[ContentLayer]] = None,
         page_break_placeholder: Optional[str] = None,
+        include_annotations: bool = True,
     ):
         """Save to markdown."""
         if isinstance(filename, str):
@@ -2951,6 +2953,7 @@ class DoclingDocument(BaseModel):
             page_no=page_no,
             included_content_layers=included_content_layers,
             page_break_placeholder=page_break_placeholder,
+            include_annotations=include_annotations,
         )
         with open(filename, "w", encoding="utf-8") as fw:
@@ -2972,6 +2975,8 @@ class DoclingDocument(BaseModel):
         page_no: Optional[int] = None,
         included_content_layers: Optional[set[ContentLayer]] = None,
         page_break_placeholder: Optional[str] = None,  # e.g. "<!-- page break -->",
+        include_annotations: bool = True,
+        mark_annotations: bool = False,
     ) -> str:
         r"""Serialize to Markdown.
@@ -2991,9 +2996,9 @@ class DoclingDocument(BaseModel):
         :type labels: Optional[set[DocItemLabel]] = None
         :param strict_text: Deprecated.
         :type strict_text: bool = False
-        :param escaping_underscores: bool: Whether to escape underscores in the
+        :param escape_underscores: bool: Whether to escape underscores in the
             text content of the document. (Default value = True).
-        :type escaping_underscores: bool = True
+        :type escape_underscores: bool = True
         :param image_placeholder: The placeholder to include to position
             images in the markdown. (Default value = "\<!-- image --\>").
         :type image_placeholder: str = "<!-- image -->"
@@ -3009,6 +3014,12 @@ class DoclingDocument(BaseModel):
         :param page_break_placeholder: The placeholder to include for marking page
             breaks. None means no page break placeholder will be used.
         :type page_break_placeholder: Optional[str] = None
+        :param include_annotations: bool: Whether to include annotations in the export.
+            (Default value = True).
+        :type include_annotations: bool = True
+        :param mark_annotations: bool: Whether to mark annotations in the export; only
+            relevant if include_annotations is True. (Default value = False).
+        :type mark_annotations: bool = False
         :returns: The exported Markdown representation.
         :rtype: str
         """
@@ -3038,6 +3049,8 @@ class DoclingDocument(BaseModel):
                 indent=indent,
                 wrap_width=text_width if text_width > 0 else None,
                 page_break_placeholder=page_break_placeholder,
+                include_annotations=include_annotations,
+                mark_annotations=mark_annotations,
             ),
         )
         ser_res = serializer.serialize()
@@ -3087,6 +3100,7 @@ class DoclingDocument(BaseModel):
         html_head: str = "null",  # should be deprecated
         included_content_layers: Optional[set[ContentLayer]] = None,
         split_page_view: bool = False,
+        include_annotations: bool = True,
     ):
         """Save to HTML."""
         if isinstance(filename, str):
@@ -3112,6 +3126,7 @@ class DoclingDocument(BaseModel):
             html_head=html_head,
             included_content_layers=included_content_layers,
             split_page_view=split_page_view,
+            include_annotations=include_annotations,
         )
         with open(filename, "w", encoding="utf-8") as fw:
@@ -3164,6 +3179,7 @@ class DoclingDocument(BaseModel):
         html_head: str = "null",  # should be deprecated ...
         included_content_layers: Optional[set[ContentLayer]] = None,
         split_page_view: bool = False,
+        include_annotations: bool = True,
     ) -> str:
         r"""Serialize to HTML."""
         from docling_core.transforms.serializer.html import (
@@ -3195,6 +3211,7 @@ class DoclingDocument(BaseModel):
             html_head=html_head,
             html_lang=html_lang,
             output_style=output_style,
+            include_annotations=include_annotations,
         )
         if html_head == "null":
@@ -4109,7 +4126,10 @@ class DoclingDocument(BaseModel):
     @classmethod
     def validate_document(cls, d: "DoclingDocument"):
         """validate_document."""
-        if not d.validate_tree(d.body) or not d.validate_tree(d.furniture):
-            raise ValueError("Document hierachy is inconsistent.")
+        with warnings.catch_warnings():
+            # ignore warning from deprecated furniture
+            warnings.filterwarnings("ignore", category=DeprecationWarning)
+            if not d.validate_tree(d.body) or not d.validate_tree(d.furniture):
+                raise ValueError("Document hierachy is inconsistent.")
         return d

docling-core 2.31.2__tar.gz → 2.33.0__tar.gz

Potentially problematic release.

docling-core 2.31.2tar.gz → 2.33.0tar.gz