PyPI - deepdoctection - Versions diffs - 0.34__tar.gz → 0.36__tar.gz - Mend

deepdoctection 0.34tar.gz → 0.36tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of deepdoctection might be problematic. Click here for more details.

Files changed (155) hide show

{deepdoctection-0.34 → deepdoctection-0.36}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: deepdoctection
-Version: 0.34
+Version: 0.36
 Summary: Repository for Document AI
 Home-page: https://github.com/deepdoctection/deepdoctection
 Author: Dr. Janis Meyer
@@ -17,7 +17,7 @@ Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: catalogue==2.0.10
-Requires-Dist: huggingface_hub>=0.12.0
+Requires-Dist: huggingface_hub<0.26,>=0.12.0
 Requires-Dist: importlib-metadata>=5.0.0
 Requires-Dist: jsonlines==3.1.0
 Requires-Dist: lazy-imports==0.3.1
@@ -27,6 +27,7 @@ Requires-Dist: numpy<2.0,>=1.21
 Requires-Dist: packaging>=20.0
 Requires-Dist: Pillow>=10.0.0
 Requires-Dist: pypdf>=3.16.0
+Requires-Dist: pypdfium2>=4.30.0
 Requires-Dist: pyyaml>=6.0.1
 Requires-Dist: pyzmq>=16
 Requires-Dist: scipy>=1.13.1
@@ -35,7 +36,7 @@ Requires-Dist: tabulate>=0.7.7
 Requires-Dist: tqdm==4.64.0
 Provides-Extra: tf
 Requires-Dist: catalogue==2.0.10; extra == "tf"
-Requires-Dist: huggingface_hub>=0.12.0; extra == "tf"
+Requires-Dist: huggingface_hub<0.26,>=0.12.0; extra == "tf"
 Requires-Dist: importlib-metadata>=5.0.0; extra == "tf"
 Requires-Dist: jsonlines==3.1.0; extra == "tf"
 Requires-Dist: lazy-imports==0.3.1; extra == "tf"
@@ -45,6 +46,7 @@ Requires-Dist: numpy<2.0,>=1.21; extra == "tf"
 Requires-Dist: packaging>=20.0; extra == "tf"
 Requires-Dist: Pillow>=10.0.0; extra == "tf"
 Requires-Dist: pypdf>=3.16.0; extra == "tf"
+Requires-Dist: pypdfium2>=4.30.0; extra == "tf"
 Requires-Dist: pyyaml>=6.0.1; extra == "tf"
 Requires-Dist: pyzmq>=16; extra == "tf"
 Requires-Dist: scipy>=1.13.1; extra == "tf"
@@ -66,7 +68,7 @@ Requires-Dist: distance==0.1.3; extra == "tf"
 Requires-Dist: lxml>=4.9.1; extra == "tf"
 Provides-Extra: pt
 Requires-Dist: catalogue==2.0.10; extra == "pt"
-Requires-Dist: huggingface_hub>=0.12.0; extra == "pt"
+Requires-Dist: huggingface_hub<0.26,>=0.12.0; extra == "pt"
 Requires-Dist: importlib-metadata>=5.0.0; extra == "pt"
 Requires-Dist: jsonlines==3.1.0; extra == "pt"
 Requires-Dist: lazy-imports==0.3.1; extra == "pt"
@@ -76,6 +78,7 @@ Requires-Dist: numpy<2.0,>=1.21; extra == "pt"
 Requires-Dist: packaging>=20.0; extra == "pt"
 Requires-Dist: Pillow>=10.0.0; extra == "pt"
 Requires-Dist: pypdf>=3.16.0; extra == "pt"
+Requires-Dist: pypdfium2>=4.30.0; extra == "pt"
 Requires-Dist: pyyaml>=6.0.1; extra == "pt"
 Requires-Dist: pyzmq>=16; extra == "pt"
 Requires-Dist: scipy>=1.13.1; extra == "pt"
@@ -172,13 +175,17 @@ pipelines. Its core function does not depend on any specific deep learning libra
  - Document layout analysis and table recognition now runs with
    [**Torchscript**](https://pytorch.org/docs/stable/jit.html) (CPU) as well and [**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2) is not required
    anymore for basic inference.
- - [**new**] More angle predictors for determining the rotation of a document based on [**Tesseract**](https://github.com/tesseract-ocr/tesseract) and [**DocTr**](https://github.com/mindee/doctr)
-   (not contained in the built-in Analyzer).
- - [**new**] Token classification with [**LiLT**](https://github.com/jpWang/LiLT) via
+ - More angle predictors for determining the rotation of a document based on [**Tesseract**](https://github.com/tesseract-ocr/tesseract) and [**DocTr**](https://github.com/mindee/doctr)
+ - Token classification with [**LiLT**](https://github.com/jpWang/LiLT) via
    [**transformers**](https://github.com/huggingface/transformers).
    We have added a model wrapper for token classification with LiLT and added a some LiLT models to the model catalog
    that seem to look promising, especially if you want to train a model on non-english data. The training script for
-   LayoutLM can be used for LiLT as well and we will be providing a notebook on how to train a model on a custom dataset soon.
+   LayoutLM can be used for LiLT as well.
+ - [**new**] There are two notebooks available that show, how to write a
+   [custom predictor](https://github.com/deepdoctection/notebooks/blob/main/Doclaynet_Analyzer_Config.ipynb) based on
+   a third party library that has not been supported yet and how to use
+   [advanced configuration](https://github.com/deepdoctection/notebooks/blob/main/Doclaynet_Analyzer_Config.ipynb) to
+   get links between layout segments e.g. captions and tables or figures.
 **deep**doctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to
 post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words
@@ -263,7 +270,7 @@ documentation.
 ## Requirements
-![requirements](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/requirements_deepdoctection.png)
+![requirements](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/requirements_deepdoctection_081124.png)
 Everything in the overview listed below the **deep**doctection layer are necessary requirements and have to be installed
 separately.
@@ -272,13 +279,16 @@ separately.
 - Python >= 3.9
 - 1.13 <= PyTorch  **or** 2.11 <= Tensorflow < 2.16. (For lower Tensorflow versions the code will only run on a GPU).
 In general, if you want to train or fine-tune models, a GPU is required.
-- **deep**doctection uses Python wrappers for [Poppler](https://poppler.freedesktop.org/) to convert PDF documents into
-images.
 - With respect to the Deep Learning framework, you must decide between [Tensorflow](https://www.tensorflow.org/install?hl=en)
   and [PyTorch](https://pytorch.org/get-started/locally/).
 - [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine will be used through a Python wrapper. The core
   engine has to be installed separately.
+- For release `v.0.34.0` and below **deep**doctection uses Python wrappers for [Poppler](https://poppler.freedesktop.org/) to convert PDF
+  documents into images. For release `v.0.35.0` this dependency will be optional.
 The following overview shows the availability of the models in conjunction with the DL framework.
 | Task                                          | PyTorch | Torchscript    |  Tensorflow  |
@@ -396,8 +406,8 @@ to develop this framework.
 ## Problems
 We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this
-repo and try to address them as quickly as possible. Bug fixes or enhancements will be deployed in a new release every 4
-to 6 weeks.
+repo and try to address them as quickly as possible. Bug fixes or enhancements will be deployed in a new release every 10
+to 12 weeks.
 ## If you like **deep**doctection ...

{deepdoctection-0.34 → deepdoctection-0.36}/README.md RENAMED Viewed

@@ -45,13 +45,17 @@ pipelines. Its core function does not depend on any specific deep learning libra
  - Document layout analysis and table recognition now runs with
    [**Torchscript**](https://pytorch.org/docs/stable/jit.html) (CPU) as well and [**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2) is not required
    anymore for basic inference.
- - [**new**] More angle predictors for determining the rotation of a document based on [**Tesseract**](https://github.com/tesseract-ocr/tesseract) and [**DocTr**](https://github.com/mindee/doctr)
-   (not contained in the built-in Analyzer).
- - [**new**] Token classification with [**LiLT**](https://github.com/jpWang/LiLT) via
+ - More angle predictors for determining the rotation of a document based on [**Tesseract**](https://github.com/tesseract-ocr/tesseract) and [**DocTr**](https://github.com/mindee/doctr)
+ - Token classification with [**LiLT**](https://github.com/jpWang/LiLT) via
    [**transformers**](https://github.com/huggingface/transformers).
    We have added a model wrapper for token classification with LiLT and added a some LiLT models to the model catalog
    that seem to look promising, especially if you want to train a model on non-english data. The training script for
-   LayoutLM can be used for LiLT as well and we will be providing a notebook on how to train a model on a custom dataset soon.
+   LayoutLM can be used for LiLT as well.
+ - [**new**] There are two notebooks available that show, how to write a
+   [custom predictor](https://github.com/deepdoctection/notebooks/blob/main/Doclaynet_Analyzer_Config.ipynb) based on
+   a third party library that has not been supported yet and how to use
+   [advanced configuration](https://github.com/deepdoctection/notebooks/blob/main/Doclaynet_Analyzer_Config.ipynb) to
+   get links between layout segments e.g. captions and tables or figures.
 **deep**doctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to
 post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words
@@ -136,7 +140,7 @@ documentation.
 ## Requirements
-![requirements](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/requirements_deepdoctection.png)
+![requirements](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/requirements_deepdoctection_081124.png)
 Everything in the overview listed below the **deep**doctection layer are necessary requirements and have to be installed
 separately.
@@ -145,13 +149,16 @@ separately.
 - Python >= 3.9
 - 1.13 <= PyTorch  **or** 2.11 <= Tensorflow < 2.16. (For lower Tensorflow versions the code will only run on a GPU).
 In general, if you want to train or fine-tune models, a GPU is required.
-- **deep**doctection uses Python wrappers for [Poppler](https://poppler.freedesktop.org/) to convert PDF documents into
-images.
 - With respect to the Deep Learning framework, you must decide between [Tensorflow](https://www.tensorflow.org/install?hl=en)
   and [PyTorch](https://pytorch.org/get-started/locally/).
 - [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine will be used through a Python wrapper. The core
   engine has to be installed separately.
+- For release `v.0.34.0` and below **deep**doctection uses Python wrappers for [Poppler](https://poppler.freedesktop.org/) to convert PDF
+  documents into images. For release `v.0.35.0` this dependency will be optional.
 The following overview shows the availability of the models in conjunction with the DL framework.
 | Task                                          | PyTorch | Torchscript    |  Tensorflow  |
@@ -269,8 +276,8 @@ to develop this framework.
 ## Problems
 We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this
-repo and try to address them as quickly as possible. Bug fixes or enhancements will be deployed in a new release every 4
-to 6 weeks.
+repo and try to address them as quickly as possible. Bug fixes or enhancements will be deployed in a new release every 10
+to 12 weeks.
 ## If you like **deep**doctection ...

{deepdoctection-0.34 → deepdoctection-0.36}/deepdoctection/__init__.py RENAMED Viewed

@@ -18,26 +18,16 @@ if importlib.util.find_spec("dotenv") is not None:
 import sys
 from typing import TYPE_CHECKING
-from .utils.env_info import collect_env_info
+from .utils.env_info import auto_select_pdf_render_framework, collect_env_info
 from .utils.file_utils import _LazyModule, get_tf_version, pytorch_available, tf_available
 from .utils.logger import LoggingRecord, logger
 # pylint: enable=wrong-import-position
-__version__ = 0.34
+__version__ = 0.36
 _IMPORT_STRUCTURE = {
-    "analyzer": [
-        "config_sanity_checks",
-        "build_detector",
-        "build_padder",
-        "build_service",
-        "build_sub_image_service",
-        "build_ocr",
-        "build_doctr_word",
-        "get_dd_analyzer",
-        "build_analyzer",
-    ],
+    "analyzer": ["config_sanity_checks", "get_dd_analyzer", "ServiceFactory"],
     "configs": [],
     "dataflow": [
         "DataFlowTerminated",
@@ -197,6 +187,7 @@ _IMPORT_STRUCTURE = {
         "print_model_infos",
         "ModelDownloadManager",
         "PdfPlumberTextDetector",
+        "Pdfmium2TextDetector",
         "TesseractOcrDetector",
         "TesseractRotationTransformer",
         "TextractOcrDetector",
@@ -304,6 +295,7 @@ _IMPORT_STRUCTURE = {
         "timed_operation",
         "collect_env_info",
         "auto_select_viz_library",
+        "auto_select_pdf_render_framework",
         "get_tensorflow_requirement",
         "tf_addons_available",
         "get_tf_addons_requirements",
@@ -383,6 +375,7 @@ _IMPORT_STRUCTURE = {
         "get_pdf_file_writer",
         "PDFStreamer",
         "pdf_to_np_array",
+        "split_pdf",
         "ObjectTypes",
         "TypeOrStr",
         "object_types_registry",
@@ -427,7 +420,7 @@ _IMPORT_STRUCTURE = {
 # Setting some environment variables so that standard functions can be invoked with available hardware
 env_info = collect_env_info()
 logger.debug(LoggingRecord(msg=env_info))
+auto_select_pdf_render_framework()
 # Direct imports for type-checking
 if TYPE_CHECKING:

{deepdoctection-0.34 → deepdoctection-0.36}/deepdoctection/analyzer/__init__.py RENAMED Viewed

@@ -20,3 +20,4 @@ Package for pre-built pipelines
 """
 from .dd import *
+from .factory import *

deepdoctection-0.36/deepdoctection/analyzer/_config.py ADDED Viewed

@@ -0,0 +1,142 @@
+# -*- coding: utf-8 -*-
+# File: config.py
+# Copyright 2024 Dr. Janis Meyer. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Pipeline configuration for deepdoctection analyzer. Do not change the defaults in this file. """
+from ..datapoint.view import IMAGE_DEFAULTS
+from ..utils.metacfg import AttrDict
+from ..utils.settings import CellType, LayoutType
+cfg = AttrDict()
+cfg.LANGUAGE = None
+cfg.LIB = None
+cfg.DEVICE = None
+cfg.USE_ROTATOR = False
+cfg.USE_LAYOUT = True
+cfg.USE_TABLE_SEGMENTATION = True
+cfg.TF.LAYOUT.WEIGHTS = "layout/model-800000_inf_only.data-00000-of-00001"
+cfg.TF.LAYOUT.FILTER = None
+cfg.TF.CELL.WEIGHTS = "cell/model-1800000_inf_only.data-00000-of-00001"
+cfg.TF.CELL.FILTER = None
+cfg.TF.ITEM.WEIGHTS = "item/model-1620000_inf_only.data-00000-of-00001"
+cfg.TF.ITEM.FILTER = None
+cfg.PT.LAYOUT.WEIGHTS = "layout/d2_model_0829999_layout_inf_only.pt"
+cfg.PT.LAYOUT.WEIGHTS_TS = "layout/d2_model_0829999_layout_inf_only.ts"
+cfg.PT.LAYOUT.FILTER = None
+cfg.PT.LAYOUT.PAD.TOP = 60
+cfg.PT.LAYOUT.PAD.RIGHT = 60
+cfg.PT.LAYOUT.PAD.BOTTOM = 60
+cfg.PT.LAYOUT.PAD.LEFT = 60
+cfg.PT.ITEM.WEIGHTS = "item/d2_model_1639999_item_inf_only.pt"
+cfg.PT.ITEM.WEIGHTS_TS = "item/d2_model_1639999_item_inf_only.ts"
+cfg.PT.ITEM.FILTER = None
+cfg.PT.ITEM.PAD.TOP = 60
+cfg.PT.ITEM.PAD.RIGHT = 60
+cfg.PT.ITEM.PAD.BOTTOM = 60
+cfg.PT.ITEM.PAD.LEFT = 60
+cfg.PT.CELL.WEIGHTS = "cell/d2_model_1849999_cell_inf_only.pt"
+cfg.PT.CELL.WEIGHTS_TS = "cell/d2_model_1849999_cell_inf_only.ts"
+cfg.PT.CELL.FILTER = None
+cfg.USE_LAYOUT_NMS = False
+cfg.LAYOUT_NMS_PAIRS.COMBINATIONS = None
+cfg.LAYOUT_NMS_PAIRS.THRESHOLDS = None
+cfg.LAYOUT_NMS_PAIRS.PRIORITY = None
+cfg.SEGMENTATION.ASSIGNMENT_RULE = "ioa"
+cfg.SEGMENTATION.THRESHOLD_ROWS = 0.4
+cfg.SEGMENTATION.THRESHOLD_COLS = 0.4
+cfg.SEGMENTATION.FULL_TABLE_TILING = True
+cfg.SEGMENTATION.REMOVE_IOU_THRESHOLD_ROWS = 0.001
+cfg.SEGMENTATION.REMOVE_IOU_THRESHOLD_COLS = 0.001
+cfg.SEGMENTATION.CELL_CATEGORY_ID = 12
+cfg.SEGMENTATION.TABLE_NAME = LayoutType.TABLE
+cfg.SEGMENTATION.PUBTABLES_CELL_NAMES = [
+    CellType.SPANNING,
+    CellType.ROW_HEADER,
+    CellType.COLUMN_HEADER,
+    CellType.PROJECTED_ROW_HEADER,
+    LayoutType.CELL,
+]
+cfg.SEGMENTATION.PUBTABLES_SPANNING_CELL_NAMES = [
+    CellType.SPANNING,
+    CellType.ROW_HEADER,
+    CellType.COLUMN_HEADER,
+    CellType.PROJECTED_ROW_HEADER,
+]
+cfg.SEGMENTATION.PUBTABLES_ITEM_NAMES = [LayoutType.ROW, LayoutType.COLUMN]
+cfg.SEGMENTATION.PUBTABLES_SUB_ITEM_NAMES = [CellType.ROW_NUMBER, CellType.COLUMN_NUMBER]
+cfg.SEGMENTATION.CELL_NAMES = [CellType.HEADER, CellType.BODY, LayoutType.CELL]
+cfg.SEGMENTATION.ITEM_NAMES = [LayoutType.ROW, LayoutType.COLUMN]
+cfg.SEGMENTATION.SUB_ITEM_NAMES = [CellType.ROW_NUMBER, CellType.COLUMN_NUMBER]
+cfg.SEGMENTATION.STRETCH_RULE = "equal"
+cfg.USE_TABLE_REFINEMENT = True
+cfg.USE_PDF_MINER = False
+cfg.PDF_MINER.X_TOLERANCE = 3
+cfg.PDF_MINER.Y_TOLERANCE = 3
+cfg.USE_OCR = True
+cfg.OCR.USE_TESSERACT = True
+cfg.OCR.USE_DOCTR = False
+cfg.OCR.USE_TEXTRACT = False
+cfg.OCR.CONFIG.TESSERACT = "dd/conf_tesseract.yaml"
+cfg.OCR.WEIGHTS.DOCTR_WORD.TF = "doctr/db_resnet50/tf/db_resnet50-adcafc63.zip"
+cfg.OCR.WEIGHTS.DOCTR_WORD.PT = "doctr/db_resnet50/pt/db_resnet50-ac60cadc.pt"
+cfg.OCR.WEIGHTS.DOCTR_RECOGNITION.TF = "doctr/crnn_vgg16_bn/tf/crnn_vgg16_bn-76b7f2c6.zip"
+cfg.OCR.WEIGHTS.DOCTR_RECOGNITION.PT = "doctr/crnn_vgg16_bn/pt/crnn_vgg16_bn-9762b0b0.pt"
+cfg.TEXT_CONTAINER = IMAGE_DEFAULTS["text_container"]
+cfg.WORD_MATCHING.PARENTAL_CATEGORIES = [
+    LayoutType.TEXT,
+    LayoutType.TITLE,
+    LayoutType.LIST,
+    LayoutType.CELL,
+    CellType.COLUMN_HEADER,
+    CellType.PROJECTED_ROW_HEADER,
+    CellType.SPANNING,
+    CellType.ROW_HEADER,
+]
+cfg.WORD_MATCHING.RULE = "ioa"
+cfg.WORD_MATCHING.THRESHOLD = 0.6
+cfg.WORD_MATCHING.MAX_PARENT_ONLY = True
+cfg.TEXT_ORDERING.TEXT_BLOCK_CATEGORIES = IMAGE_DEFAULTS["text_block_categories"]
+cfg.TEXT_ORDERING.FLOATING_TEXT_BLOCK_CATEGORIES = IMAGE_DEFAULTS["floating_text_block_categories"]
+cfg.TEXT_ORDERING.INCLUDE_RESIDUAL_TEXT_CONTAINER = False
+cfg.TEXT_ORDERING.STARTING_POINT_TOLERANCE = 0.005
+cfg.TEXT_ORDERING.BROKEN_LINE_TOLERANCE = 0.003
+cfg.TEXT_ORDERING.HEIGHT_TOLERANCE = 2.0
+cfg.TEXT_ORDERING.PARAGRAPH_BREAK = 0.035
+cfg.USE_LAYOUT_LINK = False
+cfg.LAYOUT_LINK.PARENTAL_CATEGORIES = []
+cfg.LAYOUT_LINK.CHILD_CATEGORIES = []
+cfg.freeze()

deepdoctection-0.36/deepdoctection/analyzer/dd.py ADDED Viewed

@@ -0,0 +1,154 @@
+# -*- coding: utf-8 -*-
+# File: dd.py
+# Copyright 2021 Dr. Janis Meyer. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Module for **deep**doctection analyzer.
+-factory build_analyzer for a given config
+-user factory with a reduced config setting
+"""
+from __future__ import annotations
+import os
+from typing import Optional
+from ..extern.pt.ptutils import get_torch_device
+from ..extern.tp.tfutils import disable_tp_layer_logging, get_tf_device
+from ..pipe.doctectionpipe import DoctectionPipe
+from ..utils.env_info import ENV_VARS_TRUE
+from ..utils.error import DependencyError
+from ..utils.file_utils import tensorpack_available
+from ..utils.fs import get_configs_dir_path, get_package_path, maybe_copy_config_to_cache
+from ..utils.logger import LoggingRecord, logger
+from ..utils.metacfg import set_config_by_yaml
+from ..utils.types import PathLikeOrStr
+from ._config import cfg
+from .factory import ServiceFactory
+__all__ = [
+    "config_sanity_checks",
+    "get_dd_analyzer",
+]
+_DD_ONE = "deepdoctection/configs/conf_dd_one.yaml"
+_TESSERACT = "deepdoctection/configs/conf_tesseract.yaml"
+_MODEL_CHOICES = {
+    "layout": [
+        "layout/d2_model_0829999_layout_inf_only.pt",
+        "xrf_layout/model_final_inf_only.pt",
+        "microsoft/table-transformer-detection/pytorch_model.bin",
+    ],
+    "segmentation": [
+        "item/model-1620000_inf_only.data-00000-of-00001",
+        "xrf_item/model_final_inf_only.pt",
+        "microsoft/table-transformer-structure-recognition/pytorch_model.bin",
+        "deepdoctection/tatr_tab_struct_v2/pytorch_model.bin",
+    ],
+    "ocr": ["Tesseract", "DocTr", "Textract"],
+    "doctr_word": ["doctr/db_resnet50/pt/db_resnet50-ac60cadc.pt"],
+    "doctr_recognition": [
+        "doctr/crnn_vgg16_bn/pt/crnn_vgg16_bn-9762b0b0.pt",
+        "doctr/crnn_vgg16_bn/pt/pytorch_model.bin",
+    ],
+    "llm": ["gpt-3.5-turbo", "gpt-4"],
+    "segmentation_choices": {
+        "item/model-1620000_inf_only.data-00000-of-00001": "cell/model-1800000_inf_only.data-00000-of-00001",
+        "xrf_item/model_final_inf_only.pt": "xrf_cell/model_final_inf_only.pt",
+        "microsoft/table-transformer-structure-recognition/pytorch_model.bin": None,
+        "deepdoctection/tatr_tab_struct_v2/pytorch_model.bin": None,
+    },
+}
+def config_sanity_checks() -> None:
+    """Some config sanity checks"""
+    if cfg.USE_PDF_MINER and cfg.USE_OCR and cfg.OCR.USE_DOCTR:
+        raise ValueError("Configuration USE_PDF_MINER= True and USE_OCR=True and USE_DOCTR=True is not allowed")
+    if cfg.USE_OCR:
+        if cfg.OCR.USE_TESSERACT + cfg.OCR.USE_DOCTR + cfg.OCR.USE_TEXTRACT != 1:
+            raise ValueError(
+                "Choose either OCR.USE_TESSERACT=True or OCR.USE_DOCTR=True or OCR.USE_TEXTRACT=True "
+                "and set the other two to False. Only one OCR system can be activated."
+            )
+def get_dd_analyzer(
+    reset_config_file: bool = True,
+    config_overwrite: Optional[list[str]] = None,
+    path_config_file: Optional[PathLikeOrStr] = None,
+) -> DoctectionPipe:
+    """
+    Factory function for creating the built-in **deep**doctection analyzer.
+    The Standard Analyzer is a pipeline that comprises the following analysis components:
+    - Document layout analysis
+    - Table segmentation
+    - Text extraction/OCR
+    - Reading order
+    We refer to the various notebooks and docs for running an analyzer and changing the configs.
+    :param reset_config_file: This will copy the `.yaml` file with default variables to the `.cache` and therefore
+                              resetting all configurations if set to `True`.
+    :param config_overwrite: Passing a list of string arguments and values to overwrite the `.yaml` configuration with
+                             highest priority, e.g. ["USE_TABLE_SEGMENTATION=False",
+                                                     "USE_OCR=False",
+                                                     "TF.LAYOUT.WEIGHTS=my_fancy_pytorch_model"]
+    :param path_config_file: Path to a custom config file. Can be outside of the .cache directory.
+    :return: A DoctectionPipe instance with given configs
+    """
+    config_overwrite = [] if config_overwrite is None else config_overwrite
+    lib = "TF" if os.environ.get("DD_USE_TF", "0") in ENV_VARS_TRUE else "PT"
+    if lib == "TF":
+        device = get_tf_device()
+    elif lib == "PT":
+        device = get_torch_device()
+    else:
+        raise DependencyError("At least one of the env variables DD_USE_TF or DD_USE_TORCH must be set.")
+    dd_one_config_path = maybe_copy_config_to_cache(
+        get_package_path(), get_configs_dir_path() / "dd", _DD_ONE, reset_config_file
+    )
+    maybe_copy_config_to_cache(get_package_path(), get_configs_dir_path() / "dd", _TESSERACT)
+    # Set up of the configuration and logging
+    file_cfg = set_config_by_yaml(dd_one_config_path if not path_config_file else path_config_file)
+    cfg.freeze(freezed=False)
+    cfg.overwrite_config(file_cfg)
+    cfg.freeze(freezed=False)
+    cfg.LANGUAGE = None
+    cfg.LIB = lib
+    cfg.DEVICE = device
+    cfg.freeze()
+    if config_overwrite:
+        cfg.update_args(config_overwrite)
+    config_sanity_checks()
+    logger.info(LoggingRecord(f"Config: \n {str(cfg)}", cfg.to_dict()))  # type: ignore
+    # will silent all TP logging while building the tower
+    if tensorpack_available():
+        disable_tp_layer_logging()
+    return ServiceFactory.build_analyzer(cfg)

deepdoctection 0.34__tar.gz → 0.36__tar.gz

Potentially problematic release.

deepdoctection 0.34tar.gz → 0.36tar.gz