PyPI - deepdoctection - Versions diffs - 0.42.0__py3-none-any.whl → 0.43__py3-none-any.whl - Mend

deepdoctection 0.42.0py3-none-any.whl → 0.43py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of deepdoctection might be problematic. Click here for more details.

Files changed (124) hide show

deepdoctection/__init__.py +2 -1
deepdoctection/analyzer/__init__.py +2 -1
deepdoctection/analyzer/config.py +904 -0
deepdoctection/analyzer/dd.py +36 -62
deepdoctection/analyzer/factory.py +311 -141
deepdoctection/configs/conf_dd_one.yaml +100 -44
deepdoctection/configs/profiles.jsonl +32 -0
deepdoctection/dataflow/__init__.py +9 -6
deepdoctection/dataflow/base.py +33 -15
deepdoctection/dataflow/common.py +96 -75
deepdoctection/dataflow/custom.py +36 -29
deepdoctection/dataflow/custom_serialize.py +135 -91
deepdoctection/dataflow/parallel_map.py +33 -31
deepdoctection/dataflow/serialize.py +15 -10
deepdoctection/dataflow/stats.py +41 -28
deepdoctection/datapoint/__init__.py +4 -6
deepdoctection/datapoint/annotation.py +104 -66
deepdoctection/datapoint/box.py +190 -130
deepdoctection/datapoint/convert.py +66 -39
deepdoctection/datapoint/image.py +151 -95
deepdoctection/datapoint/view.py +383 -236
deepdoctection/datasets/__init__.py +2 -6
deepdoctection/datasets/adapter.py +11 -11
deepdoctection/datasets/base.py +118 -81
deepdoctection/datasets/dataflow_builder.py +18 -12
deepdoctection/datasets/info.py +76 -57
deepdoctection/datasets/instances/__init__.py +6 -2
deepdoctection/datasets/instances/doclaynet.py +17 -14
deepdoctection/datasets/instances/fintabnet.py +16 -22
deepdoctection/datasets/instances/funsd.py +11 -6
deepdoctection/datasets/instances/iiitar13k.py +9 -9
deepdoctection/datasets/instances/layouttest.py +9 -9
deepdoctection/datasets/instances/publaynet.py +9 -9
deepdoctection/datasets/instances/pubtables1m.py +13 -13
deepdoctection/datasets/instances/pubtabnet.py +13 -15
deepdoctection/datasets/instances/rvlcdip.py +8 -8
deepdoctection/datasets/instances/xfund.py +11 -9
deepdoctection/datasets/registry.py +18 -11
deepdoctection/datasets/save.py +12 -11
deepdoctection/eval/__init__.py +3 -2
deepdoctection/eval/accmetric.py +72 -52
deepdoctection/eval/base.py +29 -10
deepdoctection/eval/cocometric.py +14 -12
deepdoctection/eval/eval.py +56 -41
deepdoctection/eval/registry.py +6 -3
deepdoctection/eval/tedsmetric.py +24 -9
deepdoctection/eval/tp_eval_callback.py +13 -12
deepdoctection/extern/__init__.py +1 -1
deepdoctection/extern/base.py +176 -97
deepdoctection/extern/d2detect.py +127 -92
deepdoctection/extern/deskew.py +19 -10
deepdoctection/extern/doctrocr.py +157 -106
deepdoctection/extern/fastlang.py +25 -17
deepdoctection/extern/hfdetr.py +137 -60
deepdoctection/extern/hflayoutlm.py +329 -248
deepdoctection/extern/hflm.py +67 -33
deepdoctection/extern/model.py +108 -762
deepdoctection/extern/pdftext.py +37 -12
deepdoctection/extern/pt/nms.py +15 -1
deepdoctection/extern/pt/ptutils.py +13 -9
deepdoctection/extern/tessocr.py +87 -54
deepdoctection/extern/texocr.py +29 -14
deepdoctection/extern/tp/tfutils.py +36 -8
deepdoctection/extern/tp/tpcompat.py +54 -16
deepdoctection/extern/tp/tpfrcnn/config/config.py +20 -4
deepdoctection/extern/tpdetect.py +4 -2
deepdoctection/mapper/__init__.py +1 -1
deepdoctection/mapper/cats.py +117 -76
deepdoctection/mapper/cocostruct.py +35 -17
deepdoctection/mapper/d2struct.py +56 -29
deepdoctection/mapper/hfstruct.py +32 -19
deepdoctection/mapper/laylmstruct.py +221 -185
deepdoctection/mapper/maputils.py +71 -35
deepdoctection/mapper/match.py +76 -62
deepdoctection/mapper/misc.py +68 -44
deepdoctection/mapper/pascalstruct.py +13 -12
deepdoctection/mapper/prodigystruct.py +33 -19
deepdoctection/mapper/pubstruct.py +42 -32
deepdoctection/mapper/tpstruct.py +39 -19
deepdoctection/mapper/xfundstruct.py +20 -13
deepdoctection/pipe/__init__.py +1 -2
deepdoctection/pipe/anngen.py +104 -62
deepdoctection/pipe/base.py +226 -107
deepdoctection/pipe/common.py +206 -123
deepdoctection/pipe/concurrency.py +74 -47
deepdoctection/pipe/doctectionpipe.py +108 -47
deepdoctection/pipe/language.py +41 -24
deepdoctection/pipe/layout.py +45 -18
deepdoctection/pipe/lm.py +146 -78
deepdoctection/pipe/order.py +196 -113
deepdoctection/pipe/refine.py +111 -63
deepdoctection/pipe/registry.py +1 -1
deepdoctection/pipe/segment.py +213 -142
deepdoctection/pipe/sub_layout.py +76 -46
deepdoctection/pipe/text.py +52 -33
deepdoctection/pipe/transform.py +8 -6
deepdoctection/train/d2_frcnn_train.py +87 -69
deepdoctection/train/hf_detr_train.py +72 -40
deepdoctection/train/hf_layoutlm_train.py +85 -46
deepdoctection/train/tp_frcnn_train.py +56 -28
deepdoctection/utils/concurrency.py +59 -16
deepdoctection/utils/context.py +40 -19
deepdoctection/utils/develop.py +25 -17
deepdoctection/utils/env_info.py +85 -36
deepdoctection/utils/error.py +16 -10
deepdoctection/utils/file_utils.py +246 -62
deepdoctection/utils/fs.py +162 -43
deepdoctection/utils/identifier.py +29 -16
deepdoctection/utils/logger.py +49 -32
deepdoctection/utils/metacfg.py +83 -21
deepdoctection/utils/pdf_utils.py +119 -62
deepdoctection/utils/settings.py +24 -10
deepdoctection/utils/tqdm.py +10 -5
deepdoctection/utils/transform.py +182 -46
deepdoctection/utils/utils.py +61 -28
deepdoctection/utils/viz.py +150 -104
deepdoctection-0.43.dist-info/METADATA +376 -0
deepdoctection-0.43.dist-info/RECORD +149 -0
{deepdoctection-0.42.0.dist-info → deepdoctection-0.43.dist-info}/WHEEL +1 -1
deepdoctection/analyzer/_config.py +0 -146
deepdoctection-0.42.0.dist-info/METADATA +0 -431
deepdoctection-0.42.0.dist-info/RECORD +0 -148
{deepdoctection-0.42.0.dist-info → deepdoctection-0.43.dist-info}/licenses/LICENSE +0 -0
{deepdoctection-0.42.0.dist-info → deepdoctection-0.43.dist-info}/top_level.txt +0 -0

deepdoctection/analyzer/_config.py DELETED Viewed

@@ -1,146 +0,0 @@
-# -*- coding: utf-8 -*-
-# File: config.py
-# Copyright 2024 Dr. Janis Meyer. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Pipeline configuration for deepdoctection analyzer. Do not change the defaults in this file. """
-from ..datapoint.view import IMAGE_DEFAULTS
-from ..utils.metacfg import AttrDict
-from ..utils.settings import CellType, LayoutType
-cfg = AttrDict()
-cfg.LANGUAGE = None
-cfg.LIB = None
-cfg.DEVICE = None
-cfg.USE_ROTATOR = False
-cfg.USE_LAYOUT = True
-cfg.USE_TABLE_SEGMENTATION = True
-cfg.TF.LAYOUT.WEIGHTS = "layout/model-800000_inf_only.data-00000-of-00001"
-cfg.TF.LAYOUT.FILTER = None
-cfg.TF.CELL.WEIGHTS = "cell/model-1800000_inf_only.data-00000-of-00001"
-cfg.TF.CELL.FILTER = None
-cfg.TF.ITEM.WEIGHTS = "item/model-1620000_inf_only.data-00000-of-00001"
-cfg.TF.ITEM.FILTER = None
-cfg.PT.ENFORCE_WEIGHTS.LAYOUT = True
-cfg.PT.LAYOUT.WEIGHTS = "layout/d2_model_0829999_layout_inf_only.pt"
-cfg.PT.LAYOUT.WEIGHTS_TS = "layout/d2_model_0829999_layout_inf_only.ts"
-cfg.PT.LAYOUT.FILTER = None
-cfg.PT.LAYOUT.PAD.TOP = 60
-cfg.PT.LAYOUT.PAD.RIGHT = 60
-cfg.PT.LAYOUT.PAD.BOTTOM = 60
-cfg.PT.LAYOUT.PAD.LEFT = 60
-cfg.PT.ENFORCE_WEIGHTS.ITEM = True
-cfg.PT.ITEM.WEIGHTS = "item/d2_model_1639999_item_inf_only.pt"
-cfg.PT.ITEM.WEIGHTS_TS = "item/d2_model_1639999_item_inf_only.ts"
-cfg.PT.ITEM.FILTER = None
-cfg.PT.ITEM.PAD.TOP = 60
-cfg.PT.ITEM.PAD.RIGHT = 60
-cfg.PT.ITEM.PAD.BOTTOM = 60
-cfg.PT.ITEM.PAD.LEFT = 60
-cfg.PT.ENFORCE_WEIGHTS.CELL = True
-cfg.PT.CELL.WEIGHTS = "cell/d2_model_1849999_cell_inf_only.pt"
-cfg.PT.CELL.WEIGHTS_TS = "cell/d2_model_1849999_cell_inf_only.ts"
-cfg.PT.CELL.FILTER = None
-cfg.USE_LAYOUT_NMS = False
-cfg.LAYOUT_NMS_PAIRS.COMBINATIONS = None
-cfg.LAYOUT_NMS_PAIRS.THRESHOLDS = None
-cfg.LAYOUT_NMS_PAIRS.PRIORITY = None
-cfg.SEGMENTATION.ASSIGNMENT_RULE = "ioa"
-cfg.SEGMENTATION.THRESHOLD_ROWS = 0.4
-cfg.SEGMENTATION.THRESHOLD_COLS = 0.4
-cfg.SEGMENTATION.FULL_TABLE_TILING = True
-cfg.SEGMENTATION.REMOVE_IOU_THRESHOLD_ROWS = 0.001
-cfg.SEGMENTATION.REMOVE_IOU_THRESHOLD_COLS = 0.001
-cfg.SEGMENTATION.TABLE_NAME = LayoutType.TABLE
-cfg.SEGMENTATION.PUBTABLES_CELL_NAMES = [
-    CellType.SPANNING,
-    CellType.ROW_HEADER,
-    CellType.COLUMN_HEADER,
-    CellType.PROJECTED_ROW_HEADER,
-    LayoutType.CELL,
-]
-cfg.SEGMENTATION.PUBTABLES_SPANNING_CELL_NAMES = [
-    CellType.SPANNING,
-    CellType.ROW_HEADER,
-    CellType.COLUMN_HEADER,
-    CellType.PROJECTED_ROW_HEADER,
-]
-cfg.SEGMENTATION.PUBTABLES_ITEM_NAMES = [LayoutType.ROW, LayoutType.COLUMN]
-cfg.SEGMENTATION.PUBTABLES_SUB_ITEM_NAMES = [CellType.ROW_NUMBER, CellType.COLUMN_NUMBER]
-cfg.SEGMENTATION.CELL_NAMES = [CellType.HEADER, CellType.BODY, LayoutType.CELL]
-cfg.SEGMENTATION.ITEM_NAMES = [LayoutType.ROW, LayoutType.COLUMN]
-cfg.SEGMENTATION.SUB_ITEM_NAMES = [CellType.ROW_NUMBER, CellType.COLUMN_NUMBER]
-cfg.SEGMENTATION.PUBTABLES_ITEM_HEADER_CELL_NAMES = [CellType.COLUMN_HEADER, CellType.ROW_HEADER]
-cfg.SEGMENTATION.PUBTABLES_ITEM_HEADER_THRESHOLDS = [0.6, 0.0001]
-cfg.SEGMENTATION.STRETCH_RULE = "equal"
-cfg.USE_TABLE_REFINEMENT = True
-cfg.USE_PDF_MINER = False
-cfg.PDF_MINER.X_TOLERANCE = 3
-cfg.PDF_MINER.Y_TOLERANCE = 3
-cfg.USE_OCR = True
-cfg.OCR.USE_TESSERACT = True
-cfg.OCR.USE_DOCTR = False
-cfg.OCR.USE_TEXTRACT = False
-cfg.OCR.CONFIG.TESSERACT = "dd/conf_tesseract.yaml"
-cfg.OCR.WEIGHTS.DOCTR_WORD.TF = "doctr/db_resnet50/tf/db_resnet50-adcafc63.zip"
-cfg.OCR.WEIGHTS.DOCTR_WORD.PT = "doctr/db_resnet50/pt/db_resnet50-ac60cadc.pt"
-cfg.OCR.WEIGHTS.DOCTR_RECOGNITION.TF = "doctr/crnn_vgg16_bn/tf/crnn_vgg16_bn-76b7f2c6.zip"
-cfg.OCR.WEIGHTS.DOCTR_RECOGNITION.PT = "doctr/crnn_vgg16_bn/pt/crnn_vgg16_bn-9762b0b0.pt"
-cfg.TEXT_CONTAINER = IMAGE_DEFAULTS["text_container"]
-cfg.WORD_MATCHING.PARENTAL_CATEGORIES = [
-    LayoutType.TEXT,
-    LayoutType.TITLE,
-    LayoutType.LIST,
-    LayoutType.CELL,
-    CellType.COLUMN_HEADER,
-    CellType.PROJECTED_ROW_HEADER,
-    CellType.SPANNING,
-    CellType.ROW_HEADER,
-]
-cfg.WORD_MATCHING.RULE = "ioa"
-cfg.WORD_MATCHING.THRESHOLD = 0.6
-cfg.WORD_MATCHING.MAX_PARENT_ONLY = True
-cfg.TEXT_ORDERING.TEXT_BLOCK_CATEGORIES = IMAGE_DEFAULTS["text_block_categories"]
-cfg.TEXT_ORDERING.FLOATING_TEXT_BLOCK_CATEGORIES = IMAGE_DEFAULTS["floating_text_block_categories"]
-cfg.TEXT_ORDERING.INCLUDE_RESIDUAL_TEXT_CONTAINER = False
-cfg.TEXT_ORDERING.STARTING_POINT_TOLERANCE = 0.005
-cfg.TEXT_ORDERING.BROKEN_LINE_TOLERANCE = 0.003
-cfg.TEXT_ORDERING.HEIGHT_TOLERANCE = 2.0
-cfg.TEXT_ORDERING.PARAGRAPH_BREAK = 0.035
-cfg.USE_LAYOUT_LINK = False
-cfg.USE_LINE_MATCHER = False
-cfg.LAYOUT_LINK.PARENTAL_CATEGORIES = []
-cfg.LAYOUT_LINK.CHILD_CATEGORIES = []
-cfg.freeze()

deepdoctection-0.42.0.dist-info/METADATA DELETED Viewed

@@ -1,431 +0,0 @@
-Metadata-Version: 2.4
-Name: deepdoctection
-Version: 0.42.0
-Summary: Repository for Document AI
-Home-page: https://github.com/deepdoctection/deepdoctection
-Author: Dr. Janis Meyer
-License: Apache License 2.0
-Classifier: Development Status :: 4 - Beta
-Classifier: License :: OSI Approved :: Apache Software License
-Classifier: Natural Language :: English
-Classifier: Operating System :: POSIX :: Linux
-Classifier: Programming Language :: Python :: 3.9
-Classifier: Programming Language :: Python :: 3.10
-Classifier: Programming Language :: Python :: 3.11
-Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
-Requires-Python: >=3.9
-Description-Content-Type: text/markdown
-License-File: LICENSE
-Requires-Dist: catalogue==2.0.10
-Requires-Dist: huggingface_hub>=0.26.0
-Requires-Dist: importlib-metadata>=5.0.0
-Requires-Dist: jsonlines==3.1.0
-Requires-Dist: lazy-imports==0.3.1
-Requires-Dist: mock==4.0.3
-Requires-Dist: networkx>=2.7.1
-Requires-Dist: numpy<2.0,>=1.21
-Requires-Dist: packaging>=20.0
-Requires-Dist: Pillow>=10.0.0
-Requires-Dist: pypdf>=3.16.0
-Requires-Dist: pypdfium2>=4.30.0
-Requires-Dist: pyyaml>=6.0.1
-Requires-Dist: pyzmq>=16
-Requires-Dist: scipy>=1.13.1
-Requires-Dist: termcolor>=1.1
-Requires-Dist: tabulate>=0.7.7
-Requires-Dist: tqdm==4.64.0
-Provides-Extra: tf
-Requires-Dist: catalogue==2.0.10; extra == "tf"
-Requires-Dist: huggingface_hub>=0.26.0; extra == "tf"
-Requires-Dist: importlib-metadata>=5.0.0; extra == "tf"
-Requires-Dist: jsonlines==3.1.0; extra == "tf"
-Requires-Dist: lazy-imports==0.3.1; extra == "tf"
-Requires-Dist: mock==4.0.3; extra == "tf"
-Requires-Dist: networkx>=2.7.1; extra == "tf"
-Requires-Dist: numpy<2.0,>=1.21; extra == "tf"
-Requires-Dist: packaging>=20.0; extra == "tf"
-Requires-Dist: Pillow>=10.0.0; extra == "tf"
-Requires-Dist: pypdf>=3.16.0; extra == "tf"
-Requires-Dist: pypdfium2>=4.30.0; extra == "tf"
-Requires-Dist: pyyaml>=6.0.1; extra == "tf"
-Requires-Dist: pyzmq>=16; extra == "tf"
-Requires-Dist: scipy>=1.13.1; extra == "tf"
-Requires-Dist: termcolor>=1.1; extra == "tf"
-Requires-Dist: tabulate>=0.7.7; extra == "tf"
-Requires-Dist: tqdm==4.64.0; extra == "tf"
-Requires-Dist: tensorpack==0.11; extra == "tf"
-Requires-Dist: protobuf==3.20.1; extra == "tf"
-Requires-Dist: tensorflow-addons>=0.17.1; extra == "tf"
-Requires-Dist: tf2onnx>=1.9.2; extra == "tf"
-Requires-Dist: python-doctr==0.9.0; extra == "tf"
-Requires-Dist: pycocotools>=2.0.2; extra == "tf"
-Requires-Dist: boto3==1.34.102; extra == "tf"
-Requires-Dist: pdfplumber>=0.11.0; extra == "tf"
-Requires-Dist: fasttext-wheel; extra == "tf"
-Requires-Dist: jdeskew>=0.2.2; extra == "tf"
-Requires-Dist: apted==1.0.3; extra == "tf"
-Requires-Dist: distance==0.1.3; extra == "tf"
-Requires-Dist: lxml>=4.9.1; extra == "tf"
-Provides-Extra: pt
-Requires-Dist: catalogue==2.0.10; extra == "pt"
-Requires-Dist: huggingface_hub>=0.26.0; extra == "pt"
-Requires-Dist: importlib-metadata>=5.0.0; extra == "pt"
-Requires-Dist: jsonlines==3.1.0; extra == "pt"
-Requires-Dist: lazy-imports==0.3.1; extra == "pt"
-Requires-Dist: mock==4.0.3; extra == "pt"
-Requires-Dist: networkx>=2.7.1; extra == "pt"
-Requires-Dist: numpy<2.0,>=1.21; extra == "pt"
-Requires-Dist: packaging>=20.0; extra == "pt"
-Requires-Dist: Pillow>=10.0.0; extra == "pt"
-Requires-Dist: pypdf>=3.16.0; extra == "pt"
-Requires-Dist: pypdfium2>=4.30.0; extra == "pt"
-Requires-Dist: pyyaml>=6.0.1; extra == "pt"
-Requires-Dist: pyzmq>=16; extra == "pt"
-Requires-Dist: scipy>=1.13.1; extra == "pt"
-Requires-Dist: termcolor>=1.1; extra == "pt"
-Requires-Dist: tabulate>=0.7.7; extra == "pt"
-Requires-Dist: tqdm==4.64.0; extra == "pt"
-Requires-Dist: timm>=0.9.16; extra == "pt"
-Requires-Dist: transformers>=4.48.0; extra == "pt"
-Requires-Dist: accelerate>=0.29.1; extra == "pt"
-Requires-Dist: python-doctr==0.9.0; extra == "pt"
-Requires-Dist: boto3==1.34.102; extra == "pt"
-Requires-Dist: pdfplumber>=0.11.0; extra == "pt"
-Requires-Dist: fasttext-wheel; extra == "pt"
-Requires-Dist: jdeskew>=0.2.2; extra == "pt"
-Requires-Dist: apted==1.0.3; extra == "pt"
-Requires-Dist: distance==0.1.3; extra == "pt"
-Requires-Dist: lxml>=4.9.1; extra == "pt"
-Provides-Extra: docs
-Requires-Dist: tensorpack==0.11; extra == "docs"
-Requires-Dist: boto3==1.34.102; extra == "docs"
-Requires-Dist: transformers>=4.48.0; extra == "docs"
-Requires-Dist: accelerate>=0.29.1; extra == "docs"
-Requires-Dist: pdfplumber>=0.11.0; extra == "docs"
-Requires-Dist: lxml>=4.9.1; extra == "docs"
-Requires-Dist: lxml-stubs>=0.5.1; extra == "docs"
-Requires-Dist: jdeskew>=0.2.2; extra == "docs"
-Requires-Dist: jinja2==3.0.3; extra == "docs"
-Requires-Dist: mkdocs-material; extra == "docs"
-Requires-Dist: mkdocstrings-python; extra == "docs"
-Requires-Dist: griffe==0.25.0; extra == "docs"
-Provides-Extra: dev
-Requires-Dist: python-dotenv==1.0.0; extra == "dev"
-Requires-Dist: click; extra == "dev"
-Requires-Dist: black==23.7.0; extra == "dev"
-Requires-Dist: isort==5.13.2; extra == "dev"
-Requires-Dist: pylint==2.17.4; extra == "dev"
-Requires-Dist: mypy==1.4.1; extra == "dev"
-Requires-Dist: wandb; extra == "dev"
-Requires-Dist: types-PyYAML>=6.0.12.12; extra == "dev"
-Requires-Dist: types-termcolor>=1.1.3; extra == "dev"
-Requires-Dist: types-tabulate>=0.9.0.3; extra == "dev"
-Requires-Dist: types-tqdm>=4.66.0.5; extra == "dev"
-Requires-Dist: lxml-stubs>=0.5.1; extra == "dev"
-Requires-Dist: types-Pillow>=10.2.0.20240406; extra == "dev"
-Requires-Dist: types-urllib3>=1.26.25.14; extra == "dev"
-Provides-Extra: test
-Requires-Dist: pytest==8.0.2; extra == "test"
-Requires-Dist: pytest-cov; extra == "test"
-Dynamic: author
-Dynamic: classifier
-Dynamic: description
-Dynamic: description-content-type
-Dynamic: home-page
-Dynamic: license
-Dynamic: license-file
-Dynamic: provides-extra
-Dynamic: requires-dist
-Dynamic: requires-python
-Dynamic: summary
-<p align="center">
-  <img src="https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_logo.png" alt="Deep Doctection Logo" width="60%">
-  <h3 align="center">
-  A Document AI Package
-  </h3>
-</p>
-**deep**doctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does
-not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR
-and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models. For more
- specific text processing tasks use one of the many other great NLP libraries.
-**deep**doctection focuses on applications and is made for those who want to solve real world problems related to
-document extraction from PDFs or scans in various image formats.
-Check the demo of a document layout analysis pipeline with OCR on
-:hugs: [**Hugging Face spaces**](https://huggingface.co/spaces/deepdoctection/deepdoctection).
-# Overview
-**deep**doctection provides model wrappers of supported libraries for various tasks to be integrated into
-pipelines. Its core function does not depend on any specific deep learning library. Selected models for the following
- tasks are currently supported:
- - Document layout analysis including table recognition in Tensorflow with [**Tensorpack**](https://github.com/tensorpack),
-   or PyTorch with [**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2),
- - OCR with support of [**Tesseract**](https://github.com/tesseract-ocr/tesseract), [**DocTr**](https://github.com/mindee/doctr)
-   (Tensorflow and PyTorch implementations available) and a wrapper to an API for a commercial solution,
- - Text mining for native PDFs with  [**pdfplumber**](https://github.com/jsvine/pdfplumber),
- - Language detection with [**fastText**](https://github.com/facebookresearch/fastText),
- - Deskewing and rotating images with [**jdeskew**](https://github.com/phamquiluan/jdeskew).
- - Document and token classification with all LayoutLM models provided by the
-   [**Transformer library**](https://github.com/huggingface/transformers).
-   (Yes, you can use any LayoutLM-model with any of the provided OCR-or pdfplumber tools straight away!).
- - Table detection and table structure recognition with
-   [**table-transformer**](https://github.com/microsoft/table-transformer).
- - There is a small dataset for token classification [available](https://huggingface.co/datasets/deepdoctection/FRFPE)
-   and a lot of new [tutorials](https://github.com/deepdoctection/notebooks/blob/main/Layoutlm_v2_on_custom_token_classification.ipynb)
-   to show, how to train and evaluate this dataset using LayoutLMv1, LayoutLMv2, LayoutXLM and LayoutLMv3.
- - Comprehensive configuration of **analyzer** like choosing different models, output parsing, OCR selection.
-   Check this [notebook](https://github.com/deepdoctection/notebooks/blob/main/Analyzer_Configuration.ipynb) or the
-   [docs](https://deepdoctection.readthedocs.io/en/latest/tutorials/analyzer_configuration_notebook/) for more infos.
- - Document layout analysis and table recognition now runs with
-   [**Torchscript**](https://pytorch.org/docs/stable/jit.html) (CPU) as well and [**Detectron2**](https://github.com/facebookresearch/detectron2/tree/main/detectron2) is not required
-   anymore for basic inference.
- - More angle predictors for determining the rotation of a document based on [**Tesseract**](https://github.com/tesseract-ocr/tesseract) and [**DocTr**](https://github.com/mindee/doctr)
- - Token classification with [**LiLT**](https://github.com/jpWang/LiLT) via
-   [**transformers**](https://github.com/huggingface/transformers).
-   We have added a model wrapper for token classification with LiLT and added a some LiLT models to the model catalog
-   that seem to look promising, especially if you want to train a model on non-english data. The training script for
-   LayoutLM can be used for LiLT as well.
- - [**new**] There are two notebooks available that show, how to write a
-   [custom predictor](https://github.com/deepdoctection/notebooks/blob/main/Doclaynet_Analyzer_Config.ipynb) based on
-   a third party library that has not been supported yet and how to use
-   [advanced configuration](https://github.com/deepdoctection/notebooks/blob/main/Doclaynet_Analyzer_Config.ipynb) to
-   get links between layout segments e.g. captions and tables or figures.
-**deep**doctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to
-post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words
-into contiguous text. You will get an output in JSON format that you can customize even further by yourself.
-Have a look at the [**introduction notebook**](https://github.com/deepdoctection/notebooks/blob/main/Get_Started.ipynb) in the
-[notebook repo](https://github.com/deepdoctection/notebooks) for an easy start.
-Check the [**release notes**](https://github.com/deepdoctection/deepdoctection/releases) for recent updates.
-## Models
-**deep**doctection or its support libraries provide pre-trained models that are in most of the cases available at the
-[**Hugging Face Model Hub**](https://huggingface.co/deepdoctection) or that will be automatically downloaded once
-requested. For instance, you can find pre-trained object detection models from the Tensorpack or Detectron2 framework
- for coarse layout analysis, table cell detection and table recognition.
-## Datasets and training scripts
-Training is a substantial part to get pipelines ready on some specific domain, let it be document layout analysis,
-document classification or NER. **deep**doctection provides training scripts for models that are based on trainers
-developed from the library that hosts the model code. Moreover, **deep**doctection hosts code to some well established
-datasets like **Publaynet** that makes it easy to experiment. It also contains mappings from widely used data
-formats like COCO and it has a dataset framework (akin to [**datasets**](https://github.com/huggingface/datasets) so that
- setting up training on a custom dataset becomes very easy. [**This notebook**](https://github.com/deepdoctection/notebooks/blob/main/Datasets_and_Eval.ipynb)
-shows you how to do this.
-## Evaluation
-**deep**doctection comes equipped with a framework that allows you to evaluate predictions of a single or multiple
-models in a pipeline against some ground truth. Check again [**here**](https://github.com/deepdoctection/notebooks/blob/main/Datasets_and_Eval.ipynb) how it is
-done.
-## Inference
-Having set up a pipeline it takes you a few lines of code to instantiate the pipeline and after a for loop all pages will
-be processed through the pipeline.
-```python
-import deepdoctection as dd
-from IPython.core.display import HTML
-from matplotlib import pyplot as plt
-analyzer = dd.get_dd_analyzer()  # instantiate the built-in analyzer similar to the Hugging Face space demo
-df = analyzer.analyze(path = "/path/to/your/doc.pdf")  # setting up pipeline
-df.reset_state()                 # Trigger some initialization
-doc = iter(df)
-page = next(doc)
-image = page.viz()
-plt.figure(figsize = (25,17))
-plt.axis('off')
-plt.imshow(image)
-```
-![text](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_sample.png)
-```
-HTML(page.tables[0].html)
-```
-![table](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_table.png)
-```
-print(page.text)
-```
-![table](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/dd_rm_text.png)
-## Documentation
-There is an extensive [**documentation**](https://deepdoctection.readthedocs.io/en/latest/index.html#) available
-containing tutorials, design concepts and the API. We want to present things as comprehensively and understandably
-as possible. However, we are aware that there are still many areas where significant improvements can be made in terms
-of clarity, grammar and correctness. We look forward to every hint and comment that increases the quality of the
-documentation.
-## Requirements
-![requirements](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/_imgs/requirements_deepdoctection_081124.png)
-Everything in the overview listed below the **deep**doctection layer are necessary requirements and have to be installed
-separately.
-- Linux or macOS. (Windows is not supported but there is a [Dockerfile](./docker/pytorch-cpu-jupyter/Dockerfile) available)
-- Python >= 3.9
-- 1.13 <= PyTorch  **or** 2.11 <= Tensorflow < 2.16. (For lower Tensorflow versions the code will only run on a GPU).
-In general, if you want to train or fine-tune models, a GPU is required.
-- With respect to the Deep Learning framework, you must decide between [Tensorflow](https://www.tensorflow.org/install?hl=en)
-  and [PyTorch](https://pytorch.org/get-started/locally/).
-- [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine will be used through a Python wrapper. The core
-  engine has to be installed separately.
-- For release `v.0.34.0` and below **deep**doctection uses Python wrappers for [Poppler](https://poppler.freedesktop.org/) to convert PDF
-  documents into images. For release `v.0.35.0` this dependency will be optional.
-The following overview shows the availability of the models in conjunction with the DL framework.
-| Task                                          | PyTorch | Torchscript    |  Tensorflow  |
-|-----------------------------------------------|:-------:|----------------|:------------:|
-| Layout detection via Detectron2/Tensorpack    |    ✅    | ✅ (CPU only)   | ✅ (GPU only) |
-| Table recognition via Detectron2/Tensorpack   |    ✅    | ✅ (CPU only)   | ✅ (GPU only) |
-| Table transformer via Transformers            |    ✅    | ❌              |      ❌       |
-| DocTr                                         |    ✅    | ❌              |      ✅       |
-| LayoutLM (v1, v2, v3, XLM) via Transformers   |    ✅    | ❌              | ❌            |
-## Installation
-We recommend using a virtual environment. You can install the package via pip or from source.
-### Install with pip from PyPi
-#### Minimal installation
-If you want to get started with a minimal setting (e.g. running the **deep**doctection analyzer with
-default configuration or trying the 'Get started notebook'), install **deep**doctection with
-```
-pip install deepdoctection
-```
-If you want to use the Tensorflow framework, please install Tensorpack separately. Detectron2 will not be installed
-and layout models/ table recognition models will run with Torchscript on a CPU.
-#### Full installation
-The following installation will give you ALL models available within the Deep Learning framework as well as all models
-that are independent of Tensorflow/PyTorch. Please note, that the dependencies are very complex. We try hard to keep
-the requirements up to date though.
-For **Tensorflow**, run
-```
-pip install deepdoctection[tf]
-```
-For **PyTorch**,
-first install **Detectron2** separately as it is not distributed via PyPi. Check the instruction
-[here](https://detectron2.readthedocs.io/en/latest/tutorials/install.html). Then run
-```
-pip install deepdoctection[pt]
-```
-This will install **deep**doctection with all dependencies listed above the **deep**doctection layer. Use this setting,
-if you want to get started or want to explore all features.
-If you want to have more control with your installation and are looking for fewer dependencies then
-install **deep**doctection with the basic setup only.
-```
-pip install deepdoctection
-```
-This will ignore all model libraries (layers above the **deep**doctection layer in the diagram) and you
-will be responsible to install them by yourself. Note, that you will not be able to run any pipeline with this setup.
-For further information, please consult the [**full installation instructions**](https://deepdoctection.readthedocs.io/en/latest/install/).
-### Installation from source
-Download the repository or clone via
-```
-git clone https://github.com/deepdoctection/deepdoctection.git
-```
-To get started with **Tensorflow**, run:
-```
-cd deepdoctection
-pip install ".[tf]"
-```
-Installing the full **PyTorch** setup from source will also install **Detectron2** for you:
-```
-cd deepdoctection
-pip install ".[source-pt]"
-```
-### Running a Docker container from Docker hub
-Starting from release `v.0.27.0`, pre-existing Docker images can be downloaded from the
-[Docker hub](https://hub.docker.com/r/deepdoctection/deepdoctection).
-```
-docker pull deepdoctection/deepdoctection:<release_tag>
-```
-To start the container, you can use the Docker compose file `./docker/pytorch-gpu/docker-compose.yaml`.
-In the `.env` file provided, specify the host directory where **deep**doctection's cache should be stored.
-This directory will be mounted. Additionally, specify a working directory to mount files to be processed into the
-container.
-```
-docker compose up -d
-```
-will start the container.
-## Credits
-We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible
-to develop this framework.
-## Problems
-We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this
-repo and try to address them as quickly as possible. Bug fixes or enhancements will be deployed in a new release every 10
-to 12 weeks.
-## If you like **deep**doctection ...
- ...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.
-## License
-Distributed under the Apache 2.0 License. Check [LICENSE](https://github.com/deepdoctection/deepdoctection/blob/master/LICENSE)
-for additional information.

deepdoctection 0.42.0__py3-none-any.whl → 0.43__py3-none-any.whl

Potentially problematic release.

deepdoctection 0.42.0py3-none-any.whl → 0.43py3-none-any.whl