docling 0.2.0__tar.gz → 0.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (26) hide show
  1. {docling-0.2.0 → docling-0.3.1}/PKG-INFO +36 -25
  2. docling-0.3.1/README.md +110 -0
  3. {docling-0.2.0 → docling-0.3.1}/docling/datamodel/document.py +2 -2
  4. {docling-0.2.0 → docling-0.3.1}/pyproject.toml +2 -2
  5. docling-0.2.0/README.md +0 -99
  6. {docling-0.2.0 → docling-0.3.1}/LICENSE +0 -0
  7. {docling-0.2.0 → docling-0.3.1}/docling/__init__.py +0 -0
  8. {docling-0.2.0 → docling-0.3.1}/docling/backend/__init__.py +0 -0
  9. {docling-0.2.0 → docling-0.3.1}/docling/backend/abstract_backend.py +0 -0
  10. {docling-0.2.0 → docling-0.3.1}/docling/backend/pypdfium2_backend.py +0 -0
  11. {docling-0.2.0 → docling-0.3.1}/docling/datamodel/__init__.py +0 -0
  12. {docling-0.2.0 → docling-0.3.1}/docling/datamodel/base_models.py +0 -0
  13. {docling-0.2.0 → docling-0.3.1}/docling/datamodel/settings.py +0 -0
  14. {docling-0.2.0 → docling-0.3.1}/docling/document_converter.py +0 -0
  15. {docling-0.2.0 → docling-0.3.1}/docling/models/__init__.py +0 -0
  16. {docling-0.2.0 → docling-0.3.1}/docling/models/ds_glm_model.py +0 -0
  17. {docling-0.2.0 → docling-0.3.1}/docling/models/easyocr_model.py +0 -0
  18. {docling-0.2.0 → docling-0.3.1}/docling/models/layout_model.py +0 -0
  19. {docling-0.2.0 → docling-0.3.1}/docling/models/page_assemble_model.py +0 -0
  20. {docling-0.2.0 → docling-0.3.1}/docling/models/table_structure_model.py +0 -0
  21. {docling-0.2.0 → docling-0.3.1}/docling/pipeline/__init__.py +0 -0
  22. {docling-0.2.0 → docling-0.3.1}/docling/pipeline/base_model_pipeline.py +0 -0
  23. {docling-0.2.0 → docling-0.3.1}/docling/pipeline/standard_model_pipeline.py +0 -0
  24. {docling-0.2.0 → docling-0.3.1}/docling/utils/__init__.py +0 -0
  25. {docling-0.2.0 → docling-0.3.1}/docling/utils/layout_utils.py +0 -0
  26. {docling-0.2.0 → docling-0.3.1}/docling/utils/utils.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: docling
3
- Version: 0.2.0
3
+ Version: 0.3.1
4
4
  Summary: Docling PDF conversion package
5
5
  Home-page: https://github.com/DS4SD/docling
6
6
  License: MIT
@@ -18,7 +18,7 @@ Classifier: Programming Language :: Python :: 3
18
18
  Classifier: Programming Language :: Python :: 3.11
19
19
  Classifier: Programming Language :: Python :: 3.12
20
20
  Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
- Requires-Dist: deepsearch-glm (>=0.18.4,<1)
21
+ Requires-Dist: deepsearch-glm (>=0.19.0,<1)
22
22
  Requires-Dist: deepsearch-toolkit (>=0.47.0,<1)
23
23
  Requires-Dist: docling-core (>=0.2.0,<0.3.0)
24
24
  Requires-Dist: docling-ibm-models (>=0.2.0,<0.3.0)
@@ -31,11 +31,20 @@ Project-URL: Repository, https://github.com/DS4SD/docling
31
31
  Description-Content-Type: text/markdown
32
32
 
33
33
  <p align="center">
34
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
34
+ <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
35
35
  </p>
36
36
 
37
37
  # Docling
38
38
 
39
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
40
+ ![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
41
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
42
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
43
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
44
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
45
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
46
+ [![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)
47
+
39
48
  Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
40
49
 
41
50
  ## Features
@@ -44,22 +53,23 @@ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-co
44
53
  * 📝 Extracts metadata from the document, such as title, authors, references and language
45
54
  * 🔍 Optionally applies OCR (use with scanned PDFs)
46
55
 
47
- ## Setup
56
+ ## Installation
57
+
58
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
59
+ ```bash
60
+ pip install docling
61
+ ```
48
62
 
49
- You need Python 3.11 and poetry. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
63
+ > [!NOTE]
64
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
50
65
 
51
- Once you have `poetry` installed, create an environment and install the package:
66
+ ### Development setup
52
67
 
68
+ To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
53
69
  ```bash
54
- poetry env use $(which python3.11)
55
- poetry shell
56
70
  poetry install
57
71
  ```
58
72
 
59
- **Notes**:
60
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
61
-
62
-
63
73
  ## Usage
64
74
 
65
75
  For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
@@ -71,23 +81,24 @@ The output of the above command will be written to `./scratch`.
71
81
 
72
82
  ### Enable or disable pipeline features
73
83
 
74
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
84
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
75
85
  ```python
76
86
  doc_converter = DocumentConverter(
77
87
  artifacts_path=artifacts_path,
78
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
79
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
88
+ pipeline_options=PipelineOptions(
89
+ do_table_structure=False, # controls if table structure is recovered
90
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
91
+ ),
80
92
  )
81
93
  ```
82
94
 
83
95
  ### Impose limits on the document size
84
96
 
85
- You can limit the file size and number of pages which should be allowed to process per document.
97
+ You can limit the file size and number of pages which should be allowed to process per document:
86
98
  ```python
87
- paths = [Path("./test/data/2206.01062.pdf")]
88
-
89
- input = DocumentConversionInput.from_paths(
90
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
99
+ conv_input = DocumentConversionInput.from_paths(
100
+ paths=[Path("./test/data/2206.01062.pdf")],
101
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
91
102
  )
92
103
  ```
93
104
 
@@ -97,12 +108,12 @@ You can convert PDFs from a binary stream instead of from the filesystem as foll
97
108
  ```python
98
109
  buf = BytesIO(your_binary_stream)
99
110
  docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
100
- input = DocumentConversionInput.from_streams(docs)
101
- converted_docs = doc_converter.convert(input)
111
+ conv_input = DocumentConversionInput.from_streams(docs)
112
+ converted_docs = doc_converter.convert(conv_input)
102
113
  ```
103
114
  ### Limit resource usage
104
115
 
105
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
116
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
106
117
 
107
118
 
108
119
  ## Contributing
@@ -112,7 +123,7 @@ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main
112
123
 
113
124
  ## References
114
125
 
115
- If you use `Docling` in your projects, please consider citing the following:
126
+ If you use Docling in your projects, please consider citing the following:
116
127
 
117
128
  ```bib
118
129
  @software{Docling,
@@ -127,6 +138,6 @@ year = {2024}
127
138
 
128
139
  ## License
129
140
 
130
- The `Docling` codebase is under MIT license.
141
+ The Docling codebase is under MIT license.
131
142
  For individual model usage, please refer to the model licenses found in the original packages.
132
143
 
@@ -0,0 +1,110 @@
1
+ <p align="center">
2
+ <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
3
+ </p>
4
+
5
+ # Docling
6
+
7
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
8
+ ![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
9
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
10
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
11
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
12
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
13
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
14
+ [![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)
15
+
16
+ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
17
+
18
+ ## Features
19
+ * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
20
+ * 📑 Understands detailed page layout, reading order and recovers table structures
21
+ * 📝 Extracts metadata from the document, such as title, authors, references and language
22
+ * 🔍 Optionally applies OCR (use with scanned PDFs)
23
+
24
+ ## Installation
25
+
26
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
27
+ ```bash
28
+ pip install docling
29
+ ```
30
+
31
+ > [!NOTE]
32
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
33
+
34
+ ### Development setup
35
+
36
+ To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
37
+ ```bash
38
+ poetry install
39
+ ```
40
+
41
+ ## Usage
42
+
43
+ For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
44
+
45
+ ```
46
+ python examples/convert.py
47
+ ```
48
+ The output of the above command will be written to `./scratch`.
49
+
50
+ ### Enable or disable pipeline features
51
+
52
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
53
+ ```python
54
+ doc_converter = DocumentConverter(
55
+ artifacts_path=artifacts_path,
56
+ pipeline_options=PipelineOptions(
57
+ do_table_structure=False, # controls if table structure is recovered
58
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
59
+ ),
60
+ )
61
+ ```
62
+
63
+ ### Impose limits on the document size
64
+
65
+ You can limit the file size and number of pages which should be allowed to process per document:
66
+ ```python
67
+ conv_input = DocumentConversionInput.from_paths(
68
+ paths=[Path("./test/data/2206.01062.pdf")],
69
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
70
+ )
71
+ ```
72
+
73
+ ### Convert from binary PDF streams
74
+
75
+ You can convert PDFs from a binary stream instead of from the filesystem as follows:
76
+ ```python
77
+ buf = BytesIO(your_binary_stream)
78
+ docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
79
+ conv_input = DocumentConversionInput.from_streams(docs)
80
+ converted_docs = doc_converter.convert(conv_input)
81
+ ```
82
+ ### Limit resource usage
83
+
84
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
85
+
86
+
87
+ ## Contributing
88
+
89
+ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
90
+
91
+
92
+ ## References
93
+
94
+ If you use Docling in your projects, please consider citing the following:
95
+
96
+ ```bib
97
+ @software{Docling,
98
+ author = {Deep Search Team},
99
+ month = {7},
100
+ title = {{Docling}},
101
+ url = {https://github.com/DS4SD/docling},
102
+ version = {main},
103
+ year = {2024}
104
+ }
105
+ ```
106
+
107
+ ## License
108
+
109
+ The Docling codebase is under MIT license.
110
+ For individual model usage, please refer to the model licenses found in the original packages.
@@ -117,9 +117,9 @@ class ConvertedDocument(BaseModel):
117
117
  errors: List[Dict] = [] # structure to keep errors
118
118
 
119
119
  pages: List[Page] = []
120
- assembled: AssembledUnit = None
120
+ assembled: Optional[AssembledUnit] = None
121
121
 
122
- output: DsDocument = None
122
+ output: Optional[DsDocument] = None
123
123
 
124
124
  def to_ds_document(self) -> DsDocument:
125
125
  title = ""
@@ -1,6 +1,6 @@
1
1
  [tool.poetry]
2
2
  name = "docling"
3
- version = "0.2.0" # DO NOT EDIT, updated automatically
3
+ version = "0.3.1" # DO NOT EDIT, updated automatically
4
4
  description = "Docling PDF conversion package"
5
5
  authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
6
6
  license = "MIT"
@@ -25,7 +25,7 @@ python = "^3.11"
25
25
  pydantic = "^2.0.0"
26
26
  docling-core = "^0.2.0"
27
27
  docling-ibm-models = "^0.2.0"
28
- deepsearch-glm = ">=0.18.4,<1"
28
+ deepsearch-glm = ">=0.19.0,<1"
29
29
  deepsearch-toolkit = ">=0.47.0,<1"
30
30
  filetype = "^1.2.0"
31
31
  pypdfium2 = "^4.30.0"
docling-0.2.0/README.md DELETED
@@ -1,99 +0,0 @@
1
- <p align="center">
2
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
3
- </p>
4
-
5
- # Docling
6
-
7
- Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
8
-
9
- ## Features
10
- * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
11
- * 📑 Understands detailed page layout, reading order and recovers table structures
12
- * 📝 Extracts metadata from the document, such as title, authors, references and language
13
- * 🔍 Optionally applies OCR (use with scanned PDFs)
14
-
15
- ## Setup
16
-
17
- You need Python 3.11 and poetry. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
18
-
19
- Once you have `poetry` installed, create an environment and install the package:
20
-
21
- ```bash
22
- poetry env use $(which python3.11)
23
- poetry shell
24
- poetry install
25
- ```
26
-
27
- **Notes**:
28
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
29
-
30
-
31
- ## Usage
32
-
33
- For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
34
-
35
- ```
36
- python examples/convert.py
37
- ```
38
- The output of the above command will be written to `./scratch`.
39
-
40
- ### Enable or disable pipeline features
41
-
42
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
43
- ```python
44
- doc_converter = DocumentConverter(
45
- artifacts_path=artifacts_path,
46
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
47
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
48
- )
49
- ```
50
-
51
- ### Impose limits on the document size
52
-
53
- You can limit the file size and number of pages which should be allowed to process per document.
54
- ```python
55
- paths = [Path("./test/data/2206.01062.pdf")]
56
-
57
- input = DocumentConversionInput.from_paths(
58
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
59
- )
60
- ```
61
-
62
- ### Convert from binary PDF streams
63
-
64
- You can convert PDFs from a binary stream instead of from the filesystem as follows:
65
- ```python
66
- buf = BytesIO(your_binary_stream)
67
- docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
68
- input = DocumentConversionInput.from_streams(docs)
69
- converted_docs = doc_converter.convert(input)
70
- ```
71
- ### Limit resource usage
72
-
73
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
74
-
75
-
76
- ## Contributing
77
-
78
- Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
79
-
80
-
81
- ## References
82
-
83
- If you use `Docling` in your projects, please consider citing the following:
84
-
85
- ```bib
86
- @software{Docling,
87
- author = {Deep Search Team},
88
- month = {7},
89
- title = {{Docling}},
90
- url = {https://github.com/DS4SD/docling},
91
- version = {main},
92
- year = {2024}
93
- }
94
- ```
95
-
96
- ## License
97
-
98
- The `Docling` codebase is under MIT license.
99
- For individual model usage, please refer to the model licenses found in the original packages.
File without changes
File without changes
File without changes