docling 0.3.0__tar.gz → 0.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (26) hide show
  1. {docling-0.3.0 → docling-0.3.1}/PKG-INFO +31 -26
  2. docling-0.3.1/README.md +110 -0
  3. {docling-0.3.0 → docling-0.3.1}/docling/datamodel/document.py +2 -2
  4. {docling-0.3.0 → docling-0.3.1}/pyproject.toml +1 -1
  5. docling-0.3.0/README.md +0 -105
  6. {docling-0.3.0 → docling-0.3.1}/LICENSE +0 -0
  7. {docling-0.3.0 → docling-0.3.1}/docling/__init__.py +0 -0
  8. {docling-0.3.0 → docling-0.3.1}/docling/backend/__init__.py +0 -0
  9. {docling-0.3.0 → docling-0.3.1}/docling/backend/abstract_backend.py +0 -0
  10. {docling-0.3.0 → docling-0.3.1}/docling/backend/pypdfium2_backend.py +0 -0
  11. {docling-0.3.0 → docling-0.3.1}/docling/datamodel/__init__.py +0 -0
  12. {docling-0.3.0 → docling-0.3.1}/docling/datamodel/base_models.py +0 -0
  13. {docling-0.3.0 → docling-0.3.1}/docling/datamodel/settings.py +0 -0
  14. {docling-0.3.0 → docling-0.3.1}/docling/document_converter.py +0 -0
  15. {docling-0.3.0 → docling-0.3.1}/docling/models/__init__.py +0 -0
  16. {docling-0.3.0 → docling-0.3.1}/docling/models/ds_glm_model.py +0 -0
  17. {docling-0.3.0 → docling-0.3.1}/docling/models/easyocr_model.py +0 -0
  18. {docling-0.3.0 → docling-0.3.1}/docling/models/layout_model.py +0 -0
  19. {docling-0.3.0 → docling-0.3.1}/docling/models/page_assemble_model.py +0 -0
  20. {docling-0.3.0 → docling-0.3.1}/docling/models/table_structure_model.py +0 -0
  21. {docling-0.3.0 → docling-0.3.1}/docling/pipeline/__init__.py +0 -0
  22. {docling-0.3.0 → docling-0.3.1}/docling/pipeline/base_model_pipeline.py +0 -0
  23. {docling-0.3.0 → docling-0.3.1}/docling/pipeline/standard_model_pipeline.py +0 -0
  24. {docling-0.3.0 → docling-0.3.1}/docling/utils/__init__.py +0 -0
  25. {docling-0.3.0 → docling-0.3.1}/docling/utils/layout_utils.py +0 -0
  26. {docling-0.3.0 → docling-0.3.1}/docling/utils/utils.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: docling
3
- Version: 0.3.0
3
+ Version: 0.3.1
4
4
  Summary: Docling PDF conversion package
5
5
  Home-page: https://github.com/DS4SD/docling
6
6
  License: MIT
@@ -31,11 +31,20 @@ Project-URL: Repository, https://github.com/DS4SD/docling
31
31
  Description-Content-Type: text/markdown
32
32
 
33
33
  <p align="center">
34
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
34
+ <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
35
35
  </p>
36
36
 
37
37
  # Docling
38
38
 
39
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
40
+ ![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
41
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
42
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
43
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
44
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
45
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
46
+ [![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)
47
+
39
48
  Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
40
49
 
41
50
  ## Features
@@ -44,25 +53,20 @@ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-co
44
53
  * 📝 Extracts metadata from the document, such as title, authors, references and language
45
54
  * 🔍 Optionally applies OCR (use with scanned PDFs)
46
55
 
47
- ## Setup
56
+ ## Installation
48
57
 
49
- For general usage, you can simply install `docling` through `pip` from the pypi package index.
50
- ```
58
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
59
+ ```bash
51
60
  pip install docling
52
61
  ```
53
62
 
54
- **Notes**:
55
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
63
+ > [!NOTE]
64
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
56
65
 
57
66
  ### Development setup
58
67
 
59
- To develop for `docling`, you need Python 3.11 and `poetry`. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
60
-
61
- Once you have `poetry` installed and cloned this repo, create an environment and install `docling` from the repo root:
62
-
68
+ To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
63
69
  ```bash
64
- poetry env use $(which python3.11)
65
- poetry shell
66
70
  poetry install
67
71
  ```
68
72
 
@@ -77,23 +81,24 @@ The output of the above command will be written to `./scratch`.
77
81
 
78
82
  ### Enable or disable pipeline features
79
83
 
80
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
84
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
81
85
  ```python
82
86
  doc_converter = DocumentConverter(
83
87
  artifacts_path=artifacts_path,
84
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
85
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
88
+ pipeline_options=PipelineOptions(
89
+ do_table_structure=False, # controls if table structure is recovered
90
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
91
+ ),
86
92
  )
87
93
  ```
88
94
 
89
95
  ### Impose limits on the document size
90
96
 
91
- You can limit the file size and number of pages which should be allowed to process per document.
97
+ You can limit the file size and number of pages which should be allowed to process per document:
92
98
  ```python
93
- paths = [Path("./test/data/2206.01062.pdf")]
94
-
95
- input = DocumentConversionInput.from_paths(
96
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
99
+ conv_input = DocumentConversionInput.from_paths(
100
+ paths=[Path("./test/data/2206.01062.pdf")],
101
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
97
102
  )
98
103
  ```
99
104
 
@@ -103,12 +108,12 @@ You can convert PDFs from a binary stream instead of from the filesystem as foll
103
108
  ```python
104
109
  buf = BytesIO(your_binary_stream)
105
110
  docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
106
- input = DocumentConversionInput.from_streams(docs)
107
- converted_docs = doc_converter.convert(input)
111
+ conv_input = DocumentConversionInput.from_streams(docs)
112
+ converted_docs = doc_converter.convert(conv_input)
108
113
  ```
109
114
  ### Limit resource usage
110
115
 
111
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
116
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
112
117
 
113
118
 
114
119
  ## Contributing
@@ -118,7 +123,7 @@ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main
118
123
 
119
124
  ## References
120
125
 
121
- If you use `Docling` in your projects, please consider citing the following:
126
+ If you use Docling in your projects, please consider citing the following:
122
127
 
123
128
  ```bib
124
129
  @software{Docling,
@@ -133,6 +138,6 @@ year = {2024}
133
138
 
134
139
  ## License
135
140
 
136
- The `Docling` codebase is under MIT license.
141
+ The Docling codebase is under MIT license.
137
142
  For individual model usage, please refer to the model licenses found in the original packages.
138
143
 
@@ -0,0 +1,110 @@
1
+ <p align="center">
2
+ <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
3
+ </p>
4
+
5
+ # Docling
6
+
7
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
8
+ ![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
9
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
10
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
11
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
12
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
13
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
14
+ [![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)
15
+
16
+ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
17
+
18
+ ## Features
19
+ * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
20
+ * 📑 Understands detailed page layout, reading order and recovers table structures
21
+ * 📝 Extracts metadata from the document, such as title, authors, references and language
22
+ * 🔍 Optionally applies OCR (use with scanned PDFs)
23
+
24
+ ## Installation
25
+
26
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
27
+ ```bash
28
+ pip install docling
29
+ ```
30
+
31
+ > [!NOTE]
32
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
33
+
34
+ ### Development setup
35
+
36
+ To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
37
+ ```bash
38
+ poetry install
39
+ ```
40
+
41
+ ## Usage
42
+
43
+ For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
44
+
45
+ ```
46
+ python examples/convert.py
47
+ ```
48
+ The output of the above command will be written to `./scratch`.
49
+
50
+ ### Enable or disable pipeline features
51
+
52
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
53
+ ```python
54
+ doc_converter = DocumentConverter(
55
+ artifacts_path=artifacts_path,
56
+ pipeline_options=PipelineOptions(
57
+ do_table_structure=False, # controls if table structure is recovered
58
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
59
+ ),
60
+ )
61
+ ```
62
+
63
+ ### Impose limits on the document size
64
+
65
+ You can limit the file size and number of pages which should be allowed to process per document:
66
+ ```python
67
+ conv_input = DocumentConversionInput.from_paths(
68
+ paths=[Path("./test/data/2206.01062.pdf")],
69
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
70
+ )
71
+ ```
72
+
73
+ ### Convert from binary PDF streams
74
+
75
+ You can convert PDFs from a binary stream instead of from the filesystem as follows:
76
+ ```python
77
+ buf = BytesIO(your_binary_stream)
78
+ docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
79
+ conv_input = DocumentConversionInput.from_streams(docs)
80
+ converted_docs = doc_converter.convert(conv_input)
81
+ ```
82
+ ### Limit resource usage
83
+
84
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
85
+
86
+
87
+ ## Contributing
88
+
89
+ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
90
+
91
+
92
+ ## References
93
+
94
+ If you use Docling in your projects, please consider citing the following:
95
+
96
+ ```bib
97
+ @software{Docling,
98
+ author = {Deep Search Team},
99
+ month = {7},
100
+ title = {{Docling}},
101
+ url = {https://github.com/DS4SD/docling},
102
+ version = {main},
103
+ year = {2024}
104
+ }
105
+ ```
106
+
107
+ ## License
108
+
109
+ The Docling codebase is under MIT license.
110
+ For individual model usage, please refer to the model licenses found in the original packages.
@@ -117,9 +117,9 @@ class ConvertedDocument(BaseModel):
117
117
  errors: List[Dict] = [] # structure to keep errors
118
118
 
119
119
  pages: List[Page] = []
120
- assembled: AssembledUnit = None
120
+ assembled: Optional[AssembledUnit] = None
121
121
 
122
- output: DsDocument = None
122
+ output: Optional[DsDocument] = None
123
123
 
124
124
  def to_ds_document(self) -> DsDocument:
125
125
  title = ""
@@ -1,6 +1,6 @@
1
1
  [tool.poetry]
2
2
  name = "docling"
3
- version = "0.3.0" # DO NOT EDIT, updated automatically
3
+ version = "0.3.1" # DO NOT EDIT, updated automatically
4
4
  description = "Docling PDF conversion package"
5
5
  authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
6
6
  license = "MIT"
docling-0.3.0/README.md DELETED
@@ -1,105 +0,0 @@
1
- <p align="center">
2
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
3
- </p>
4
-
5
- # Docling
6
-
7
- Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
8
-
9
- ## Features
10
- * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
11
- * 📑 Understands detailed page layout, reading order and recovers table structures
12
- * 📝 Extracts metadata from the document, such as title, authors, references and language
13
- * 🔍 Optionally applies OCR (use with scanned PDFs)
14
-
15
- ## Setup
16
-
17
- For general usage, you can simply install `docling` through `pip` from the pypi package index.
18
- ```
19
- pip install docling
20
- ```
21
-
22
- **Notes**:
23
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
24
-
25
- ### Development setup
26
-
27
- To develop for `docling`, you need Python 3.11 and `poetry`. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
28
-
29
- Once you have `poetry` installed and cloned this repo, create an environment and install `docling` from the repo root:
30
-
31
- ```bash
32
- poetry env use $(which python3.11)
33
- poetry shell
34
- poetry install
35
- ```
36
-
37
- ## Usage
38
-
39
- For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
40
-
41
- ```
42
- python examples/convert.py
43
- ```
44
- The output of the above command will be written to `./scratch`.
45
-
46
- ### Enable or disable pipeline features
47
-
48
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
49
- ```python
50
- doc_converter = DocumentConverter(
51
- artifacts_path=artifacts_path,
52
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
53
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
54
- )
55
- ```
56
-
57
- ### Impose limits on the document size
58
-
59
- You can limit the file size and number of pages which should be allowed to process per document.
60
- ```python
61
- paths = [Path("./test/data/2206.01062.pdf")]
62
-
63
- input = DocumentConversionInput.from_paths(
64
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
65
- )
66
- ```
67
-
68
- ### Convert from binary PDF streams
69
-
70
- You can convert PDFs from a binary stream instead of from the filesystem as follows:
71
- ```python
72
- buf = BytesIO(your_binary_stream)
73
- docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
74
- input = DocumentConversionInput.from_streams(docs)
75
- converted_docs = doc_converter.convert(input)
76
- ```
77
- ### Limit resource usage
78
-
79
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
80
-
81
-
82
- ## Contributing
83
-
84
- Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
85
-
86
-
87
- ## References
88
-
89
- If you use `Docling` in your projects, please consider citing the following:
90
-
91
- ```bib
92
- @software{Docling,
93
- author = {Deep Search Team},
94
- month = {7},
95
- title = {{Docling}},
96
- url = {https://github.com/DS4SD/docling},
97
- version = {main},
98
- year = {2024}
99
- }
100
- ```
101
-
102
- ## License
103
-
104
- The `Docling` codebase is under MIT license.
105
- For individual model usage, please refer to the model licenses found in the original packages.
File without changes
File without changes
File without changes