docling 0.2.0__py3-none-any.whl → 0.3.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -117,9 +117,9 @@ class ConvertedDocument(BaseModel):
117
117
  errors: List[Dict] = [] # structure to keep errors
118
118
 
119
119
  pages: List[Page] = []
120
- assembled: AssembledUnit = None
120
+ assembled: Optional[AssembledUnit] = None
121
121
 
122
- output: DsDocument = None
122
+ output: Optional[DsDocument] = None
123
123
 
124
124
  def to_ds_document(self) -> DsDocument:
125
125
  title = ""
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: docling
3
- Version: 0.2.0
3
+ Version: 0.3.1
4
4
  Summary: Docling PDF conversion package
5
5
  Home-page: https://github.com/DS4SD/docling
6
6
  License: MIT
@@ -18,7 +18,7 @@ Classifier: Programming Language :: Python :: 3
18
18
  Classifier: Programming Language :: Python :: 3.11
19
19
  Classifier: Programming Language :: Python :: 3.12
20
20
  Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
- Requires-Dist: deepsearch-glm (>=0.18.4,<1)
21
+ Requires-Dist: deepsearch-glm (>=0.19.0,<1)
22
22
  Requires-Dist: deepsearch-toolkit (>=0.47.0,<1)
23
23
  Requires-Dist: docling-core (>=0.2.0,<0.3.0)
24
24
  Requires-Dist: docling-ibm-models (>=0.2.0,<0.3.0)
@@ -31,11 +31,20 @@ Project-URL: Repository, https://github.com/DS4SD/docling
31
31
  Description-Content-Type: text/markdown
32
32
 
33
33
  <p align="center">
34
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
34
+ <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
35
35
  </p>
36
36
 
37
37
  # Docling
38
38
 
39
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
40
+ ![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
41
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
42
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
43
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
44
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
45
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
46
+ [![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)
47
+
39
48
  Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
40
49
 
41
50
  ## Features
@@ -44,22 +53,23 @@ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-co
44
53
  * 📝 Extracts metadata from the document, such as title, authors, references and language
45
54
  * 🔍 Optionally applies OCR (use with scanned PDFs)
46
55
 
47
- ## Setup
56
+ ## Installation
57
+
58
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
59
+ ```bash
60
+ pip install docling
61
+ ```
48
62
 
49
- You need Python 3.11 and poetry. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
63
+ > [!NOTE]
64
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
50
65
 
51
- Once you have `poetry` installed, create an environment and install the package:
66
+ ### Development setup
52
67
 
68
+ To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
53
69
  ```bash
54
- poetry env use $(which python3.11)
55
- poetry shell
56
70
  poetry install
57
71
  ```
58
72
 
59
- **Notes**:
60
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
61
-
62
-
63
73
  ## Usage
64
74
 
65
75
  For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
@@ -71,23 +81,24 @@ The output of the above command will be written to `./scratch`.
71
81
 
72
82
  ### Enable or disable pipeline features
73
83
 
74
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
84
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
75
85
  ```python
76
86
  doc_converter = DocumentConverter(
77
87
  artifacts_path=artifacts_path,
78
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
79
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
88
+ pipeline_options=PipelineOptions(
89
+ do_table_structure=False, # controls if table structure is recovered
90
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
91
+ ),
80
92
  )
81
93
  ```
82
94
 
83
95
  ### Impose limits on the document size
84
96
 
85
- You can limit the file size and number of pages which should be allowed to process per document.
97
+ You can limit the file size and number of pages which should be allowed to process per document:
86
98
  ```python
87
- paths = [Path("./test/data/2206.01062.pdf")]
88
-
89
- input = DocumentConversionInput.from_paths(
90
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
99
+ conv_input = DocumentConversionInput.from_paths(
100
+ paths=[Path("./test/data/2206.01062.pdf")],
101
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
91
102
  )
92
103
  ```
93
104
 
@@ -97,12 +108,12 @@ You can convert PDFs from a binary stream instead of from the filesystem as foll
97
108
  ```python
98
109
  buf = BytesIO(your_binary_stream)
99
110
  docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
100
- input = DocumentConversionInput.from_streams(docs)
101
- converted_docs = doc_converter.convert(input)
111
+ conv_input = DocumentConversionInput.from_streams(docs)
112
+ converted_docs = doc_converter.convert(conv_input)
102
113
  ```
103
114
  ### Limit resource usage
104
115
 
105
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
116
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
106
117
 
107
118
 
108
119
  ## Contributing
@@ -112,7 +123,7 @@ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main
112
123
 
113
124
  ## References
114
125
 
115
- If you use `Docling` in your projects, please consider citing the following:
126
+ If you use Docling in your projects, please consider citing the following:
116
127
 
117
128
  ```bib
118
129
  @software{Docling,
@@ -127,6 +138,6 @@ year = {2024}
127
138
 
128
139
  ## License
129
140
 
130
- The `Docling` codebase is under MIT license.
141
+ The Docling codebase is under MIT license.
131
142
  For individual model usage, please refer to the model licenses found in the original packages.
132
143
 
@@ -4,7 +4,7 @@ docling/backend/abstract_backend.py,sha256=dINr8oTax9Fq31Y1AR0CGWNZtAHN5aqB_M7TA
4
4
  docling/backend/pypdfium2_backend.py,sha256=sJMoActFyc3qdKB6RFly3auHXuXM4noQAG0ypUlj26o,7647
5
5
  docling/datamodel/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
6
6
  docling/datamodel/base_models.py,sha256=GKeRryRuCS6mPWJf0IPJ5manXwiuS0v8wFOnVXF38b0,6128
7
- docling/datamodel/document.py,sha256=JIp4TRl9NrVLXwNU-9llkbrFGUKly9B2pwzXaH0GEsE,12615
7
+ docling/datamodel/document.py,sha256=S4USz13mqLS9WUwTgEkoocykcmY6B3cC3f4JlfTSYcM,12635
8
8
  docling/datamodel/settings.py,sha256=t5g6wrEJnPa9gBzMMl8ppgBRUYz-8xgopEtfMS0ZH28,733
9
9
  docling/document_converter.py,sha256=MZw23oPlRmRi1ggzoD1PukUnqo-6boO3RZB06dZ5Xt0,7305
10
10
  docling/models/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
@@ -19,7 +19,7 @@ docling/pipeline/standard_model_pipeline.py,sha256=pDbgVO0oOJry7Q-3KYdMuaypXCQOd
19
19
  docling/utils/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
20
20
  docling/utils/layout_utils.py,sha256=FOFbL0hKzUoWXdZaeUvEtFqKv0IkPifIr4sdGW4suKs,31804
21
21
  docling/utils/utils.py,sha256=llhXSbIDNZ1MHOwBEfLHBAoJIAYI7QlPIonlI1jLUJ0,1208
22
- docling-0.2.0.dist-info/LICENSE,sha256=ACwmltkrXIz5VsEQcrqljq-fat6ZXAMepjXGoe40KtE,1069
23
- docling-0.2.0.dist-info/METADATA,sha256=uI_41YwSSWv5K_xUDsJA9Ju8FZQ3zwVgAFOA1glhuyM,4455
24
- docling-0.2.0.dist-info/WHEEL,sha256=sP946D7jFCHeNz5Iq4fL4Lu-PrWrFsgfLXbbkciIZwg,88
25
- docling-0.2.0.dist-info/RECORD,,
22
+ docling-0.3.1.dist-info/LICENSE,sha256=ACwmltkrXIz5VsEQcrqljq-fat6ZXAMepjXGoe40KtE,1069
23
+ docling-0.3.1.dist-info/METADATA,sha256=5OpesJEMNC_jdf88GO7drN0XHZkmpmk0J13mM1E50rk,5390
24
+ docling-0.3.1.dist-info/WHEEL,sha256=sP946D7jFCHeNz5Iq4fL4Lu-PrWrFsgfLXbbkciIZwg,88
25
+ docling-0.3.1.dist-info/RECORD,,