docling 0.3.0__py3-none-any.whl → 0.3.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -117,9 +117,9 @@ class ConvertedDocument(BaseModel):
117
117
  errors: List[Dict] = [] # structure to keep errors
118
118
 
119
119
  pages: List[Page] = []
120
- assembled: AssembledUnit = None
120
+ assembled: Optional[AssembledUnit] = None
121
121
 
122
- output: DsDocument = None
122
+ output: Optional[DsDocument] = None
123
123
 
124
124
  def to_ds_document(self) -> DsDocument:
125
125
  title = ""
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: docling
3
- Version: 0.3.0
3
+ Version: 0.3.1
4
4
  Summary: Docling PDF conversion package
5
5
  Home-page: https://github.com/DS4SD/docling
6
6
  License: MIT
@@ -31,11 +31,20 @@ Project-URL: Repository, https://github.com/DS4SD/docling
31
31
  Description-Content-Type: text/markdown
32
32
 
33
33
  <p align="center">
34
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
34
+ <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
35
35
  </p>
36
36
 
37
37
  # Docling
38
38
 
39
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
40
+ ![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
41
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
42
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
43
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
44
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
45
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
46
+ [![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)
47
+
39
48
  Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
40
49
 
41
50
  ## Features
@@ -44,25 +53,20 @@ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-co
44
53
  * 📝 Extracts metadata from the document, such as title, authors, references and language
45
54
  * 🔍 Optionally applies OCR (use with scanned PDFs)
46
55
 
47
- ## Setup
56
+ ## Installation
48
57
 
49
- For general usage, you can simply install `docling` through `pip` from the pypi package index.
50
- ```
58
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
59
+ ```bash
51
60
  pip install docling
52
61
  ```
53
62
 
54
- **Notes**:
55
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
63
+ > [!NOTE]
64
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
56
65
 
57
66
  ### Development setup
58
67
 
59
- To develop for `docling`, you need Python 3.11 and `poetry`. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
60
-
61
- Once you have `poetry` installed and cloned this repo, create an environment and install `docling` from the repo root:
62
-
68
+ To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
63
69
  ```bash
64
- poetry env use $(which python3.11)
65
- poetry shell
66
70
  poetry install
67
71
  ```
68
72
 
@@ -77,23 +81,24 @@ The output of the above command will be written to `./scratch`.
77
81
 
78
82
  ### Enable or disable pipeline features
79
83
 
80
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
84
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
81
85
  ```python
82
86
  doc_converter = DocumentConverter(
83
87
  artifacts_path=artifacts_path,
84
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
85
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
88
+ pipeline_options=PipelineOptions(
89
+ do_table_structure=False, # controls if table structure is recovered
90
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
91
+ ),
86
92
  )
87
93
  ```
88
94
 
89
95
  ### Impose limits on the document size
90
96
 
91
- You can limit the file size and number of pages which should be allowed to process per document.
97
+ You can limit the file size and number of pages which should be allowed to process per document:
92
98
  ```python
93
- paths = [Path("./test/data/2206.01062.pdf")]
94
-
95
- input = DocumentConversionInput.from_paths(
96
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
99
+ conv_input = DocumentConversionInput.from_paths(
100
+ paths=[Path("./test/data/2206.01062.pdf")],
101
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
97
102
  )
98
103
  ```
99
104
 
@@ -103,12 +108,12 @@ You can convert PDFs from a binary stream instead of from the filesystem as foll
103
108
  ```python
104
109
  buf = BytesIO(your_binary_stream)
105
110
  docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
106
- input = DocumentConversionInput.from_streams(docs)
107
- converted_docs = doc_converter.convert(input)
111
+ conv_input = DocumentConversionInput.from_streams(docs)
112
+ converted_docs = doc_converter.convert(conv_input)
108
113
  ```
109
114
  ### Limit resource usage
110
115
 
111
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
116
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
112
117
 
113
118
 
114
119
  ## Contributing
@@ -118,7 +123,7 @@ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main
118
123
 
119
124
  ## References
120
125
 
121
- If you use `Docling` in your projects, please consider citing the following:
126
+ If you use Docling in your projects, please consider citing the following:
122
127
 
123
128
  ```bib
124
129
  @software{Docling,
@@ -133,6 +138,6 @@ year = {2024}
133
138
 
134
139
  ## License
135
140
 
136
- The `Docling` codebase is under MIT license.
141
+ The Docling codebase is under MIT license.
137
142
  For individual model usage, please refer to the model licenses found in the original packages.
138
143
 
@@ -4,7 +4,7 @@ docling/backend/abstract_backend.py,sha256=dINr8oTax9Fq31Y1AR0CGWNZtAHN5aqB_M7TA
4
4
  docling/backend/pypdfium2_backend.py,sha256=sJMoActFyc3qdKB6RFly3auHXuXM4noQAG0ypUlj26o,7647
5
5
  docling/datamodel/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
6
6
  docling/datamodel/base_models.py,sha256=GKeRryRuCS6mPWJf0IPJ5manXwiuS0v8wFOnVXF38b0,6128
7
- docling/datamodel/document.py,sha256=JIp4TRl9NrVLXwNU-9llkbrFGUKly9B2pwzXaH0GEsE,12615
7
+ docling/datamodel/document.py,sha256=S4USz13mqLS9WUwTgEkoocykcmY6B3cC3f4JlfTSYcM,12635
8
8
  docling/datamodel/settings.py,sha256=t5g6wrEJnPa9gBzMMl8ppgBRUYz-8xgopEtfMS0ZH28,733
9
9
  docling/document_converter.py,sha256=MZw23oPlRmRi1ggzoD1PukUnqo-6boO3RZB06dZ5Xt0,7305
10
10
  docling/models/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
@@ -19,7 +19,7 @@ docling/pipeline/standard_model_pipeline.py,sha256=pDbgVO0oOJry7Q-3KYdMuaypXCQOd
19
19
  docling/utils/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
20
20
  docling/utils/layout_utils.py,sha256=FOFbL0hKzUoWXdZaeUvEtFqKv0IkPifIr4sdGW4suKs,31804
21
21
  docling/utils/utils.py,sha256=llhXSbIDNZ1MHOwBEfLHBAoJIAYI7QlPIonlI1jLUJ0,1208
22
- docling-0.3.0.dist-info/LICENSE,sha256=ACwmltkrXIz5VsEQcrqljq-fat6ZXAMepjXGoe40KtE,1069
23
- docling-0.3.0.dist-info/METADATA,sha256=kCXvIw9BqMqcht7yhkIp-B4Z7MIw3GmeSpGh_iVm0sA,4667
24
- docling-0.3.0.dist-info/WHEEL,sha256=sP946D7jFCHeNz5Iq4fL4Lu-PrWrFsgfLXbbkciIZwg,88
25
- docling-0.3.0.dist-info/RECORD,,
22
+ docling-0.3.1.dist-info/LICENSE,sha256=ACwmltkrXIz5VsEQcrqljq-fat6ZXAMepjXGoe40KtE,1069
23
+ docling-0.3.1.dist-info/METADATA,sha256=5OpesJEMNC_jdf88GO7drN0XHZkmpmk0J13mM1E50rk,5390
24
+ docling-0.3.1.dist-info/WHEEL,sha256=sP946D7jFCHeNz5Iq4fL4Lu-PrWrFsgfLXbbkciIZwg,88
25
+ docling-0.3.1.dist-info/RECORD,,