docling 0.3.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (26) hide show
  1. {docling-0.3.0 → docling-0.4.0}/PKG-INFO +51 -27
  2. docling-0.4.0/README.md +129 -0
  3. {docling-0.3.0 → docling-0.4.0}/docling/datamodel/base_models.py +23 -2
  4. {docling-0.3.0 → docling-0.4.0}/docling/datamodel/document.py +2 -2
  5. {docling-0.3.0 → docling-0.4.0}/docling/models/page_assemble_model.py +0 -12
  6. {docling-0.3.0 → docling-0.4.0}/docling/models/table_structure_model.py +43 -11
  7. {docling-0.3.0 → docling-0.4.0}/docling/pipeline/standard_model_pipeline.py +1 -1
  8. {docling-0.3.0 → docling-0.4.0}/pyproject.toml +1 -1
  9. docling-0.3.0/README.md +0 -105
  10. {docling-0.3.0 → docling-0.4.0}/LICENSE +0 -0
  11. {docling-0.3.0 → docling-0.4.0}/docling/__init__.py +0 -0
  12. {docling-0.3.0 → docling-0.4.0}/docling/backend/__init__.py +0 -0
  13. {docling-0.3.0 → docling-0.4.0}/docling/backend/abstract_backend.py +0 -0
  14. {docling-0.3.0 → docling-0.4.0}/docling/backend/pypdfium2_backend.py +0 -0
  15. {docling-0.3.0 → docling-0.4.0}/docling/datamodel/__init__.py +0 -0
  16. {docling-0.3.0 → docling-0.4.0}/docling/datamodel/settings.py +0 -0
  17. {docling-0.3.0 → docling-0.4.0}/docling/document_converter.py +0 -0
  18. {docling-0.3.0 → docling-0.4.0}/docling/models/__init__.py +0 -0
  19. {docling-0.3.0 → docling-0.4.0}/docling/models/ds_glm_model.py +0 -0
  20. {docling-0.3.0 → docling-0.4.0}/docling/models/easyocr_model.py +0 -0
  21. {docling-0.3.0 → docling-0.4.0}/docling/models/layout_model.py +0 -0
  22. {docling-0.3.0 → docling-0.4.0}/docling/pipeline/__init__.py +0 -0
  23. {docling-0.3.0 → docling-0.4.0}/docling/pipeline/base_model_pipeline.py +0 -0
  24. {docling-0.3.0 → docling-0.4.0}/docling/utils/__init__.py +0 -0
  25. {docling-0.3.0 → docling-0.4.0}/docling/utils/layout_utils.py +0 -0
  26. {docling-0.3.0 → docling-0.4.0}/docling/utils/utils.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: docling
3
- Version: 0.3.0
3
+ Version: 0.4.0
4
4
  Summary: Docling PDF conversion package
5
5
  Home-page: https://github.com/DS4SD/docling
6
6
  License: MIT
@@ -31,11 +31,20 @@ Project-URL: Repository, https://github.com/DS4SD/docling
31
31
  Description-Content-Type: text/markdown
32
32
 
33
33
  <p align="center">
34
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
34
+ <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
35
35
  </p>
36
36
 
37
37
  # Docling
38
38
 
39
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
40
+ ![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
41
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
42
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
43
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
44
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
45
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
46
+ [![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)
47
+
39
48
  Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
40
49
 
41
50
  ## Features
@@ -44,25 +53,20 @@ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-co
44
53
  * 📝 Extracts metadata from the document, such as title, authors, references and language
45
54
  * 🔍 Optionally applies OCR (use with scanned PDFs)
46
55
 
47
- ## Setup
56
+ ## Installation
48
57
 
49
- For general usage, you can simply install `docling` through `pip` from the pypi package index.
50
- ```
58
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
59
+ ```bash
51
60
  pip install docling
52
61
  ```
53
62
 
54
- **Notes**:
55
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
63
+ > [!NOTE]
64
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
56
65
 
57
66
  ### Development setup
58
67
 
59
- To develop for `docling`, you need Python 3.11 and `poetry`. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
60
-
61
- Once you have `poetry` installed and cloned this repo, create an environment and install `docling` from the repo root:
62
-
68
+ To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
63
69
  ```bash
64
- poetry env use $(which python3.11)
65
- poetry shell
66
70
  poetry install
67
71
  ```
68
72
 
@@ -75,25 +79,45 @@ python examples/convert.py
75
79
  ```
76
80
  The output of the above command will be written to `./scratch`.
77
81
 
78
- ### Enable or disable pipeline features
82
+ ### Adjust pipeline features
79
83
 
80
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
84
+ **Control pipeline options**
85
+
86
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
81
87
  ```python
82
88
  doc_converter = DocumentConverter(
83
89
  artifacts_path=artifacts_path,
84
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
85
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
90
+ pipeline_options=PipelineOptions(
91
+ do_table_structure=False, # controls if table structure is recovered
92
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
93
+ ),
86
94
  )
87
95
  ```
88
96
 
89
- ### Impose limits on the document size
97
+ **Control table extraction options**
98
+
99
+ You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
100
+ This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
101
+
90
102
 
91
- You can limit the file size and number of pages which should be allowed to process per document.
92
103
  ```python
93
- paths = [Path("./test/data/2206.01062.pdf")]
94
104
 
95
- input = DocumentConversionInput.from_paths(
96
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
105
+ pipeline_options = PipelineOptions(do_table_structure=True)
106
+ pipeline_options.table_structure_options.do_cell_matching = False # Uses text cells predicted from table structure model
107
+
108
+ doc_converter = DocumentConverter(
109
+ artifacts_path=artifacts_path,
110
+ pipeline_options=pipeline_options,
111
+ )
112
+ ```
113
+
114
+ ### Impose limits on the document size
115
+
116
+ You can limit the file size and number of pages which should be allowed to process per document:
117
+ ```python
118
+ conv_input = DocumentConversionInput.from_paths(
119
+ paths=[Path("./test/data/2206.01062.pdf")],
120
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
97
121
  )
98
122
  ```
99
123
 
@@ -103,12 +127,12 @@ You can convert PDFs from a binary stream instead of from the filesystem as foll
103
127
  ```python
104
128
  buf = BytesIO(your_binary_stream)
105
129
  docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
106
- input = DocumentConversionInput.from_streams(docs)
107
- converted_docs = doc_converter.convert(input)
130
+ conv_input = DocumentConversionInput.from_streams(docs)
131
+ converted_docs = doc_converter.convert(conv_input)
108
132
  ```
109
133
  ### Limit resource usage
110
134
 
111
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
135
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
112
136
 
113
137
 
114
138
  ## Contributing
@@ -118,7 +142,7 @@ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main
118
142
 
119
143
  ## References
120
144
 
121
- If you use `Docling` in your projects, please consider citing the following:
145
+ If you use Docling in your projects, please consider citing the following:
122
146
 
123
147
  ```bib
124
148
  @software{Docling,
@@ -133,6 +157,6 @@ year = {2024}
133
157
 
134
158
  ## License
135
159
 
136
- The `Docling` codebase is under MIT license.
160
+ The Docling codebase is under MIT license.
137
161
  For individual model usage, please refer to the model licenses found in the original packages.
138
162
 
@@ -0,0 +1,129 @@
1
+ <p align="center">
2
+ <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" />
3
+ </p>
4
+
5
+ # Docling
6
+
7
+ [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
8
+ ![Python](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
9
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
10
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
11
+ [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
12
+ [![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
13
+ [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
14
+ [![License MIT](https://img.shields.io/github/license/ds4sd/deepsearch-toolkit)](https://opensource.org/licenses/MIT)
15
+
16
+ Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
17
+
18
+ ## Features
19
+ * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
20
+ * 📑 Understands detailed page layout, reading order and recovers table structures
21
+ * 📝 Extracts metadata from the document, such as title, authors, references and language
22
+ * 🔍 Optionally applies OCR (use with scanned PDFs)
23
+
24
+ ## Installation
25
+
26
+ To use Docling, simply install `docling` from your package manager, e.g. pip:
27
+ ```bash
28
+ pip install docling
29
+ ```
30
+
31
+ > [!NOTE]
32
+ > Works on macOS and Linux environments. Windows platforms are currently not tested.
33
+
34
+ ### Development setup
35
+
36
+ To develop for Docling, you need Python 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
37
+ ```bash
38
+ poetry install
39
+ ```
40
+
41
+ ## Usage
42
+
43
+ For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
44
+
45
+ ```
46
+ python examples/convert.py
47
+ ```
48
+ The output of the above command will be written to `./scratch`.
49
+
50
+ ### Adjust pipeline features
51
+
52
+ **Control pipeline options**
53
+
54
+ You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`:
55
+ ```python
56
+ doc_converter = DocumentConverter(
57
+ artifacts_path=artifacts_path,
58
+ pipeline_options=PipelineOptions(
59
+ do_table_structure=False, # controls if table structure is recovered
60
+ do_ocr=True, # controls if OCR is applied (ignores programmatic content)
61
+ ),
62
+ )
63
+ ```
64
+
65
+ **Control table extraction options**
66
+
67
+ You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
68
+ This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
69
+
70
+
71
+ ```python
72
+
73
+ pipeline_options = PipelineOptions(do_table_structure=True)
74
+ pipeline_options.table_structure_options.do_cell_matching = False # Uses text cells predicted from table structure model
75
+
76
+ doc_converter = DocumentConverter(
77
+ artifacts_path=artifacts_path,
78
+ pipeline_options=pipeline_options,
79
+ )
80
+ ```
81
+
82
+ ### Impose limits on the document size
83
+
84
+ You can limit the file size and number of pages which should be allowed to process per document:
85
+ ```python
86
+ conv_input = DocumentConversionInput.from_paths(
87
+ paths=[Path("./test/data/2206.01062.pdf")],
88
+ limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
89
+ )
90
+ ```
91
+
92
+ ### Convert from binary PDF streams
93
+
94
+ You can convert PDFs from a binary stream instead of from the filesystem as follows:
95
+ ```python
96
+ buf = BytesIO(your_binary_stream)
97
+ docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
98
+ conv_input = DocumentConversionInput.from_streams(docs)
99
+ converted_docs = doc_converter.convert(conv_input)
100
+ ```
101
+ ### Limit resource usage
102
+
103
+ You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
104
+
105
+
106
+ ## Contributing
107
+
108
+ Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
109
+
110
+
111
+ ## References
112
+
113
+ If you use Docling in your projects, please consider citing the following:
114
+
115
+ ```bib
116
+ @software{Docling,
117
+ author = {Deep Search Team},
118
+ month = {7},
119
+ title = {{Docling}},
120
+ url = {https://github.com/DS4SD/docling},
121
+ version = {main},
122
+ year = {2024}
123
+ }
124
+ ```
125
+
126
+ ## License
127
+
128
+ The Docling codebase is under MIT license.
129
+ For individual model usage, please refer to the model licenses found in the original packages.
@@ -1,3 +1,4 @@
1
+ import copy
1
2
  from enum import Enum, auto
2
3
  from io import BytesIO
3
4
  from typing import Any, Dict, List, Optional, Tuple, Union
@@ -47,6 +48,15 @@ class BoundingBox(BaseModel):
47
48
  def height(self):
48
49
  return abs(self.t - self.b)
49
50
 
51
+ def scaled(self, scale: float) -> "BoundingBox":
52
+ out_bbox = copy.deepcopy(self)
53
+ out_bbox.l *= scale
54
+ out_bbox.r *= scale
55
+ out_bbox.t *= scale
56
+ out_bbox.b *= scale
57
+
58
+ return out_bbox
59
+
50
60
  def as_tuple(self):
51
61
  if self.coord_origin == CoordOrigin.TOPLEFT:
52
62
  return (self.l, self.t, self.r, self.b)
@@ -241,6 +251,17 @@ class DocumentStream(BaseModel):
241
251
  stream: BytesIO
242
252
 
243
253
 
254
+ class TableStructureOptions(BaseModel):
255
+ do_cell_matching: bool = (
256
+ True
257
+ # True: Matches predictions back to PDF cells. Can break table output if PDF cells
258
+ # are merged across table columns.
259
+ # False: Let table structure model define the text cells, ignore PDF cells.
260
+ )
261
+
262
+
244
263
  class PipelineOptions(BaseModel):
245
- do_table_structure: bool = True
246
- do_ocr: bool = False
264
+ do_table_structure: bool = True # True: perform table structure extraction
265
+ do_ocr: bool = False # True: perform OCR, replace programmatic PDF text
266
+
267
+ table_structure_options: TableStructureOptions = TableStructureOptions()
@@ -117,9 +117,9 @@ class ConvertedDocument(BaseModel):
117
117
  errors: List[Dict] = [] # structure to keep errors
118
118
 
119
119
  pages: List[Page] = []
120
- assembled: AssembledUnit = None
120
+ assembled: Optional[AssembledUnit] = None
121
121
 
122
- output: DsDocument = None
122
+ output: Optional[DsDocument] = None
123
123
 
124
124
  def to_ds_document(self) -> DsDocument:
125
125
  title = ""
@@ -19,18 +19,6 @@ class PageAssembleModel:
19
19
  def __init__(self, config):
20
20
  self.config = config
21
21
 
22
- # self.line_wrap_pattern = re.compile(r'(?<=[^\W_])- \n(?=\w)')
23
-
24
- # def sanitize_text_poor(self, lines):
25
- # text = '\n'.join(lines)
26
- #
27
- # # treat line wraps.
28
- # sanitized_text = self.line_wrap_pattern.sub('', text)
29
- #
30
- # sanitized_text = sanitized_text.replace('\n', ' ')
31
- #
32
- # return sanitized_text
33
-
34
22
  def sanitize_text(self, lines):
35
23
  if len(lines) <= 1:
36
24
  return " ".join(lines)
@@ -1,7 +1,10 @@
1
- from typing import Iterable
1
+ import copy
2
+ import random
3
+ from typing import Iterable, List
2
4
 
3
5
  import numpy
4
6
  from docling_ibm_models.tableformer.data_management.tf_predictor import TFPredictor
7
+ from PIL import ImageDraw
5
8
 
6
9
  from docling.datamodel.base_models import (
7
10
  BoundingBox,
@@ -28,6 +31,21 @@ class TableStructureModel:
28
31
  self.tm_model_type = self.tm_config["model"]["type"]
29
32
 
30
33
  self.tf_predictor = TFPredictor(self.tm_config)
34
+ self.scale = 2.0 # Scale up table input images to 144 dpi
35
+
36
+ def draw_table_and_cells(self, page: Page, tbl_list: List[TableElement]):
37
+ image = page._backend.get_page_image()
38
+ draw = ImageDraw.Draw(image)
39
+
40
+ for table_element in tbl_list:
41
+ x0, y0, x1, y1 = table_element.cluster.bbox.as_tuple()
42
+ draw.rectangle([(x0, y0), (x1, y1)], outline="red")
43
+
44
+ for tc in table_element.table_cells:
45
+ x0, y0, x1, y1 = tc.bbox.as_tuple()
46
+ draw.rectangle([(x0, y0), (x1, y1)], outline="blue")
47
+
48
+ image.show()
31
49
 
32
50
  def __call__(self, page_batch: Iterable[Page]) -> Iterable[Page]:
33
51
 
@@ -36,16 +54,17 @@ class TableStructureModel:
36
54
  return
37
55
 
38
56
  for page in page_batch:
57
+
39
58
  page.predictions.tablestructure = TableStructurePrediction() # dummy
40
59
 
41
60
  in_tables = [
42
61
  (
43
62
  cluster,
44
63
  [
45
- round(cluster.bbox.l),
46
- round(cluster.bbox.t),
47
- round(cluster.bbox.r),
48
- round(cluster.bbox.b),
64
+ round(cluster.bbox.l) * self.scale,
65
+ round(cluster.bbox.t) * self.scale,
66
+ round(cluster.bbox.r) * self.scale,
67
+ round(cluster.bbox.b) * self.scale,
49
68
  ],
50
69
  )
51
70
  for cluster in page.predictions.layout.clusters
@@ -65,20 +84,29 @@ class TableStructureModel:
65
84
  ):
66
85
  # Only allow non empty stings (spaces) into the cells of a table
67
86
  if len(c.text.strip()) > 0:
68
- tokens.append(c.model_dump())
87
+ new_cell = copy.deepcopy(c)
88
+ new_cell.bbox = new_cell.bbox.scaled(scale=self.scale)
89
+
90
+ tokens.append(new_cell.model_dump())
69
91
 
70
- iocr_page = {
71
- "image": numpy.asarray(page.image),
92
+ page_input = {
72
93
  "tokens": tokens,
73
- "width": page.size.width,
74
- "height": page.size.height,
94
+ "width": page.size.width * self.scale,
95
+ "height": page.size.height * self.scale,
75
96
  }
97
+ # add image to page input.
98
+ if self.scale == 1.0:
99
+ page_input["image"] = numpy.asarray(page.image)
100
+ else: # render new page image on the fly at desired scale
101
+ page_input["image"] = numpy.asarray(
102
+ page._backend.get_page_image(scale=self.scale)
103
+ )
76
104
 
77
105
  table_clusters, table_bboxes = zip(*in_tables)
78
106
 
79
107
  if len(table_bboxes):
80
108
  tf_output = self.tf_predictor.multi_table_predict(
81
- iocr_page, table_bboxes, do_matching=self.do_cell_matching
109
+ page_input, table_bboxes, do_matching=self.do_cell_matching
82
110
  )
83
111
 
84
112
  for table_cluster, table_out in zip(table_clusters, tf_output):
@@ -91,6 +119,7 @@ class TableStructureModel:
91
119
  element["bbox"]["token"] = text_piece
92
120
 
93
121
  tc = TableCell.model_validate(element)
122
+ tc.bbox = tc.bbox.scaled(1 / self.scale)
94
123
  table_cells.append(tc)
95
124
 
96
125
  # Retrieving cols/rows, after post processing:
@@ -111,4 +140,7 @@ class TableStructureModel:
111
140
 
112
141
  page.predictions.tablestructure.table_map[table_cluster.id] = tbl
113
142
 
143
+ # For debugging purposes:
144
+ # self.draw_table_and_cells(page, page.predictions.tablestructure.table_map.values())
145
+
114
146
  yield page
@@ -34,7 +34,7 @@ class StandardModelPipeline(BaseModelPipeline):
34
34
  "artifacts_path": artifacts_path
35
35
  / StandardModelPipeline._table_model_path,
36
36
  "enabled": pipeline_options.do_table_structure,
37
- "do_cell_matching": False,
37
+ "do_cell_matching": pipeline_options.table_structure_options.do_cell_matching,
38
38
  }
39
39
  ),
40
40
  ]
@@ -1,6 +1,6 @@
1
1
  [tool.poetry]
2
2
  name = "docling"
3
- version = "0.3.0" # DO NOT EDIT, updated automatically
3
+ version = "0.4.0" # DO NOT EDIT, updated automatically
4
4
  description = "Docling PDF conversion package"
5
5
  authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
6
6
  license = "MIT"
docling-0.3.0/README.md DELETED
@@ -1,105 +0,0 @@
1
- <p align="center">
2
- <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
3
- </p>
4
-
5
- # Docling
6
-
7
- Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.
8
-
9
- ## Features
10
- * ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
11
- * 📑 Understands detailed page layout, reading order and recovers table structures
12
- * 📝 Extracts metadata from the document, such as title, authors, references and language
13
- * 🔍 Optionally applies OCR (use with scanned PDFs)
14
-
15
- ## Setup
16
-
17
- For general usage, you can simply install `docling` through `pip` from the pypi package index.
18
- ```
19
- pip install docling
20
- ```
21
-
22
- **Notes**:
23
- * Works on macOS and Linux environments. Windows platforms are currently not tested.
24
-
25
- ### Development setup
26
-
27
- To develop for `docling`, you need Python 3.11 and `poetry`. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).
28
-
29
- Once you have `poetry` installed and cloned this repo, create an environment and install `docling` from the repo root:
30
-
31
- ```bash
32
- poetry env use $(which python3.11)
33
- poetry shell
34
- poetry install
35
- ```
36
-
37
- ## Usage
38
-
39
- For basic usage, see the [convert.py](https://github.com/DS4SD/docling/blob/main/examples/convert.py) example module. Run with:
40
-
41
- ```
42
- python examples/convert.py
43
- ```
44
- The output of the above command will be written to `./scratch`.
45
-
46
- ### Enable or disable pipeline features
47
-
48
- You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
49
- ```python
50
- doc_converter = DocumentConverter(
51
- artifacts_path=artifacts_path,
52
- pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.
53
- do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
54
- )
55
- ```
56
-
57
- ### Impose limits on the document size
58
-
59
- You can limit the file size and number of pages which should be allowed to process per document.
60
- ```python
61
- paths = [Path("./test/data/2206.01062.pdf")]
62
-
63
- input = DocumentConversionInput.from_paths(
64
- paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
65
- )
66
- ```
67
-
68
- ### Convert from binary PDF streams
69
-
70
- You can convert PDFs from a binary stream instead of from the filesystem as follows:
71
- ```python
72
- buf = BytesIO(your_binary_stream)
73
- docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
74
- input = DocumentConversionInput.from_streams(docs)
75
- converted_docs = doc_converter.convert(input)
76
- ```
77
- ### Limit resource usage
78
-
79
- You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
80
-
81
-
82
- ## Contributing
83
-
84
- Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
85
-
86
-
87
- ## References
88
-
89
- If you use `Docling` in your projects, please consider citing the following:
90
-
91
- ```bib
92
- @software{Docling,
93
- author = {Deep Search Team},
94
- month = {7},
95
- title = {{Docling}},
96
- url = {https://github.com/DS4SD/docling},
97
- version = {main},
98
- year = {2024}
99
- }
100
- ```
101
-
102
- ## License
103
-
104
- The `Docling` codebase is under MIT license.
105
- For individual model usage, please refer to the model licenses found in the original packages.
File without changes
File without changes
File without changes