magic-pdf 0.6.2b1__py3-none-any.whl → 0.7.0b1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. magic_pdf/dict2md/ocr_mkcontent.py +10 -3
  2. magic_pdf/libs/Constants.py +4 -1
  3. magic_pdf/libs/config_reader.py +10 -10
  4. magic_pdf/libs/draw_bbox.py +66 -1
  5. magic_pdf/libs/ocr_content_type.py +14 -0
  6. magic_pdf/libs/version.py +1 -1
  7. magic_pdf/model/doc_analyze_by_custom_model.py +10 -4
  8. magic_pdf/model/magic_model.py +4 -0
  9. magic_pdf/model/pdf_extract_kit.py +83 -39
  10. magic_pdf/model/pek_sub_modules/structeqtable/StructTableModel.py +22 -0
  11. magic_pdf/resources/model_config/model_configs.yaml +4 -0
  12. magic_pdf/rw/AbsReaderWriter.py +1 -18
  13. magic_pdf/rw/DiskReaderWriter.py +32 -24
  14. magic_pdf/rw/S3ReaderWriter.py +83 -48
  15. magic_pdf/tools/cli.py +79 -0
  16. magic_pdf/tools/cli_dev.py +155 -0
  17. magic_pdf/tools/common.py +122 -0
  18. magic_pdf-0.7.0b1.dist-info/METADATA +421 -0
  19. {magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/RECORD +25 -27
  20. {magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/WHEEL +1 -1
  21. magic_pdf-0.7.0b1.dist-info/entry_points.txt +3 -0
  22. magic_pdf/cli/magicpdf.py +0 -359
  23. magic_pdf/pdf_parse_for_train.py +0 -685
  24. magic_pdf/train_utils/convert_to_train_format.py +0 -65
  25. magic_pdf/train_utils/extract_caption.py +0 -59
  26. magic_pdf/train_utils/remove_footer_header.py +0 -159
  27. magic_pdf/train_utils/vis_utils.py +0 -327
  28. magic_pdf-0.6.2b1.dist-info/METADATA +0 -344
  29. magic_pdf-0.6.2b1.dist-info/entry_points.txt +0 -2
  30. /magic_pdf/{cli → model/pek_sub_modules/structeqtable}/__init__.py +0 -0
  31. /magic_pdf/{train_utils → tools}/__init__.py +0 -0
  32. {magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/LICENSE.md +0 -0
  33. {magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/top_level.txt +0 -0
@@ -1,344 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: magic-pdf
3
- Version: 0.6.2b1
4
- Summary: A practical tool for converting PDF to Markdown
5
- Home-page: https://github.com/opendatalab/MinerU
6
- Requires-Python: >=3.9
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE.md
9
- Requires-Dist: boto3 >=1.28.43
10
- Requires-Dist: Brotli >=1.1.0
11
- Requires-Dist: click >=8.1.7
12
- Requires-Dist: PyMuPDF >=1.24.9
13
- Requires-Dist: loguru >=0.6.0
14
- Requires-Dist: numpy <2.0.0,>=1.21.6
15
- Requires-Dist: fast-langdetect ==0.2.0
16
- Requires-Dist: wordninja >=2.0.0
17
- Requires-Dist: scikit-learn >=1.0.2
18
- Requires-Dist: pdfminer.six ==20231228
19
- Provides-Extra: full
20
- Requires-Dist: unimernet ==0.1.6 ; extra == 'full'
21
- Requires-Dist: matplotlib ; extra == 'full'
22
- Requires-Dist: ultralytics ; extra == 'full'
23
- Requires-Dist: paddleocr ==2.7.3 ; extra == 'full'
24
- Requires-Dist: detectron2 ; extra == 'full'
25
- Requires-Dist: paddlepaddle ==3.0.0b1 ; (platform_system == "Linux") and extra == 'full'
26
- Requires-Dist: paddlepaddle ==2.6.1 ; (platform_system == "Windows" or platform_system == "Darwin") and extra == 'full'
27
- Provides-Extra: lite
28
- Requires-Dist: paddleocr ==2.7.3 ; extra == 'lite'
29
- Requires-Dist: paddlepaddle ==3.0.0b1 ; (platform_system == "Linux") and extra == 'lite'
30
- Requires-Dist: paddlepaddle ==2.6.1 ; (platform_system == "Windows" or platform_system == "Darwin") and extra == 'lite'
31
-
32
- <div id="top">
33
-
34
- <p align="center">
35
- <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
36
- </p>
37
-
38
- </div>
39
- <div align="center">
40
-
41
- [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
42
- [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
43
- [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
44
- [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
45
- [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
46
- [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
47
- [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
48
-
49
- <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
50
-
51
-
52
-
53
-
54
- [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
55
-
56
- </div>
57
-
58
- <div align="center">
59
- <p align="center">
60
- <a href="https://github.com/opendatalab/MinerU">MinerU: An end-to-end PDF parsing tool based on PDF-Extract-Kit, supporting conversion from PDF to Markdown.</a>🚀🚀🚀<br>
61
- <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction</a>🔥🔥🔥
62
- </p>
63
-
64
- <p align="center">
65
- 👋 join us on <a href="https://discord.gg/AsQMhuMN" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
66
- </p>
67
- </div>
68
-
69
- # MinerU
70
-
71
-
72
- ## Introduction
73
-
74
- MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
75
-
76
- - [Magic-PDF](#Magic-PDF) PDF Document Extraction
77
- - [Magic-Doc](#Magic-Doc) Webpage & E-book Extraction
78
-
79
-
80
- # Magic-PDF
81
-
82
-
83
- ## Introduction
84
-
85
- Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
86
-
87
- Key features include:
88
-
89
- - Support for multiple front-end model inputs
90
- - Removal of headers, footers, footnotes, and page numbers
91
- - Human-readable layout formatting
92
- - Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
93
- - Extraction and display of images and tables within markdown
94
- - Conversion of equations into LaTeX format
95
- - Automatic detection and conversion of garbled PDFs
96
- - Compatibility with CPU and GPU environments
97
- - Available for Windows, Linux, and macOS platforms
98
-
99
-
100
- https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
101
-
102
-
103
-
104
- ## Project Panorama
105
-
106
- ![Project Panorama](docs/images/project_panorama_en.png)
107
-
108
-
109
- ## Flowchart
110
-
111
- ![Flowchart](docs/images/flowchart_en.png)
112
-
113
- ### Dependency repositorys
114
-
115
- - [PDF-Extract-Kit : A Comprehensive Toolkit for High-Quality PDF Content Extraction](https://github.com/opendatalab/PDF-Extract-Kit) 🚀🚀🚀
116
-
117
- ## Getting Started
118
-
119
- ### Requirements
120
-
121
- - Python >= 3.9
122
-
123
- Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable.
124
- For example:
125
- ```bash
126
- conda create -n MinerU python=3.10
127
- conda activate MinerU
128
- ```
129
-
130
- ### Installation and Configuration
131
-
132
- #### 1. Install Magic-PDF
133
-
134
- Install the full-feature package with pip:
135
- >Note: The pip-installed package supports CPU-only and is ideal for quick tests.
136
- >
137
- >For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).
138
-
139
- ```bash
140
- pip install magic-pdf[full-cpu]
141
- ```
142
- The full-feature package depends on detectron2, which requires a compilation installation.
143
- If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
144
- Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):
145
-
146
- ```bash
147
- pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
148
- ```
149
-
150
-
151
- #### 2. Downloading model weights files
152
-
153
- For detailed references, please see below [how_to_download_models](docs/how_to_download_models_en.md)
154
-
155
- After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.
156
-
157
-
158
- #### 3. Copy the Configuration File and Make Configurations
159
- You can get the [magic-pdf.template.json](magic-pdf.template.json) file in the repository root directory.
160
- ```bash
161
- cp magic-pdf.template.json ~/magic-pdf.json
162
- ```
163
- In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.
164
-
165
- ```json
166
- {
167
- "models-dir": "/tmp/models"
168
- }
169
- ```
170
-
171
-
172
- #### 4. Acceleration Using CUDA or MPS
173
- If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
174
- ##### CUDA
175
-
176
- You need to install the corresponding PyTorch version according to your CUDA version.
177
- This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/
178
- ```bash
179
- pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
180
- ```
181
- Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.
182
- ```json
183
- {
184
- "device-mode":"cuda"
185
- }
186
- ```
187
-
188
- ##### MPS
189
-
190
- For macOS users with M-series chip devices, you can use MPS for inference acceleration.
191
- You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.
192
- ```json
193
- {
194
- "device-mode":"mps"
195
- }
196
- ```
197
-
198
-
199
- ### Usage
200
-
201
- #### 1.Usage via Command Line
202
-
203
- ###### simple
204
-
205
- ```bash
206
- magic-pdf pdf-command --pdf "pdf_path" --inside_model true
207
- ```
208
- After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
209
- You can find the corresponding xxx_model.json file in the markdown directory.
210
- If you intend to do secondary development on the post-processing pipeline, you can use the command:
211
- ```bash
212
- magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
213
- ```
214
- In this way, you won't need to re-run the model data, making debugging more convenient.
215
-
216
-
217
- ###### more
218
-
219
- ```bash
220
- magic-pdf --help
221
- ```
222
-
223
-
224
- #### 2. Usage via Api
225
-
226
- ###### Local
227
- ```python
228
- image_writer = DiskReaderWriter(local_image_dir)
229
- image_dir = str(os.path.basename(local_image_dir))
230
- jso_useful_key = {"_pdf_type": "", "model_list": []}
231
- pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
232
- pipe.pipe_classify()
233
- pipe.pipe_parse()
234
- md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
235
- ```
236
-
237
- ###### Object Storage
238
- ```python
239
- s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
240
- image_dir = "s3://img_bucket/"
241
- s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
242
- pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
243
- jso_useful_key = {"_pdf_type": "", "model_list": []}
244
- pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
245
- pipe.pipe_classify()
246
- pipe.pipe_parse()
247
- md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
248
- ```
249
-
250
- Demo can be referred to [demo.py](demo/demo.py)
251
-
252
-
253
- # Magic-Doc
254
-
255
-
256
- ## Introduction
257
-
258
- Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
259
-
260
- Key Features Include:
261
-
262
- - Web Page Extraction
263
- - Cross-modal precise parsing of text, images, tables, and formula information.
264
-
265
- - E-Book Document Extraction
266
- - Supports various document formats including epub, mobi, with full adaptation for text and images.
267
-
268
- - Language Type Identification
269
- - Accurate recognition of 176 languages.
270
-
271
- https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
272
-
273
-
274
-
275
- https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
276
-
277
-
278
-
279
- https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
280
-
281
-
282
-
283
-
284
- ## Project Repository
285
-
286
- - [Magic-Doc](https://github.com/InternLM/magic-doc)
287
- Outstanding Webpage and E-book Extraction Tool
288
-
289
-
290
- # All Thanks To Our Contributors
291
-
292
- <a href="https://github.com/opendatalab/MinerU/graphs/contributors">
293
- <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
294
- </a>
295
-
296
-
297
- # License Information
298
-
299
- [LICENSE.md](LICENSE.md)
300
-
301
- The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
302
-
303
-
304
- # Acknowledgments
305
-
306
- - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
307
- - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
308
- - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
309
- - [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
310
-
311
-
312
- # Citation
313
-
314
- ```bibtex
315
- @article{he2024opendatalab,
316
- title={Opendatalab: Empowering general artificial intelligence with open datasets},
317
- author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
318
- journal={arXiv preprint arXiv:2407.13773},
319
- year={2024}
320
- }
321
-
322
- @misc{2024mineru,
323
- title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
324
- author={MinerU Contributors},
325
- howpublished = {\url{https://github.com/opendatalab/MinerU}},
326
- year={2024}
327
- }
328
- ```
329
-
330
-
331
- # Star History
332
-
333
- <a>
334
- <picture>
335
- <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
336
- <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
337
- <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
338
- </picture>
339
- </a>
340
-
341
- # Links
342
- - [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
343
- - [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
344
- - [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)
@@ -1,2 +0,0 @@
1
- [console_scripts]
2
- magic-pdf = magic_pdf.cli.magicpdf:cli
File without changes