magic-pdf 0.7.0a1__py3-none-any.whl → 0.7.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,362 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: magic-pdf
3
- Version: 0.7.0a1
4
- Summary: A practical tool for converting PDF to Markdown
5
- Home-page: https://github.com/opendatalab/MinerU
6
- Requires-Python: >=3.9
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE.md
9
- Requires-Dist: boto3>=1.28.43
10
- Requires-Dist: Brotli>=1.1.0
11
- Requires-Dist: click>=8.1.7
12
- Requires-Dist: PyMuPDF>=1.24.9
13
- Requires-Dist: loguru>=0.6.0
14
- Requires-Dist: numpy<2.0.0,>=1.21.6
15
- Requires-Dist: fast-langdetect==0.2.0
16
- Requires-Dist: wordninja>=2.0.0
17
- Requires-Dist: scikit-learn>=1.0.2
18
- Requires-Dist: pdfminer.six==20231228
19
- Provides-Extra: full
20
- Requires-Dist: unimernet==0.1.6; extra == "full"
21
- Requires-Dist: ultralytics; extra == "full"
22
- Requires-Dist: paddleocr==2.7.3; extra == "full"
23
- Requires-Dist: pypandoc; extra == "full"
24
- Requires-Dist: struct-eqtable==0.1.0; extra == "full"
25
- Requires-Dist: detectron2; extra == "full"
26
- Requires-Dist: paddlepaddle==3.0.0b1; platform_system == "Linux" and extra == "full"
27
- Requires-Dist: matplotlib; (platform_system == "Linux" or platform_system == "Darwin") and extra == "full"
28
- Requires-Dist: matplotlib<=3.9.0; platform_system == "Windows" and extra == "full"
29
- Requires-Dist: paddlepaddle==2.6.1; (platform_system == "Windows" or platform_system == "Darwin") and extra == "full"
30
- Provides-Extra: lite
31
- Requires-Dist: paddleocr==2.7.3; extra == "lite"
32
- Requires-Dist: paddlepaddle==3.0.0b1; platform_system == "Linux" and extra == "lite"
33
- Requires-Dist: paddlepaddle==2.6.1; (platform_system == "Windows" or platform_system == "Darwin") and extra == "lite"
34
-
35
- <div id="top">
36
-
37
- <p align="center">
38
- <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
39
- </p>
40
-
41
- </div>
42
- <div align="center">
43
-
44
- [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
45
- [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
46
- [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
47
- [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
48
- [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
49
- [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
50
- [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
51
-
52
- <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
53
-
54
-
55
-
56
-
57
- [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
58
-
59
- </div>
60
-
61
- <div align="center">
62
- <p align="center">
63
- <a href="https://github.com/opendatalab/MinerU">MinerU: An end-to-end PDF parsing tool based on PDF-Extract-Kit, supporting conversion from PDF to Markdown.</a>🚀🚀🚀<br>
64
- <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction</a>🔥🔥🔥
65
- </p>
66
-
67
- <p align="center">
68
- 👋 join us on <a href="https://discord.gg/gPxmVeGC" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
69
- </p>
70
- </div>
71
-
72
- # MinerU
73
-
74
-
75
- ## Introduction
76
-
77
- MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
78
-
79
- - [Magic-PDF](#Magic-PDF) PDF Document Extraction
80
- - [Magic-Doc](#Magic-Doc) Webpage & E-book Extraction
81
-
82
-
83
- # Magic-PDF
84
-
85
-
86
- ## Introduction
87
-
88
- Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
89
-
90
- Key features include:
91
-
92
- - Support for multiple front-end model inputs
93
- - Removal of headers, footers, footnotes, and page numbers
94
- - Human-readable layout formatting
95
- - Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
96
- - Extraction and display of images and tables within markdown
97
- - Conversion of equations into LaTeX format
98
- - Automatic detection and conversion of garbled PDFs
99
- - Compatibility with CPU and GPU environments
100
- - Available for Windows, Linux, and macOS platforms
101
-
102
-
103
- https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
104
-
105
-
106
-
107
- ## Project Panorama
108
-
109
- ![Project Panorama](docs/images/project_panorama_en.png)
110
-
111
-
112
- ## Flowchart
113
-
114
- ![Flowchart](docs/images/flowchart_en.png)
115
-
116
- ### Dependency repositorys
117
-
118
- - [PDF-Extract-Kit : A Comprehensive Toolkit for High-Quality PDF Content Extraction](https://github.com/opendatalab/PDF-Extract-Kit) 🚀🚀🚀
119
-
120
- ## Getting Started
121
-
122
- ### Requirements
123
-
124
- - Python >= 3.9
125
-
126
- Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable.
127
- For example:
128
- ```bash
129
- conda create -n MinerU python=3.10
130
- conda activate MinerU
131
- ```
132
-
133
- ### Installation and Configuration
134
-
135
- #### 1. Install Magic-PDF
136
-
137
- **1.Install dependencies**
138
-
139
- The full-feature package depends on detectron2, which requires a compilation installation.
140
- If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
141
- Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):
142
-
143
- ```bash
144
- pip install detectron2 --extra-index-url https://wheels.myhloli.com
145
- ```
146
-
147
- **2.Install the full-feature package with pip**
148
- >Note: The pip-installed package supports CPU-only and is ideal for quick tests.
149
- >
150
- >For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).
151
-
152
- ```bash
153
- pip install magic-pdf[full]==0.6.2b1
154
- ```
155
- > ❗️❗️❗️
156
- > We have pre-released the 0.6.2 beta version, addressing numerous issues mentioned in our logs. However, this build has not undergone full QA testing and does not represent the final release quality. Should you encounter any problems, please promptly report them to us via issues or revert to using version 0.6.1.
157
- > ```bash
158
- > pip install magic-pdf[full-cpu]==0.6.1
159
- > ```
160
-
161
-
162
-
163
- #### 2. Downloading model weights files
164
-
165
- For detailed references, please see below [how_to_download_models](docs/how_to_download_models_en.md)
166
-
167
- After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.
168
-
169
-
170
- #### 3. Copy the Configuration File and Make Configurations
171
- You can get the [magic-pdf.template.json](magic-pdf.template.json) file in the repository root directory.
172
- ```bash
173
- cp magic-pdf.template.json ~/magic-pdf.json
174
- ```
175
- In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.
176
-
177
- ```json
178
- {
179
- "models-dir": "/tmp/models"
180
- }
181
- ```
182
-
183
-
184
- #### 4. Acceleration Using CUDA or MPS
185
- If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
186
- ##### CUDA
187
-
188
- You need to install the corresponding PyTorch version according to your CUDA version.
189
- This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/
190
- ```bash
191
- pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
192
- ```
193
- > ❗ ️Make sure to specify version
194
- > ```bash
195
- > torch==2.3.1 torchvision==0.18.1
196
- > ```
197
- > in the command, as these are the highest versions we support. Failing to specify the versions may result in automatically installing higher versions which can cause the program to fail.
198
-
199
- Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.
200
- ```json
201
- {
202
- "device-mode":"cuda"
203
- }
204
- ```
205
-
206
- ##### MPS
207
-
208
- For macOS users with M-series chip devices, you can use MPS for inference acceleration.
209
- You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.
210
- ```json
211
- {
212
- "device-mode":"mps"
213
- }
214
- ```
215
-
216
-
217
- ### Usage
218
-
219
- #### 1.Usage via Command Line
220
-
221
- ###### simple
222
-
223
- ```bash
224
- magic-pdf pdf-command --pdf "pdf_path" --inside_model true
225
- ```
226
- After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
227
- You can find the corresponding xxx_model.json file in the markdown directory.
228
- If you intend to do secondary development on the post-processing pipeline, you can use the command:
229
- ```bash
230
- magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
231
- ```
232
- In this way, you won't need to re-run the model data, making debugging more convenient.
233
-
234
-
235
- ###### more
236
-
237
- ```bash
238
- magic-pdf --help
239
- ```
240
-
241
-
242
- #### 2. Usage via Api
243
-
244
- ###### Local
245
- ```python
246
- image_writer = DiskReaderWriter(local_image_dir)
247
- image_dir = str(os.path.basename(local_image_dir))
248
- jso_useful_key = {"_pdf_type": "", "model_list": []}
249
- pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
250
- pipe.pipe_classify()
251
- pipe.pipe_parse()
252
- md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
253
- ```
254
-
255
- ###### Object Storage
256
- ```python
257
- s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
258
- image_dir = "s3://img_bucket/"
259
- s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
260
- pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
261
- jso_useful_key = {"_pdf_type": "", "model_list": []}
262
- pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
263
- pipe.pipe_classify()
264
- pipe.pipe_parse()
265
- md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
266
- ```
267
-
268
- Demo can be referred to [demo.py](demo/demo.py)
269
-
270
-
271
- # Magic-Doc
272
-
273
-
274
- ## Introduction
275
-
276
- Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
277
-
278
- Key Features Include:
279
-
280
- - Web Page Extraction
281
- - Cross-modal precise parsing of text, images, tables, and formula information.
282
-
283
- - E-Book Document Extraction
284
- - Supports various document formats including epub, mobi, with full adaptation for text and images.
285
-
286
- - Language Type Identification
287
- - Accurate recognition of 176 languages.
288
-
289
- https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
290
-
291
-
292
-
293
- https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
294
-
295
-
296
-
297
- https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
298
-
299
-
300
-
301
-
302
- ## Project Repository
303
-
304
- - [Magic-Doc](https://github.com/InternLM/magic-doc)
305
- Outstanding Webpage and E-book Extraction Tool
306
-
307
-
308
- # All Thanks To Our Contributors
309
-
310
- <a href="https://github.com/opendatalab/MinerU/graphs/contributors">
311
- <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
312
- </a>
313
-
314
-
315
- # License Information
316
-
317
- [LICENSE.md](LICENSE.md)
318
-
319
- The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
320
-
321
-
322
- # Acknowledgments
323
-
324
- - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
325
- - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
326
- - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
327
- - [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
328
-
329
-
330
- # Citation
331
-
332
- ```bibtex
333
- @article{he2024opendatalab,
334
- title={Opendatalab: Empowering general artificial intelligence with open datasets},
335
- author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
336
- journal={arXiv preprint arXiv:2407.13773},
337
- year={2024}
338
- }
339
-
340
- @misc{2024mineru,
341
- title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
342
- author={MinerU Contributors},
343
- howpublished = {\url{https://github.com/opendatalab/MinerU}},
344
- year={2024}
345
- }
346
- ```
347
-
348
-
349
- # Star History
350
-
351
- <a>
352
- <picture>
353
- <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
354
- <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
355
- <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
356
- </picture>
357
- </a>
358
-
359
- # Links
360
- - [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
361
- - [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
362
- - [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)