magic-pdf 0.6.0__py3-none-any.whl → 0.6.2b1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,241 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: magic-pdf
3
- Version: 0.6.0
4
- Summary: A practical tool for converting PDF to Markdown
5
- Home-page: https://github.com/opendatalab/MinerU
6
- Requires-Python: >=3.9
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE.md
9
- Requires-Dist: boto3 >=1.28.43
10
- Requires-Dist: Brotli >=1.1.0
11
- Requires-Dist: click >=8.1.7
12
- Requires-Dist: PyMuPDF >=1.24.7
13
- Requires-Dist: loguru >=0.6.0
14
- Requires-Dist: numpy >=1.21.6
15
- Requires-Dist: fast-langdetect >=0.2.1
16
- Requires-Dist: wordninja >=2.0.0
17
- Requires-Dist: scikit-learn >=1.0.2
18
- Requires-Dist: pdfminer.six >=20231228
19
- Provides-Extra: cpu
20
- Requires-Dist: paddleocr ==2.7.3 ; extra == 'cpu'
21
- Requires-Dist: paddlepaddle ; extra == 'cpu'
22
- Provides-Extra: full-cpu
23
- Requires-Dist: unimernet ; extra == 'full-cpu'
24
- Requires-Dist: matplotlib ; extra == 'full-cpu'
25
- Requires-Dist: ultralytics ; extra == 'full-cpu'
26
- Requires-Dist: paddleocr ==2.7.3 ; extra == 'full-cpu'
27
- Requires-Dist: paddlepaddle ; extra == 'full-cpu'
28
- Provides-Extra: gpu
29
- Requires-Dist: paddleocr ==2.7.3 ; extra == 'gpu'
30
- Requires-Dist: paddlepaddle-gpu ; extra == 'gpu'
31
-
32
- <div id="top"></div>
33
- <div align="center">
34
-
35
- [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
36
- [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
37
- [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
38
- [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
39
- [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
40
- [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
41
- [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
42
-
43
-
44
-
45
-
46
- [English](README.md) | [简体中文](README_zh-CN.md)
47
-
48
- </div>
49
-
50
- <div align="center">
51
-
52
- </div>
53
-
54
- # MinerU
55
-
56
-
57
- ## Introduction
58
-
59
- MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
60
-
61
- - [Magic-PDF](#Magic-PDF) PDF Document Extraction
62
- - [Magic-Doc](#Magic-Doc) Webpage & E-book Extraction
63
-
64
-
65
- # Magic-PDF
66
-
67
-
68
- ## Introduction
69
-
70
- Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
71
-
72
- Key features include:
73
-
74
- - Support for multiple front-end model inputs
75
- - Removal of headers, footers, footnotes, and page numbers
76
- - Human-readable layout formatting
77
- - Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
78
- - Extraction and display of images and tables within markdown
79
- - Conversion of equations into LaTeX format
80
- - Automatic detection and conversion of garbled PDFs
81
- - Compatibility with CPU and GPU environments
82
- - Available for Windows, Linux, and macOS platforms
83
-
84
-
85
- https://github.com/opendatalab/MinerU/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070
86
-
87
-
88
-
89
- ## Project Panorama
90
-
91
- ![Project Panorama](docs/images/project_panorama_en.png)
92
-
93
-
94
- ## Flowchart
95
-
96
- ![Flowchart](docs/images/flowchart_en.png)
97
-
98
- ### Submodule Repositories
99
-
100
- - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
101
- - A Comprehensive Toolkit for High-Quality PDF Content Extraction
102
-
103
- ## Getting Started
104
-
105
- ### Requirements
106
-
107
- - Python >= 3.9
108
-
109
- ### Usage Instructions
110
-
111
- #### 1. Install Magic-PDF
112
-
113
- ```bash
114
- pip install magic-pdf
115
- ```
116
-
117
- #### 2. Usage via Command Line
118
-
119
- ###### simple
120
-
121
- ```bash
122
- cp magic-pdf.template.json ~/magic-pdf.json
123
- magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
124
- ```
125
- After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
126
-
127
- ###### more
128
-
129
- ```bash
130
- magic-pdf --help
131
- ```
132
-
133
- #### 3. Usage via Api
134
-
135
- ###### Local
136
- ```python
137
- image_writer = DiskReaderWriter(local_image_dir)
138
- image_dir = str(os.path.basename(local_image_dir))
139
- jso_useful_key = {"_pdf_type": "", "model_list": model_json}
140
- pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
141
- pipe.pipe_classify()
142
- pipe.pipe_parse()
143
- md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
144
- ```
145
-
146
- ###### Object Storage
147
- ```python
148
- s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
149
- image_dir = "s3://img_bucket/"
150
- s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
151
- pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
152
- jso_useful_key = {"_pdf_type": "", "model_list": model_json}
153
- pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
154
- pipe.pipe_classify()
155
- pipe.pipe_parse()
156
- md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
157
- ```
158
-
159
- Demo can be referred to [demo.py](demo/demo.py)
160
-
161
-
162
- # Magic-Doc
163
-
164
-
165
- ## Introduction
166
-
167
- Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
168
-
169
- Key Features Include:
170
-
171
- - Web Page Extraction
172
- - Cross-modal precise parsing of text, images, tables, and formula information.
173
-
174
- - E-Book Document Extraction
175
- - Supports various document formats including epub, mobi, with full adaptation for text and images.
176
-
177
- - Language Type Identification
178
- - Accurate recognition of 176 languages.
179
-
180
- https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
181
-
182
-
183
-
184
- https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
185
-
186
-
187
-
188
- https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
189
-
190
-
191
-
192
-
193
- ## Project Repository
194
-
195
- - [Magic-Doc](https://github.com/InternLM/magic-doc)
196
- Outstanding Webpage and E-book Extraction Tool
197
-
198
-
199
- # All Thanks To Our Contributors
200
-
201
- <a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
202
- <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
203
- </a>
204
-
205
-
206
- # License Information
207
-
208
- [LICENSE.md](LICENSE.md)
209
-
210
- The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
211
-
212
-
213
- # Acknowledgments
214
-
215
- - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
216
- - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
217
- - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
218
- - [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
219
-
220
-
221
- # Citation
222
-
223
- ```bibtex
224
- @misc{2024mineru,
225
- title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
226
- author={MinerU Contributors},
227
- howpublished = {\url{https://github.com/opendatalab/MinerU}},
228
- year={2024}
229
- }
230
- ```
231
-
232
-
233
- # Star History
234
-
235
- <a>
236
- <picture>
237
- <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
238
- <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
239
- <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
240
- </picture>
241
- </a>
File without changes