magic-pdf 0.6.0__py3-none-any.whl → 0.6.2b1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,344 @@
1
+ Metadata-Version: 2.1
2
+ Name: magic-pdf
3
+ Version: 0.6.2b1
4
+ Summary: A practical tool for converting PDF to Markdown
5
+ Home-page: https://github.com/opendatalab/MinerU
6
+ Requires-Python: >=3.9
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE.md
9
+ Requires-Dist: boto3 >=1.28.43
10
+ Requires-Dist: Brotli >=1.1.0
11
+ Requires-Dist: click >=8.1.7
12
+ Requires-Dist: PyMuPDF >=1.24.9
13
+ Requires-Dist: loguru >=0.6.0
14
+ Requires-Dist: numpy <2.0.0,>=1.21.6
15
+ Requires-Dist: fast-langdetect ==0.2.0
16
+ Requires-Dist: wordninja >=2.0.0
17
+ Requires-Dist: scikit-learn >=1.0.2
18
+ Requires-Dist: pdfminer.six ==20231228
19
+ Provides-Extra: full
20
+ Requires-Dist: unimernet ==0.1.6 ; extra == 'full'
21
+ Requires-Dist: matplotlib ; extra == 'full'
22
+ Requires-Dist: ultralytics ; extra == 'full'
23
+ Requires-Dist: paddleocr ==2.7.3 ; extra == 'full'
24
+ Requires-Dist: detectron2 ; extra == 'full'
25
+ Requires-Dist: paddlepaddle ==3.0.0b1 ; (platform_system == "Linux") and extra == 'full'
26
+ Requires-Dist: paddlepaddle ==2.6.1 ; (platform_system == "Windows" or platform_system == "Darwin") and extra == 'full'
27
+ Provides-Extra: lite
28
+ Requires-Dist: paddleocr ==2.7.3 ; extra == 'lite'
29
+ Requires-Dist: paddlepaddle ==3.0.0b1 ; (platform_system == "Linux") and extra == 'lite'
30
+ Requires-Dist: paddlepaddle ==2.6.1 ; (platform_system == "Windows" or platform_system == "Darwin") and extra == 'lite'
31
+
32
+ <div id="top">
33
+
34
+ <p align="center">
35
+ <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
36
+ </p>
37
+
38
+ </div>
39
+ <div align="center">
40
+
41
+ [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
42
+ [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
43
+ [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
44
+ [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
45
+ [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
46
+ [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
47
+ [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
48
+
49
+ <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
50
+
51
+
52
+
53
+
54
+ [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
55
+
56
+ </div>
57
+
58
+ <div align="center">
59
+ <p align="center">
60
+ <a href="https://github.com/opendatalab/MinerU">MinerU: An end-to-end PDF parsing tool based on PDF-Extract-Kit, supporting conversion from PDF to Markdown.</a>🚀🚀🚀<br>
61
+ <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction</a>🔥🔥🔥
62
+ </p>
63
+
64
+ <p align="center">
65
+ 👋 join us on <a href="https://discord.gg/AsQMhuMN" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
66
+ </p>
67
+ </div>
68
+
69
+ # MinerU
70
+
71
+
72
+ ## Introduction
73
+
74
+ MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
75
+
76
+ - [Magic-PDF](#Magic-PDF) PDF Document Extraction
77
+ - [Magic-Doc](#Magic-Doc) Webpage & E-book Extraction
78
+
79
+
80
+ # Magic-PDF
81
+
82
+
83
+ ## Introduction
84
+
85
+ Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
86
+
87
+ Key features include:
88
+
89
+ - Support for multiple front-end model inputs
90
+ - Removal of headers, footers, footnotes, and page numbers
91
+ - Human-readable layout formatting
92
+ - Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
93
+ - Extraction and display of images and tables within markdown
94
+ - Conversion of equations into LaTeX format
95
+ - Automatic detection and conversion of garbled PDFs
96
+ - Compatibility with CPU and GPU environments
97
+ - Available for Windows, Linux, and macOS platforms
98
+
99
+
100
+ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
101
+
102
+
103
+
104
+ ## Project Panorama
105
+
106
+ ![Project Panorama](docs/images/project_panorama_en.png)
107
+
108
+
109
+ ## Flowchart
110
+
111
+ ![Flowchart](docs/images/flowchart_en.png)
112
+
113
+ ### Dependency repositorys
114
+
115
+ - [PDF-Extract-Kit : A Comprehensive Toolkit for High-Quality PDF Content Extraction](https://github.com/opendatalab/PDF-Extract-Kit) 🚀🚀🚀
116
+
117
+ ## Getting Started
118
+
119
+ ### Requirements
120
+
121
+ - Python >= 3.9
122
+
123
+ Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable.
124
+ For example:
125
+ ```bash
126
+ conda create -n MinerU python=3.10
127
+ conda activate MinerU
128
+ ```
129
+
130
+ ### Installation and Configuration
131
+
132
+ #### 1. Install Magic-PDF
133
+
134
+ Install the full-feature package with pip:
135
+ >Note: The pip-installed package supports CPU-only and is ideal for quick tests.
136
+ >
137
+ >For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).
138
+
139
+ ```bash
140
+ pip install magic-pdf[full-cpu]
141
+ ```
142
+ The full-feature package depends on detectron2, which requires a compilation installation.
143
+ If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
144
+ Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):
145
+
146
+ ```bash
147
+ pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
148
+ ```
149
+
150
+
151
+ #### 2. Downloading model weights files
152
+
153
+ For detailed references, please see below [how_to_download_models](docs/how_to_download_models_en.md)
154
+
155
+ After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.
156
+
157
+
158
+ #### 3. Copy the Configuration File and Make Configurations
159
+ You can get the [magic-pdf.template.json](magic-pdf.template.json) file in the repository root directory.
160
+ ```bash
161
+ cp magic-pdf.template.json ~/magic-pdf.json
162
+ ```
163
+ In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.
164
+
165
+ ```json
166
+ {
167
+ "models-dir": "/tmp/models"
168
+ }
169
+ ```
170
+
171
+
172
+ #### 4. Acceleration Using CUDA or MPS
173
+ If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
174
+ ##### CUDA
175
+
176
+ You need to install the corresponding PyTorch version according to your CUDA version.
177
+ This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/
178
+ ```bash
179
+ pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
180
+ ```
181
+ Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.
182
+ ```json
183
+ {
184
+ "device-mode":"cuda"
185
+ }
186
+ ```
187
+
188
+ ##### MPS
189
+
190
+ For macOS users with M-series chip devices, you can use MPS for inference acceleration.
191
+ You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.
192
+ ```json
193
+ {
194
+ "device-mode":"mps"
195
+ }
196
+ ```
197
+
198
+
199
+ ### Usage
200
+
201
+ #### 1.Usage via Command Line
202
+
203
+ ###### simple
204
+
205
+ ```bash
206
+ magic-pdf pdf-command --pdf "pdf_path" --inside_model true
207
+ ```
208
+ After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
209
+ You can find the corresponding xxx_model.json file in the markdown directory.
210
+ If you intend to do secondary development on the post-processing pipeline, you can use the command:
211
+ ```bash
212
+ magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
213
+ ```
214
+ In this way, you won't need to re-run the model data, making debugging more convenient.
215
+
216
+
217
+ ###### more
218
+
219
+ ```bash
220
+ magic-pdf --help
221
+ ```
222
+
223
+
224
+ #### 2. Usage via Api
225
+
226
+ ###### Local
227
+ ```python
228
+ image_writer = DiskReaderWriter(local_image_dir)
229
+ image_dir = str(os.path.basename(local_image_dir))
230
+ jso_useful_key = {"_pdf_type": "", "model_list": []}
231
+ pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
232
+ pipe.pipe_classify()
233
+ pipe.pipe_parse()
234
+ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
235
+ ```
236
+
237
+ ###### Object Storage
238
+ ```python
239
+ s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
240
+ image_dir = "s3://img_bucket/"
241
+ s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
242
+ pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
243
+ jso_useful_key = {"_pdf_type": "", "model_list": []}
244
+ pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
245
+ pipe.pipe_classify()
246
+ pipe.pipe_parse()
247
+ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
248
+ ```
249
+
250
+ Demo can be referred to [demo.py](demo/demo.py)
251
+
252
+
253
+ # Magic-Doc
254
+
255
+
256
+ ## Introduction
257
+
258
+ Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
259
+
260
+ Key Features Include:
261
+
262
+ - Web Page Extraction
263
+ - Cross-modal precise parsing of text, images, tables, and formula information.
264
+
265
+ - E-Book Document Extraction
266
+ - Supports various document formats including epub, mobi, with full adaptation for text and images.
267
+
268
+ - Language Type Identification
269
+ - Accurate recognition of 176 languages.
270
+
271
+ https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
272
+
273
+
274
+
275
+ https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
276
+
277
+
278
+
279
+ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
280
+
281
+
282
+
283
+
284
+ ## Project Repository
285
+
286
+ - [Magic-Doc](https://github.com/InternLM/magic-doc)
287
+ Outstanding Webpage and E-book Extraction Tool
288
+
289
+
290
+ # All Thanks To Our Contributors
291
+
292
+ <a href="https://github.com/opendatalab/MinerU/graphs/contributors">
293
+ <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
294
+ </a>
295
+
296
+
297
+ # License Information
298
+
299
+ [LICENSE.md](LICENSE.md)
300
+
301
+ The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
302
+
303
+
304
+ # Acknowledgments
305
+
306
+ - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
307
+ - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
308
+ - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
309
+ - [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
310
+
311
+
312
+ # Citation
313
+
314
+ ```bibtex
315
+ @article{he2024opendatalab,
316
+ title={Opendatalab: Empowering general artificial intelligence with open datasets},
317
+ author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
318
+ journal={arXiv preprint arXiv:2407.13773},
319
+ year={2024}
320
+ }
321
+
322
+ @misc{2024mineru,
323
+ title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
324
+ author={MinerU Contributors},
325
+ howpublished = {\url{https://github.com/opendatalab/MinerU}},
326
+ year={2024}
327
+ }
328
+ ```
329
+
330
+
331
+ # Star History
332
+
333
+ <a>
334
+ <picture>
335
+ <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
336
+ <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
337
+ <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
338
+ </picture>
339
+ </a>
340
+
341
+ # Links
342
+ - [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
343
+ - [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
344
+ - [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)
@@ -2,13 +2,13 @@ magic_pdf/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
2
  magic_pdf/pdf_parse_by_ocr.py,sha256=IWnSWt1Z-d35xRqspzdLR2iUtma_SAu4W7K4kEk8SHc,638
3
3
  magic_pdf/pdf_parse_by_txt.py,sha256=KUSH7Gh83CZmdyWw59pqDskwyJ2Kg-jU-9fnQGJQEs4,537
4
4
  magic_pdf/pdf_parse_for_train.py,sha256=Oby61DMjJ716Jj_ri7lwXfv2Chus0pbBR2RPXrmBW08,28661
5
- magic_pdf/pdf_parse_union_core.py,sha256=a67iQuEfuslAEF-wQplGZKXUuz5mT3HiCyvuR52E6Gw,10584
5
+ magic_pdf/pdf_parse_union_core.py,sha256=jNly6l9pGcCf7wr6s6PgQhITJZ1m9PaI32Q26zxFngs,10620
6
6
  magic_pdf/user_api.py,sha256=CVQH-VSiZpz0bSkyMT4czk1epZriIPSJsLsPbluPa9Q,3054
7
7
  magic_pdf/cli/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
8
- magic_pdf/cli/magicpdf.py,sha256=EcTiX-MaiDc4Fv9qZ_UdjHt5tYnBEu6vlbp0w030sA0,12691
8
+ magic_pdf/cli/magicpdf.py,sha256=jzGSxLcx_tmjFiZfIKpomfU9rfvAwZDyMm4bmNGQmJw,13541
9
9
  magic_pdf/dict2md/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
10
10
  magic_pdf/dict2md/mkcontent.py,sha256=rWUY-2opd0jeowEUEVOV_uWcKum1Q7ng4nOoT6-ka_s,17459
11
- magic_pdf/dict2md/ocr_mkcontent.py,sha256=RyxebPtvFfNce_HCa-_YGxwFx_srzL-BfMKc85V9JG0,15442
11
+ magic_pdf/dict2md/ocr_mkcontent.py,sha256=L5D6aLB3vSATfDmsXRI_Mnq79ijH6yPYdSxSJb5aFh8,15678
12
12
  magic_pdf/filter/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
13
13
  magic_pdf/filter/pdf_classify_by_type.py,sha256=spmDO-f2ihAl1d6-EP-j271Yi50oyu6mw4X2kRd_m0s,42320
14
14
  magic_pdf/filter/pdf_meta_scan.py,sha256=5R2XDiBZw0xd4ugbDxuyk6fztGlT5jFsGN85hLvo-hQ,17390
@@ -25,7 +25,7 @@ magic_pdf/libs/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
25
25
  magic_pdf/libs/boxbase.py,sha256=MvD0DypR4sTEF3T2RrI_yJ8mPDUBYHAqAaau2mnBSxY,15343
26
26
  magic_pdf/libs/calc_span_stats.py,sha256=5vnU27DcbkFDRSAoLqAmX0KQ3I9ehWkEgh_t9hxg_zI,10147
27
27
  magic_pdf/libs/commons.py,sha256=6Zu9-OyamyCNDY7qj0SxR-rux-ggj9im3CVPtC4ubB8,7108
28
- magic_pdf/libs/config_reader.py,sha256=wB0Zn6qEwuAWmv2Icz9owPIqxrhFEH5i6sUr8Nt5ULo,1806
28
+ magic_pdf/libs/config_reader.py,sha256=Wlig8KRhKpz8fVNc7dlspxoccX5SnyT7FmwKj9i-gWE,2435
29
29
  magic_pdf/libs/convert_utils.py,sha256=Ov-lsfCLBPz_15iSJXIslBNmrSf_E_1g_XDWJy8NgO8,143
30
30
  magic_pdf/libs/coordinate_transform.py,sha256=Bbop2cP2uz2ZG0U0gwd7J6EKkgABq5Rv03qf2LMPw80,429
31
31
  magic_pdf/libs/detect_language_from_model.py,sha256=Uln8F9qs8EJOw4EgI7KRlaU3lD_mK8KMTlADLFtz8fk,816
@@ -34,9 +34,9 @@ magic_pdf/libs/drop_reason.py,sha256=IfjPSrPLMmVziqjOXPep7r_ioQKFRahDgbOW1SD-Tuw
34
34
  magic_pdf/libs/drop_tag.py,sha256=bZDg3bIVWvBT1Ec1icwj5WLOkt5-hI6eRYZ2tX9_a74,673
35
35
  magic_pdf/libs/hash_utils.py,sha256=VEKK9WfFoZgrPfi8kfITjLpr8Ahufs8tXh9R1Y5lAL8,404
36
36
  magic_pdf/libs/json_compressor.py,sha256=6-KCu0lb5ksmyqWtQGb4QqmP-FjRb5dP7P-Hevcn68g,875
37
- magic_pdf/libs/language.py,sha256=l0LGIz-dlerU9Xct-7ypNKGNEI_q-CTadsJAnVTF9VY,692
37
+ magic_pdf/libs/language.py,sha256=Hj5-lrGoNExxdHLbkcNG-c27U4AjJ9AZPdZblaNSehU,1099
38
+ magic_pdf/libs/local_math.py,sha256=tqljQOgqh3fZc146HYhO88JXJaiXMVwArBkk_CSGICc,177
38
39
  magic_pdf/libs/markdown_utils.py,sha256=cLxLXjRhrNp_wCHvtglrGA_FVdrvfd1KULeTtj1p18w,944
39
- magic_pdf/libs/math.py,sha256=tqljQOgqh3fZc146HYhO88JXJaiXMVwArBkk_CSGICc,177
40
40
  magic_pdf/libs/nlp_utils.py,sha256=-X9W3-Ns5ZdDYFvyyEq6i6P2b5hCATaFEZeOjwNOH9M,6901
41
41
  magic_pdf/libs/ocr_content_type.py,sha256=DiGTYppd6WlibwCAeVpIy3NHCQkglfIAQsJ_ffu5BPw,526
42
42
  magic_pdf/libs/path_utils.py,sha256=Hykw_l5CU736b2egHV9P7B-qh3QNKO4nZSGCbsi0Z8E,1043
@@ -44,22 +44,22 @@ magic_pdf/libs/pdf_check.py,sha256=MAe8wzwT0qvPf_I72wEZG7k1g4haNHS7oUtLqkB5rlE,2
44
44
  magic_pdf/libs/pdf_image_tools.py,sha256=CAd01giTKr_UJz1_QtDOARG9G9z69GFpzRZwcWSfLtE,1282
45
45
  magic_pdf/libs/safe_filename.py,sha256=ckwcM_eqoysTb5id8czp-tXq2G9da0-l3pshZDCHQtE,236
46
46
  magic_pdf/libs/textbase.py,sha256=SC1Frhz3Fb7V7n2SFRBsl7Bmg0JZdlvZskq0lfW1vIk,732
47
- magic_pdf/libs/version.py,sha256=cID1jLnC_vj48GgMN6Yb1FA3JsQ95zNmCHmRYE8TFhY,22
47
+ magic_pdf/libs/version.py,sha256=ZMr9EH74nzdVNg9PXVs5XLZcabTX-_J2XZD1VgXHRBE,24
48
48
  magic_pdf/libs/vis_utils.py,sha256=hTOTEakKV0pGMbk0tbRkVI_tku7A3dGc96ynObZ4kwI,10207
49
- magic_pdf/model/__init__.py,sha256=XeYcF4RMZ3DosyLqiz0_n1JVa2k5RhTwUXwKt5sAjEQ,53
50
- magic_pdf/model/doc_analyze_by_custom_model.py,sha256=kssz_Nn6zTYED_iEgGuFRjus947xoK5dTqj88FOehE0,3256
51
- magic_pdf/model/magic_model.py,sha256=2H6Gz1mg0f0YCvz-TLIWrAWXCQLgZftBXJNRPlSIjwc,25077
49
+ magic_pdf/model/__init__.py,sha256=1QcfMKET0xQhSaZMjNQHi_TjzSSDR6PI5mjkmaXHPe8,52
50
+ magic_pdf/model/doc_analyze_by_custom_model.py,sha256=0YyZoDUQBFQIDiHDK59JEql9FKSkZMr8aGN0Tw1eL8k,4064
51
+ magic_pdf/model/magic_model.py,sha256=tIGi-vtn6OUDqs3mtW2YLzoFvSPu4kkBiHJUv6o0Kic,25233
52
52
  magic_pdf/model/model_list.py,sha256=AqxAtKGLDn7VVXWYwk0l9LnACxDLyU2jwOJ7vjPZj04,72
53
- magic_pdf/model/pdf_extract_kit.py,sha256=hiK1zDrwn5QhqUwI7BvM1JOoq_JIab4uVx_flHrBmWE,8374
54
- magic_pdf/model/pp_structure_v2.py,sha256=apYWwWiCjlks5CLXolcynnuPV7llCm2PdP-6tg0-Kt0,2903
53
+ magic_pdf/model/pdf_extract_kit.py,sha256=ecoTBn7ROmfU1KxFpFzu7yXw7aezmPV6xOht7PRmPHA,10692
54
+ magic_pdf/model/pp_structure_v2.py,sha256=1sn8IJK0d5ZmqJ2XFt9FdaSdI0RQf-iwNAWBrVrIeuc,2872
55
55
  magic_pdf/model/pek_sub_modules/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
56
56
  magic_pdf/model/pek_sub_modules/post_process.py,sha256=HzRxV2sVR3Qo8XKYEHhT6tae-bYTb6dnAfGP6gfVNaM,1135
57
- magic_pdf/model/pek_sub_modules/self_modify.py,sha256=XiwLUCiY_E0JkaIQr5m1hOD75-iGrgkMVe-1bzeF_Go,10522
57
+ magic_pdf/model/pek_sub_modules/self_modify.py,sha256=wl7HPzfUyKrjsSRp_iMXlbM2q1brVk5TD9NbgyY6UWU,11397
58
58
  magic_pdf/model/pek_sub_modules/layoutlmv3/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
59
59
  magic_pdf/model/pek_sub_modules/layoutlmv3/backbone.py,sha256=1cvSCczgvwOLdvzWyqttoYPMHsXmnzI3w9abJ1bAXoM,7106
60
60
  magic_pdf/model/pek_sub_modules/layoutlmv3/beit.py,sha256=e-INve6bpEx_0FM5wYbQcEcelc79tzDlCljTVHaGt1w,30450
61
61
  magic_pdf/model/pek_sub_modules/layoutlmv3/deit.py,sha256=Qyn5UWutZ-0GJczexCh-oMMSXtav_g3ovumMFJp8Om4,17000
62
- magic_pdf/model/pek_sub_modules/layoutlmv3/model_init.py,sha256=POs5s4_9rS-GlE7f_iHBuZpTwOuyfI6VE3DUb37fgxA,4483
62
+ magic_pdf/model/pek_sub_modules/layoutlmv3/model_init.py,sha256=PhWqqRwgSSmXTaUlLIjGqnBUNjzxwYDKgMzKjnxNy1k,4528
63
63
  magic_pdf/model/pek_sub_modules/layoutlmv3/rcnn_vl.py,sha256=nI4G6AeLRmjavNhs5S2USKh0ozn-ftMuW0F0m_eVy3c,6649
64
64
  magic_pdf/model/pek_sub_modules/layoutlmv3/visualizer.py,sha256=H6UYeCCbaN2gbDjGthTkKkPoyWxfE3azRjsR7fVBwnw,49797
65
65
  magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/__init__.py,sha256=C4N9gXJr7is7uznvQefQ7dOhlzEhdp86Lgh-7p0Y-08,186
@@ -99,7 +99,7 @@ magic_pdf/post_proc/detect_para.py,sha256=5LX86ueHQGOV9CNimAxqZH4R3KTi78leum1de_
99
99
  magic_pdf/post_proc/pdf_post_filter.py,sha256=FeZceyjGG_UvBrBoa51Ohge5edQzCoJtZTaocidKCHg,2530
100
100
  magic_pdf/post_proc/remove_footnote.py,sha256=701P7xRu6gzLaEHfb2xkYpLZI4CwK2FAo7Ggho4bOTI,7596
101
101
  magic_pdf/pre_proc/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
102
- magic_pdf/pre_proc/citationmarker_remove.py,sha256=81iFoNLhZ2vOZi1g2zBH0S2UEfyZdhBY43qKKdxr6mo,6521
102
+ magic_pdf/pre_proc/citationmarker_remove.py,sha256=IitOERaK9fGaktsYMyiaaL_71uMIrlG5ZdmpZaR6dsA,6640
103
103
  magic_pdf/pre_proc/construct_page_dict.py,sha256=lp3zBmInlWYYIcGC1-NSqT9s44AjDvlnWxDPeZoBVSY,3043
104
104
  magic_pdf/pre_proc/cut_image.py,sha256=bbeELTg2-SFyHkVEnGAL_7S6k8hyy1xtDSoFmXDQDOA,2768
105
105
  magic_pdf/pre_proc/detect_equation.py,sha256=9omDHKTI8QO9Qd46eVFHWhZeMmTNx7XDuWRgjXI-KFA,6627
@@ -114,10 +114,10 @@ magic_pdf/pre_proc/equations_replace.py,sha256=fXj7ZV7F3YtkDYrAhE9g5tHk4_3pVUyLb
114
114
  magic_pdf/pre_proc/fix_image.py,sha256=5MOfkXc8abfIp49g-68vll40wwTUZ5tcQ2gtsJuFmvs,11486
115
115
  magic_pdf/pre_proc/fix_table.py,sha256=20sqJe27fAXcL7_C0qQ9mpsggmH37WuX-wPYWyRgACA,13227
116
116
  magic_pdf/pre_proc/main_text_font.py,sha256=1gkjvPuBdKC4oVFkLvnRm2zghsLtVlfAEMKXouyVonM,1048
117
- magic_pdf/pre_proc/ocr_detect_all_bboxes.py,sha256=K5VHG_hNzqR6va0TB24SONMS7NGYBrbY4pxgEijLNlk,5305
117
+ magic_pdf/pre_proc/ocr_detect_all_bboxes.py,sha256=aM_RZWg-2nPBQtqOVoMuvexmP5g1CuP94WxKxXuqnW4,6574
118
118
  magic_pdf/pre_proc/ocr_detect_layout.py,sha256=DW0_HXzmcbW22cXKIYFsyZNFh8mEjSHXIFVjXndJsvQ,5878
119
119
  magic_pdf/pre_proc/ocr_dict_merge.py,sha256=lz1viJJkjZAklKZYmkWP39YBnv5BoWVr3fv6mmNr33E,12620
120
- magic_pdf/pre_proc/ocr_span_list_modify.py,sha256=nIkbEcV0EBBQbwzvfnjDPseHxVzYxFRQmtel3GnXdPo,11290
120
+ magic_pdf/pre_proc/ocr_span_list_modify.py,sha256=jqBheXF8EuYCfS9tn6typr-aE57nfMoeBC36J5GjpbQ,11519
121
121
  magic_pdf/pre_proc/pdf_pre_filter.py,sha256=FIMwe8Lei9LI2RmkqiaSyTHV5b7ViADbpyBwgVwZH-c,2687
122
122
  magic_pdf/pre_proc/post_layout_split.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
123
123
  magic_pdf/pre_proc/remove_bbox_overlap.py,sha256=u_ObNLkZ8pPDNBUkSMpA9ffiSpfz42B4807cdBPZmLU,3085
@@ -127,9 +127,10 @@ magic_pdf/pre_proc/remove_rotate_bbox.py,sha256=0FlBXeiEwjZAGAWo-DiMptclFOj04POu
127
127
  magic_pdf/pre_proc/resolve_bbox_conflict.py,sha256=bJiegofPUeDyi--oZjfipQ5Q5RLm6TOCW0TLXbPii_Q,7307
128
128
  magic_pdf/pre_proc/solve_line_alien.py,sha256=aNoQptPcC38Sm1I2ABhgw8jeH_5kjsRHx3VYlFFtm1g,853
129
129
  magic_pdf/pre_proc/statistics.py,sha256=_9jGlXq0iXd03UMxB92ZqCiu7cjNkG5vHvFlTF_9ytA,220
130
+ magic_pdf/resources/fasttext-langdetect/lid.176.ftz,sha256=jzRyz-hzintgmejpmcPL-uDc0VaWqsfXc4qAOdtgPoM,938013
130
131
  magic_pdf/resources/model_config/model_configs.yaml,sha256=C_9UfFMlHOX-iSgcwCHjyHKazKKuwpy1RcGHeTQD1kY,139
131
132
  magic_pdf/resources/model_config/UniMERNet/demo.yaml,sha256=al9_--m3n2j9zEn9OjlmmpfQbqVBAYFakXc_hY4vDXo,807
132
- magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml,sha256=noqVE3GmZUG86NYDfs9DiFgdQFjXaICyCU7KPzgR3II,6174
133
+ magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml,sha256=9aNAEYgpHTAWpcUrDvuPG2y4V-Qw8QdcJefi96y8yDU,6109
133
134
  magic_pdf/rw/AbsReaderWriter.py,sha256=1Hd6Xo2g12CaRAo5Sze-R_GSQA6GQ0rQwSmgQvw4V_c,1297
134
135
  magic_pdf/rw/DiskReaderWriter.py,sha256=0tt8lbRyqrOfFgGlhjt24YMdj2xN7QUIVysfhFIxPgo,2113
135
136
  magic_pdf/rw/S3ReaderWriter.py,sha256=O7Quf3CUqXBjMz4sIE7kNVI3TIQROeg5PuXneAacieY,4474
@@ -141,9 +142,9 @@ magic_pdf/train_utils/convert_to_train_format.py,sha256=ifo2FAoBMa_etCvz0O4v03xO
141
142
  magic_pdf/train_utils/extract_caption.py,sha256=gommEqIEWLplSDEJWD7_66daqlOBsWhpRBW1DHpkny4,1825
142
143
  magic_pdf/train_utils/remove_footer_header.py,sha256=pyeNNdJ-th3wl5Xwb10ZLYNaFN4-6BmahoMFE8VTNNs,5978
143
144
  magic_pdf/train_utils/vis_utils.py,sha256=MV9N9cT3ifJ35u7LFKGF9I_bOIQrtU1zcsxu2hj3aqM,10111
144
- magic_pdf-0.6.0.dist-info/LICENSE.md,sha256=hIahDEOTzuHCU5J2nd07LWwkLW7Hko4UFO__ffsvB-8,34523
145
- magic_pdf-0.6.0.dist-info/METADATA,sha256=rqkC7PCOuqDzqnsVWW8DSqwogF3jHwGp6-sS8xjCi6o,7093
146
- magic_pdf-0.6.0.dist-info/WHEEL,sha256=GJ7t_kWBFywbagK5eo9IoUwLW6oyOeTKmQ-9iHFVNxQ,92
147
- magic_pdf-0.6.0.dist-info/entry_points.txt,sha256=NbSkSmE08UuTwdoJD8Uofq8iyufySA4x7jmIIk4YCzI,57
148
- magic_pdf-0.6.0.dist-info/top_level.txt,sha256=J9I0AzmHWGkp9c6DL8Oe4mEx3yYphLzkRn4H25Lg1rE,10
149
- magic_pdf-0.6.0.dist-info/RECORD,,
145
+ magic_pdf-0.6.2b1.dist-info/LICENSE.md,sha256=hIahDEOTzuHCU5J2nd07LWwkLW7Hko4UFO__ffsvB-8,34523
146
+ magic_pdf-0.6.2b1.dist-info/METADATA,sha256=DVFHn5RBQilga9OZp87sGf4e4wAE-XT8QpB6KSVUgoo,11541
147
+ magic_pdf-0.6.2b1.dist-info/WHEEL,sha256=GJ7t_kWBFywbagK5eo9IoUwLW6oyOeTKmQ-9iHFVNxQ,92
148
+ magic_pdf-0.6.2b1.dist-info/entry_points.txt,sha256=NbSkSmE08UuTwdoJD8Uofq8iyufySA4x7jmIIk4YCzI,57
149
+ magic_pdf-0.6.2b1.dist-info/top_level.txt,sha256=J9I0AzmHWGkp9c6DL8Oe4mEx3yYphLzkRn4H25Lg1rE,10
150
+ magic_pdf-0.6.2b1.dist-info/RECORD,,