PyPI - magic-pdf - Versions diffs - 0.6.2b1__py3-none-any.whl → 0.7.0b1__py3-none-any.whl - Mend

magic-pdf 0.6.2b1py3-none-any.whl → 0.7.0b1py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

magic_pdf/dict2md/ocr_mkcontent.py +10 -3
magic_pdf/libs/Constants.py +4 -1
magic_pdf/libs/config_reader.py +10 -10
magic_pdf/libs/draw_bbox.py +66 -1
magic_pdf/libs/ocr_content_type.py +14 -0
magic_pdf/libs/version.py +1 -1
magic_pdf/model/doc_analyze_by_custom_model.py +10 -4
magic_pdf/model/magic_model.py +4 -0
magic_pdf/model/pdf_extract_kit.py +83 -39
magic_pdf/model/pek_sub_modules/structeqtable/StructTableModel.py +22 -0
magic_pdf/resources/model_config/model_configs.yaml +4 -0
magic_pdf/rw/AbsReaderWriter.py +1 -18
magic_pdf/rw/DiskReaderWriter.py +32 -24
magic_pdf/rw/S3ReaderWriter.py +83 -48
magic_pdf/tools/cli.py +79 -0
magic_pdf/tools/cli_dev.py +155 -0
magic_pdf/tools/common.py +122 -0
magic_pdf-0.7.0b1.dist-info/METADATA +421 -0
{magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/RECORD +25 -27
{magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/WHEEL +1 -1
magic_pdf-0.7.0b1.dist-info/entry_points.txt +3 -0
magic_pdf/cli/magicpdf.py +0 -359
magic_pdf/pdf_parse_for_train.py +0 -685
magic_pdf/train_utils/convert_to_train_format.py +0 -65
magic_pdf/train_utils/extract_caption.py +0 -59
magic_pdf/train_utils/remove_footer_header.py +0 -159
magic_pdf/train_utils/vis_utils.py +0 -327
magic_pdf-0.6.2b1.dist-info/METADATA +0 -344
magic_pdf-0.6.2b1.dist-info/entry_points.txt +0 -2
/magic_pdf/{cli → model/pek_sub_modules/structeqtable}/__init__.py +0 -0
/magic_pdf/{train_utils → tools}/__init__.py +0 -0
{magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/LICENSE.md +0 -0
{magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/top_level.txt +0 -0

magic_pdf-0.6.2b1.dist-info/METADATA DELETED Viewed

@@ -1,344 +0,0 @@
-Metadata-Version: 2.1
-Name: magic-pdf
-Version: 0.6.2b1
-Summary: A practical tool for converting PDF to Markdown
-Home-page: https://github.com/opendatalab/MinerU
-Requires-Python: >=3.9
-Description-Content-Type: text/markdown
-License-File: LICENSE.md
-Requires-Dist: boto3 >=1.28.43
-Requires-Dist: Brotli >=1.1.0
-Requires-Dist: click >=8.1.7
-Requires-Dist: PyMuPDF >=1.24.9
-Requires-Dist: loguru >=0.6.0
-Requires-Dist: numpy <2.0.0,>=1.21.6
-Requires-Dist: fast-langdetect ==0.2.0
-Requires-Dist: wordninja >=2.0.0
-Requires-Dist: scikit-learn >=1.0.2
-Requires-Dist: pdfminer.six ==20231228
-Provides-Extra: full
-Requires-Dist: unimernet ==0.1.6 ; extra == 'full'
-Requires-Dist: matplotlib ; extra == 'full'
-Requires-Dist: ultralytics ; extra == 'full'
-Requires-Dist: paddleocr ==2.7.3 ; extra == 'full'
-Requires-Dist: detectron2 ; extra == 'full'
-Requires-Dist: paddlepaddle ==3.0.0b1 ; (platform_system == "Linux") and extra == 'full'
-Requires-Dist: paddlepaddle ==2.6.1 ; (platform_system == "Windows" or platform_system == "Darwin") and extra == 'full'
-Provides-Extra: lite
-Requires-Dist: paddleocr ==2.7.3 ; extra == 'lite'
-Requires-Dist: paddlepaddle ==3.0.0b1 ; (platform_system == "Linux") and extra == 'lite'
-Requires-Dist: paddlepaddle ==2.6.1 ; (platform_system == "Windows" or platform_system == "Darwin") and extra == 'lite'
-<div id="top">
-<p align="center">
-  <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
-</p>
-</div>
-<div align="center">
-[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
-[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
-[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
-[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
-[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
-[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
-<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
-[English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
-</div>
-<div align="center">
-<p align="center">
-<a href="https://github.com/opendatalab/MinerU">MinerU: An end-to-end PDF parsing tool based on PDF-Extract-Kit, supporting conversion from PDF to Markdown.</a>🚀🚀🚀<br>
-<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction</a>🔥🔥🔥
-</p>
-<p align="center">
-    👋 join us on <a href="https://discord.gg/AsQMhuMN" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
-</p>
-</div>
-# MinerU
-## Introduction
-MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
-- [Magic-PDF](#Magic-PDF)  PDF Document Extraction
-- [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
-# Magic-PDF
-## Introduction
-Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
-Key features include:
-- Support for multiple front-end model inputs
-- Removal of headers, footers, footnotes, and page numbers
-- Human-readable layout formatting
-- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
-- Extraction and display of images and tables within markdown
-- Conversion of equations into LaTeX format
-- Automatic detection and conversion of garbled PDFs
-- Compatibility with CPU and GPU environments
-- Available for Windows, Linux, and macOS platforms
-https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
-## Project Panorama
-![Project Panorama](docs/images/project_panorama_en.png)
-## Flowchart
-![Flowchart](docs/images/flowchart_en.png)
-### Dependency repositorys
-- [PDF-Extract-Kit : A Comprehensive Toolkit for High-Quality PDF Content Extraction](https://github.com/opendatalab/PDF-Extract-Kit) 🚀🚀🚀
-## Getting Started
-### Requirements
-- Python >= 3.9
-Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable.
-For example:
-```bash
-conda create -n MinerU python=3.10
-conda activate MinerU
-```
-### Installation and Configuration
-#### 1. Install Magic-PDF
-Install the full-feature package with pip:
->Note: The pip-installed package supports CPU-only and is ideal for quick tests.
->
->For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).
-```bash
-pip install magic-pdf[full-cpu]
-```
-The full-feature package depends on detectron2, which requires a compilation installation.
-If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
-Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):
-```bash
-pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
-```
-#### 2. Downloading model weights files
-For detailed references, please see below [how_to_download_models](docs/how_to_download_models_en.md)
-After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.
-#### 3. Copy the Configuration File and Make Configurations
-You can get the [magic-pdf.template.json](magic-pdf.template.json) file in the repository root directory.
-```bash
-cp magic-pdf.template.json ~/magic-pdf.json
-```
-In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.
-```json
-{
-  "models-dir": "/tmp/models"
-}
-```
-#### 4. Acceleration Using CUDA or MPS
-If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
-##### CUDA
-You need to install the corresponding PyTorch version according to your CUDA version.
-This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/
-```bash
-pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
-```
-Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.
-```json
-{
-  "device-mode":"cuda"
-}
-```
-##### MPS
-For macOS users with M-series chip devices, you can use MPS for inference acceleration.
-You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.
-```json
-{
-  "device-mode":"mps"
-}
-```
-### Usage
-#### 1.Usage via Command Line
-###### simple
-```bash
-magic-pdf pdf-command --pdf "pdf_path" --inside_model true
-```
-After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
-You can find the corresponding xxx_model.json file in the markdown directory.
-If you intend to do secondary development on the post-processing pipeline, you can use the command:
-```bash
-magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
-```
-In this way, you won't need to re-run the model data, making debugging more convenient.
-###### more
-```bash
-magic-pdf --help
-```
-#### 2. Usage via Api
-###### Local
-```python
-image_writer = DiskReaderWriter(local_image_dir)
-image_dir = str(os.path.basename(local_image_dir))
-jso_useful_key = {"_pdf_type": "", "model_list": []}
-pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
-pipe.pipe_classify()
-pipe.pipe_parse()
-md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
-```
-###### Object Storage
-```python
-s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
-image_dir = "s3://img_bucket/"
-s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
-pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
-jso_useful_key = {"_pdf_type": "", "model_list": []}
-pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
-pipe.pipe_classify()
-pipe.pipe_parse()
-md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
-```
-Demo can be referred to [demo.py](demo/demo.py)
-# Magic-Doc
-## Introduction
-Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
-Key Features Include:
-- Web Page Extraction
-  - Cross-modal precise parsing of text, images, tables, and formula information.
-- E-Book Document Extraction
-  - Supports various document formats including epub, mobi, with full adaptation for text and images.
-- Language Type Identification
-  - Accurate recognition of 176 languages.
-https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
-https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
-https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
-## Project Repository
-- [Magic-Doc](https://github.com/InternLM/magic-doc)
-  Outstanding Webpage and E-book Extraction Tool
-# All Thanks To Our Contributors
-<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
-</a>
-# License Information
-[LICENSE.md](LICENSE.md)
-The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
-# Acknowledgments
-- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
-- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
-- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
-- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
-# Citation
-```bibtex
-@article{he2024opendatalab,
-  title={Opendatalab: Empowering general artificial intelligence with open datasets},
-  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
-  journal={arXiv preprint arXiv:2407.13773},
-  year={2024}
-}
-@misc{2024mineru,
-    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
-    author={MinerU Contributors},
-    howpublished = {\url{https://github.com/opendatalab/MinerU}},
-    year={2024}
-}
-```
-# Star History
-<a>
- <picture>
-   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
-   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
-   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
- </picture>
-</a>
-# Links
-- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
-- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
-- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)

magic_pdf-0.6.2b1.dist-info/entry_points.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- [console_scripts]
2	- magic-pdf = magic_pdf.cli.magicpdf:cli

/magic_pdf/{cli → model/pek_sub_modules/structeqtable}/__init__.py RENAMED Viewed

File without changes

/magic_pdf/{train_utils → tools}/__init__.py RENAMED Viewed

File without changes

{magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/LICENSE.md RENAMED Viewed

File without changes

{magic_pdf-0.6.2b1.dist-info → magic_pdf-0.7.0b1.dist-info}/top_level.txt RENAMED Viewed

File without changes

magic-pdf 0.6.2b1__py3-none-any.whl → 0.7.0b1__py3-none-any.whl

magic-pdf 0.6.2b1py3-none-any.whl → 0.7.0b1py3-none-any.whl