PyPI - infinity-parser2 - Versions diffs - 0.1.0__tar.gz → 0.3.0__tar.gz - Mend

infinity-parser2 0.1.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

{infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: infinity_parser2
-Version: 0.1.0
+Version: 0.3.0
 Summary: Document parsing Python package supporting PDF and image parsing using Infinity-Parser2-Pro model.
 Home-page: https://github.com/infly-ai/INF-MLLM
 Author: INF Tech
@@ -53,22 +53,148 @@ Dynamic: summary
 # Infinity-Parser2
-Infinity-Parser2 is a document parsing tool powered by the Infinity-Parser2-Pro model. It converts **PDF files** and **images** (PNG, JPG, WEBP) into structured Markdown or JSON with layout information.
+<p align="center">
+    <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/logo.png" width="400"/>
+<p>
+<p align="center">
+🤗 <a href="https://huggingface.co/infly/Infinity-Parser2-Pro">Model</a> |
+📊 <a>Dataset (coming soon...)</a> |
+📄 <a>Paper (coming soon...)</a> |
+🚀 <a>Demo (coming soon...)</a>
+</p>
+## Introduction
+We are excited to release Infinity-Parser2-Pro, our latest flagship document understanding model that achieves a new state-of-the-art on olmOCR-Bench with a score of 86.7%, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr. Building on our previous model Infinity-Parser-7B, we have significantly enhanced our data engine and multi-task reinforcement learning approach. This enables the model to consolidate robust multi-modal parsing capabilities into a unified architecture, delivering brand-new zero-shot capabilities for diverse real-world business scenarios.
+### Key Features
+- **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By generating over 1 million diverse full-text samples covering a wide range of document layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types.
+- **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including doc2json and doc2markdown.
+- **Breakthrough Parsing Performance**: It substantially outperforms our previous 7B model, achieving 86.7% on olmOCR-Bench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr.
+- **Inference Acceleration**: By adopting the highly efficient MoE architecture, our inference throughput has increased by 21% (from 441 to 534 tokens/sec), reducing deployment latency and costs.
+## Performance
+<p align="left">
+    <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/document_parsing_performance_evaluation.png" width="1200"/>
+<p>
 ## Quick Start
-### Installation
+### 1. Minimal "Hello World" (Native Transformers)
+If you are looking for a minimal script to parse a single image to Markdown using the native `transformers` library, here is a simple snippet:
+```python
+from PIL import Image
+import torch
+from transformers import AutoModelForImageTextToText, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# Load the model and processor
+model = AutoModelForImageTextToText.from_pretrained(
+    "infly/Infinity-Parser2-Pro",
+    torch_dtype="float16",
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro")
+# Build the messages for the model
+pil_image = Image.open("demo_data/demo.png").convert("RGB")
+min_pixels = 2048  # 32 * 64
+max_pixels = 16777216  # 4096 * 4096
+prompt = """
+Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
+1. Bbox format: [x1, y1, x2, y2]
+2. Layout Categories: The possible categories are ['header', 'title', 'text', 'figure', 'table', 'formula', 'figure_caption', 'table_caption', 'formula_caption', 'figure_footnote', 'table_footnote', 'page_footnote', 'footer'].
+3. Text Extraction & Formatting Rules:
+    - Figure: For the 'figure' category, the text field should be empty string.
+    - Formula: Format its text as LaTeX.
+    - Table: Format its text as HTML.
+    - All Others (Text, Title, etc.): Format their text as Markdown.
+4. Constraints:
+    - The output text must be the original text from the image, with no translation.
+    - All layout elements must be sorted according to human reading order.
+5. Final Output: The entire output must be a single JSON object.
+"""
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": pil_image,
+                "min_pixels": min_pixels,
+                "max_pixels": max_pixels,
+            },
+            {"type": "text", "text": prompt},
+        ],
+    }
+]
+chat_template_kwargs = {"enable_thinking": False}
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True, **chat_template_kwargs
+)
+image_inputs, _ = process_vision_info(messages, image_patch_size=16)
+inputs = processor(
+    text=text,
+    images=image_inputs,
+    do_resize=False,
+    padding=True,
+    return_tensors="pt",
+)
+# Move all tensors to the same device as the model
+inputs = {
+    k: v.to(model.device) if isinstance(v, torch.Tensor) else v
+    for k, v in inputs.items()
+}
+# Generate the response
+generated_ids = model.generate(
+    **inputs,
+    max_new_tokens=32768,
+    temperature=0.0,
+    top_p=1.0,
+)
+# Strip input tokens, keeping only the newly generated response
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :]
+    for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+### 2. Advanced Pipeline (infinity_parser2)
+For bulk processing, advanced features, or an end-to-end PDF parsing pipeline, we recommend using our infinity_parser2 wrapper.
 #### Pre-requisites
 ```bash
-# Install PyTorch (CUDA). Find the proper version on the [official site](https://pytorch.org/get-started/previous-versions) based on your CUDA version.
+# Create a Conda environment (Optional)
+conda create -n infinity_parser2 python=3.12
+conda activate infinity_parser2
+# Install PyTorch (CUDA). Find the proper version at https://pytorch.org/get-started/previous-versions based on your CUDA version.
 pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128
-# Install FlashAttention (required for NVIDIA GPUs).
-# This command builds flash-attn from source, which can take 10 to 30 minutes.
+# Install FlashAttention (FlashAttention-2 is recommended by default)
+# Standard install (compiles from source, ~10-30 min):
 pip install flash-attn==2.8.3 --no-build-isolation
-# For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See the [official guide](https://github.com/Dao-AILab/flash-attention).
+# Faster install: download wheel from https://github.com/Dao-AILab/flash-attention/releases. Then run: pip install /path/to/<wheel_filename>.whl
+# For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See: https://github.com/Dao-AILab/flash-attention
+# NOTE: The code will prioritize detecting FlashAttention-3. If not found, it falls back to FlashAttention-2.
 # Install vLLM
 # NOTE: you may need to run the command below to resolve triton and numpy conflicts before installing vllm.
@@ -78,23 +204,29 @@ pip install vllm==0.17.1
 #### Install infinity_parser2
+Install from PyPI
 ```bash
-# From PyPI
 pip install infinity_parser2
+```
+Install from source code
-# From source
+```bash
 git clone https://github.com/infly-ai/INF-MLLM.git
 cd INF-MLLM/Infinity-Parser2
 pip install -e .
 ```
-### Usage
+#### Usage
-#### Command Line
+##### Command Line
 The `parser` command is the fastest way to get started.
 ```bash
+# NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
 # Parse a PDF (outputs Markdown by default)
 parser demo_data/demo.pdf
@@ -119,9 +251,11 @@ parser demo_data/demo.png --task doc2md
 parser --help
 ```
-#### Python API
+##### Python API
 ```python
+# NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
 from infinity_parser2 import InfinityParser2
 parser = InfinityParser2()
@@ -154,7 +288,7 @@ result = parser.parse("demo_data/demo.pdf", task_type="doc2md")
 # Custom prompt
 result = parser.parse("demo_data/demo.pdf", task_type="custom",
-                      custom_prompt="Extract the title and authors only.")
+                      custom_prompt="Please transform the document's contents into Markdown format.")
 # Batch processing with custom batch size
 result = parser.parse("demo_data", batch_size=8)
@@ -308,3 +442,7 @@ print(cache.resolve_model_path("infly/Infinity-Parser2-Pro"))
 - Python 3.12+
 - CUDA-compatible GPU
 - See `setup.py` for full dependency list.
+## Acknowledgments
+We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.

{infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/README.md RENAMED Viewed

@@ -1,21 +1,147 @@
 # Infinity-Parser2
-Infinity-Parser2 is a document parsing tool powered by the Infinity-Parser2-Pro model. It converts **PDF files** and **images** (PNG, JPG, WEBP) into structured Markdown or JSON with layout information.
+<p align="center">
+    <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/logo.png" width="400"/>
+<p>
+<p align="center">
+🤗 <a href="https://huggingface.co/infly/Infinity-Parser2-Pro">Model</a> |
+📊 <a>Dataset (coming soon...)</a> |
+📄 <a>Paper (coming soon...)</a> |
+🚀 <a>Demo (coming soon...)</a>
+</p>
+## Introduction
+We are excited to release Infinity-Parser2-Pro, our latest flagship document understanding model that achieves a new state-of-the-art on olmOCR-Bench with a score of 86.7%, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr. Building on our previous model Infinity-Parser-7B, we have significantly enhanced our data engine and multi-task reinforcement learning approach. This enables the model to consolidate robust multi-modal parsing capabilities into a unified architecture, delivering brand-new zero-shot capabilities for diverse real-world business scenarios.
+### Key Features
+- **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By generating over 1 million diverse full-text samples covering a wide range of document layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types.
+- **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including doc2json and doc2markdown.
+- **Breakthrough Parsing Performance**: It substantially outperforms our previous 7B model, achieving 86.7% on olmOCR-Bench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr.
+- **Inference Acceleration**: By adopting the highly efficient MoE architecture, our inference throughput has increased by 21% (from 441 to 534 tokens/sec), reducing deployment latency and costs.
+## Performance
+<p align="left">
+    <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/document_parsing_performance_evaluation.png" width="1200"/>
+<p>
 ## Quick Start
-### Installation
+### 1. Minimal "Hello World" (Native Transformers)
+If you are looking for a minimal script to parse a single image to Markdown using the native `transformers` library, here is a simple snippet:
+```python
+from PIL import Image
+import torch
+from transformers import AutoModelForImageTextToText, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# Load the model and processor
+model = AutoModelForImageTextToText.from_pretrained(
+    "infly/Infinity-Parser2-Pro",
+    torch_dtype="float16",
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro")
+# Build the messages for the model
+pil_image = Image.open("demo_data/demo.png").convert("RGB")
+min_pixels = 2048  # 32 * 64
+max_pixels = 16777216  # 4096 * 4096
+prompt = """
+Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
+1. Bbox format: [x1, y1, x2, y2]
+2. Layout Categories: The possible categories are ['header', 'title', 'text', 'figure', 'table', 'formula', 'figure_caption', 'table_caption', 'formula_caption', 'figure_footnote', 'table_footnote', 'page_footnote', 'footer'].
+3. Text Extraction & Formatting Rules:
+    - Figure: For the 'figure' category, the text field should be empty string.
+    - Formula: Format its text as LaTeX.
+    - Table: Format its text as HTML.
+    - All Others (Text, Title, etc.): Format their text as Markdown.
+4. Constraints:
+    - The output text must be the original text from the image, with no translation.
+    - All layout elements must be sorted according to human reading order.
+5. Final Output: The entire output must be a single JSON object.
+"""
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": pil_image,
+                "min_pixels": min_pixels,
+                "max_pixels": max_pixels,
+            },
+            {"type": "text", "text": prompt},
+        ],
+    }
+]
+chat_template_kwargs = {"enable_thinking": False}
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True, **chat_template_kwargs
+)
+image_inputs, _ = process_vision_info(messages, image_patch_size=16)
+inputs = processor(
+    text=text,
+    images=image_inputs,
+    do_resize=False,
+    padding=True,
+    return_tensors="pt",
+)
+# Move all tensors to the same device as the model
+inputs = {
+    k: v.to(model.device) if isinstance(v, torch.Tensor) else v
+    for k, v in inputs.items()
+}
+# Generate the response
+generated_ids = model.generate(
+    **inputs,
+    max_new_tokens=32768,
+    temperature=0.0,
+    top_p=1.0,
+)
+# Strip input tokens, keeping only the newly generated response
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :]
+    for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+### 2. Advanced Pipeline (infinity_parser2)
+For bulk processing, advanced features, or an end-to-end PDF parsing pipeline, we recommend using our infinity_parser2 wrapper.
 #### Pre-requisites
 ```bash
-# Install PyTorch (CUDA). Find the proper version on the [official site](https://pytorch.org/get-started/previous-versions) based on your CUDA version.
+# Create a Conda environment (Optional)
+conda create -n infinity_parser2 python=3.12
+conda activate infinity_parser2
+# Install PyTorch (CUDA). Find the proper version at https://pytorch.org/get-started/previous-versions based on your CUDA version.
 pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128
-# Install FlashAttention (required for NVIDIA GPUs).
-# This command builds flash-attn from source, which can take 10 to 30 minutes.
+# Install FlashAttention (FlashAttention-2 is recommended by default)
+# Standard install (compiles from source, ~10-30 min):
 pip install flash-attn==2.8.3 --no-build-isolation
-# For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See the [official guide](https://github.com/Dao-AILab/flash-attention).
+# Faster install: download wheel from https://github.com/Dao-AILab/flash-attention/releases. Then run: pip install /path/to/<wheel_filename>.whl
+# For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See: https://github.com/Dao-AILab/flash-attention
+# NOTE: The code will prioritize detecting FlashAttention-3. If not found, it falls back to FlashAttention-2.
 # Install vLLM
 # NOTE: you may need to run the command below to resolve triton and numpy conflicts before installing vllm.
@@ -25,23 +151,29 @@ pip install vllm==0.17.1
 #### Install infinity_parser2
+Install from PyPI
 ```bash
-# From PyPI
 pip install infinity_parser2
+```
+Install from source code
-# From source
+```bash
 git clone https://github.com/infly-ai/INF-MLLM.git
 cd INF-MLLM/Infinity-Parser2
 pip install -e .
 ```
-### Usage
+#### Usage
-#### Command Line
+##### Command Line
 The `parser` command is the fastest way to get started.
 ```bash
+# NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
 # Parse a PDF (outputs Markdown by default)
 parser demo_data/demo.pdf
@@ -66,9 +198,11 @@ parser demo_data/demo.png --task doc2md
 parser --help
 ```
-#### Python API
+##### Python API
 ```python
+# NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
 from infinity_parser2 import InfinityParser2
 parser = InfinityParser2()
@@ -101,7 +235,7 @@ result = parser.parse("demo_data/demo.pdf", task_type="doc2md")
 # Custom prompt
 result = parser.parse("demo_data/demo.pdf", task_type="custom",
-                      custom_prompt="Extract the title and authors only.")
+                      custom_prompt="Please transform the document's contents into Markdown format.")
 # Batch processing with custom batch size
 result = parser.parse("demo_data", batch_size=8)
@@ -255,3 +389,7 @@ print(cache.resolve_model_path("infly/Infinity-Parser2-Pro"))
 - Python 3.12+
 - CUDA-compatible GPU
 - See `setup.py` for full dependency list.
+## Acknowledgments
+We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.

{infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/__init__.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """Infinity-Parser2: Document parsing Python package."""
-__version__ = "0.1.0"
+__version__ = "0.3.0"
 from .parser import InfinityParser2
 from .backends import (

{infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/cli.py RENAMED Viewed

@@ -26,28 +26,28 @@ def build_parser() -> argparse.ArgumentParser:
         epilog="""
 Examples:
   # Parse a PDF file (default: doc2json -> markdown output)
-  parser document.pdf
+  parser demo_data/demo.pdf
   # Parse with doc2md task type
-  parser document.pdf --task doc2md
+  parser demo_data/demo.pdf --task doc2md
   # Parse with custom prompt
-  parser document.pdf --task custom --prompt "Extract the title and authors"
+  parser demo_data/demo.pdf --task custom --prompt "Please transform the document's contents into Markdown format."
   # Parse multiple files
-  parser doc1.pdf doc2.png --output-dir ./results
+  parser demo_data/demo.pdf demo_data/demo.png --output-dir ./results
   # Parse a directory
-  parser ./docs --output-dir ./results
+  parser demo_data --output-dir ./results
   # Output raw JSON
-  parser document.pdf --output-format json
+  parser demo_data/demo.pdf --output-format json
   # Use transformers backend
-  parser document.pdf --backend transformers
+  parser demo_data/demo.pdf --backend transformers
   # Use vllm-server backend
-  parser document.pdf --backend vllm-server --api-url http://localhost:8000/v1/chat/completions
+  parser demo_data/demo.pdf --backend vllm-server --api-url http://localhost:8000/v1/chat/completions
         """,
     )
@@ -136,7 +136,7 @@ Examples:
     parser.add_argument(
         "--version",
         action="version",
-        version="Infinity-Parser2 0.1.0",
+        version="Infinity-Parser2 0.3.0",
     )
     return parser

{infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/parser.py RENAMED Viewed

@@ -52,7 +52,7 @@ class InfinityParser2:
     Example:
         >>> from infinity_parser2 import InfinityParser2
         >>> parser = InfinityParser2(model_name="infly/Infinity-Parser2-Pro")
-        >>> result = parser.parse("document.pdf")
+        >>> result = parser.parse("demo_data/demo.pdf")
     """
     def __init__(
@@ -86,8 +86,11 @@ class InfinityParser2:
         self.kwargs = kwargs
         # Initialize model cache and resolve model path (stored separately)
-        cache = get_model_cache(model_cache_dir)
-        self._model_path = cache.resolve_model_path(self.model_name)
+        if self.backend_name == "vllm-server":
+            self._model_path = self.model_name
+        else:
+            cache = get_model_cache(model_cache_dir)
+            self._model_path = cache.resolve_model_path(self.model_name)
         self._backend: BaseBackend = self._init_backend()
@@ -183,13 +186,13 @@ class InfinityParser2:
         Example:
             >>> parser = InfinityParser2()
             >>> # Single file, returns str
-            >>> result = parser.parse("document.pdf")
+            >>> result = parser.parse("demo_data/demo.pdf")
             >>> # Multiple files, returns List[str]
-            >>> result = parser.parse(["doc1.pdf", "doc2.pdf"])
+            >>> result = parser.parse(["demo_data/demo.pdf", "demo_data/demo.png"])
             >>> # Directory, returns Dict[str, str]
-            >>> result = parser.parse("/path/to/docs")
+            >>> result = parser.parse("./demo_data")
             >>> # Save results to output_dir, returns None
-            >>> parser.parse("document.pdf", output_dir="./output")
+            >>> parser.parse("demo_data/demo.pdf", output_dir="./output")
         """
         if task_type not in SUPPORTED_TASK_TYPES:
             raise ValueError(f"task_type must be one of {SUPPORTED_TASK_TYPES}, got '{task_type}'")
@@ -204,6 +207,7 @@ class InfinityParser2:
             )
         prompt = self._resolve_prompt(task_type, custom_prompt)
+        print(f"[Infinity-Parser2] task_type: {task_type}, prompt: {prompt}")
         is_directory = isinstance(input_data, str) and os.path.isdir(input_data)
         file_paths = normalize_input(input_data)

{infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: infinity_parser2
-Version: 0.1.0
+Version: 0.3.0
 Summary: Document parsing Python package supporting PDF and image parsing using Infinity-Parser2-Pro model.
 Home-page: https://github.com/infly-ai/INF-MLLM
 Author: INF Tech
@@ -53,22 +53,148 @@ Dynamic: summary
 # Infinity-Parser2
-Infinity-Parser2 is a document parsing tool powered by the Infinity-Parser2-Pro model. It converts **PDF files** and **images** (PNG, JPG, WEBP) into structured Markdown or JSON with layout information.
+<p align="center">
+    <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/logo.png" width="400"/>
+<p>
+<p align="center">
+🤗 <a href="https://huggingface.co/infly/Infinity-Parser2-Pro">Model</a> |
+📊 <a>Dataset (coming soon...)</a> |
+📄 <a>Paper (coming soon...)</a> |
+🚀 <a>Demo (coming soon...)</a>
+</p>
+## Introduction
+We are excited to release Infinity-Parser2-Pro, our latest flagship document understanding model that achieves a new state-of-the-art on olmOCR-Bench with a score of 86.7%, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr. Building on our previous model Infinity-Parser-7B, we have significantly enhanced our data engine and multi-task reinforcement learning approach. This enables the model to consolidate robust multi-modal parsing capabilities into a unified architecture, delivering brand-new zero-shot capabilities for diverse real-world business scenarios.
+### Key Features
+- **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By generating over 1 million diverse full-text samples covering a wide range of document layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types.
+- **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including doc2json and doc2markdown.
+- **Breakthrough Parsing Performance**: It substantially outperforms our previous 7B model, achieving 86.7% on olmOCR-Bench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr.
+- **Inference Acceleration**: By adopting the highly efficient MoE architecture, our inference throughput has increased by 21% (from 441 to 534 tokens/sec), reducing deployment latency and costs.
+## Performance
+<p align="left">
+    <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/document_parsing_performance_evaluation.png" width="1200"/>
+<p>
 ## Quick Start
-### Installation
+### 1. Minimal "Hello World" (Native Transformers)
+If you are looking for a minimal script to parse a single image to Markdown using the native `transformers` library, here is a simple snippet:
+```python
+from PIL import Image
+import torch
+from transformers import AutoModelForImageTextToText, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# Load the model and processor
+model = AutoModelForImageTextToText.from_pretrained(
+    "infly/Infinity-Parser2-Pro",
+    torch_dtype="float16",
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro")
+# Build the messages for the model
+pil_image = Image.open("demo_data/demo.png").convert("RGB")
+min_pixels = 2048  # 32 * 64
+max_pixels = 16777216  # 4096 * 4096
+prompt = """
+Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
+1. Bbox format: [x1, y1, x2, y2]
+2. Layout Categories: The possible categories are ['header', 'title', 'text', 'figure', 'table', 'formula', 'figure_caption', 'table_caption', 'formula_caption', 'figure_footnote', 'table_footnote', 'page_footnote', 'footer'].
+3. Text Extraction & Formatting Rules:
+    - Figure: For the 'figure' category, the text field should be empty string.
+    - Formula: Format its text as LaTeX.
+    - Table: Format its text as HTML.
+    - All Others (Text, Title, etc.): Format their text as Markdown.
+4. Constraints:
+    - The output text must be the original text from the image, with no translation.
+    - All layout elements must be sorted according to human reading order.
+5. Final Output: The entire output must be a single JSON object.
+"""
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": pil_image,
+                "min_pixels": min_pixels,
+                "max_pixels": max_pixels,
+            },
+            {"type": "text", "text": prompt},
+        ],
+    }
+]
+chat_template_kwargs = {"enable_thinking": False}
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True, **chat_template_kwargs
+)
+image_inputs, _ = process_vision_info(messages, image_patch_size=16)
+inputs = processor(
+    text=text,
+    images=image_inputs,
+    do_resize=False,
+    padding=True,
+    return_tensors="pt",
+)
+# Move all tensors to the same device as the model
+inputs = {
+    k: v.to(model.device) if isinstance(v, torch.Tensor) else v
+    for k, v in inputs.items()
+}
+# Generate the response
+generated_ids = model.generate(
+    **inputs,
+    max_new_tokens=32768,
+    temperature=0.0,
+    top_p=1.0,
+)
+# Strip input tokens, keeping only the newly generated response
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :]
+    for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+### 2. Advanced Pipeline (infinity_parser2)
+For bulk processing, advanced features, or an end-to-end PDF parsing pipeline, we recommend using our infinity_parser2 wrapper.
 #### Pre-requisites
 ```bash
-# Install PyTorch (CUDA). Find the proper version on the [official site](https://pytorch.org/get-started/previous-versions) based on your CUDA version.
+# Create a Conda environment (Optional)
+conda create -n infinity_parser2 python=3.12
+conda activate infinity_parser2
+# Install PyTorch (CUDA). Find the proper version at https://pytorch.org/get-started/previous-versions based on your CUDA version.
 pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128
-# Install FlashAttention (required for NVIDIA GPUs).
-# This command builds flash-attn from source, which can take 10 to 30 minutes.
+# Install FlashAttention (FlashAttention-2 is recommended by default)
+# Standard install (compiles from source, ~10-30 min):
 pip install flash-attn==2.8.3 --no-build-isolation
-# For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See the [official guide](https://github.com/Dao-AILab/flash-attention).
+# Faster install: download wheel from https://github.com/Dao-AILab/flash-attention/releases. Then run: pip install /path/to/<wheel_filename>.whl
+# For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See: https://github.com/Dao-AILab/flash-attention
+# NOTE: The code will prioritize detecting FlashAttention-3. If not found, it falls back to FlashAttention-2.
 # Install vLLM
 # NOTE: you may need to run the command below to resolve triton and numpy conflicts before installing vllm.
@@ -78,23 +204,29 @@ pip install vllm==0.17.1
 #### Install infinity_parser2
+Install from PyPI
 ```bash
-# From PyPI
 pip install infinity_parser2
+```
+Install from source code
-# From source
+```bash
 git clone https://github.com/infly-ai/INF-MLLM.git
 cd INF-MLLM/Infinity-Parser2
 pip install -e .
 ```
-### Usage
+#### Usage
-#### Command Line
+##### Command Line
 The `parser` command is the fastest way to get started.
 ```bash
+# NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
 # Parse a PDF (outputs Markdown by default)
 parser demo_data/demo.pdf
@@ -119,9 +251,11 @@ parser demo_data/demo.png --task doc2md
 parser --help
 ```
-#### Python API
+##### Python API
 ```python
+# NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
 from infinity_parser2 import InfinityParser2
 parser = InfinityParser2()
@@ -154,7 +288,7 @@ result = parser.parse("demo_data/demo.pdf", task_type="doc2md")
 # Custom prompt
 result = parser.parse("demo_data/demo.pdf", task_type="custom",
-                      custom_prompt="Extract the title and authors only.")
+                      custom_prompt="Please transform the document's contents into Markdown format.")
 # Batch processing with custom batch size
 result = parser.parse("demo_data", batch_size=8)
@@ -308,3 +442,7 @@ print(cache.resolve_model_path("infly/Infinity-Parser2-Pro"))
 - Python 3.12+
 - CUDA-compatible GPU
 - See `setup.py` for full dependency list.
+## Acknowledgments
+We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.

{infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/setup.py RENAMED Viewed

@@ -32,7 +32,7 @@ install_requires = [
 setup(
     name="infinity_parser2",
-    version="0.1.0",
+    version="0.3.0",
     description="Document parsing Python package supporting PDF and image parsing using Infinity-Parser2-Pro model.",
     long_description=open("README.md", "r", encoding="utf-8").read(),
     long_description_content_type="text/markdown",