infinity-parser2 0.1.0__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/PKG-INFO +151 -13
  2. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/README.md +150 -12
  3. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/__init__.py +1 -1
  4. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/cli.py +9 -9
  5. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/parser.py +11 -7
  6. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2.egg-info/PKG-INFO +151 -13
  7. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/setup.py +1 -1
  8. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/__main__.py +0 -0
  9. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/backends/__init__.py +0 -0
  10. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/backends/base.py +0 -0
  11. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/backends/transformers.py +0 -0
  12. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/backends/vllm_engine.py +0 -0
  13. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/backends/vllm_server.py +0 -0
  14. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/prompts.py +0 -0
  15. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/utils/__init__.py +0 -0
  16. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/utils/file.py +0 -0
  17. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/utils/image.py +0 -0
  18. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/utils/model.py +0 -0
  19. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/utils/pdf.py +0 -0
  20. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2/utils/utils.py +0 -0
  21. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2.egg-info/SOURCES.txt +0 -0
  22. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2.egg-info/dependency_links.txt +0 -0
  23. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2.egg-info/entry_points.txt +0 -0
  24. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2.egg-info/requires.txt +0 -0
  25. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/infinity_parser2.egg-info/top_level.txt +0 -0
  26. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/setup.cfg +0 -0
  27. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/tests/__init__.py +0 -0
  28. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/tests/test_backends.py +0 -0
  29. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/tests/test_parser.py +0 -0
  30. {infinity_parser2-0.1.0 → infinity_parser2-0.3.0}/tests/test_utils.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: infinity_parser2
3
- Version: 0.1.0
3
+ Version: 0.3.0
4
4
  Summary: Document parsing Python package supporting PDF and image parsing using Infinity-Parser2-Pro model.
5
5
  Home-page: https://github.com/infly-ai/INF-MLLM
6
6
  Author: INF Tech
@@ -53,22 +53,148 @@ Dynamic: summary
53
53
 
54
54
  # Infinity-Parser2
55
55
 
56
- Infinity-Parser2 is a document parsing tool powered by the Infinity-Parser2-Pro model. It converts **PDF files** and **images** (PNG, JPG, WEBP) into structured Markdown or JSON with layout information.
56
+ <p align="center">
57
+ <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/logo.png" width="400"/>
58
+ <p>
59
+
60
+ <p align="center">
61
+ 🤗 <a href="https://huggingface.co/infly/Infinity-Parser2-Pro">Model</a> |
62
+ 📊 <a>Dataset (coming soon...)</a> |
63
+ 📄 <a>Paper (coming soon...)</a> |
64
+ 🚀 <a>Demo (coming soon...)</a>
65
+ </p>
66
+
67
+ ## Introduction
68
+
69
+ We are excited to release Infinity-Parser2-Pro, our latest flagship document understanding model that achieves a new state-of-the-art on olmOCR-Bench with a score of 86.7%, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr. Building on our previous model Infinity-Parser-7B, we have significantly enhanced our data engine and multi-task reinforcement learning approach. This enables the model to consolidate robust multi-modal parsing capabilities into a unified architecture, delivering brand-new zero-shot capabilities for diverse real-world business scenarios.
70
+
71
+ ### Key Features
72
+
73
+ - **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By generating over 1 million diverse full-text samples covering a wide range of document layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types.
74
+ - **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including doc2json and doc2markdown.
75
+ - **Breakthrough Parsing Performance**: It substantially outperforms our previous 7B model, achieving 86.7% on olmOCR-Bench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr.
76
+ - **Inference Acceleration**: By adopting the highly efficient MoE architecture, our inference throughput has increased by 21% (from 441 to 534 tokens/sec), reducing deployment latency and costs.
77
+
78
+ ## Performance
79
+
80
+ <p align="left">
81
+ <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/document_parsing_performance_evaluation.png" width="1200"/>
82
+ <p>
57
83
 
58
84
  ## Quick Start
59
85
 
60
- ### Installation
86
+ ### 1. Minimal "Hello World" (Native Transformers)
87
+
88
+ If you are looking for a minimal script to parse a single image to Markdown using the native `transformers` library, here is a simple snippet:
89
+
90
+ ```python
91
+ from PIL import Image
92
+ import torch
93
+ from transformers import AutoModelForImageTextToText, AutoProcessor
94
+ from qwen_vl_utils import process_vision_info
95
+
96
+ # Load the model and processor
97
+ model = AutoModelForImageTextToText.from_pretrained(
98
+ "infly/Infinity-Parser2-Pro",
99
+ torch_dtype="float16",
100
+ device_map="auto",
101
+ )
102
+ processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro")
103
+
104
+ # Build the messages for the model
105
+ pil_image = Image.open("demo_data/demo.png").convert("RGB")
106
+ min_pixels = 2048 # 32 * 64
107
+ max_pixels = 16777216 # 4096 * 4096
108
+ prompt = """
109
+ Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
110
+ 1. Bbox format: [x1, y1, x2, y2]
111
+ 2. Layout Categories: The possible categories are ['header', 'title', 'text', 'figure', 'table', 'formula', 'figure_caption', 'table_caption', 'formula_caption', 'figure_footnote', 'table_footnote', 'page_footnote', 'footer'].
112
+ 3. Text Extraction & Formatting Rules:
113
+ - Figure: For the 'figure' category, the text field should be empty string.
114
+ - Formula: Format its text as LaTeX.
115
+ - Table: Format its text as HTML.
116
+ - All Others (Text, Title, etc.): Format their text as Markdown.
117
+ 4. Constraints:
118
+ - The output text must be the original text from the image, with no translation.
119
+ - All layout elements must be sorted according to human reading order.
120
+ 5. Final Output: The entire output must be a single JSON object.
121
+ """
122
+
123
+ messages = [
124
+ {
125
+ "role": "user",
126
+ "content": [
127
+ {
128
+ "type": "image",
129
+ "image": pil_image,
130
+ "min_pixels": min_pixels,
131
+ "max_pixels": max_pixels,
132
+ },
133
+ {"type": "text", "text": prompt},
134
+ ],
135
+ }
136
+ ]
137
+
138
+ chat_template_kwargs = {"enable_thinking": False}
139
+
140
+ text = processor.apply_chat_template(
141
+ messages, tokenize=False, add_generation_prompt=True, **chat_template_kwargs
142
+ )
143
+ image_inputs, _ = process_vision_info(messages, image_patch_size=16)
144
+
145
+ inputs = processor(
146
+ text=text,
147
+ images=image_inputs,
148
+ do_resize=False,
149
+ padding=True,
150
+ return_tensors="pt",
151
+ )
152
+
153
+ # Move all tensors to the same device as the model
154
+ inputs = {
155
+ k: v.to(model.device) if isinstance(v, torch.Tensor) else v
156
+ for k, v in inputs.items()
157
+ }
158
+
159
+ # Generate the response
160
+ generated_ids = model.generate(
161
+ **inputs,
162
+ max_new_tokens=32768,
163
+ temperature=0.0,
164
+ top_p=1.0,
165
+ )
166
+
167
+ # Strip input tokens, keeping only the newly generated response
168
+ generated_ids_trimmed = [
169
+ out_ids[len(in_ids) :]
170
+ for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
171
+ ]
172
+ output_text = processor.batch_decode(
173
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
174
+ )
175
+ print(output_text)
176
+ ```
177
+
178
+ ### 2. Advanced Pipeline (infinity_parser2)
179
+
180
+ For bulk processing, advanced features, or an end-to-end PDF parsing pipeline, we recommend using our infinity_parser2 wrapper.
61
181
 
62
182
  #### Pre-requisites
63
183
 
64
184
  ```bash
65
- # Install PyTorch (CUDA). Find the proper version on the [official site](https://pytorch.org/get-started/previous-versions) based on your CUDA version.
185
+ # Create a Conda environment (Optional)
186
+ conda create -n infinity_parser2 python=3.12
187
+ conda activate infinity_parser2
188
+
189
+ # Install PyTorch (CUDA). Find the proper version at https://pytorch.org/get-started/previous-versions based on your CUDA version.
66
190
  pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128
67
191
 
68
- # Install FlashAttention (required for NVIDIA GPUs).
69
- # This command builds flash-attn from source, which can take 10 to 30 minutes.
192
+ # Install FlashAttention (FlashAttention-2 is recommended by default)
193
+ # Standard install (compiles from source, ~10-30 min):
70
194
  pip install flash-attn==2.8.3 --no-build-isolation
71
- # For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See the [official guide](https://github.com/Dao-AILab/flash-attention).
195
+ # Faster install: download wheel from https://github.com/Dao-AILab/flash-attention/releases. Then run: pip install /path/to/<wheel_filename>.whl
196
+ # For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See: https://github.com/Dao-AILab/flash-attention
197
+ # NOTE: The code will prioritize detecting FlashAttention-3. If not found, it falls back to FlashAttention-2.
72
198
 
73
199
  # Install vLLM
74
200
  # NOTE: you may need to run the command below to resolve triton and numpy conflicts before installing vllm.
@@ -78,23 +204,29 @@ pip install vllm==0.17.1
78
204
 
79
205
  #### Install infinity_parser2
80
206
 
207
+ Install from PyPI
208
+
81
209
  ```bash
82
- # From PyPI
83
210
  pip install infinity_parser2
211
+ ```
212
+
213
+ Install from source code
84
214
 
85
- # From source
215
+ ```bash
86
216
  git clone https://github.com/infly-ai/INF-MLLM.git
87
217
  cd INF-MLLM/Infinity-Parser2
88
218
  pip install -e .
89
219
  ```
90
220
 
91
- ### Usage
221
+ #### Usage
92
222
 
93
- #### Command Line
223
+ ##### Command Line
94
224
 
95
225
  The `parser` command is the fastest way to get started.
96
226
 
97
227
  ```bash
228
+ # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
229
+
98
230
  # Parse a PDF (outputs Markdown by default)
99
231
  parser demo_data/demo.pdf
100
232
 
@@ -119,9 +251,11 @@ parser demo_data/demo.png --task doc2md
119
251
  parser --help
120
252
  ```
121
253
 
122
- #### Python API
254
+ ##### Python API
123
255
 
124
256
  ```python
257
+ # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
258
+
125
259
  from infinity_parser2 import InfinityParser2
126
260
 
127
261
  parser = InfinityParser2()
@@ -154,7 +288,7 @@ result = parser.parse("demo_data/demo.pdf", task_type="doc2md")
154
288
 
155
289
  # Custom prompt
156
290
  result = parser.parse("demo_data/demo.pdf", task_type="custom",
157
- custom_prompt="Extract the title and authors only.")
291
+ custom_prompt="Please transform the document's contents into Markdown format.")
158
292
 
159
293
  # Batch processing with custom batch size
160
294
  result = parser.parse("demo_data", batch_size=8)
@@ -308,3 +442,7 @@ print(cache.resolve_model_path("infly/Infinity-Parser2-Pro"))
308
442
  - Python 3.12+
309
443
  - CUDA-compatible GPU
310
444
  - See `setup.py` for full dependency list.
445
+
446
+ ## Acknowledgments
447
+
448
+ We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.
@@ -1,21 +1,147 @@
1
1
  # Infinity-Parser2
2
2
 
3
- Infinity-Parser2 is a document parsing tool powered by the Infinity-Parser2-Pro model. It converts **PDF files** and **images** (PNG, JPG, WEBP) into structured Markdown or JSON with layout information.
3
+ <p align="center">
4
+ <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/logo.png" width="400"/>
5
+ <p>
6
+
7
+ <p align="center">
8
+ 🤗 <a href="https://huggingface.co/infly/Infinity-Parser2-Pro">Model</a> |
9
+ 📊 <a>Dataset (coming soon...)</a> |
10
+ 📄 <a>Paper (coming soon...)</a> |
11
+ 🚀 <a>Demo (coming soon...)</a>
12
+ </p>
13
+
14
+ ## Introduction
15
+
16
+ We are excited to release Infinity-Parser2-Pro, our latest flagship document understanding model that achieves a new state-of-the-art on olmOCR-Bench with a score of 86.7%, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr. Building on our previous model Infinity-Parser-7B, we have significantly enhanced our data engine and multi-task reinforcement learning approach. This enables the model to consolidate robust multi-modal parsing capabilities into a unified architecture, delivering brand-new zero-shot capabilities for diverse real-world business scenarios.
17
+
18
+ ### Key Features
19
+
20
+ - **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By generating over 1 million diverse full-text samples covering a wide range of document layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types.
21
+ - **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including doc2json and doc2markdown.
22
+ - **Breakthrough Parsing Performance**: It substantially outperforms our previous 7B model, achieving 86.7% on olmOCR-Bench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr.
23
+ - **Inference Acceleration**: By adopting the highly efficient MoE architecture, our inference throughput has increased by 21% (from 441 to 534 tokens/sec), reducing deployment latency and costs.
24
+
25
+ ## Performance
26
+
27
+ <p align="left">
28
+ <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/document_parsing_performance_evaluation.png" width="1200"/>
29
+ <p>
4
30
 
5
31
  ## Quick Start
6
32
 
7
- ### Installation
33
+ ### 1. Minimal "Hello World" (Native Transformers)
34
+
35
+ If you are looking for a minimal script to parse a single image to Markdown using the native `transformers` library, here is a simple snippet:
36
+
37
+ ```python
38
+ from PIL import Image
39
+ import torch
40
+ from transformers import AutoModelForImageTextToText, AutoProcessor
41
+ from qwen_vl_utils import process_vision_info
42
+
43
+ # Load the model and processor
44
+ model = AutoModelForImageTextToText.from_pretrained(
45
+ "infly/Infinity-Parser2-Pro",
46
+ torch_dtype="float16",
47
+ device_map="auto",
48
+ )
49
+ processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro")
50
+
51
+ # Build the messages for the model
52
+ pil_image = Image.open("demo_data/demo.png").convert("RGB")
53
+ min_pixels = 2048 # 32 * 64
54
+ max_pixels = 16777216 # 4096 * 4096
55
+ prompt = """
56
+ Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
57
+ 1. Bbox format: [x1, y1, x2, y2]
58
+ 2. Layout Categories: The possible categories are ['header', 'title', 'text', 'figure', 'table', 'formula', 'figure_caption', 'table_caption', 'formula_caption', 'figure_footnote', 'table_footnote', 'page_footnote', 'footer'].
59
+ 3. Text Extraction & Formatting Rules:
60
+ - Figure: For the 'figure' category, the text field should be empty string.
61
+ - Formula: Format its text as LaTeX.
62
+ - Table: Format its text as HTML.
63
+ - All Others (Text, Title, etc.): Format their text as Markdown.
64
+ 4. Constraints:
65
+ - The output text must be the original text from the image, with no translation.
66
+ - All layout elements must be sorted according to human reading order.
67
+ 5. Final Output: The entire output must be a single JSON object.
68
+ """
69
+
70
+ messages = [
71
+ {
72
+ "role": "user",
73
+ "content": [
74
+ {
75
+ "type": "image",
76
+ "image": pil_image,
77
+ "min_pixels": min_pixels,
78
+ "max_pixels": max_pixels,
79
+ },
80
+ {"type": "text", "text": prompt},
81
+ ],
82
+ }
83
+ ]
84
+
85
+ chat_template_kwargs = {"enable_thinking": False}
86
+
87
+ text = processor.apply_chat_template(
88
+ messages, tokenize=False, add_generation_prompt=True, **chat_template_kwargs
89
+ )
90
+ image_inputs, _ = process_vision_info(messages, image_patch_size=16)
91
+
92
+ inputs = processor(
93
+ text=text,
94
+ images=image_inputs,
95
+ do_resize=False,
96
+ padding=True,
97
+ return_tensors="pt",
98
+ )
99
+
100
+ # Move all tensors to the same device as the model
101
+ inputs = {
102
+ k: v.to(model.device) if isinstance(v, torch.Tensor) else v
103
+ for k, v in inputs.items()
104
+ }
105
+
106
+ # Generate the response
107
+ generated_ids = model.generate(
108
+ **inputs,
109
+ max_new_tokens=32768,
110
+ temperature=0.0,
111
+ top_p=1.0,
112
+ )
113
+
114
+ # Strip input tokens, keeping only the newly generated response
115
+ generated_ids_trimmed = [
116
+ out_ids[len(in_ids) :]
117
+ for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
118
+ ]
119
+ output_text = processor.batch_decode(
120
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
121
+ )
122
+ print(output_text)
123
+ ```
124
+
125
+ ### 2. Advanced Pipeline (infinity_parser2)
126
+
127
+ For bulk processing, advanced features, or an end-to-end PDF parsing pipeline, we recommend using our infinity_parser2 wrapper.
8
128
 
9
129
  #### Pre-requisites
10
130
 
11
131
  ```bash
12
- # Install PyTorch (CUDA). Find the proper version on the [official site](https://pytorch.org/get-started/previous-versions) based on your CUDA version.
132
+ # Create a Conda environment (Optional)
133
+ conda create -n infinity_parser2 python=3.12
134
+ conda activate infinity_parser2
135
+
136
+ # Install PyTorch (CUDA). Find the proper version at https://pytorch.org/get-started/previous-versions based on your CUDA version.
13
137
  pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128
14
138
 
15
- # Install FlashAttention (required for NVIDIA GPUs).
16
- # This command builds flash-attn from source, which can take 10 to 30 minutes.
139
+ # Install FlashAttention (FlashAttention-2 is recommended by default)
140
+ # Standard install (compiles from source, ~10-30 min):
17
141
  pip install flash-attn==2.8.3 --no-build-isolation
18
- # For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See the [official guide](https://github.com/Dao-AILab/flash-attention).
142
+ # Faster install: download wheel from https://github.com/Dao-AILab/flash-attention/releases. Then run: pip install /path/to/<wheel_filename>.whl
143
+ # For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See: https://github.com/Dao-AILab/flash-attention
144
+ # NOTE: The code will prioritize detecting FlashAttention-3. If not found, it falls back to FlashAttention-2.
19
145
 
20
146
  # Install vLLM
21
147
  # NOTE: you may need to run the command below to resolve triton and numpy conflicts before installing vllm.
@@ -25,23 +151,29 @@ pip install vllm==0.17.1
25
151
 
26
152
  #### Install infinity_parser2
27
153
 
154
+ Install from PyPI
155
+
28
156
  ```bash
29
- # From PyPI
30
157
  pip install infinity_parser2
158
+ ```
159
+
160
+ Install from source code
31
161
 
32
- # From source
162
+ ```bash
33
163
  git clone https://github.com/infly-ai/INF-MLLM.git
34
164
  cd INF-MLLM/Infinity-Parser2
35
165
  pip install -e .
36
166
  ```
37
167
 
38
- ### Usage
168
+ #### Usage
39
169
 
40
- #### Command Line
170
+ ##### Command Line
41
171
 
42
172
  The `parser` command is the fastest way to get started.
43
173
 
44
174
  ```bash
175
+ # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
176
+
45
177
  # Parse a PDF (outputs Markdown by default)
46
178
  parser demo_data/demo.pdf
47
179
 
@@ -66,9 +198,11 @@ parser demo_data/demo.png --task doc2md
66
198
  parser --help
67
199
  ```
68
200
 
69
- #### Python API
201
+ ##### Python API
70
202
 
71
203
  ```python
204
+ # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
205
+
72
206
  from infinity_parser2 import InfinityParser2
73
207
 
74
208
  parser = InfinityParser2()
@@ -101,7 +235,7 @@ result = parser.parse("demo_data/demo.pdf", task_type="doc2md")
101
235
 
102
236
  # Custom prompt
103
237
  result = parser.parse("demo_data/demo.pdf", task_type="custom",
104
- custom_prompt="Extract the title and authors only.")
238
+ custom_prompt="Please transform the document's contents into Markdown format.")
105
239
 
106
240
  # Batch processing with custom batch size
107
241
  result = parser.parse("demo_data", batch_size=8)
@@ -255,3 +389,7 @@ print(cache.resolve_model_path("infly/Infinity-Parser2-Pro"))
255
389
  - Python 3.12+
256
390
  - CUDA-compatible GPU
257
391
  - See `setup.py` for full dependency list.
392
+
393
+ ## Acknowledgments
394
+
395
+ We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.
@@ -1,6 +1,6 @@
1
1
  """Infinity-Parser2: Document parsing Python package."""
2
2
 
3
- __version__ = "0.1.0"
3
+ __version__ = "0.3.0"
4
4
 
5
5
  from .parser import InfinityParser2
6
6
  from .backends import (
@@ -26,28 +26,28 @@ def build_parser() -> argparse.ArgumentParser:
26
26
  epilog="""
27
27
  Examples:
28
28
  # Parse a PDF file (default: doc2json -> markdown output)
29
- parser document.pdf
29
+ parser demo_data/demo.pdf
30
30
 
31
31
  # Parse with doc2md task type
32
- parser document.pdf --task doc2md
32
+ parser demo_data/demo.pdf --task doc2md
33
33
 
34
34
  # Parse with custom prompt
35
- parser document.pdf --task custom --prompt "Extract the title and authors"
35
+ parser demo_data/demo.pdf --task custom --prompt "Please transform the document's contents into Markdown format."
36
36
 
37
37
  # Parse multiple files
38
- parser doc1.pdf doc2.png --output-dir ./results
38
+ parser demo_data/demo.pdf demo_data/demo.png --output-dir ./results
39
39
 
40
40
  # Parse a directory
41
- parser ./docs --output-dir ./results
41
+ parser demo_data --output-dir ./results
42
42
 
43
43
  # Output raw JSON
44
- parser document.pdf --output-format json
44
+ parser demo_data/demo.pdf --output-format json
45
45
 
46
46
  # Use transformers backend
47
- parser document.pdf --backend transformers
47
+ parser demo_data/demo.pdf --backend transformers
48
48
 
49
49
  # Use vllm-server backend
50
- parser document.pdf --backend vllm-server --api-url http://localhost:8000/v1/chat/completions
50
+ parser demo_data/demo.pdf --backend vllm-server --api-url http://localhost:8000/v1/chat/completions
51
51
  """,
52
52
  )
53
53
 
@@ -136,7 +136,7 @@ Examples:
136
136
  parser.add_argument(
137
137
  "--version",
138
138
  action="version",
139
- version="Infinity-Parser2 0.1.0",
139
+ version="Infinity-Parser2 0.3.0",
140
140
  )
141
141
 
142
142
  return parser
@@ -52,7 +52,7 @@ class InfinityParser2:
52
52
  Example:
53
53
  >>> from infinity_parser2 import InfinityParser2
54
54
  >>> parser = InfinityParser2(model_name="infly/Infinity-Parser2-Pro")
55
- >>> result = parser.parse("document.pdf")
55
+ >>> result = parser.parse("demo_data/demo.pdf")
56
56
  """
57
57
 
58
58
  def __init__(
@@ -86,8 +86,11 @@ class InfinityParser2:
86
86
  self.kwargs = kwargs
87
87
 
88
88
  # Initialize model cache and resolve model path (stored separately)
89
- cache = get_model_cache(model_cache_dir)
90
- self._model_path = cache.resolve_model_path(self.model_name)
89
+ if self.backend_name == "vllm-server":
90
+ self._model_path = self.model_name
91
+ else:
92
+ cache = get_model_cache(model_cache_dir)
93
+ self._model_path = cache.resolve_model_path(self.model_name)
91
94
 
92
95
  self._backend: BaseBackend = self._init_backend()
93
96
 
@@ -183,13 +186,13 @@ class InfinityParser2:
183
186
  Example:
184
187
  >>> parser = InfinityParser2()
185
188
  >>> # Single file, returns str
186
- >>> result = parser.parse("document.pdf")
189
+ >>> result = parser.parse("demo_data/demo.pdf")
187
190
  >>> # Multiple files, returns List[str]
188
- >>> result = parser.parse(["doc1.pdf", "doc2.pdf"])
191
+ >>> result = parser.parse(["demo_data/demo.pdf", "demo_data/demo.png"])
189
192
  >>> # Directory, returns Dict[str, str]
190
- >>> result = parser.parse("/path/to/docs")
193
+ >>> result = parser.parse("./demo_data")
191
194
  >>> # Save results to output_dir, returns None
192
- >>> parser.parse("document.pdf", output_dir="./output")
195
+ >>> parser.parse("demo_data/demo.pdf", output_dir="./output")
193
196
  """
194
197
  if task_type not in SUPPORTED_TASK_TYPES:
195
198
  raise ValueError(f"task_type must be one of {SUPPORTED_TASK_TYPES}, got '{task_type}'")
@@ -204,6 +207,7 @@ class InfinityParser2:
204
207
  )
205
208
 
206
209
  prompt = self._resolve_prompt(task_type, custom_prompt)
210
+ print(f"[Infinity-Parser2] task_type: {task_type}, prompt: {prompt}")
207
211
 
208
212
  is_directory = isinstance(input_data, str) and os.path.isdir(input_data)
209
213
  file_paths = normalize_input(input_data)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: infinity_parser2
3
- Version: 0.1.0
3
+ Version: 0.3.0
4
4
  Summary: Document parsing Python package supporting PDF and image parsing using Infinity-Parser2-Pro model.
5
5
  Home-page: https://github.com/infly-ai/INF-MLLM
6
6
  Author: INF Tech
@@ -53,22 +53,148 @@ Dynamic: summary
53
53
 
54
54
  # Infinity-Parser2
55
55
 
56
- Infinity-Parser2 is a document parsing tool powered by the Infinity-Parser2-Pro model. It converts **PDF files** and **images** (PNG, JPG, WEBP) into structured Markdown or JSON with layout information.
56
+ <p align="center">
57
+ <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/logo.png" width="400"/>
58
+ <p>
59
+
60
+ <p align="center">
61
+ 🤗 <a href="https://huggingface.co/infly/Infinity-Parser2-Pro">Model</a> |
62
+ 📊 <a>Dataset (coming soon...)</a> |
63
+ 📄 <a>Paper (coming soon...)</a> |
64
+ 🚀 <a>Demo (coming soon...)</a>
65
+ </p>
66
+
67
+ ## Introduction
68
+
69
+ We are excited to release Infinity-Parser2-Pro, our latest flagship document understanding model that achieves a new state-of-the-art on olmOCR-Bench with a score of 86.7%, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr. Building on our previous model Infinity-Parser-7B, we have significantly enhanced our data engine and multi-task reinforcement learning approach. This enables the model to consolidate robust multi-modal parsing capabilities into a unified architecture, delivering brand-new zero-shot capabilities for diverse real-world business scenarios.
70
+
71
+ ### Key Features
72
+
73
+ - **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By generating over 1 million diverse full-text samples covering a wide range of document layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types.
74
+ - **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including doc2json and doc2markdown.
75
+ - **Breakthrough Parsing Performance**: It substantially outperforms our previous 7B model, achieving 86.7% on olmOCR-Bench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and dots.mocr.
76
+ - **Inference Acceleration**: By adopting the highly efficient MoE architecture, our inference throughput has increased by 21% (from 441 to 534 tokens/sec), reducing deployment latency and costs.
77
+
78
+ ## Performance
79
+
80
+ <p align="left">
81
+ <img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/document_parsing_performance_evaluation.png" width="1200"/>
82
+ <p>
57
83
 
58
84
  ## Quick Start
59
85
 
60
- ### Installation
86
+ ### 1. Minimal "Hello World" (Native Transformers)
87
+
88
+ If you are looking for a minimal script to parse a single image to Markdown using the native `transformers` library, here is a simple snippet:
89
+
90
+ ```python
91
+ from PIL import Image
92
+ import torch
93
+ from transformers import AutoModelForImageTextToText, AutoProcessor
94
+ from qwen_vl_utils import process_vision_info
95
+
96
+ # Load the model and processor
97
+ model = AutoModelForImageTextToText.from_pretrained(
98
+ "infly/Infinity-Parser2-Pro",
99
+ torch_dtype="float16",
100
+ device_map="auto",
101
+ )
102
+ processor = AutoProcessor.from_pretrained("infly/Infinity-Parser2-Pro")
103
+
104
+ # Build the messages for the model
105
+ pil_image = Image.open("demo_data/demo.png").convert("RGB")
106
+ min_pixels = 2048 # 32 * 64
107
+ max_pixels = 16777216 # 4096 * 4096
108
+ prompt = """
109
+ Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
110
+ 1. Bbox format: [x1, y1, x2, y2]
111
+ 2. Layout Categories: The possible categories are ['header', 'title', 'text', 'figure', 'table', 'formula', 'figure_caption', 'table_caption', 'formula_caption', 'figure_footnote', 'table_footnote', 'page_footnote', 'footer'].
112
+ 3. Text Extraction & Formatting Rules:
113
+ - Figure: For the 'figure' category, the text field should be empty string.
114
+ - Formula: Format its text as LaTeX.
115
+ - Table: Format its text as HTML.
116
+ - All Others (Text, Title, etc.): Format their text as Markdown.
117
+ 4. Constraints:
118
+ - The output text must be the original text from the image, with no translation.
119
+ - All layout elements must be sorted according to human reading order.
120
+ 5. Final Output: The entire output must be a single JSON object.
121
+ """
122
+
123
+ messages = [
124
+ {
125
+ "role": "user",
126
+ "content": [
127
+ {
128
+ "type": "image",
129
+ "image": pil_image,
130
+ "min_pixels": min_pixels,
131
+ "max_pixels": max_pixels,
132
+ },
133
+ {"type": "text", "text": prompt},
134
+ ],
135
+ }
136
+ ]
137
+
138
+ chat_template_kwargs = {"enable_thinking": False}
139
+
140
+ text = processor.apply_chat_template(
141
+ messages, tokenize=False, add_generation_prompt=True, **chat_template_kwargs
142
+ )
143
+ image_inputs, _ = process_vision_info(messages, image_patch_size=16)
144
+
145
+ inputs = processor(
146
+ text=text,
147
+ images=image_inputs,
148
+ do_resize=False,
149
+ padding=True,
150
+ return_tensors="pt",
151
+ )
152
+
153
+ # Move all tensors to the same device as the model
154
+ inputs = {
155
+ k: v.to(model.device) if isinstance(v, torch.Tensor) else v
156
+ for k, v in inputs.items()
157
+ }
158
+
159
+ # Generate the response
160
+ generated_ids = model.generate(
161
+ **inputs,
162
+ max_new_tokens=32768,
163
+ temperature=0.0,
164
+ top_p=1.0,
165
+ )
166
+
167
+ # Strip input tokens, keeping only the newly generated response
168
+ generated_ids_trimmed = [
169
+ out_ids[len(in_ids) :]
170
+ for in_ids, out_ids in zip(inputs["input_ids"], generated_ids)
171
+ ]
172
+ output_text = processor.batch_decode(
173
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
174
+ )
175
+ print(output_text)
176
+ ```
177
+
178
+ ### 2. Advanced Pipeline (infinity_parser2)
179
+
180
+ For bulk processing, advanced features, or an end-to-end PDF parsing pipeline, we recommend using our infinity_parser2 wrapper.
61
181
 
62
182
  #### Pre-requisites
63
183
 
64
184
  ```bash
65
- # Install PyTorch (CUDA). Find the proper version on the [official site](https://pytorch.org/get-started/previous-versions) based on your CUDA version.
185
+ # Create a Conda environment (Optional)
186
+ conda create -n infinity_parser2 python=3.12
187
+ conda activate infinity_parser2
188
+
189
+ # Install PyTorch (CUDA). Find the proper version at https://pytorch.org/get-started/previous-versions based on your CUDA version.
66
190
  pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 --index-url https://download.pytorch.org/whl/cu128
67
191
 
68
- # Install FlashAttention (required for NVIDIA GPUs).
69
- # This command builds flash-attn from source, which can take 10 to 30 minutes.
192
+ # Install FlashAttention (FlashAttention-2 is recommended by default)
193
+ # Standard install (compiles from source, ~10-30 min):
70
194
  pip install flash-attn==2.8.3 --no-build-isolation
71
- # For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See the [official guide](https://github.com/Dao-AILab/flash-attention).
195
+ # Faster install: download wheel from https://github.com/Dao-AILab/flash-attention/releases. Then run: pip install /path/to/<wheel_filename>.whl
196
+ # For Hopper GPUs (e.g. H100, H800), we recommend FlashAttention-3 instead. See: https://github.com/Dao-AILab/flash-attention
197
+ # NOTE: The code will prioritize detecting FlashAttention-3. If not found, it falls back to FlashAttention-2.
72
198
 
73
199
  # Install vLLM
74
200
  # NOTE: you may need to run the command below to resolve triton and numpy conflicts before installing vllm.
@@ -78,23 +204,29 @@ pip install vllm==0.17.1
78
204
 
79
205
  #### Install infinity_parser2
80
206
 
207
+ Install from PyPI
208
+
81
209
  ```bash
82
- # From PyPI
83
210
  pip install infinity_parser2
211
+ ```
212
+
213
+ Install from source code
84
214
 
85
- # From source
215
+ ```bash
86
216
  git clone https://github.com/infly-ai/INF-MLLM.git
87
217
  cd INF-MLLM/Infinity-Parser2
88
218
  pip install -e .
89
219
  ```
90
220
 
91
- ### Usage
221
+ #### Usage
92
222
 
93
- #### Command Line
223
+ ##### Command Line
94
224
 
95
225
  The `parser` command is the fastest way to get started.
96
226
 
97
227
  ```bash
228
+ # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
229
+
98
230
  # Parse a PDF (outputs Markdown by default)
99
231
  parser demo_data/demo.pdf
100
232
 
@@ -119,9 +251,11 @@ parser demo_data/demo.png --task doc2md
119
251
  parser --help
120
252
  ```
121
253
 
122
- #### Python API
254
+ ##### Python API
123
255
 
124
256
  ```python
257
+ # NOTE: The Infinity-Parser2 model will be automatically downloaded on the first run.
258
+
125
259
  from infinity_parser2 import InfinityParser2
126
260
 
127
261
  parser = InfinityParser2()
@@ -154,7 +288,7 @@ result = parser.parse("demo_data/demo.pdf", task_type="doc2md")
154
288
 
155
289
  # Custom prompt
156
290
  result = parser.parse("demo_data/demo.pdf", task_type="custom",
157
- custom_prompt="Extract the title and authors only.")
291
+ custom_prompt="Please transform the document's contents into Markdown format.")
158
292
 
159
293
  # Batch processing with custom batch size
160
294
  result = parser.parse("demo_data", batch_size=8)
@@ -308,3 +442,7 @@ print(cache.resolve_model_path("infly/Infinity-Parser2-Pro"))
308
442
  - Python 3.12+
309
443
  - CUDA-compatible GPU
310
444
  - See `setup.py` for full dependency list.
445
+
446
+ ## Acknowledgments
447
+
448
+ We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.
@@ -32,7 +32,7 @@ install_requires = [
32
32
 
33
33
  setup(
34
34
  name="infinity_parser2",
35
- version="0.1.0",
35
+ version="0.3.0",
36
36
  description="Document parsing Python package supporting PDF and image parsing using Infinity-Parser2-Pro model.",
37
37
  long_description=open("README.md", "r", encoding="utf-8").read(),
38
38
  long_description_content_type="text/markdown",