sparrow-parse 1.1.1__tar.gz → 1.1.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sparrow-parse-1.1.2/PKG-INFO +405 -0
- sparrow-parse-1.1.2/README.md +377 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/setup.py +1 -1
- sparrow-parse-1.1.2/sparrow_parse/__init__.py +1 -0
- sparrow-parse-1.1.2/sparrow_parse.egg-info/PKG-INFO +405 -0
- sparrow-parse-1.1.1/PKG-INFO +0 -187
- sparrow-parse-1.1.1/README.md +0 -159
- sparrow-parse-1.1.1/sparrow_parse/__init__.py +0 -1
- sparrow-parse-1.1.1/sparrow_parse.egg-info/PKG-INFO +0 -187
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/setup.cfg +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/__main__.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/extractors/__init__.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/extractors/vllm_extractor.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/helpers/__init__.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/helpers/image_optimizer.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/helpers/pdf_optimizer.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/processors/__init__.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/processors/table_structure_processor.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/text_extraction.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/__init__.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/huggingface_inference.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/inference_base.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/inference_factory.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/local_gpu_inference.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/mlx_inference.py +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/SOURCES.txt +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/dependency_links.txt +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/entry_points.txt +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/requires.txt +0 -0
- {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/top_level.txt +0 -0
@@ -0,0 +1,405 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: sparrow-parse
|
3
|
+
Version: 1.1.2
|
4
|
+
Summary: Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.
|
5
|
+
Home-page: https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
|
6
|
+
Author: Andrej Baranovskij
|
7
|
+
Author-email: andrejus.baranovskis@gmail.com
|
8
|
+
Project-URL: Homepage, https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
|
9
|
+
Project-URL: Repository, https://github.com/katanaml/sparrow
|
10
|
+
Keywords: llm,vllm,ocr,vision
|
11
|
+
Classifier: Operating System :: OS Independent
|
12
|
+
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
|
13
|
+
Classifier: Topic :: Software Development
|
14
|
+
Classifier: Programming Language :: Python :: 3.10
|
15
|
+
Requires-Python: >=3.10
|
16
|
+
Description-Content-Type: text/markdown
|
17
|
+
Requires-Dist: rich
|
18
|
+
Requires-Dist: transformers>=4.51.3
|
19
|
+
Requires-Dist: torchvision>=0.22.0
|
20
|
+
Requires-Dist: torch>=2.7.0
|
21
|
+
Requires-Dist: sentence-transformers>=4.1.0
|
22
|
+
Requires-Dist: numpy>=2.2.5
|
23
|
+
Requires-Dist: pypdf>=5.5.0
|
24
|
+
Requires-Dist: gradio_client>=1.7.2
|
25
|
+
Requires-Dist: pdf2image>=1.17.0
|
26
|
+
Requires-Dist: mlx>=0.25.2; sys_platform == "darwin" and platform_machine == "arm64"
|
27
|
+
Requires-Dist: mlx-vlm==0.1.26; sys_platform == "darwin" and platform_machine == "arm64"
|
28
|
+
|
29
|
+
# Sparrow Parse
|
30
|
+
|
31
|
+
[](https://badge.fury.io/py/sparrow-parse)
|
32
|
+
[](https://www.python.org/downloads/)
|
33
|
+
[](https://www.gnu.org/licenses/gpl-3.0)
|
34
|
+
|
35
|
+
A powerful Python library for parsing and extracting structured information from documents using Vision Language Models (VLMs). Part of the [Sparrow](https://github.com/katanaml/sparrow) ecosystem for intelligent document processing.
|
36
|
+
|
37
|
+
## ✨ Features
|
38
|
+
|
39
|
+
- 🔍 **Document Data Extraction**: Extract structured data from invoices, forms, tables, and complex documents
|
40
|
+
- 🤖 **Multiple Backend Support**: MLX (Apple Silicon), Hugging Face Cloud GPU, and local GPU inference
|
41
|
+
- 📄 **Multi-format Support**: Images (PNG, JPG, JPEG) and multi-page PDFs
|
42
|
+
- 🎯 **Schema Validation**: JSON schema-based extraction with automatic validation
|
43
|
+
- 📊 **Table Processing**: Specialized table detection and extraction capabilities
|
44
|
+
- 🖼️ **Image Annotation**: Bounding box annotations for extracted data
|
45
|
+
- 💬 **Text Instructions**: Support for instruction-based text processing
|
46
|
+
- ⚡ **Optimized Processing**: Image cropping, resizing, and preprocessing capabilities
|
47
|
+
|
48
|
+
## 🚀 Quick Start
|
49
|
+
|
50
|
+
### Installation
|
51
|
+
|
52
|
+
```bash
|
53
|
+
pip install sparrow-parse
|
54
|
+
```
|
55
|
+
|
56
|
+
**Additional Requirements:**
|
57
|
+
- For PDF processing: `brew install poppler` (macOS) or `apt-get install poppler-utils` (Linux)
|
58
|
+
- For MLX backend: Apple Silicon Mac required
|
59
|
+
- For Hugging Face: Valid HF token with GPU access
|
60
|
+
|
61
|
+
### Basic Usage
|
62
|
+
|
63
|
+
```python
|
64
|
+
from sparrow_parse.vllm.inference_factory import InferenceFactory
|
65
|
+
from sparrow_parse.extractors.vllm_extractor import VLLMExtractor
|
66
|
+
|
67
|
+
# Initialize extractor
|
68
|
+
extractor = VLLMExtractor()
|
69
|
+
|
70
|
+
# Configure backend (MLX example)
|
71
|
+
config = {
|
72
|
+
"method": "mlx",
|
73
|
+
"model_name": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-8bit"
|
74
|
+
}
|
75
|
+
|
76
|
+
# Create inference instance
|
77
|
+
factory = InferenceFactory(config)
|
78
|
+
model_inference_instance = factory.get_inference_instance()
|
79
|
+
|
80
|
+
# Prepare input data
|
81
|
+
input_data = [{
|
82
|
+
"file_path": "path/to/your/document.png",
|
83
|
+
"text_input": "retrieve [{\"field_name\": \"str\", \"amount\": 0}]. return response in JSON format"
|
84
|
+
}]
|
85
|
+
|
86
|
+
# Run inference
|
87
|
+
results, num_pages = extractor.run_inference(
|
88
|
+
model_inference_instance,
|
89
|
+
input_data,
|
90
|
+
debug=True
|
91
|
+
)
|
92
|
+
|
93
|
+
print(f"Extracted data: {results[0]}")
|
94
|
+
```
|
95
|
+
|
96
|
+
## 📖 Detailed Usage
|
97
|
+
|
98
|
+
### Backend Configuration
|
99
|
+
|
100
|
+
#### MLX Backend (Apple Silicon)
|
101
|
+
```python
|
102
|
+
config = {
|
103
|
+
"method": "mlx",
|
104
|
+
"model_name": "mlx-community/Qwen2.5-VL-72B-Instruct-4bit"
|
105
|
+
}
|
106
|
+
```
|
107
|
+
|
108
|
+
#### Hugging Face Backend
|
109
|
+
```python
|
110
|
+
import os
|
111
|
+
config = {
|
112
|
+
"method": "huggingface",
|
113
|
+
"hf_space": "your-username/your-space",
|
114
|
+
"hf_token": os.getenv('HF_TOKEN')
|
115
|
+
}
|
116
|
+
```
|
117
|
+
|
118
|
+
#### Local GPU Backend
|
119
|
+
```python
|
120
|
+
config = {
|
121
|
+
"method": "local_gpu",
|
122
|
+
"device": "cuda",
|
123
|
+
"model_path": "path/to/model.pth"
|
124
|
+
}
|
125
|
+
```
|
126
|
+
|
127
|
+
### Input Data Formats
|
128
|
+
|
129
|
+
#### Document Processing
|
130
|
+
```python
|
131
|
+
input_data = [{
|
132
|
+
"file_path": "invoice.pdf",
|
133
|
+
"text_input": "extract invoice data: {\"invoice_number\": \"str\", \"total\": 0, \"date\": \"str\"}"
|
134
|
+
}]
|
135
|
+
```
|
136
|
+
|
137
|
+
#### Text-Only Processing
|
138
|
+
```python
|
139
|
+
input_data = [{
|
140
|
+
"file_path": None,
|
141
|
+
"text_input": "Summarize the key points about renewable energy."
|
142
|
+
}]
|
143
|
+
```
|
144
|
+
|
145
|
+
### Advanced Options
|
146
|
+
|
147
|
+
#### Table Extraction Only
|
148
|
+
```python
|
149
|
+
results, num_pages = extractor.run_inference(
|
150
|
+
model_inference_instance,
|
151
|
+
input_data,
|
152
|
+
tables_only=True # Extract only tables from document
|
153
|
+
)
|
154
|
+
```
|
155
|
+
|
156
|
+
#### Image Cropping
|
157
|
+
```python
|
158
|
+
results, num_pages = extractor.run_inference(
|
159
|
+
model_inference_instance,
|
160
|
+
input_data,
|
161
|
+
crop_size=60 # Crop 60 pixels from all borders
|
162
|
+
)
|
163
|
+
```
|
164
|
+
|
165
|
+
#### Bounding Box Annotations
|
166
|
+
```python
|
167
|
+
results, num_pages = extractor.run_inference(
|
168
|
+
model_inference_instance,
|
169
|
+
input_data,
|
170
|
+
apply_annotation=True # Include bounding box coordinates
|
171
|
+
)
|
172
|
+
```
|
173
|
+
|
174
|
+
#### Generic Data Extraction
|
175
|
+
```python
|
176
|
+
results, num_pages = extractor.run_inference(
|
177
|
+
model_inference_instance,
|
178
|
+
input_data,
|
179
|
+
generic_query=True # Extract all available data
|
180
|
+
)
|
181
|
+
```
|
182
|
+
|
183
|
+
## 🛠️ Utility Functions
|
184
|
+
|
185
|
+
### PDF Processing
|
186
|
+
```python
|
187
|
+
from sparrow_parse.helpers.pdf_optimizer import PDFOptimizer
|
188
|
+
|
189
|
+
pdf_optimizer = PDFOptimizer()
|
190
|
+
num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(
|
191
|
+
file_path="document.pdf",
|
192
|
+
debug_dir="./debug",
|
193
|
+
convert_to_images=True
|
194
|
+
)
|
195
|
+
```
|
196
|
+
|
197
|
+
### Image Optimization
|
198
|
+
```python
|
199
|
+
from sparrow_parse.helpers.image_optimizer import ImageOptimizer
|
200
|
+
|
201
|
+
image_optimizer = ImageOptimizer()
|
202
|
+
cropped_path = image_optimizer.crop_image_borders(
|
203
|
+
file_path="image.jpg",
|
204
|
+
temp_dir="./temp",
|
205
|
+
debug_dir="./debug",
|
206
|
+
crop_size=50
|
207
|
+
)
|
208
|
+
```
|
209
|
+
|
210
|
+
### Table Detection
|
211
|
+
```python
|
212
|
+
from sparrow_parse.processors.table_structure_processor import TableDetector
|
213
|
+
|
214
|
+
detector = TableDetector()
|
215
|
+
cropped_tables = detector.detect_tables(
|
216
|
+
file_path="document.png",
|
217
|
+
local=True,
|
218
|
+
debug=True
|
219
|
+
)
|
220
|
+
```
|
221
|
+
|
222
|
+
## 🎯 Use Cases & Examples
|
223
|
+
|
224
|
+
### Invoice Processing
|
225
|
+
```python
|
226
|
+
invoice_schema = {
|
227
|
+
"invoice_number": "str",
|
228
|
+
"date": "str",
|
229
|
+
"vendor_name": "str",
|
230
|
+
"total_amount": 0,
|
231
|
+
"line_items": [{
|
232
|
+
"description": "str",
|
233
|
+
"quantity": 0,
|
234
|
+
"price": 0.0
|
235
|
+
}]
|
236
|
+
}
|
237
|
+
|
238
|
+
input_data = [{
|
239
|
+
"file_path": "invoice.pdf",
|
240
|
+
"text_input": f"extract invoice data: {json.dumps(invoice_schema)}"
|
241
|
+
}]
|
242
|
+
```
|
243
|
+
|
244
|
+
### Financial Tables
|
245
|
+
```python
|
246
|
+
table_schema = [{
|
247
|
+
"instrument_name": "str",
|
248
|
+
"valuation": 0,
|
249
|
+
"currency": "str or null"
|
250
|
+
}]
|
251
|
+
|
252
|
+
input_data = [{
|
253
|
+
"file_path": "financial_report.png",
|
254
|
+
"text_input": f"retrieve {json.dumps(table_schema)}. return response in JSON format"
|
255
|
+
}]
|
256
|
+
```
|
257
|
+
|
258
|
+
### Form Processing
|
259
|
+
```python
|
260
|
+
form_schema = {
|
261
|
+
"applicant_name": "str",
|
262
|
+
"application_date": "str",
|
263
|
+
"fields": [{
|
264
|
+
"field_name": "str",
|
265
|
+
"field_value": "str or null"
|
266
|
+
}]
|
267
|
+
}
|
268
|
+
```
|
269
|
+
|
270
|
+
## ⚙️ Configuration Options
|
271
|
+
|
272
|
+
| Parameter | Type | Default | Description |
|
273
|
+
|-----------|------|---------|-------------|
|
274
|
+
| `tables_only` | bool | False | Extract only tables from documents |
|
275
|
+
| `generic_query` | bool | False | Extract all available data without schema |
|
276
|
+
| `crop_size` | int | None | Pixels to crop from image borders |
|
277
|
+
| `apply_annotation` | bool | False | Include bounding box coordinates |
|
278
|
+
| `debug_dir` | str | None | Directory to save debug images |
|
279
|
+
| `debug` | bool | False | Enable debug logging |
|
280
|
+
| `mode` | str | None | Set to "static" for mock responses |
|
281
|
+
|
282
|
+
## 🔧 Troubleshooting
|
283
|
+
|
284
|
+
### Common Issues
|
285
|
+
|
286
|
+
**Import Errors:**
|
287
|
+
```bash
|
288
|
+
# For MLX backend on non-Apple Silicon
|
289
|
+
pip install sparrow-parse --no-deps
|
290
|
+
pip install -r requirements.txt --exclude mlx-vlm
|
291
|
+
|
292
|
+
# For missing poppler
|
293
|
+
brew install poppler # macOS
|
294
|
+
sudo apt-get install poppler-utils # Ubuntu/Debian
|
295
|
+
```
|
296
|
+
|
297
|
+
**Memory Issues:**
|
298
|
+
- Use smaller models or reduce image resolution
|
299
|
+
- Enable image cropping to reduce processing load
|
300
|
+
- Process single pages instead of entire PDFs
|
301
|
+
|
302
|
+
**Model Loading Errors:**
|
303
|
+
- Verify model name and availability
|
304
|
+
- Check HF token permissions for private models
|
305
|
+
- Ensure sufficient disk space for model downloads
|
306
|
+
|
307
|
+
### Performance Tips
|
308
|
+
|
309
|
+
- **Image Size**: Resize large images before processing
|
310
|
+
- **Batch Processing**: Process multiple pages together when possible
|
311
|
+
- **Model Selection**: Choose appropriate model size for your hardware
|
312
|
+
- **Caching**: Models are cached after first load
|
313
|
+
|
314
|
+
## 📚 API Reference
|
315
|
+
|
316
|
+
### VLLMExtractor Class
|
317
|
+
|
318
|
+
```python
|
319
|
+
class VLLMExtractor:
|
320
|
+
def run_inference(
|
321
|
+
self,
|
322
|
+
model_inference_instance,
|
323
|
+
input_data: List[Dict],
|
324
|
+
tables_only: bool = False,
|
325
|
+
generic_query: bool = False,
|
326
|
+
crop_size: Optional[int] = None,
|
327
|
+
apply_annotation: bool = False,
|
328
|
+
debug_dir: Optional[str] = None,
|
329
|
+
debug: bool = False,
|
330
|
+
mode: Optional[str] = None
|
331
|
+
) -> Tuple[List[str], int]
|
332
|
+
```
|
333
|
+
|
334
|
+
### InferenceFactory Class
|
335
|
+
|
336
|
+
```python
|
337
|
+
class InferenceFactory:
|
338
|
+
def __init__(self, config: Dict)
|
339
|
+
def get_inference_instance(self) -> ModelInference
|
340
|
+
```
|
341
|
+
|
342
|
+
## 🏗️ Development
|
343
|
+
|
344
|
+
### Building from Source
|
345
|
+
|
346
|
+
```bash
|
347
|
+
# Clone repository
|
348
|
+
git clone https://github.com/katanaml/sparrow.git
|
349
|
+
cd sparrow/sparrow-data/parse
|
350
|
+
|
351
|
+
# Create virtual environment
|
352
|
+
python -m venv .env_sparrow_parse
|
353
|
+
source .env_sparrow_parse/bin/activate # Linux/Mac
|
354
|
+
# or
|
355
|
+
.env_sparrow_parse\Scripts\activate # Windows
|
356
|
+
|
357
|
+
# Install dependencies
|
358
|
+
pip install -r requirements.txt
|
359
|
+
|
360
|
+
# Build package
|
361
|
+
pip install setuptools wheel
|
362
|
+
python setup.py sdist bdist_wheel
|
363
|
+
|
364
|
+
# Install locally
|
365
|
+
pip install -e .
|
366
|
+
```
|
367
|
+
|
368
|
+
### Running Tests
|
369
|
+
|
370
|
+
```bash
|
371
|
+
python -m pytest tests/
|
372
|
+
```
|
373
|
+
|
374
|
+
## 📄 Supported File Formats
|
375
|
+
|
376
|
+
| Format | Extension | Multi-page | Notes |
|
377
|
+
|--------|-----------|------------|-------|
|
378
|
+
| PNG | .png | ❌ | Recommended for tables/forms |
|
379
|
+
| JPEG | .jpg, .jpeg | ❌ | Good for photos/scanned docs |
|
380
|
+
| PDF | .pdf | ✅ | Automatically split into pages |
|
381
|
+
|
382
|
+
## 🤝 Contributing
|
383
|
+
|
384
|
+
We welcome contributions! Please see our [Contributing Guidelines](https://github.com/katanaml/sparrow/blob/main/CONTRIBUTING.md) for details.
|
385
|
+
|
386
|
+
## 📞 Support
|
387
|
+
|
388
|
+
- 📖 [Documentation](https://github.com/katanaml/sparrow)
|
389
|
+
- 🐛 [Issue Tracker](https://github.com/katanaml/sparrow/issues)
|
390
|
+
- 💼 [Professional Services](mailto:abaranovskis@redsamuraiconsulting.com)
|
391
|
+
|
392
|
+
## 📜 License
|
393
|
+
|
394
|
+
Licensed under the GPL 3.0. Copyright 2020-2025 Katana ML, Andrej Baranovskij.
|
395
|
+
|
396
|
+
**Commercial Licensing:** Free for organizations with revenue under $5M USD annually. [Contact us](mailto:abaranovskis@redsamuraiconsulting.com) for commercial licensing options.
|
397
|
+
|
398
|
+
## 👥 Authors
|
399
|
+
|
400
|
+
- **[Katana ML](https://katanaml.io)**
|
401
|
+
- **[Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)**
|
402
|
+
|
403
|
+
---
|
404
|
+
|
405
|
+
⭐ **Star us on [GitHub](https://github.com/katanaml/sparrow)** if you find Sparrow Parse useful!
|