sparrow-parse 1.1.1__tar.gz → 1.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. sparrow-parse-1.1.2/PKG-INFO +405 -0
  2. sparrow-parse-1.1.2/README.md +377 -0
  3. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/setup.py +1 -1
  4. sparrow-parse-1.1.2/sparrow_parse/__init__.py +1 -0
  5. sparrow-parse-1.1.2/sparrow_parse.egg-info/PKG-INFO +405 -0
  6. sparrow-parse-1.1.1/PKG-INFO +0 -187
  7. sparrow-parse-1.1.1/README.md +0 -159
  8. sparrow-parse-1.1.1/sparrow_parse/__init__.py +0 -1
  9. sparrow-parse-1.1.1/sparrow_parse.egg-info/PKG-INFO +0 -187
  10. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/setup.cfg +0 -0
  11. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/__main__.py +0 -0
  12. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/extractors/__init__.py +0 -0
  13. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/extractors/vllm_extractor.py +0 -0
  14. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/helpers/__init__.py +0 -0
  15. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/helpers/image_optimizer.py +0 -0
  16. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/helpers/pdf_optimizer.py +0 -0
  17. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/processors/__init__.py +0 -0
  18. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/processors/table_structure_processor.py +0 -0
  19. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/text_extraction.py +0 -0
  20. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/__init__.py +0 -0
  21. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/huggingface_inference.py +0 -0
  22. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/inference_base.py +0 -0
  23. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/inference_factory.py +0 -0
  24. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/local_gpu_inference.py +0 -0
  25. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse/vllm/mlx_inference.py +0 -0
  26. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/SOURCES.txt +0 -0
  27. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/dependency_links.txt +0 -0
  28. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/entry_points.txt +0 -0
  29. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/requires.txt +0 -0
  30. {sparrow-parse-1.1.1 → sparrow-parse-1.1.2}/sparrow_parse.egg-info/top_level.txt +0 -0
@@ -0,0 +1,405 @@
1
+ Metadata-Version: 2.1
2
+ Name: sparrow-parse
3
+ Version: 1.1.2
4
+ Summary: Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.
5
+ Home-page: https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
6
+ Author: Andrej Baranovskij
7
+ Author-email: andrejus.baranovskis@gmail.com
8
+ Project-URL: Homepage, https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
9
+ Project-URL: Repository, https://github.com/katanaml/sparrow
10
+ Keywords: llm,vllm,ocr,vision
11
+ Classifier: Operating System :: OS Independent
12
+ Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
13
+ Classifier: Topic :: Software Development
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Requires-Python: >=3.10
16
+ Description-Content-Type: text/markdown
17
+ Requires-Dist: rich
18
+ Requires-Dist: transformers>=4.51.3
19
+ Requires-Dist: torchvision>=0.22.0
20
+ Requires-Dist: torch>=2.7.0
21
+ Requires-Dist: sentence-transformers>=4.1.0
22
+ Requires-Dist: numpy>=2.2.5
23
+ Requires-Dist: pypdf>=5.5.0
24
+ Requires-Dist: gradio_client>=1.7.2
25
+ Requires-Dist: pdf2image>=1.17.0
26
+ Requires-Dist: mlx>=0.25.2; sys_platform == "darwin" and platform_machine == "arm64"
27
+ Requires-Dist: mlx-vlm==0.1.26; sys_platform == "darwin" and platform_machine == "arm64"
28
+
29
+ # Sparrow Parse
30
+
31
+ [![PyPI version](https://badge.fury.io/py/sparrow-parse.svg)](https://badge.fury.io/py/sparrow-parse)
32
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
33
+ [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
34
+
35
+ A powerful Python library for parsing and extracting structured information from documents using Vision Language Models (VLMs). Part of the [Sparrow](https://github.com/katanaml/sparrow) ecosystem for intelligent document processing.
36
+
37
+ ## ✨ Features
38
+
39
+ - 🔍 **Document Data Extraction**: Extract structured data from invoices, forms, tables, and complex documents
40
+ - 🤖 **Multiple Backend Support**: MLX (Apple Silicon), Hugging Face Cloud GPU, and local GPU inference
41
+ - 📄 **Multi-format Support**: Images (PNG, JPG, JPEG) and multi-page PDFs
42
+ - 🎯 **Schema Validation**: JSON schema-based extraction with automatic validation
43
+ - 📊 **Table Processing**: Specialized table detection and extraction capabilities
44
+ - 🖼️ **Image Annotation**: Bounding box annotations for extracted data
45
+ - 💬 **Text Instructions**: Support for instruction-based text processing
46
+ - ⚡ **Optimized Processing**: Image cropping, resizing, and preprocessing capabilities
47
+
48
+ ## 🚀 Quick Start
49
+
50
+ ### Installation
51
+
52
+ ```bash
53
+ pip install sparrow-parse
54
+ ```
55
+
56
+ **Additional Requirements:**
57
+ - For PDF processing: `brew install poppler` (macOS) or `apt-get install poppler-utils` (Linux)
58
+ - For MLX backend: Apple Silicon Mac required
59
+ - For Hugging Face: Valid HF token with GPU access
60
+
61
+ ### Basic Usage
62
+
63
+ ```python
64
+ from sparrow_parse.vllm.inference_factory import InferenceFactory
65
+ from sparrow_parse.extractors.vllm_extractor import VLLMExtractor
66
+
67
+ # Initialize extractor
68
+ extractor = VLLMExtractor()
69
+
70
+ # Configure backend (MLX example)
71
+ config = {
72
+ "method": "mlx",
73
+ "model_name": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-8bit"
74
+ }
75
+
76
+ # Create inference instance
77
+ factory = InferenceFactory(config)
78
+ model_inference_instance = factory.get_inference_instance()
79
+
80
+ # Prepare input data
81
+ input_data = [{
82
+ "file_path": "path/to/your/document.png",
83
+ "text_input": "retrieve [{\"field_name\": \"str\", \"amount\": 0}]. return response in JSON format"
84
+ }]
85
+
86
+ # Run inference
87
+ results, num_pages = extractor.run_inference(
88
+ model_inference_instance,
89
+ input_data,
90
+ debug=True
91
+ )
92
+
93
+ print(f"Extracted data: {results[0]}")
94
+ ```
95
+
96
+ ## 📖 Detailed Usage
97
+
98
+ ### Backend Configuration
99
+
100
+ #### MLX Backend (Apple Silicon)
101
+ ```python
102
+ config = {
103
+ "method": "mlx",
104
+ "model_name": "mlx-community/Qwen2.5-VL-72B-Instruct-4bit"
105
+ }
106
+ ```
107
+
108
+ #### Hugging Face Backend
109
+ ```python
110
+ import os
111
+ config = {
112
+ "method": "huggingface",
113
+ "hf_space": "your-username/your-space",
114
+ "hf_token": os.getenv('HF_TOKEN')
115
+ }
116
+ ```
117
+
118
+ #### Local GPU Backend
119
+ ```python
120
+ config = {
121
+ "method": "local_gpu",
122
+ "device": "cuda",
123
+ "model_path": "path/to/model.pth"
124
+ }
125
+ ```
126
+
127
+ ### Input Data Formats
128
+
129
+ #### Document Processing
130
+ ```python
131
+ input_data = [{
132
+ "file_path": "invoice.pdf",
133
+ "text_input": "extract invoice data: {\"invoice_number\": \"str\", \"total\": 0, \"date\": \"str\"}"
134
+ }]
135
+ ```
136
+
137
+ #### Text-Only Processing
138
+ ```python
139
+ input_data = [{
140
+ "file_path": None,
141
+ "text_input": "Summarize the key points about renewable energy."
142
+ }]
143
+ ```
144
+
145
+ ### Advanced Options
146
+
147
+ #### Table Extraction Only
148
+ ```python
149
+ results, num_pages = extractor.run_inference(
150
+ model_inference_instance,
151
+ input_data,
152
+ tables_only=True # Extract only tables from document
153
+ )
154
+ ```
155
+
156
+ #### Image Cropping
157
+ ```python
158
+ results, num_pages = extractor.run_inference(
159
+ model_inference_instance,
160
+ input_data,
161
+ crop_size=60 # Crop 60 pixels from all borders
162
+ )
163
+ ```
164
+
165
+ #### Bounding Box Annotations
166
+ ```python
167
+ results, num_pages = extractor.run_inference(
168
+ model_inference_instance,
169
+ input_data,
170
+ apply_annotation=True # Include bounding box coordinates
171
+ )
172
+ ```
173
+
174
+ #### Generic Data Extraction
175
+ ```python
176
+ results, num_pages = extractor.run_inference(
177
+ model_inference_instance,
178
+ input_data,
179
+ generic_query=True # Extract all available data
180
+ )
181
+ ```
182
+
183
+ ## 🛠️ Utility Functions
184
+
185
+ ### PDF Processing
186
+ ```python
187
+ from sparrow_parse.helpers.pdf_optimizer import PDFOptimizer
188
+
189
+ pdf_optimizer = PDFOptimizer()
190
+ num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(
191
+ file_path="document.pdf",
192
+ debug_dir="./debug",
193
+ convert_to_images=True
194
+ )
195
+ ```
196
+
197
+ ### Image Optimization
198
+ ```python
199
+ from sparrow_parse.helpers.image_optimizer import ImageOptimizer
200
+
201
+ image_optimizer = ImageOptimizer()
202
+ cropped_path = image_optimizer.crop_image_borders(
203
+ file_path="image.jpg",
204
+ temp_dir="./temp",
205
+ debug_dir="./debug",
206
+ crop_size=50
207
+ )
208
+ ```
209
+
210
+ ### Table Detection
211
+ ```python
212
+ from sparrow_parse.processors.table_structure_processor import TableDetector
213
+
214
+ detector = TableDetector()
215
+ cropped_tables = detector.detect_tables(
216
+ file_path="document.png",
217
+ local=True,
218
+ debug=True
219
+ )
220
+ ```
221
+
222
+ ## 🎯 Use Cases & Examples
223
+
224
+ ### Invoice Processing
225
+ ```python
226
+ invoice_schema = {
227
+ "invoice_number": "str",
228
+ "date": "str",
229
+ "vendor_name": "str",
230
+ "total_amount": 0,
231
+ "line_items": [{
232
+ "description": "str",
233
+ "quantity": 0,
234
+ "price": 0.0
235
+ }]
236
+ }
237
+
238
+ input_data = [{
239
+ "file_path": "invoice.pdf",
240
+ "text_input": f"extract invoice data: {json.dumps(invoice_schema)}"
241
+ }]
242
+ ```
243
+
244
+ ### Financial Tables
245
+ ```python
246
+ table_schema = [{
247
+ "instrument_name": "str",
248
+ "valuation": 0,
249
+ "currency": "str or null"
250
+ }]
251
+
252
+ input_data = [{
253
+ "file_path": "financial_report.png",
254
+ "text_input": f"retrieve {json.dumps(table_schema)}. return response in JSON format"
255
+ }]
256
+ ```
257
+
258
+ ### Form Processing
259
+ ```python
260
+ form_schema = {
261
+ "applicant_name": "str",
262
+ "application_date": "str",
263
+ "fields": [{
264
+ "field_name": "str",
265
+ "field_value": "str or null"
266
+ }]
267
+ }
268
+ ```
269
+
270
+ ## ⚙️ Configuration Options
271
+
272
+ | Parameter | Type | Default | Description |
273
+ |-----------|------|---------|-------------|
274
+ | `tables_only` | bool | False | Extract only tables from documents |
275
+ | `generic_query` | bool | False | Extract all available data without schema |
276
+ | `crop_size` | int | None | Pixels to crop from image borders |
277
+ | `apply_annotation` | bool | False | Include bounding box coordinates |
278
+ | `debug_dir` | str | None | Directory to save debug images |
279
+ | `debug` | bool | False | Enable debug logging |
280
+ | `mode` | str | None | Set to "static" for mock responses |
281
+
282
+ ## 🔧 Troubleshooting
283
+
284
+ ### Common Issues
285
+
286
+ **Import Errors:**
287
+ ```bash
288
+ # For MLX backend on non-Apple Silicon
289
+ pip install sparrow-parse --no-deps
290
+ pip install -r requirements.txt --exclude mlx-vlm
291
+
292
+ # For missing poppler
293
+ brew install poppler # macOS
294
+ sudo apt-get install poppler-utils # Ubuntu/Debian
295
+ ```
296
+
297
+ **Memory Issues:**
298
+ - Use smaller models or reduce image resolution
299
+ - Enable image cropping to reduce processing load
300
+ - Process single pages instead of entire PDFs
301
+
302
+ **Model Loading Errors:**
303
+ - Verify model name and availability
304
+ - Check HF token permissions for private models
305
+ - Ensure sufficient disk space for model downloads
306
+
307
+ ### Performance Tips
308
+
309
+ - **Image Size**: Resize large images before processing
310
+ - **Batch Processing**: Process multiple pages together when possible
311
+ - **Model Selection**: Choose appropriate model size for your hardware
312
+ - **Caching**: Models are cached after first load
313
+
314
+ ## 📚 API Reference
315
+
316
+ ### VLLMExtractor Class
317
+
318
+ ```python
319
+ class VLLMExtractor:
320
+ def run_inference(
321
+ self,
322
+ model_inference_instance,
323
+ input_data: List[Dict],
324
+ tables_only: bool = False,
325
+ generic_query: bool = False,
326
+ crop_size: Optional[int] = None,
327
+ apply_annotation: bool = False,
328
+ debug_dir: Optional[str] = None,
329
+ debug: bool = False,
330
+ mode: Optional[str] = None
331
+ ) -> Tuple[List[str], int]
332
+ ```
333
+
334
+ ### InferenceFactory Class
335
+
336
+ ```python
337
+ class InferenceFactory:
338
+ def __init__(self, config: Dict)
339
+ def get_inference_instance(self) -> ModelInference
340
+ ```
341
+
342
+ ## 🏗️ Development
343
+
344
+ ### Building from Source
345
+
346
+ ```bash
347
+ # Clone repository
348
+ git clone https://github.com/katanaml/sparrow.git
349
+ cd sparrow/sparrow-data/parse
350
+
351
+ # Create virtual environment
352
+ python -m venv .env_sparrow_parse
353
+ source .env_sparrow_parse/bin/activate # Linux/Mac
354
+ # or
355
+ .env_sparrow_parse\Scripts\activate # Windows
356
+
357
+ # Install dependencies
358
+ pip install -r requirements.txt
359
+
360
+ # Build package
361
+ pip install setuptools wheel
362
+ python setup.py sdist bdist_wheel
363
+
364
+ # Install locally
365
+ pip install -e .
366
+ ```
367
+
368
+ ### Running Tests
369
+
370
+ ```bash
371
+ python -m pytest tests/
372
+ ```
373
+
374
+ ## 📄 Supported File Formats
375
+
376
+ | Format | Extension | Multi-page | Notes |
377
+ |--------|-----------|------------|-------|
378
+ | PNG | .png | ❌ | Recommended for tables/forms |
379
+ | JPEG | .jpg, .jpeg | ❌ | Good for photos/scanned docs |
380
+ | PDF | .pdf | ✅ | Automatically split into pages |
381
+
382
+ ## 🤝 Contributing
383
+
384
+ We welcome contributions! Please see our [Contributing Guidelines](https://github.com/katanaml/sparrow/blob/main/CONTRIBUTING.md) for details.
385
+
386
+ ## 📞 Support
387
+
388
+ - 📖 [Documentation](https://github.com/katanaml/sparrow)
389
+ - 🐛 [Issue Tracker](https://github.com/katanaml/sparrow/issues)
390
+ - 💼 [Professional Services](mailto:abaranovskis@redsamuraiconsulting.com)
391
+
392
+ ## 📜 License
393
+
394
+ Licensed under the GPL 3.0. Copyright 2020-2025 Katana ML, Andrej Baranovskij.
395
+
396
+ **Commercial Licensing:** Free for organizations with revenue under $5M USD annually. [Contact us](mailto:abaranovskis@redsamuraiconsulting.com) for commercial licensing options.
397
+
398
+ ## 👥 Authors
399
+
400
+ - **[Katana ML](https://katanaml.io)**
401
+ - **[Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)**
402
+
403
+ ---
404
+
405
+ ⭐ **Star us on [GitHub](https://github.com/katanaml/sparrow)** if you find Sparrow Parse useful!