sparrow-parse 1.1.0__py3-none-any.whl → 1.1.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
sparrow_parse/__init__.py CHANGED
@@ -1 +1 @@
1
- __version__ = '1.1.0'
1
+ __version__ = '1.1.2'
@@ -174,7 +174,7 @@ class MLXInference(ModelInference):
174
174
  :return: Generated response
175
175
  """
176
176
  prompt = apply_chat_template(processor, config, messages)
177
- response = generate(
177
+ response, _ = generate(
178
178
  model,
179
179
  processor,
180
180
  prompt,
@@ -0,0 +1,405 @@
1
+ Metadata-Version: 2.1
2
+ Name: sparrow-parse
3
+ Version: 1.1.2
4
+ Summary: Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.
5
+ Home-page: https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
6
+ Author: Andrej Baranovskij
7
+ Author-email: andrejus.baranovskis@gmail.com
8
+ Project-URL: Homepage, https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
9
+ Project-URL: Repository, https://github.com/katanaml/sparrow
10
+ Keywords: llm,vllm,ocr,vision
11
+ Classifier: Operating System :: OS Independent
12
+ Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
13
+ Classifier: Topic :: Software Development
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Requires-Python: >=3.10
16
+ Description-Content-Type: text/markdown
17
+ Requires-Dist: rich
18
+ Requires-Dist: transformers >=4.51.3
19
+ Requires-Dist: torchvision >=0.22.0
20
+ Requires-Dist: torch >=2.7.0
21
+ Requires-Dist: sentence-transformers >=4.1.0
22
+ Requires-Dist: numpy >=2.2.5
23
+ Requires-Dist: pypdf >=5.5.0
24
+ Requires-Dist: gradio-client >=1.7.2
25
+ Requires-Dist: pdf2image >=1.17.0
26
+ Requires-Dist: mlx >=0.25.2 ; sys_platform == "darwin" and platform_machine == "arm64"
27
+ Requires-Dist: mlx-vlm ==0.1.26 ; sys_platform == "darwin" and platform_machine == "arm64"
28
+
29
+ # Sparrow Parse
30
+
31
+ [![PyPI version](https://badge.fury.io/py/sparrow-parse.svg)](https://badge.fury.io/py/sparrow-parse)
32
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
33
+ [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
34
+
35
+ A powerful Python library for parsing and extracting structured information from documents using Vision Language Models (VLMs). Part of the [Sparrow](https://github.com/katanaml/sparrow) ecosystem for intelligent document processing.
36
+
37
+ ## ✨ Features
38
+
39
+ - 🔍 **Document Data Extraction**: Extract structured data from invoices, forms, tables, and complex documents
40
+ - 🤖 **Multiple Backend Support**: MLX (Apple Silicon), Hugging Face Cloud GPU, and local GPU inference
41
+ - 📄 **Multi-format Support**: Images (PNG, JPG, JPEG) and multi-page PDFs
42
+ - 🎯 **Schema Validation**: JSON schema-based extraction with automatic validation
43
+ - 📊 **Table Processing**: Specialized table detection and extraction capabilities
44
+ - 🖼️ **Image Annotation**: Bounding box annotations for extracted data
45
+ - 💬 **Text Instructions**: Support for instruction-based text processing
46
+ - ⚡ **Optimized Processing**: Image cropping, resizing, and preprocessing capabilities
47
+
48
+ ## 🚀 Quick Start
49
+
50
+ ### Installation
51
+
52
+ ```bash
53
+ pip install sparrow-parse
54
+ ```
55
+
56
+ **Additional Requirements:**
57
+ - For PDF processing: `brew install poppler` (macOS) or `apt-get install poppler-utils` (Linux)
58
+ - For MLX backend: Apple Silicon Mac required
59
+ - For Hugging Face: Valid HF token with GPU access
60
+
61
+ ### Basic Usage
62
+
63
+ ```python
64
+ from sparrow_parse.vllm.inference_factory import InferenceFactory
65
+ from sparrow_parse.extractors.vllm_extractor import VLLMExtractor
66
+
67
+ # Initialize extractor
68
+ extractor = VLLMExtractor()
69
+
70
+ # Configure backend (MLX example)
71
+ config = {
72
+ "method": "mlx",
73
+ "model_name": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-8bit"
74
+ }
75
+
76
+ # Create inference instance
77
+ factory = InferenceFactory(config)
78
+ model_inference_instance = factory.get_inference_instance()
79
+
80
+ # Prepare input data
81
+ input_data = [{
82
+ "file_path": "path/to/your/document.png",
83
+ "text_input": "retrieve [{\"field_name\": \"str\", \"amount\": 0}]. return response in JSON format"
84
+ }]
85
+
86
+ # Run inference
87
+ results, num_pages = extractor.run_inference(
88
+ model_inference_instance,
89
+ input_data,
90
+ debug=True
91
+ )
92
+
93
+ print(f"Extracted data: {results[0]}")
94
+ ```
95
+
96
+ ## 📖 Detailed Usage
97
+
98
+ ### Backend Configuration
99
+
100
+ #### MLX Backend (Apple Silicon)
101
+ ```python
102
+ config = {
103
+ "method": "mlx",
104
+ "model_name": "mlx-community/Qwen2.5-VL-72B-Instruct-4bit"
105
+ }
106
+ ```
107
+
108
+ #### Hugging Face Backend
109
+ ```python
110
+ import os
111
+ config = {
112
+ "method": "huggingface",
113
+ "hf_space": "your-username/your-space",
114
+ "hf_token": os.getenv('HF_TOKEN')
115
+ }
116
+ ```
117
+
118
+ #### Local GPU Backend
119
+ ```python
120
+ config = {
121
+ "method": "local_gpu",
122
+ "device": "cuda",
123
+ "model_path": "path/to/model.pth"
124
+ }
125
+ ```
126
+
127
+ ### Input Data Formats
128
+
129
+ #### Document Processing
130
+ ```python
131
+ input_data = [{
132
+ "file_path": "invoice.pdf",
133
+ "text_input": "extract invoice data: {\"invoice_number\": \"str\", \"total\": 0, \"date\": \"str\"}"
134
+ }]
135
+ ```
136
+
137
+ #### Text-Only Processing
138
+ ```python
139
+ input_data = [{
140
+ "file_path": None,
141
+ "text_input": "Summarize the key points about renewable energy."
142
+ }]
143
+ ```
144
+
145
+ ### Advanced Options
146
+
147
+ #### Table Extraction Only
148
+ ```python
149
+ results, num_pages = extractor.run_inference(
150
+ model_inference_instance,
151
+ input_data,
152
+ tables_only=True # Extract only tables from document
153
+ )
154
+ ```
155
+
156
+ #### Image Cropping
157
+ ```python
158
+ results, num_pages = extractor.run_inference(
159
+ model_inference_instance,
160
+ input_data,
161
+ crop_size=60 # Crop 60 pixels from all borders
162
+ )
163
+ ```
164
+
165
+ #### Bounding Box Annotations
166
+ ```python
167
+ results, num_pages = extractor.run_inference(
168
+ model_inference_instance,
169
+ input_data,
170
+ apply_annotation=True # Include bounding box coordinates
171
+ )
172
+ ```
173
+
174
+ #### Generic Data Extraction
175
+ ```python
176
+ results, num_pages = extractor.run_inference(
177
+ model_inference_instance,
178
+ input_data,
179
+ generic_query=True # Extract all available data
180
+ )
181
+ ```
182
+
183
+ ## 🛠️ Utility Functions
184
+
185
+ ### PDF Processing
186
+ ```python
187
+ from sparrow_parse.helpers.pdf_optimizer import PDFOptimizer
188
+
189
+ pdf_optimizer = PDFOptimizer()
190
+ num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(
191
+ file_path="document.pdf",
192
+ debug_dir="./debug",
193
+ convert_to_images=True
194
+ )
195
+ ```
196
+
197
+ ### Image Optimization
198
+ ```python
199
+ from sparrow_parse.helpers.image_optimizer import ImageOptimizer
200
+
201
+ image_optimizer = ImageOptimizer()
202
+ cropped_path = image_optimizer.crop_image_borders(
203
+ file_path="image.jpg",
204
+ temp_dir="./temp",
205
+ debug_dir="./debug",
206
+ crop_size=50
207
+ )
208
+ ```
209
+
210
+ ### Table Detection
211
+ ```python
212
+ from sparrow_parse.processors.table_structure_processor import TableDetector
213
+
214
+ detector = TableDetector()
215
+ cropped_tables = detector.detect_tables(
216
+ file_path="document.png",
217
+ local=True,
218
+ debug=True
219
+ )
220
+ ```
221
+
222
+ ## 🎯 Use Cases & Examples
223
+
224
+ ### Invoice Processing
225
+ ```python
226
+ invoice_schema = {
227
+ "invoice_number": "str",
228
+ "date": "str",
229
+ "vendor_name": "str",
230
+ "total_amount": 0,
231
+ "line_items": [{
232
+ "description": "str",
233
+ "quantity": 0,
234
+ "price": 0.0
235
+ }]
236
+ }
237
+
238
+ input_data = [{
239
+ "file_path": "invoice.pdf",
240
+ "text_input": f"extract invoice data: {json.dumps(invoice_schema)}"
241
+ }]
242
+ ```
243
+
244
+ ### Financial Tables
245
+ ```python
246
+ table_schema = [{
247
+ "instrument_name": "str",
248
+ "valuation": 0,
249
+ "currency": "str or null"
250
+ }]
251
+
252
+ input_data = [{
253
+ "file_path": "financial_report.png",
254
+ "text_input": f"retrieve {json.dumps(table_schema)}. return response in JSON format"
255
+ }]
256
+ ```
257
+
258
+ ### Form Processing
259
+ ```python
260
+ form_schema = {
261
+ "applicant_name": "str",
262
+ "application_date": "str",
263
+ "fields": [{
264
+ "field_name": "str",
265
+ "field_value": "str or null"
266
+ }]
267
+ }
268
+ ```
269
+
270
+ ## ⚙️ Configuration Options
271
+
272
+ | Parameter | Type | Default | Description |
273
+ |-----------|------|---------|-------------|
274
+ | `tables_only` | bool | False | Extract only tables from documents |
275
+ | `generic_query` | bool | False | Extract all available data without schema |
276
+ | `crop_size` | int | None | Pixels to crop from image borders |
277
+ | `apply_annotation` | bool | False | Include bounding box coordinates |
278
+ | `debug_dir` | str | None | Directory to save debug images |
279
+ | `debug` | bool | False | Enable debug logging |
280
+ | `mode` | str | None | Set to "static" for mock responses |
281
+
282
+ ## 🔧 Troubleshooting
283
+
284
+ ### Common Issues
285
+
286
+ **Import Errors:**
287
+ ```bash
288
+ # For MLX backend on non-Apple Silicon
289
+ pip install sparrow-parse --no-deps
290
+ pip install -r requirements.txt --exclude mlx-vlm
291
+
292
+ # For missing poppler
293
+ brew install poppler # macOS
294
+ sudo apt-get install poppler-utils # Ubuntu/Debian
295
+ ```
296
+
297
+ **Memory Issues:**
298
+ - Use smaller models or reduce image resolution
299
+ - Enable image cropping to reduce processing load
300
+ - Process single pages instead of entire PDFs
301
+
302
+ **Model Loading Errors:**
303
+ - Verify model name and availability
304
+ - Check HF token permissions for private models
305
+ - Ensure sufficient disk space for model downloads
306
+
307
+ ### Performance Tips
308
+
309
+ - **Image Size**: Resize large images before processing
310
+ - **Batch Processing**: Process multiple pages together when possible
311
+ - **Model Selection**: Choose appropriate model size for your hardware
312
+ - **Caching**: Models are cached after first load
313
+
314
+ ## 📚 API Reference
315
+
316
+ ### VLLMExtractor Class
317
+
318
+ ```python
319
+ class VLLMExtractor:
320
+ def run_inference(
321
+ self,
322
+ model_inference_instance,
323
+ input_data: List[Dict],
324
+ tables_only: bool = False,
325
+ generic_query: bool = False,
326
+ crop_size: Optional[int] = None,
327
+ apply_annotation: bool = False,
328
+ debug_dir: Optional[str] = None,
329
+ debug: bool = False,
330
+ mode: Optional[str] = None
331
+ ) -> Tuple[List[str], int]
332
+ ```
333
+
334
+ ### InferenceFactory Class
335
+
336
+ ```python
337
+ class InferenceFactory:
338
+ def __init__(self, config: Dict)
339
+ def get_inference_instance(self) -> ModelInference
340
+ ```
341
+
342
+ ## 🏗️ Development
343
+
344
+ ### Building from Source
345
+
346
+ ```bash
347
+ # Clone repository
348
+ git clone https://github.com/katanaml/sparrow.git
349
+ cd sparrow/sparrow-data/parse
350
+
351
+ # Create virtual environment
352
+ python -m venv .env_sparrow_parse
353
+ source .env_sparrow_parse/bin/activate # Linux/Mac
354
+ # or
355
+ .env_sparrow_parse\Scripts\activate # Windows
356
+
357
+ # Install dependencies
358
+ pip install -r requirements.txt
359
+
360
+ # Build package
361
+ pip install setuptools wheel
362
+ python setup.py sdist bdist_wheel
363
+
364
+ # Install locally
365
+ pip install -e .
366
+ ```
367
+
368
+ ### Running Tests
369
+
370
+ ```bash
371
+ python -m pytest tests/
372
+ ```
373
+
374
+ ## 📄 Supported File Formats
375
+
376
+ | Format | Extension | Multi-page | Notes |
377
+ |--------|-----------|------------|-------|
378
+ | PNG | .png | ❌ | Recommended for tables/forms |
379
+ | JPEG | .jpg, .jpeg | ❌ | Good for photos/scanned docs |
380
+ | PDF | .pdf | ✅ | Automatically split into pages |
381
+
382
+ ## 🤝 Contributing
383
+
384
+ We welcome contributions! Please see our [Contributing Guidelines](https://github.com/katanaml/sparrow/blob/main/CONTRIBUTING.md) for details.
385
+
386
+ ## 📞 Support
387
+
388
+ - 📖 [Documentation](https://github.com/katanaml/sparrow)
389
+ - 🐛 [Issue Tracker](https://github.com/katanaml/sparrow/issues)
390
+ - 💼 [Professional Services](mailto:abaranovskis@redsamuraiconsulting.com)
391
+
392
+ ## 📜 License
393
+
394
+ Licensed under the GPL 3.0. Copyright 2020-2025 Katana ML, Andrej Baranovskij.
395
+
396
+ **Commercial Licensing:** Free for organizations with revenue under $5M USD annually. [Contact us](mailto:abaranovskis@redsamuraiconsulting.com) for commercial licensing options.
397
+
398
+ ## 👥 Authors
399
+
400
+ - **[Katana ML](https://katanaml.io)**
401
+ - **[Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)**
402
+
403
+ ---
404
+
405
+ ⭐ **Star us on [GitHub](https://github.com/katanaml/sparrow)** if you find Sparrow Parse useful!
@@ -1,4 +1,4 @@
1
- sparrow_parse/__init__.py,sha256=XIz3qAg9G9YysQi3Ryp0CN3rtc_JiecHZ9L2vEzcM6s,21
1
+ sparrow_parse/__init__.py,sha256=mJ0xQIqup8FmjqcIVqLNOKlf8ld4M0MtvFilCTGA0Fw,21
2
2
  sparrow_parse/__main__.py,sha256=Xs1bpJV0n08KWOoQE34FBYn6EBXZA9HIYJKrE4ZdG78,153
3
3
  sparrow_parse/text_extraction.py,sha256=uhYVNK5Q2FZnw1Poa3JWjtN-aEL7cyKpvaltdn0m2II,8948
4
4
  sparrow_parse/extractors/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
@@ -13,9 +13,9 @@ sparrow_parse/vllm/huggingface_inference.py,sha256=RqYmP-wh_cm_BZ271HbejnZe30S5E
13
13
  sparrow_parse/vllm/inference_base.py,sha256=AmWF1OUjJLxSEK_WCbcRpXHX3cKk8nPJJHha_X-9Gs4,844
14
14
  sparrow_parse/vllm/inference_factory.py,sha256=FTM65O-dW2WZchHOrNN7_Q3-FlVoAc65iSptuuUuClM,1166
15
15
  sparrow_parse/vllm/local_gpu_inference.py,sha256=SIyprv12fYawwfxgQ7ZOTM5WmMfQqhO_9vbereRpZdk,652
16
- sparrow_parse/vllm/mlx_inference.py,sha256=opTNOxcTBb6McVEStDECMRcsc_3pnzKSFUmm27h08yA,15466
17
- sparrow_parse-1.1.0.dist-info/METADATA,sha256=yq1Fmcu0rmoxIiIAUR6UK-4xqrM2x5NmVAED9-DuWIw,7229
18
- sparrow_parse-1.1.0.dist-info/WHEEL,sha256=yQN5g4mg4AybRjkgi-9yy4iQEFibGQmlz78Pik5Or-A,92
19
- sparrow_parse-1.1.0.dist-info/entry_points.txt,sha256=HV5nnQVtr2m-kn6hzY_ynp0zugNCcGovbmnfmQgOyhw,53
20
- sparrow_parse-1.1.0.dist-info/top_level.txt,sha256=n6b-WtT91zKLyCPZTP7wvne8v_yvIahcsz-4sX8I0rY,14
21
- sparrow_parse-1.1.0.dist-info/RECORD,,
16
+ sparrow_parse/vllm/mlx_inference.py,sha256=bWNojY0BNxRj42Xe-b3p_drN--XFKUcLB7PZCNZ6UqA,15468
17
+ sparrow_parse-1.1.2.dist-info/METADATA,sha256=ryD0TsA9niM1MtsG5NWfv_uFaS7T1K_J27DqqYIXZrg,10974
18
+ sparrow_parse-1.1.2.dist-info/WHEEL,sha256=yQN5g4mg4AybRjkgi-9yy4iQEFibGQmlz78Pik5Or-A,92
19
+ sparrow_parse-1.1.2.dist-info/entry_points.txt,sha256=HV5nnQVtr2m-kn6hzY_ynp0zugNCcGovbmnfmQgOyhw,53
20
+ sparrow_parse-1.1.2.dist-info/top_level.txt,sha256=n6b-WtT91zKLyCPZTP7wvne8v_yvIahcsz-4sX8I0rY,14
21
+ sparrow_parse-1.1.2.dist-info/RECORD,,
@@ -1,187 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: sparrow-parse
3
- Version: 1.1.0
4
- Summary: Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.
5
- Home-page: https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
6
- Author: Andrej Baranovskij
7
- Author-email: andrejus.baranovskis@gmail.com
8
- Project-URL: Homepage, https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
9
- Project-URL: Repository, https://github.com/katanaml/sparrow
10
- Keywords: llm,vllm,ocr,vision
11
- Classifier: Operating System :: OS Independent
12
- Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
13
- Classifier: Topic :: Software Development
14
- Classifier: Programming Language :: Python :: 3.10
15
- Requires-Python: >=3.10
16
- Description-Content-Type: text/markdown
17
- Requires-Dist: rich
18
- Requires-Dist: transformers >=4.51.3
19
- Requires-Dist: torchvision >=0.22.0
20
- Requires-Dist: torch >=2.7.0
21
- Requires-Dist: sentence-transformers >=4.1.0
22
- Requires-Dist: numpy >=2.2.5
23
- Requires-Dist: pypdf >=5.5.0
24
- Requires-Dist: gradio-client >=1.7.2
25
- Requires-Dist: pdf2image >=1.17.0
26
- Requires-Dist: mlx >=0.25.2 ; sys_platform == "darwin" and platform_machine == "arm64"
27
- Requires-Dist: mlx-vlm ==0.1.26 ; sys_platform == "darwin" and platform_machine == "arm64"
28
-
29
- # Sparrow Parse
30
-
31
- ## Description
32
-
33
- This module implements Sparrow Parse [library](https://pypi.org/project/sparrow-parse/) library with helpful methods for data pre-processing, parsing and extracting information. Library relies on Visual LLM functionality, Table Transformers and is part of Sparrow. Check main [README](https://github.com/katanaml/sparrow)
34
-
35
- ## Install
36
-
37
- ```
38
- pip install sparrow-parse
39
- ```
40
-
41
- ## Parsing and extraction
42
-
43
- ### Sparrow Parse VL (vision-language model) extractor with local MLX or Hugging Face Cloud GPU infra
44
-
45
- ```
46
- # run locally: python -m sparrow_parse.extractors.vllm_extractor
47
-
48
- from sparrow_parse.vllm.inference_factory import InferenceFactory
49
- from sparrow_parse.extractors.vllm_extractor import VLLMExtractor
50
-
51
- extractor = VLLMExtractor()
52
-
53
- config = {
54
- "method": "mlx", # Could be 'huggingface', 'mlx' or 'local_gpu'
55
- "model_name": "mlx-community/Qwen2-VL-72B-Instruct-4bit",
56
- }
57
-
58
- # Use the factory to get the correct instance
59
- factory = InferenceFactory(config)
60
- model_inference_instance = factory.get_inference_instance()
61
-
62
- input_data = [
63
- {
64
- "file_path": "/Users/andrejb/Work/katana-git/sparrow/sparrow-ml/llm/data/bonds_table.jpg",
65
- "text_input": "retrieve all data. return response in JSON format"
66
- }
67
- ]
68
-
69
- # Now you can run inference without knowing which implementation is used
70
- results_array, num_pages = extractor.run_inference(model_inference_instance, input_data, tables_only=False,
71
- generic_query=False,
72
- crop_size=80,
73
- debug_dir=None,
74
- debug=True,
75
- mode=None)
76
-
77
- for i, result in enumerate(results_array):
78
- print(f"Result for page {i + 1}:", result)
79
- print(f"Number of pages: {num_pages}")
80
- ```
81
-
82
- Use `tables_only=True` if you want to extract only tables.
83
-
84
- Use `crop_size=N` (where `N` is an integer) to crop N pixels from all borders of the input images. This can be helpful for removing unwanted borders or frame artifacts from scanned documents.
85
-
86
- Use `mode="static"` if you want to simulate LLM call, without executing LLM backend.
87
-
88
- Method `run_inference` will return results and number of pages processed.
89
-
90
- To run with Hugging Face backend use these config values:
91
-
92
- ```
93
- config = {
94
- "method": "huggingface", # Could be 'huggingface' or 'local_gpu'
95
- "hf_space": "katanaml/sparrow-qwen2-vl-7b",
96
- "hf_token": os.getenv('HF_TOKEN'),
97
- }
98
- ```
99
-
100
- Note: GPU backend `katanaml/sparrow-qwen2-vl-7b` is private, to be able to run below command, you need to create your own backend on Hugging Face space using [code](https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse/sparrow_parse/vllm/infra/qwen2_vl_7b) from Sparrow Parse.
101
-
102
- ## PDF pre-processing
103
-
104
- ```
105
- from sparrow_parse.extractor.pdf_optimizer import PDFOptimizer
106
-
107
- pdf_optimizer = PDFOptimizer()
108
-
109
- num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(file_path,
110
- debug_dir,
111
- convert_to_images)
112
-
113
- ```
114
-
115
- Example:
116
-
117
- *file_path* - `/data/invoice_1.pdf`
118
-
119
- *debug_dir* - set to not `None`, for debug purposes only
120
-
121
- *convert_to_images* - default `False`, to split into PDF files
122
-
123
- ## Image cropping
124
-
125
- ```
126
- from sparrow_parse.helpers.image_optimizer import ImageOptimizer
127
-
128
- image_optimizer = ImageOptimizer()
129
-
130
- cropped_file_path = image_optimizer.crop_image_borders(file_path, temp_dir, debug_dir, crop_size)
131
- ```
132
-
133
- Example:
134
-
135
- *file_path* - `/data/invoice_1.jpg`
136
-
137
- *temp_dir* - directory to store cropped files
138
-
139
- *debug_dir* - set to not `None`, for debug purposes only
140
-
141
- *crop_size* - Number of pixels to crop from each border
142
-
143
- ## Library build
144
-
145
- Create Python virtual environment
146
-
147
- ```
148
- python -m venv .env_sparrow_parse
149
- ```
150
-
151
- Install Python libraries
152
-
153
- ```
154
- pip install -r requirements.txt
155
- ```
156
-
157
- Build package
158
-
159
- ```
160
- pip install setuptools wheel
161
- python setup.py sdist bdist_wheel
162
- ```
163
-
164
- Upload to PyPI
165
-
166
- ```
167
- pip install twine
168
- twine upload dist/*
169
- ```
170
-
171
- ## Commercial usage
172
-
173
- Sparrow is available under the GPL 3.0 license, promoting freedom to use, modify, and distribute the software while ensuring any modifications remain open source under the same license. This aligns with our commitment to supporting the open-source community and fostering collaboration.
174
-
175
- Additionally, we recognize the diverse needs of organizations, including small to medium-sized enterprises (SMEs). Therefore, Sparrow is also offered for free commercial use to organizations with gross revenue below $5 million USD in the past 12 months, enabling them to leverage Sparrow without the financial burden often associated with high-quality software solutions.
176
-
177
- For businesses that exceed this revenue threshold or require usage terms not accommodated by the GPL 3.0 license—such as integrating Sparrow into proprietary software without the obligation to disclose source code modifications—we offer dual licensing options. Dual licensing allows Sparrow to be used under a separate proprietary license, offering greater flexibility for commercial applications and proprietary integrations. This model supports both the project's sustainability and the business's needs for confidentiality and customization.
178
-
179
- If your organization is seeking to utilize Sparrow under a proprietary license, or if you are interested in custom workflows, consulting services, or dedicated support and maintenance options, please contact us at abaranovskis@redsamuraiconsulting.com. We're here to provide tailored solutions that meet your unique requirements, ensuring you can maximize the benefits of Sparrow for your projects and workflows.
180
-
181
- ## Author
182
-
183
- [Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)
184
-
185
- ## License
186
-
187
- Licensed under the GPL 3.0. Copyright 2020-2025 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).