sparrow-parse 0.1.7__tar.gz → 0.1.9__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sparrow_parse-0.1.9/PKG-INFO +81 -0
- sparrow_parse-0.1.9/README.md +56 -0
- {sparrow_parse-0.1.7 → sparrow_parse-0.1.9}/pyproject.toml +6 -3
- sparrow_parse-0.1.9/sparrow_parse/__init__.py +1 -0
- sparrow_parse-0.1.9/sparrow_parse/extractor/file_processor.py +143 -0
- sparrow_parse-0.1.7/PKG-INFO +0 -28
- sparrow_parse-0.1.7/README.md +0 -7
- sparrow_parse-0.1.7/sparrow_parse/__init__.py +0 -1
- sparrow_parse-0.1.7/sparrow_parse/pdf/pdf_processor.py +0 -7
- {sparrow_parse-0.1.7 → sparrow_parse-0.1.9}/sparrow_parse/__main__.py +0 -0
- {sparrow_parse-0.1.7/sparrow_parse/pdf → sparrow_parse-0.1.9/sparrow_parse/extractor}/__init__.py +0 -0
@@ -0,0 +1,81 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: sparrow-parse
|
3
|
+
Version: 0.1.9
|
4
|
+
Summary: Sparrow Parse is a Python package for parsing and extracting information from documents.
|
5
|
+
Home-page: https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
|
6
|
+
License: GPL-3.0
|
7
|
+
Keywords: llm,rag,vision
|
8
|
+
Author: Andrej Baranovskij
|
9
|
+
Author-email: andrejus.baranovskis@gmail.com
|
10
|
+
Requires-Python: >=3.9,<3.12
|
11
|
+
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
|
12
|
+
Classifier: Operating System :: OS Independent
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
17
|
+
Classifier: Topic :: Software Development
|
18
|
+
Requires-Dist: rich (>=13.7.1,<14.0.0)
|
19
|
+
Requires-Dist: torch (==2.2.2)
|
20
|
+
Requires-Dist: unstructured-inference (==0.7.29)
|
21
|
+
Requires-Dist: unstructured[all-docs] (==0.13.6)
|
22
|
+
Project-URL: Repository, https://github.com/katanaml/sparrow
|
23
|
+
Description-Content-Type: text/markdown
|
24
|
+
|
25
|
+
# Sparrow Parse
|
26
|
+
|
27
|
+
## Description
|
28
|
+
|
29
|
+
This module implements Sparrow Parse [library](https://pypi.org/project/sparrow-parse/) with helpful methods for data pre-processing.
|
30
|
+
|
31
|
+
## Install
|
32
|
+
|
33
|
+
```
|
34
|
+
pip install sparrow-parse
|
35
|
+
```
|
36
|
+
|
37
|
+
## Use
|
38
|
+
|
39
|
+
Import
|
40
|
+
|
41
|
+
```
|
42
|
+
from sparrow_parse.pdf.pdf_processor import PDFProcessor
|
43
|
+
```
|
44
|
+
|
45
|
+
Usage
|
46
|
+
|
47
|
+
```
|
48
|
+
processor = PDFProcessor()
|
49
|
+
result = processor.process_file(file_path, strategy, model_name)
|
50
|
+
```
|
51
|
+
|
52
|
+
Build for development
|
53
|
+
|
54
|
+
```
|
55
|
+
poetry build
|
56
|
+
```
|
57
|
+
|
58
|
+
Publish to PyPi
|
59
|
+
|
60
|
+
```
|
61
|
+
poetry publish
|
62
|
+
```
|
63
|
+
|
64
|
+
## Commercial usage
|
65
|
+
|
66
|
+
Sparrow is available under the GPL 3.0 license, promoting freedom to use, modify, and distribute the software while ensuring any modifications remain open source under the same license. This aligns with our commitment to supporting the open-source community and fostering collaboration.
|
67
|
+
|
68
|
+
Additionally, we recognize the diverse needs of organizations, including small to medium-sized enterprises (SMEs). Therefore, Sparrow is also offered for free commercial use to organizations with gross revenue below $5 million USD in the past 12 months, enabling them to leverage Sparrow without the financial burden often associated with high-quality software solutions.
|
69
|
+
|
70
|
+
For businesses that exceed this revenue threshold or require usage terms not accommodated by the GPL 3.0 license—such as integrating Sparrow into proprietary software without the obligation to disclose source code modifications—we offer dual licensing options. Dual licensing allows Sparrow to be used under a separate proprietary license, offering greater flexibility for commercial applications and proprietary integrations. This model supports both the project's sustainability and the business's needs for confidentiality and customization.
|
71
|
+
|
72
|
+
If your organization is seeking to utilize Sparrow under a proprietary license, or if you are interested in custom workflows, consulting services, or dedicated support and maintenance options, please contact us at abaranovskis@redsamuraiconsulting.com. We're here to provide tailored solutions that meet your unique requirements, ensuring you can maximize the benefits of Sparrow for your projects and workflows.
|
73
|
+
|
74
|
+
## Author
|
75
|
+
|
76
|
+
[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)
|
77
|
+
|
78
|
+
## License
|
79
|
+
|
80
|
+
Licensed under the GPL 3.0. Copyright 2020-2024 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).
|
81
|
+
|
@@ -0,0 +1,56 @@
|
|
1
|
+
# Sparrow Parse
|
2
|
+
|
3
|
+
## Description
|
4
|
+
|
5
|
+
This module implements Sparrow Parse [library](https://pypi.org/project/sparrow-parse/) with helpful methods for data pre-processing.
|
6
|
+
|
7
|
+
## Install
|
8
|
+
|
9
|
+
```
|
10
|
+
pip install sparrow-parse
|
11
|
+
```
|
12
|
+
|
13
|
+
## Use
|
14
|
+
|
15
|
+
Import
|
16
|
+
|
17
|
+
```
|
18
|
+
from sparrow_parse.pdf.pdf_processor import PDFProcessor
|
19
|
+
```
|
20
|
+
|
21
|
+
Usage
|
22
|
+
|
23
|
+
```
|
24
|
+
processor = PDFProcessor()
|
25
|
+
result = processor.process_file(file_path, strategy, model_name)
|
26
|
+
```
|
27
|
+
|
28
|
+
Build for development
|
29
|
+
|
30
|
+
```
|
31
|
+
poetry build
|
32
|
+
```
|
33
|
+
|
34
|
+
Publish to PyPi
|
35
|
+
|
36
|
+
```
|
37
|
+
poetry publish
|
38
|
+
```
|
39
|
+
|
40
|
+
## Commercial usage
|
41
|
+
|
42
|
+
Sparrow is available under the GPL 3.0 license, promoting freedom to use, modify, and distribute the software while ensuring any modifications remain open source under the same license. This aligns with our commitment to supporting the open-source community and fostering collaboration.
|
43
|
+
|
44
|
+
Additionally, we recognize the diverse needs of organizations, including small to medium-sized enterprises (SMEs). Therefore, Sparrow is also offered for free commercial use to organizations with gross revenue below $5 million USD in the past 12 months, enabling them to leverage Sparrow without the financial burden often associated with high-quality software solutions.
|
45
|
+
|
46
|
+
For businesses that exceed this revenue threshold or require usage terms not accommodated by the GPL 3.0 license—such as integrating Sparrow into proprietary software without the obligation to disclose source code modifications—we offer dual licensing options. Dual licensing allows Sparrow to be used under a separate proprietary license, offering greater flexibility for commercial applications and proprietary integrations. This model supports both the project's sustainability and the business's needs for confidentiality and customization.
|
47
|
+
|
48
|
+
If your organization is seeking to utilize Sparrow under a proprietary license, or if you are interested in custom workflows, consulting services, or dedicated support and maintenance options, please contact us at abaranovskis@redsamuraiconsulting.com. We're here to provide tailored solutions that meet your unique requirements, ensuring you can maximize the benefits of Sparrow for your projects and workflows.
|
49
|
+
|
50
|
+
## Author
|
51
|
+
|
52
|
+
[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)
|
53
|
+
|
54
|
+
## License
|
55
|
+
|
56
|
+
Licensed under the GPL 3.0. Copyright 2020-2024 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).
|
@@ -1,6 +1,6 @@
|
|
1
1
|
[tool.poetry]
|
2
2
|
name = "sparrow-parse"
|
3
|
-
version = "0.1.
|
3
|
+
version = "0.1.9"
|
4
4
|
description = "Sparrow Parse is a Python package for parsing and extracting information from documents."
|
5
5
|
authors = ["Andrej Baranovskij <andrejus.baranovskis@gmail.com>"]
|
6
6
|
license = "GPL-3.0"
|
@@ -20,8 +20,11 @@ include = [
|
|
20
20
|
|
21
21
|
|
22
22
|
[tool.poetry.dependencies]
|
23
|
-
python = "
|
24
|
-
|
23
|
+
python = ">=3.9,<3.12"
|
24
|
+
torch = {version = "2.2.2", source = "pypi"}
|
25
|
+
unstructured = {version = "0.13.6", extras = ["all-docs"]}
|
26
|
+
unstructured-inference = "0.7.29"
|
27
|
+
rich = "^13.7.1"
|
25
28
|
|
26
29
|
|
27
30
|
[tool.poetry.scripts]
|
@@ -0,0 +1 @@
|
|
1
|
+
__version__ = '0.1.9'
|
@@ -0,0 +1,143 @@
|
|
1
|
+
import tempfile
|
2
|
+
import os
|
3
|
+
from unstructured.partition.pdf import partition_pdf
|
4
|
+
from unstructured.partition.image import partition_image
|
5
|
+
import json
|
6
|
+
from unstructured.staging.base import elements_to_json
|
7
|
+
from rich.progress import Progress, SpinnerColumn, TextColumn
|
8
|
+
|
9
|
+
|
10
|
+
class FileProcessor(object):
|
11
|
+
def __init__(self):
|
12
|
+
pass
|
13
|
+
|
14
|
+
def extract_data(self, file_path, strategy, model_name, options, local=True, debug=False):
|
15
|
+
# check if string options contains word table
|
16
|
+
extract_tables = False
|
17
|
+
if options is not None and "tables" in options:
|
18
|
+
extract_tables = True
|
19
|
+
|
20
|
+
# Extracts the elements from the PDF
|
21
|
+
elements = self.invoke_pipeline_step(
|
22
|
+
lambda: self.process_file(file_path, strategy, model_name),
|
23
|
+
"Extracting elements from the document...",
|
24
|
+
local
|
25
|
+
)
|
26
|
+
|
27
|
+
if debug:
|
28
|
+
new_extension = 'json' # You can change this to any extension you want
|
29
|
+
new_file_path = self.change_file_extension(file_path, new_extension)
|
30
|
+
|
31
|
+
content = self.invoke_pipeline_step(
|
32
|
+
lambda: self.load_text_data(elements, new_file_path, extract_tables),
|
33
|
+
"Loading text data...",
|
34
|
+
local
|
35
|
+
)
|
36
|
+
else:
|
37
|
+
with tempfile.TemporaryDirectory() as temp_dir:
|
38
|
+
temp_file_path = os.path.join(temp_dir, "file_data.json")
|
39
|
+
|
40
|
+
content = self.invoke_pipeline_step(
|
41
|
+
lambda: self.load_text_data(elements, temp_file_path, extract_tables),
|
42
|
+
"Loading text data...",
|
43
|
+
local
|
44
|
+
)
|
45
|
+
|
46
|
+
return content
|
47
|
+
|
48
|
+
def process_file(self, file_path, strategy, model_name):
|
49
|
+
elements = None
|
50
|
+
|
51
|
+
if file_path.lower().endswith('.pdf'):
|
52
|
+
elements = partition_pdf(
|
53
|
+
filename=file_path,
|
54
|
+
strategy=strategy,
|
55
|
+
infer_table_structure=True,
|
56
|
+
model_name=model_name
|
57
|
+
)
|
58
|
+
elif file_path.lower().endswith(('.jpg', '.jpeg', '.png')):
|
59
|
+
elements = partition_image(
|
60
|
+
filename=file_path,
|
61
|
+
strategy=strategy,
|
62
|
+
infer_table_structure=True,
|
63
|
+
model_name=model_name
|
64
|
+
)
|
65
|
+
|
66
|
+
return elements
|
67
|
+
|
68
|
+
def change_file_extension(self, file_path, new_extension):
|
69
|
+
# Check if the new extension starts with a dot and add one if not
|
70
|
+
if not new_extension.startswith('.'):
|
71
|
+
new_extension = '.' + new_extension
|
72
|
+
|
73
|
+
# Split the file path into two parts: the base (everything before the last dot) and the extension
|
74
|
+
# If there's no dot in the filename, it'll just return the original filename without an extension
|
75
|
+
base = file_path.rsplit('.', 1)[0]
|
76
|
+
|
77
|
+
# Concatenate the base with the new extension
|
78
|
+
new_file_path = base + new_extension
|
79
|
+
|
80
|
+
return new_file_path
|
81
|
+
|
82
|
+
def load_text_data(self, elements, file_path, extract_tables):
|
83
|
+
elements_to_json(elements, filename=file_path)
|
84
|
+
text_file = self.process_json_file(file_path, extract_tables)
|
85
|
+
|
86
|
+
with open(text_file, 'r') as file:
|
87
|
+
content = file.read()
|
88
|
+
|
89
|
+
return content
|
90
|
+
|
91
|
+
def process_json_file(self, input_data, extract_tables):
|
92
|
+
# Read the JSON file
|
93
|
+
with open(input_data, 'r') as file:
|
94
|
+
data = json.load(file)
|
95
|
+
|
96
|
+
# Iterate over the JSON data and extract required table elements
|
97
|
+
extracted_elements = []
|
98
|
+
for entry in data:
|
99
|
+
if entry["type"] == "Table":
|
100
|
+
extracted_elements.append(entry["metadata"]["text_as_html"])
|
101
|
+
elif entry["type"] == "Title" and extract_tables is False:
|
102
|
+
extracted_elements.append(entry["text"])
|
103
|
+
elif entry["type"] == "NarrativeText" and extract_tables is False:
|
104
|
+
extracted_elements.append(entry["text"])
|
105
|
+
elif entry["type"] == "UncategorizedText" and extract_tables is False:
|
106
|
+
extracted_elements.append(entry["text"])
|
107
|
+
|
108
|
+
# Write the extracted elements to the output file
|
109
|
+
new_extension = 'txt' # You can change this to any extension you want
|
110
|
+
new_file_path = self.change_file_extension(input_data, new_extension)
|
111
|
+
with open(new_file_path, 'w') as output_file:
|
112
|
+
for element in extracted_elements:
|
113
|
+
output_file.write(element + "\n\n") # Adding two newlines for separation
|
114
|
+
|
115
|
+
return new_file_path
|
116
|
+
|
117
|
+
def invoke_pipeline_step(self, task_call, task_description, local):
|
118
|
+
if local:
|
119
|
+
with Progress(
|
120
|
+
SpinnerColumn(),
|
121
|
+
TextColumn("[progress.description]{task.description}"),
|
122
|
+
transient=False,
|
123
|
+
) as progress:
|
124
|
+
progress.add_task(description=task_description, total=None)
|
125
|
+
ret = task_call()
|
126
|
+
else:
|
127
|
+
print(task_description)
|
128
|
+
ret = task_call()
|
129
|
+
|
130
|
+
return ret
|
131
|
+
|
132
|
+
|
133
|
+
# if __name__ == "__main__":
|
134
|
+
# processor = FileProcessor()
|
135
|
+
# content = processor.extract_data('/Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf',
|
136
|
+
# 'hi_res',
|
137
|
+
# 'yolox',
|
138
|
+
# 'tables',
|
139
|
+
# False,
|
140
|
+
# True)
|
141
|
+
# processor.extract_data("/Users/andrejb/Documents/work/lifung/lemming_test/C16E150001_SUPINV.pdf")
|
142
|
+
# processor.extract_data("/Users/andrejb/Documents/work/epik/bankstatement/OCBC_1_single.pdf")
|
143
|
+
# print(content)
|
sparrow_parse-0.1.7/PKG-INFO
DELETED
@@ -1,28 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: sparrow-parse
|
3
|
-
Version: 0.1.7
|
4
|
-
Summary: Sparrow Parse is a Python package for parsing and extracting information from documents.
|
5
|
-
Home-page: https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
|
6
|
-
License: GPL-3.0
|
7
|
-
Keywords: llm,rag,vision
|
8
|
-
Author: Andrej Baranovskij
|
9
|
-
Author-email: andrejus.baranovskis@gmail.com
|
10
|
-
Requires-Python: >=3.10,<4.0
|
11
|
-
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
|
12
|
-
Classifier: Operating System :: OS Independent
|
13
|
-
Classifier: Programming Language :: Python :: 3
|
14
|
-
Classifier: Programming Language :: Python :: 3.10
|
15
|
-
Classifier: Programming Language :: Python :: 3.11
|
16
|
-
Classifier: Programming Language :: Python :: 3.12
|
17
|
-
Classifier: Topic :: Software Development
|
18
|
-
Requires-Dist: requests (>=2.31.0,<3.0.0)
|
19
|
-
Project-URL: Repository, https://github.com/katanaml/sparrow
|
20
|
-
Description-Content-Type: text/markdown
|
21
|
-
|
22
|
-
## Author
|
23
|
-
|
24
|
-
[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)
|
25
|
-
|
26
|
-
## License
|
27
|
-
|
28
|
-
Licensed under the GPL 3.0. Copyright 2020-2024 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).
|
sparrow_parse-0.1.7/README.md
DELETED
@@ -1,7 +0,0 @@
|
|
1
|
-
## Author
|
2
|
-
|
3
|
-
[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)
|
4
|
-
|
5
|
-
## License
|
6
|
-
|
7
|
-
Licensed under the GPL 3.0. Copyright 2020-2024 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).
|
@@ -1 +0,0 @@
|
|
1
|
-
__version__ = '0.1.7'
|
File without changes
|
{sparrow_parse-0.1.7/sparrow_parse/pdf → sparrow_parse-0.1.9/sparrow_parse/extractor}/__init__.py
RENAMED
File without changes
|