swarmauri_parser_fitzpdf 0.8.0.dev4__tar.gz → 0.8.0.dev22__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- swarmauri_parser_fitzpdf-0.8.0.dev22/PKG-INFO +107 -0
- swarmauri_parser_fitzpdf-0.8.0.dev22/README.md +82 -0
- {swarmauri_parser_fitzpdf-0.8.0.dev4 → swarmauri_parser_fitzpdf-0.8.0.dev22}/pyproject.toml +12 -1
- swarmauri_parser_fitzpdf-0.8.0.dev4/PKG-INFO +0 -64
- swarmauri_parser_fitzpdf-0.8.0.dev4/README.md +0 -45
- {swarmauri_parser_fitzpdf-0.8.0.dev4 → swarmauri_parser_fitzpdf-0.8.0.dev22}/LICENSE +0 -0
- {swarmauri_parser_fitzpdf-0.8.0.dev4 → swarmauri_parser_fitzpdf-0.8.0.dev22}/swarmauri_parser_fitzpdf/FitzPdfParser.py +0 -0
- {swarmauri_parser_fitzpdf-0.8.0.dev4 → swarmauri_parser_fitzpdf-0.8.0.dev22}/swarmauri_parser_fitzpdf/__init__.py +0 -0
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: swarmauri_parser_fitzpdf
|
|
3
|
+
Version: 0.8.0.dev22
|
|
4
|
+
Summary: Fitz PDF Parser for Swarmauri.
|
|
5
|
+
License-Expression: Apache-2.0
|
|
6
|
+
License-File: LICENSE
|
|
7
|
+
Keywords: swarmauri,parser,fitzpdf,fitz,pdf
|
|
8
|
+
Author: Jacob Stewart
|
|
9
|
+
Author-email: jacob@swarmauri.com
|
|
10
|
+
Requires-Python: >=3.10,<3.13
|
|
11
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Classifier: Natural Language :: English
|
|
16
|
+
Classifier: Development Status :: 3 - Alpha
|
|
17
|
+
Classifier: Intended Audience :: Developers
|
|
18
|
+
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
|
|
19
|
+
Requires-Dist: PyMuPDF (>=1.24.12)
|
|
20
|
+
Requires-Dist: swarmauri_base
|
|
21
|
+
Requires-Dist: swarmauri_core
|
|
22
|
+
Requires-Dist: swarmauri_standard
|
|
23
|
+
Description-Content-Type: text/markdown
|
|
24
|
+
|
|
25
|
+

|
|
26
|
+
|
|
27
|
+
<p align="center">
|
|
28
|
+
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
29
|
+
<img src="https://img.shields.io/pypi/dm/swarmauri_parser_fitzpdf" alt="PyPI - Downloads"/></a>
|
|
30
|
+
<a href="https://hits.sh/github.com/swarmauri/swarmauri-sdk/tree/master/pkgs/community/swarmauri_parser_fitzpdf/">
|
|
31
|
+
<img alt="Hits" src="https://hits.sh/github.com/swarmauri/swarmauri-sdk/tree/master/pkgs/community/swarmauri_parser_fitzpdf.svg"/></a>
|
|
32
|
+
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
33
|
+
<img src="https://img.shields.io/pypi/pyversions/swarmauri_parser_fitzpdf" alt="PyPI - Python Version"/></a>
|
|
34
|
+
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
35
|
+
<img src="https://img.shields.io/pypi/l/swarmauri_parser_fitzpdf" alt="PyPI - License"/></a>
|
|
36
|
+
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
37
|
+
<img src="https://img.shields.io/pypi/v/swarmauri_parser_fitzpdf?label=swarmauri_parser_fitzpdf&color=green" alt="PyPI - swarmauri_parser_fitzpdf"/></a>
|
|
38
|
+
</p>
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
# Swarmauri Parser Fitz PDF
|
|
43
|
+
|
|
44
|
+
PDF-to-text parser for Swarmauri built on [PyMuPDF (`pymupdf`)](https://pymupdf.readthedocs.io/). Extracts text from every page of a PDF and returns a `Document` object with the aggregated content and source metadata.
|
|
45
|
+
|
|
46
|
+
## Features
|
|
47
|
+
|
|
48
|
+
- Opens PDFs via PyMuPDF and collects text per page.
|
|
49
|
+
- Emits a single `Document` with `content` containing the combined text and `metadata['source']` holding the file path.
|
|
50
|
+
- Raises a clear error if the input is not a file path string; returns an empty list if PyMuPDF encounters parsing failures.
|
|
51
|
+
|
|
52
|
+
## Prerequisites
|
|
53
|
+
|
|
54
|
+
- Python 3.10 or newer.
|
|
55
|
+
- PyMuPDF (`pymupdf`) along with system dependencies (X11 libraries on Linux, poppler on some distros). Install OS packages listed in [PyMuPDF docs](https://pymupdf.readthedocs.io/en/latest/installation.html) before pip installing if needed.
|
|
56
|
+
- Read access to the PDF files you plan to parse.
|
|
57
|
+
|
|
58
|
+
## Installation
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
# pip
|
|
62
|
+
pip install swarmauri_parser_fitzpdf
|
|
63
|
+
|
|
64
|
+
# poetry
|
|
65
|
+
poetry add swarmauri_parser_fitzpdf
|
|
66
|
+
|
|
67
|
+
# uv (pyproject-based projects)
|
|
68
|
+
uv add swarmauri_parser_fitzpdf
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
## Quickstart
|
|
72
|
+
|
|
73
|
+
```python
|
|
74
|
+
from swarmauri_parser_fitzpdf import FitzPdfParser
|
|
75
|
+
|
|
76
|
+
parser = FitzPdfParser()
|
|
77
|
+
documents = parser.parse("reports/quarterly.pdf")
|
|
78
|
+
|
|
79
|
+
for doc in documents:
|
|
80
|
+
print(doc.metadata["source"])
|
|
81
|
+
print(doc.content[:500])
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## Handling Errors
|
|
85
|
+
|
|
86
|
+
```python
|
|
87
|
+
from swarmauri_parser_fitzpdf import FitzPdfParser
|
|
88
|
+
|
|
89
|
+
parser = FitzPdfParser()
|
|
90
|
+
try:
|
|
91
|
+
docs = parser.parse("missing.pdf")
|
|
92
|
+
if not docs:
|
|
93
|
+
print("Parsing failed or returned no content.")
|
|
94
|
+
except ValueError as exc:
|
|
95
|
+
print(f"Bad input: {exc}")
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Tips
|
|
99
|
+
|
|
100
|
+
- Pre-process PDFs (deskew, OCR) before parsing if they contain scanned pages without embedded text; PyMuPDF only extracts existing text objects.
|
|
101
|
+
- For multi-document pipelines, pair this parser with Swarmauri token-count measurements or summarizers to chunk large PDFs.
|
|
102
|
+
- Cache parsed output if the same PDF is accessed frequently—parsing large documents repeatedly is expensive.
|
|
103
|
+
|
|
104
|
+
## Want to help?
|
|
105
|
+
|
|
106
|
+
If you want to contribute to swarmauri-sdk, read up on our [guidelines for contributing](https://github.com/swarmauri/swarmauri-sdk/blob/master/contributing.md) that will help you get started.
|
|
107
|
+
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+

|
|
2
|
+
|
|
3
|
+
<p align="center">
|
|
4
|
+
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
5
|
+
<img src="https://img.shields.io/pypi/dm/swarmauri_parser_fitzpdf" alt="PyPI - Downloads"/></a>
|
|
6
|
+
<a href="https://hits.sh/github.com/swarmauri/swarmauri-sdk/tree/master/pkgs/community/swarmauri_parser_fitzpdf/">
|
|
7
|
+
<img alt="Hits" src="https://hits.sh/github.com/swarmauri/swarmauri-sdk/tree/master/pkgs/community/swarmauri_parser_fitzpdf.svg"/></a>
|
|
8
|
+
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
9
|
+
<img src="https://img.shields.io/pypi/pyversions/swarmauri_parser_fitzpdf" alt="PyPI - Python Version"/></a>
|
|
10
|
+
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
11
|
+
<img src="https://img.shields.io/pypi/l/swarmauri_parser_fitzpdf" alt="PyPI - License"/></a>
|
|
12
|
+
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
13
|
+
<img src="https://img.shields.io/pypi/v/swarmauri_parser_fitzpdf?label=swarmauri_parser_fitzpdf&color=green" alt="PyPI - swarmauri_parser_fitzpdf"/></a>
|
|
14
|
+
</p>
|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
# Swarmauri Parser Fitz PDF
|
|
19
|
+
|
|
20
|
+
PDF-to-text parser for Swarmauri built on [PyMuPDF (`pymupdf`)](https://pymupdf.readthedocs.io/). Extracts text from every page of a PDF and returns a `Document` object with the aggregated content and source metadata.
|
|
21
|
+
|
|
22
|
+
## Features
|
|
23
|
+
|
|
24
|
+
- Opens PDFs via PyMuPDF and collects text per page.
|
|
25
|
+
- Emits a single `Document` with `content` containing the combined text and `metadata['source']` holding the file path.
|
|
26
|
+
- Raises a clear error if the input is not a file path string; returns an empty list if PyMuPDF encounters parsing failures.
|
|
27
|
+
|
|
28
|
+
## Prerequisites
|
|
29
|
+
|
|
30
|
+
- Python 3.10 or newer.
|
|
31
|
+
- PyMuPDF (`pymupdf`) along with system dependencies (X11 libraries on Linux, poppler on some distros). Install OS packages listed in [PyMuPDF docs](https://pymupdf.readthedocs.io/en/latest/installation.html) before pip installing if needed.
|
|
32
|
+
- Read access to the PDF files you plan to parse.
|
|
33
|
+
|
|
34
|
+
## Installation
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
# pip
|
|
38
|
+
pip install swarmauri_parser_fitzpdf
|
|
39
|
+
|
|
40
|
+
# poetry
|
|
41
|
+
poetry add swarmauri_parser_fitzpdf
|
|
42
|
+
|
|
43
|
+
# uv (pyproject-based projects)
|
|
44
|
+
uv add swarmauri_parser_fitzpdf
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
## Quickstart
|
|
48
|
+
|
|
49
|
+
```python
|
|
50
|
+
from swarmauri_parser_fitzpdf import FitzPdfParser
|
|
51
|
+
|
|
52
|
+
parser = FitzPdfParser()
|
|
53
|
+
documents = parser.parse("reports/quarterly.pdf")
|
|
54
|
+
|
|
55
|
+
for doc in documents:
|
|
56
|
+
print(doc.metadata["source"])
|
|
57
|
+
print(doc.content[:500])
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
## Handling Errors
|
|
61
|
+
|
|
62
|
+
```python
|
|
63
|
+
from swarmauri_parser_fitzpdf import FitzPdfParser
|
|
64
|
+
|
|
65
|
+
parser = FitzPdfParser()
|
|
66
|
+
try:
|
|
67
|
+
docs = parser.parse("missing.pdf")
|
|
68
|
+
if not docs:
|
|
69
|
+
print("Parsing failed or returned no content.")
|
|
70
|
+
except ValueError as exc:
|
|
71
|
+
print(f"Bad input: {exc}")
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
## Tips
|
|
75
|
+
|
|
76
|
+
- Pre-process PDFs (deskew, OCR) before parsing if they contain scanned pages without embedded text; PyMuPDF only extracts existing text objects.
|
|
77
|
+
- For multi-document pipelines, pair this parser with Swarmauri token-count measurements or summarizers to chunk large PDFs.
|
|
78
|
+
- Cache parsed output if the same PDF is accessed frequently—parsing large documents repeatedly is expensive.
|
|
79
|
+
|
|
80
|
+
## Want to help?
|
|
81
|
+
|
|
82
|
+
If you want to contribute to swarmauri-sdk, read up on our [guidelines for contributing](https://github.com/swarmauri/swarmauri-sdk/blob/master/contributing.md) that will help you get started.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
[project]
|
|
2
2
|
name = "swarmauri_parser_fitzpdf"
|
|
3
|
-
version = "0.8.0.
|
|
3
|
+
version = "0.8.0.dev22"
|
|
4
4
|
description = "Fitz PDF Parser for Swarmauri."
|
|
5
5
|
license = "Apache-2.0"
|
|
6
6
|
readme = "README.md"
|
|
@@ -11,6 +11,10 @@ classifiers = [
|
|
|
11
11
|
"Programming Language :: Python :: 3.10",
|
|
12
12
|
"Programming Language :: Python :: 3.11",
|
|
13
13
|
"Programming Language :: Python :: 3.12",
|
|
14
|
+
"Natural Language :: English",
|
|
15
|
+
"Development Status :: 3 - Alpha",
|
|
16
|
+
"Intended Audience :: Developers",
|
|
17
|
+
"Topic :: Software Development :: Libraries :: Application Frameworks",
|
|
14
18
|
]
|
|
15
19
|
authors = [{ name = "Jacob Stewart", email = "jacob@swarmauri.com" }]
|
|
16
20
|
dependencies = [
|
|
@@ -19,6 +23,13 @@ dependencies = [
|
|
|
19
23
|
"swarmauri_base",
|
|
20
24
|
"swarmauri_standard",
|
|
21
25
|
]
|
|
26
|
+
keywords = [
|
|
27
|
+
"swarmauri",
|
|
28
|
+
"parser",
|
|
29
|
+
"fitzpdf",
|
|
30
|
+
"fitz",
|
|
31
|
+
"pdf",
|
|
32
|
+
]
|
|
22
33
|
|
|
23
34
|
[tool.uv.sources]
|
|
24
35
|
swarmauri_core = { workspace = true }
|
|
@@ -1,64 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.3
|
|
2
|
-
Name: swarmauri_parser_fitzpdf
|
|
3
|
-
Version: 0.8.0.dev4
|
|
4
|
-
Summary: Fitz PDF Parser for Swarmauri.
|
|
5
|
-
License: Apache-2.0
|
|
6
|
-
Author: Jacob Stewart
|
|
7
|
-
Author-email: jacob@swarmauri.com
|
|
8
|
-
Requires-Python: >=3.10,<3.13
|
|
9
|
-
Classifier: License :: OSI Approved :: Apache Software License
|
|
10
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
11
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
-
Requires-Dist: PyMuPDF (>=1.24.12)
|
|
14
|
-
Requires-Dist: swarmauri_base
|
|
15
|
-
Requires-Dist: swarmauri_core
|
|
16
|
-
Requires-Dist: swarmauri_standard
|
|
17
|
-
Description-Content-Type: text/markdown
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-

|
|
21
|
-
|
|
22
|
-
<p align="center">
|
|
23
|
-
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
24
|
-
<img src="https://img.shields.io/pypi/dm/swarmauri_parser_fitzpdf" alt="PyPI - Downloads"/></a>
|
|
25
|
-
<a href="https://hits.sh/github.com/swarmauri/swarmauri-sdk/tree/master/pkgs/community/swarmauri_parser_fitzpdf/">
|
|
26
|
-
<img alt="Hits" src="https://hits.sh/github.com/swarmauri/swarmauri-sdk/tree/master/pkgs/community/swarmauri_parser_fitzpdf.svg"/></a>
|
|
27
|
-
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
28
|
-
<img src="https://img.shields.io/pypi/pyversions/swarmauri_parser_fitzpdf" alt="PyPI - Python Version"/></a>
|
|
29
|
-
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
30
|
-
<img src="https://img.shields.io/pypi/l/swarmauri_parser_fitzpdf" alt="PyPI - License"/></a>
|
|
31
|
-
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
32
|
-
<img src="https://img.shields.io/pypi/v/swarmauri_parser_fitzpdf?label=swarmauri_parser_fitzpdf&color=green" alt="PyPI - swarmauri_parser_fitzpdf"/></a>
|
|
33
|
-
</p>
|
|
34
|
-
|
|
35
|
-
---
|
|
36
|
-
|
|
37
|
-
# Swarmauri Parser Fitz PDF
|
|
38
|
-
|
|
39
|
-
A parser to extract text from PDF files using PyMuPDF.
|
|
40
|
-
|
|
41
|
-
## Installation
|
|
42
|
-
|
|
43
|
-
```bash
|
|
44
|
-
pip install swarmauri_parser_fitzpdf
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
## Usage
|
|
48
|
-
Basic usage examples with code snippets
|
|
49
|
-
```python
|
|
50
|
-
from swarmauri.parsers.FitzPdfParser import PDFtoTextParser
|
|
51
|
-
|
|
52
|
-
parser = PDFtoTextParser()
|
|
53
|
-
file_path = "path/to/your/pdf/file.pdf"
|
|
54
|
-
documents = parser.parse(file_path)
|
|
55
|
-
|
|
56
|
-
for document in documents:
|
|
57
|
-
print(document.content)
|
|
58
|
-
print(document.metadata)
|
|
59
|
-
```
|
|
60
|
-
|
|
61
|
-
## Want to help?
|
|
62
|
-
|
|
63
|
-
If you want to contribute to swarmauri-sdk, read up on our [guidelines for contributing](https://github.com/swarmauri/swarmauri-sdk/blob/master/contributing.md) that will help you get started.
|
|
64
|
-
|
|
@@ -1,45 +0,0 @@
|
|
|
1
|
-
|
|
2
|
-

|
|
3
|
-
|
|
4
|
-
<p align="center">
|
|
5
|
-
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
6
|
-
<img src="https://img.shields.io/pypi/dm/swarmauri_parser_fitzpdf" alt="PyPI - Downloads"/></a>
|
|
7
|
-
<a href="https://hits.sh/github.com/swarmauri/swarmauri-sdk/tree/master/pkgs/community/swarmauri_parser_fitzpdf/">
|
|
8
|
-
<img alt="Hits" src="https://hits.sh/github.com/swarmauri/swarmauri-sdk/tree/master/pkgs/community/swarmauri_parser_fitzpdf.svg"/></a>
|
|
9
|
-
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
10
|
-
<img src="https://img.shields.io/pypi/pyversions/swarmauri_parser_fitzpdf" alt="PyPI - Python Version"/></a>
|
|
11
|
-
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
12
|
-
<img src="https://img.shields.io/pypi/l/swarmauri_parser_fitzpdf" alt="PyPI - License"/></a>
|
|
13
|
-
<a href="https://pypi.org/project/swarmauri_parser_fitzpdf/">
|
|
14
|
-
<img src="https://img.shields.io/pypi/v/swarmauri_parser_fitzpdf?label=swarmauri_parser_fitzpdf&color=green" alt="PyPI - swarmauri_parser_fitzpdf"/></a>
|
|
15
|
-
</p>
|
|
16
|
-
|
|
17
|
-
---
|
|
18
|
-
|
|
19
|
-
# Swarmauri Parser Fitz PDF
|
|
20
|
-
|
|
21
|
-
A parser to extract text from PDF files using PyMuPDF.
|
|
22
|
-
|
|
23
|
-
## Installation
|
|
24
|
-
|
|
25
|
-
```bash
|
|
26
|
-
pip install swarmauri_parser_fitzpdf
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
## Usage
|
|
30
|
-
Basic usage examples with code snippets
|
|
31
|
-
```python
|
|
32
|
-
from swarmauri.parsers.FitzPdfParser import PDFtoTextParser
|
|
33
|
-
|
|
34
|
-
parser = PDFtoTextParser()
|
|
35
|
-
file_path = "path/to/your/pdf/file.pdf"
|
|
36
|
-
documents = parser.parse(file_path)
|
|
37
|
-
|
|
38
|
-
for document in documents:
|
|
39
|
-
print(document.content)
|
|
40
|
-
print(document.metadata)
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
## Want to help?
|
|
44
|
-
|
|
45
|
-
If you want to contribute to swarmauri-sdk, read up on our [guidelines for contributing](https://github.com/swarmauri/swarmauri-sdk/blob/master/contributing.md) that will help you get started.
|
|
File without changes
|
|
File without changes
|
|
File without changes
|