natural-pdf 0.1.7__py3-none-any.whl → 0.1.8__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- docs/categorizing-documents/index.md +168 -0
- docs/data-extraction/index.md +87 -0
- docs/element-selection/index.ipynb +218 -164
- docs/element-selection/index.md +20 -0
- docs/index.md +19 -0
- docs/ocr/index.md +63 -16
- docs/tutorials/01-loading-and-extraction.ipynb +1713 -34
- docs/tutorials/02-finding-elements.ipynb +123 -46
- docs/tutorials/03-extracting-blocks.ipynb +24 -19
- docs/tutorials/04-table-extraction.ipynb +17 -12
- docs/tutorials/05-excluding-content.ipynb +37 -32
- docs/tutorials/06-document-qa.ipynb +36 -31
- docs/tutorials/07-layout-analysis.ipynb +45 -40
- docs/tutorials/07-working-with-regions.ipynb +61 -60
- docs/tutorials/08-spatial-navigation.ipynb +76 -71
- docs/tutorials/09-section-extraction.ipynb +160 -155
- docs/tutorials/10-form-field-extraction.ipynb +71 -66
- docs/tutorials/11-enhanced-table-processing.ipynb +11 -6
- docs/tutorials/12-ocr-integration.ipynb +3420 -312
- docs/tutorials/12-ocr-integration.md +68 -106
- docs/tutorials/13-semantic-search.ipynb +641 -251
- natural_pdf/__init__.py +2 -0
- natural_pdf/classification/manager.py +343 -0
- natural_pdf/classification/mixin.py +149 -0
- natural_pdf/classification/results.py +62 -0
- natural_pdf/collections/mixins.py +63 -0
- natural_pdf/collections/pdf_collection.py +321 -15
- natural_pdf/core/element_manager.py +67 -0
- natural_pdf/core/page.py +227 -64
- natural_pdf/core/pdf.py +387 -378
- natural_pdf/elements/collections.py +272 -41
- natural_pdf/elements/region.py +99 -15
- natural_pdf/elements/text.py +5 -2
- natural_pdf/exporters/paddleocr.py +1 -1
- natural_pdf/extraction/manager.py +134 -0
- natural_pdf/extraction/mixin.py +246 -0
- natural_pdf/extraction/result.py +37 -0
- natural_pdf/ocr/engine_easyocr.py +6 -3
- natural_pdf/ocr/ocr_manager.py +85 -25
- natural_pdf/ocr/ocr_options.py +33 -10
- natural_pdf/ocr/utils.py +14 -3
- natural_pdf/qa/document_qa.py +0 -4
- natural_pdf/selectors/parser.py +363 -238
- natural_pdf/templates/finetune/fine_tune_paddleocr.md +10 -5
- natural_pdf/utils/locks.py +8 -0
- natural_pdf/utils/text_extraction.py +52 -1
- natural_pdf/utils/tqdm_utils.py +43 -0
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/METADATA +6 -1
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/RECORD +52 -41
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/WHEEL +1 -1
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/licenses/LICENSE +0 -0
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/top_level.txt +0 -0
docs/element-selection/index.md
CHANGED
@@ -141,6 +141,26 @@ page.find_all('text:bold').show()
|
|
141
141
|
page.find_all('text[size>=11]:bold')
|
142
142
|
```
|
143
143
|
|
144
|
+
### Negation Pseudo-class (`:not()`)
|
145
|
+
|
146
|
+
You can exclude elements that match a certain selector using the `:not()` pseudo-class. It takes another simple selector as its argument.
|
147
|
+
|
148
|
+
```python
|
149
|
+
# Find all text elements that are NOT bold
|
150
|
+
non_bold_text = page.find_all('text:not(:bold)')
|
151
|
+
|
152
|
+
# Find all elements that are NOT regions of type 'table'
|
153
|
+
not_tables = page.find_all(':not(region[type=table])')
|
154
|
+
|
155
|
+
# Find text elements that do not contain "Total" (case-insensitive)
|
156
|
+
relevant_text = page.find_all('text:not(:contains("Total"))', case=False)
|
157
|
+
|
158
|
+
# Find text elements that are not empty
|
159
|
+
non_empty_text = page.find_all('text:not(:empty)')
|
160
|
+
```
|
161
|
+
|
162
|
+
**Note:** The selector inside `:not()` follows the same rules as regular selectors but currently does not support combinators (like `>`, `+`, `~`, or descendant space) within `:not()`. You can nest basic type, attribute, and other pseudo-class selectors.
|
163
|
+
|
144
164
|
### Spatial Pseudo-Classes Examples
|
145
165
|
|
146
166
|
```python
|
docs/index.md
CHANGED
@@ -132,6 +132,25 @@ if result.get("found", False):
|
|
132
132
|
|
133
133
|
[Learn about Document QA →](document-qa/index.ipynb)
|
134
134
|
|
135
|
+
### Classify Pages and Regions
|
136
|
+
|
137
|
+
Categorize pages or specific regions based on their content using text or vision models.
|
138
|
+
|
139
|
+
**Note:** Requires `pip install "natural-pdf[classification]"`
|
140
|
+
|
141
|
+
```python
|
142
|
+
# Classify a page based on text
|
143
|
+
categories = ["invoice", "scientific article", "presentation"]
|
144
|
+
page.classify(categories=categories, model="text")
|
145
|
+
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
|
146
|
+
|
147
|
+
|
148
|
+
# Classify a page based on what it looks like
|
149
|
+
categories = ["invoice", "scientific article", "presentation"]
|
150
|
+
page.classify(categories=categories, model="vision")
|
151
|
+
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
|
152
|
+
```
|
153
|
+
|
135
154
|
### Visualize Your Work
|
136
155
|
|
137
156
|
Debug and understand your extractions visually.
|
docs/ocr/index.md
CHANGED
@@ -6,16 +6,16 @@ Natural PDF includes OCR (Optical Character Recognition) to extract text from sc
|
|
6
6
|
|
7
7
|
Natural PDF supports multiple OCR engines:
|
8
8
|
|
9
|
-
| Feature | EasyOCR | PaddleOCR | Surya OCR |
|
10
|
-
|
11
|
-
| **Installation** | `natural-pdf[easyocr]` | `natural-pdf[paddle]` | `natural-pdf[surya]` |
|
12
|
-
| **Primary Strength** | Good general performance, simpler | Excellent Asian language, speed | High accuracy, multilingual lines |
|
13
|
-
| **Speed** | Moderate | Fast | Moderate (GPU recommended) |
|
14
|
-
| **Memory Usage** | Higher | Efficient | Higher (GPU recommended) |
|
15
|
-
| **Paragraph Detect** | Yes (via option) | No | No (focuses on lines) |
|
16
|
-
| **Handwritten** | Better support | Limited | Limited |
|
17
|
-
| **Small Text** | Moderate | Good | Good |
|
18
|
-
| **When to Use** | General documents, handwritten text| Asian languages, speed-critical tasks | Highest accuracy needed, line-level |
|
9
|
+
| Feature | EasyOCR | PaddleOCR | Surya OCR | Gemini (Layout + potential OCR) |
|
10
|
+
|----------------------|------------------------------------|------------------------------------------|---------------------------------------|--------------------------------------|
|
11
|
+
| **Installation** | `natural-pdf[easyocr]` | `natural-pdf[paddle]` | `natural-pdf[surya]` | `natural-pdf[gemini]` |
|
12
|
+
| **Primary Strength** | Good general performance, simpler | Excellent Asian language, speed | High accuracy, multilingual lines | Advanced layout analysis (via API) |
|
13
|
+
| **Speed** | Moderate | Fast | Moderate (GPU recommended) | API Latency |
|
14
|
+
| **Memory Usage** | Higher | Efficient | Higher (GPU recommended) | N/A (API) |
|
15
|
+
| **Paragraph Detect** | Yes (via option) | No | No (focuses on lines) | Yes (Layout model) |
|
16
|
+
| **Handwritten** | Better support | Limited | Limited | Potentially (API model dependent) |
|
17
|
+
| **Small Text** | Moderate | Good | Good | Potentially (API model dependent) |
|
18
|
+
| **When to Use** | General documents, handwritten text| Asian languages, speed-critical tasks | Highest accuracy needed, line-level | Complex layouts, API integration |
|
19
19
|
|
20
20
|
## Basic OCR Usage
|
21
21
|
|
@@ -53,6 +53,7 @@ For advanced, engine-specific settings, use the Options classes:
|
|
53
53
|
|
54
54
|
```python
|
55
55
|
from natural_pdf.ocr import PaddleOCROptions, EasyOCROptions, SuryaOCROptions
|
56
|
+
from natural_pdf.analyzers.layout import GeminiOptions # Note: Gemini is primarily layout
|
56
57
|
|
57
58
|
# --- Configure PaddleOCR ---
|
58
59
|
paddle_opts = PaddleOCROptions(
|
@@ -90,6 +91,25 @@ surya_opts = SuryaOCROptions(
|
|
90
91
|
# set via environment variables (see note below).
|
91
92
|
)
|
92
93
|
ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
|
94
|
+
|
95
|
+
# --- Configure Gemini (as layout analyzer, can be used with OCR) ---
|
96
|
+
# Gemini requires API key (GOOGLE_API_KEY environment variable)
|
97
|
+
# Note: Gemini is used via apply_layout, but its options can influence OCR if used together
|
98
|
+
gemini_opts = GeminiOptions(
|
99
|
+
prompt="Extract text content and identify document elements.",
|
100
|
+
# model_name="gemini-1.5-flash-latest" # Specify a model if needed
|
101
|
+
# See GeminiOptions documentation for more parameters
|
102
|
+
)
|
103
|
+
# Typically used like this (layout first, then potentially OCR on regions)
|
104
|
+
layout_elements = page.apply_layout(engine='gemini', options=gemini_opts)
|
105
|
+
# If Gemini also performed OCR or you want to OCR layout regions:
|
106
|
+
# ocr_elements = some_region.apply_ocr(...)
|
107
|
+
|
108
|
+
# It can sometimes be used directly if the model supports it, but less common:
|
109
|
+
# try:
|
110
|
+
# ocr_elements = page.apply_ocr(engine='gemini', options=gemini_opts)
|
111
|
+
# except Exception as e:
|
112
|
+
# print(f"Gemini might not be configured for direct OCR via apply_ocr: {e}")
|
93
113
|
```
|
94
114
|
|
95
115
|
## Applying OCR Directly
|
@@ -105,6 +125,9 @@ print(f"Found {len(ocr_elements)} text elements via OCR")
|
|
105
125
|
title = page.find('text:contains("Title")')
|
106
126
|
content_region = title.below(height=300)
|
107
127
|
region_ocr_elements = content_region.apply_ocr(engine='paddle', languages=['en'])
|
128
|
+
|
129
|
+
# Note: Re-applying OCR to the same page or region will remove any
|
130
|
+
# previously generated OCR elements for that area before adding the new ones.
|
108
131
|
```
|
109
132
|
|
110
133
|
## OCR Engines
|
@@ -190,15 +213,39 @@ page.correct_ocr(correct)
|
|
190
213
|
# You're done!
|
191
214
|
```
|
192
215
|
|
193
|
-
##
|
216
|
+
## Interactive OCR Correction / Debugging
|
194
217
|
|
195
|
-
|
196
|
-
from natural_pdf.utils.packaging import create_correction_task_package
|
218
|
+
Natural PDF includes a utility to package a PDF and its detected elements, along with an interactive web application (SPA) for reviewing and correcting OCR results.
|
197
219
|
|
198
|
-
|
199
|
-
|
220
|
+
1. **Package the data:**
|
221
|
+
Use the `create_correction_task_package` function to create a zip file containing the necessary data for the SPA.
|
222
|
+
|
223
|
+
```python
|
224
|
+
from natural_pdf.utils.packaging import create_correction_task_package
|
225
|
+
|
226
|
+
# Assuming 'pdf' is your loaded PDF object after running apply_ocr or apply_layout
|
227
|
+
create_correction_task_package(pdf, "correction_package.zip", overwrite=True)
|
228
|
+
```
|
229
|
+
|
230
|
+
2. **Run the SPA:**
|
231
|
+
The correction SPA is bundled with the library. You need to run a simple web server from the directory containing the SPA's files. The location of these files might depend on your installation, but you can typically find them within the installed `natural_pdf` package directory under `templates/spa`.
|
232
|
+
|
233
|
+
*Example using Python's built-in server (run from your terminal):*
|
234
|
+
|
235
|
+
```bash
|
236
|
+
# Find the path to the installed natural_pdf package
|
237
|
+
# (This command might vary depending on your environment)
|
238
|
+
NATURAL_PDF_PATH=$(python -c "import site; print(site.getsitepackages()[0])")/natural_pdf
|
239
|
+
|
240
|
+
# Navigate to the SPA directory
|
241
|
+
cd $NATURAL_PDF_PATH/templates/spa
|
242
|
+
|
243
|
+
# Start the web server (e.g., on port 8000)
|
244
|
+
python -m http.server 8000
|
245
|
+
```
|
200
246
|
|
201
|
-
|
247
|
+
3. **Use the SPA:**
|
248
|
+
Open your web browser to `http://localhost:8000`. The SPA should load, allowing you to drag and drop the `correction_package.zip` file you created into the application to view and edit the OCR results.
|
202
249
|
|
203
250
|
## Next Steps
|
204
251
|
|