natural-pdf 0.1.5__py3-none-any.whl → 0.1.6__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- docs/ocr/index.md +34 -47
- docs/tutorials/01-loading-and-extraction.ipynb +60 -46
- docs/tutorials/02-finding-elements.ipynb +42 -42
- docs/tutorials/03-extracting-blocks.ipynb +17 -17
- docs/tutorials/04-table-extraction.ipynb +12 -12
- docs/tutorials/05-excluding-content.ipynb +30 -30
- docs/tutorials/06-document-qa.ipynb +28 -28
- docs/tutorials/07-layout-analysis.ipynb +63 -35
- docs/tutorials/07-working-with-regions.ipynb +55 -51
- docs/tutorials/07-working-with-regions.md +2 -2
- docs/tutorials/08-spatial-navigation.ipynb +60 -60
- docs/tutorials/09-section-extraction.ipynb +113 -113
- docs/tutorials/10-form-field-extraction.ipynb +78 -50
- docs/tutorials/11-enhanced-table-processing.ipynb +6 -6
- docs/tutorials/12-ocr-integration.ipynb +149 -131
- docs/tutorials/12-ocr-integration.md +0 -13
- docs/tutorials/13-semantic-search.ipynb +313 -873
- natural_pdf/__init__.py +21 -23
- natural_pdf/analyzers/layout/gemini.py +264 -0
- natural_pdf/analyzers/layout/layout_manager.py +28 -1
- natural_pdf/analyzers/layout/layout_options.py +11 -0
- natural_pdf/analyzers/layout/yolo.py +6 -2
- natural_pdf/collections/pdf_collection.py +21 -0
- natural_pdf/core/element_manager.py +16 -13
- natural_pdf/core/page.py +165 -36
- natural_pdf/core/pdf.py +146 -41
- natural_pdf/elements/base.py +11 -17
- natural_pdf/elements/collections.py +100 -38
- natural_pdf/elements/region.py +77 -38
- natural_pdf/elements/text.py +5 -0
- natural_pdf/ocr/__init__.py +49 -36
- natural_pdf/ocr/engine.py +146 -51
- natural_pdf/ocr/engine_easyocr.py +141 -161
- natural_pdf/ocr/engine_paddle.py +107 -193
- natural_pdf/ocr/engine_surya.py +75 -148
- natural_pdf/ocr/ocr_factory.py +114 -0
- natural_pdf/ocr/ocr_manager.py +65 -93
- natural_pdf/ocr/ocr_options.py +7 -17
- natural_pdf/ocr/utils.py +98 -0
- natural_pdf/templates/spa/css/style.css +334 -0
- natural_pdf/templates/spa/index.html +31 -0
- natural_pdf/templates/spa/js/app.js +472 -0
- natural_pdf/templates/spa/words.txt +235976 -0
- natural_pdf/utils/debug.py +32 -0
- natural_pdf/utils/identifiers.py +29 -0
- natural_pdf/utils/packaging.py +418 -0
- {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/METADATA +41 -19
- {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/RECORD +51 -44
- {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/WHEEL +1 -1
- {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/top_level.txt +0 -1
- natural_pdf/templates/ocr_debug.html +0 -517
- tests/test_loading.py +0 -50
- tests/test_optional_deps.py +0 -298
- {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/licenses/LICENSE +0 -0
docs/ocr/index.md
CHANGED
@@ -92,26 +92,6 @@ surya_opts = SuryaOCROptions(
|
|
92
92
|
ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
|
93
93
|
```
|
94
94
|
|
95
|
-
## Multiple Languages
|
96
|
-
|
97
|
-
OCR supports multiple languages:
|
98
|
-
|
99
|
-
```python
|
100
|
-
# Recognize English and Spanish text
|
101
|
-
pdf = PDF('multilingual.pdf', ocr={
|
102
|
-
'enabled': True,
|
103
|
-
'languages': ['en', 'es']
|
104
|
-
})
|
105
|
-
|
106
|
-
# Multiple languages with PaddleOCR
|
107
|
-
pdf = PDF('multilingual_document.pdf',
|
108
|
-
ocr_engine='paddleocr',
|
109
|
-
ocr={
|
110
|
-
'enabled': True,
|
111
|
-
'languages': ['zh', 'ja', 'ko', 'en'] # Chinese, Japanese, Korean, English
|
112
|
-
})
|
113
|
-
```
|
114
|
-
|
115
95
|
## Applying OCR Directly
|
116
96
|
|
117
97
|
The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
|
@@ -179,39 +159,46 @@ high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
|
|
179
159
|
high_conf.highlight(color="green", label="High Confidence OCR")
|
180
160
|
```
|
181
161
|
|
182
|
-
## OCR
|
162
|
+
## Detect + LLM OCR
|
163
|
+
|
164
|
+
Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
|
165
|
+
|
166
|
+
```python
|
167
|
+
from natural_pdf import PDF
|
168
|
+
from natural_pdf.ocr.utils import direct_ocr_llm
|
169
|
+
import openai
|
170
|
+
|
171
|
+
pdf = PDF("needs-ocr.pdf")
|
172
|
+
page = pdf.pages[0]
|
173
|
+
|
174
|
+
# Detect
|
175
|
+
page.apply_ocr('paddle', resolution=120, detect_only=True)
|
176
|
+
|
177
|
+
# Build the framework
|
178
|
+
client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key='sk-XXXXX')
|
179
|
+
prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
|
180
|
+
punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
|
181
|
+
The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
|
182
|
+
|
183
|
+
# This returns the cleaned-up text
|
184
|
+
def correct(region):
|
185
|
+
return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
|
186
|
+
|
187
|
+
# Run 'correct' on each text element
|
188
|
+
page.correct_ocr(correct)
|
189
|
+
|
190
|
+
# You're done!
|
191
|
+
```
|
183
192
|
|
184
|
-
|
193
|
+
## Debugging OCR
|
185
194
|
|
186
195
|
```python
|
187
|
-
|
188
|
-
pdf.debug_ocr("ocr_debug.html")
|
196
|
+
from natural_pdf.utils.packaging import create_correction_task_package
|
189
197
|
|
190
|
-
|
191
|
-
pdf.debug_ocr("ocr_debug.html", pages=[0, 1, 2])
|
198
|
+
create_correction_task_package(pdf, "original.zip", overwrite=True)
|
192
199
|
```
|
193
200
|
|
194
|
-
|
195
|
-
- The original image
|
196
|
-
- Text found with confidence scores
|
197
|
-
- Boxes around each detected word
|
198
|
-
- Options to sort and filter results
|
199
|
-
|
200
|
-
## OCR Parameter Tuning
|
201
|
-
|
202
|
-
### Parameter Recommendation Table
|
203
|
-
|
204
|
-
| Issue | Engine | Parameter | Recommended Value | Effect |
|
205
|
-
|-------|--------|-----------|-------------------|--------|
|
206
|
-
| Missing text | EasyOCR | `text_threshold` | 0.1 - 0.3 (default: 0.7) | Lower values detect more text but may increase false positives |
|
207
|
-
| Missing text | PaddleOCR | `det_db_thresh` | 0.1 - 0.3 (default: 0.3) | Lower values detect more text areas |
|
208
|
-
| Low quality scan | EasyOCR | `contrast_ths` | 0.05 - 0.1 (default: 0.1) | Lower values help with low contrast documents |
|
209
|
-
| Low quality scan | PaddleOCR | `det_limit_side_len` | 1280 - 2560 (default: 960) | Higher values improve detail detection |
|
210
|
-
| Accuracy vs. speed | EasyOCR | `decoder` | "wordbeamsearch" (accuracy)<br>"greedy" (speed) | Word beam search is more accurate but slower |
|
211
|
-
| Accuracy vs. speed | PaddleOCR | `rec_batch_num` | 1 (accuracy)<br>8+ (speed) | Larger batches process faster but use more memory |
|
212
|
-
| Small text | Both | `min_confidence` | 0.3 - 0.4 (default: 0.5) | Lower confidence threshold to capture small/blurry text |
|
213
|
-
| Text orientation | PaddleOCR | `use_angle_cls` | `True` | Enable angle classification for rotated text |
|
214
|
-
| Asian languages | PaddleOCR | `lang` | "ch", "japan", "korea" | Use PaddleOCR for Asian languages |
|
201
|
+
This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
|
215
202
|
|
216
203
|
## Next Steps
|
217
204
|
|