PyPI - natural-pdf - Versions diffs - 0.1.5__py3-none-any.whl → 0.1.6__py3-none-any.whl - Mend

natural-pdf 0.1.5py3-none-any.whl → 0.1.6py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

docs/ocr/index.md +34 -47
docs/tutorials/01-loading-and-extraction.ipynb +60 -46
docs/tutorials/02-finding-elements.ipynb +42 -42
docs/tutorials/03-extracting-blocks.ipynb +17 -17
docs/tutorials/04-table-extraction.ipynb +12 -12
docs/tutorials/05-excluding-content.ipynb +30 -30
docs/tutorials/06-document-qa.ipynb +28 -28
docs/tutorials/07-layout-analysis.ipynb +63 -35
docs/tutorials/07-working-with-regions.ipynb +55 -51
docs/tutorials/07-working-with-regions.md +2 -2
docs/tutorials/08-spatial-navigation.ipynb +60 -60
docs/tutorials/09-section-extraction.ipynb +113 -113
docs/tutorials/10-form-field-extraction.ipynb +78 -50
docs/tutorials/11-enhanced-table-processing.ipynb +6 -6
docs/tutorials/12-ocr-integration.ipynb +149 -131
docs/tutorials/12-ocr-integration.md +0 -13
docs/tutorials/13-semantic-search.ipynb +313 -873
natural_pdf/__init__.py +21 -23
natural_pdf/analyzers/layout/gemini.py +264 -0
natural_pdf/analyzers/layout/layout_manager.py +28 -1
natural_pdf/analyzers/layout/layout_options.py +11 -0
natural_pdf/analyzers/layout/yolo.py +6 -2
natural_pdf/collections/pdf_collection.py +21 -0
natural_pdf/core/element_manager.py +16 -13
natural_pdf/core/page.py +165 -36
natural_pdf/core/pdf.py +146 -41
natural_pdf/elements/base.py +11 -17
natural_pdf/elements/collections.py +100 -38
natural_pdf/elements/region.py +77 -38
natural_pdf/elements/text.py +5 -0
natural_pdf/ocr/__init__.py +49 -36
natural_pdf/ocr/engine.py +146 -51
natural_pdf/ocr/engine_easyocr.py +141 -161
natural_pdf/ocr/engine_paddle.py +107 -193
natural_pdf/ocr/engine_surya.py +75 -148
natural_pdf/ocr/ocr_factory.py +114 -0
natural_pdf/ocr/ocr_manager.py +65 -93
natural_pdf/ocr/ocr_options.py +7 -17
natural_pdf/ocr/utils.py +98 -0
natural_pdf/templates/spa/css/style.css +334 -0
natural_pdf/templates/spa/index.html +31 -0
natural_pdf/templates/spa/js/app.js +472 -0
natural_pdf/templates/spa/words.txt +235976 -0
natural_pdf/utils/debug.py +32 -0
natural_pdf/utils/identifiers.py +29 -0
natural_pdf/utils/packaging.py +418 -0
{natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/METADATA +41 -19
{natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/RECORD +51 -44
{natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/WHEEL +1 -1
{natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/top_level.txt +0 -1
natural_pdf/templates/ocr_debug.html +0 -517
tests/test_loading.py +0 -50
tests/test_optional_deps.py +0 -298
{natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/licenses/LICENSE +0 -0

docs/ocr/index.md CHANGED Viewed

@@ -92,26 +92,6 @@ surya_opts = SuryaOCROptions(
 ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
 ```
-## Multiple Languages
-OCR supports multiple languages:
-```python
-# Recognize English and Spanish text
-pdf = PDF('multilingual.pdf', ocr={
-    'enabled': True,
-    'languages': ['en', 'es']
-})
-# Multiple languages with PaddleOCR
-pdf = PDF('multilingual_document.pdf',
-          ocr_engine='paddleocr',
-          ocr={
-              'enabled': True,
-              'languages': ['zh', 'ja', 'ko', 'en']  # Chinese, Japanese, Korean, English
-          })
-```
 ## Applying OCR Directly
 The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
@@ -179,39 +159,46 @@ high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
 high_conf.highlight(color="green", label="High Confidence OCR")
 ```
-## OCR Debugging
+## Detect + LLM OCR
+Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
+```python
+from natural_pdf import PDF
+from natural_pdf.ocr.utils import direct_ocr_llm
+import openai
+pdf = PDF("needs-ocr.pdf")
+page = pdf.pages[0]
+# Detect
+page.apply_ocr('paddle', resolution=120, detect_only=True)
+# Build the framework
+client = openai.OpenAI(base_url="https://api.anthropic.com/v1/",  api_key='sk-XXXXX')
+prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
+punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
+The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
+# This returns the cleaned-up text
+def correct(region):
+    return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
+# Run 'correct' on each text element
+page.correct_ocr(correct)
+# You're done!
+```
-For troubleshooting OCR problems:
+## Debugging OCR
 ```python
-# Create an interactive HTML debug report
-pdf.debug_ocr("ocr_debug.html")
+from natural_pdf.utils.packaging import create_correction_task_package
-# Specify which pages to include
-pdf.debug_ocr("ocr_debug.html", pages=[0, 1, 2])
+create_correction_task_package(pdf, "original.zip", overwrite=True)
 ```
-The debug report shows:
-- The original image
-- Text found with confidence scores
-- Boxes around each detected word
-- Options to sort and filter results
-## OCR Parameter Tuning
-### Parameter Recommendation Table
-| Issue | Engine | Parameter | Recommended Value | Effect |
-|-------|--------|-----------|-------------------|--------|
-| Missing text | EasyOCR | `text_threshold` | 0.1 - 0.3 (default: 0.7) | Lower values detect more text but may increase false positives |
-| Missing text | PaddleOCR | `det_db_thresh` | 0.1 - 0.3 (default: 0.3) | Lower values detect more text areas |
-| Low quality scan | EasyOCR | `contrast_ths` | 0.05 - 0.1 (default: 0.1) | Lower values help with low contrast documents |
-| Low quality scan | PaddleOCR | `det_limit_side_len` | 1280 - 2560 (default: 960) | Higher values improve detail detection |
-| Accuracy vs. speed | EasyOCR | `decoder` | "wordbeamsearch" (accuracy)<br>"greedy" (speed) | Word beam search is more accurate but slower |
-| Accuracy vs. speed | PaddleOCR | `rec_batch_num` | 1 (accuracy)<br>8+ (speed) | Larger batches process faster but use more memory |
-| Small text | Both | `min_confidence` | 0.3 - 0.4 (default: 0.5) | Lower confidence threshold to capture small/blurry text |
-| Text orientation | PaddleOCR | `use_angle_cls` | `True` | Enable angle classification for rotated text |
-| Asian languages | PaddleOCR | `lang` | "ch", "japan", "korea" | Use PaddleOCR for Asian languages |
+This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
 ## Next Steps

natural-pdf 0.1.5__py3-none-any.whl → 0.1.6__py3-none-any.whl

natural-pdf 0.1.5py3-none-any.whl → 0.1.6py3-none-any.whl