natural-pdf 0.1.5__py3-none-any.whl → 0.1.6__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. docs/ocr/index.md +34 -47
  2. docs/tutorials/01-loading-and-extraction.ipynb +60 -46
  3. docs/tutorials/02-finding-elements.ipynb +42 -42
  4. docs/tutorials/03-extracting-blocks.ipynb +17 -17
  5. docs/tutorials/04-table-extraction.ipynb +12 -12
  6. docs/tutorials/05-excluding-content.ipynb +30 -30
  7. docs/tutorials/06-document-qa.ipynb +28 -28
  8. docs/tutorials/07-layout-analysis.ipynb +63 -35
  9. docs/tutorials/07-working-with-regions.ipynb +55 -51
  10. docs/tutorials/07-working-with-regions.md +2 -2
  11. docs/tutorials/08-spatial-navigation.ipynb +60 -60
  12. docs/tutorials/09-section-extraction.ipynb +113 -113
  13. docs/tutorials/10-form-field-extraction.ipynb +78 -50
  14. docs/tutorials/11-enhanced-table-processing.ipynb +6 -6
  15. docs/tutorials/12-ocr-integration.ipynb +149 -131
  16. docs/tutorials/12-ocr-integration.md +0 -13
  17. docs/tutorials/13-semantic-search.ipynb +313 -873
  18. natural_pdf/__init__.py +21 -23
  19. natural_pdf/analyzers/layout/gemini.py +264 -0
  20. natural_pdf/analyzers/layout/layout_manager.py +28 -1
  21. natural_pdf/analyzers/layout/layout_options.py +11 -0
  22. natural_pdf/analyzers/layout/yolo.py +6 -2
  23. natural_pdf/collections/pdf_collection.py +21 -0
  24. natural_pdf/core/element_manager.py +16 -13
  25. natural_pdf/core/page.py +165 -36
  26. natural_pdf/core/pdf.py +146 -41
  27. natural_pdf/elements/base.py +11 -17
  28. natural_pdf/elements/collections.py +100 -38
  29. natural_pdf/elements/region.py +77 -38
  30. natural_pdf/elements/text.py +5 -0
  31. natural_pdf/ocr/__init__.py +49 -36
  32. natural_pdf/ocr/engine.py +146 -51
  33. natural_pdf/ocr/engine_easyocr.py +141 -161
  34. natural_pdf/ocr/engine_paddle.py +107 -193
  35. natural_pdf/ocr/engine_surya.py +75 -148
  36. natural_pdf/ocr/ocr_factory.py +114 -0
  37. natural_pdf/ocr/ocr_manager.py +65 -93
  38. natural_pdf/ocr/ocr_options.py +7 -17
  39. natural_pdf/ocr/utils.py +98 -0
  40. natural_pdf/templates/spa/css/style.css +334 -0
  41. natural_pdf/templates/spa/index.html +31 -0
  42. natural_pdf/templates/spa/js/app.js +472 -0
  43. natural_pdf/templates/spa/words.txt +235976 -0
  44. natural_pdf/utils/debug.py +32 -0
  45. natural_pdf/utils/identifiers.py +29 -0
  46. natural_pdf/utils/packaging.py +418 -0
  47. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/METADATA +41 -19
  48. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/RECORD +51 -44
  49. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/WHEEL +1 -1
  50. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/top_level.txt +0 -1
  51. natural_pdf/templates/ocr_debug.html +0 -517
  52. tests/test_loading.py +0 -50
  53. tests/test_optional_deps.py +0 -298
  54. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.6.dist-info}/licenses/LICENSE +0 -0
docs/ocr/index.md CHANGED
@@ -92,26 +92,6 @@ surya_opts = SuryaOCROptions(
92
92
  ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
93
93
  ```
94
94
 
95
- ## Multiple Languages
96
-
97
- OCR supports multiple languages:
98
-
99
- ```python
100
- # Recognize English and Spanish text
101
- pdf = PDF('multilingual.pdf', ocr={
102
- 'enabled': True,
103
- 'languages': ['en', 'es']
104
- })
105
-
106
- # Multiple languages with PaddleOCR
107
- pdf = PDF('multilingual_document.pdf',
108
- ocr_engine='paddleocr',
109
- ocr={
110
- 'enabled': True,
111
- 'languages': ['zh', 'ja', 'ko', 'en'] # Chinese, Japanese, Korean, English
112
- })
113
- ```
114
-
115
95
  ## Applying OCR Directly
116
96
 
117
97
  The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
@@ -179,39 +159,46 @@ high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
179
159
  high_conf.highlight(color="green", label="High Confidence OCR")
180
160
  ```
181
161
 
182
- ## OCR Debugging
162
+ ## Detect + LLM OCR
163
+
164
+ Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
165
+
166
+ ```python
167
+ from natural_pdf import PDF
168
+ from natural_pdf.ocr.utils import direct_ocr_llm
169
+ import openai
170
+
171
+ pdf = PDF("needs-ocr.pdf")
172
+ page = pdf.pages[0]
173
+
174
+ # Detect
175
+ page.apply_ocr('paddle', resolution=120, detect_only=True)
176
+
177
+ # Build the framework
178
+ client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key='sk-XXXXX')
179
+ prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
180
+ punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
181
+ The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
182
+
183
+ # This returns the cleaned-up text
184
+ def correct(region):
185
+ return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
186
+
187
+ # Run 'correct' on each text element
188
+ page.correct_ocr(correct)
189
+
190
+ # You're done!
191
+ ```
183
192
 
184
- For troubleshooting OCR problems:
193
+ ## Debugging OCR
185
194
 
186
195
  ```python
187
- # Create an interactive HTML debug report
188
- pdf.debug_ocr("ocr_debug.html")
196
+ from natural_pdf.utils.packaging import create_correction_task_package
189
197
 
190
- # Specify which pages to include
191
- pdf.debug_ocr("ocr_debug.html", pages=[0, 1, 2])
198
+ create_correction_task_package(pdf, "original.zip", overwrite=True)
192
199
  ```
193
200
 
194
- The debug report shows:
195
- - The original image
196
- - Text found with confidence scores
197
- - Boxes around each detected word
198
- - Options to sort and filter results
199
-
200
- ## OCR Parameter Tuning
201
-
202
- ### Parameter Recommendation Table
203
-
204
- | Issue | Engine | Parameter | Recommended Value | Effect |
205
- |-------|--------|-----------|-------------------|--------|
206
- | Missing text | EasyOCR | `text_threshold` | 0.1 - 0.3 (default: 0.7) | Lower values detect more text but may increase false positives |
207
- | Missing text | PaddleOCR | `det_db_thresh` | 0.1 - 0.3 (default: 0.3) | Lower values detect more text areas |
208
- | Low quality scan | EasyOCR | `contrast_ths` | 0.05 - 0.1 (default: 0.1) | Lower values help with low contrast documents |
209
- | Low quality scan | PaddleOCR | `det_limit_side_len` | 1280 - 2560 (default: 960) | Higher values improve detail detection |
210
- | Accuracy vs. speed | EasyOCR | `decoder` | "wordbeamsearch" (accuracy)<br>"greedy" (speed) | Word beam search is more accurate but slower |
211
- | Accuracy vs. speed | PaddleOCR | `rec_batch_num` | 1 (accuracy)<br>8+ (speed) | Larger batches process faster but use more memory |
212
- | Small text | Both | `min_confidence` | 0.3 - 0.4 (default: 0.5) | Lower confidence threshold to capture small/blurry text |
213
- | Text orientation | PaddleOCR | `use_angle_cls` | `True` | Enable angle classification for rotated text |
214
- | Asian languages | PaddleOCR | `lang` | "ch", "japan", "korea" | Use PaddleOCR for Asian languages |
201
+ This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
215
202
 
216
203
  ## Next Steps
217
204