natural-pdf 0.1.5__py3-none-any.whl → 0.1.7__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. docs/finetuning/index.md +176 -0
  2. docs/ocr/index.md +34 -47
  3. docs/tutorials/01-loading-and-extraction.ipynb +34 -1536
  4. docs/tutorials/02-finding-elements.ipynb +42 -42
  5. docs/tutorials/03-extracting-blocks.ipynb +17 -17
  6. docs/tutorials/04-table-extraction.ipynb +12 -12
  7. docs/tutorials/05-excluding-content.ipynb +30 -30
  8. docs/tutorials/06-document-qa.ipynb +28 -28
  9. docs/tutorials/07-layout-analysis.ipynb +63 -35
  10. docs/tutorials/07-working-with-regions.ipynb +55 -51
  11. docs/tutorials/07-working-with-regions.md +2 -2
  12. docs/tutorials/08-spatial-navigation.ipynb +60 -60
  13. docs/tutorials/09-section-extraction.ipynb +113 -113
  14. docs/tutorials/10-form-field-extraction.ipynb +78 -50
  15. docs/tutorials/11-enhanced-table-processing.ipynb +6 -6
  16. docs/tutorials/12-ocr-integration.ipynb +149 -131
  17. docs/tutorials/12-ocr-integration.md +0 -13
  18. docs/tutorials/13-semantic-search.ipynb +313 -873
  19. natural_pdf/__init__.py +21 -22
  20. natural_pdf/analyzers/layout/gemini.py +280 -0
  21. natural_pdf/analyzers/layout/layout_manager.py +28 -1
  22. natural_pdf/analyzers/layout/layout_options.py +11 -0
  23. natural_pdf/analyzers/layout/yolo.py +6 -2
  24. natural_pdf/collections/pdf_collection.py +24 -0
  25. natural_pdf/core/element_manager.py +18 -13
  26. natural_pdf/core/page.py +174 -36
  27. natural_pdf/core/pdf.py +156 -42
  28. natural_pdf/elements/base.py +9 -17
  29. natural_pdf/elements/collections.py +99 -38
  30. natural_pdf/elements/region.py +77 -37
  31. natural_pdf/elements/text.py +5 -0
  32. natural_pdf/exporters/__init__.py +4 -0
  33. natural_pdf/exporters/base.py +61 -0
  34. natural_pdf/exporters/paddleocr.py +345 -0
  35. natural_pdf/ocr/__init__.py +57 -36
  36. natural_pdf/ocr/engine.py +160 -49
  37. natural_pdf/ocr/engine_easyocr.py +178 -157
  38. natural_pdf/ocr/engine_paddle.py +114 -189
  39. natural_pdf/ocr/engine_surya.py +87 -144
  40. natural_pdf/ocr/ocr_factory.py +125 -0
  41. natural_pdf/ocr/ocr_manager.py +65 -89
  42. natural_pdf/ocr/ocr_options.py +8 -13
  43. natural_pdf/ocr/utils.py +113 -0
  44. natural_pdf/templates/finetune/fine_tune_paddleocr.md +415 -0
  45. natural_pdf/templates/spa/css/style.css +334 -0
  46. natural_pdf/templates/spa/index.html +31 -0
  47. natural_pdf/templates/spa/js/app.js +472 -0
  48. natural_pdf/templates/spa/words.txt +235976 -0
  49. natural_pdf/utils/debug.py +34 -0
  50. natural_pdf/utils/identifiers.py +33 -0
  51. natural_pdf/utils/packaging.py +485 -0
  52. natural_pdf/utils/text_extraction.py +44 -64
  53. natural_pdf/utils/visualization.py +1 -1
  54. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.7.dist-info}/METADATA +44 -20
  55. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.7.dist-info}/RECORD +58 -47
  56. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.7.dist-info}/WHEEL +1 -1
  57. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.7.dist-info}/top_level.txt +0 -1
  58. natural_pdf/templates/ocr_debug.html +0 -517
  59. tests/test_loading.py +0 -50
  60. tests/test_optional_deps.py +0 -298
  61. {natural_pdf-0.1.5.dist-info → natural_pdf-0.1.7.dist-info}/licenses/LICENSE +0 -0
@@ -0,0 +1,176 @@
1
+ # OCR Fine-tuning
2
+
3
+ While the built-in OCR engines (EasyOCR, PaddleOCR, Surya) offer good general performance, you might encounter situations where their accuracy isn't sufficient for your specific needs. This is often the case with:
4
+
5
+ * **Unique Fonts:** Documents using unusual or stylized fonts.
6
+ * **Specific Languages:** Languages or scripts not perfectly covered by the default models.
7
+ * **Low Quality Scans:** Noisy or degraded document images.
8
+ * **Specialized Layouts:** Text within complex tables, forms, or unusual arrangements.
9
+
10
+ Fine-tuning allows you to adapt a pre-trained OCR recognition model to your specific data, significantly improving its accuracy on documents similar to those used for training.
11
+
12
+ ## Why Fine-tune?
13
+
14
+ - **Higher Accuracy:** Achieve better text extraction results on your specific document types.
15
+ - **Adaptability:** Train the model to recognize domain-specific terms, symbols, or layouts.
16
+ - **Reduced Errors:** Minimize downstream errors in data extraction and processing pipelines.
17
+
18
+ ## Strategy: Detect + LLM Correct + Export
19
+
20
+ Training an OCR model requires accurate ground truth: images of text snippets paired with their correct transcriptions. Manually creating this data is tedious. A powerful alternative leverages the strengths of different models:
21
+
22
+ 1. **Detect Text Regions:** Use a robust local OCR engine (like Surya or PaddleOCR) primarily for its *detection* capabilities (`detect_only=True`). This identifies the *locations* of text on the page, even if the initial *recognition* isn't perfect. You can combine this with layout analysis or region selections (`.region()`, `.below()`, `.add_exclusion()`) to focus on the specific areas you care about.
23
+ 2. **Correct with LLM:** For each detected text region, send the image snippet to a powerful Large Language Model (LLM) with multimodal capabilities (like GPT-4o, Claude 3.5 Sonnet/Haiku) using the `direct_ocr_llm` utility. The LLM performs high-accuracy OCR on the snippet, providing a "ground truth" transcription.
24
+ 3. **Export for Fine-tuning:** Use the `PaddleOCRRecognitionExporter` to package the original image snippets (from step 1) along with their corresponding LLM-generated text labels (from step 2) into the specific format required by PaddleOCR for fine-tuning its *recognition* model.
25
+
26
+ This approach combines the efficient spatial detection of local models with the superior text recognition of large generative models to create a high-quality fine-tuning dataset with minimal manual effort.
27
+
28
+ ## Example: Fine-tuning for Greek Spreadsheet Text
29
+
30
+ Let's walk through an example of preparing data to fine-tune PaddleOCR for text from a scanned Greek spreadsheet, adapting the process described above.
31
+
32
+ ```python
33
+ # --- 1. Setup and Load PDF ---
34
+ from natural_pdf import PDF
35
+ from natural_pdf.ocr.utils import direct_ocr_llm
36
+ from natural_pdf.exporters import PaddleOCRRecognitionExporter
37
+ import openai # Or your preferred LLM client library
38
+ import os
39
+
40
+ # Ensure your LLM API key is set (using environment variables is recommended)
41
+ # os.environ["OPENAI_API_KEY"] = "sk-..."
42
+ # os.environ["ANTHROPIC_API_KEY"] = "sk-..."
43
+
44
+ # pdf_path = "path/to/your/document.pdf"
45
+ pdf_path = "pdfs/hidden/the-bad-one.pdf" # Replace with your PDF path
46
+ pdf = PDF(pdf_path)
47
+
48
+ # --- 2. (Optional) Exclude Irrelevant Areas ---
49
+ # If the document has consistent headers, footers, or margins you want to ignore
50
+ # Use exclusions *before* detection
51
+ pdf.add_exclusion(lambda page: page.region(right=45)) # Exclude left margin/line numbers
52
+ pdf.add_exclusion(lambda page: page.region(left=500)) # Exclude right margin
53
+
54
+ # --- 3. Detect Text Regions ---
55
+ # Use a good detection engine. Surya is often robust for line detection.
56
+ # We only want the bounding boxes, not the initial (potentially inaccurate) OCR text.
57
+ print("Detecting text regions...")
58
+ # Process only a subset of pages for demonstration if needed
59
+ for page in pdf.pages[:10]:
60
+ # Use a moderate resolution for detection; higher res used for LLM correction later
61
+ page.apply_ocr(engine='surya', resolution=120, detect_only=True)
62
+ print(f"Detection complete for {num_pages_to_process} pages.")
63
+
64
+ # (Optional) Visualize detected boxes on a sample page
65
+ # pdf.pages[9].find_all('text[source=ocr]').show()
66
+
67
+ # --- 4. Correct with LLM ---
68
+ # Configure your LLM client (example using OpenAI client, adaptable for others)
69
+ # For Anthropic: client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key=os.environ.get("ANTHROPIC_API_KEY"))
70
+ client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
71
+
72
+ # Craft a clear prompt for the LLM
73
+ # Be as specific as possible! If it's in a specific language, what kinds
74
+ # of characters, etc.
75
+ prompt = """OCR this image patch. Return only the exact text content visible in the image.
76
+ Preserve original spelling, capitalization, punctuation, and symbols.
77
+ Do not add any explanatory text, translations, comments, or quotation marks around the result.
78
+ The text is likely from a Greek document, potentially a spreadsheet, containing Modern Greek words or numbers."""
79
+
80
+ # Define the correction function using direct_ocr_llm
81
+ def correct_text_region(region):
82
+ # Use a high resolution for the LLM call for best accuracy
83
+ return direct_ocr_llm(
84
+ region,
85
+ client,
86
+ prompt=prompt,
87
+ resolution=300,
88
+ # model="claude-3-5-sonnet-20240620" # Example Anthropic model
89
+ model="gpt-4o-mini" # Example OpenAI model
90
+ )
91
+
92
+ # Apply the correction function to the detected text regions
93
+ print("Applying LLM correction to detected regions...")
94
+ for page in pdf.pages[:num_pages_to_process]:
95
+ # This finds elements added by apply_ocr and passes their regions to 'correct_text_region'
96
+ # The returned text from the LLM replaces the original OCR text for these elements
97
+ # The source attribute is updated (e.g., to 'ocr-llm-corrected')
98
+ page.correct_ocr(correct_text_region)
99
+ print("LLM correction complete.")
100
+
101
+ # --- 5. Export for PaddleOCR Fine-tuning ---
102
+ print("Configuring exporter...")
103
+ exporter = PaddleOCRRecognitionExporter(
104
+ # Select all of the non-blank OCR text
105
+ # Hopefully it's all been LLM-corrected!
106
+ selector="text[source^=ocr][text!='']",
107
+ resolution=300, # Resolution for the exported image crops
108
+ padding=2, # Add slight padding around text boxes
109
+ split_ratio=0.9, # 90% for training, 10% for validation
110
+ random_seed=42, # For reproducible train/val split
111
+ include_guide=True # Include the Colab fine-tuning notebook
112
+ )
113
+
114
+ # Define the output directory
115
+ output_directory = "./my_paddleocr_finetune_data"
116
+ print(f"Exporting data to {output_directory}...")
117
+
118
+ # Run the export process
119
+ exporter.export(pdf, output_directory)
120
+
121
+ print("Export complete.")
122
+ print(f"Dataset ready for fine-tuning in: {output_directory}")
123
+ print(f"Next step: Upload '{os.path.join(output_directory, 'fine_tune_paddleocr.ipynb')}' and the rest of the contents to Google Colab.")
124
+
125
+ # --- Cleanup ---
126
+ pdf.close()
127
+ ```
128
+
129
+ ## Running the Fine-tuning
130
+
131
+ The `PaddleOCRRecognitionExporter` automatically includes a Jupyter Notebook (`fine_tune_paddleocr.ipynb`) in the output directory. This notebook is pre-configured to guide you through the fine-tuning process on Google Colab (which offers free GPU access):
132
+
133
+ 1. **Upload:** Upload the entire output directory (e.g., `my_paddleocr_finetune_data`) to your Google Drive or directly to your Colab instance.
134
+ 2. **Open Notebook:** Open the `fine_tune_paddleocr.ipynb` notebook in Google Colab.
135
+ 3. **Set Runtime:** Ensure the Colab runtime is set to use a GPU (Runtime -> Change runtime type -> GPU).
136
+ 4. **Run Cells:** Execute the cells in the notebook sequentially. It will:
137
+ * Install necessary libraries (PaddlePaddle, PaddleOCR).
138
+ * Point the training configuration to your uploaded dataset (`images/`, `train.txt`, `val.txt`, `dict.txt`).
139
+ * Download a pre-trained PaddleOCR model (usually a multilingual one).
140
+ * Start the fine-tuning process using your data.
141
+ * Save the fine-tuned model checkpoints.
142
+ * Export the best model into an "inference format" suitable for use with `natural-pdf`.
143
+ 5. **Download Model:** Download the resulting `inference_model` directory from Colab.
144
+
145
+ ## Using the Fine-tuned Model
146
+
147
+ Once you have the `inference_model` directory, you can instruct `natural-pdf` to use it for OCR:
148
+
149
+ ```python
150
+ from natural_pdf import PDF
151
+ from natural_pdf.ocr import PaddleOCROptions
152
+
153
+ # Path to the directory you downloaded from Colab
154
+ finetuned_model_dir = "/path/to/your/downloaded/inference_model"
155
+
156
+ # Specify the path in PaddleOCROptions
157
+ paddle_opts = PaddleOCROptions(
158
+ rec_model_dir=finetuned_model_dir,
159
+ rec_char_dict_path=os.path.join(finetuned_model_dir, 'your_dict.txt') # Or wherever your dict is
160
+ use_gpu=True # If using GPU locally
161
+ )
162
+
163
+ pdf = PDF("another-similar-document.pdf")
164
+ page = pdf.pages[0]
165
+
166
+ # Apply OCR using your fine-tuned model
167
+ ocr_elements = page.apply_ocr(engine='paddle', options=paddle_opts)
168
+
169
+ # Extract text using the improved results
170
+ text = page.extract_text()
171
+ print(text)
172
+
173
+ pdf.close()
174
+ ```
175
+
176
+ By following this process, you can significantly enhance OCR performance on your specific documents using the power of fine-tuning.
docs/ocr/index.md CHANGED
@@ -92,26 +92,6 @@ surya_opts = SuryaOCROptions(
92
92
  ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
93
93
  ```
94
94
 
95
- ## Multiple Languages
96
-
97
- OCR supports multiple languages:
98
-
99
- ```python
100
- # Recognize English and Spanish text
101
- pdf = PDF('multilingual.pdf', ocr={
102
- 'enabled': True,
103
- 'languages': ['en', 'es']
104
- })
105
-
106
- # Multiple languages with PaddleOCR
107
- pdf = PDF('multilingual_document.pdf',
108
- ocr_engine='paddleocr',
109
- ocr={
110
- 'enabled': True,
111
- 'languages': ['zh', 'ja', 'ko', 'en'] # Chinese, Japanese, Korean, English
112
- })
113
- ```
114
-
115
95
  ## Applying OCR Directly
116
96
 
117
97
  The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
@@ -179,39 +159,46 @@ high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
179
159
  high_conf.highlight(color="green", label="High Confidence OCR")
180
160
  ```
181
161
 
182
- ## OCR Debugging
162
+ ## Detect + LLM OCR
163
+
164
+ Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
165
+
166
+ ```python
167
+ from natural_pdf import PDF
168
+ from natural_pdf.ocr.utils import direct_ocr_llm
169
+ import openai
170
+
171
+ pdf = PDF("needs-ocr.pdf")
172
+ page = pdf.pages[0]
173
+
174
+ # Detect
175
+ page.apply_ocr('paddle', resolution=120, detect_only=True)
176
+
177
+ # Build the framework
178
+ client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key='sk-XXXXX')
179
+ prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
180
+ punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
181
+ The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
182
+
183
+ # This returns the cleaned-up text
184
+ def correct(region):
185
+ return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
186
+
187
+ # Run 'correct' on each text element
188
+ page.correct_ocr(correct)
189
+
190
+ # You're done!
191
+ ```
183
192
 
184
- For troubleshooting OCR problems:
193
+ ## Debugging OCR
185
194
 
186
195
  ```python
187
- # Create an interactive HTML debug report
188
- pdf.debug_ocr("ocr_debug.html")
196
+ from natural_pdf.utils.packaging import create_correction_task_package
189
197
 
190
- # Specify which pages to include
191
- pdf.debug_ocr("ocr_debug.html", pages=[0, 1, 2])
198
+ create_correction_task_package(pdf, "original.zip", overwrite=True)
192
199
  ```
193
200
 
194
- The debug report shows:
195
- - The original image
196
- - Text found with confidence scores
197
- - Boxes around each detected word
198
- - Options to sort and filter results
199
-
200
- ## OCR Parameter Tuning
201
-
202
- ### Parameter Recommendation Table
203
-
204
- | Issue | Engine | Parameter | Recommended Value | Effect |
205
- |-------|--------|-----------|-------------------|--------|
206
- | Missing text | EasyOCR | `text_threshold` | 0.1 - 0.3 (default: 0.7) | Lower values detect more text but may increase false positives |
207
- | Missing text | PaddleOCR | `det_db_thresh` | 0.1 - 0.3 (default: 0.3) | Lower values detect more text areas |
208
- | Low quality scan | EasyOCR | `contrast_ths` | 0.05 - 0.1 (default: 0.1) | Lower values help with low contrast documents |
209
- | Low quality scan | PaddleOCR | `det_limit_side_len` | 1280 - 2560 (default: 960) | Higher values improve detail detection |
210
- | Accuracy vs. speed | EasyOCR | `decoder` | "wordbeamsearch" (accuracy)<br>"greedy" (speed) | Word beam search is more accurate but slower |
211
- | Accuracy vs. speed | PaddleOCR | `rec_batch_num` | 1 (accuracy)<br>8+ (speed) | Larger batches process faster but use more memory |
212
- | Small text | Both | `min_confidence` | 0.3 - 0.4 (default: 0.5) | Lower confidence threshold to capture small/blurry text |
213
- | Text orientation | PaddleOCR | `use_angle_cls` | `True` | Enable angle classification for rotated text |
214
- | Asian languages | PaddleOCR | `lang` | "ch", "japan", "korea" | Use PaddleOCR for Asian languages |
201
+ This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
215
202
 
216
203
  ## Next Steps
217
204