natural-pdf 0.1.7__py3-none-any.whl → 0.1.8__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. docs/categorizing-documents/index.md +168 -0
  2. docs/data-extraction/index.md +87 -0
  3. docs/element-selection/index.ipynb +218 -164
  4. docs/element-selection/index.md +20 -0
  5. docs/index.md +19 -0
  6. docs/ocr/index.md +63 -16
  7. docs/tutorials/01-loading-and-extraction.ipynb +1713 -34
  8. docs/tutorials/02-finding-elements.ipynb +123 -46
  9. docs/tutorials/03-extracting-blocks.ipynb +24 -19
  10. docs/tutorials/04-table-extraction.ipynb +17 -12
  11. docs/tutorials/05-excluding-content.ipynb +37 -32
  12. docs/tutorials/06-document-qa.ipynb +36 -31
  13. docs/tutorials/07-layout-analysis.ipynb +45 -40
  14. docs/tutorials/07-working-with-regions.ipynb +61 -60
  15. docs/tutorials/08-spatial-navigation.ipynb +76 -71
  16. docs/tutorials/09-section-extraction.ipynb +160 -155
  17. docs/tutorials/10-form-field-extraction.ipynb +71 -66
  18. docs/tutorials/11-enhanced-table-processing.ipynb +11 -6
  19. docs/tutorials/12-ocr-integration.ipynb +3420 -312
  20. docs/tutorials/12-ocr-integration.md +68 -106
  21. docs/tutorials/13-semantic-search.ipynb +641 -251
  22. natural_pdf/__init__.py +2 -0
  23. natural_pdf/classification/manager.py +343 -0
  24. natural_pdf/classification/mixin.py +149 -0
  25. natural_pdf/classification/results.py +62 -0
  26. natural_pdf/collections/mixins.py +63 -0
  27. natural_pdf/collections/pdf_collection.py +321 -15
  28. natural_pdf/core/element_manager.py +67 -0
  29. natural_pdf/core/page.py +227 -64
  30. natural_pdf/core/pdf.py +387 -378
  31. natural_pdf/elements/collections.py +272 -41
  32. natural_pdf/elements/region.py +99 -15
  33. natural_pdf/elements/text.py +5 -2
  34. natural_pdf/exporters/paddleocr.py +1 -1
  35. natural_pdf/extraction/manager.py +134 -0
  36. natural_pdf/extraction/mixin.py +246 -0
  37. natural_pdf/extraction/result.py +37 -0
  38. natural_pdf/ocr/engine_easyocr.py +6 -3
  39. natural_pdf/ocr/ocr_manager.py +85 -25
  40. natural_pdf/ocr/ocr_options.py +33 -10
  41. natural_pdf/ocr/utils.py +14 -3
  42. natural_pdf/qa/document_qa.py +0 -4
  43. natural_pdf/selectors/parser.py +363 -238
  44. natural_pdf/templates/finetune/fine_tune_paddleocr.md +10 -5
  45. natural_pdf/utils/locks.py +8 -0
  46. natural_pdf/utils/text_extraction.py +52 -1
  47. natural_pdf/utils/tqdm_utils.py +43 -0
  48. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/METADATA +6 -1
  49. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/RECORD +52 -41
  50. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/WHEEL +1 -1
  51. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/licenses/LICENSE +0 -0
  52. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/top_level.txt +0 -0
@@ -141,6 +141,26 @@ page.find_all('text:bold').show()
141
141
  page.find_all('text[size>=11]:bold')
142
142
  ```
143
143
 
144
+ ### Negation Pseudo-class (`:not()`)
145
+
146
+ You can exclude elements that match a certain selector using the `:not()` pseudo-class. It takes another simple selector as its argument.
147
+
148
+ ```python
149
+ # Find all text elements that are NOT bold
150
+ non_bold_text = page.find_all('text:not(:bold)')
151
+
152
+ # Find all elements that are NOT regions of type 'table'
153
+ not_tables = page.find_all(':not(region[type=table])')
154
+
155
+ # Find text elements that do not contain "Total" (case-insensitive)
156
+ relevant_text = page.find_all('text:not(:contains("Total"))', case=False)
157
+
158
+ # Find text elements that are not empty
159
+ non_empty_text = page.find_all('text:not(:empty)')
160
+ ```
161
+
162
+ **Note:** The selector inside `:not()` follows the same rules as regular selectors but currently does not support combinators (like `>`, `+`, `~`, or descendant space) within `:not()`. You can nest basic type, attribute, and other pseudo-class selectors.
163
+
144
164
  ### Spatial Pseudo-Classes Examples
145
165
 
146
166
  ```python
docs/index.md CHANGED
@@ -132,6 +132,25 @@ if result.get("found", False):
132
132
 
133
133
  [Learn about Document QA →](document-qa/index.ipynb)
134
134
 
135
+ ### Classify Pages and Regions
136
+
137
+ Categorize pages or specific regions based on their content using text or vision models.
138
+
139
+ **Note:** Requires `pip install "natural-pdf[classification]"`
140
+
141
+ ```python
142
+ # Classify a page based on text
143
+ categories = ["invoice", "scientific article", "presentation"]
144
+ page.classify(categories=categories, model="text")
145
+ print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
146
+
147
+
148
+ # Classify a page based on what it looks like
149
+ categories = ["invoice", "scientific article", "presentation"]
150
+ page.classify(categories=categories, model="vision")
151
+ print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
152
+ ```
153
+
135
154
  ### Visualize Your Work
136
155
 
137
156
  Debug and understand your extractions visually.
docs/ocr/index.md CHANGED
@@ -6,16 +6,16 @@ Natural PDF includes OCR (Optical Character Recognition) to extract text from sc
6
6
 
7
7
  Natural PDF supports multiple OCR engines:
8
8
 
9
- | Feature | EasyOCR | PaddleOCR | Surya OCR |
10
- |----------------------|------------------------------------|------------------------------------------|---------------------------------------|
11
- | **Installation** | `natural-pdf[easyocr]` | `natural-pdf[paddle]` | `natural-pdf[surya]` |
12
- | **Primary Strength** | Good general performance, simpler | Excellent Asian language, speed | High accuracy, multilingual lines |
13
- | **Speed** | Moderate | Fast | Moderate (GPU recommended) |
14
- | **Memory Usage** | Higher | Efficient | Higher (GPU recommended) |
15
- | **Paragraph Detect** | Yes (via option) | No | No (focuses on lines) |
16
- | **Handwritten** | Better support | Limited | Limited |
17
- | **Small Text** | Moderate | Good | Good |
18
- | **When to Use** | General documents, handwritten text| Asian languages, speed-critical tasks | Highest accuracy needed, line-level |
9
+ | Feature | EasyOCR | PaddleOCR | Surya OCR | Gemini (Layout + potential OCR) |
10
+ |----------------------|------------------------------------|------------------------------------------|---------------------------------------|--------------------------------------|
11
+ | **Installation** | `natural-pdf[easyocr]` | `natural-pdf[paddle]` | `natural-pdf[surya]` | `natural-pdf[gemini]` |
12
+ | **Primary Strength** | Good general performance, simpler | Excellent Asian language, speed | High accuracy, multilingual lines | Advanced layout analysis (via API) |
13
+ | **Speed** | Moderate | Fast | Moderate (GPU recommended) | API Latency |
14
+ | **Memory Usage** | Higher | Efficient | Higher (GPU recommended) | N/A (API) |
15
+ | **Paragraph Detect** | Yes (via option) | No | No (focuses on lines) | Yes (Layout model) |
16
+ | **Handwritten** | Better support | Limited | Limited | Potentially (API model dependent) |
17
+ | **Small Text** | Moderate | Good | Good | Potentially (API model dependent) |
18
+ | **When to Use** | General documents, handwritten text| Asian languages, speed-critical tasks | Highest accuracy needed, line-level | Complex layouts, API integration |
19
19
 
20
20
  ## Basic OCR Usage
21
21
 
@@ -53,6 +53,7 @@ For advanced, engine-specific settings, use the Options classes:
53
53
 
54
54
  ```python
55
55
  from natural_pdf.ocr import PaddleOCROptions, EasyOCROptions, SuryaOCROptions
56
+ from natural_pdf.analyzers.layout import GeminiOptions # Note: Gemini is primarily layout
56
57
 
57
58
  # --- Configure PaddleOCR ---
58
59
  paddle_opts = PaddleOCROptions(
@@ -90,6 +91,25 @@ surya_opts = SuryaOCROptions(
90
91
  # set via environment variables (see note below).
91
92
  )
92
93
  ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
94
+
95
+ # --- Configure Gemini (as layout analyzer, can be used with OCR) ---
96
+ # Gemini requires API key (GOOGLE_API_KEY environment variable)
97
+ # Note: Gemini is used via apply_layout, but its options can influence OCR if used together
98
+ gemini_opts = GeminiOptions(
99
+ prompt="Extract text content and identify document elements.",
100
+ # model_name="gemini-1.5-flash-latest" # Specify a model if needed
101
+ # See GeminiOptions documentation for more parameters
102
+ )
103
+ # Typically used like this (layout first, then potentially OCR on regions)
104
+ layout_elements = page.apply_layout(engine='gemini', options=gemini_opts)
105
+ # If Gemini also performed OCR or you want to OCR layout regions:
106
+ # ocr_elements = some_region.apply_ocr(...)
107
+
108
+ # It can sometimes be used directly if the model supports it, but less common:
109
+ # try:
110
+ # ocr_elements = page.apply_ocr(engine='gemini', options=gemini_opts)
111
+ # except Exception as e:
112
+ # print(f"Gemini might not be configured for direct OCR via apply_ocr: {e}")
93
113
  ```
94
114
 
95
115
  ## Applying OCR Directly
@@ -105,6 +125,9 @@ print(f"Found {len(ocr_elements)} text elements via OCR")
105
125
  title = page.find('text:contains("Title")')
106
126
  content_region = title.below(height=300)
107
127
  region_ocr_elements = content_region.apply_ocr(engine='paddle', languages=['en'])
128
+
129
+ # Note: Re-applying OCR to the same page or region will remove any
130
+ # previously generated OCR elements for that area before adding the new ones.
108
131
  ```
109
132
 
110
133
  ## OCR Engines
@@ -190,15 +213,39 @@ page.correct_ocr(correct)
190
213
  # You're done!
191
214
  ```
192
215
 
193
- ## Debugging OCR
216
+ ## Interactive OCR Correction / Debugging
194
217
 
195
- ```python
196
- from natural_pdf.utils.packaging import create_correction_task_package
218
+ Natural PDF includes a utility to package a PDF and its detected elements, along with an interactive web application (SPA) for reviewing and correcting OCR results.
197
219
 
198
- create_correction_task_package(pdf, "original.zip", overwrite=True)
199
- ```
220
+ 1. **Package the data:**
221
+ Use the `create_correction_task_package` function to create a zip file containing the necessary data for the SPA.
222
+
223
+ ```python
224
+ from natural_pdf.utils.packaging import create_correction_task_package
225
+
226
+ # Assuming 'pdf' is your loaded PDF object after running apply_ocr or apply_layout
227
+ create_correction_task_package(pdf, "correction_package.zip", overwrite=True)
228
+ ```
229
+
230
+ 2. **Run the SPA:**
231
+ The correction SPA is bundled with the library. You need to run a simple web server from the directory containing the SPA's files. The location of these files might depend on your installation, but you can typically find them within the installed `natural_pdf` package directory under `templates/spa`.
232
+
233
+ *Example using Python's built-in server (run from your terminal):*
234
+
235
+ ```bash
236
+ # Find the path to the installed natural_pdf package
237
+ # (This command might vary depending on your environment)
238
+ NATURAL_PDF_PATH=$(python -c "import site; print(site.getsitepackages()[0])")/natural_pdf
239
+
240
+ # Navigate to the SPA directory
241
+ cd $NATURAL_PDF_PATH/templates/spa
242
+
243
+ # Start the web server (e.g., on port 8000)
244
+ python -m http.server 8000
245
+ ```
200
246
 
201
- This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
247
+ 3. **Use the SPA:**
248
+ Open your web browser to `http://localhost:8000`. The SPA should load, allowing you to drag and drop the `correction_package.zip` file you created into the application to view and edit the OCR results.
202
249
 
203
250
  ## Next Steps
204
251