PyPI - natural-pdf - Versions diffs - 0.1.7__py3-none-any.whl → 0.1.8__py3-none-any.whl - Mend

natural-pdf 0.1.7py3-none-any.whl → 0.1.8py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

docs/categorizing-documents/index.md +168 -0
docs/data-extraction/index.md +87 -0
docs/element-selection/index.ipynb +218 -164
docs/element-selection/index.md +20 -0
docs/index.md +19 -0
docs/ocr/index.md +63 -16
docs/tutorials/01-loading-and-extraction.ipynb +1713 -34
docs/tutorials/02-finding-elements.ipynb +123 -46
docs/tutorials/03-extracting-blocks.ipynb +24 -19
docs/tutorials/04-table-extraction.ipynb +17 -12
docs/tutorials/05-excluding-content.ipynb +37 -32
docs/tutorials/06-document-qa.ipynb +36 -31
docs/tutorials/07-layout-analysis.ipynb +45 -40
docs/tutorials/07-working-with-regions.ipynb +61 -60
docs/tutorials/08-spatial-navigation.ipynb +76 -71
docs/tutorials/09-section-extraction.ipynb +160 -155
docs/tutorials/10-form-field-extraction.ipynb +71 -66
docs/tutorials/11-enhanced-table-processing.ipynb +11 -6
docs/tutorials/12-ocr-integration.ipynb +3420 -312
docs/tutorials/12-ocr-integration.md +68 -106
docs/tutorials/13-semantic-search.ipynb +641 -251
natural_pdf/__init__.py +2 -0
natural_pdf/classification/manager.py +343 -0
natural_pdf/classification/mixin.py +149 -0
natural_pdf/classification/results.py +62 -0
natural_pdf/collections/mixins.py +63 -0
natural_pdf/collections/pdf_collection.py +321 -15
natural_pdf/core/element_manager.py +67 -0
natural_pdf/core/page.py +227 -64
natural_pdf/core/pdf.py +387 -378
natural_pdf/elements/collections.py +272 -41
natural_pdf/elements/region.py +99 -15
natural_pdf/elements/text.py +5 -2
natural_pdf/exporters/paddleocr.py +1 -1
natural_pdf/extraction/manager.py +134 -0
natural_pdf/extraction/mixin.py +246 -0
natural_pdf/extraction/result.py +37 -0
natural_pdf/ocr/engine_easyocr.py +6 -3
natural_pdf/ocr/ocr_manager.py +85 -25
natural_pdf/ocr/ocr_options.py +33 -10
natural_pdf/ocr/utils.py +14 -3
natural_pdf/qa/document_qa.py +0 -4
natural_pdf/selectors/parser.py +363 -238
natural_pdf/templates/finetune/fine_tune_paddleocr.md +10 -5
natural_pdf/utils/locks.py +8 -0
natural_pdf/utils/text_extraction.py +52 -1
natural_pdf/utils/tqdm_utils.py +43 -0
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/METADATA +6 -1
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/RECORD +52 -41
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/WHEEL +1 -1
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/licenses/LICENSE +0 -0
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/top_level.txt +0 -0

docs/element-selection/index.md CHANGED Viewed

@@ -141,6 +141,26 @@ page.find_all('text:bold').show()
 page.find_all('text[size>=11]:bold')
 ```
+### Negation Pseudo-class (`:not()`)
+You can exclude elements that match a certain selector using the `:not()` pseudo-class. It takes another simple selector as its argument.
+```python
+# Find all text elements that are NOT bold
+non_bold_text = page.find_all('text:not(:bold)')
+# Find all elements that are NOT regions of type 'table'
+not_tables = page.find_all(':not(region[type=table])')
+# Find text elements that do not contain "Total" (case-insensitive)
+relevant_text = page.find_all('text:not(:contains("Total"))', case=False)
+# Find text elements that are not empty
+non_empty_text = page.find_all('text:not(:empty)')
+```
+**Note:** The selector inside `:not()` follows the same rules as regular selectors but currently does not support combinators (like `>`, `+`, `~`, or descendant space) within `:not()`. You can nest basic type, attribute, and other pseudo-class selectors.
 ### Spatial Pseudo-Classes Examples
 ```python

docs/index.md CHANGED Viewed

@@ -132,6 +132,25 @@ if result.get("found", False):
 [Learn about Document QA →](document-qa/index.ipynb)
+### Classify Pages and Regions
+Categorize pages or specific regions based on their content using text or vision models.
+**Note:** Requires `pip install "natural-pdf[classification]"`
+```python
+# Classify a page based on text
+categories = ["invoice", "scientific article", "presentation"]
+page.classify(categories=categories, model="text")
+print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
+# Classify a page based on what it looks like
+categories = ["invoice", "scientific article", "presentation"]
+page.classify(categories=categories, model="vision")
+print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
+```
 ### Visualize Your Work
 Debug and understand your extractions visually.

docs/ocr/index.md CHANGED Viewed

@@ -6,16 +6,16 @@ Natural PDF includes OCR (Optical Character Recognition) to extract text from sc
 Natural PDF supports multiple OCR engines:
-| Feature              | EasyOCR                            | PaddleOCR                                | Surya OCR                             |
-|----------------------|------------------------------------|------------------------------------------|---------------------------------------|
-| **Installation**     | `natural-pdf[easyocr]`             | `natural-pdf[paddle]`                    | `natural-pdf[surya]`                  |
-| **Primary Strength** | Good general performance, simpler  | Excellent Asian language, speed        | High accuracy, multilingual lines     |
-| **Speed**            | Moderate                           | Fast                                     | Moderate (GPU recommended)            |
-| **Memory Usage**     | Higher                             | Efficient                                | Higher (GPU recommended)            |
-| **Paragraph Detect** | Yes (via option)                   | No                                       | No (focuses on lines)                 |
-| **Handwritten**      | Better support                     | Limited                                  | Limited                               |
-| **Small Text**       | Moderate                           | Good                                     | Good                                  |
-| **When to Use**      | General documents, handwritten text| Asian languages, speed-critical tasks    | Highest accuracy needed, line-level   |
+| Feature              | EasyOCR                            | PaddleOCR                                | Surya OCR                             | Gemini (Layout + potential OCR)      |
+|----------------------|------------------------------------|------------------------------------------|---------------------------------------|--------------------------------------|
+| **Installation**     | `natural-pdf[easyocr]`             | `natural-pdf[paddle]`                    | `natural-pdf[surya]`                  | `natural-pdf[gemini]`                |
+| **Primary Strength** | Good general performance, simpler  | Excellent Asian language, speed        | High accuracy, multilingual lines     | Advanced layout analysis (via API) |
+| **Speed**            | Moderate                           | Fast                                     | Moderate (GPU recommended)            | API Latency                          |
+| **Memory Usage**     | Higher                             | Efficient                                | Higher (GPU recommended)            | N/A (API)                            |
+| **Paragraph Detect** | Yes (via option)                   | No                                       | No (focuses on lines)                 | Yes (Layout model)                 |
+| **Handwritten**      | Better support                     | Limited                                  | Limited                               | Potentially (API model dependent)    |
+| **Small Text**       | Moderate                           | Good                                     | Good                                  | Potentially (API model dependent)    |
+| **When to Use**      | General documents, handwritten text| Asian languages, speed-critical tasks    | Highest accuracy needed, line-level   | Complex layouts, API integration     |
 ## Basic OCR Usage
@@ -53,6 +53,7 @@ For advanced, engine-specific settings, use the Options classes:
 ```python
 from natural_pdf.ocr import PaddleOCROptions, EasyOCROptions, SuryaOCROptions
+from natural_pdf.analyzers.layout import GeminiOptions # Note: Gemini is primarily layout
 # --- Configure PaddleOCR ---
 paddle_opts = PaddleOCROptions(
@@ -90,6 +91,25 @@ surya_opts = SuryaOCROptions(
     # set via environment variables (see note below).
 )
 ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
+# --- Configure Gemini (as layout analyzer, can be used with OCR) ---
+# Gemini requires API key (GOOGLE_API_KEY environment variable)
+# Note: Gemini is used via apply_layout, but its options can influence OCR if used together
+gemini_opts = GeminiOptions(
+    prompt="Extract text content and identify document elements.",
+    # model_name="gemini-1.5-flash-latest" # Specify a model if needed
+    # See GeminiOptions documentation for more parameters
+)
+# Typically used like this (layout first, then potentially OCR on regions)
+layout_elements = page.apply_layout(engine='gemini', options=gemini_opts)
+# If Gemini also performed OCR or you want to OCR layout regions:
+# ocr_elements = some_region.apply_ocr(...)
+# It can sometimes be used directly if the model supports it, but less common:
+# try:
+#     ocr_elements = page.apply_ocr(engine='gemini', options=gemini_opts)
+# except Exception as e:
+#     print(f"Gemini might not be configured for direct OCR via apply_ocr: {e}")
 ```
 ## Applying OCR Directly
@@ -105,6 +125,9 @@ print(f"Found {len(ocr_elements)} text elements via OCR")
 title = page.find('text:contains("Title")')
 content_region = title.below(height=300)
 region_ocr_elements = content_region.apply_ocr(engine='paddle', languages=['en'])
+# Note: Re-applying OCR to the same page or region will remove any
+# previously generated OCR elements for that area before adding the new ones.
 ```
 ## OCR Engines
@@ -190,15 +213,39 @@ page.correct_ocr(correct)
 # You're done!
 ```
-## Debugging OCR
+## Interactive OCR Correction / Debugging
-```python
-from natural_pdf.utils.packaging import create_correction_task_package
+Natural PDF includes a utility to package a PDF and its detected elements, along with an interactive web application (SPA) for reviewing and correcting OCR results.
-create_correction_task_package(pdf, "original.zip", overwrite=True)
-```
+1.  **Package the data:**
+    Use the `create_correction_task_package` function to create a zip file containing the necessary data for the SPA.
+    ```python
+    from natural_pdf.utils.packaging import create_correction_task_package
+    # Assuming 'pdf' is your loaded PDF object after running apply_ocr or apply_layout
+    create_correction_task_package(pdf, "correction_package.zip", overwrite=True)
+    ```
+2.  **Run the SPA:**
+    The correction SPA is bundled with the library. You need to run a simple web server from the directory containing the SPA's files. The location of these files might depend on your installation, but you can typically find them within the installed `natural_pdf` package directory under `templates/spa`.
+    *Example using Python's built-in server (run from your terminal):*
+    ```bash
+    # Find the path to the installed natural_pdf package
+    # (This command might vary depending on your environment)
+    NATURAL_PDF_PATH=$(python -c "import site; print(site.getsitepackages()[0])")/natural_pdf
+    # Navigate to the SPA directory
+    cd $NATURAL_PDF_PATH/templates/spa
+    # Start the web server (e.g., on port 8000)
+    python -m http.server 8000
+    ```
-This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
+3.  **Use the SPA:**
+    Open your web browser to `http://localhost:8000`. The SPA should load, allowing you to drag and drop the `correction_package.zip` file you created into the application to view and edit the OCR results.
 ## Next Steps

natural-pdf 0.1.7__py3-none-any.whl → 0.1.8__py3-none-any.whl

natural-pdf 0.1.7py3-none-any.whl → 0.1.8py3-none-any.whl