PyPI - natural-pdf - Versions diffs - 0.1.6__py3-none-any.whl → 0.1.8__py3-none-any.whl - Mend

natural-pdf 0.1.6py3-none-any.whl → 0.1.8py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

docs/categorizing-documents/index.md +168 -0
docs/data-extraction/index.md +87 -0
docs/element-selection/index.ipynb +218 -164
docs/element-selection/index.md +20 -0
docs/finetuning/index.md +176 -0
docs/index.md +19 -0
docs/ocr/index.md +63 -16
docs/tutorials/01-loading-and-extraction.ipynb +411 -248
docs/tutorials/02-finding-elements.ipynb +123 -46
docs/tutorials/03-extracting-blocks.ipynb +24 -19
docs/tutorials/04-table-extraction.ipynb +17 -12
docs/tutorials/05-excluding-content.ipynb +37 -32
docs/tutorials/06-document-qa.ipynb +36 -31
docs/tutorials/07-layout-analysis.ipynb +45 -40
docs/tutorials/07-working-with-regions.ipynb +61 -60
docs/tutorials/08-spatial-navigation.ipynb +76 -71
docs/tutorials/09-section-extraction.ipynb +160 -155
docs/tutorials/10-form-field-extraction.ipynb +71 -66
docs/tutorials/11-enhanced-table-processing.ipynb +11 -6
docs/tutorials/12-ocr-integration.ipynb +3420 -312
docs/tutorials/12-ocr-integration.md +68 -106
docs/tutorials/13-semantic-search.ipynb +641 -251
natural_pdf/__init__.py +3 -0
natural_pdf/analyzers/layout/gemini.py +63 -47
natural_pdf/classification/manager.py +343 -0
natural_pdf/classification/mixin.py +149 -0
natural_pdf/classification/results.py +62 -0
natural_pdf/collections/mixins.py +63 -0
natural_pdf/collections/pdf_collection.py +326 -17
natural_pdf/core/element_manager.py +73 -4
natural_pdf/core/page.py +255 -83
natural_pdf/core/pdf.py +385 -367
natural_pdf/elements/base.py +1 -3
natural_pdf/elements/collections.py +279 -49
natural_pdf/elements/region.py +106 -21
natural_pdf/elements/text.py +5 -2
natural_pdf/exporters/__init__.py +4 -0
natural_pdf/exporters/base.py +61 -0
natural_pdf/exporters/paddleocr.py +345 -0
natural_pdf/extraction/manager.py +134 -0
natural_pdf/extraction/mixin.py +246 -0
natural_pdf/extraction/result.py +37 -0
natural_pdf/ocr/__init__.py +16 -8
natural_pdf/ocr/engine.py +46 -30
natural_pdf/ocr/engine_easyocr.py +86 -42
natural_pdf/ocr/engine_paddle.py +39 -28
natural_pdf/ocr/engine_surya.py +32 -16
natural_pdf/ocr/ocr_factory.py +34 -23
natural_pdf/ocr/ocr_manager.py +98 -34
natural_pdf/ocr/ocr_options.py +38 -10
natural_pdf/ocr/utils.py +59 -33
natural_pdf/qa/document_qa.py +0 -4
natural_pdf/selectors/parser.py +363 -238
natural_pdf/templates/finetune/fine_tune_paddleocr.md +420 -0
natural_pdf/utils/debug.py +4 -2
natural_pdf/utils/identifiers.py +9 -5
natural_pdf/utils/locks.py +8 -0
natural_pdf/utils/packaging.py +172 -105
natural_pdf/utils/text_extraction.py +96 -65
natural_pdf/utils/tqdm_utils.py +43 -0
natural_pdf/utils/visualization.py +1 -1
{natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/METADATA +10 -3
{natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/RECORD +66 -51
{natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/WHEEL +1 -1
{natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/licenses/LICENSE +0 -0
{natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/top_level.txt +0 -0

docs/element-selection/index.md CHANGED Viewed

@@ -141,6 +141,26 @@ page.find_all('text:bold').show()
 page.find_all('text[size>=11]:bold')
 ```
+### Negation Pseudo-class (`:not()`)
+You can exclude elements that match a certain selector using the `:not()` pseudo-class. It takes another simple selector as its argument.
+```python
+# Find all text elements that are NOT bold
+non_bold_text = page.find_all('text:not(:bold)')
+# Find all elements that are NOT regions of type 'table'
+not_tables = page.find_all(':not(region[type=table])')
+# Find text elements that do not contain "Total" (case-insensitive)
+relevant_text = page.find_all('text:not(:contains("Total"))', case=False)
+# Find text elements that are not empty
+non_empty_text = page.find_all('text:not(:empty)')
+```
+**Note:** The selector inside `:not()` follows the same rules as regular selectors but currently does not support combinators (like `>`, `+`, `~`, or descendant space) within `:not()`. You can nest basic type, attribute, and other pseudo-class selectors.
 ### Spatial Pseudo-Classes Examples
 ```python

docs/finetuning/index.md ADDED Viewed

@@ -0,0 +1,176 @@
+# OCR Fine-tuning
+While the built-in OCR engines (EasyOCR, PaddleOCR, Surya) offer good general performance, you might encounter situations where their accuracy isn't sufficient for your specific needs. This is often the case with:
+*   **Unique Fonts:** Documents using unusual or stylized fonts.
+*   **Specific Languages:** Languages or scripts not perfectly covered by the default models.
+*   **Low Quality Scans:** Noisy or degraded document images.
+*   **Specialized Layouts:** Text within complex tables, forms, or unusual arrangements.
+Fine-tuning allows you to adapt a pre-trained OCR recognition model to your specific data, significantly improving its accuracy on documents similar to those used for training.
+## Why Fine-tune?
+-   **Higher Accuracy:** Achieve better text extraction results on your specific document types.
+-   **Adaptability:** Train the model to recognize domain-specific terms, symbols, or layouts.
+-   **Reduced Errors:** Minimize downstream errors in data extraction and processing pipelines.
+## Strategy: Detect + LLM Correct + Export
+Training an OCR model requires accurate ground truth: images of text snippets paired with their correct transcriptions. Manually creating this data is tedious. A powerful alternative leverages the strengths of different models:
+1.  **Detect Text Regions:** Use a robust local OCR engine (like Surya or PaddleOCR) primarily for its *detection* capabilities (`detect_only=True`). This identifies the *locations* of text on the page, even if the initial *recognition* isn't perfect. You can combine this with layout analysis or region selections (`.region()`, `.below()`, `.add_exclusion()`) to focus on the specific areas you care about.
+2.  **Correct with LLM:** For each detected text region, send the image snippet to a powerful Large Language Model (LLM) with multimodal capabilities (like GPT-4o, Claude 3.5 Sonnet/Haiku) using the `direct_ocr_llm` utility. The LLM performs high-accuracy OCR on the snippet, providing a "ground truth" transcription.
+3.  **Export for Fine-tuning:** Use the `PaddleOCRRecognitionExporter` to package the original image snippets (from step 1) along with their corresponding LLM-generated text labels (from step 2) into the specific format required by PaddleOCR for fine-tuning its *recognition* model.
+This approach combines the efficient spatial detection of local models with the superior text recognition of large generative models to create a high-quality fine-tuning dataset with minimal manual effort.
+## Example: Fine-tuning for Greek Spreadsheet Text
+Let's walk through an example of preparing data to fine-tune PaddleOCR for text from a scanned Greek spreadsheet, adapting the process described above.
+```python
+# --- 1. Setup and Load PDF ---
+from natural_pdf import PDF
+from natural_pdf.ocr.utils import direct_ocr_llm
+from natural_pdf.exporters import PaddleOCRRecognitionExporter
+import openai # Or your preferred LLM client library
+import os
+# Ensure your LLM API key is set (using environment variables is recommended)
+# os.environ["OPENAI_API_KEY"] = "sk-..."
+# os.environ["ANTHROPIC_API_KEY"] = "sk-..."
+# pdf_path = "path/to/your/document.pdf"
+pdf_path = "pdfs/hidden/the-bad-one.pdf" # Replace with your PDF path
+pdf = PDF(pdf_path)
+# --- 2. (Optional) Exclude Irrelevant Areas ---
+# If the document has consistent headers, footers, or margins you want to ignore
+# Use exclusions *before* detection
+pdf.add_exclusion(lambda page: page.region(right=45)) # Exclude left margin/line numbers
+pdf.add_exclusion(lambda page: page.region(left=500)) # Exclude right margin
+# --- 3. Detect Text Regions ---
+# Use a good detection engine. Surya is often robust for line detection.
+# We only want the bounding boxes, not the initial (potentially inaccurate) OCR text.
+print("Detecting text regions...")
+# Process only a subset of pages for demonstration if needed
+for page in pdf.pages[:10]:
+    # Use a moderate resolution for detection; higher res used for LLM correction later
+    page.apply_ocr(engine='surya', resolution=120, detect_only=True)
+print(f"Detection complete for {num_pages_to_process} pages.")
+# (Optional) Visualize detected boxes on a sample page
+# pdf.pages[9].find_all('text[source=ocr]').show()
+# --- 4. Correct with LLM ---
+# Configure your LLM client (example using OpenAI client, adaptable for others)
+# For Anthropic: client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key=os.environ.get("ANTHROPIC_API_KEY"))
+client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
+# Craft a clear prompt for the LLM
+# Be as specific as possible! If it's in a specific language, what kinds
+# of characters, etc.
+prompt = """OCR this image patch. Return only the exact text content visible in the image.
+Preserve original spelling, capitalization, punctuation, and symbols.
+Do not add any explanatory text, translations, comments, or quotation marks around the result.
+The text is likely from a Greek document, potentially a spreadsheet, containing Modern Greek words or numbers."""
+# Define the correction function using direct_ocr_llm
+def correct_text_region(region):
+    # Use a high resolution for the LLM call for best accuracy
+    return direct_ocr_llm(
+        region,
+        client,
+        prompt=prompt,
+        resolution=300,
+        # model="claude-3-5-sonnet-20240620" # Example Anthropic model
+        model="gpt-4o-mini" # Example OpenAI model
+    )
+# Apply the correction function to the detected text regions
+print("Applying LLM correction to detected regions...")
+for page in pdf.pages[:num_pages_to_process]:
+    # This finds elements added by apply_ocr and passes their regions to 'correct_text_region'
+    # The returned text from the LLM replaces the original OCR text for these elements
+    # The source attribute is updated (e.g., to 'ocr-llm-corrected')
+    page.correct_ocr(correct_text_region)
+print("LLM correction complete.")
+# --- 5. Export for PaddleOCR Fine-tuning ---
+print("Configuring exporter...")
+exporter = PaddleOCRRecognitionExporter(
+    # Select all of the non-blank OCR text
+    # Hopefully it's all been LLM-corrected!
+    selector="text[source^=ocr][text!='']",
+    resolution=300,     # Resolution for the exported image crops
+    padding=2,          # Add slight padding around text boxes
+    split_ratio=0.9,    # 90% for training, 10% for validation
+    random_seed=42,     # For reproducible train/val split
+    include_guide=True  # Include the Colab fine-tuning notebook
+)
+# Define the output directory
+output_directory = "./my_paddleocr_finetune_data"
+print(f"Exporting data to {output_directory}...")
+# Run the export process
+exporter.export(pdf, output_directory)
+print("Export complete.")
+print(f"Dataset ready for fine-tuning in: {output_directory}")
+print(f"Next step: Upload '{os.path.join(output_directory, 'fine_tune_paddleocr.ipynb')}' and the rest of the contents to Google Colab.")
+# --- Cleanup ---
+pdf.close()
+```
+## Running the Fine-tuning
+The `PaddleOCRRecognitionExporter` automatically includes a Jupyter Notebook (`fine_tune_paddleocr.ipynb`) in the output directory. This notebook is pre-configured to guide you through the fine-tuning process on Google Colab (which offers free GPU access):
+1.  **Upload:** Upload the entire output directory (e.g., `my_paddleocr_finetune_data`) to your Google Drive or directly to your Colab instance.
+2.  **Open Notebook:** Open the `fine_tune_paddleocr.ipynb` notebook in Google Colab.
+3.  **Set Runtime:** Ensure the Colab runtime is set to use a GPU (Runtime -> Change runtime type -> GPU).
+4.  **Run Cells:** Execute the cells in the notebook sequentially. It will:
+    *   Install necessary libraries (PaddlePaddle, PaddleOCR).
+    *   Point the training configuration to your uploaded dataset (`images/`, `train.txt`, `val.txt`, `dict.txt`).
+    *   Download a pre-trained PaddleOCR model (usually a multilingual one).
+    *   Start the fine-tuning process using your data.
+    *   Save the fine-tuned model checkpoints.
+    *   Export the best model into an "inference format" suitable for use with `natural-pdf`.
+5.  **Download Model:** Download the resulting `inference_model` directory from Colab.
+## Using the Fine-tuned Model
+Once you have the `inference_model` directory, you can instruct `natural-pdf` to use it for OCR:
+```python
+from natural_pdf import PDF
+from natural_pdf.ocr import PaddleOCROptions
+# Path to the directory you downloaded from Colab
+finetuned_model_dir = "/path/to/your/downloaded/inference_model"
+# Specify the path in PaddleOCROptions
+paddle_opts = PaddleOCROptions(
+    rec_model_dir=finetuned_model_dir,
+    rec_char_dict_path=os.path.join(finetuned_model_dir, 'your_dict.txt') # Or wherever your dict is
+    use_gpu=True # If using GPU locally
+)
+pdf = PDF("another-similar-document.pdf")
+page = pdf.pages[0]
+# Apply OCR using your fine-tuned model
+ocr_elements = page.apply_ocr(engine='paddle', options=paddle_opts)
+# Extract text using the improved results
+text = page.extract_text()
+print(text)
+pdf.close()
+```
+By following this process, you can significantly enhance OCR performance on your specific documents using the power of fine-tuning.

docs/index.md CHANGED Viewed

@@ -132,6 +132,25 @@ if result.get("found", False):
 [Learn about Document QA →](document-qa/index.ipynb)
+### Classify Pages and Regions
+Categorize pages or specific regions based on their content using text or vision models.
+**Note:** Requires `pip install "natural-pdf[classification]"`
+```python
+# Classify a page based on text
+categories = ["invoice", "scientific article", "presentation"]
+page.classify(categories=categories, model="text")
+print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
+# Classify a page based on what it looks like
+categories = ["invoice", "scientific article", "presentation"]
+page.classify(categories=categories, model="vision")
+print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
+```
 ### Visualize Your Work
 Debug and understand your extractions visually.

docs/ocr/index.md CHANGED Viewed

@@ -6,16 +6,16 @@ Natural PDF includes OCR (Optical Character Recognition) to extract text from sc
 Natural PDF supports multiple OCR engines:
-| Feature              | EasyOCR                            | PaddleOCR                                | Surya OCR                             |
-|----------------------|------------------------------------|------------------------------------------|---------------------------------------|
-| **Installation**     | `natural-pdf[easyocr]`             | `natural-pdf[paddle]`                    | `natural-pdf[surya]`                  |
-| **Primary Strength** | Good general performance, simpler  | Excellent Asian language, speed        | High accuracy, multilingual lines     |
-| **Speed**            | Moderate                           | Fast                                     | Moderate (GPU recommended)            |
-| **Memory Usage**     | Higher                             | Efficient                                | Higher (GPU recommended)            |
-| **Paragraph Detect** | Yes (via option)                   | No                                       | No (focuses on lines)                 |
-| **Handwritten**      | Better support                     | Limited                                  | Limited                               |
-| **Small Text**       | Moderate                           | Good                                     | Good                                  |
-| **When to Use**      | General documents, handwritten text| Asian languages, speed-critical tasks    | Highest accuracy needed, line-level   |
+| Feature              | EasyOCR                            | PaddleOCR                                | Surya OCR                             | Gemini (Layout + potential OCR)      |
+|----------------------|------------------------------------|------------------------------------------|---------------------------------------|--------------------------------------|
+| **Installation**     | `natural-pdf[easyocr]`             | `natural-pdf[paddle]`                    | `natural-pdf[surya]`                  | `natural-pdf[gemini]`                |
+| **Primary Strength** | Good general performance, simpler  | Excellent Asian language, speed        | High accuracy, multilingual lines     | Advanced layout analysis (via API) |
+| **Speed**            | Moderate                           | Fast                                     | Moderate (GPU recommended)            | API Latency                          |
+| **Memory Usage**     | Higher                             | Efficient                                | Higher (GPU recommended)            | N/A (API)                            |
+| **Paragraph Detect** | Yes (via option)                   | No                                       | No (focuses on lines)                 | Yes (Layout model)                 |
+| **Handwritten**      | Better support                     | Limited                                  | Limited                               | Potentially (API model dependent)    |
+| **Small Text**       | Moderate                           | Good                                     | Good                                  | Potentially (API model dependent)    |
+| **When to Use**      | General documents, handwritten text| Asian languages, speed-critical tasks    | Highest accuracy needed, line-level   | Complex layouts, API integration     |
 ## Basic OCR Usage
@@ -53,6 +53,7 @@ For advanced, engine-specific settings, use the Options classes:
 ```python
 from natural_pdf.ocr import PaddleOCROptions, EasyOCROptions, SuryaOCROptions
+from natural_pdf.analyzers.layout import GeminiOptions # Note: Gemini is primarily layout
 # --- Configure PaddleOCR ---
 paddle_opts = PaddleOCROptions(
@@ -90,6 +91,25 @@ surya_opts = SuryaOCROptions(
     # set via environment variables (see note below).
 )
 ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
+# --- Configure Gemini (as layout analyzer, can be used with OCR) ---
+# Gemini requires API key (GOOGLE_API_KEY environment variable)
+# Note: Gemini is used via apply_layout, but its options can influence OCR if used together
+gemini_opts = GeminiOptions(
+    prompt="Extract text content and identify document elements.",
+    # model_name="gemini-1.5-flash-latest" # Specify a model if needed
+    # See GeminiOptions documentation for more parameters
+)
+# Typically used like this (layout first, then potentially OCR on regions)
+layout_elements = page.apply_layout(engine='gemini', options=gemini_opts)
+# If Gemini also performed OCR or you want to OCR layout regions:
+# ocr_elements = some_region.apply_ocr(...)
+# It can sometimes be used directly if the model supports it, but less common:
+# try:
+#     ocr_elements = page.apply_ocr(engine='gemini', options=gemini_opts)
+# except Exception as e:
+#     print(f"Gemini might not be configured for direct OCR via apply_ocr: {e}")
 ```
 ## Applying OCR Directly
@@ -105,6 +125,9 @@ print(f"Found {len(ocr_elements)} text elements via OCR")
 title = page.find('text:contains("Title")')
 content_region = title.below(height=300)
 region_ocr_elements = content_region.apply_ocr(engine='paddle', languages=['en'])
+# Note: Re-applying OCR to the same page or region will remove any
+# previously generated OCR elements for that area before adding the new ones.
 ```
 ## OCR Engines
@@ -190,15 +213,39 @@ page.correct_ocr(correct)
 # You're done!
 ```
-## Debugging OCR
+## Interactive OCR Correction / Debugging
-```python
-from natural_pdf.utils.packaging import create_correction_task_package
+Natural PDF includes a utility to package a PDF and its detected elements, along with an interactive web application (SPA) for reviewing and correcting OCR results.
-create_correction_task_package(pdf, "original.zip", overwrite=True)
-```
+1.  **Package the data:**
+    Use the `create_correction_task_package` function to create a zip file containing the necessary data for the SPA.
+    ```python
+    from natural_pdf.utils.packaging import create_correction_task_package
+    # Assuming 'pdf' is your loaded PDF object after running apply_ocr or apply_layout
+    create_correction_task_package(pdf, "correction_package.zip", overwrite=True)
+    ```
+2.  **Run the SPA:**
+    The correction SPA is bundled with the library. You need to run a simple web server from the directory containing the SPA's files. The location of these files might depend on your installation, but you can typically find them within the installed `natural_pdf` package directory under `templates/spa`.
+    *Example using Python's built-in server (run from your terminal):*
+    ```bash
+    # Find the path to the installed natural_pdf package
+    # (This command might vary depending on your environment)
+    NATURAL_PDF_PATH=$(python -c "import site; print(site.getsitepackages()[0])")/natural_pdf
+    # Navigate to the SPA directory
+    cd $NATURAL_PDF_PATH/templates/spa
+    # Start the web server (e.g., on port 8000)
+    python -m http.server 8000
+    ```
-This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
+3.  **Use the SPA:**
+    Open your web browser to `http://localhost:8000`. The SPA should load, allowing you to drag and drop the `correction_package.zip` file you created into the application to view and edit the OCR results.
 ## Next Steps

natural-pdf 0.1.6__py3-none-any.whl → 0.1.8__py3-none-any.whl

natural-pdf 0.1.6py3-none-any.whl → 0.1.8py3-none-any.whl