PyPI - natural-pdf - Versions diffs - 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl - Mend

natural-pdf 0.1.7py3-none-any.whl → 0.1.9py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (134) hide show

natural_pdf/__init__.py +3 -0
natural_pdf/analyzers/layout/base.py +1 -5
natural_pdf/analyzers/layout/gemini.py +61 -51
natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
natural_pdf/analyzers/layout/layout_manager.py +26 -84
natural_pdf/analyzers/layout/layout_options.py +7 -0
natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
natural_pdf/analyzers/layout/surya.py +46 -123
natural_pdf/analyzers/layout/tatr.py +51 -4
natural_pdf/analyzers/text_structure.py +3 -5
natural_pdf/analyzers/utils.py +3 -3
natural_pdf/classification/manager.py +422 -0
natural_pdf/classification/mixin.py +163 -0
natural_pdf/classification/results.py +80 -0
natural_pdf/collections/mixins.py +111 -0
natural_pdf/collections/pdf_collection.py +434 -15
natural_pdf/core/element_manager.py +83 -0
natural_pdf/core/highlighting_service.py +13 -22
natural_pdf/core/page.py +578 -93
natural_pdf/core/pdf.py +912 -460
natural_pdf/elements/base.py +134 -40
natural_pdf/elements/collections.py +712 -109
natural_pdf/elements/region.py +722 -69
natural_pdf/elements/text.py +4 -1
natural_pdf/export/mixin.py +137 -0
natural_pdf/exporters/base.py +3 -3
natural_pdf/exporters/paddleocr.py +5 -4
natural_pdf/extraction/manager.py +135 -0
natural_pdf/extraction/mixin.py +279 -0
natural_pdf/extraction/result.py +23 -0
natural_pdf/ocr/__init__.py +5 -5
natural_pdf/ocr/engine_doctr.py +346 -0
natural_pdf/ocr/engine_easyocr.py +6 -3
natural_pdf/ocr/ocr_factory.py +24 -4
natural_pdf/ocr/ocr_manager.py +122 -26
natural_pdf/ocr/ocr_options.py +94 -11
natural_pdf/ocr/utils.py +19 -6
natural_pdf/qa/document_qa.py +0 -4
natural_pdf/search/__init__.py +20 -34
natural_pdf/search/haystack_search_service.py +309 -265
natural_pdf/search/haystack_utils.py +99 -75
natural_pdf/search/search_service_protocol.py +11 -12
natural_pdf/selectors/parser.py +431 -230
natural_pdf/utils/debug.py +3 -3
natural_pdf/utils/identifiers.py +1 -1
natural_pdf/utils/locks.py +8 -0
natural_pdf/utils/packaging.py +8 -6
natural_pdf/utils/text_extraction.py +60 -1
natural_pdf/utils/tqdm_utils.py +51 -0
natural_pdf/utils/visualization.py +18 -0
natural_pdf/widgets/viewer.py +4 -25
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +17 -3
natural_pdf-0.1.9.dist-info/RECORD +80 -0
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
docs/api/index.md +0 -386
docs/assets/favicon.png +0 -3
docs/assets/favicon.svg +0 -3
docs/assets/javascripts/custom.js +0 -17
docs/assets/logo.svg +0 -3
docs/assets/sample-screen.png +0 -0
docs/assets/social-preview.png +0 -17
docs/assets/social-preview.svg +0 -17
docs/assets/stylesheets/custom.css +0 -65
docs/document-qa/index.ipynb +0 -435
docs/document-qa/index.md +0 -79
docs/element-selection/index.ipynb +0 -915
docs/element-selection/index.md +0 -229
docs/finetuning/index.md +0 -176
docs/index.md +0 -170
docs/installation/index.md +0 -69
docs/interactive-widget/index.ipynb +0 -962
docs/interactive-widget/index.md +0 -12
docs/layout-analysis/index.ipynb +0 -818
docs/layout-analysis/index.md +0 -185
docs/ocr/index.md +0 -209
docs/pdf-navigation/index.ipynb +0 -314
docs/pdf-navigation/index.md +0 -97
docs/regions/index.ipynb +0 -816
docs/regions/index.md +0 -294
docs/tables/index.ipynb +0 -658
docs/tables/index.md +0 -144
docs/text-analysis/index.ipynb +0 -370
docs/text-analysis/index.md +0 -105
docs/text-extraction/index.ipynb +0 -1478
docs/text-extraction/index.md +0 -292
docs/tutorials/01-loading-and-extraction.ipynb +0 -194
docs/tutorials/01-loading-and-extraction.md +0 -95
docs/tutorials/02-finding-elements.ipynb +0 -340
docs/tutorials/02-finding-elements.md +0 -149
docs/tutorials/03-extracting-blocks.ipynb +0 -147
docs/tutorials/03-extracting-blocks.md +0 -48
docs/tutorials/04-table-extraction.ipynb +0 -114
docs/tutorials/04-table-extraction.md +0 -50
docs/tutorials/05-excluding-content.ipynb +0 -270
docs/tutorials/05-excluding-content.md +0 -109
docs/tutorials/06-document-qa.ipynb +0 -332
docs/tutorials/06-document-qa.md +0 -91
docs/tutorials/07-layout-analysis.ipynb +0 -288
docs/tutorials/07-layout-analysis.md +0 -66
docs/tutorials/07-working-with-regions.ipynb +0 -413
docs/tutorials/07-working-with-regions.md +0 -151
docs/tutorials/08-spatial-navigation.ipynb +0 -508
docs/tutorials/08-spatial-navigation.md +0 -190
docs/tutorials/09-section-extraction.ipynb +0 -2434
docs/tutorials/09-section-extraction.md +0 -256
docs/tutorials/10-form-field-extraction.ipynb +0 -512
docs/tutorials/10-form-field-extraction.md +0 -201
docs/tutorials/11-enhanced-table-processing.ipynb +0 -54
docs/tutorials/11-enhanced-table-processing.md +0 -9
docs/tutorials/12-ocr-integration.ipynb +0 -604
docs/tutorials/12-ocr-integration.md +0 -175
docs/tutorials/13-semantic-search.ipynb +0 -1328
docs/tutorials/13-semantic-search.md +0 -77
docs/visual-debugging/index.ipynb +0 -2970
docs/visual-debugging/index.md +0 -157
docs/visual-debugging/region.png +0 -0
natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -415
natural_pdf/templates/spa/css/style.css +0 -334
natural_pdf/templates/spa/index.html +0 -31
natural_pdf/templates/spa/js/app.js +0 -472
natural_pdf/templates/spa/words.txt +0 -235976
natural_pdf/widgets/frontend/viewer.js +0 -88
natural_pdf-0.1.7.dist-info/RECORD +0 -145
notebooks/Examples.ipynb +0 -1293
pdfs/.gitkeep +0 -0
pdfs/01-practice.pdf +0 -543
pdfs/0500000US42001.pdf +0 -0
pdfs/0500000US42007.pdf +0 -0
pdfs/2014 Statistics.pdf +0 -0
pdfs/2019 Statistics.pdf +0 -0
pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
pdfs/needs-ocr.pdf +0 -0
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0

docs/tutorials/10-form-field-extraction.md DELETED Viewed

@@ -1,201 +0,0 @@
-# Form Field Extraction
-Business documents like invoices, forms, and applications contain field-value pairs that need to be extracted. This tutorial shows how to identify and extract these form fields.
-```python
-#%pip install "natural-pdf[all]"
-```
-```python
-from natural_pdf import PDF
-# Load a PDF
-pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
-page = pdf.pages[0]
-# Find fields with labels ending in colon
-labels = page.find_all('text:contains(":")')
-# Visualize the found labels
-labels.show(color="blue", label="Field Labels")
-# Count how many potential fields we found
-len(labels)
-```
-## Extracting Field Values
-```python
-# Extract the value for each field label
-form_data = {}
-for label in labels:
-    # Clean up the label text
-    field_name = label.text.strip().rstrip(':')
-    # Find the value to the right of the label
-    value_region = label.right(width=200)
-    value = value_region.extract_text().strip()
-    # Store in our dictionary
-    form_data[field_name] = value
-# Display the extracted data
-form_data
-```
-## Visualizing Labels and Values
-```python
-# Clear previous highlights
-page.clear_highlights()
-# Highlight both labels and their values
-for label in labels:
-    # Highlight the label in red
-    label.show(color="red", label="Label")
-    # Highlight the value area in blue
-    label.right(width=200).show(color="blue", label="Value")
-# Show the page image with highlighted elements
-page.to_image()
-```
-## Handling Multi-line Values
-```python
-# Extract values that might span multiple lines
-multi_line_data = {}
-for label in labels:
-    # Get the field name
-    field_name = label.text.strip().rstrip(':')
-    # Look both to the right and below
-    right_value = label.right(width=200).extract_text().strip()
-    below_value = label.below(height=50).extract_text().strip()
-    # Combine the values if they're different
-    if right_value in below_value:
-        value = below_value
-    else:
-        value = f"{right_value} {below_value}".strip()
-    # Add to results
-    multi_line_data[field_name] = value
-# Show fields with potential multi-line values
-multi_line_data
-```
-## Finding Pattern-Based Fields
-```python
-import re
-# Find dates in the format July 31, YYY
-date_pattern = r'\b\w+ \d+, \d\d\d\d\b'
-# Search all text elements for dates
-text_elements = page.find_all('text')
-print([elem.text for elem in text_elements])
-dates = text_elements.filter(lambda elem: re.search(date_pattern, elem.text))
-# Visualize the date fields
-dates.show(color="green", label="Date")
-# Extract just the date values
-date_texts = [re.search(date_pattern, elem.text).group(0) for elem in dates]
-date_texts
-```
-## Working with Form Tables
-```python
-# Run layout analysis to find table structures
-page.analyze_layout()
-# Find possible form tables
-tables = page.find_all('region[type=table]')
-if tables:
-    # Visualize the tables
-    tables.show(color="purple", label="Form Table")
-    # Extract data from the first table
-    first_table = tables[0]
-    table_data = first_table.extract_table()
-    table_data
-else:
-    # Try to find form-like structure using text alignment
-    # Create a region where a form might be
-    form_region = page.create_region(50, 200, page.width - 50, 500)
-    # Group text by vertical position
-    rows = {}
-    text_elements = form_region.find_all('text')
-    for elem in text_elements:
-        # Round y-position to group elements in the same row
-        row_pos = round(elem.top / 5) * 5
-        if row_pos not in rows:
-            rows[row_pos] = []
-        rows[row_pos].append(elem)
-    # Extract data from rows (first 5 rows)
-    row_data = []
-    for y in sorted(rows.keys())[:5]:
-        # Sort elements by x-position (left to right)
-        elements = sorted(rows[y], key=lambda e: e.x0)
-        # Show the row
-        row_box = form_region.create_region(
-            min(e.x0 for e in elements),
-            min(e.top for e in elements),
-            max(e.x1 for e in elements),
-            max(e.bottom for e in elements)
-        )
-        row_box.show(color=None, use_color_cycling=True)
-        # Extract text from row
-        row_text = [e.text for e in elements]
-        row_data.append(row_text)
-    # Show the extracted rows
-    row_data
-```
-## Combining Different Extraction Techniques
-```python
-# Combine label-based and pattern-based extraction
-all_fields = {}
-# 1. First get fields with explicit labels
-for label in labels:
-    field_name = label.text.strip().rstrip(':')
-    value = label.right(width=200).extract_text().strip()
-    all_fields[field_name] = value
-# 2. Add date fields that we found with pattern matching
-for date_elem in dates:
-    # Find the nearest label
-    nearby_label = date_elem.nearest('text:contains(":")')
-    if nearby_label:
-        # Extract the label text
-        label_text = nearby_label.text.strip().rstrip(':')
-        # Get the date value
-        date_value = re.search(date_pattern, date_elem.text).group(0)
-        # Add to our results if not already present
-        if label_text not in all_fields:
-            all_fields[label_text] = date_value
-# Show all extracted fields
-all_fields
-```
-Form field extraction enables you to automate data entry and document processing. By combining different techniques like label detection, spatial navigation, and pattern matching, you can handle a wide variety of form layouts.

docs/tutorials/11-enhanced-table-processing.ipynb DELETED Viewed

@@ -1,54 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "7674e123",
-   "metadata": {},
-   "source": [
-    "# Enhanced Table Processing\n",
-    "\n",
-    "Tables are a common way to present structured data in documents, but they can be challenging to extract correctly. This tutorial demonstrates advanced techniques for working with tables in natural-pdf.\n",
-    "\n",
-    "TK"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "08c7c5f0",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-21T21:25:37.324499Z",
-     "iopub.status.busy": "2025-04-21T21:25:37.324337Z",
-     "iopub.status.idle": "2025-04-21T21:25:37.328739Z",
-     "shell.execute_reply": "2025-04-21T21:25:37.328344Z"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "#%pip install \"natural-pdf[all]\""
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "cell_metadata_filter": "-all",
-   "main_language": "python",
-   "notebook_metadata_filter": "-all"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.13"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}

docs/tutorials/11-enhanced-table-processing.md DELETED Viewed

@@ -1,9 +0,0 @@
-# Enhanced Table Processing
-Tables are a common way to present structured data in documents, but they can be challenging to extract correctly. This tutorial demonstrates advanced techniques for working with tables in natural-pdf.
-TK
-```python
-#%pip install "natural-pdf[all]"
-```

natural-pdf 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl

natural-pdf 0.1.7py3-none-any.whl → 0.1.9py3-none-any.whl