PyPI - natural-pdf - Versions diffs - 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl - Mend

natural-pdf 0.1.7py3-none-any.whl → 0.1.9py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (134) hide show

natural_pdf/__init__.py +3 -0
natural_pdf/analyzers/layout/base.py +1 -5
natural_pdf/analyzers/layout/gemini.py +61 -51
natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
natural_pdf/analyzers/layout/layout_manager.py +26 -84
natural_pdf/analyzers/layout/layout_options.py +7 -0
natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
natural_pdf/analyzers/layout/surya.py +46 -123
natural_pdf/analyzers/layout/tatr.py +51 -4
natural_pdf/analyzers/text_structure.py +3 -5
natural_pdf/analyzers/utils.py +3 -3
natural_pdf/classification/manager.py +422 -0
natural_pdf/classification/mixin.py +163 -0
natural_pdf/classification/results.py +80 -0
natural_pdf/collections/mixins.py +111 -0
natural_pdf/collections/pdf_collection.py +434 -15
natural_pdf/core/element_manager.py +83 -0
natural_pdf/core/highlighting_service.py +13 -22
natural_pdf/core/page.py +578 -93
natural_pdf/core/pdf.py +912 -460
natural_pdf/elements/base.py +134 -40
natural_pdf/elements/collections.py +712 -109
natural_pdf/elements/region.py +722 -69
natural_pdf/elements/text.py +4 -1
natural_pdf/export/mixin.py +137 -0
natural_pdf/exporters/base.py +3 -3
natural_pdf/exporters/paddleocr.py +5 -4
natural_pdf/extraction/manager.py +135 -0
natural_pdf/extraction/mixin.py +279 -0
natural_pdf/extraction/result.py +23 -0
natural_pdf/ocr/__init__.py +5 -5
natural_pdf/ocr/engine_doctr.py +346 -0
natural_pdf/ocr/engine_easyocr.py +6 -3
natural_pdf/ocr/ocr_factory.py +24 -4
natural_pdf/ocr/ocr_manager.py +122 -26
natural_pdf/ocr/ocr_options.py +94 -11
natural_pdf/ocr/utils.py +19 -6
natural_pdf/qa/document_qa.py +0 -4
natural_pdf/search/__init__.py +20 -34
natural_pdf/search/haystack_search_service.py +309 -265
natural_pdf/search/haystack_utils.py +99 -75
natural_pdf/search/search_service_protocol.py +11 -12
natural_pdf/selectors/parser.py +431 -230
natural_pdf/utils/debug.py +3 -3
natural_pdf/utils/identifiers.py +1 -1
natural_pdf/utils/locks.py +8 -0
natural_pdf/utils/packaging.py +8 -6
natural_pdf/utils/text_extraction.py +60 -1
natural_pdf/utils/tqdm_utils.py +51 -0
natural_pdf/utils/visualization.py +18 -0
natural_pdf/widgets/viewer.py +4 -25
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +17 -3
natural_pdf-0.1.9.dist-info/RECORD +80 -0
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
docs/api/index.md +0 -386
docs/assets/favicon.png +0 -3
docs/assets/favicon.svg +0 -3
docs/assets/javascripts/custom.js +0 -17
docs/assets/logo.svg +0 -3
docs/assets/sample-screen.png +0 -0
docs/assets/social-preview.png +0 -17
docs/assets/social-preview.svg +0 -17
docs/assets/stylesheets/custom.css +0 -65
docs/document-qa/index.ipynb +0 -435
docs/document-qa/index.md +0 -79
docs/element-selection/index.ipynb +0 -915
docs/element-selection/index.md +0 -229
docs/finetuning/index.md +0 -176
docs/index.md +0 -170
docs/installation/index.md +0 -69
docs/interactive-widget/index.ipynb +0 -962
docs/interactive-widget/index.md +0 -12
docs/layout-analysis/index.ipynb +0 -818
docs/layout-analysis/index.md +0 -185
docs/ocr/index.md +0 -209
docs/pdf-navigation/index.ipynb +0 -314
docs/pdf-navigation/index.md +0 -97
docs/regions/index.ipynb +0 -816
docs/regions/index.md +0 -294
docs/tables/index.ipynb +0 -658
docs/tables/index.md +0 -144
docs/text-analysis/index.ipynb +0 -370
docs/text-analysis/index.md +0 -105
docs/text-extraction/index.ipynb +0 -1478
docs/text-extraction/index.md +0 -292
docs/tutorials/01-loading-and-extraction.ipynb +0 -194
docs/tutorials/01-loading-and-extraction.md +0 -95
docs/tutorials/02-finding-elements.ipynb +0 -340
docs/tutorials/02-finding-elements.md +0 -149
docs/tutorials/03-extracting-blocks.ipynb +0 -147
docs/tutorials/03-extracting-blocks.md +0 -48
docs/tutorials/04-table-extraction.ipynb +0 -114
docs/tutorials/04-table-extraction.md +0 -50
docs/tutorials/05-excluding-content.ipynb +0 -270
docs/tutorials/05-excluding-content.md +0 -109
docs/tutorials/06-document-qa.ipynb +0 -332
docs/tutorials/06-document-qa.md +0 -91
docs/tutorials/07-layout-analysis.ipynb +0 -288
docs/tutorials/07-layout-analysis.md +0 -66
docs/tutorials/07-working-with-regions.ipynb +0 -413
docs/tutorials/07-working-with-regions.md +0 -151
docs/tutorials/08-spatial-navigation.ipynb +0 -508
docs/tutorials/08-spatial-navigation.md +0 -190
docs/tutorials/09-section-extraction.ipynb +0 -2434
docs/tutorials/09-section-extraction.md +0 -256
docs/tutorials/10-form-field-extraction.ipynb +0 -512
docs/tutorials/10-form-field-extraction.md +0 -201
docs/tutorials/11-enhanced-table-processing.ipynb +0 -54
docs/tutorials/11-enhanced-table-processing.md +0 -9
docs/tutorials/12-ocr-integration.ipynb +0 -604
docs/tutorials/12-ocr-integration.md +0 -175
docs/tutorials/13-semantic-search.ipynb +0 -1328
docs/tutorials/13-semantic-search.md +0 -77
docs/visual-debugging/index.ipynb +0 -2970
docs/visual-debugging/index.md +0 -157
docs/visual-debugging/region.png +0 -0
natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -415
natural_pdf/templates/spa/css/style.css +0 -334
natural_pdf/templates/spa/index.html +0 -31
natural_pdf/templates/spa/js/app.js +0 -472
natural_pdf/templates/spa/words.txt +0 -235976
natural_pdf/widgets/frontend/viewer.js +0 -88
natural_pdf-0.1.7.dist-info/RECORD +0 -145
notebooks/Examples.ipynb +0 -1293
pdfs/.gitkeep +0 -0
pdfs/01-practice.pdf +0 -543
pdfs/0500000US42001.pdf +0 -0
pdfs/0500000US42007.pdf +0 -0
pdfs/2014 Statistics.pdf +0 -0
pdfs/2019 Statistics.pdf +0 -0
pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
pdfs/needs-ocr.pdf +0 -0
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0

docs/tutorials/03-extracting-blocks.md DELETED Viewed

@@ -1,48 +0,0 @@
-# Extracting Text Blocks
-Often, you need a specific section, like a paragraph between two headings. You can find a starting element and select everything below it until an ending element.
-Let's extract the "Summary" section from `01-practice.pdf`. It starts after "Summary:" and ends before the thick horizontal line.
-```python
-#%pip install "natural-pdf[all]"
-```
-```python
-from natural_pdf import PDF
-# Load the PDF and get the page
-pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
-page = pdf.pages[0]
-# Find the starting element ("Summary:")
-start_marker = page.find('text:contains("Summary:")')
-# Select elements below the start_marker, stopping *before*
-# the thick horizontal line (a line with height > 1).
-summary_elements = start_marker.below(
-    include_element=True, # Include the "Summary:" text itself
-    until="line[height > 1]"
-)
-# Visualize the elements found in this block
-summary_elements.highlight(color="lightgreen", label="Summary Block")
-# Extract and display the text from the collection of summary elements
-summary_elements.extract_text()
-```
-```python
-# Display the page image to see the visualization
-page.to_image()
-```
-This selects the elements using `.below(until=...)` and extracts their text. The second code block displays the page image with the visualized section.
-<div class="admonition note">
-<p class="admonition-title">Selector Specificity</p>
-    We used `line[height > 1]` to find the thick horizontal line. You might need to adjust selectors based on the specific PDF structure. Inspecting element properties can help you find reliable start and end markers.
-</div>

docs/tutorials/04-table-extraction.ipynb DELETED Viewed

@@ -1,114 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "24111eee",
-   "metadata": {},
-   "source": [
-    "# Basic Table Extraction\n",
-    "\n",
-    "PDFs often contain tables, and `natural-pdf` provides methods to extract their data, building on `pdfplumber`'s capabilities.\n",
-    "\n",
-    "Let's extract the \"Violations\" table from our practice PDF."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "75f17900",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-21T21:23:59.967091Z",
-     "iopub.status.busy": "2025-04-21T21:23:59.966933Z",
-     "iopub.status.idle": "2025-04-21T21:23:59.971753Z",
-     "shell.execute_reply": "2025-04-21T21:23:59.970980Z"
-    },
-    "lines_to_next_cell": 2
-   },
-   "outputs": [],
-   "source": [
-    "#%pip install \"natural-pdf[all]\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "f1b71280",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-21T21:23:59.974183Z",
-     "iopub.status.busy": "2025-04-21T21:23:59.973996Z",
-     "iopub.status.idle": "2025-04-21T21:24:06.847197Z",
-     "shell.execute_reply": "2025-04-21T21:24:06.846712Z"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "from natural_pdf import PDF\n",
-    "\n",
-    "# Load a PDF\n",
-    "pdf = PDF(\"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf\")\n",
-    "page = pdf.pages[0]\n",
-    "\n",
-    "# Use extract_tables() to find all tables on the page.\n",
-    "# It returns a list of tables, where each table is a list of lists.\n",
-    "tables_data = page.extract_tables()\n",
-    "\n",
-    "# Display the first table found\n",
-    "tables_data[0] if tables_data else \"No tables found\"\n",
-    "\n",
-    "# You can also visualize the general area of the first table \n",
-    "# by finding elements in that region\n",
-    "if tables_data:\n",
-    "    # Find a header element in the table\n",
-    "    statute_header = page.find('text:contains(\"Statute\")')\n",
-    "    if statute_header:\n",
-    "        # Show the area\n",
-    "        statute_header.below(height=100).highlight(color=\"green\", label=\"Table Area\")\n",
-    "        page.to_image()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5c80e397",
-   "metadata": {},
-   "source": [
-    "This code uses `page.extract_tables()` which attempts to automatically detect tables based on visual cues like lines and whitespace. The result is a list of lists, representing the rows and cells of the table.\n",
-    "\n",
-    "<div class=\"admonition note\">\n",
-    "<p class=\"admonition-title\">Table Settings and Limitations</p>\n",
-    "\n",
-    "    The default `extract_tables()` works well for simple, clearly defined tables. However, it might struggle with:\n",
-    "    *   Tables without clear borders or lines.\n",
-    "    *   Complex merged cells.\n",
-    "    *   Tables spanning multiple pages.\n",
-    "\n",
-    "    `pdfplumber` (and thus `natural-pdf`) allows passing `table_settings` dictionaries to `extract_tables()` for more control over the detection strategy (e.g., `\"vertical_strategy\": \"text\"`, `\"horizontal_strategy\": \"text\"`).\n",
-    "\n",
-    "    For even more robust table detection, especially for tables without explicit lines, using Layout Analysis (like `page.analyze_layout(engine='tatr')`) first, finding the table `region`, and then calling `region.extract_table()` can yield better results. We'll explore layout analysis in a later tutorial.\n",
-    "</div> "
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "cell_metadata_filter": "-all",
-   "main_language": "python",
-   "notebook_metadata_filter": "-all"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.13"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}

docs/tutorials/04-table-extraction.md DELETED Viewed

@@ -1,50 +0,0 @@
-# Basic Table Extraction
-PDFs often contain tables, and `natural-pdf` provides methods to extract their data, building on `pdfplumber`'s capabilities.
-Let's extract the "Violations" table from our practice PDF.
-```python
-#%pip install "natural-pdf[all]"
-```
-```python
-from natural_pdf import PDF
-# Load a PDF
-pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
-page = pdf.pages[0]
-# Use extract_tables() to find all tables on the page.
-# It returns a list of tables, where each table is a list of lists.
-tables_data = page.extract_tables()
-# Display the first table found
-tables_data[0] if tables_data else "No tables found"
-# You can also visualize the general area of the first table
-# by finding elements in that region
-if tables_data:
-    # Find a header element in the table
-    statute_header = page.find('text:contains("Statute")')
-    if statute_header:
-        # Show the area
-        statute_header.below(height=100).highlight(color="green", label="Table Area")
-        page.to_image()
-```
-This code uses `page.extract_tables()` which attempts to automatically detect tables based on visual cues like lines and whitespace. The result is a list of lists, representing the rows and cells of the table.
-<div class="admonition note">
-<p class="admonition-title">Table Settings and Limitations</p>
-    The default `extract_tables()` works well for simple, clearly defined tables. However, it might struggle with:
-    *   Tables without clear borders or lines.
-    *   Complex merged cells.
-    *   Tables spanning multiple pages.
-    `pdfplumber` (and thus `natural-pdf`) allows passing `table_settings` dictionaries to `extract_tables()` for more control over the detection strategy (e.g., `"vertical_strategy": "text"`, `"horizontal_strategy": "text"`).
-    For even more robust table detection, especially for tables without explicit lines, using Layout Analysis (like `page.analyze_layout(engine='tatr')`) first, finding the table `region`, and then calling `region.extract_table()` can yield better results. We'll explore layout analysis in a later tutorial.
-</div>

natural-pdf 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl

natural-pdf 0.1.7py3-none-any.whl → 0.1.9py3-none-any.whl