PyPI - natural-pdf - Versions diffs - 0.1.8__py3-none-any.whl → 0.1.9__py3-none-any.whl - Mend

natural-pdf 0.1.8py3-none-any.whl → 0.1.9py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (134) hide show

natural_pdf/__init__.py +1 -0
natural_pdf/analyzers/layout/base.py +1 -5
natural_pdf/analyzers/layout/gemini.py +61 -51
natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
natural_pdf/analyzers/layout/layout_manager.py +26 -84
natural_pdf/analyzers/layout/layout_options.py +7 -0
natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
natural_pdf/analyzers/layout/surya.py +46 -123
natural_pdf/analyzers/layout/tatr.py +51 -4
natural_pdf/analyzers/text_structure.py +3 -5
natural_pdf/analyzers/utils.py +3 -3
natural_pdf/classification/manager.py +230 -151
natural_pdf/classification/mixin.py +49 -35
natural_pdf/classification/results.py +64 -46
natural_pdf/collections/mixins.py +68 -20
natural_pdf/collections/pdf_collection.py +177 -64
natural_pdf/core/element_manager.py +30 -14
natural_pdf/core/highlighting_service.py +13 -22
natural_pdf/core/page.py +423 -101
natural_pdf/core/pdf.py +633 -190
natural_pdf/elements/base.py +134 -40
natural_pdf/elements/collections.py +503 -131
natural_pdf/elements/region.py +659 -90
natural_pdf/elements/text.py +1 -1
natural_pdf/export/mixin.py +137 -0
natural_pdf/exporters/base.py +3 -3
natural_pdf/exporters/paddleocr.py +4 -3
natural_pdf/extraction/manager.py +50 -49
natural_pdf/extraction/mixin.py +90 -57
natural_pdf/extraction/result.py +9 -23
natural_pdf/ocr/__init__.py +5 -5
natural_pdf/ocr/engine_doctr.py +346 -0
natural_pdf/ocr/ocr_factory.py +24 -4
natural_pdf/ocr/ocr_manager.py +61 -25
natural_pdf/ocr/ocr_options.py +70 -10
natural_pdf/ocr/utils.py +6 -4
natural_pdf/search/__init__.py +20 -34
natural_pdf/search/haystack_search_service.py +309 -265
natural_pdf/search/haystack_utils.py +99 -75
natural_pdf/search/search_service_protocol.py +11 -12
natural_pdf/selectors/parser.py +219 -143
natural_pdf/utils/debug.py +3 -3
natural_pdf/utils/identifiers.py +1 -1
natural_pdf/utils/locks.py +1 -1
natural_pdf/utils/packaging.py +8 -6
natural_pdf/utils/text_extraction.py +24 -16
natural_pdf/utils/tqdm_utils.py +18 -10
natural_pdf/utils/visualization.py +18 -0
natural_pdf/widgets/viewer.py +4 -25
{natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +12 -3
natural_pdf-0.1.9.dist-info/RECORD +80 -0
{natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
{natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
docs/api/index.md +0 -386
docs/assets/favicon.png +0 -3
docs/assets/favicon.svg +0 -3
docs/assets/javascripts/custom.js +0 -17
docs/assets/logo.svg +0 -3
docs/assets/sample-screen.png +0 -0
docs/assets/social-preview.png +0 -17
docs/assets/social-preview.svg +0 -17
docs/assets/stylesheets/custom.css +0 -65
docs/categorizing-documents/index.md +0 -168
docs/data-extraction/index.md +0 -87
docs/document-qa/index.ipynb +0 -435
docs/document-qa/index.md +0 -79
docs/element-selection/index.ipynb +0 -969
docs/element-selection/index.md +0 -249
docs/finetuning/index.md +0 -176
docs/index.md +0 -189
docs/installation/index.md +0 -69
docs/interactive-widget/index.ipynb +0 -962
docs/interactive-widget/index.md +0 -12
docs/layout-analysis/index.ipynb +0 -818
docs/layout-analysis/index.md +0 -185
docs/ocr/index.md +0 -256
docs/pdf-navigation/index.ipynb +0 -314
docs/pdf-navigation/index.md +0 -97
docs/regions/index.ipynb +0 -816
docs/regions/index.md +0 -294
docs/tables/index.ipynb +0 -658
docs/tables/index.md +0 -144
docs/text-analysis/index.ipynb +0 -370
docs/text-analysis/index.md +0 -105
docs/text-extraction/index.ipynb +0 -1478
docs/text-extraction/index.md +0 -292
docs/tutorials/01-loading-and-extraction.ipynb +0 -1873
docs/tutorials/01-loading-and-extraction.md +0 -95
docs/tutorials/02-finding-elements.ipynb +0 -417
docs/tutorials/02-finding-elements.md +0 -149
docs/tutorials/03-extracting-blocks.ipynb +0 -152
docs/tutorials/03-extracting-blocks.md +0 -48
docs/tutorials/04-table-extraction.ipynb +0 -119
docs/tutorials/04-table-extraction.md +0 -50
docs/tutorials/05-excluding-content.ipynb +0 -275
docs/tutorials/05-excluding-content.md +0 -109
docs/tutorials/06-document-qa.ipynb +0 -337
docs/tutorials/06-document-qa.md +0 -91
docs/tutorials/07-layout-analysis.ipynb +0 -293
docs/tutorials/07-layout-analysis.md +0 -66
docs/tutorials/07-working-with-regions.ipynb +0 -414
docs/tutorials/07-working-with-regions.md +0 -151
docs/tutorials/08-spatial-navigation.ipynb +0 -513
docs/tutorials/08-spatial-navigation.md +0 -190
docs/tutorials/09-section-extraction.ipynb +0 -2439
docs/tutorials/09-section-extraction.md +0 -256
docs/tutorials/10-form-field-extraction.ipynb +0 -517
docs/tutorials/10-form-field-extraction.md +0 -201
docs/tutorials/11-enhanced-table-processing.ipynb +0 -59
docs/tutorials/11-enhanced-table-processing.md +0 -9
docs/tutorials/12-ocr-integration.ipynb +0 -3712
docs/tutorials/12-ocr-integration.md +0 -137
docs/tutorials/13-semantic-search.ipynb +0 -1718
docs/tutorials/13-semantic-search.md +0 -77
docs/visual-debugging/index.ipynb +0 -2970
docs/visual-debugging/index.md +0 -157
docs/visual-debugging/region.png +0 -0
natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -420
natural_pdf/templates/spa/css/style.css +0 -334
natural_pdf/templates/spa/index.html +0 -31
natural_pdf/templates/spa/js/app.js +0 -472
natural_pdf/templates/spa/words.txt +0 -235976
natural_pdf/widgets/frontend/viewer.js +0 -88
natural_pdf-0.1.8.dist-info/RECORD +0 -156
notebooks/Examples.ipynb +0 -1293
pdfs/.gitkeep +0 -0
pdfs/01-practice.pdf +0 -543
pdfs/0500000US42001.pdf +0 -0
pdfs/0500000US42007.pdf +0 -0
pdfs/2014 Statistics.pdf +0 -0
pdfs/2019 Statistics.pdf +0 -0
pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
pdfs/needs-ocr.pdf +0 -0
{natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0

docs/pdf-navigation/index.ipynb DELETED Viewed

@@ -1,314 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "bba1860e",
-   "metadata": {},
-   "source": [
-    "# PDF Navigation\n",
-    "\n",
-    "This guide covers the basics of working with PDFs in Natural PDF - opening documents, accessing pages, and navigating through content.\n",
-    "\n",
-    "## Opening a PDF\n",
-    "\n",
-    "The main entry point to Natural PDF is the `PDF` class:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "56d12ab5",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-03T14:50:38.434157Z",
-     "iopub.status.busy": "2025-04-03T14:50:38.433170Z",
-     "iopub.status.idle": "2025-04-03T14:50:49.768101Z",
-     "shell.execute_reply": "2025-04-03T14:50:49.767384Z"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "from natural_pdf import PDF\n",
-    "\n",
-    "# Open a PDF file\n",
-    "pdf = PDF(\"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/0500000US42001.pdf\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c425482a",
-   "metadata": {},
-   "source": [
-    "## Accessing Pages\n",
-    "\n",
-    "Once you have a PDF object, you can access its pages:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "a3405aa9",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-03T14:50:49.770604Z",
-     "iopub.status.busy": "2025-04-03T14:50:49.770419Z",
-     "iopub.status.idle": "2025-04-03T14:50:50.700808Z",
-     "shell.execute_reply": "2025-04-03T14:50:50.699634Z"
-    }
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "This PDF has 153 pages\n",
-      "Page 1 has 985 characters\n",
-      "Page 2 has 778 characters\n",
-      "Page 3 has 522 characters\n",
-      "Page 4 has 984 characters\n",
-      "Page 5 has 778 characters\n",
-      "Page 6 has 523 characters\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Page 7 has 982 characters\n",
-      "Page 8 has 772 characters\n",
-      "Page 9 has 522 characters\n",
-      "Page 10 has 1008 characters\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Page 11 has 796 characters\n",
-      "Page 12 has 532 characters\n",
-      "Page 13 has 986 characters\n",
-      "Page 14 has 780 characters\n",
-      "Page 15 has 523 characters\n",
-      "Page 16 has 990 characters\n",
-      "Page 17 has 782 characters\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Page 18 has 520 characters\n",
-      "Page 19 has 1006 characters\n",
-      "Page 20 has 795 characters\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Get the total number of pages\n",
-    "num_pages = len(pdf)\n",
-    "print(f\"This PDF has {num_pages} pages\")\n",
-    "\n",
-    "# Get a specific page (0-indexed)\n",
-    "first_page = pdf.pages[0]\n",
-    "last_page = pdf.pages[-1]\n",
-    "\n",
-    "# Iterate through the first 20 pages\n",
-    "for page in pdf.pages[:20]:\n",
-    "    print(f\"Page {page.number} has {len(page.extract_text())} characters\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2eca7327",
-   "metadata": {},
-   "source": [
-    "## Page Properties\n",
-    "\n",
-    "Each `Page` object has useful properties:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "348f28d7",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-03T14:50:50.713325Z",
-     "iopub.status.busy": "2025-04-03T14:50:50.711638Z",
-     "iopub.status.idle": "2025-04-03T14:50:50.738737Z",
-     "shell.execute_reply": "2025-04-03T14:50:50.726839Z"
-    }
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "612 792\n",
-      "20\n",
-      "19\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Page dimensions in points (1/72 inch)\n",
-    "print(page.width, page.height)\n",
-    "\n",
-    "# Page number (1-indexed as shown in PDF viewers)\n",
-    "print(page.number)\n",
-    "\n",
-    "# Page index (0-indexed position in the PDF)\n",
-    "print(page.index)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c7cf1839",
-   "metadata": {},
-   "source": [
-    "## Working Across Pages\n",
-    "\n",
-    "Natural PDF makes it easy to work with content across multiple pages:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "71a8f1ec",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-03T14:50:50.765495Z",
-     "iopub.status.busy": "2025-04-03T14:50:50.764444Z",
-     "iopub.status.idle": "2025-04-03T14:50:57.735494Z",
-     "shell.execute_reply": "2025-04-03T14:50:57.726489Z"
-    }
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "<natural_pdf.core.pdf.PDF at 0x1045224d0>"
-      ]
-     },
-     "execution_count": 4,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# Extract text from all pages\n",
-    "all_text = pdf.extract_text()\n",
-    "\n",
-    "# Find elements across all pages\n",
-    "all_headings = pdf.find_all('text[size>=14]:bold')\n",
-    "\n",
-    "# Add exclusion zones to all pages (like headers/footers)\n",
-    "pdf.add_exclusion(\n",
-    "    lambda page: page.find('text:contains(\"CONFIDENTIAL\")').above() if page.find('text:contains(\"CONFIDENTIAL\")') else None,\n",
-    "    label=\"header\"\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e18051a4",
-   "metadata": {},
-   "source": [
-    "## The Page Collection\n",
-    "\n",
-    "The `pdf.pages` object is a `PageCollection` that allows batch operations on pages:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "id": "e5f1c662",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-03T14:50:57.752240Z",
-     "iopub.status.busy": "2025-04-03T14:50:57.751868Z",
-     "iopub.status.idle": "2025-04-03T14:50:57.770738Z",
-     "shell.execute_reply": "2025-04-03T14:50:57.759415Z"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Extract text from specific pages\n",
-    "text = pdf.pages[2:5].extract_text()\n",
-    "\n",
-    "# Find elements across specific pages\n",
-    "elements = pdf.pages[2:5].find_all('text:contains(\"Annual Report\")')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9713e392",
-   "metadata": {},
-   "source": [
-    "## Document Sections Across Pages\n",
-    "\n",
-    "You can extract sections that span across multiple pages:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "id": "d5b89a2b",
-   "metadata": {
-    "execution": {
-     "iopub.execute_input": "2025-04-03T14:50:57.782621Z",
-     "iopub.status.busy": "2025-04-03T14:50:57.781776Z",
-     "iopub.status.idle": "2025-04-03T14:50:57.811508Z",
-     "shell.execute_reply": "2025-04-03T14:50:57.805310Z"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "# Get sections with headings as section starts\n",
-    "sections = pdf.pages.get_sections(\n",
-    "    start_elements='text[size>=14]:bold',\n",
-    "    new_section_on_page_break=False\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f51594ce",
-   "metadata": {},
-   "source": [
-    "## Next Steps\n",
-    "\n",
-    "Now that you know how to navigate PDFs, you can:\n",
-    "\n",
-    "- [Find elements using selectors](../element-selection/index.ipynb)\n",
-    "- [Extract text from your documents](../text-extraction/index.ipynb)\n",
-    "- [Work with specific regions](../regions/index.ipynb)"
-   ]
-  }
- ],
- "metadata": {
-  "jupytext": {
-   "cell_metadata_filter": "-all",
-   "main_language": "python",
-   "notebook_metadata_filter": "-all",
-   "text_representation": {
-    "extension": ".md",
-    "format_name": "markdown"
-   }
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.13"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}

docs/pdf-navigation/index.md DELETED Viewed

@@ -1,97 +0,0 @@
-# PDF Navigation
-This guide covers the basics of working with PDFs in Natural PDF - opening documents, accessing pages, and navigating through content.
-## Opening a PDF
-The main entry point to Natural PDF is the `PDF` class:
-```python
-from natural_pdf import PDF
-# Open a PDF file
-pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/0500000US42001.pdf")
-```
-## Accessing Pages
-Once you have a PDF object, you can access its pages:
-```python
-# Get the total number of pages
-num_pages = len(pdf)
-print(f"This PDF has {num_pages} pages")
-# Get a specific page (0-indexed)
-first_page = pdf.pages[0]
-last_page = pdf.pages[-1]
-# Iterate through the first 20 pages
-for page in pdf.pages[:20]:
-    print(f"Page {page.number} has {len(page.extract_text())} characters")
-```
-## Page Properties
-Each `Page` object has useful properties:
-```python
-# Page dimensions in points (1/72 inch)
-print(page.width, page.height)
-# Page number (1-indexed as shown in PDF viewers)
-print(page.number)
-# Page index (0-indexed position in the PDF)
-print(page.index)
-```
-## Working Across Pages
-Natural PDF makes it easy to work with content across multiple pages:
-```python
-# Extract text from all pages
-all_text = pdf.extract_text()
-# Find elements across all pages
-all_headings = pdf.find_all('text[size>=14]:bold')
-# Add exclusion zones to all pages (like headers/footers)
-pdf.add_exclusion(
-    lambda page: page.find('text:contains("CONFIDENTIAL")').above() if page.find('text:contains("CONFIDENTIAL")') else None,
-    label="header"
-)
-```
-## The Page Collection
-The `pdf.pages` object is a `PageCollection` that allows batch operations on pages:
-```python
-# Extract text from specific pages
-text = pdf.pages[2:5].extract_text()
-# Find elements across specific pages
-elements = pdf.pages[2:5].find_all('text:contains("Annual Report")')
-```
-## Document Sections Across Pages
-You can extract sections that span across multiple pages:
-```python
-# Get sections with headings as section starts
-sections = pdf.pages.get_sections(
-    start_elements='text[size>=14]:bold',
-    new_section_on_page_break=False
-)
-```
-## Next Steps
-Now that you know how to navigate PDFs, you can:
-- [Find elements using selectors](../element-selection/index.ipynb)
-- [Extract text from your documents](../text-extraction/index.ipynb)
-- [Work with specific regions](../regions/index.ipynb)

natural-pdf 0.1.8__py3-none-any.whl → 0.1.9__py3-none-any.whl

natural-pdf 0.1.8py3-none-any.whl → 0.1.9py3-none-any.whl