PyPI - natural-pdf - Versions diffs - 0.1.4__py3-none-any.whl → 0.1.5__py3-none-any.whl - Mend

natural-pdf 0.1.4py3-none-any.whl → 0.1.5py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (132) hide show

docs/api/index.md +386 -0
docs/assets/favicon.png +3 -0
docs/assets/favicon.svg +3 -0
docs/assets/javascripts/custom.js +17 -0
docs/assets/logo.svg +3 -0
docs/assets/sample-screen.png +0 -0
docs/assets/social-preview.png +17 -0
docs/assets/social-preview.svg +17 -0
docs/assets/stylesheets/custom.css +65 -0
docs/document-qa/index.ipynb +435 -0
docs/document-qa/index.md +79 -0
docs/element-selection/index.ipynb +915 -0
docs/element-selection/index.md +229 -0
docs/index.md +170 -0
docs/installation/index.md +69 -0
docs/interactive-widget/index.ipynb +962 -0
docs/interactive-widget/index.md +12 -0
docs/layout-analysis/index.ipynb +818 -0
docs/layout-analysis/index.md +185 -0
docs/ocr/index.md +222 -0
docs/pdf-navigation/index.ipynb +314 -0
docs/pdf-navigation/index.md +97 -0
docs/regions/index.ipynb +816 -0
docs/regions/index.md +294 -0
docs/tables/index.ipynb +658 -0
docs/tables/index.md +144 -0
docs/text-analysis/index.ipynb +370 -0
docs/text-analysis/index.md +105 -0
docs/text-extraction/index.ipynb +1478 -0
docs/text-extraction/index.md +292 -0
docs/tutorials/01-loading-and-extraction.ipynb +1696 -0
docs/tutorials/01-loading-and-extraction.md +95 -0
docs/tutorials/02-finding-elements.ipynb +340 -0
docs/tutorials/02-finding-elements.md +149 -0
docs/tutorials/03-extracting-blocks.ipynb +147 -0
docs/tutorials/03-extracting-blocks.md +48 -0
docs/tutorials/04-table-extraction.ipynb +114 -0
docs/tutorials/04-table-extraction.md +50 -0
docs/tutorials/05-excluding-content.ipynb +270 -0
docs/tutorials/05-excluding-content.md +109 -0
docs/tutorials/06-document-qa.ipynb +332 -0
docs/tutorials/06-document-qa.md +91 -0
docs/tutorials/07-layout-analysis.ipynb +260 -0
docs/tutorials/07-layout-analysis.md +66 -0
docs/tutorials/07-working-with-regions.ipynb +409 -0
docs/tutorials/07-working-with-regions.md +151 -0
docs/tutorials/08-spatial-navigation.ipynb +508 -0
docs/tutorials/08-spatial-navigation.md +190 -0
docs/tutorials/09-section-extraction.ipynb +2434 -0
docs/tutorials/09-section-extraction.md +256 -0
docs/tutorials/10-form-field-extraction.ipynb +484 -0
docs/tutorials/10-form-field-extraction.md +201 -0
docs/tutorials/11-enhanced-table-processing.ipynb +54 -0
docs/tutorials/11-enhanced-table-processing.md +9 -0
docs/tutorials/12-ocr-integration.ipynb +586 -0
docs/tutorials/12-ocr-integration.md +188 -0
docs/tutorials/13-semantic-search.ipynb +1888 -0
docs/tutorials/13-semantic-search.md +77 -0
docs/visual-debugging/index.ipynb +2970 -0
docs/visual-debugging/index.md +157 -0
docs/visual-debugging/region.png +0 -0
natural_pdf/__init__.py +39 -20
natural_pdf/analyzers/__init__.py +2 -1
natural_pdf/analyzers/layout/base.py +32 -24
natural_pdf/analyzers/layout/docling.py +131 -72
natural_pdf/analyzers/layout/layout_analyzer.py +156 -113
natural_pdf/analyzers/layout/layout_manager.py +98 -58
natural_pdf/analyzers/layout/layout_options.py +32 -17
natural_pdf/analyzers/layout/paddle.py +152 -95
natural_pdf/analyzers/layout/surya.py +164 -92
natural_pdf/analyzers/layout/tatr.py +149 -84
natural_pdf/analyzers/layout/yolo.py +84 -44
natural_pdf/analyzers/text_options.py +22 -15
natural_pdf/analyzers/text_structure.py +131 -85
natural_pdf/analyzers/utils.py +30 -23
natural_pdf/collections/pdf_collection.py +125 -97
natural_pdf/core/__init__.py +1 -1
natural_pdf/core/element_manager.py +416 -337
natural_pdf/core/highlighting_service.py +268 -196
natural_pdf/core/page.py +907 -513
natural_pdf/core/pdf.py +385 -287
natural_pdf/elements/__init__.py +1 -1
natural_pdf/elements/base.py +302 -214
natural_pdf/elements/collections.py +708 -508
natural_pdf/elements/line.py +39 -36
natural_pdf/elements/rect.py +32 -30
natural_pdf/elements/region.py +854 -883
natural_pdf/elements/text.py +122 -99
natural_pdf/exporters/__init__.py +0 -1
natural_pdf/exporters/searchable_pdf.py +261 -102
natural_pdf/ocr/__init__.py +23 -14
natural_pdf/ocr/engine.py +17 -8
natural_pdf/ocr/engine_easyocr.py +63 -47
natural_pdf/ocr/engine_paddle.py +97 -68
natural_pdf/ocr/engine_surya.py +54 -44
natural_pdf/ocr/ocr_manager.py +88 -62
natural_pdf/ocr/ocr_options.py +16 -10
natural_pdf/qa/__init__.py +1 -1
natural_pdf/qa/document_qa.py +119 -111
natural_pdf/search/__init__.py +37 -31
natural_pdf/search/haystack_search_service.py +312 -189
natural_pdf/search/haystack_utils.py +186 -122
natural_pdf/search/search_options.py +25 -14
natural_pdf/search/search_service_protocol.py +12 -6
natural_pdf/search/searchable_mixin.py +261 -176
natural_pdf/selectors/__init__.py +2 -1
natural_pdf/selectors/parser.py +159 -316
natural_pdf/templates/__init__.py +1 -1
natural_pdf/utils/highlighting.py +8 -2
natural_pdf/utils/reading_order.py +65 -63
natural_pdf/utils/text_extraction.py +195 -0
natural_pdf/utils/visualization.py +70 -61
natural_pdf/widgets/__init__.py +2 -3
natural_pdf/widgets/viewer.py +749 -718
{natural_pdf-0.1.4.dist-info → natural_pdf-0.1.5.dist-info}/METADATA +15 -1
natural_pdf-0.1.5.dist-info/RECORD +134 -0
natural_pdf-0.1.5.dist-info/top_level.txt +5 -0
notebooks/Examples.ipynb +1293 -0
pdfs/.gitkeep +0 -0
pdfs/01-practice.pdf +543 -0
pdfs/0500000US42001.pdf +0 -0
pdfs/0500000US42007.pdf +0 -0
pdfs/2014 Statistics.pdf +0 -0
pdfs/2019 Statistics.pdf +0 -0
pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
pdfs/needs-ocr.pdf +0 -0
tests/test_loading.py +50 -0
tests/test_optional_deps.py +298 -0
natural_pdf-0.1.4.dist-info/RECORD +0 -61
natural_pdf-0.1.4.dist-info/top_level.txt +0 -1
{natural_pdf-0.1.4.dist-info → natural_pdf-0.1.5.dist-info}/WHEEL +0 -0
{natural_pdf-0.1.4.dist-info → natural_pdf-0.1.5.dist-info}/licenses/LICENSE +0 -0

docs/element-selection/index.md ADDED Viewed

@@ -0,0 +1,229 @@
+# Finding Elements with Selectors
+Natural PDF uses CSS-like selectors to find elements (text, lines, images, etc.) within a PDF page or document. This guide demonstrates how to use these selectors effectively.
+## Setup
+Let's load a sample PDF to work with. We'll use `01-practice.pdf` which has various elements.
+```python
+from natural_pdf import PDF
+# Load the PDF
+pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
+# Select the first page
+page = pdf.pages[0]
+# Display the page
+page.show()
+```
+## Basic Element Finding
+The core methods are `find()` (returns the first match) and `find_all()` (returns all matches as an `ElementCollection`).
+The basic selector structure is `element_type[attribute_filter]:pseudo_class`.
+### Finding Text by Content
+```python
+# Find the first text element containing "Summary"
+summary_text = page.find('text:contains("Summary")')
+summary_text
+```
+```python
+# Find all text elements containing "Inadequate"
+contains_inadequate = page.find_all('text:contains("Inadequate")')
+len(contains_inadequate)
+```
+```python
+summary_text.highlight(label='summary')
+contains_inadequate.highlight(label="inadequate")
+page.to_image(width=700)
+```
+## Selecting by Element Type
+You can select specific types of elements found in PDFs.
+```python
+# Find all text elements
+all_text = page.find_all('text')
+len(all_text)
+```
+```python
+# Find all rectangle elements
+all_rects = page.find_all('rect')
+len(all_rects)
+```
+```python
+# Find all line elements
+all_lines = page.find_all('line')
+len(all_lines)
+```
+```python
+page.find_all('line').show()
+```
+## Filtering by Attributes
+Use square brackets `[]` to filter elements by their properties (attributes).
+### Common Attributes & Operators
+| Attribute     | Example Usage          | Operators | Notes |
+|---------------|------------------------|-----------|-------|
+| `size` (text) | `text[size>=12]`       | `>`, `<`, `>=`, `<=` | Font size in points |
+| `fontname`    | `text[fontname*=Bold]` | `=`, `*=`  | `*=` for contains substring |
+| `color` (text)| `text[color~=red]`     | `~=`      | Approx. match (name, rgb, hex) |
+| `width` (line)| `line[width>1]`        | `>`, `<`, `>=`, `<=` | Line thickness |
+| `source`      | `text[source=ocr]`     | `=`       | `pdf`, `ocr`, `detected` |
+| `type` (region)| `region[type=table]`  | `=`       | Layout analysis region type |
+```python
+# Find large text (size >= 11 points)
+page.find_all('text[size>=11]')
+```
+```python
+# Find text with 'Helvetica' in the font name
+page.find_all('text[fontname*=Helvetica]')
+```
+```python
+# Find red text (using approximate color match)
+# This PDF has text with color (0.8, 0.0, 0.0)
+red_text = page.find_all('text[color~=red]')
+```
+```python
+# Highlight the red text (ignoring existing highlights)
+red_text.show()
+```
+```python
+# Find thick lines (width >= 2)
+page.find_all('line[width>=2]')
+```
+## Using Pseudo-Classes
+Use colons `:` for special conditions (pseudo-classes).
+### Common Pseudo-Classes
+| Pseudo-Class          | Example Usage                           | Notes |
+|-----------------------|-----------------------------------------|-------|
+| `:contains('text')` | `text:contains('Report')`             | Finds elements containing specific text |
+| `:bold`               | `text:bold`                             | Finds text heuristically identified as bold |
+| `:italic`             | `text:italic`                           | Finds text heuristically identified as italic |
+| `:below(selector)`    | `text:below('line[width>=2]')`         | Finds elements physically below the reference element |
+| `:above(selector)`    | `text:above('text:contains("Summary")')`| Finds elements physically above the reference element |
+| `:left-of(selector)`  | `line:left-of('rect')`                 | Finds elements physically left of the reference element |
+| `:right-of(selector)` | `text:right-of('rect')`                | Finds elements physically right of the reference element |
+| `:near(selector)`     | `text:near('image')`                   | Finds elements physically near the reference element |
+*Note: Spatial pseudo-classes like `:below`, `:above` identify elements based on bounding box positions relative to the **first** element matched by the inner selector.*
+```python
+# Find bold text
+page.find_all('text:bold').show()
+```
+```python
+# Combine attribute and pseudo-class: bold text size >= 11
+page.find_all('text[size>=11]:bold')
+```
+### Spatial Pseudo-Classes Examples
+```python
+# Find the thick horizontal line first
+ref_line = page.find('line[width>=2]')
+# Find text elements strictly above that line
+text_above_line = page.find_all('text:above("line[width>=2]")')
+text_above_line
+```
+## Advanced Text Searching Options
+Pass options to `find()` or `find_all()` for more control over text matching.
+```python
+# Case-insensitive search for "summary"
+page.find_all('text:contains("summary")', case=False)
+```
+```python
+# Regular expression search for the inspection ID (e.g., INS-XXX...)
+# The ID is in the red text we found earlier
+page.find_all('text:contains("INS-\\w+")', regex=True)
+```
+```python
+# Combine regex and case-insensitivity
+page.find_all('text:contains("jungle health")', regex=True, case=False)
+```
+## Working with ElementCollections
+`find_all()` returns an `ElementCollection`, which is like a list but with extra PDF-specific methods.
+```python
+# Get all headings (using a selector for large, bold text)
+headings = page.find_all('text[size>=11]:bold')
+headings
+```
+```python
+# Get the first and last heading in reading order
+first = headings.first
+last = headings.last
+(first, last)
+```
+```python
+# Get the physically highest/lowest element in the collection
+highest = headings.highest()
+lowest = headings.lowest()
+(highest, lowest)
+```
+```python
+# Filter the collection further: headings containing "Service"
+service_headings = headings.find_all('text:contains("Service")')
+service_headings
+```
+```python
+# Extract text from all elements in the collection
+headings.extract_text()
+```
+*Remember: `.highest()`, `.lowest()`, `.leftmost()`, `.rightmost()` raise errors if the collection spans multiple pages.*
+## Font Variants
+Sometimes PDFs use font variants (prefixes like `AAAAAB+`) which can be useful for selection.
+```python
+# Find text elements with a specific font variant prefix (if any exist)
+# This example PDF doesn't use variants, but the selector works like this:
+page.find_all('text[font-variant=AAAAAB]')
+```
+## Next Steps
+Now that you can find elements, explore:
+- [Text Extraction](../text-extraction/index.ipynb): Get text content from found elements.
+- [Spatial Navigation](../pdf-navigation/index.ipynb): Use found elements as anchors to navigate (`.above()`, `.below()`, etc.).
+- [Working with Regions](../regions/index.ipynb): Define areas based on found elements.
+- [Visual Debugging](../visual-debugging/index.ipynb): Techniques for highlighting and visualizing elements.

docs/index.md ADDED Viewed

@@ -0,0 +1,170 @@
+# Natural PDF
+A friendly library for working with PDFs, built on top of [pdfplumber](https://github.com/jsvine/pdfplumber).
+Natural PDF lets you find and extract content from PDFs using simple code that makes sense.
+- [Live demo here](https://colab.research.google.com/github/jsoma/natural-pdf/blob/main/notebooks/Examples.ipynb)
+<div style="max-width: 400px; margin: auto"><a href="assets/sample-screen.png"><img src="assets/sample-screen.png"></a></div>
+## Installation
+```
+pip install natural_pdf
+# All the extras
+pip install "natural_pdf[all]"
+```
+## Quick Example
+```python
+from natural_pdf import PDF
+pdf = PDF('document.pdf')
+page = pdf.pages[0]
+# Find the title and get content below it
+title = page.find('text:contains("Summary"):bold')
+content = title.below().extract_text()
+# Exclude everything above 'CONFIDENTIAL' and below last line on page
+page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
+page.add_exclusion(page.find_all('line')[-1].below())
+# Get the clean text without header/footer
+clean_text = page.extract_text()
+```
+## Key Features
+Here are a few highlights of what you can do:
+### Find Elements with Selectors
+Use CSS-like selectors to find text, shapes, and more.
+```python
+# Find bold text containing "Revenue"
+page.find('text:contains("Revenue"):bold').extract_text()
+# Find all large text
+page.find_all('text[size>=12]').extract_text()
+```
+[Learn more about selectors →](element-selection/index.ipynb)
+### Navigate Spatially
+Move around the page relative to elements, not just coordinates.
+```python
+# Extract text below a specific heading
+intro_text = page.find('text:contains("Introduction")').below().extract_text()
+# Extract text from one heading to the next
+methods_text = page.find('text:contains("Methods")').below(
+    until='text:contains("Results")'
+).extract_text()
+```
+[Explore more navigation methods →](pdf-navigation/index.ipynb)
+### Extract Clean Text
+Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).
+```python
+# Extract all text from the page (respecting exclusions)
+page_text = page.extract_text()
+# Extract text from a specific region
+some_region = page.find(...)
+region_text = some_region.extract_text()
+```
+[Learn about text extraction →](text-extraction/index.ipynb)
+[Learn about exclusion zones →](regions/index.ipynb#exclusion-zones)
+### Apply OCR
+Extract text from scanned documents using various OCR engines.
+```python
+# Apply OCR using the default engine
+ocr_elements = page.apply_ocr()
+# Extract text (will use OCR results if available)
+text = page.extract_text()
+```
+[Explore OCR options →](ocr/index.md)
+### Analyze Document Layout
+Use AI models to detect document structures like titles, paragraphs, and tables.
+```python
+# Detect document structure
+page.analyze_layout()
+# Highlight titles and tables
+page.find_all('region[type=title]').highlight(color="purple")
+page.find_all('region[type=table]').highlight(color="blue")
+# Extract data from the first table
+table_data = page.find('region[type=table]').extract_table()
+```
+[Learn about layout models →](layout-analysis/index.ipynb)
+[Working with tables? →](tables/index.ipynb)
+### Document Question Answering
+Ask natural language questions directly to your documents.
+```python
+# Ask a question
+result = pdf.ask("What was the company's revenue in 2022?")
+if result.get("found", False):
+    print(f"Answer: {result['answer']}")
+```
+[Learn about Document QA →](document-qa/index.ipynb)
+### Visualize Your Work
+Debug and understand your extractions visually.
+```python
+# Highlight headings
+page.find_all('text[size>=14]').highlight(color="red", label="Headings")
+# Launch the interactive viewer (Jupyter)
+# Requires: pip install natural-pdf[interactive]
+page.viewer()
+# Or save an image
+# page.save_image("highlighted.png")
+```
+[See more visualization options →](visual-debugging/index.ipynb)
+## Documentation Topics
+Choose what you want to learn about:
+### Task-based Guides
+- [Getting Started](installation/index.md): Install the library and run your first extraction
+- [PDF Navigation](pdf-navigation/index.ipynb): Open PDFs and work with pages
+- [Element Selection](element-selection/index.ipynb): Find text and other elements using selectors
+- [Text Extraction](text-extraction/index.ipynb): Extract clean text from documents
+- [Regions](regions/index.ipynb): Work with specific areas of a page
+- [Visual Debugging](visual-debugging/index.ipynb): See what you're extracting
+- [OCR](ocr/index.md): Extract text from scanned documents
+- [Layout Analysis](layout-analysis/index.ipynb): Detect document structure
+- [Tables](tables/index.ipynb): Extract tabular data
+- [Document QA](document-qa/index.ipynb): Ask questions to your documents
+### Reference
+- [API Reference](api/index.md): Complete library reference

docs/installation/index.md ADDED Viewed

@@ -0,0 +1,69 @@
+# Getting Started with Natural PDF
+Let's get Natural PDF installed and run your first extraction.
+## Installation
+The base installation includes the core library and necessary AI dependencies (like PyTorch and Transformers):
+```bash
+pip install natural-pdf
+```
+### Optional Dependencies
+Natural PDF has modular dependencies for different features. Install them based on your needs:
+```bash
+# --- OCR Engines ---
+# Install support for EasyOCR
+pip install natural-pdf[easyocr]
+# Install support for PaddleOCR (requires paddlepaddle)
+pip install natural-pdf[paddle]
+# Install support for Surya OCR
+pip install natural-pdf[surya]
+# --- Layout Detection ---
+# Install support for YOLO layout model
+pip install natural-pdf[layout_yolo]
+# --- Interactive Widget ---
+# Install support for the interactive .viewer() widget in Jupyter
+pip install natural-pdf[interactive]
+# --- All Features ---
+# Install all optional dependencies
+pip install natural-pdf[all]
+```
+## Your First PDF Extraction
+Here's a quick example to make sure everything is working:
+```python
+from natural_pdf import PDF
+# Open a PDF
+pdf = PDF('your_document.pdf')
+# Get the first page
+page = pdf.pages[0]
+# Extract all text
+text = page.extract_text()
+print(text)
+# Find something specific
+title = page.find('text:bold')
+print(f"Found title: {title.text}")
+```
+## What's Next?
+Now that you have Natural PDF installed, you can:
+- Learn to [navigate PDFs](../pdf-navigation/index.ipynb)
+- Explore how to [select elements](../element-selection/index.ipynb)
+- See how to [extract text](../text-extraction/index.ipynb)

natural-pdf 0.1.4__py3-none-any.whl → 0.1.5__py3-none-any.whl

natural-pdf 0.1.4py3-none-any.whl → 0.1.5py3-none-any.whl