PyPI - natural-pdf - Versions diffs - 0.1.1__tar.gz → 0.1.2__tar.gz - Mend

natural-pdf 0.1.1tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (243) hide show

natural_pdf-0.1.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,124 @@
+Metadata-Version: 2.4
+Name: natural-pdf
+Version: 0.1.2
+Summary: A more intuitive interface for working with PDFs
+Author-email: Jonathan Soma <jonathan.soma@gmail.com>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/jsoma/natural-pdf
+Project-URL: Repository, https://github.com/jsoma/natural-pdf
+Classifier: Programming Language :: Python :: 3
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pdfplumber>=0.7.0
+Requires-Dist: Pillow>=8.0.0
+Requires-Dist: colour>=0.1.5
+Requires-Dist: numpy>=1.20.0
+Requires-Dist: urllib3>=1.26.0
+Requires-Dist: torch>=2.0.0
+Requires-Dist: torchvision>=0.15.0
+Requires-Dist: transformers>=4.30.0
+Requires-Dist: huggingface_hub>=0.19.0
+Provides-Extra: interactive
+Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "interactive"
+Provides-Extra: easyocr
+Requires-Dist: easyocr; extra == "easyocr"
+Provides-Extra: paddle
+Requires-Dist: paddlepaddle; extra == "paddle"
+Requires-Dist: paddleocr; extra == "paddle"
+Provides-Extra: layout-yolo
+Requires-Dist: doclayout_yolo; extra == "layout-yolo"
+Provides-Extra: surya
+Requires-Dist: surya-ocr; extra == "surya"
+Provides-Extra: qa
+Provides-Extra: all
+Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "all"
+Requires-Dist: easyocr; extra == "all"
+Requires-Dist: paddlepaddle; extra == "all"
+Requires-Dist: paddleocr; extra == "all"
+Requires-Dist: doclayout_yolo; extra == "all"
+Requires-Dist: surya-ocr; extra == "all"
+Dynamic: license-file
+# Natural PDF
+A friendly library for working with PDFs, built on top of [pdfplumber](https://github.com/jsvine/pdfplumber).
+Natural PDF lets you find and extract content from PDFs using simple code that makes sense.
+- [Complete documentation here](https://jsoma.github.io/natural-pdf)
+- [Live demos here](https://colab.research.google.com/github/jsoma/natural-pdf/)
+<div style="max-width: 400px; margin: auto"><a href="sample-screen.png"><img src="sample-screen.png"></a></div>
+## Installation
+```bash
+pip install natural-pdf
+```
+For optional features like specific OCR engines, layout analysis models, or the interactive Jupyter widget, you can install extras:
+```bash
+# Example: Install with EasyOCR support
+pip install natural-pdf[easyocr]
+pip install natural-pdf[surya]
+pip install natural-pdf[paddle]
+# Example: Install with interactive viewer support
+pip install natural-pdf[interactive]
+# Install everything
+pip install natural-pdf[all]
+```
+See the [installation guide](https://jsoma.github.io/natural-pdf/installation/) for more details on extras.
+## Quick Start
+```python
+from natural_pdf import PDF
+# Open a PDF
+pdf = PDF('document.pdf')
+page = pdf.pages[0]
+# Find elements using CSS-like selectors
+heading = page.find('text:contains("Summary"):bold')
+# Extract content below the heading
+content = heading.below().extract_text()
+print("Content below Summary:", content[:100] + "...")
+# Exclude headers/footers automatically (example)
+# You might define these based on common text or position
+page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
+page.add_exclusion(page.find_all('line')[-1].below())
+# Extract clean text from the page
+clean_text = page.extract_text()
+print("\nClean page text:", clean_text[:200] + "...")
+# Highlight the heading and view the page
+heading.highlight(color='red')
+page.to_image()
+```
+And as a fun bonus, `page.viewer()` will provide an interactive method to explore the PDF.
+## Key Features
+Natural PDF offers a range of features for working with PDFs:
+*   **CSS-like Selectors:** Find elements using intuitive query strings (`page.find('text:bold')`).
+*   **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
+*   **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
+*   **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
+*   **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using AI models.
+*   **Document QA:** Ask natural language questions about your document's content.
+*   **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.
+## Learn More
+Dive deeper into the features and explore advanced usage in the [**Complete Documentation**](https://jsoma.github.io/natural-pdf).

natural_pdf-0.1.2/README.md ADDED Viewed

@@ -0,0 +1,81 @@
+# Natural PDF
+A friendly library for working with PDFs, built on top of [pdfplumber](https://github.com/jsvine/pdfplumber).
+Natural PDF lets you find and extract content from PDFs using simple code that makes sense.
+- [Complete documentation here](https://jsoma.github.io/natural-pdf)
+- [Live demos here](https://colab.research.google.com/github/jsoma/natural-pdf/)
+<div style="max-width: 400px; margin: auto"><a href="sample-screen.png"><img src="sample-screen.png"></a></div>
+## Installation
+```bash
+pip install natural-pdf
+```
+For optional features like specific OCR engines, layout analysis models, or the interactive Jupyter widget, you can install extras:
+```bash
+# Example: Install with EasyOCR support
+pip install natural-pdf[easyocr]
+pip install natural-pdf[surya]
+pip install natural-pdf[paddle]
+# Example: Install with interactive viewer support
+pip install natural-pdf[interactive]
+# Install everything
+pip install natural-pdf[all]
+```
+See the [installation guide](https://jsoma.github.io/natural-pdf/installation/) for more details on extras.
+## Quick Start
+```python
+from natural_pdf import PDF
+# Open a PDF
+pdf = PDF('document.pdf')
+page = pdf.pages[0]
+# Find elements using CSS-like selectors
+heading = page.find('text:contains("Summary"):bold')
+# Extract content below the heading
+content = heading.below().extract_text()
+print("Content below Summary:", content[:100] + "...")
+# Exclude headers/footers automatically (example)
+# You might define these based on common text or position
+page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
+page.add_exclusion(page.find_all('line')[-1].below())
+# Extract clean text from the page
+clean_text = page.extract_text()
+print("\nClean page text:", clean_text[:200] + "...")
+# Highlight the heading and view the page
+heading.highlight(color='red')
+page.to_image()
+```
+And as a fun bonus, `page.viewer()` will provide an interactive method to explore the PDF.
+## Key Features
+Natural PDF offers a range of features for working with PDFs:
+*   **CSS-like Selectors:** Find elements using intuitive query strings (`page.find('text:bold')`).
+*   **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
+*   **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
+*   **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
+*   **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using AI models.
+*   **Document QA:** Ask natural language questions about your document's content.
+*   **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.
+## Learn More
+Dive deeper into the features and explore advanced usage in the [**Complete Documentation**](https://jsoma.github.io/natural-pdf).

natural_pdf-0.1.2/docs/assets/sample-screen.png ADDED Viewed

Binary file

natural_pdf-0.1.2/docs/index.md ADDED Viewed

@@ -0,0 +1,170 @@
+# Natural PDF
+A friendly library for working with PDFs, built on top of [pdfplumber](https://github.com/jsvine/pdfplumber).
+Natural PDF lets you find and extract content from PDFs using simple code that makes sense.
+- [Live demo here](https://colab.research.google.com/github/jsoma/natural-pdf/blob/main/notebooks/Examples.ipynb)
+<div style="max-width: 400px; margin: auto"><a href="assets/sample-screen.png"><img src="assets/sample-screen.png"></a></div>
+## Installation
+```
+pip install natural_pdf
+# All the extras
+pip install "natural_pdf[all]"
+```
+## Quick Example
+```python
+from natural_pdf import PDF
+pdf = PDF('document.pdf')
+page = pdf.pages[0]
+# Find the title and get content below it
+title = page.find('text:contains("Summary"):bold')
+content = title.below().extract_text()
+# Exclude everything above 'CONFIDENTIAL' and below last line on page
+page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
+page.add_exclusion(page.find_all('line')[-1].below())
+# Get the clean text without header/footer
+clean_text = page.extract_text()
+```
+## Key Features
+Here are a few highlights of what you can do:
+### Find Elements with Selectors
+Use CSS-like selectors to find text, shapes, and more.
+```python
+# Find bold text containing "Revenue"
+page.find('text:contains("Revenue"):bold').extract_text()
+# Find all large text
+page.find_all('text[size>=12]').extract_text()
+```
+[Learn more about selectors →](element-selection/index.ipynb)
+### Navigate Spatially
+Move around the page relative to elements, not just coordinates.
+```python
+# Extract text below a specific heading
+intro_text = page.find('text:contains("Introduction")').below().extract_text()
+# Extract text from one heading to the next
+methods_text = page.find('text:contains("Methods")').below(
+    until='text:contains("Results")'
+).extract_text()
+```
+[Explore more navigation methods →](pdf-navigation/index.ipynb)
+### Extract Clean Text
+Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).
+```python
+# Extract all text from the page (respecting exclusions)
+page_text = page.extract_text()
+# Extract text from a specific region
+some_region = page.find(...)
+region_text = some_region.extract_text()
+```
+[Learn about text extraction →](text-extraction/index.ipynb)
+[Learn about exclusion zones →](regions/index.ipynb#exclusion-zones)
+### Apply OCR
+Extract text from scanned documents using various OCR engines.
+```python
+# Apply OCR using the default engine
+ocr_elements = page.apply_ocr()
+# Extract text (will use OCR results if available)
+text = page.extract_text()
+```
+[Explore OCR options →](ocr/index.md)
+### Analyze Document Layout
+Use AI models to detect document structures like titles, paragraphs, and tables.
+```python
+# Detect document structure
+page.analyze_layout()
+# Highlight titles and tables
+page.find_all('region[type=title]').highlight(color="purple")
+page.find_all('region[type=table]').highlight(color="blue")
+# Extract data from the first table
+table_data = page.find('region[type=table]').extract_table()
+```
+[Learn about layout models →](layout-analysis/index.ipynb)
+[Working with tables? →](tables/index.ipynb)
+### Document Question Answering
+Ask natural language questions directly to your documents.
+```python
+# Ask a question
+result = pdf.ask("What was the company's revenue in 2022?")
+if result.get("found", False):
+    print(f"Answer: {result['answer']}")
+```
+[Learn about Document QA →](document-qa/index.ipynb)
+### Visualize Your Work
+Debug and understand your extractions visually.
+```python
+# Highlight headings
+page.find_all('text[size>=14]').highlight(color="red", label="Headings")
+# Launch the interactive viewer (Jupyter)
+# Requires: pip install natural-pdf[interactive]
+page.viewer()
+# Or save an image
+# page.save_image("highlighted.png")
+```
+[See more visualization options →](visual-debugging/index.ipynb)
+## Documentation Topics
+Choose what you want to learn about:
+### Task-based Guides
+- [Getting Started](installation/index.md): Install the library and run your first extraction
+- [PDF Navigation](pdf-navigation/index.ipynb): Open PDFs and work with pages
+- [Element Selection](element-selection/index.ipynb): Find text and other elements using selectors
+- [Text Extraction](text-extraction/index.ipynb): Extract clean text from documents
+- [Regions](regions/index.ipynb): Work with specific areas of a page
+- [Visual Debugging](visual-debugging/index.ipynb): See what you're extracting
+- [OCR](ocr/index.md): Extract text from scanned documents
+- [Layout Analysis](layout-analysis/index.ipynb): Detect document structure
+- [Tables](tables/index.ipynb): Extract tabular data
+- [Document QA](document-qa/index.ipynb): Ask questions to your documents
+### Reference
+- [API Reference](api/index.md): Complete library reference

{natural_pdf-0.1.1 → natural_pdf-0.1.2}/docs/installation/index.md RENAMED Viewed

@@ -57,8 +57,7 @@ print(text)
 # Find something specific
 title = page.find('text:bold')
-if title:
-    print(f"Found title: {title.text}")
+print(f"Found title: {title.text}")
 ```
 ## What's Next?

{natural_pdf-0.1.1 → natural_pdf-0.1.2}/docs/regions/index.md RENAMED Viewed

@@ -221,12 +221,6 @@ print(f"Original text: {len(full_text)} chars\nText with exclusion: {len(text_wi
 print(f"Difference: {len(full_text) - len(text_with_exclusion)} chars excluded")
 ```
-```python
-# Temporarily bypass exclusions if needed
-text_ignoring_exclusion = full_page_region.extract_text(use_exclusions=False)
-print(f"Text ignoring exclusions: {len(text_ignoring_exclusion)} chars (should match original)")
-```
 ```python
 # When done with this page, clear exclusions
 page.clear_exclusions()
@@ -253,10 +247,11 @@ pdf.add_exclusion(
 # PDF-level exclusions are used whenever you extract text
 # Let's try on the first three pages
-for i in range(min(3, len(pdf.pages))):
+for page in pdf.pages[:3]:
     page_i = pdf.pages[i]
     text = page_i.extract_text()
-    print(f"Page {i+1}: {len(text)} characters after exclusions")
+    text_original = page_i.extract_text(use_exclusions=False)
+    print(f"Page {page.number} – Before: {len(text_original)} After: {len(text)}")
 ```
 ```python

{natural_pdf-0.1.1 → natural_pdf-0.1.2}/docs/tutorials/05-excluding-content.md RENAMED Viewed

@@ -23,12 +23,12 @@ page = pdf.pages[0]
 full_text_unfiltered = page.extract_text()
 # Show the last 200 characters (likely containing footer text)
-"Unfiltered text (last 200 chars): " + full_text_unfiltered[-200:]
+full_text_unfiltered[-200:]
 ```
 ## Approach 1: Excluding a Fixed Area
-A simple way to exclude headers or footers is to define a fixed region based on page coordinates. Let's exclude the bottom 50 points of the page.
+A simple way to exclude headers or footers is to define a fixed region based on page coordinates. Let's exclude the bottom 200 pixels of the page.
 ```python
 from natural_pdf import PDF
@@ -36,26 +36,29 @@ from natural_pdf import PDF
 pdf_url = "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/0500000US42007.pdf"
 pdf = PDF(pdf_url)
-# Define the exclusion region directly using a lambda function
-footer_height = 50
+# Define the exclusion region on every page using a lambda function
+footer_height = 200
 pdf.add_exclusion(
     lambda page: page.region(top=page.height - footer_height),
-    label="Bottom 50pt Footer"
+    label="Bottom 200pt Footer"
 )
 # Now extract text from the first page again, exclusions are active by default
 page = pdf.pages[0]
-filtered_text = page.extract_text() # use_exclusions=True is default
-# Show the last 200 chars with footer area excluded
-"Fixed Area Excluded (last 200 chars): " + filtered_text[-200:]
 # Visualize the excluded area
 footer_region_viz = page.region(top=page.height - footer_height)
-footer_region_viz.show(label="Excluded Footer Area")
+footer_region_viz.highlight(label="Excluded Footer Area")
 page.to_image()
 ```
+```python
+filtered_text = page.extract_text() # use_exclusions=True is default
+# Show the last 200 chars with footer area excluded
+filtered_text[-200:]
+```
 This method is simple but might cut off content if the footer height varies or content extends lower on some pages.
 ## Approach 2: Excluding Based on Elements

natural-pdf 0.1.1__tar.gz → 0.1.2__tar.gz

natural-pdf 0.1.1tar.gz → 0.1.2tar.gz