PyPI - natural-pdf - Versions diffs - 0.1.8__py3-none-any.whl → 0.1.10__py3-none-any.whl - Mend

natural-pdf 0.1.8py3-none-any.whl → 0.1.10py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (134) hide show

natural_pdf/__init__.py +1 -0
natural_pdf/analyzers/layout/base.py +1 -5
natural_pdf/analyzers/layout/gemini.py +61 -51
natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
natural_pdf/analyzers/layout/layout_manager.py +26 -84
natural_pdf/analyzers/layout/layout_options.py +7 -0
natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
natural_pdf/analyzers/layout/surya.py +46 -123
natural_pdf/analyzers/layout/tatr.py +51 -4
natural_pdf/analyzers/text_structure.py +3 -5
natural_pdf/analyzers/utils.py +3 -3
natural_pdf/classification/manager.py +241 -158
natural_pdf/classification/mixin.py +52 -38
natural_pdf/classification/results.py +71 -45
natural_pdf/collections/mixins.py +85 -20
natural_pdf/collections/pdf_collection.py +245 -100
natural_pdf/core/element_manager.py +30 -14
natural_pdf/core/highlighting_service.py +13 -22
natural_pdf/core/page.py +423 -101
natural_pdf/core/pdf.py +694 -195
natural_pdf/elements/base.py +134 -40
natural_pdf/elements/collections.py +610 -134
natural_pdf/elements/region.py +659 -90
natural_pdf/elements/text.py +1 -1
natural_pdf/export/mixin.py +137 -0
natural_pdf/exporters/base.py +3 -3
natural_pdf/exporters/paddleocr.py +4 -3
natural_pdf/extraction/manager.py +50 -49
natural_pdf/extraction/mixin.py +90 -57
natural_pdf/extraction/result.py +9 -23
natural_pdf/ocr/__init__.py +5 -5
natural_pdf/ocr/engine_doctr.py +346 -0
natural_pdf/ocr/ocr_factory.py +24 -4
natural_pdf/ocr/ocr_manager.py +61 -25
natural_pdf/ocr/ocr_options.py +70 -10
natural_pdf/ocr/utils.py +6 -4
natural_pdf/search/__init__.py +20 -34
natural_pdf/search/haystack_search_service.py +309 -265
natural_pdf/search/haystack_utils.py +99 -75
natural_pdf/search/search_service_protocol.py +11 -12
natural_pdf/selectors/parser.py +219 -143
natural_pdf/utils/debug.py +3 -3
natural_pdf/utils/identifiers.py +1 -1
natural_pdf/utils/locks.py +1 -1
natural_pdf/utils/packaging.py +8 -6
natural_pdf/utils/text_extraction.py +24 -16
natural_pdf/utils/tqdm_utils.py +18 -10
natural_pdf/utils/visualization.py +18 -0
natural_pdf/widgets/viewer.py +4 -25
{natural_pdf-0.1.8.dist-info → natural_pdf-0.1.10.dist-info}/METADATA +12 -3
natural_pdf-0.1.10.dist-info/RECORD +80 -0
{natural_pdf-0.1.8.dist-info → natural_pdf-0.1.10.dist-info}/WHEEL +1 -1
{natural_pdf-0.1.8.dist-info → natural_pdf-0.1.10.dist-info}/top_level.txt +0 -2
docs/api/index.md +0 -386
docs/assets/favicon.png +0 -3
docs/assets/favicon.svg +0 -3
docs/assets/javascripts/custom.js +0 -17
docs/assets/logo.svg +0 -3
docs/assets/sample-screen.png +0 -0
docs/assets/social-preview.png +0 -17
docs/assets/social-preview.svg +0 -17
docs/assets/stylesheets/custom.css +0 -65
docs/categorizing-documents/index.md +0 -168
docs/data-extraction/index.md +0 -87
docs/document-qa/index.ipynb +0 -435
docs/document-qa/index.md +0 -79
docs/element-selection/index.ipynb +0 -969
docs/element-selection/index.md +0 -249
docs/finetuning/index.md +0 -176
docs/index.md +0 -189
docs/installation/index.md +0 -69
docs/interactive-widget/index.ipynb +0 -962
docs/interactive-widget/index.md +0 -12
docs/layout-analysis/index.ipynb +0 -818
docs/layout-analysis/index.md +0 -185
docs/ocr/index.md +0 -256
docs/pdf-navigation/index.ipynb +0 -314
docs/pdf-navigation/index.md +0 -97
docs/regions/index.ipynb +0 -816
docs/regions/index.md +0 -294
docs/tables/index.ipynb +0 -658
docs/tables/index.md +0 -144
docs/text-analysis/index.ipynb +0 -370
docs/text-analysis/index.md +0 -105
docs/text-extraction/index.ipynb +0 -1478
docs/text-extraction/index.md +0 -292
docs/tutorials/01-loading-and-extraction.ipynb +0 -1873
docs/tutorials/01-loading-and-extraction.md +0 -95
docs/tutorials/02-finding-elements.ipynb +0 -417
docs/tutorials/02-finding-elements.md +0 -149
docs/tutorials/03-extracting-blocks.ipynb +0 -152
docs/tutorials/03-extracting-blocks.md +0 -48
docs/tutorials/04-table-extraction.ipynb +0 -119
docs/tutorials/04-table-extraction.md +0 -50
docs/tutorials/05-excluding-content.ipynb +0 -275
docs/tutorials/05-excluding-content.md +0 -109
docs/tutorials/06-document-qa.ipynb +0 -337
docs/tutorials/06-document-qa.md +0 -91
docs/tutorials/07-layout-analysis.ipynb +0 -293
docs/tutorials/07-layout-analysis.md +0 -66
docs/tutorials/07-working-with-regions.ipynb +0 -414
docs/tutorials/07-working-with-regions.md +0 -151
docs/tutorials/08-spatial-navigation.ipynb +0 -513
docs/tutorials/08-spatial-navigation.md +0 -190
docs/tutorials/09-section-extraction.ipynb +0 -2439
docs/tutorials/09-section-extraction.md +0 -256
docs/tutorials/10-form-field-extraction.ipynb +0 -517
docs/tutorials/10-form-field-extraction.md +0 -201
docs/tutorials/11-enhanced-table-processing.ipynb +0 -59
docs/tutorials/11-enhanced-table-processing.md +0 -9
docs/tutorials/12-ocr-integration.ipynb +0 -3712
docs/tutorials/12-ocr-integration.md +0 -137
docs/tutorials/13-semantic-search.ipynb +0 -1718
docs/tutorials/13-semantic-search.md +0 -77
docs/visual-debugging/index.ipynb +0 -2970
docs/visual-debugging/index.md +0 -157
docs/visual-debugging/region.png +0 -0
natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -420
natural_pdf/templates/spa/css/style.css +0 -334
natural_pdf/templates/spa/index.html +0 -31
natural_pdf/templates/spa/js/app.js +0 -472
natural_pdf/templates/spa/words.txt +0 -235976
natural_pdf/widgets/frontend/viewer.js +0 -88
natural_pdf-0.1.8.dist-info/RECORD +0 -156
notebooks/Examples.ipynb +0 -1293
pdfs/.gitkeep +0 -0
pdfs/01-practice.pdf +0 -543
pdfs/0500000US42001.pdf +0 -0
pdfs/0500000US42007.pdf +0 -0
pdfs/2014 Statistics.pdf +0 -0
pdfs/2019 Statistics.pdf +0 -0
pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
pdfs/needs-ocr.pdf +0 -0
{natural_pdf-0.1.8.dist-info → natural_pdf-0.1.10.dist-info}/licenses/LICENSE +0 -0

docs/text-extraction/index.md DELETED Viewed

@@ -1,292 +0,0 @@
-# Text Extraction Guide
-This guide demonstrates various ways to extract text from PDFs using Natural PDF, from simple page dumps to targeted extraction based on elements, regions, and styles.
-## Setup
-First, let's import necessary libraries and load a sample PDF. We'll use `example.pdf` from the tutorials' `pdfs` directory. *Adjust the path if your setup differs.*
-```python
-from natural_pdf import PDF
-# Load the PDF
-pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
-# Select the first page for initial examples
-page = pdf.pages[0]
-# Display the first page
-page.show(width=700)
-```
-## Basic Text Extraction
-Get all text from a page or the entire document.
-```python
-# Extract all text from the first page
-# Displaying first 500 characters
-print(page.extract_text()[:500])
-```
-You can also preserve layout with `layout=True`.
-```python
-# Extract text from the entire document (may take time)
-# Uncomment to run:
-print(page.extract_text(layout=True)[:2000])
-```
-## Extracting Text from Specific Elements
-Use selectors with `find()` or `find_all()` to target specific elements. *Selectors like `:contains("Summary")` are examples; adapt them to your PDF.*
-```python
-# Find a single element, e.g., a title containing "Summary"
-# Adjust selector as needed
-date_element = page.find('text:contains("Site")')
-date_element # Display the found element object
-```
-```python
-date_element.show()
-```
-```python
-date_element.text
-```
-```python
-# Find multiple elements, e.g., bold headings (size >= 8)
-heading_elements = page.find_all('text[size>=8]:bold')
-heading_elements
-```
-```python
-page.find_all('text[size>=8]:bold').show()
-```
-```python
-# Pull out all of their text (why? I don't know!)
-print(heading_elements.extract_text())
-```
-## Advanced text searches
-```python
-# Exact phrase (case-sensitive)
-page.find('text:contains("Hazardous Materials")').text
-```
-```python
-# Exact phrase (case-sensitive)
-page.find('text:contains("HAZARDOUS MATERIALS")', case=False).text
-```
-```python
-# Regular expression (e.g., "YYYY Report")
-regex = "\d+, \d{4}"
-page.find(f'text:contains("{regex}")', regex=True)
-```
-```python
-# Regular expression (e.g., "YYYY Report")
-page.find_all('text[fontname="Helvetica"][size=10]')
-```
-# Regions
-```python
-# Region below an element (e.g., below "Introduction")
-# Adjust selector as needed
-page.find('text:contains("Summary")').below(include_element=True).show()
-```
-```python
-(
-    page
-    .find('text:contains("Summary")')
-    .below(include_element=True)
-    .extract_text()
-    [:500]
-)
-```
-```python
-(
-    page
-    .find('text:contains("Summary")')
-    .below(include_element=True, until='line:horizontal')
-    .show()
-)
-```
-```python
-# Manually defined region via coordinates (x0, top, x1, bottom)
-manual_region = page.create_region(30, 60, 600, 300)
-manual_region.show()
-```
-```python
-# Extract text from the manual region
-manual_region.extract_text()[:500]
-```
-## Filtering Out Headers and Footers
-Use Exclusion Zones to remove unwanted content before extraction. *Adjust selectors for typical header/footer content.*
-```python
-header_content = page.find('rect')
-footer_content = page.find_all('line')[-1].below()
-header_content.highlight()
-footer_content.highlight()
-page.to_image()
-```
-```python
-page.extract_text()[:500]
-```
-```python
-page.add_exclusion(header_content)
-page.add_exclusion(footer_content)
-```
-```python
-page.extract_text()[:500]
-```
-```python
-full_text_no_exclusions = page.extract_text(use_exclusions=False)
-clean_text = page.extract_text()
-f"Original length: {len(full_text_no_exclusions)}, Excluded length: {len(clean_text)}"
-```
-```python
-page.clear_exclusions()
-```
-*Exclusions can also be defined globally at the PDF level using `pdf.add_exclusion()` with a function.*
-## Controlling Whitespace
-Manage how spaces and blank lines are handled during extraction using `layout`.
-```python
-print(page.extract_text())
-```
-```python
-print(page.extract_text(use_exclusions=False, layout=True))
-```
-### Font Information Access
-Inspect font details of text elements.
-```python
-# Find the first text element on the page
-first_text = page.find_all('text')[1]
-first_text # Display basic info
-```
-```python
-# Highlight the first text element
-first_text.show()
-```
-```python
-# Get detailed font properties dictionary
-first_text.font_info()
-```
-```python
-# Check specific style properties directly
-f"Is Bold: {first_text.bold}, Is Italic: {first_text.italic}, Font: {first_text.fontname}, Size: {first_text.size}"
-```
-```python
-# Find elements by font attributes (adjust selectors)
-# Example: Find Arial fonts
-arial_text = page.find_all('text[fontname*=Helvetica]')
-arial_text # Display list of found elements
-```
-```python
-# Example: Find large text (e.g., size >= 16)
-large_text = page.find_all('text[size>=12]')
-large_text
-```
-```python
-# Example: Find large text (e.g., size >= 16)
-bold_text = page.find_all('text:bold')
-bold_text
-```
-## Working with Font Styles
-Analyze and group text elements by their computed font *style*, which combines attributes like font name, size, boldness, etc., into logical groups.
-```python
-# Analyze styles on the page
-# This returns a dictionary mapping style names to ElementList objects
-page.analyze_text_styles()
-page.text_style_labels
-```
-```python
-page.find_all('text').highlight(group_by='style_label').to_image()
-```
-```python
-page.find_all('text[style_label="8.0pt Helvetica"]')
-```
-```python
-page.find_all('text[fontname="Helvetica"][size=8]')
-```
-*Font variants (e.g., `AAAAAB+FontName`) are also accessible via the `font-variant` attribute selector: `page.find_all('text[font-variant="AAAAAB"]')`.*
-## Reading Order
-Text extraction respects a pathetic attempt at natural reading order (top-to-bottom, left-to-right by default). `page.find_all('text')` returns elements already sorted this way.
-```python
-# Get first 5 text elements in reading order
-elements_in_order = page.find_all('text')
-elements_in_order[:5]
-```
-```python
-# Text extracted via page.extract_text() respects this order automatically
-# (Result already shown in Basic Text Extraction section)
-page.extract_text()[:100]
-```
-## Element Navigation
-Move between elements sequentially based on reading order using `.next()` and `.previous()`.
-```python
-page.clear_highlights()
-start = page.find('text:contains("Date")')
-start.highlight(label='Date label')
-start.next().highlight(label='Maybe the date', color='green')
-start.next('text:contains("\d")', regex=True).highlight(label='Probably the date')
-page.to_image()
-```
-## Next Steps
-Now that you know how to extract text, you might want to explore:
-- [Working with regions](../regions/index.ipynb) for more precise extraction
-- [OCR capabilities](../ocr/index.md) for scanned documents
-- [Document layout analysis](../layout-analysis/index.ipynb) for automatic structure detection
-- [Document QA](../document-qa/index.ipynb) for asking questions directly to your documents

natural-pdf 0.1.8__py3-none-any.whl → 0.1.10__py3-none-any.whl

natural-pdf 0.1.8py3-none-any.whl → 0.1.10py3-none-any.whl