PyPI - natural-pdf - Versions diffs - 0.1.3__py3-none-any.whl → 0.1.5__py3-none-any.whl - Mend

natural-pdf 0.1.3py3-none-any.whl → 0.1.5py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (132) hide show

docs/api/index.md +386 -0
docs/assets/favicon.png +3 -0
docs/assets/favicon.svg +3 -0
docs/assets/javascripts/custom.js +17 -0
docs/assets/logo.svg +3 -0
docs/assets/sample-screen.png +0 -0
docs/assets/social-preview.png +17 -0
docs/assets/social-preview.svg +17 -0
docs/assets/stylesheets/custom.css +65 -0
docs/document-qa/index.ipynb +435 -0
docs/document-qa/index.md +79 -0
docs/element-selection/index.ipynb +915 -0
docs/element-selection/index.md +229 -0
docs/index.md +170 -0
docs/installation/index.md +69 -0
docs/interactive-widget/index.ipynb +962 -0
docs/interactive-widget/index.md +12 -0
docs/layout-analysis/index.ipynb +818 -0
docs/layout-analysis/index.md +185 -0
docs/ocr/index.md +222 -0
docs/pdf-navigation/index.ipynb +314 -0
docs/pdf-navigation/index.md +97 -0
docs/regions/index.ipynb +816 -0
docs/regions/index.md +294 -0
docs/tables/index.ipynb +658 -0
docs/tables/index.md +144 -0
docs/text-analysis/index.ipynb +370 -0
docs/text-analysis/index.md +105 -0
docs/text-extraction/index.ipynb +1478 -0
docs/text-extraction/index.md +292 -0
docs/tutorials/01-loading-and-extraction.ipynb +1696 -0
docs/tutorials/01-loading-and-extraction.md +95 -0
docs/tutorials/02-finding-elements.ipynb +340 -0
docs/tutorials/02-finding-elements.md +149 -0
docs/tutorials/03-extracting-blocks.ipynb +147 -0
docs/tutorials/03-extracting-blocks.md +48 -0
docs/tutorials/04-table-extraction.ipynb +114 -0
docs/tutorials/04-table-extraction.md +50 -0
docs/tutorials/05-excluding-content.ipynb +270 -0
docs/tutorials/05-excluding-content.md +109 -0
docs/tutorials/06-document-qa.ipynb +332 -0
docs/tutorials/06-document-qa.md +91 -0
docs/tutorials/07-layout-analysis.ipynb +260 -0
docs/tutorials/07-layout-analysis.md +66 -0
docs/tutorials/07-working-with-regions.ipynb +409 -0
docs/tutorials/07-working-with-regions.md +151 -0
docs/tutorials/08-spatial-navigation.ipynb +508 -0
docs/tutorials/08-spatial-navigation.md +190 -0
docs/tutorials/09-section-extraction.ipynb +2434 -0
docs/tutorials/09-section-extraction.md +256 -0
docs/tutorials/10-form-field-extraction.ipynb +484 -0
docs/tutorials/10-form-field-extraction.md +201 -0
docs/tutorials/11-enhanced-table-processing.ipynb +54 -0
docs/tutorials/11-enhanced-table-processing.md +9 -0
docs/tutorials/12-ocr-integration.ipynb +586 -0
docs/tutorials/12-ocr-integration.md +188 -0
docs/tutorials/13-semantic-search.ipynb +1888 -0
docs/tutorials/13-semantic-search.md +77 -0
docs/visual-debugging/index.ipynb +2970 -0
docs/visual-debugging/index.md +157 -0
docs/visual-debugging/region.png +0 -0
natural_pdf/__init__.py +39 -20
natural_pdf/analyzers/__init__.py +2 -1
natural_pdf/analyzers/layout/base.py +32 -24
natural_pdf/analyzers/layout/docling.py +131 -72
natural_pdf/analyzers/layout/layout_analyzer.py +156 -113
natural_pdf/analyzers/layout/layout_manager.py +98 -58
natural_pdf/analyzers/layout/layout_options.py +32 -17
natural_pdf/analyzers/layout/paddle.py +152 -95
natural_pdf/analyzers/layout/surya.py +164 -92
natural_pdf/analyzers/layout/tatr.py +149 -84
natural_pdf/analyzers/layout/yolo.py +84 -44
natural_pdf/analyzers/text_options.py +22 -15
natural_pdf/analyzers/text_structure.py +131 -85
natural_pdf/analyzers/utils.py +30 -23
natural_pdf/collections/pdf_collection.py +126 -98
natural_pdf/core/__init__.py +1 -1
natural_pdf/core/element_manager.py +416 -337
natural_pdf/core/highlighting_service.py +268 -196
natural_pdf/core/page.py +910 -516
natural_pdf/core/pdf.py +387 -289
natural_pdf/elements/__init__.py +1 -1
natural_pdf/elements/base.py +302 -214
natural_pdf/elements/collections.py +714 -514
natural_pdf/elements/line.py +39 -36
natural_pdf/elements/rect.py +32 -30
natural_pdf/elements/region.py +854 -883
natural_pdf/elements/text.py +122 -99
natural_pdf/exporters/__init__.py +0 -1
natural_pdf/exporters/searchable_pdf.py +261 -102
natural_pdf/ocr/__init__.py +23 -14
natural_pdf/ocr/engine.py +17 -8
natural_pdf/ocr/engine_easyocr.py +63 -47
natural_pdf/ocr/engine_paddle.py +97 -68
natural_pdf/ocr/engine_surya.py +54 -44
natural_pdf/ocr/ocr_manager.py +88 -62
natural_pdf/ocr/ocr_options.py +16 -10
natural_pdf/qa/__init__.py +1 -1
natural_pdf/qa/document_qa.py +119 -111
natural_pdf/search/__init__.py +37 -31
natural_pdf/search/haystack_search_service.py +312 -189
natural_pdf/search/haystack_utils.py +186 -122
natural_pdf/search/search_options.py +25 -14
natural_pdf/search/search_service_protocol.py +12 -6
natural_pdf/search/searchable_mixin.py +261 -176
natural_pdf/selectors/__init__.py +2 -1
natural_pdf/selectors/parser.py +159 -316
natural_pdf/templates/__init__.py +1 -1
natural_pdf/utils/highlighting.py +8 -2
natural_pdf/utils/reading_order.py +65 -63
natural_pdf/utils/text_extraction.py +195 -0
natural_pdf/utils/visualization.py +70 -61
natural_pdf/widgets/__init__.py +2 -3
natural_pdf/widgets/viewer.py +749 -718
{natural_pdf-0.1.3.dist-info → natural_pdf-0.1.5.dist-info}/METADATA +29 -15
natural_pdf-0.1.5.dist-info/RECORD +134 -0
natural_pdf-0.1.5.dist-info/top_level.txt +5 -0
notebooks/Examples.ipynb +1293 -0
pdfs/.gitkeep +0 -0
pdfs/01-practice.pdf +543 -0
pdfs/0500000US42001.pdf +0 -0
pdfs/0500000US42007.pdf +0 -0
pdfs/2014 Statistics.pdf +0 -0
pdfs/2019 Statistics.pdf +0 -0
pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
pdfs/needs-ocr.pdf +0 -0
tests/test_loading.py +50 -0
tests/test_optional_deps.py +298 -0
natural_pdf-0.1.3.dist-info/RECORD +0 -61
natural_pdf-0.1.3.dist-info/top_level.txt +0 -1
{natural_pdf-0.1.3.dist-info → natural_pdf-0.1.5.dist-info}/WHEEL +0 -0
{natural_pdf-0.1.3.dist-info → natural_pdf-0.1.5.dist-info}/licenses/LICENSE +0 -0

docs/text-extraction/index.md ADDED Viewed

@@ -0,0 +1,292 @@
+# Text Extraction Guide
+This guide demonstrates various ways to extract text from PDFs using Natural PDF, from simple page dumps to targeted extraction based on elements, regions, and styles.
+## Setup
+First, let's import necessary libraries and load a sample PDF. We'll use `example.pdf` from the tutorials' `pdfs` directory. *Adjust the path if your setup differs.*
+```python
+from natural_pdf import PDF
+# Load the PDF
+pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
+# Select the first page for initial examples
+page = pdf.pages[0]
+# Display the first page
+page.show(width=700)
+```
+## Basic Text Extraction
+Get all text from a page or the entire document.
+```python
+# Extract all text from the first page
+# Displaying first 500 characters
+print(page.extract_text()[:500])
+```
+You can also preserve layout with `layout=True`.
+```python
+# Extract text from the entire document (may take time)
+# Uncomment to run:
+print(page.extract_text(layout=True)[:2000])
+```
+## Extracting Text from Specific Elements
+Use selectors with `find()` or `find_all()` to target specific elements. *Selectors like `:contains("Summary")` are examples; adapt them to your PDF.*
+```python
+# Find a single element, e.g., a title containing "Summary"
+# Adjust selector as needed
+date_element = page.find('text:contains("Site")')
+date_element # Display the found element object
+```
+```python
+date_element.show()
+```
+```python
+date_element.text
+```
+```python
+# Find multiple elements, e.g., bold headings (size >= 8)
+heading_elements = page.find_all('text[size>=8]:bold')
+heading_elements
+```
+```python
+page.find_all('text[size>=8]:bold').show()
+```
+```python
+# Pull out all of their text (why? I don't know!)
+print(heading_elements.extract_text())
+```
+## Advanced text searches
+```python
+# Exact phrase (case-sensitive)
+page.find('text:contains("Hazardous Materials")').text
+```
+```python
+# Exact phrase (case-sensitive)
+page.find('text:contains("HAZARDOUS MATERIALS")', case=False).text
+```
+```python
+# Regular expression (e.g., "YYYY Report")
+regex = "\d+, \d{4}"
+page.find(f'text:contains("{regex}")', regex=True)
+```
+```python
+# Regular expression (e.g., "YYYY Report")
+page.find_all('text[fontname="Helvetica"][size=10]')
+```
+# Regions
+```python
+# Region below an element (e.g., below "Introduction")
+# Adjust selector as needed
+page.find('text:contains("Summary")').below(include_element=True).show()
+```
+```python
+(
+    page
+    .find('text:contains("Summary")')
+    .below(include_element=True)
+    .extract_text()
+    [:500]
+)
+```
+```python
+(
+    page
+    .find('text:contains("Summary")')
+    .below(include_element=True, until='line:horizontal')
+    .show()
+)
+```
+```python
+# Manually defined region via coordinates (x0, top, x1, bottom)
+manual_region = page.create_region(30, 60, 600, 300)
+manual_region.show()
+```
+```python
+# Extract text from the manual region
+manual_region.extract_text()[:500]
+```
+## Filtering Out Headers and Footers
+Use Exclusion Zones to remove unwanted content before extraction. *Adjust selectors for typical header/footer content.*
+```python
+header_content = page.find('rect')
+footer_content = page.find_all('line')[-1].below()
+header_content.highlight()
+footer_content.highlight()
+page.to_image()
+```
+```python
+page.extract_text()[:500]
+```
+```python
+page.add_exclusion(header_content)
+page.add_exclusion(footer_content)
+```
+```python
+page.extract_text()[:500]
+```
+```python
+full_text_no_exclusions = page.extract_text(use_exclusions=False)
+clean_text = page.extract_text()
+f"Original length: {len(full_text_no_exclusions)}, Excluded length: {len(clean_text)}"
+```
+```python
+page.clear_exclusions()
+```
+*Exclusions can also be defined globally at the PDF level using `pdf.add_exclusion()` with a function.*
+## Controlling Whitespace
+Manage how spaces and blank lines are handled during extraction using `layout`.
+```python
+print(page.extract_text())
+```
+```python
+print(page.extract_text(use_exclusions=False, layout=True))
+```
+### Font Information Access
+Inspect font details of text elements.
+```python
+# Find the first text element on the page
+first_text = page.find_all('text')[1]
+first_text # Display basic info
+```
+```python
+# Highlight the first text element
+first_text.show()
+```
+```python
+# Get detailed font properties dictionary
+first_text.font_info()
+```
+```python
+# Check specific style properties directly
+f"Is Bold: {first_text.bold}, Is Italic: {first_text.italic}, Font: {first_text.fontname}, Size: {first_text.size}"
+```
+```python
+# Find elements by font attributes (adjust selectors)
+# Example: Find Arial fonts
+arial_text = page.find_all('text[fontname*=Helvetica]')
+arial_text # Display list of found elements
+```
+```python
+# Example: Find large text (e.g., size >= 16)
+large_text = page.find_all('text[size>=12]')
+large_text
+```
+```python
+# Example: Find large text (e.g., size >= 16)
+bold_text = page.find_all('text:bold')
+bold_text
+```
+## Working with Font Styles
+Analyze and group text elements by their computed font *style*, which combines attributes like font name, size, boldness, etc., into logical groups.
+```python
+# Analyze styles on the page
+# This returns a dictionary mapping style names to ElementList objects
+page.analyze_text_styles()
+page.text_style_labels
+```
+```python
+page.find_all('text').highlight(group_by='style_label').to_image()
+```
+```python
+page.find_all('text[style_label="8.0pt Helvetica"]')
+```
+```python
+page.find_all('text[fontname="Helvetica"][size=8]')
+```
+*Font variants (e.g., `AAAAAB+FontName`) are also accessible via the `font-variant` attribute selector: `page.find_all('text[font-variant="AAAAAB"]')`.*
+## Reading Order
+Text extraction respects a pathetic attempt at natural reading order (top-to-bottom, left-to-right by default). `page.find_all('text')` returns elements already sorted this way.
+```python
+# Get first 5 text elements in reading order
+elements_in_order = page.find_all('text')
+elements_in_order[:5]
+```
+```python
+# Text extracted via page.extract_text() respects this order automatically
+# (Result already shown in Basic Text Extraction section)
+page.extract_text()[:100]
+```
+## Element Navigation
+Move between elements sequentially based on reading order using `.next()` and `.previous()`.
+```python
+page.clear_highlights()
+start = page.find('text:contains("Date")')
+start.highlight(label='Date label')
+start.next().highlight(label='Maybe the date', color='green')
+start.next('text:contains("\d")', regex=True).highlight(label='Probably the date')
+page.to_image()
+```
+## Next Steps
+Now that you know how to extract text, you might want to explore:
+- [Working with regions](../regions/index.ipynb) for more precise extraction
+- [OCR capabilities](../ocr/index.md) for scanned documents
+- [Document layout analysis](../layout-analysis/index.ipynb) for automatic structure detection
+- [Document QA](../document-qa/index.ipynb) for asking questions directly to your documents

natural-pdf 0.1.3__py3-none-any.whl → 0.1.5__py3-none-any.whl

natural-pdf 0.1.3py3-none-any.whl → 0.1.5py3-none-any.whl