PyPI - natural-pdf - Versions diffs - 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl - Mend

natural-pdf 0.1.7py3-none-any.whl → 0.1.9py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (134) hide show

natural_pdf/__init__.py +3 -0
natural_pdf/analyzers/layout/base.py +1 -5
natural_pdf/analyzers/layout/gemini.py +61 -51
natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
natural_pdf/analyzers/layout/layout_manager.py +26 -84
natural_pdf/analyzers/layout/layout_options.py +7 -0
natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
natural_pdf/analyzers/layout/surya.py +46 -123
natural_pdf/analyzers/layout/tatr.py +51 -4
natural_pdf/analyzers/text_structure.py +3 -5
natural_pdf/analyzers/utils.py +3 -3
natural_pdf/classification/manager.py +422 -0
natural_pdf/classification/mixin.py +163 -0
natural_pdf/classification/results.py +80 -0
natural_pdf/collections/mixins.py +111 -0
natural_pdf/collections/pdf_collection.py +434 -15
natural_pdf/core/element_manager.py +83 -0
natural_pdf/core/highlighting_service.py +13 -22
natural_pdf/core/page.py +578 -93
natural_pdf/core/pdf.py +912 -460
natural_pdf/elements/base.py +134 -40
natural_pdf/elements/collections.py +712 -109
natural_pdf/elements/region.py +722 -69
natural_pdf/elements/text.py +4 -1
natural_pdf/export/mixin.py +137 -0
natural_pdf/exporters/base.py +3 -3
natural_pdf/exporters/paddleocr.py +5 -4
natural_pdf/extraction/manager.py +135 -0
natural_pdf/extraction/mixin.py +279 -0
natural_pdf/extraction/result.py +23 -0
natural_pdf/ocr/__init__.py +5 -5
natural_pdf/ocr/engine_doctr.py +346 -0
natural_pdf/ocr/engine_easyocr.py +6 -3
natural_pdf/ocr/ocr_factory.py +24 -4
natural_pdf/ocr/ocr_manager.py +122 -26
natural_pdf/ocr/ocr_options.py +94 -11
natural_pdf/ocr/utils.py +19 -6
natural_pdf/qa/document_qa.py +0 -4
natural_pdf/search/__init__.py +20 -34
natural_pdf/search/haystack_search_service.py +309 -265
natural_pdf/search/haystack_utils.py +99 -75
natural_pdf/search/search_service_protocol.py +11 -12
natural_pdf/selectors/parser.py +431 -230
natural_pdf/utils/debug.py +3 -3
natural_pdf/utils/identifiers.py +1 -1
natural_pdf/utils/locks.py +8 -0
natural_pdf/utils/packaging.py +8 -6
natural_pdf/utils/text_extraction.py +60 -1
natural_pdf/utils/tqdm_utils.py +51 -0
natural_pdf/utils/visualization.py +18 -0
natural_pdf/widgets/viewer.py +4 -25
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +17 -3
natural_pdf-0.1.9.dist-info/RECORD +80 -0
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
docs/api/index.md +0 -386
docs/assets/favicon.png +0 -3
docs/assets/favicon.svg +0 -3
docs/assets/javascripts/custom.js +0 -17
docs/assets/logo.svg +0 -3
docs/assets/sample-screen.png +0 -0
docs/assets/social-preview.png +0 -17
docs/assets/social-preview.svg +0 -17
docs/assets/stylesheets/custom.css +0 -65
docs/document-qa/index.ipynb +0 -435
docs/document-qa/index.md +0 -79
docs/element-selection/index.ipynb +0 -915
docs/element-selection/index.md +0 -229
docs/finetuning/index.md +0 -176
docs/index.md +0 -170
docs/installation/index.md +0 -69
docs/interactive-widget/index.ipynb +0 -962
docs/interactive-widget/index.md +0 -12
docs/layout-analysis/index.ipynb +0 -818
docs/layout-analysis/index.md +0 -185
docs/ocr/index.md +0 -209
docs/pdf-navigation/index.ipynb +0 -314
docs/pdf-navigation/index.md +0 -97
docs/regions/index.ipynb +0 -816
docs/regions/index.md +0 -294
docs/tables/index.ipynb +0 -658
docs/tables/index.md +0 -144
docs/text-analysis/index.ipynb +0 -370
docs/text-analysis/index.md +0 -105
docs/text-extraction/index.ipynb +0 -1478
docs/text-extraction/index.md +0 -292
docs/tutorials/01-loading-and-extraction.ipynb +0 -194
docs/tutorials/01-loading-and-extraction.md +0 -95
docs/tutorials/02-finding-elements.ipynb +0 -340
docs/tutorials/02-finding-elements.md +0 -149
docs/tutorials/03-extracting-blocks.ipynb +0 -147
docs/tutorials/03-extracting-blocks.md +0 -48
docs/tutorials/04-table-extraction.ipynb +0 -114
docs/tutorials/04-table-extraction.md +0 -50
docs/tutorials/05-excluding-content.ipynb +0 -270
docs/tutorials/05-excluding-content.md +0 -109
docs/tutorials/06-document-qa.ipynb +0 -332
docs/tutorials/06-document-qa.md +0 -91
docs/tutorials/07-layout-analysis.ipynb +0 -288
docs/tutorials/07-layout-analysis.md +0 -66
docs/tutorials/07-working-with-regions.ipynb +0 -413
docs/tutorials/07-working-with-regions.md +0 -151
docs/tutorials/08-spatial-navigation.ipynb +0 -508
docs/tutorials/08-spatial-navigation.md +0 -190
docs/tutorials/09-section-extraction.ipynb +0 -2434
docs/tutorials/09-section-extraction.md +0 -256
docs/tutorials/10-form-field-extraction.ipynb +0 -512
docs/tutorials/10-form-field-extraction.md +0 -201
docs/tutorials/11-enhanced-table-processing.ipynb +0 -54
docs/tutorials/11-enhanced-table-processing.md +0 -9
docs/tutorials/12-ocr-integration.ipynb +0 -604
docs/tutorials/12-ocr-integration.md +0 -175
docs/tutorials/13-semantic-search.ipynb +0 -1328
docs/tutorials/13-semantic-search.md +0 -77
docs/visual-debugging/index.ipynb +0 -2970
docs/visual-debugging/index.md +0 -157
docs/visual-debugging/region.png +0 -0
natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -415
natural_pdf/templates/spa/css/style.css +0 -334
natural_pdf/templates/spa/index.html +0 -31
natural_pdf/templates/spa/js/app.js +0 -472
natural_pdf/templates/spa/words.txt +0 -235976
natural_pdf/widgets/frontend/viewer.js +0 -88
natural_pdf-0.1.7.dist-info/RECORD +0 -145
notebooks/Examples.ipynb +0 -1293
pdfs/.gitkeep +0 -0
pdfs/01-practice.pdf +0 -543
pdfs/0500000US42001.pdf +0 -0
pdfs/0500000US42007.pdf +0 -0
pdfs/2014 Statistics.pdf +0 -0
pdfs/2019 Statistics.pdf +0 -0
pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
pdfs/needs-ocr.pdf +0 -0
{natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0

docs/tables/index.md DELETED Viewed

@@ -1,144 +0,0 @@
-# Table Extraction
-Extracting tables from PDFs can range from straightforward to complex. Natural PDF provides several tools and methods to handle different scenarios, leveraging both rule-based (`pdfplumber`) and model-based (`TATR`) approaches.
-## Setup
-Let's load a PDF containing tables.
-```python
-from natural_pdf import PDF
-# Load the PDF
-pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
-# Select the first page
-page = pdf.pages[0]
-# Display the page
-page.show()
-```
-## Basic Table Extraction (No Detection)
-If you know a table exists, you can try `extract_table()` directly on the page or a region. This uses `pdfplumber` behind the scenes.
-```python
-# Extract the first table found on the page using pdfplumber
-# This works best for simple tables with clear lines
-table_data = page.extract_table() # Returns a list of lists
-table_data
-```
-*This might fail or give poor results if there are multiple tables or the table structure is complex.*
-## Layout Analysis for Table Detection
-A more robust approach can be to first *detect* the table boundaries using layout analysis.
-### Using YOLO (Default)
-The default YOLO model finds the overall bounding box of tables.
-```python
-# Detect layout elements using YOLO (default)
-page.analyze_layout(engine='yolo')
-# Find regions detected as tables
-table_regions_yolo = page.find_all('region[type=table][model=yolo]')
-table_regions_yolo.show()
-```
-```python
-table_regions_yolo[0].extract_table()
-```
-### Using TATR (Table Transformer)
-The TATR model provides detailed table structure (rows, columns, headers).
-```python
-page.clear_detected_layout_regions() # Clear previous YOLO regions for clarity
-page.analyze_layout(engine='tatr')
-```
-```python
-# Find the main table region(s) detected by TATR
-tatr_table = page.find('region[type=table][model=tatr]')
-tatr_table.show()
-```
-```python
-# Find rows, columns, headers detected by TATR
-rows = page.find_all('region[type=table-row][model=tatr]')
-cols = page.find_all('region[type=table-column][model=tatr]')
-hdrs = page.find_all('region[type=table-column-header][model=tatr]')
-f"TATR found: {len(rows)} rows, {len(cols)} columns, {len(hdrs)} headers"
-```
-## Controlling Extraction Method (`plumber` vs `tatr`)
-When you call `extract_table()` on a region:
-- If the region was detected by **YOLO** (or not detected at all), it uses the `plumber` method.
-- If the region was detected by **TATR**, it defaults to the `tatr` method, which uses the detected row/column structure.
-You can override this using the `method` argument.
-```python
-tatr_table = page.find('region[type=table][model=tatr]')
-tatr_table.extract_table(method='tatr')
-```
-```python
-# Force using pdfplumber even on a TATR-detected region
-# (Might be useful for comparison or if TATR structure is flawed)
-tatr_table = page.find('region[type=table][model=tatr]')
-tatr_table.extract_table(method='pdfplumber')
-```
-### When to Use Which Method?
-- **`pdfplumber`**: Good for simple tables with clear grid lines. Faster.
-- **`tatr`**: Better for tables without clear lines, complex cell merging, or irregular layouts. Leverages the model's understanding of rows and columns.
-## Customizing `pdfplumber` Settings
-If using the `pdfplumber` method (explicitly or implicitly), you can pass `pdfplumber` settings via `table_settings`.
-```python
-# Example: Use text alignment for vertical lines, explicit lines for horizontal
-# See pdfplumber documentation for all settings
-table_settings = {
-    "vertical_strategy": "text",
-    "horizontal_strategy": "lines",
-    "intersection_x_tolerance": 5, # Increase tolerance for intersections
-}
-results = page.extract_table(
-    table_settings=table_settings
-)
-```
-## Saving Extracted Tables
-You can easily save the extracted data (list of lists) to common formats.
-```python
-import pandas as pd
-pd.DataFrame(page.extract_table())
-```
-## Working Directly with TATR Cells
-The TATR engine implicitly creates cell regions at the intersection of detected rows and columns. You can access these for fine-grained control.
-```python
-# This doesn't work! I forget why, I should troubleshoot later.
-# tatr_table.cells
-```
-## Next Steps
-- [Layout Analysis](../layout-analysis/index.ipynb): Understand how table detection fits into overall document structure analysis.
-- [Working with Regions](../regions/index.ipynb): Manually define table areas if detection fails.

natural-pdf 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl

natural-pdf 0.1.7py3-none-any.whl → 0.1.9py3-none-any.whl