natural-pdf 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- natural_pdf/__init__.py +3 -0
- natural_pdf/analyzers/layout/base.py +1 -5
- natural_pdf/analyzers/layout/gemini.py +61 -51
- natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
- natural_pdf/analyzers/layout/layout_manager.py +26 -84
- natural_pdf/analyzers/layout/layout_options.py +7 -0
- natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
- natural_pdf/analyzers/layout/surya.py +46 -123
- natural_pdf/analyzers/layout/tatr.py +51 -4
- natural_pdf/analyzers/text_structure.py +3 -5
- natural_pdf/analyzers/utils.py +3 -3
- natural_pdf/classification/manager.py +422 -0
- natural_pdf/classification/mixin.py +163 -0
- natural_pdf/classification/results.py +80 -0
- natural_pdf/collections/mixins.py +111 -0
- natural_pdf/collections/pdf_collection.py +434 -15
- natural_pdf/core/element_manager.py +83 -0
- natural_pdf/core/highlighting_service.py +13 -22
- natural_pdf/core/page.py +578 -93
- natural_pdf/core/pdf.py +912 -460
- natural_pdf/elements/base.py +134 -40
- natural_pdf/elements/collections.py +712 -109
- natural_pdf/elements/region.py +722 -69
- natural_pdf/elements/text.py +4 -1
- natural_pdf/export/mixin.py +137 -0
- natural_pdf/exporters/base.py +3 -3
- natural_pdf/exporters/paddleocr.py +5 -4
- natural_pdf/extraction/manager.py +135 -0
- natural_pdf/extraction/mixin.py +279 -0
- natural_pdf/extraction/result.py +23 -0
- natural_pdf/ocr/__init__.py +5 -5
- natural_pdf/ocr/engine_doctr.py +346 -0
- natural_pdf/ocr/engine_easyocr.py +6 -3
- natural_pdf/ocr/ocr_factory.py +24 -4
- natural_pdf/ocr/ocr_manager.py +122 -26
- natural_pdf/ocr/ocr_options.py +94 -11
- natural_pdf/ocr/utils.py +19 -6
- natural_pdf/qa/document_qa.py +0 -4
- natural_pdf/search/__init__.py +20 -34
- natural_pdf/search/haystack_search_service.py +309 -265
- natural_pdf/search/haystack_utils.py +99 -75
- natural_pdf/search/search_service_protocol.py +11 -12
- natural_pdf/selectors/parser.py +431 -230
- natural_pdf/utils/debug.py +3 -3
- natural_pdf/utils/identifiers.py +1 -1
- natural_pdf/utils/locks.py +8 -0
- natural_pdf/utils/packaging.py +8 -6
- natural_pdf/utils/text_extraction.py +60 -1
- natural_pdf/utils/tqdm_utils.py +51 -0
- natural_pdf/utils/visualization.py +18 -0
- natural_pdf/widgets/viewer.py +4 -25
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +17 -3
- natural_pdf-0.1.9.dist-info/RECORD +80 -0
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
- docs/api/index.md +0 -386
- docs/assets/favicon.png +0 -3
- docs/assets/favicon.svg +0 -3
- docs/assets/javascripts/custom.js +0 -17
- docs/assets/logo.svg +0 -3
- docs/assets/sample-screen.png +0 -0
- docs/assets/social-preview.png +0 -17
- docs/assets/social-preview.svg +0 -17
- docs/assets/stylesheets/custom.css +0 -65
- docs/document-qa/index.ipynb +0 -435
- docs/document-qa/index.md +0 -79
- docs/element-selection/index.ipynb +0 -915
- docs/element-selection/index.md +0 -229
- docs/finetuning/index.md +0 -176
- docs/index.md +0 -170
- docs/installation/index.md +0 -69
- docs/interactive-widget/index.ipynb +0 -962
- docs/interactive-widget/index.md +0 -12
- docs/layout-analysis/index.ipynb +0 -818
- docs/layout-analysis/index.md +0 -185
- docs/ocr/index.md +0 -209
- docs/pdf-navigation/index.ipynb +0 -314
- docs/pdf-navigation/index.md +0 -97
- docs/regions/index.ipynb +0 -816
- docs/regions/index.md +0 -294
- docs/tables/index.ipynb +0 -658
- docs/tables/index.md +0 -144
- docs/text-analysis/index.ipynb +0 -370
- docs/text-analysis/index.md +0 -105
- docs/text-extraction/index.ipynb +0 -1478
- docs/text-extraction/index.md +0 -292
- docs/tutorials/01-loading-and-extraction.ipynb +0 -194
- docs/tutorials/01-loading-and-extraction.md +0 -95
- docs/tutorials/02-finding-elements.ipynb +0 -340
- docs/tutorials/02-finding-elements.md +0 -149
- docs/tutorials/03-extracting-blocks.ipynb +0 -147
- docs/tutorials/03-extracting-blocks.md +0 -48
- docs/tutorials/04-table-extraction.ipynb +0 -114
- docs/tutorials/04-table-extraction.md +0 -50
- docs/tutorials/05-excluding-content.ipynb +0 -270
- docs/tutorials/05-excluding-content.md +0 -109
- docs/tutorials/06-document-qa.ipynb +0 -332
- docs/tutorials/06-document-qa.md +0 -91
- docs/tutorials/07-layout-analysis.ipynb +0 -288
- docs/tutorials/07-layout-analysis.md +0 -66
- docs/tutorials/07-working-with-regions.ipynb +0 -413
- docs/tutorials/07-working-with-regions.md +0 -151
- docs/tutorials/08-spatial-navigation.ipynb +0 -508
- docs/tutorials/08-spatial-navigation.md +0 -190
- docs/tutorials/09-section-extraction.ipynb +0 -2434
- docs/tutorials/09-section-extraction.md +0 -256
- docs/tutorials/10-form-field-extraction.ipynb +0 -512
- docs/tutorials/10-form-field-extraction.md +0 -201
- docs/tutorials/11-enhanced-table-processing.ipynb +0 -54
- docs/tutorials/11-enhanced-table-processing.md +0 -9
- docs/tutorials/12-ocr-integration.ipynb +0 -604
- docs/tutorials/12-ocr-integration.md +0 -175
- docs/tutorials/13-semantic-search.ipynb +0 -1328
- docs/tutorials/13-semantic-search.md +0 -77
- docs/visual-debugging/index.ipynb +0 -2970
- docs/visual-debugging/index.md +0 -157
- docs/visual-debugging/region.png +0 -0
- natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -415
- natural_pdf/templates/spa/css/style.css +0 -334
- natural_pdf/templates/spa/index.html +0 -31
- natural_pdf/templates/spa/js/app.js +0 -472
- natural_pdf/templates/spa/words.txt +0 -235976
- natural_pdf/widgets/frontend/viewer.js +0 -88
- natural_pdf-0.1.7.dist-info/RECORD +0 -145
- notebooks/Examples.ipynb +0 -1293
- pdfs/.gitkeep +0 -0
- pdfs/01-practice.pdf +0 -543
- pdfs/0500000US42001.pdf +0 -0
- pdfs/0500000US42007.pdf +0 -0
- pdfs/2014 Statistics.pdf +0 -0
- pdfs/2019 Statistics.pdf +0 -0
- pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
- pdfs/needs-ocr.pdf +0 -0
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0
docs/tables/index.md
DELETED
@@ -1,144 +0,0 @@
|
|
1
|
-
# Table Extraction
|
2
|
-
|
3
|
-
Extracting tables from PDFs can range from straightforward to complex. Natural PDF provides several tools and methods to handle different scenarios, leveraging both rule-based (`pdfplumber`) and model-based (`TATR`) approaches.
|
4
|
-
|
5
|
-
## Setup
|
6
|
-
|
7
|
-
Let's load a PDF containing tables.
|
8
|
-
|
9
|
-
```python
|
10
|
-
from natural_pdf import PDF
|
11
|
-
|
12
|
-
# Load the PDF
|
13
|
-
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
|
14
|
-
|
15
|
-
# Select the first page
|
16
|
-
page = pdf.pages[0]
|
17
|
-
|
18
|
-
# Display the page
|
19
|
-
page.show()
|
20
|
-
```
|
21
|
-
|
22
|
-
## Basic Table Extraction (No Detection)
|
23
|
-
|
24
|
-
If you know a table exists, you can try `extract_table()` directly on the page or a region. This uses `pdfplumber` behind the scenes.
|
25
|
-
|
26
|
-
```python
|
27
|
-
# Extract the first table found on the page using pdfplumber
|
28
|
-
# This works best for simple tables with clear lines
|
29
|
-
table_data = page.extract_table() # Returns a list of lists
|
30
|
-
table_data
|
31
|
-
```
|
32
|
-
|
33
|
-
*This might fail or give poor results if there are multiple tables or the table structure is complex.*
|
34
|
-
|
35
|
-
## Layout Analysis for Table Detection
|
36
|
-
|
37
|
-
A more robust approach can be to first *detect* the table boundaries using layout analysis.
|
38
|
-
|
39
|
-
### Using YOLO (Default)
|
40
|
-
|
41
|
-
The default YOLO model finds the overall bounding box of tables.
|
42
|
-
|
43
|
-
```python
|
44
|
-
# Detect layout elements using YOLO (default)
|
45
|
-
page.analyze_layout(engine='yolo')
|
46
|
-
|
47
|
-
# Find regions detected as tables
|
48
|
-
table_regions_yolo = page.find_all('region[type=table][model=yolo]')
|
49
|
-
table_regions_yolo.show()
|
50
|
-
```
|
51
|
-
|
52
|
-
```python
|
53
|
-
table_regions_yolo[0].extract_table()
|
54
|
-
```
|
55
|
-
|
56
|
-
### Using TATR (Table Transformer)
|
57
|
-
|
58
|
-
The TATR model provides detailed table structure (rows, columns, headers).
|
59
|
-
|
60
|
-
```python
|
61
|
-
page.clear_detected_layout_regions() # Clear previous YOLO regions for clarity
|
62
|
-
page.analyze_layout(engine='tatr')
|
63
|
-
```
|
64
|
-
|
65
|
-
```python
|
66
|
-
# Find the main table region(s) detected by TATR
|
67
|
-
tatr_table = page.find('region[type=table][model=tatr]')
|
68
|
-
tatr_table.show()
|
69
|
-
```
|
70
|
-
|
71
|
-
```python
|
72
|
-
# Find rows, columns, headers detected by TATR
|
73
|
-
rows = page.find_all('region[type=table-row][model=tatr]')
|
74
|
-
cols = page.find_all('region[type=table-column][model=tatr]')
|
75
|
-
hdrs = page.find_all('region[type=table-column-header][model=tatr]')
|
76
|
-
f"TATR found: {len(rows)} rows, {len(cols)} columns, {len(hdrs)} headers"
|
77
|
-
```
|
78
|
-
|
79
|
-
## Controlling Extraction Method (`plumber` vs `tatr`)
|
80
|
-
|
81
|
-
When you call `extract_table()` on a region:
|
82
|
-
- If the region was detected by **YOLO** (or not detected at all), it uses the `plumber` method.
|
83
|
-
- If the region was detected by **TATR**, it defaults to the `tatr` method, which uses the detected row/column structure.
|
84
|
-
|
85
|
-
You can override this using the `method` argument.
|
86
|
-
|
87
|
-
```python
|
88
|
-
tatr_table = page.find('region[type=table][model=tatr]')
|
89
|
-
tatr_table.extract_table(method='tatr')
|
90
|
-
```
|
91
|
-
|
92
|
-
```python
|
93
|
-
# Force using pdfplumber even on a TATR-detected region
|
94
|
-
# (Might be useful for comparison or if TATR structure is flawed)
|
95
|
-
tatr_table = page.find('region[type=table][model=tatr]')
|
96
|
-
tatr_table.extract_table(method='pdfplumber')
|
97
|
-
```
|
98
|
-
|
99
|
-
### When to Use Which Method?
|
100
|
-
|
101
|
-
- **`pdfplumber`**: Good for simple tables with clear grid lines. Faster.
|
102
|
-
- **`tatr`**: Better for tables without clear lines, complex cell merging, or irregular layouts. Leverages the model's understanding of rows and columns.
|
103
|
-
|
104
|
-
## Customizing `pdfplumber` Settings
|
105
|
-
|
106
|
-
If using the `pdfplumber` method (explicitly or implicitly), you can pass `pdfplumber` settings via `table_settings`.
|
107
|
-
|
108
|
-
```python
|
109
|
-
# Example: Use text alignment for vertical lines, explicit lines for horizontal
|
110
|
-
# See pdfplumber documentation for all settings
|
111
|
-
table_settings = {
|
112
|
-
"vertical_strategy": "text",
|
113
|
-
"horizontal_strategy": "lines",
|
114
|
-
"intersection_x_tolerance": 5, # Increase tolerance for intersections
|
115
|
-
}
|
116
|
-
|
117
|
-
results = page.extract_table(
|
118
|
-
table_settings=table_settings
|
119
|
-
)
|
120
|
-
```
|
121
|
-
|
122
|
-
## Saving Extracted Tables
|
123
|
-
|
124
|
-
You can easily save the extracted data (list of lists) to common formats.
|
125
|
-
|
126
|
-
```python
|
127
|
-
import pandas as pd
|
128
|
-
|
129
|
-
pd.DataFrame(page.extract_table())
|
130
|
-
```
|
131
|
-
|
132
|
-
## Working Directly with TATR Cells
|
133
|
-
|
134
|
-
The TATR engine implicitly creates cell regions at the intersection of detected rows and columns. You can access these for fine-grained control.
|
135
|
-
|
136
|
-
```python
|
137
|
-
# This doesn't work! I forget why, I should troubleshoot later.
|
138
|
-
# tatr_table.cells
|
139
|
-
```
|
140
|
-
|
141
|
-
## Next Steps
|
142
|
-
|
143
|
-
- [Layout Analysis](../layout-analysis/index.ipynb): Understand how table detection fits into overall document structure analysis.
|
144
|
-
- [Working with Regions](../regions/index.ipynb): Manually define table areas if detection fails.
|