natural-pdf 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (134) hide show
  1. natural_pdf/__init__.py +3 -0
  2. natural_pdf/analyzers/layout/base.py +1 -5
  3. natural_pdf/analyzers/layout/gemini.py +61 -51
  4. natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
  5. natural_pdf/analyzers/layout/layout_manager.py +26 -84
  6. natural_pdf/analyzers/layout/layout_options.py +7 -0
  7. natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
  8. natural_pdf/analyzers/layout/surya.py +46 -123
  9. natural_pdf/analyzers/layout/tatr.py +51 -4
  10. natural_pdf/analyzers/text_structure.py +3 -5
  11. natural_pdf/analyzers/utils.py +3 -3
  12. natural_pdf/classification/manager.py +422 -0
  13. natural_pdf/classification/mixin.py +163 -0
  14. natural_pdf/classification/results.py +80 -0
  15. natural_pdf/collections/mixins.py +111 -0
  16. natural_pdf/collections/pdf_collection.py +434 -15
  17. natural_pdf/core/element_manager.py +83 -0
  18. natural_pdf/core/highlighting_service.py +13 -22
  19. natural_pdf/core/page.py +578 -93
  20. natural_pdf/core/pdf.py +912 -460
  21. natural_pdf/elements/base.py +134 -40
  22. natural_pdf/elements/collections.py +712 -109
  23. natural_pdf/elements/region.py +722 -69
  24. natural_pdf/elements/text.py +4 -1
  25. natural_pdf/export/mixin.py +137 -0
  26. natural_pdf/exporters/base.py +3 -3
  27. natural_pdf/exporters/paddleocr.py +5 -4
  28. natural_pdf/extraction/manager.py +135 -0
  29. natural_pdf/extraction/mixin.py +279 -0
  30. natural_pdf/extraction/result.py +23 -0
  31. natural_pdf/ocr/__init__.py +5 -5
  32. natural_pdf/ocr/engine_doctr.py +346 -0
  33. natural_pdf/ocr/engine_easyocr.py +6 -3
  34. natural_pdf/ocr/ocr_factory.py +24 -4
  35. natural_pdf/ocr/ocr_manager.py +122 -26
  36. natural_pdf/ocr/ocr_options.py +94 -11
  37. natural_pdf/ocr/utils.py +19 -6
  38. natural_pdf/qa/document_qa.py +0 -4
  39. natural_pdf/search/__init__.py +20 -34
  40. natural_pdf/search/haystack_search_service.py +309 -265
  41. natural_pdf/search/haystack_utils.py +99 -75
  42. natural_pdf/search/search_service_protocol.py +11 -12
  43. natural_pdf/selectors/parser.py +431 -230
  44. natural_pdf/utils/debug.py +3 -3
  45. natural_pdf/utils/identifiers.py +1 -1
  46. natural_pdf/utils/locks.py +8 -0
  47. natural_pdf/utils/packaging.py +8 -6
  48. natural_pdf/utils/text_extraction.py +60 -1
  49. natural_pdf/utils/tqdm_utils.py +51 -0
  50. natural_pdf/utils/visualization.py +18 -0
  51. natural_pdf/widgets/viewer.py +4 -25
  52. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +17 -3
  53. natural_pdf-0.1.9.dist-info/RECORD +80 -0
  54. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
  55. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
  56. docs/api/index.md +0 -386
  57. docs/assets/favicon.png +0 -3
  58. docs/assets/favicon.svg +0 -3
  59. docs/assets/javascripts/custom.js +0 -17
  60. docs/assets/logo.svg +0 -3
  61. docs/assets/sample-screen.png +0 -0
  62. docs/assets/social-preview.png +0 -17
  63. docs/assets/social-preview.svg +0 -17
  64. docs/assets/stylesheets/custom.css +0 -65
  65. docs/document-qa/index.ipynb +0 -435
  66. docs/document-qa/index.md +0 -79
  67. docs/element-selection/index.ipynb +0 -915
  68. docs/element-selection/index.md +0 -229
  69. docs/finetuning/index.md +0 -176
  70. docs/index.md +0 -170
  71. docs/installation/index.md +0 -69
  72. docs/interactive-widget/index.ipynb +0 -962
  73. docs/interactive-widget/index.md +0 -12
  74. docs/layout-analysis/index.ipynb +0 -818
  75. docs/layout-analysis/index.md +0 -185
  76. docs/ocr/index.md +0 -209
  77. docs/pdf-navigation/index.ipynb +0 -314
  78. docs/pdf-navigation/index.md +0 -97
  79. docs/regions/index.ipynb +0 -816
  80. docs/regions/index.md +0 -294
  81. docs/tables/index.ipynb +0 -658
  82. docs/tables/index.md +0 -144
  83. docs/text-analysis/index.ipynb +0 -370
  84. docs/text-analysis/index.md +0 -105
  85. docs/text-extraction/index.ipynb +0 -1478
  86. docs/text-extraction/index.md +0 -292
  87. docs/tutorials/01-loading-and-extraction.ipynb +0 -194
  88. docs/tutorials/01-loading-and-extraction.md +0 -95
  89. docs/tutorials/02-finding-elements.ipynb +0 -340
  90. docs/tutorials/02-finding-elements.md +0 -149
  91. docs/tutorials/03-extracting-blocks.ipynb +0 -147
  92. docs/tutorials/03-extracting-blocks.md +0 -48
  93. docs/tutorials/04-table-extraction.ipynb +0 -114
  94. docs/tutorials/04-table-extraction.md +0 -50
  95. docs/tutorials/05-excluding-content.ipynb +0 -270
  96. docs/tutorials/05-excluding-content.md +0 -109
  97. docs/tutorials/06-document-qa.ipynb +0 -332
  98. docs/tutorials/06-document-qa.md +0 -91
  99. docs/tutorials/07-layout-analysis.ipynb +0 -288
  100. docs/tutorials/07-layout-analysis.md +0 -66
  101. docs/tutorials/07-working-with-regions.ipynb +0 -413
  102. docs/tutorials/07-working-with-regions.md +0 -151
  103. docs/tutorials/08-spatial-navigation.ipynb +0 -508
  104. docs/tutorials/08-spatial-navigation.md +0 -190
  105. docs/tutorials/09-section-extraction.ipynb +0 -2434
  106. docs/tutorials/09-section-extraction.md +0 -256
  107. docs/tutorials/10-form-field-extraction.ipynb +0 -512
  108. docs/tutorials/10-form-field-extraction.md +0 -201
  109. docs/tutorials/11-enhanced-table-processing.ipynb +0 -54
  110. docs/tutorials/11-enhanced-table-processing.md +0 -9
  111. docs/tutorials/12-ocr-integration.ipynb +0 -604
  112. docs/tutorials/12-ocr-integration.md +0 -175
  113. docs/tutorials/13-semantic-search.ipynb +0 -1328
  114. docs/tutorials/13-semantic-search.md +0 -77
  115. docs/visual-debugging/index.ipynb +0 -2970
  116. docs/visual-debugging/index.md +0 -157
  117. docs/visual-debugging/region.png +0 -0
  118. natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -415
  119. natural_pdf/templates/spa/css/style.css +0 -334
  120. natural_pdf/templates/spa/index.html +0 -31
  121. natural_pdf/templates/spa/js/app.js +0 -472
  122. natural_pdf/templates/spa/words.txt +0 -235976
  123. natural_pdf/widgets/frontend/viewer.js +0 -88
  124. natural_pdf-0.1.7.dist-info/RECORD +0 -145
  125. notebooks/Examples.ipynb +0 -1293
  126. pdfs/.gitkeep +0 -0
  127. pdfs/01-practice.pdf +0 -543
  128. pdfs/0500000US42001.pdf +0 -0
  129. pdfs/0500000US42007.pdf +0 -0
  130. pdfs/2014 Statistics.pdf +0 -0
  131. pdfs/2019 Statistics.pdf +0 -0
  132. pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  133. pdfs/needs-ocr.pdf +0 -0
  134. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0
@@ -1,229 +0,0 @@
1
- # Finding Elements with Selectors
2
-
3
- Natural PDF uses CSS-like selectors to find elements (text, lines, images, etc.) within a PDF page or document. This guide demonstrates how to use these selectors effectively.
4
-
5
- ## Setup
6
-
7
- Let's load a sample PDF to work with. We'll use `01-practice.pdf` which has various elements.
8
-
9
- ```python
10
- from natural_pdf import PDF
11
-
12
- # Load the PDF
13
- pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
14
-
15
- # Select the first page
16
- page = pdf.pages[0]
17
-
18
- # Display the page
19
- page.show()
20
- ```
21
-
22
- ## Basic Element Finding
23
-
24
- The core methods are `find()` (returns the first match) and `find_all()` (returns all matches as an `ElementCollection`).
25
-
26
- The basic selector structure is `element_type[attribute_filter]:pseudo_class`.
27
-
28
- ### Finding Text by Content
29
-
30
- ```python
31
- # Find the first text element containing "Summary"
32
- summary_text = page.find('text:contains("Summary")')
33
- summary_text
34
- ```
35
-
36
- ```python
37
- # Find all text elements containing "Inadequate"
38
- contains_inadequate = page.find_all('text:contains("Inadequate")')
39
- len(contains_inadequate)
40
- ```
41
-
42
- ```python
43
- summary_text.highlight(label='summary')
44
- contains_inadequate.highlight(label="inadequate")
45
- page.to_image(width=700)
46
- ```
47
-
48
- ## Selecting by Element Type
49
-
50
- You can select specific types of elements found in PDFs.
51
-
52
- ```python
53
- # Find all text elements
54
- all_text = page.find_all('text')
55
- len(all_text)
56
- ```
57
-
58
- ```python
59
- # Find all rectangle elements
60
- all_rects = page.find_all('rect')
61
- len(all_rects)
62
- ```
63
-
64
- ```python
65
- # Find all line elements
66
- all_lines = page.find_all('line')
67
- len(all_lines)
68
- ```
69
-
70
- ```python
71
- page.find_all('line').show()
72
- ```
73
-
74
- ## Filtering by Attributes
75
-
76
- Use square brackets `[]` to filter elements by their properties (attributes).
77
-
78
- ### Common Attributes & Operators
79
-
80
- | Attribute | Example Usage | Operators | Notes |
81
- |---------------|------------------------|-----------|-------|
82
- | `size` (text) | `text[size>=12]` | `>`, `<`, `>=`, `<=` | Font size in points |
83
- | `fontname` | `text[fontname*=Bold]` | `=`, `*=` | `*=` for contains substring |
84
- | `color` (text)| `text[color~=red]` | `~=` | Approx. match (name, rgb, hex) |
85
- | `width` (line)| `line[width>1]` | `>`, `<`, `>=`, `<=` | Line thickness |
86
- | `source` | `text[source=ocr]` | `=` | `pdf`, `ocr`, `detected` |
87
- | `type` (region)| `region[type=table]` | `=` | Layout analysis region type |
88
-
89
- ```python
90
- # Find large text (size >= 11 points)
91
- page.find_all('text[size>=11]')
92
- ```
93
-
94
- ```python
95
- # Find text with 'Helvetica' in the font name
96
- page.find_all('text[fontname*=Helvetica]')
97
- ```
98
-
99
- ```python
100
- # Find red text (using approximate color match)
101
- # This PDF has text with color (0.8, 0.0, 0.0)
102
- red_text = page.find_all('text[color~=red]')
103
- ```
104
-
105
- ```python
106
- # Highlight the red text (ignoring existing highlights)
107
- red_text.show()
108
- ```
109
-
110
- ```python
111
- # Find thick lines (width >= 2)
112
- page.find_all('line[width>=2]')
113
- ```
114
-
115
- ## Using Pseudo-Classes
116
-
117
- Use colons `:` for special conditions (pseudo-classes).
118
-
119
- ### Common Pseudo-Classes
120
-
121
- | Pseudo-Class | Example Usage | Notes |
122
- |-----------------------|-----------------------------------------|-------|
123
- | `:contains('text')` | `text:contains('Report')` | Finds elements containing specific text |
124
- | `:bold` | `text:bold` | Finds text heuristically identified as bold |
125
- | `:italic` | `text:italic` | Finds text heuristically identified as italic |
126
- | `:below(selector)` | `text:below('line[width>=2]')` | Finds elements physically below the reference element |
127
- | `:above(selector)` | `text:above('text:contains("Summary")')`| Finds elements physically above the reference element |
128
- | `:left-of(selector)` | `line:left-of('rect')` | Finds elements physically left of the reference element |
129
- | `:right-of(selector)` | `text:right-of('rect')` | Finds elements physically right of the reference element |
130
- | `:near(selector)` | `text:near('image')` | Finds elements physically near the reference element |
131
-
132
- *Note: Spatial pseudo-classes like `:below`, `:above` identify elements based on bounding box positions relative to the **first** element matched by the inner selector.*
133
-
134
- ```python
135
- # Find bold text
136
- page.find_all('text:bold').show()
137
- ```
138
-
139
- ```python
140
- # Combine attribute and pseudo-class: bold text size >= 11
141
- page.find_all('text[size>=11]:bold')
142
- ```
143
-
144
- ### Spatial Pseudo-Classes Examples
145
-
146
- ```python
147
- # Find the thick horizontal line first
148
- ref_line = page.find('line[width>=2]')
149
-
150
- # Find text elements strictly above that line
151
- text_above_line = page.find_all('text:above("line[width>=2]")')
152
- text_above_line
153
- ```
154
-
155
- ## Advanced Text Searching Options
156
-
157
- Pass options to `find()` or `find_all()` for more control over text matching.
158
-
159
- ```python
160
- # Case-insensitive search for "summary"
161
- page.find_all('text:contains("summary")', case=False)
162
- ```
163
-
164
- ```python
165
- # Regular expression search for the inspection ID (e.g., INS-XXX...)
166
- # The ID is in the red text we found earlier
167
- page.find_all('text:contains("INS-\\w+")', regex=True)
168
- ```
169
-
170
- ```python
171
- # Combine regex and case-insensitivity
172
- page.find_all('text:contains("jungle health")', regex=True, case=False)
173
- ```
174
-
175
- ## Working with ElementCollections
176
-
177
- `find_all()` returns an `ElementCollection`, which is like a list but with extra PDF-specific methods.
178
-
179
- ```python
180
- # Get all headings (using a selector for large, bold text)
181
- headings = page.find_all('text[size>=11]:bold')
182
- headings
183
- ```
184
-
185
- ```python
186
- # Get the first and last heading in reading order
187
- first = headings.first
188
- last = headings.last
189
- (first, last)
190
- ```
191
-
192
- ```python
193
- # Get the physically highest/lowest element in the collection
194
- highest = headings.highest()
195
- lowest = headings.lowest()
196
- (highest, lowest)
197
- ```
198
-
199
- ```python
200
- # Filter the collection further: headings containing "Service"
201
- service_headings = headings.find_all('text:contains("Service")')
202
- service_headings
203
- ```
204
-
205
- ```python
206
- # Extract text from all elements in the collection
207
- headings.extract_text()
208
- ```
209
-
210
- *Remember: `.highest()`, `.lowest()`, `.leftmost()`, `.rightmost()` raise errors if the collection spans multiple pages.*
211
-
212
- ## Font Variants
213
-
214
- Sometimes PDFs use font variants (prefixes like `AAAAAB+`) which can be useful for selection.
215
-
216
- ```python
217
- # Find text elements with a specific font variant prefix (if any exist)
218
- # This example PDF doesn't use variants, but the selector works like this:
219
- page.find_all('text[font-variant=AAAAAB]')
220
- ```
221
-
222
- ## Next Steps
223
-
224
- Now that you can find elements, explore:
225
-
226
- - [Text Extraction](../text-extraction/index.ipynb): Get text content from found elements.
227
- - [Spatial Navigation](../pdf-navigation/index.ipynb): Use found elements as anchors to navigate (`.above()`, `.below()`, etc.).
228
- - [Working with Regions](../regions/index.ipynb): Define areas based on found elements.
229
- - [Visual Debugging](../visual-debugging/index.ipynb): Techniques for highlighting and visualizing elements.
docs/finetuning/index.md DELETED
@@ -1,176 +0,0 @@
1
- # OCR Fine-tuning
2
-
3
- While the built-in OCR engines (EasyOCR, PaddleOCR, Surya) offer good general performance, you might encounter situations where their accuracy isn't sufficient for your specific needs. This is often the case with:
4
-
5
- * **Unique Fonts:** Documents using unusual or stylized fonts.
6
- * **Specific Languages:** Languages or scripts not perfectly covered by the default models.
7
- * **Low Quality Scans:** Noisy or degraded document images.
8
- * **Specialized Layouts:** Text within complex tables, forms, or unusual arrangements.
9
-
10
- Fine-tuning allows you to adapt a pre-trained OCR recognition model to your specific data, significantly improving its accuracy on documents similar to those used for training.
11
-
12
- ## Why Fine-tune?
13
-
14
- - **Higher Accuracy:** Achieve better text extraction results on your specific document types.
15
- - **Adaptability:** Train the model to recognize domain-specific terms, symbols, or layouts.
16
- - **Reduced Errors:** Minimize downstream errors in data extraction and processing pipelines.
17
-
18
- ## Strategy: Detect + LLM Correct + Export
19
-
20
- Training an OCR model requires accurate ground truth: images of text snippets paired with their correct transcriptions. Manually creating this data is tedious. A powerful alternative leverages the strengths of different models:
21
-
22
- 1. **Detect Text Regions:** Use a robust local OCR engine (like Surya or PaddleOCR) primarily for its *detection* capabilities (`detect_only=True`). This identifies the *locations* of text on the page, even if the initial *recognition* isn't perfect. You can combine this with layout analysis or region selections (`.region()`, `.below()`, `.add_exclusion()`) to focus on the specific areas you care about.
23
- 2. **Correct with LLM:** For each detected text region, send the image snippet to a powerful Large Language Model (LLM) with multimodal capabilities (like GPT-4o, Claude 3.5 Sonnet/Haiku) using the `direct_ocr_llm` utility. The LLM performs high-accuracy OCR on the snippet, providing a "ground truth" transcription.
24
- 3. **Export for Fine-tuning:** Use the `PaddleOCRRecognitionExporter` to package the original image snippets (from step 1) along with their corresponding LLM-generated text labels (from step 2) into the specific format required by PaddleOCR for fine-tuning its *recognition* model.
25
-
26
- This approach combines the efficient spatial detection of local models with the superior text recognition of large generative models to create a high-quality fine-tuning dataset with minimal manual effort.
27
-
28
- ## Example: Fine-tuning for Greek Spreadsheet Text
29
-
30
- Let's walk through an example of preparing data to fine-tune PaddleOCR for text from a scanned Greek spreadsheet, adapting the process described above.
31
-
32
- ```python
33
- # --- 1. Setup and Load PDF ---
34
- from natural_pdf import PDF
35
- from natural_pdf.ocr.utils import direct_ocr_llm
36
- from natural_pdf.exporters import PaddleOCRRecognitionExporter
37
- import openai # Or your preferred LLM client library
38
- import os
39
-
40
- # Ensure your LLM API key is set (using environment variables is recommended)
41
- # os.environ["OPENAI_API_KEY"] = "sk-..."
42
- # os.environ["ANTHROPIC_API_KEY"] = "sk-..."
43
-
44
- # pdf_path = "path/to/your/document.pdf"
45
- pdf_path = "pdfs/hidden/the-bad-one.pdf" # Replace with your PDF path
46
- pdf = PDF(pdf_path)
47
-
48
- # --- 2. (Optional) Exclude Irrelevant Areas ---
49
- # If the document has consistent headers, footers, or margins you want to ignore
50
- # Use exclusions *before* detection
51
- pdf.add_exclusion(lambda page: page.region(right=45)) # Exclude left margin/line numbers
52
- pdf.add_exclusion(lambda page: page.region(left=500)) # Exclude right margin
53
-
54
- # --- 3. Detect Text Regions ---
55
- # Use a good detection engine. Surya is often robust for line detection.
56
- # We only want the bounding boxes, not the initial (potentially inaccurate) OCR text.
57
- print("Detecting text regions...")
58
- # Process only a subset of pages for demonstration if needed
59
- for page in pdf.pages[:10]:
60
- # Use a moderate resolution for detection; higher res used for LLM correction later
61
- page.apply_ocr(engine='surya', resolution=120, detect_only=True)
62
- print(f"Detection complete for {num_pages_to_process} pages.")
63
-
64
- # (Optional) Visualize detected boxes on a sample page
65
- # pdf.pages[9].find_all('text[source=ocr]').show()
66
-
67
- # --- 4. Correct with LLM ---
68
- # Configure your LLM client (example using OpenAI client, adaptable for others)
69
- # For Anthropic: client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key=os.environ.get("ANTHROPIC_API_KEY"))
70
- client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
71
-
72
- # Craft a clear prompt for the LLM
73
- # Be as specific as possible! If it's in a specific language, what kinds
74
- # of characters, etc.
75
- prompt = """OCR this image patch. Return only the exact text content visible in the image.
76
- Preserve original spelling, capitalization, punctuation, and symbols.
77
- Do not add any explanatory text, translations, comments, or quotation marks around the result.
78
- The text is likely from a Greek document, potentially a spreadsheet, containing Modern Greek words or numbers."""
79
-
80
- # Define the correction function using direct_ocr_llm
81
- def correct_text_region(region):
82
- # Use a high resolution for the LLM call for best accuracy
83
- return direct_ocr_llm(
84
- region,
85
- client,
86
- prompt=prompt,
87
- resolution=300,
88
- # model="claude-3-5-sonnet-20240620" # Example Anthropic model
89
- model="gpt-4o-mini" # Example OpenAI model
90
- )
91
-
92
- # Apply the correction function to the detected text regions
93
- print("Applying LLM correction to detected regions...")
94
- for page in pdf.pages[:num_pages_to_process]:
95
- # This finds elements added by apply_ocr and passes their regions to 'correct_text_region'
96
- # The returned text from the LLM replaces the original OCR text for these elements
97
- # The source attribute is updated (e.g., to 'ocr-llm-corrected')
98
- page.correct_ocr(correct_text_region)
99
- print("LLM correction complete.")
100
-
101
- # --- 5. Export for PaddleOCR Fine-tuning ---
102
- print("Configuring exporter...")
103
- exporter = PaddleOCRRecognitionExporter(
104
- # Select all of the non-blank OCR text
105
- # Hopefully it's all been LLM-corrected!
106
- selector="text[source^=ocr][text!='']",
107
- resolution=300, # Resolution for the exported image crops
108
- padding=2, # Add slight padding around text boxes
109
- split_ratio=0.9, # 90% for training, 10% for validation
110
- random_seed=42, # For reproducible train/val split
111
- include_guide=True # Include the Colab fine-tuning notebook
112
- )
113
-
114
- # Define the output directory
115
- output_directory = "./my_paddleocr_finetune_data"
116
- print(f"Exporting data to {output_directory}...")
117
-
118
- # Run the export process
119
- exporter.export(pdf, output_directory)
120
-
121
- print("Export complete.")
122
- print(f"Dataset ready for fine-tuning in: {output_directory}")
123
- print(f"Next step: Upload '{os.path.join(output_directory, 'fine_tune_paddleocr.ipynb')}' and the rest of the contents to Google Colab.")
124
-
125
- # --- Cleanup ---
126
- pdf.close()
127
- ```
128
-
129
- ## Running the Fine-tuning
130
-
131
- The `PaddleOCRRecognitionExporter` automatically includes a Jupyter Notebook (`fine_tune_paddleocr.ipynb`) in the output directory. This notebook is pre-configured to guide you through the fine-tuning process on Google Colab (which offers free GPU access):
132
-
133
- 1. **Upload:** Upload the entire output directory (e.g., `my_paddleocr_finetune_data`) to your Google Drive or directly to your Colab instance.
134
- 2. **Open Notebook:** Open the `fine_tune_paddleocr.ipynb` notebook in Google Colab.
135
- 3. **Set Runtime:** Ensure the Colab runtime is set to use a GPU (Runtime -> Change runtime type -> GPU).
136
- 4. **Run Cells:** Execute the cells in the notebook sequentially. It will:
137
- * Install necessary libraries (PaddlePaddle, PaddleOCR).
138
- * Point the training configuration to your uploaded dataset (`images/`, `train.txt`, `val.txt`, `dict.txt`).
139
- * Download a pre-trained PaddleOCR model (usually a multilingual one).
140
- * Start the fine-tuning process using your data.
141
- * Save the fine-tuned model checkpoints.
142
- * Export the best model into an "inference format" suitable for use with `natural-pdf`.
143
- 5. **Download Model:** Download the resulting `inference_model` directory from Colab.
144
-
145
- ## Using the Fine-tuned Model
146
-
147
- Once you have the `inference_model` directory, you can instruct `natural-pdf` to use it for OCR:
148
-
149
- ```python
150
- from natural_pdf import PDF
151
- from natural_pdf.ocr import PaddleOCROptions
152
-
153
- # Path to the directory you downloaded from Colab
154
- finetuned_model_dir = "/path/to/your/downloaded/inference_model"
155
-
156
- # Specify the path in PaddleOCROptions
157
- paddle_opts = PaddleOCROptions(
158
- rec_model_dir=finetuned_model_dir,
159
- rec_char_dict_path=os.path.join(finetuned_model_dir, 'your_dict.txt') # Or wherever your dict is
160
- use_gpu=True # If using GPU locally
161
- )
162
-
163
- pdf = PDF("another-similar-document.pdf")
164
- page = pdf.pages[0]
165
-
166
- # Apply OCR using your fine-tuned model
167
- ocr_elements = page.apply_ocr(engine='paddle', options=paddle_opts)
168
-
169
- # Extract text using the improved results
170
- text = page.extract_text()
171
- print(text)
172
-
173
- pdf.close()
174
- ```
175
-
176
- By following this process, you can significantly enhance OCR performance on your specific documents using the power of fine-tuning.
docs/index.md DELETED
@@ -1,170 +0,0 @@
1
- # Natural PDF
2
-
3
- A friendly library for working with PDFs, built on top of [pdfplumber](https://github.com/jsvine/pdfplumber).
4
-
5
- Natural PDF lets you find and extract content from PDFs using simple code that makes sense.
6
-
7
- - [Live demo here](https://colab.research.google.com/github/jsoma/natural-pdf/blob/main/notebooks/Examples.ipynb)
8
-
9
- <div style="max-width: 400px; margin: auto"><a href="assets/sample-screen.png"><img src="assets/sample-screen.png"></a></div>
10
-
11
- ## Installation
12
-
13
- ```
14
- pip install natural_pdf
15
- # All the extras
16
- pip install "natural_pdf[all]"
17
- ```
18
-
19
- ## Quick Example
20
-
21
- ```python
22
- from natural_pdf import PDF
23
-
24
- pdf = PDF('document.pdf')
25
- page = pdf.pages[0]
26
-
27
- # Find the title and get content below it
28
- title = page.find('text:contains("Summary"):bold')
29
- content = title.below().extract_text()
30
-
31
- # Exclude everything above 'CONFIDENTIAL' and below last line on page
32
- page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
33
- page.add_exclusion(page.find_all('line')[-1].below())
34
-
35
- # Get the clean text without header/footer
36
- clean_text = page.extract_text()
37
- ```
38
-
39
- ## Key Features
40
-
41
- Here are a few highlights of what you can do:
42
-
43
- ### Find Elements with Selectors
44
-
45
- Use CSS-like selectors to find text, shapes, and more.
46
-
47
- ```python
48
- # Find bold text containing "Revenue"
49
- page.find('text:contains("Revenue"):bold').extract_text()
50
-
51
- # Find all large text
52
- page.find_all('text[size>=12]').extract_text()
53
- ```
54
-
55
- [Learn more about selectors →](element-selection/index.ipynb)
56
-
57
- ### Navigate Spatially
58
-
59
- Move around the page relative to elements, not just coordinates.
60
-
61
- ```python
62
- # Extract text below a specific heading
63
- intro_text = page.find('text:contains("Introduction")').below().extract_text()
64
-
65
- # Extract text from one heading to the next
66
- methods_text = page.find('text:contains("Methods")').below(
67
- until='text:contains("Results")'
68
- ).extract_text()
69
- ```
70
-
71
- [Explore more navigation methods →](pdf-navigation/index.ipynb)
72
-
73
- ### Extract Clean Text
74
-
75
- Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).
76
-
77
- ```python
78
- # Extract all text from the page (respecting exclusions)
79
- page_text = page.extract_text()
80
-
81
- # Extract text from a specific region
82
- some_region = page.find(...)
83
- region_text = some_region.extract_text()
84
- ```
85
-
86
- [Learn about text extraction →](text-extraction/index.ipynb)
87
- [Learn about exclusion zones →](regions/index.ipynb#exclusion-zones)
88
-
89
- ### Apply OCR
90
-
91
- Extract text from scanned documents using various OCR engines.
92
-
93
- ```python
94
- # Apply OCR using the default engine
95
- ocr_elements = page.apply_ocr()
96
-
97
- # Extract text (will use OCR results if available)
98
- text = page.extract_text()
99
- ```
100
-
101
- [Explore OCR options →](ocr/index.md)
102
-
103
- ### Analyze Document Layout
104
-
105
- Use AI models to detect document structures like titles, paragraphs, and tables.
106
-
107
- ```python
108
- # Detect document structure
109
- page.analyze_layout()
110
-
111
- # Highlight titles and tables
112
- page.find_all('region[type=title]').highlight(color="purple")
113
- page.find_all('region[type=table]').highlight(color="blue")
114
-
115
- # Extract data from the first table
116
- table_data = page.find('region[type=table]').extract_table()
117
- ```
118
-
119
- [Learn about layout models →](layout-analysis/index.ipynb)
120
- [Working with tables? →](tables/index.ipynb)
121
-
122
- ### Document Question Answering
123
-
124
- Ask natural language questions directly to your documents.
125
-
126
- ```python
127
- # Ask a question
128
- result = pdf.ask("What was the company's revenue in 2022?")
129
- if result.get("found", False):
130
- print(f"Answer: {result['answer']}")
131
- ```
132
-
133
- [Learn about Document QA →](document-qa/index.ipynb)
134
-
135
- ### Visualize Your Work
136
-
137
- Debug and understand your extractions visually.
138
-
139
- ```python
140
- # Highlight headings
141
- page.find_all('text[size>=14]').highlight(color="red", label="Headings")
142
-
143
- # Launch the interactive viewer (Jupyter)
144
- # Requires: pip install natural-pdf[interactive]
145
- page.viewer()
146
-
147
- # Or save an image
148
- # page.save_image("highlighted.png")
149
- ```
150
-
151
- [See more visualization options →](visual-debugging/index.ipynb)
152
-
153
- ## Documentation Topics
154
-
155
- Choose what you want to learn about:
156
-
157
- ### Task-based Guides
158
- - [Getting Started](installation/index.md): Install the library and run your first extraction
159
- - [PDF Navigation](pdf-navigation/index.ipynb): Open PDFs and work with pages
160
- - [Element Selection](element-selection/index.ipynb): Find text and other elements using selectors
161
- - [Text Extraction](text-extraction/index.ipynb): Extract clean text from documents
162
- - [Regions](regions/index.ipynb): Work with specific areas of a page
163
- - [Visual Debugging](visual-debugging/index.ipynb): See what you're extracting
164
- - [OCR](ocr/index.md): Extract text from scanned documents
165
- - [Layout Analysis](layout-analysis/index.ipynb): Detect document structure
166
- - [Tables](tables/index.ipynb): Extract tabular data
167
- - [Document QA](document-qa/index.ipynb): Ask questions to your documents
168
-
169
- ### Reference
170
- - [API Reference](api/index.md): Complete library reference
@@ -1,69 +0,0 @@
1
- # Getting Started with Natural PDF
2
-
3
- Let's get Natural PDF installed and run your first extraction.
4
-
5
- ## Installation
6
-
7
- The base installation includes the core library and necessary AI dependencies (like PyTorch and Transformers):
8
-
9
- ```bash
10
- pip install natural-pdf
11
- ```
12
-
13
- ### Optional Dependencies
14
-
15
- Natural PDF has modular dependencies for different features. Install them based on your needs:
16
-
17
- ```bash
18
- # --- OCR Engines ---
19
- # Install support for EasyOCR
20
- pip install natural-pdf[easyocr]
21
-
22
- # Install support for PaddleOCR (requires paddlepaddle)
23
- pip install natural-pdf[paddle]
24
-
25
- # Install support for Surya OCR
26
- pip install natural-pdf[surya]
27
-
28
- # --- Layout Detection ---
29
- # Install support for YOLO layout model
30
- pip install natural-pdf[layout_yolo]
31
-
32
- # --- Interactive Widget ---
33
- # Install support for the interactive .viewer() widget in Jupyter
34
- pip install natural-pdf[interactive]
35
-
36
- # --- All Features ---
37
- # Install all optional dependencies
38
- pip install natural-pdf[all]
39
- ```
40
-
41
- ## Your First PDF Extraction
42
-
43
- Here's a quick example to make sure everything is working:
44
-
45
- ```python
46
- from natural_pdf import PDF
47
-
48
- # Open a PDF
49
- pdf = PDF('your_document.pdf')
50
-
51
- # Get the first page
52
- page = pdf.pages[0]
53
-
54
- # Extract all text
55
- text = page.extract_text()
56
- print(text)
57
-
58
- # Find something specific
59
- title = page.find('text:bold')
60
- print(f"Found title: {title.text}")
61
- ```
62
-
63
- ## What's Next?
64
-
65
- Now that you have Natural PDF installed, you can:
66
-
67
- - Learn to [navigate PDFs](../pdf-navigation/index.ipynb)
68
- - Explore how to [select elements](../element-selection/index.ipynb)
69
- - See how to [extract text](../text-extraction/index.ipynb)