mistocr 0.1.5__tar.gz → 0.2.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
mistocr-0.2.4/PKG-INFO ADDED
@@ -0,0 +1,256 @@
1
+ Metadata-Version: 2.4
2
+ Name: mistocr
3
+ Version: 0.2.4
4
+ Summary: Batch OCR for PDFs with heading restoration and visual content integration
5
+ Home-page: https://github.com/franckalbinet/mistocr
6
+ Author: Solveit
7
+ Author-email: nobody@fast.ai
8
+ License: Apache Software License 2.0
9
+ Keywords: nbdev jupyter notebook python
10
+ Classifier: Development Status :: 4 - Beta
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Natural Language :: English
13
+ Classifier: Programming Language :: Python :: 3.9
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: License :: OSI Approved :: Apache Software License
18
+ Requires-Python: >=3.9
19
+ Description-Content-Type: text/markdown
20
+ License-File: LICENSE
21
+ Requires-Dist: fastcore
22
+ Requires-Dist: mistralai
23
+ Requires-Dist: pillow
24
+ Requires-Dist: dotenv
25
+ Requires-Dist: lisette
26
+ Provides-Extra: dev
27
+ Dynamic: author
28
+ Dynamic: author-email
29
+ Dynamic: classifier
30
+ Dynamic: description
31
+ Dynamic: description-content-type
32
+ Dynamic: home-page
33
+ Dynamic: keywords
34
+ Dynamic: license
35
+ Dynamic: license-file
36
+ Dynamic: provides-extra
37
+ Dynamic: requires-dist
38
+ Dynamic: requires-python
39
+ Dynamic: summary
40
+
41
+ # Mistocr
42
+
43
+
44
+ <!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
45
+
46
+ **PDF OCR is a critical bottleneck in AI pipelines.** It’s often
47
+ mentioned in passing, as if it’s a trivial step. Practice shows it’s far
48
+ from it. Poorly converted PDFs mean garbage-in-garbage-out for
49
+ downstream AI-system (RAG, …).
50
+
51
+ When [Mistral AI](https://mistral.ai) released their [state-of-the-art
52
+ OCR model](https://mistral.ai/fr/news/mistral-ocr) in March 2025, it
53
+ opened new possibilities for large-scale document processing. While
54
+ alternatives like [datalab.to](https://www.datalab.to) and
55
+ [docling.ai](https://www.docling.ai) offer viable solutions, Mistral OCR
56
+ delivers exceptional accuracy at a compelling price point.
57
+
58
+ **mistocr** emerged from months of real-world usage across projects
59
+ requiring large-scale processing of niche-domain PDFs. It addresses two
60
+ fundamental challenges that raw OCR output leaves unsolved:
61
+
62
+ - **Heading hierarchy restoration**: Even state-of-the-art OCR sometimes
63
+ produces inconsistent heading levels in large documents—a complex task
64
+ to get right. mistocr uses LLM-based analysis to restore proper
65
+ document structure, essential for downstream AI tasks.
66
+
67
+ - **Visual content integration**: Charts, figures and diagrams are
68
+ automatically classified and described, then integrated into the
69
+ markdown. This makes visual information searchable and accessible for
70
+ downstream applications.
71
+
72
+ - **Cost-efficient batch processing**: The OCR step exclusively uses
73
+ Mistral’s batch API, cutting costs by 50% (\$0.50 vs \$1.00 per 1000
74
+ pages) while eliminating the boilerplate code typically required.
75
+
76
+ **In short**: Complete PDF OCR with heading hierarchy fixes and image
77
+ descriptions for RAG and LLM pipelines.
78
+
79
+ ## Get Started
80
+
81
+ Install latest from [pypi](https://pypi.org/project/mistocr), then:
82
+
83
+ ``` sh
84
+ $ pip install mistocr
85
+ ```
86
+
87
+ Set your API keys:
88
+
89
+ ``` python
90
+ import os
91
+ os.environ['MISTRAL_API_KEY'] = 'your-key-here'
92
+ os.environ['ANTHROPIC_API_KEY'] = 'your-key-here' # for refine features (see Advanced Usage for other LLMs)
93
+ ```
94
+
95
+ ### Complete Pipeline
96
+
97
+ #### Single File Processing
98
+
99
+ Process a single PDF with OCR (using Mistral’s batch API for cost
100
+ efficiency), heading fixes, and image descriptions:
101
+
102
+ ``` python
103
+ from mistocr.pipeline import pdf_to_md
104
+ await pdf_to_md('files/test/resnet.pdf', 'files/test/md_test')
105
+ ```
106
+
107
+ Step 1/3: Running OCR on files/test/resnet.pdf...
108
+ Mistral batch job status: QUEUED
109
+ Mistral batch job status: RUNNING
110
+ Mistral batch job status: RUNNING
111
+ Step 2/3: Fixing heading hierarchy...
112
+ Step 3/3: Adding image descriptions...
113
+ Describing 7 images...
114
+ Saved descriptions to ocr_temp/resnet/img_descriptions.json
115
+ Adding descriptions to 12 pages...
116
+ Done! Enriched pages saved to files/test/md_test
117
+ Done!
118
+
119
+ This will (as indicated by the output):
120
+
121
+ 1. OCR the PDF using Mistral’s batch API
122
+ 2. Fix heading hierarchy inconsistencies
123
+ 3. Describe images (charts, diagrams) and add those descriptions into
124
+ the markdown Save everything to `files/test/md_test`
125
+
126
+ The output structure will be:
127
+
128
+ files/test/md_test/
129
+ ├── img/
130
+ │ ├── img-0.jpeg
131
+ │ ├── img-1.jpeg
132
+ │ └── ...
133
+ ├── page_1.md
134
+ ├── page_2.md
135
+ └── ...
136
+
137
+ Each page’s markdown will include inline image descriptions:
138
+
139
+ ```` markdown
140
+ ```markdown
141
+ ![Figure 1](img/img-0.jpeg)
142
+ AI-generated image description:
143
+ ___
144
+ A residual learning block...
145
+ ___
146
+ ```
147
+ ````
148
+
149
+ To print the the processed markdown, you can use the
150
+ [`read_pgs`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
151
+ function. Here’s how:
152
+
153
+ Then to read the fully processed document:
154
+
155
+ ``` python
156
+ from mistocr.pipeline import read_pgs
157
+ md = read_pgs('files/test/md_test')
158
+ print(md[:500])
159
+ ```
160
+
161
+ # Deep Residual Learning for Image Recognition ... page 1
162
+
163
+ Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
164
+
165
+
166
+ ## Abstract ... page 1
167
+
168
+ Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins
169
+
170
+ By default,
171
+ [`read_pgs()`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
172
+ joins all pages. Pass `join=False` to get a list of individual pages
173
+ instead.
174
+
175
+ ### Advanced Usage
176
+
177
+ **Batch OCR for entire folders:**
178
+
179
+ ``` python
180
+ from mistocr.core import ocr_pdf
181
+
182
+ # OCR all PDFs in a folder using Mistral's batch API
183
+ output_dirs = ocr_pdf('path/to/pdf_folder', dst='output_folder')
184
+ ```
185
+
186
+ **Custom models and prompts for heading fixes:**
187
+
188
+ ``` python
189
+ from mistocr.refine import fix_hdgs
190
+
191
+ # Use a different model or custom prompt
192
+ fix_hdgs('ocr_output/doc1',
193
+ model='gpt-4o',
194
+ prompt=your_custom_prompt)
195
+ ```
196
+
197
+ **Custom image description with rate limiting:**
198
+
199
+ ``` python
200
+ from mistocr.refine import add_img_descs
201
+
202
+ # Control API usage and customize descriptions
203
+ await add_img_descs('ocr_output/doc1',
204
+ model='claude-opus-4',
205
+ semaphore=5, # More concurrent requests
206
+ delay=0.5) # Shorter delay between calls
207
+ ```
208
+
209
+ For complete control over each pipeline step, see the
210
+ [core](https://fr.anckalbi.net/mistocr/core.html),
211
+ [refine](https://fr.anckalbi.net/mistocr/refine.html), and
212
+ [pipeline](https://fr.anckalbi.net/mistocr/pipeline.html) module
213
+ documentation.
214
+
215
+ ## Known Limitations & Future Work
216
+
217
+ `mistocr` is under active development. Current limitations include:
218
+
219
+ - **No timeout on batch jobs**: Jobs poll indefinitely until completion.
220
+ If a job stalls, manual intervention is required.
221
+ - **Limited error handling**: When batch jobs fail, error reporting and
222
+ recovery options are minimal.
223
+ - **Progress monitoring**: Currently limited to periodic status prints.
224
+ Future versions will support callbacks or streaming updates for better
225
+ real-time monitoring.
226
+
227
+ Contributions are welcome! If you encounter issues or have ideas for
228
+ improvements, please open an issue or discussion on
229
+ [GitHub](https://github.com/franckalbinet/mistocr).
230
+
231
+ ## Developer Guide
232
+
233
+ If you are new to using `nbdev` here are some useful pointers to get you
234
+ started.
235
+
236
+ ### Install mistocr in Development mode
237
+
238
+ ``` sh
239
+ # make sure mistocr package is installed in development mode
240
+ $ pip install -e .
241
+
242
+ # make changes under nbs/ directory
243
+ # ...
244
+
245
+ # compile to have changes apply to mistocr
246
+ $ nbdev_prepare
247
+ ```
248
+
249
+ ### Documentation
250
+
251
+ Documentation can be found hosted on this GitHub
252
+ [repository](https://github.com/franckalbinet/mistocr)’s
253
+ [pages](https://franckalbinet.github.io/mistocr/). Additionally you can
254
+ find package manager specific guidelines on
255
+ [conda](https://anaconda.org/franckalbinet/mistocr) and
256
+ [pypi](https://pypi.org/project/mistocr/) respectively.
@@ -0,0 +1,216 @@
1
+ # Mistocr
2
+
3
+
4
+ <!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
5
+
6
+ **PDF OCR is a critical bottleneck in AI pipelines.** It’s often
7
+ mentioned in passing, as if it’s a trivial step. Practice shows it’s far
8
+ from it. Poorly converted PDFs mean garbage-in-garbage-out for
9
+ downstream AI-system (RAG, …).
10
+
11
+ When [Mistral AI](https://mistral.ai) released their [state-of-the-art
12
+ OCR model](https://mistral.ai/fr/news/mistral-ocr) in March 2025, it
13
+ opened new possibilities for large-scale document processing. While
14
+ alternatives like [datalab.to](https://www.datalab.to) and
15
+ [docling.ai](https://www.docling.ai) offer viable solutions, Mistral OCR
16
+ delivers exceptional accuracy at a compelling price point.
17
+
18
+ **mistocr** emerged from months of real-world usage across projects
19
+ requiring large-scale processing of niche-domain PDFs. It addresses two
20
+ fundamental challenges that raw OCR output leaves unsolved:
21
+
22
+ - **Heading hierarchy restoration**: Even state-of-the-art OCR sometimes
23
+ produces inconsistent heading levels in large documents—a complex task
24
+ to get right. mistocr uses LLM-based analysis to restore proper
25
+ document structure, essential for downstream AI tasks.
26
+
27
+ - **Visual content integration**: Charts, figures and diagrams are
28
+ automatically classified and described, then integrated into the
29
+ markdown. This makes visual information searchable and accessible for
30
+ downstream applications.
31
+
32
+ - **Cost-efficient batch processing**: The OCR step exclusively uses
33
+ Mistral’s batch API, cutting costs by 50% (\$0.50 vs \$1.00 per 1000
34
+ pages) while eliminating the boilerplate code typically required.
35
+
36
+ **In short**: Complete PDF OCR with heading hierarchy fixes and image
37
+ descriptions for RAG and LLM pipelines.
38
+
39
+ ## Get Started
40
+
41
+ Install latest from [pypi](https://pypi.org/project/mistocr), then:
42
+
43
+ ``` sh
44
+ $ pip install mistocr
45
+ ```
46
+
47
+ Set your API keys:
48
+
49
+ ``` python
50
+ import os
51
+ os.environ['MISTRAL_API_KEY'] = 'your-key-here'
52
+ os.environ['ANTHROPIC_API_KEY'] = 'your-key-here' # for refine features (see Advanced Usage for other LLMs)
53
+ ```
54
+
55
+ ### Complete Pipeline
56
+
57
+ #### Single File Processing
58
+
59
+ Process a single PDF with OCR (using Mistral’s batch API for cost
60
+ efficiency), heading fixes, and image descriptions:
61
+
62
+ ``` python
63
+ from mistocr.pipeline import pdf_to_md
64
+ await pdf_to_md('files/test/resnet.pdf', 'files/test/md_test')
65
+ ```
66
+
67
+ Step 1/3: Running OCR on files/test/resnet.pdf...
68
+ Mistral batch job status: QUEUED
69
+ Mistral batch job status: RUNNING
70
+ Mistral batch job status: RUNNING
71
+ Step 2/3: Fixing heading hierarchy...
72
+ Step 3/3: Adding image descriptions...
73
+ Describing 7 images...
74
+ Saved descriptions to ocr_temp/resnet/img_descriptions.json
75
+ Adding descriptions to 12 pages...
76
+ Done! Enriched pages saved to files/test/md_test
77
+ Done!
78
+
79
+ This will (as indicated by the output):
80
+
81
+ 1. OCR the PDF using Mistral’s batch API
82
+ 2. Fix heading hierarchy inconsistencies
83
+ 3. Describe images (charts, diagrams) and add those descriptions into
84
+ the markdown Save everything to `files/test/md_test`
85
+
86
+ The output structure will be:
87
+
88
+ files/test/md_test/
89
+ ├── img/
90
+ │ ├── img-0.jpeg
91
+ │ ├── img-1.jpeg
92
+ │ └── ...
93
+ ├── page_1.md
94
+ ├── page_2.md
95
+ └── ...
96
+
97
+ Each page’s markdown will include inline image descriptions:
98
+
99
+ ```` markdown
100
+ ```markdown
101
+ ![Figure 1](img/img-0.jpeg)
102
+ AI-generated image description:
103
+ ___
104
+ A residual learning block...
105
+ ___
106
+ ```
107
+ ````
108
+
109
+ To print the the processed markdown, you can use the
110
+ [`read_pgs`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
111
+ function. Here’s how:
112
+
113
+ Then to read the fully processed document:
114
+
115
+ ``` python
116
+ from mistocr.pipeline import read_pgs
117
+ md = read_pgs('files/test/md_test')
118
+ print(md[:500])
119
+ ```
120
+
121
+ # Deep Residual Learning for Image Recognition ... page 1
122
+
123
+ Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
124
+
125
+
126
+ ## Abstract ... page 1
127
+
128
+ Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins
129
+
130
+ By default,
131
+ [`read_pgs()`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
132
+ joins all pages. Pass `join=False` to get a list of individual pages
133
+ instead.
134
+
135
+ ### Advanced Usage
136
+
137
+ **Batch OCR for entire folders:**
138
+
139
+ ``` python
140
+ from mistocr.core import ocr_pdf
141
+
142
+ # OCR all PDFs in a folder using Mistral's batch API
143
+ output_dirs = ocr_pdf('path/to/pdf_folder', dst='output_folder')
144
+ ```
145
+
146
+ **Custom models and prompts for heading fixes:**
147
+
148
+ ``` python
149
+ from mistocr.refine import fix_hdgs
150
+
151
+ # Use a different model or custom prompt
152
+ fix_hdgs('ocr_output/doc1',
153
+ model='gpt-4o',
154
+ prompt=your_custom_prompt)
155
+ ```
156
+
157
+ **Custom image description with rate limiting:**
158
+
159
+ ``` python
160
+ from mistocr.refine import add_img_descs
161
+
162
+ # Control API usage and customize descriptions
163
+ await add_img_descs('ocr_output/doc1',
164
+ model='claude-opus-4',
165
+ semaphore=5, # More concurrent requests
166
+ delay=0.5) # Shorter delay between calls
167
+ ```
168
+
169
+ For complete control over each pipeline step, see the
170
+ [core](https://fr.anckalbi.net/mistocr/core.html),
171
+ [refine](https://fr.anckalbi.net/mistocr/refine.html), and
172
+ [pipeline](https://fr.anckalbi.net/mistocr/pipeline.html) module
173
+ documentation.
174
+
175
+ ## Known Limitations & Future Work
176
+
177
+ `mistocr` is under active development. Current limitations include:
178
+
179
+ - **No timeout on batch jobs**: Jobs poll indefinitely until completion.
180
+ If a job stalls, manual intervention is required.
181
+ - **Limited error handling**: When batch jobs fail, error reporting and
182
+ recovery options are minimal.
183
+ - **Progress monitoring**: Currently limited to periodic status prints.
184
+ Future versions will support callbacks or streaming updates for better
185
+ real-time monitoring.
186
+
187
+ Contributions are welcome! If you encounter issues or have ideas for
188
+ improvements, please open an issue or discussion on
189
+ [GitHub](https://github.com/franckalbinet/mistocr).
190
+
191
+ ## Developer Guide
192
+
193
+ If you are new to using `nbdev` here are some useful pointers to get you
194
+ started.
195
+
196
+ ### Install mistocr in Development mode
197
+
198
+ ``` sh
199
+ # make sure mistocr package is installed in development mode
200
+ $ pip install -e .
201
+
202
+ # make changes under nbs/ directory
203
+ # ...
204
+
205
+ # compile to have changes apply to mistocr
206
+ $ nbdev_prepare
207
+ ```
208
+
209
+ ### Documentation
210
+
211
+ Documentation can be found hosted on this GitHub
212
+ [repository](https://github.com/franckalbinet/mistocr)’s
213
+ [pages](https://franckalbinet.github.io/mistocr/). Additionally you can
214
+ find package manager specific guidelines on
215
+ [conda](https://anaconda.org/franckalbinet/mistocr) and
216
+ [pypi](https://pypi.org/project/mistocr/) respectively.
@@ -0,0 +1 @@
1
+ __version__ = "0.2.4"
@@ -11,7 +11,7 @@ d = { 'settings': { 'branch': 'main',
11
11
  'mistocr.core.create_batch_entry': ('core.html#create_batch_entry', 'mistocr/core.py'),
12
12
  'mistocr.core.download_results': ('core.html#download_results', 'mistocr/core.py'),
13
13
  'mistocr.core.get_api_key': ('core.html#get_api_key', 'mistocr/core.py'),
14
- 'mistocr.core.ocr': ('core.html#ocr', 'mistocr/core.py'),
14
+ 'mistocr.core.ocr_pdf': ('core.html#ocr_pdf', 'mistocr/core.py'),
15
15
  'mistocr.core.prep_pdf_batch': ('core.html#prep_pdf_batch', 'mistocr/core.py'),
16
16
  'mistocr.core.read_pgs': ('core.html#read_pgs', 'mistocr/core.py'),
17
17
  'mistocr.core.save_images': ('core.html#save_images', 'mistocr/core.py'),
@@ -20,12 +20,22 @@ d = { 'settings': { 'branch': 'main',
20
20
  'mistocr.core.submit_batch': ('core.html#submit_batch', 'mistocr/core.py'),
21
21
  'mistocr.core.upload_pdf': ('core.html#upload_pdf', 'mistocr/core.py'),
22
22
  'mistocr.core.wait_for_job': ('core.html#wait_for_job', 'mistocr/core.py')},
23
+ 'mistocr.pipeline': {'mistocr.pipeline.pdf_to_md': ('pipeline.html#pdf_to_md', 'mistocr/pipeline.py')},
23
24
  'mistocr.refine': { 'mistocr.refine.HeadingCorrections': ('refine.html#headingcorrections', 'mistocr/refine.py'),
25
+ 'mistocr.refine.ImgDescription': ('refine.html#imgdescription', 'mistocr/refine.py'),
26
+ 'mistocr.refine.add_descs_to_pg': ('refine.html#add_descs_to_pg', 'mistocr/refine.py'),
27
+ 'mistocr.refine.add_descs_to_pgs': ('refine.html#add_descs_to_pgs', 'mistocr/refine.py'),
28
+ 'mistocr.refine.add_img_descs': ('refine.html#add_img_descs', 'mistocr/refine.py'),
24
29
  'mistocr.refine.add_pg_hdgs': ('refine.html#add_pg_hdgs', 'mistocr/refine.py'),
25
30
  'mistocr.refine.apply_hdg_fixes': ('refine.html#apply_hdg_fixes', 'mistocr/refine.py'),
31
+ 'mistocr.refine.describe_img': ('refine.html#describe_img', 'mistocr/refine.py'),
32
+ 'mistocr.refine.describe_imgs': ('refine.html#describe_imgs', 'mistocr/refine.py'),
26
33
  'mistocr.refine.fix_hdg_hierarchy': ('refine.html#fix_hdg_hierarchy', 'mistocr/refine.py'),
27
- 'mistocr.refine.fix_md_hdgs': ('refine.html#fix_md_hdgs', 'mistocr/refine.py'),
34
+ 'mistocr.refine.fix_hdgs': ('refine.html#fix_hdgs', 'mistocr/refine.py'),
28
35
  'mistocr.refine.fmt_hdgs_idx': ('refine.html#fmt_hdgs_idx', 'mistocr/refine.py'),
29
36
  'mistocr.refine.get_hdgs': ('refine.html#get_hdgs', 'mistocr/refine.py'),
37
+ 'mistocr.refine.limit': ('refine.html#limit', 'mistocr/refine.py'),
30
38
  'mistocr.refine.mk_fixes_lut': ('refine.html#mk_fixes_lut', 'mistocr/refine.py'),
31
- 'mistocr.refine.read_pgs_pg': ('refine.html#read_pgs_pg', 'mistocr/refine.py')}}}
39
+ 'mistocr.refine.parse_r': ('refine.html#parse_r', 'mistocr/refine.py'),
40
+ 'mistocr.refine.read_pgs_pg': ('refine.html#read_pgs_pg', 'mistocr/refine.py'),
41
+ 'mistocr.refine.save_img_descs': ('refine.html#save_img_descs', 'mistocr/refine.py')}}}
@@ -4,7 +4,7 @@
4
4
 
5
5
  # %% auto 0
6
6
  __all__ = ['ocr_model', 'ocr_endpoint', 'get_api_key', 'upload_pdf', 'create_batch_entry', 'prep_pdf_batch', 'submit_batch',
7
- 'wait_for_job', 'download_results', 'save_images', 'save_page', 'save_pages', 'ocr', 'read_pgs']
7
+ 'wait_for_job', 'download_results', 'save_images', 'save_page', 'save_pages', 'ocr_pdf', 'read_pgs']
8
8
 
9
9
  # %% ../nbs/00_core.ipynb 3
10
10
  from fastcore.all import *
@@ -79,10 +79,11 @@ def submit_batch(
79
79
  def wait_for_job(
80
80
  job:dict, # Job dict,
81
81
  c:Mistral=None, # Mistral client,
82
- poll_interval:int=10 # Poll interval in seconds
82
+ poll_interval:int=1 # Poll interval in seconds
83
83
  ) -> dict: # Job dict (with status)
84
84
  "Poll job until completion and return final job status"
85
85
  while job.status in ["QUEUED", "RUNNING"]:
86
+ print(f'Mistral batch job status: {job.status}')
86
87
  time.sleep(poll_interval)
87
88
  job = c.batch.jobs.get(job_id=job.id)
88
89
  return job
@@ -161,7 +162,7 @@ def _run_batch(entries:list[dict], c:Mistral, poll_interval:int=2) -> list[dict]
161
162
  return download_results(job, c)
162
163
 
163
164
  # %% ../nbs/00_core.ipynb 43
164
- def ocr(
165
+ def ocr_pdf(
165
166
  path:str, # Path to PDF file or folder,
166
167
  dst:str='md', # Directory to save markdown pages,
167
168
  inc_img:bool=True, # Include image in response,
@@ -174,7 +175,7 @@ def ocr(
174
175
  results = _run_batch(entries, c, poll_interval)
175
176
  return L([save_pages(r['response']['body'], dst, r['custom_id']) for r in results])
176
177
 
177
- # %% ../nbs/00_core.ipynb 48
178
+ # %% ../nbs/00_core.ipynb 47
178
179
  def read_pgs(
179
180
  path:str, # OCR output directory,
180
181
  join:bool=True # Join pages into single string
@@ -0,0 +1,37 @@
1
+ """End-to-End Pipeline: PDF OCR, Markdown Heading Correction, and AI Image Descriptions"""
2
+
3
+ # AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/02_pipeline.ipynb.
4
+
5
+ # %% auto 0
6
+ __all__ = ['pdf_to_md']
7
+
8
+ # %% ../nbs/02_pipeline.ipynb 3
9
+ from fastcore.all import *
10
+ from .core import read_pgs, ocr_pdf
11
+ from .refine import add_img_descs, fix_hdgs
12
+ from pathlib import Path
13
+ from asyncio import Semaphore, gather, sleep
14
+ import os, json, shutil
15
+
16
+ # %% ../nbs/02_pipeline.ipynb 4
17
+ @delegates(add_img_descs)
18
+ async def pdf_to_md(
19
+ pdf_path:str, # Path to input PDF file
20
+ dst:str, # Destination directory for output markdown
21
+ ocr_output:str=None, # Optional OCR output directory (defaults to pdf_path stem)
22
+ model:str='claude-sonnet-4-5', # Model to use for heading fixes and image descriptions
23
+ add_img_desc:bool=True, # Whether to add image descriptions
24
+ progress:bool=True, # Whether to show progress messages
25
+ **kwargs):
26
+ "Convert PDF to markdown with OCR, fixed heading hierarchy, and optional image descriptions"
27
+ n_steps = 3 if add_img_desc else 2
28
+ if progress: print(f"Step 1/{n_steps}: Running OCR on {pdf_path}...")
29
+ ocr_dirs = ocr_pdf(pdf_path, ocr_output or 'ocr_temp')
30
+ ocr_dir = ocr_dirs[0]
31
+ if progress: print(f"Step 2/{n_steps}: Fixing heading hierarchy...")
32
+ fix_hdgs(ocr_dir, model=model)
33
+ if add_img_desc:
34
+ if progress: print(f"Step 3/{n_steps}: Adding image descriptions...")
35
+ await add_img_descs(ocr_dir, dst=dst, model=model, progress=progress, **kwargs)
36
+ elif dst and Path(dst) != ocr_dir: shutil.copytree(ocr_dir, dst, dirs_exist_ok=True)
37
+ if progress: print("Done!")