PyPI - mistocr - Versions diffs - 0.0.3__tar.gz → 0.1.2__tar.gz - Mend

mistocr 0.0.3tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

{mistocr-0.0.3/mistocr.egg-info → mistocr-0.1.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: mistocr
-Version: 0.0.3
+Version: 0.1.2
 Summary: Simple batch OCR for PDFs using Mistral's state-of-the-art vision model
 Home-page: https://github.com/franckalbinet/mistocr
 Author: Solveit
@@ -22,6 +22,7 @@ Requires-Dist: fastcore
 Requires-Dist: mistralai
 Requires-Dist: pillow
 Requires-Dist: dotenv
+Requires-Dist: lisette
 Provides-Extra: dev
 Dynamic: author
 Dynamic: author-email
@@ -54,10 +55,11 @@ for large document sets.
 **Cost savings**: Batch OCR mode reduces costs from \$1/1000 pages to
 \$0.50/1000 pages - a 50% reduction compared to synchronous processing.
-**Simplicity**: A single `ocr()` function handles everything -
-uploading, batch submission, polling for completion, and saving results
-as markdown with extracted images. Process one PDF or an entire folder
-with the same simple interface.
+**Simplicity**: A single
+[`ocr()`](https://franckalbinet.github.io/mistocr/core.html#ocr)
+function handles everything - uploading, batch submission, polling for
+completion, and saving results as markdown with extracted images.
+Process one PDF or an entire folder with the same simple interface.
 **Organized output**: Each PDF is automatically saved to its own folder
 with pages as separate markdown files and images in an `img` subfolder,
@@ -80,57 +82,60 @@ $ pip install mistocr
 ## How to use
+### Basic usage
+Process a single PDF:
 ``` python
 from mistocr.core import ocr
-```
-- **Process a single PDF:**
-<!-- -->
+fname = 'files/test/attention-is-all-you-need.pdf'
+result = ocr(fname)
+```
-    fname = 'files/test/attention-is-all-you-need.pdf'
-    result = ocr(fname)
+Or process an entire folder:
 ``` python
+results = ocr('files/test')
 ```
-    files/test/md/attention-is-all-you-need:
-    img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
-    page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
-    page_10.md  page_13.md  page_2.md   page_5.md  page_8.md
+### Output structure
-    files/test/md/attention-is-all-you-need/img:
-    img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg
+Each PDF is saved to its own folder with pages as separate markdown
+files and images in an `img` subfolder:
-- **Or process an entire folder:**
+    files/test/md/
+    ├── attention-is-all-you-need/
+    │   ├── img/
+    │   │   ├── img-0.jpeg
+    │   │   ├── img-1.jpeg
+    │   │   └── ...
+    │   ├── page_1.md
+    │   ├── page_2.md
+    │   └── ...
+    └── resnet/
+        ├── img/
+        └── ...
-``` python
-results = ocr('files/test')
-```
+### Reading results
-``` python
-```
+Read all pages from a processed PDF:
-    files/test/md:
-    attention-is-all-you-need/  resnet/
+``` python
+from mistocr.core import read_pgs
-    files/test/md/attention-is-all-you-need:
-    img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
-    page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
-    page_10.md  page_13.md  page_2.md   page_5.md  page_8.md
+text = read_pgs('files/test/md/attention-is-all-you-need')
+```
-    files/test/md/attention-is-all-you-need/img:
-    img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg
+Or read a specific page:
-    files/test/md/resnet:
-    img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
-    page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md
+``` python
+text = read_pgs('files/test/md/attention-is-all-you-need', 10)
+```
-    files/test/md/resnet/img:
-    img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
-    img-1.jpeg  img-3.jpeg  img-5.jpeg
+### Customization
-- **Customize the output:**
+Customize output directory, image inclusion, and polling interval:
 ``` python
 results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)

{mistocr-0.0.3 → mistocr-0.1.2}/README.md RENAMED Viewed

@@ -15,10 +15,11 @@ for large document sets.
 **Cost savings**: Batch OCR mode reduces costs from \$1/1000 pages to
 \$0.50/1000 pages - a 50% reduction compared to synchronous processing.
-**Simplicity**: A single `ocr()` function handles everything -
-uploading, batch submission, polling for completion, and saving results
-as markdown with extracted images. Process one PDF or an entire folder
-with the same simple interface.
+**Simplicity**: A single
+[`ocr()`](https://franckalbinet.github.io/mistocr/core.html#ocr)
+function handles everything - uploading, batch submission, polling for
+completion, and saving results as markdown with extracted images.
+Process one PDF or an entire folder with the same simple interface.
 **Organized output**: Each PDF is automatically saved to its own folder
 with pages as separate markdown files and images in an `img` subfolder,
@@ -41,57 +42,60 @@ $ pip install mistocr
 ## How to use
+### Basic usage
+Process a single PDF:
 ``` python
 from mistocr.core import ocr
-```
-- **Process a single PDF:**
-<!-- -->
+fname = 'files/test/attention-is-all-you-need.pdf'
+result = ocr(fname)
+```
-    fname = 'files/test/attention-is-all-you-need.pdf'
-    result = ocr(fname)
+Or process an entire folder:
 ``` python
+results = ocr('files/test')
 ```
-    files/test/md/attention-is-all-you-need:
-    img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
-    page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
-    page_10.md  page_13.md  page_2.md   page_5.md  page_8.md
+### Output structure
-    files/test/md/attention-is-all-you-need/img:
-    img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg
+Each PDF is saved to its own folder with pages as separate markdown
+files and images in an `img` subfolder:
-- **Or process an entire folder:**
+    files/test/md/
+    ├── attention-is-all-you-need/
+    │   ├── img/
+    │   │   ├── img-0.jpeg
+    │   │   ├── img-1.jpeg
+    │   │   └── ...
+    │   ├── page_1.md
+    │   ├── page_2.md
+    │   └── ...
+    └── resnet/
+        ├── img/
+        └── ...
-``` python
-results = ocr('files/test')
-```
+### Reading results
-``` python
-```
+Read all pages from a processed PDF:
-    files/test/md:
-    attention-is-all-you-need/  resnet/
+``` python
+from mistocr.core import read_pgs
-    files/test/md/attention-is-all-you-need:
-    img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
-    page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
-    page_10.md  page_13.md  page_2.md   page_5.md  page_8.md
+text = read_pgs('files/test/md/attention-is-all-you-need')
+```
-    files/test/md/attention-is-all-you-need/img:
-    img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg
+Or read a specific page:
-    files/test/md/resnet:
-    img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
-    page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md
+``` python
+text = read_pgs('files/test/md/attention-is-all-you-need', 10)
+```
-    files/test/md/resnet/img:
-    img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
-    img-1.jpeg  img-3.jpeg  img-5.jpeg
+### Customization
-- **Customize the output:**
+Customize output directory, image inclusion, and polling interval:
 ``` python
 results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)

mistocr-0.1.2/mistocr/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.1.2"

{mistocr-0.0.3 → mistocr-0.1.2}/mistocr/_modidx.py RENAMED Viewed

@@ -13,9 +13,17 @@ d = { 'settings': { 'branch': 'main',
                               'mistocr.core.get_api_key': ('core.html#get_api_key', 'mistocr/core.py'),
                               'mistocr.core.ocr': ('core.html#ocr', 'mistocr/core.py'),
                               'mistocr.core.prep_pdf_batch': ('core.html#prep_pdf_batch', 'mistocr/core.py'),
+                              'mistocr.core.read_pgs': ('core.html#read_pgs', 'mistocr/core.py'),
                               'mistocr.core.save_images': ('core.html#save_images', 'mistocr/core.py'),
                               'mistocr.core.save_page': ('core.html#save_page', 'mistocr/core.py'),
                               'mistocr.core.save_pages': ('core.html#save_pages', 'mistocr/core.py'),
                               'mistocr.core.submit_batch': ('core.html#submit_batch', 'mistocr/core.py'),
                               'mistocr.core.upload_pdf': ('core.html#upload_pdf', 'mistocr/core.py'),
-                              'mistocr.core.wait_for_job': ('core.html#wait_for_job', 'mistocr/core.py')}}}
+                              'mistocr.core.wait_for_job': ('core.html#wait_for_job', 'mistocr/core.py')},
+            'mistocr.refine': { 'mistocr.refine.HeadingCorrections': ('refine.html#headingcorrections', 'mistocr/refine.py'),
+                                'mistocr.refine.apply_hdg_fixes': ('refine.html#apply_hdg_fixes', 'mistocr/refine.py'),
+                                'mistocr.refine.fix_hdg_hierarchy': ('refine.html#fix_hdg_hierarchy', 'mistocr/refine.py'),
+                                'mistocr.refine.fix_md_hdgs': ('refine.html#fix_md_hdgs', 'mistocr/refine.py'),
+                                'mistocr.refine.fmt_hdgs_idx': ('refine.html#fmt_hdgs_idx', 'mistocr/refine.py'),
+                                'mistocr.refine.get_hdgs': ('refine.html#get_hdgs', 'mistocr/refine.py'),
+                                'mistocr.refine.mk_fixes_lut': ('refine.html#mk_fixes_lut', 'mistocr/refine.py')}}}

{mistocr-0.0.3 → mistocr-0.1.2}/mistocr/core.py RENAMED Viewed

@@ -4,21 +4,17 @@
 # %% auto 0
 __all__ = ['ocr_model', 'ocr_endpoint', 'get_api_key', 'upload_pdf', 'create_batch_entry', 'prep_pdf_batch', 'submit_batch',
-           'wait_for_job', 'download_results', 'save_images', 'save_page', 'save_pages', 'ocr']
+           'wait_for_job', 'download_results', 'save_images', 'save_page', 'save_pages', 'ocr', 'read_pgs']
 # %% ../nbs/00_core.ipynb 3
 from fastcore.all import *
-from dotenv import load_dotenv
-import os, json, time, base64, tempfile
+import os, re, json, time, base64, tempfile, logging
 from io import BytesIO
 from pathlib import Path
 from PIL import Image
 from mistralai import Mistral
 # %% ../nbs/00_core.ipynb 6
-load_dotenv()
-# %% ../nbs/00_core.ipynb 7
 def get_api_key(
     key:str=None # Mistral API key
     ):
@@ -27,11 +23,11 @@ def get_api_key(
     if not key: raise ValueError("MISTRAL_API_KEY not found")
     return key
-# %% ../nbs/00_core.ipynb 8
+# %% ../nbs/00_core.ipynb 7
 ocr_model = "mistral-ocr-latest"
 ocr_endpoint = "/v1/ocr"
-# %% ../nbs/00_core.ipynb 11
+# %% ../nbs/00_core.ipynb 10
 def upload_pdf(
     path:str, # Path to PDF file
     key:str=None # Mistral API key
@@ -42,11 +38,11 @@ def upload_pdf(
     uploaded = c.files.upload(file=dict(file_name=path.stem, content=path.read_bytes()), purpose="ocr")
     return c.files.get_signed_url(file_id=uploaded.id).url, c
-# %% ../nbs/00_core.ipynb 16
+# %% ../nbs/00_core.ipynb 15
 def create_batch_entry(
     path:str, # Path to PDF file,
     url:str, # Mistral signed URL
-    cid:str=None, # Custom ID (by default using the file name without extention)
+    cid:str=None, # Custom ID (by default using the file name without extension)
     inc_img:bool=True # Include image in response
     ) -> dict[str, str | dict[str, str | bool]]: # Batch entry dict
     "Create a batch entry dict for OCR"
@@ -54,7 +50,7 @@ def create_batch_entry(
     if not cid: cid = path.stem
     return dict(custom_id=cid, body=dict(document=dict(type="document_url", document_url=url), include_image_base64=inc_img))
-# %% ../nbs/00_core.ipynb 18
+# %% ../nbs/00_core.ipynb 17
 def prep_pdf_batch(
     path:str, # Path to PDF file,
     cid:str=None, # Custom ID (by default using the file name without extention)
@@ -65,7 +61,7 @@ def prep_pdf_batch(
     url, c = upload_pdf(path, key)
     return create_batch_entry(path, url, cid, inc_img), c
-# %% ../nbs/00_core.ipynb 22
+# %% ../nbs/00_core.ipynb 21
 def submit_batch(
     entries:list[dict], # List of batch entries,
     c:Mistral=None, # Mistral client,
@@ -79,7 +75,7 @@ def submit_batch(
         batch_data = c.files.upload(file=dict(file_name="batch.jsonl", content=open(f.name, "rb")), purpose="batch")
     return c.batch.jobs.create(input_files=[batch_data.id], model=model, endpoint=endpoint)
-# %% ../nbs/00_core.ipynb 25
+# %% ../nbs/00_core.ipynb 24
 def wait_for_job(
     job:dict, # Job dict,
     c:Mistral=None, # Mistral client,
@@ -91,7 +87,7 @@ def wait_for_job(
         job = c.batch.jobs.get(job_id=job.id)
     return job
-# %% ../nbs/00_core.ipynb 27
+# %% ../nbs/00_core.ipynb 26
 def download_results(
     job:dict, # Job dict,
     c:Mistral=None # Mistral client
@@ -100,7 +96,7 @@ def download_results(
     content = c.files.download(file_id=job.output_file).read().decode('utf-8')
     return [json.loads(line) for line in content.strip().split('\n') if line]
-# %% ../nbs/00_core.ipynb 32
+# %% ../nbs/00_core.ipynb 31
 def save_images(
     page:dict, # Page dict,
     img_dir:str='img' # Directory to save images
@@ -111,32 +107,32 @@ def save_images(
             img_bytes = base64.b64decode(img['image_base64'].split(',')[1])
             Image.open(BytesIO(img_bytes)).save(img_dir / img['id'])
-# %% ../nbs/00_core.ipynb 33
+# %% ../nbs/00_core.ipynb 32
 def save_page(
     page:dict, # Page dict,
-    out_dir:str, # Directory to save page
+    dst:str, # Directory to save page
     img_dir:str='img' # Directory to save images
     ) -> None:
     "Save single page markdown and images"
-    (out_dir / f"page_{page['index']+1}.md").write_text(page['markdown'])
+    (dst / f"page_{page['index']+1}.md").write_text(page['markdown'])
     if page.get('images'):
         img_dir.mkdir(exist_ok=True)
         save_images(page, img_dir)
-# %% ../nbs/00_core.ipynb 35
+# %% ../nbs/00_core.ipynb 34
 def save_pages(
     ocr_resp:dict, # OCR response,
-    out_dir:str, # Directory to save pages,
+    dst:str, # Directory to save pages,
     cid:str # Custom ID
     ) -> Path: # Output directory
     "Save markdown pages and images from OCR response to output directory"
-    out_dir = Path(out_dir) / cid
-    out_dir.mkdir(parents=True, exist_ok=True)
-    img_dir = out_dir / 'img'
-    for page in ocr_resp['pages']: save_page(page, out_dir, img_dir)
-    return out_dir
+    dst = Path(dst) / cid
+    dst.mkdir(parents=True, exist_ok=True)
+    img_dir = dst / 'img'
+    for page in ocr_resp['pages']: save_page(page, dst, img_dir)
+    return dst
-# %% ../nbs/00_core.ipynb 41
+# %% ../nbs/00_core.ipynb 40
 def _get_paths(path:str) -> list[Path]:
     "Get list of PDFs from file or folder"
     path = Path(path)
@@ -147,7 +143,7 @@ def _get_paths(path:str) -> list[Path]:
         return pdfs
     raise ValueError(f"Path not found: {path}")
-# %% ../nbs/00_core.ipynb 42
+# %% ../nbs/00_core.ipynb 41
 def _prep_batch(pdfs:list[Path], inc_img:bool=True, key:str=None) -> tuple[list[dict], Mistral]:
     "Prepare batch entries for list of PDFs"
     entries, c = [], None
@@ -156,7 +152,7 @@ def _prep_batch(pdfs:list[Path], inc_img:bool=True, key:str=None) -> tuple[list[
         entries.append(entry)
     return entries, c
-# %% ../nbs/00_core.ipynb 43
+# %% ../nbs/00_core.ipynb 42
 def _run_batch(entries:list[dict], c:Mistral, poll_interval:int=2) -> list[dict]:
     "Submit batch, wait for completion, and download results"
     job = submit_batch(entries, c)
@@ -164,10 +160,10 @@ def _run_batch(entries:list[dict], c:Mistral, poll_interval:int=2) -> list[dict]
     if job.status != 'SUCCESS': raise Exception(f"Job failed with status: {job.status}")
     return download_results(job, c)
-# %% ../nbs/00_core.ipynb 44
+# %% ../nbs/00_core.ipynb 43
 def ocr(
     path:str, # Path to PDF file or folder,
-    out_dir:str='md', # Directory to save markdown pages,
+    dst:str='md', # Directory to save markdown pages,
     inc_img:bool=True, # Include image in response,
     key:str=None, # API key,
     poll_interval:int=2 # Poll interval in seconds
@@ -176,4 +172,15 @@ def ocr(
     pdfs = _get_paths(path)
     entries, c = _prep_batch(pdfs, inc_img, key)
     results = _run_batch(entries, c, poll_interval)
-    return L([save_pages(r['response']['body'], out_dir, r['custom_id']) for r in results])
+    return L([save_pages(r['response']['body'], dst, r['custom_id']) for r in results])
+# %% ../nbs/00_core.ipynb 48
+def read_pgs(
+    path:str, # OCR output directory,
+    join:bool=True # Join pages into single string
+    ) -> str|list[str]: # Joined string or list of page contents
+    "Read specific page or all pages from OCR output directory"
+    path = Path(path)
+    pgs = sorted(path.glob('page_*.md'), key=lambda p: int(p.stem.split('_')[1]))
+    contents = L([p.read_text() for p in pgs])
+    return '\n\n'.join(contents) if join else contents

mistocr-0.1.2/mistocr/refine.py ADDED Viewed

@@ -0,0 +1,117 @@
+"""Postprocess markdown files by fixing heading hierarchy and describint images"""
+# AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/01_refine.ipynb.
+# %% auto 0
+__all__ = ['prompt_fix_hdgs', 'get_hdgs', 'fmt_hdgs_idx', 'HeadingCorrections', 'fix_hdg_hierarchy', 'mk_fixes_lut',
+           'apply_hdg_fixes', 'fix_md_hdgs']
+# %% ../nbs/01_refine.ipynb 3
+from fastcore.all import *
+from .core import read_pgs
+from re import sub, findall, MULTILINE
+from pydantic import BaseModel
+from lisette.core import completion
+import os
+import json
+# %% ../nbs/01_refine.ipynb 7
+def get_hdgs(
+    md:str # Markdown file string
+    ):
+    "Return the markdown headings"
+    # Sanitize removing '#' in python snippet if any
+    md = sub(r'```[\s\S]*?```', '', md)
+    return L(findall(r'^#{1,6} .+$', md, MULTILINE))
+# %% ../nbs/01_refine.ipynb 10
+def fmt_hdgs_idx(
+    hdgs: list[str] # List of markdown headings
+    ) -> str: # Formatted string with index
+    "Format the headings with index"
+    return '\n'.join(f"{i}. {h}" for i, h in enumerate(hdgs))
+# %% ../nbs/01_refine.ipynb 13
+class HeadingCorrections(BaseModel):
+    corrections: dict[int, str]  # index → corrected heading
+# %% ../nbs/01_refine.ipynb 15
+prompt_fix_hdgs = """Fix markdown heading hierarchy errors while preserving the document's intended structure.
+INPUT FORMAT: Each heading is prefixed with its index number (e.g., "0. # Title")
+RULES - Apply these fixes in order:
+1. **Single H1 rule**: Documents must have exactly ONE # heading (the title/main heading)
+   - All other headings should be ## or deeper
+2. **Infer depth from numbering patterns**: If headings contain section numbers, deeper nesting means deeper heading level
+   - Parent section (e.g., "1", "2", "A") should be shallower than child (e.g., "1.1", "2.a", "A.1")
+   - Child section should be one # deeper than parent
+   - Works with any numbering: "1/1.1/1.1.1", "A/A.1/A.1.a", "I/I.A/I.A.1", etc.
+3. **Level jumps**: Headings can only increase by one # at a time when moving deeper
+   - Wrong: ## Section → ##### Subsection
+   - Fixed: ## Section → ### Subsection
+4. **Decreasing levels is OK**: Moving back up the hierarchy (### to ##) is valid for new sections
+OUTPUT: Return a Python dictionary mapping index to corrected heading (without the index prefix).
+Only include entries that need changes.
+Headings to analyze:
+{headings_list}
+"""
+# %% ../nbs/01_refine.ipynb 16
+def fix_hdg_hierarchy(
+    hdgs: list[str], # List of markdown headings
+    prompt: str=prompt_fix_hdgs, # Prompt to use
+    model: str='claude-sonnet-4-5', # Model to use
+    api_key: str=os.getenv('ANTHROPIC_API_KEY') # API key
+    ) -> dict[int, str]: # Dictionary of index → corrected heading
+    "Fix the heading hierarchy"
+    r = completion(
+        model=model,
+        messages=[{"role": "user", "content": prompt_fix_hdgs.format(headings_list=fmt_hdgs_idx(hdgs))}],
+        response_format=HeadingCorrections,
+        api_key=api_key
+        )
+    return json.loads(r.choices[0].message.content)['corrections']
+# %% ../nbs/01_refine.ipynb 19
+def mk_fixes_lut(
+    hdgs: list[str], # List of markdown headings
+    model: str='claude-sonnet-4-5', # Model to use
+    api_key: str=os.getenv('ANTHROPIC_API_KEY') # API key
+    ) -> dict[str, str]: # Dictionary of old → new heading
+    "Make a lookup table of fixes"
+    fixes = fix_hdg_hierarchy(hdgs, model, api_key)
+    return {hdgs[int(k)]:v for k,v in fixes.items()}
+# %% ../nbs/01_refine.ipynb 22
+def apply_hdg_fixes(
+    p:str, # Page to fix
+    lut_fixes: dict[str, str], # Lookup table of fixes
+    pg: int=None, # Optionnaly specify the page number to append to original heading
+    ) -> str: # Page with fixes applied
+    "Apply the fixes to the page"
+    for old in get_hdgs(p): p = p.replace(old, lut_fixes.get(old, old) + (f' .... page {pg}' if pg else ''))
+    return p
+# %% ../nbs/01_refine.ipynb 25
+def fix_md_hdgs(
+    src:str, # Source directory with markdown pages
+    model:str='claude-sonnet-4-5', # Model
+    dst:str=None, # Destination directory (None=overwrite)
+    pg_nums:bool=True # Add page numbers
+):
+    "Fix heading hierarchy in markdown document"
+    src_path,dst_path = Path(src),Path(dst) if dst else Path(src)
+    if dst_path != src_path: dst_path.mkdir(parents=True, exist_ok=True)
+    lut = mk_fixes_lut(get_hdgs(read_pgs(src_path)), model)
+    for i,p in enumerate(read_pgs(src_path, join=False), 1):
+        (dst_path/f'page_{i}.md').write_text(apply_hdg_fixes(p, lut, pg=i if pg_nums else None))

{mistocr-0.0.3 → mistocr-0.1.2/mistocr.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: mistocr
-Version: 0.0.3
+Version: 0.1.2
 Summary: Simple batch OCR for PDFs using Mistral's state-of-the-art vision model
 Home-page: https://github.com/franckalbinet/mistocr
 Author: Solveit
@@ -22,6 +22,7 @@ Requires-Dist: fastcore
 Requires-Dist: mistralai
 Requires-Dist: pillow
 Requires-Dist: dotenv
+Requires-Dist: lisette
 Provides-Extra: dev
 Dynamic: author
 Dynamic: author-email
@@ -54,10 +55,11 @@ for large document sets.
 **Cost savings**: Batch OCR mode reduces costs from \$1/1000 pages to
 \$0.50/1000 pages - a 50% reduction compared to synchronous processing.
-**Simplicity**: A single `ocr()` function handles everything -
-uploading, batch submission, polling for completion, and saving results
-as markdown with extracted images. Process one PDF or an entire folder
-with the same simple interface.
+**Simplicity**: A single
+[`ocr()`](https://franckalbinet.github.io/mistocr/core.html#ocr)
+function handles everything - uploading, batch submission, polling for
+completion, and saving results as markdown with extracted images.
+Process one PDF or an entire folder with the same simple interface.
 **Organized output**: Each PDF is automatically saved to its own folder
 with pages as separate markdown files and images in an `img` subfolder,
@@ -80,57 +82,60 @@ $ pip install mistocr
 ## How to use
+### Basic usage
+Process a single PDF:
 ``` python
 from mistocr.core import ocr
-```
-- **Process a single PDF:**
-<!-- -->
+fname = 'files/test/attention-is-all-you-need.pdf'
+result = ocr(fname)
+```
-    fname = 'files/test/attention-is-all-you-need.pdf'
-    result = ocr(fname)
+Or process an entire folder:
 ``` python
+results = ocr('files/test')
 ```
-    files/test/md/attention-is-all-you-need:
-    img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
-    page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
-    page_10.md  page_13.md  page_2.md   page_5.md  page_8.md
+### Output structure
-    files/test/md/attention-is-all-you-need/img:
-    img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg
+Each PDF is saved to its own folder with pages as separate markdown
+files and images in an `img` subfolder:
-- **Or process an entire folder:**
+    files/test/md/
+    ├── attention-is-all-you-need/
+    │   ├── img/
+    │   │   ├── img-0.jpeg
+    │   │   ├── img-1.jpeg
+    │   │   └── ...
+    │   ├── page_1.md
+    │   ├── page_2.md
+    │   └── ...
+    └── resnet/
+        ├── img/
+        └── ...
-``` python
-results = ocr('files/test')
-```
+### Reading results
-``` python
-```
+Read all pages from a processed PDF:
-    files/test/md:
-    attention-is-all-you-need/  resnet/
+``` python
+from mistocr.core import read_pgs
-    files/test/md/attention-is-all-you-need:
-    img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
-    page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
-    page_10.md  page_13.md  page_2.md   page_5.md  page_8.md
+text = read_pgs('files/test/md/attention-is-all-you-need')
+```
-    files/test/md/attention-is-all-you-need/img:
-    img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg
+Or read a specific page:
-    files/test/md/resnet:
-    img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
-    page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md
+``` python
+text = read_pgs('files/test/md/attention-is-all-you-need', 10)
+```
-    files/test/md/resnet/img:
-    img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
-    img-1.jpeg  img-3.jpeg  img-5.jpeg
+### Customization
-- **Customize the output:**
+Customize output directory, image inclusion, and polling interval:
 ``` python
 results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)

{mistocr-0.0.3 → mistocr-0.1.2}/mistocr.egg-info/SOURCES.txt RENAMED Viewed

@@ -7,6 +7,7 @@ setup.py
 mistocr/__init__.py
 mistocr/_modidx.py
 mistocr/core.py
+mistocr/refine.py
 mistocr.egg-info/PKG-INFO
 mistocr.egg-info/SOURCES.txt
 mistocr.egg-info/dependency_links.txt

{mistocr-0.0.3 → mistocr-0.1.2}/mistocr.egg-info/requires.txt RENAMED Viewed

@@ -2,5 +2,6 @@ fastcore
 mistralai
 pillow
 dotenv
+lisette
 [dev]

{mistocr-0.0.3 → mistocr-0.1.2}/settings.ini RENAMED Viewed

@@ -1,7 +1,7 @@
 [DEFAULT]
 repo = mistocr
 lib_name = mistocr
-version = 0.0.3
+version = 0.1.2
 min_python = 3.9
 license = apache2
 black_formatting = False
@@ -27,7 +27,7 @@ keywords = nbdev jupyter notebook python
 language = English
 status = 3
 user = franckalbinet
-requirements = fastcore mistralai pillow dotenv
+requirements = fastcore mistralai pillow dotenv lisette
 readme_nb = index.ipynb
 allowed_metadata_keys =
 allowed_cell_metadata_keys =