PyPI - anchorite - Versions diffs - 0.1.0__tar.gz - Mend

anchorite 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

anchorite-0.1.0/.github/workflows/build.yaml +33 -0
anchorite-0.1.0/.github/workflows/lint.yaml +25 -0
anchorite-0.1.0/.github/workflows/release.yaml +68 -0
anchorite-0.1.0/.gitignore +7 -0
anchorite-0.1.0/.markdownlint.yaml +5 -0
anchorite-0.1.0/.pre-commit-config.yaml +51 -0
anchorite-0.1.0/.python-version +1 -0
anchorite-0.1.0/.readthedocs.yaml +17 -0
anchorite-0.1.0/LICENSE +21 -0
anchorite-0.1.0/PKG-INFO +246 -0
anchorite-0.1.0/README.md +227 -0
anchorite-0.1.0/docs/Makefile +20 -0
anchorite-0.1.0/docs/make.bat +35 -0
anchorite-0.1.0/docs/requirements.txt +4 -0
anchorite-0.1.0/docs/source/_static/anchorite.svg +1 -0
anchorite-0.1.0/docs/source/api.rst +32 -0
anchorite-0.1.0/docs/source/conf.py +21 -0
anchorite-0.1.0/docs/source/index.rst +16 -0
anchorite-0.1.0/docs/source/readme.md +4 -0
anchorite-0.1.0/pyproject.toml +63 -0
anchorite-0.1.0/src/anchorite/__init__.py +392 -0
anchorite-0.1.0/src/anchorite/bbox_alignment.py +421 -0
anchorite-0.1.0/src/anchorite/document.py +102 -0
anchorite-0.1.0/src/anchorite/markdown.py +18 -0
anchorite-0.1.0/src/anchorite/orchestrator.py +111 -0
anchorite-0.1.0/src/anchorite/providers.py +16 -0
anchorite-0.1.0/src/anchorite/range_ops.py +182 -0
anchorite-0.1.0/src/anchorite/types.py +30 -0
anchorite-0.1.0/tests/fixtures/hubble_anchors.json +5194 -0
anchorite-0.1.0/tests/fixtures/hubble_golden.md +312 -0
anchorite-0.1.0/tests/fixtures/hubble_markdown_chunks.json +1 -0
anchorite-0.1.0/tests/test_anchorite.py +76 -0
anchorite-0.1.0/tests/test_bbox_alignment.py +88 -0
anchorite-0.1.0/tests/test_markdown.py +12 -0
anchorite-0.1.0/tests/test_ocr_annotation.py +80 -0
anchorite-0.1.0/tests/test_ocr_nesting.py +75 -0
anchorite-0.1.0/tests/test_range_ops.py +178 -0
anchorite-0.1.0/tests/test_regression.py +80 -0
anchorite-0.1.0/uv.lock +272 -0

anchorite-0.1.0/.github/workflows/build.yaml ADDED Viewed

@@ -0,0 +1,33 @@
+name: CI Test Build
+on:
+  pull_request:
+  push:
+    branches:
+      - main
+jobs:
+  build:
+    name: Build pure Python package
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+      - name: Update uv lock
+        run: uv lock
+      - name: Build package
+        run: uv build
+      - uses: actions/upload-artifact@v4
+        with:
+          name: dist
+          path: dist/*

anchorite-0.1.0/.github/workflows/lint.yaml ADDED Viewed

@@ -0,0 +1,25 @@
+name: Lint
+on: [push]
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    defaults:
+      run:
+        shell: bash -l {0}
+    steps:
+    - uses: actions/checkout@v4
+    - uses: actions/setup-python@v4
+      with:
+        python-version: '3.11'
+    - name: Install packages
+      run: pip install -r requirements-dev.txt
+    - name: Install pre-commit hooks
+      run: pre-commit install --install-hooks
+    - name: Run pre-commit
+      run: pre-commit run --all-files

anchorite-0.1.0/.github/workflows/release.yaml ADDED Viewed

@@ -0,0 +1,68 @@
+name: Release
+on:
+  release:
+    types: [created]
+jobs:
+  build:
+    name: Build distribution
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+      - name: Update uv lock
+        run: uv lock
+      - name: Build package
+        run: uv build
+      - uses: actions/upload-artifact@v4
+        with:
+          name: dist
+          path: dist/*
+  upload_release_assets:
+    name: Upload Assets to Release
+    runs-on: ubuntu-latest
+    needs: [build]
+    permissions:
+      contents: write
+    steps:
+      - name: Download all artifacts
+        uses: actions/download-artifact@v4
+        with:
+          path: artifacts
+          merge-multiple: true
+      - name: Upload Wheels and sdist
+        uses: softprops/action-gh-release@v2
+        with:
+          tag_name: ${{ github.event.release.tag_name }}
+          files: artifacts/*
+          overwrite_files: true
+  publish-to-pypi:
+    name: Publish to PyPI
+    runs-on: ubuntu-latest
+    needs: [build]
+    environment:
+      name: pypi
+      url: https://pypi.org/p/anchorite
+    permissions:
+      id-token: write # Required for trusted publishing
+    steps:
+      - name: Download all wheels and sdist
+        uses: actions/download-artifact@v4
+        with:
+          path: dist
+          merge-multiple: true
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1

anchorite-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,7 @@
+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.venv/
+docs/build/

anchorite-0.1.0/.markdownlint.yaml ADDED Viewed

@@ -0,0 +1,5 @@
+default: true
+MD013: false
+MD033:
+  allowed_elements:
+    - img

anchorite-0.1.0/.pre-commit-config.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+default_language_version:
+    python: python3.11
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v6.0.0
+    hooks:
+      - id: check-yaml
+      - id: end-of-file-fixer
+        exclude: 'tests/fixtures'
+      - id: trailing-whitespace
+        exclude: '\.txt$|\.tsv$'
+      - id: check-case-conflict
+      - id: check-merge-conflict
+      - id: detect-private-key
+      - id: debug-statements
+      - id: check-added-large-files
+  - repo: https://github.com/igorshubovych/markdownlint-cli
+    rev: v0.45.0
+    hooks:
+      - id: markdownlint
+        exclude: 'tests/fixtures'
+  - repo: https://github.com/populationgenomics/pre-commits
+    rev: "e37928f761f17d54aca5cedf93848b40ec7cff26"
+    hooks:
+      - id: cpg-id-checker
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.14.1
+    hooks:
+      - id: ruff
+        args: ["--fix"]
+      - id: ruff-format
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.18.2
+    hooks:
+      - id: mypy
+        exclude: "docs/"
+        args:
+          [
+            --pretty,
+            --show-error-codes,
+            --no-strict-optional,
+            --ignore-missing-imports,
+            --install-types,
+            --non-interactive,
+            --config-file=./pyproject.toml
+          ]
+        additional_dependencies: [types-PyYAML==6.0.4, types-toml]

anchorite-0.1.0/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.11

anchorite-0.1.0/.readthedocs.yaml ADDED Viewed

@@ -0,0 +1,17 @@
+version: 2
+build:
+  os: ubuntu-lts-latest
+  tools:
+    python: "3.11"
+sphinx:
+  configuration: docs/source/conf.py
+formats: all
+python:
+  install:
+    - requirements: docs/requirements.txt
+    - method: pip
+      path: .

anchorite-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Centre for Population Genomics
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

anchorite-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,246 @@
+Metadata-Version: 2.4
+Name: anchorite
+Version: 0.1.0
+Summary: Spatial text alignment and resolution for document OCR
+Author-email: Tobias Sargeant <tobias.sargeant@gmail.com>
+License: MIT
+License-File: LICENSE
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Requires-Python: >=3.11
+Requires-Dist: fsspec
+Requires-Dist: pymupdf
+Requires-Dist: seq-smith>=0.5.1
+Description-Content-Type: text/markdown
+# anchorite
+<img src="https://raw.githubusercontent.com/populationgenomics/anchorite/main/docs/source/_static/anchorite.svg" alt="anchorite" width="200">
+**Spatial text alignment for document AI pipelines.**
+`anchorite` aligns generated Markdown text back to the physical bounding boxes that an OCR engine found on the original document pages. It bridges the gap between generative AI (which produces high-quality, readable Markdown) and traditional OCR (which provides precise coordinates) by finding where each OCR word or phrase appears in the generated output.
+---
+## The problem
+Modern document AI pipelines often combine two sources:
+1. **A generative model** (Gemini, Claude, GPT-4) that reads a page image and produces clean, well-structured Markdown.
+2. **An OCR engine** (Google Document AI, Tesseract, Docling) that identifies individual words and their bounding boxes on the page.
+The generative model's output is readable and accurate but has no coordinates. The OCR output has precise coordinates but poor structure. `anchorite` fuses them: it takes the Markdown as the ground truth for text content and finds the corresponding bounding box for each OCR word or phrase within it.
+---
+## Installation
+```shell
+pip install anchorite
+```
+---
+## Core concepts
+**`Anchor`** — a piece of OCR text with its location: a `text` string, a `page` number (0-indexed), and a `BBox` (bounding box in 0–1000 normalised coordinates).
+**`BBox`** — a bounding box `(top, left, bottom, right)`.
+**`alignment`** — a `dict[Anchor, tuple[int, int]]` mapping each anchor to a `(start, end)` character span in the Markdown string.
+---
+## Workflows
+### 1. Align and annotate
+The most common workflow: align OCR anchors to Markdown, then inject coordinate spans.
+```python
+import anchorite
+anchors = [
+    anchorite.Anchor(text="Observations of a Nebula", page=0, box=anchorite.BBox(52, 120, 68, 880)),
+    anchorite.Anchor(text="Edwin Hubble", page=0, box=anchorite.BBox(80, 340, 92, 660)),
+]
+markdown = "# Observations of a Nebula\n\n*Edwin Hubble*, 1929"
+alignment = anchorite.align(anchors, markdown)
+annotated = anchorite.annotate(markdown, alignment)
+# <span data-bbox="52,120,68,880" data-page="0">Observations of a Nebula</span>
+# <span data-bbox="80,340,92,660" data-page="0">Edwin Hubble</span>
+```
+The annotated Markdown is otherwise valid Markdown and can be rendered normally; the `<span>` tags carry coordinate metadata as HTML attributes.
+### 2. Resolve quotes to coordinates
+Given annotated Markdown and a list of verbatim quotes (e.g. extracted by an LLM), find the bounding boxes that each quote covers. Useful for grounding LLM citations.
+```python
+locations = anchorite.resolve(annotated, quotes=["Observations of a Nebula"])
+# {"Observations of a Nebula": [(0, BBox(52, 120, 68, 880))]}
+```
+`resolve` uses fuzzy iterative matching so it tolerates minor transcription differences. Each quote maps to a list of `(page, BBox)` pairs — one per distinct OCR anchor the quote overlaps.
+### 3. Strip annotations for downstream validation
+`strip` is the inverse of `annotate`. It removes the `<span>` tags and returns a plain-text string alongside a validation map you can use to check whether a generated quote is grounded in the source document.
+```python
+stripped = anchorite.strip(annotated)
+# stripped.plain_text  — Markdown with tags removed
+# stripped.validation_map  — list of (start, end, Anchor) in plain_text
+```
+### 4. Orchestrated multi-page processing
+For pipelines that process multi-page documents, `process_document` handles parallelism, page-chunk assembly, and alignment in one call. You supply pre-chunked document data and implement two provider protocols.
+```python
+import asyncio
+import anchorite
+from anchorite.document import DocumentChunk
+from anchorite.providers import MarkdownProvider, AnchorProvider
+class MyMarkdownProvider:
+    async def generate_markdown(self, chunk: DocumentChunk) -> str:
+        # Call your LLM or OCR layout model here
+        ...
+class MyAnchorProvider:
+    async def generate_anchors(self, chunk: DocumentChunk) -> list[anchorite.Anchor]:
+        # Call your OCR engine here and return Anchor objects
+        ...
+# Chunk the document yourself (e.g. 10 pages per chunk)
+chunks = list(anchorite.document.chunks("paper.pdf", page_count=10))
+result = asyncio.run(anchorite.process_document(
+    chunks,
+    MyMarkdownProvider(),
+    MyAnchorProvider(),
+))
+print(result.coverage_percent)   # fraction of Markdown covered by aligned anchors
+annotated = result.annotate()    # AlignmentResult.annotate() calls anchorite.annotate internally
+```
+`process_document` runs the markdown and anchor providers concurrently across all chunks using `asyncio.gather`, then aligns the assembled full-document Markdown against the complete anchor set.
+#### Provider protocols
+```python
+class MarkdownProvider(Protocol):
+    async def generate_markdown(self, chunk: DocumentChunk) -> str: ...
+class AnchorProvider(Protocol):
+    async def generate_anchors(self, chunk: DocumentChunk) -> list[Anchor]: ...
+```
+Both are structural protocols — no inheritance required, duck typing works.
+#### Document chunking
+`anchorite.document.chunks(source, *, page_count, mime_type)` splits a PDF into sub-documents of `page_count` pages each. `source` can be a file path, URL, `bytes`, or a file-like object. Images (PNG, JPEG, WebP) are yielded as a single chunk unchanged.
+You do not have to use `anchorite.document.chunks`. If your pipeline already produces chunks (for example, Docling's own document parser), create `DocumentChunk` objects directly:
+```python
+from anchorite.document import DocumentChunk
+chunk = DocumentChunk(
+    document_sha256="abc123...",
+    start_page=0,
+    end_page=10,
+    data=pdf_bytes,
+    mime_type="application/pdf",
+)
+```
+---
+## API reference
+### `anchorite.align(anchors, markdown, *, uniqueness_threshold, min_overlap)`
+Aligns a sequence of `Anchor` objects to a Markdown string. Returns `dict[Anchor, tuple[int, int]]`.
+| Parameter | Default | Description |
+|---|---|---|
+| `uniqueness_threshold` | `0.5` | An anchor is accepted only if its best-match score exceeds this fraction of its second-best score. Higher values demand more unique matches. |
+| `min_overlap` | `0.9` | Minimum fraction of the anchor's normalised length that must be covered by the alignment. |
+### `anchorite.annotate(markdown, alignment)`
+Injects `<span data-bbox="t,l,b,r" data-page="N">` tags into Markdown at the positions given by `alignment`. Handles overlapping and nested spans. Math blocks (`$...$`, `$$...$$`) are detected and span boundaries are snapped to their edges so LaTeX is not broken.
+### `anchorite.strip(annotated_md)`
+Removes `<span>` tags and returns a `StrippedMarkdown` with fields:
+- `plain_text`: the Markdown with all tags removed
+- `validation_map`: sorted list of `(start, end, Anchor)` tuples in `plain_text` coordinates
+### `anchorite.resolve(annotated_md, quotes)`
+Resolves a list of verbatim quote strings to their bounding boxes using fuzzy iterative Smith-Waterman alignment against the stripped text. Returns `dict[str, list[tuple[int, BBox]]]` mapping each quote to a list of `(page, BBox)` pairs.
+### `anchorite.process_document(chunks, markdown_provider, anchor_provider, *, ...)`
+Orchestrates multi-chunk document alignment. Returns `AlignmentResult`.
+| Parameter | Default | Description |
+|---|---|---|
+| `alignment_uniqueness_threshold` | `0.5` | Passed to `align`. |
+| `alignment_min_overlap` | `0.9` | Passed to `align`. |
+| `renumber` | `True` | Renumber `<!--table-->` and `<!--figure-->` markers across chunks before joining. |
+---
+## Algorithm
+### Normalisation
+Before any alignment, text is normalised to a reduced alphabet: letters are lowercased, all non-alphanumeric characters (punctuation, whitespace variants) are mapped to a single space, and consecutive spaces are collapsed to one. This makes the alignment robust to minor formatting differences between the OCR text and the generated Markdown (e.g. hyphenation, ligatures, smart quotes).
+### Document fragmentation
+The Markdown is split at HTML comment markers (e.g. `<!--page-->`, `<!--table: 1-->`) into contiguous fragments. Each fragment inherits a page range from its position in the assembled document, which is used to restrict which anchors can match it — anchors are only compared against fragments whose page range includes the anchor's page number.
+### Iterative alignment
+The core loop runs until all anchors are matched or no further progress is made.
+**Pass 1 — ungapped alignment.** Each unmatched anchor is aligned against each compatible document fragment using ungapped Smith-Waterman local alignment (via `seq_smith.top_k_ungapped_local_align_many`, retrieving the top-2 scores per anchor per fragment). An anchor is promoted to a high-confidence candidate only if both conditions hold:
+- *Overlap*: the best-match score covers at least `min_overlap` of the anchor's normalised length.
+- *Uniqueness*: the best-match score exceeds `uniqueness_threshold` × the second-best score, ensuring the match is not ambiguous.
+**Subsequent passes — gapped alignment.** The same candidate-selection logic is repeated using semi-global alignment (`seq_smith.local_global_align_many`), which allows gaps within the alignment. This recovers anchors that the LLM paraphrased or reformatted slightly.
+### Span assignment
+Once a set of high-confidence candidates is identified for a fragment, each candidate is assigned a precise character range within the fragment. Candidates are processed in descending alignment score order and are accepted only if:
+1. At least 90% of the aligned positions are exact character matches (no-gap criterion within the assignment step).
+2. The proposed range is *page-consistent*: anchors from earlier pages must map to earlier positions in the Markdown than anchors from later pages.
+3. At least 90% of the proposed range is *new* coverage — not already claimed by a higher-scoring anchor in the same fragment.
+The assigned range is mapped back from normalised-character coordinates to original Markdown character offsets via the `normalized_to_source` index.
+### Fragment splitting
+After assignment, any portion of a document fragment not covered by any accepted anchor becomes a new sub-fragment for subsequent iterations. This allows later iterations to focus on progressively smaller uncovered regions, recovering matches that were hidden by initially ambiguous context.
+### Result
+The final result is a `dict[Anchor, (start, end)]` giving the character span in the original Markdown for each successfully aligned anchor. Anchors that could not be matched with sufficient confidence are omitted.