PyPI - ocr-postprocess - Versions diffs - 0.1.0__tar.gz - Mend

ocr-postprocess 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (64) hide show

ocr_postprocess-0.1.0/MANIFEST.in ADDED Viewed

@@ -0,0 +1,5 @@
+include README.md
+include requirements.txt
+include requirements-dev.txt
+recursive-include ocr_postprocess/profiles *.yml
+include ocr_postprocess/py.typed

ocr_postprocess-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,189 @@
+Metadata-Version: 2.4
+Name: ocr-postprocess
+Version: 0.1.0
+Summary: Biến raw OCR text thành structured document — trích xuất trường dữ liệu, xử lý nhiễu, cross-check, render JSON/Markdown.
+Home-page: https://github.com/ohmygodvt95/ocr-postprocess
+Author: ohmygodvt95
+License: MIT
+Keywords: ocr post-processing nlp document extraction vietnamese
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Intended Audience :: Developers
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Text Processing
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: pydantic>=2.5
+Requires-Dist: PyYAML>=6.0
+Requires-Dist: rapidfuzz>=3.5
+Requires-Dist: regex>=2024.1.24
+Requires-Dist: typer>=0.12
+Requires-Dist: python-dateutil>=2.9
+Requires-Dist: simpleeval>=0.9.13
+Dynamic: author
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# ocr-postprocess
+Biến raw OCR text thành structured document — trích xuất trường dữ liệu, xử lý nhiễu, cross-check, và render ra JSON/Markdown.
+## Installation
+```bash
+pip install ocr-postprocess
+```
+Hoặc cài từ source:
+```bash
+git clone https://github.com/your-org/ocr-postprocess
+cd ocr-postprocess
+pip install -e .
+```
+## Library usage
+```python
+from ocr_postprocess import Pipeline, ProcessedDocument, OcrPostprocessError
+# Sử dụng profiles bundled sẵn (no extra files needed)
+pipeline = Pipeline.from_default()
+raw_text = open("scan.txt").read()
+try:
+    doc: ProcessedDocument = pipeline.process(raw_text)
+except OcrPostprocessError as exc:
+    print(f"Pipeline error: {exc}")
+    raise
+# Lấy một trường
+name_candidate = doc.get("ho_va_ten")
+if name_candidate:
+    print(name_candidate.value)       # "NGUYỄN VĂN A"
+    print(name_candidate.confidence)  # 0.91
+# Toàn bộ trường đã trích
+fields = {c.key: c.value for c in doc.candidates}
+# Export JSON
+import json
+print(json.dumps(doc.to_json(), ensure_ascii=False, indent=2))
+# Export Markdown
+print(doc.markdown)
+```
+### Custom profiles directory
+```python
+# Dùng thư mục profiles riêng
+pipeline = Pipeline.from_default(profiles_dir="my_profiles/")
+```
+### Classify only
+```python
+profile_id, score = pipeline.classify(raw_text)
+# "cccd_2024", 0.97
+```
+### ProcessedDocument fields
+| Field | Type | Mô tả |
+|---|---|---|
+| `profile_id` | `str` | Profile được match |
+| `profile_score` | `float` | Điểm classify (0–1) |
+| `candidates` | `list[Candidate]` | Tất cả trường đã trích |
+| `overall_confidence` | `float` | Điểm tin cậy tổng hợp |
+| `warnings` | `list[str]` | Cảnh báo từ pipeline |
+| `markdown` | `str` | Kết quả render Markdown |
+| `cross_checks` | `list[CrossCheck]` | Kết quả cross-check |
+## CLI
+```bash
+# Process a document
+ocrpp process scan.txt
+# Markdown output
+ocrpp process scan.txt --format markdown
+# Classify only
+ocrpp classify scan.txt
+# Validate a profile
+ocrpp validate-profile profiles/my_profile.yml
+# Custom profiles directory
+ocrpp process scan.txt --profiles ./my_profiles/
+```
+## Adding a custom profile
+Tạo file YAML trong thư mục profiles của bạn:
+```yaml
+id: my_doc
+version: 1
+display_name: "My document type"
+classify:
+  any_of:
+    - contains_any: ["MY DOCUMENT HEADER"]
+extract:
+  - name: document_number
+    aliases: ["Document No", "Số chứng từ"]
+    extractor: value_in_same_line
+    required: true
+```
+Sau đó:
+```python
+pipeline = Pipeline.from_default(profiles_dir="my_profiles/")
+```
+## Exceptions
+```python
+from ocr_postprocess import (
+    OcrPostprocessError,      # base
+    ProfileNotFoundError,
+    ProfileValidationError,
+    ExtractorNotFoundError,
+    TransformError,
+)
+```
+## Development
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements-dev.txt
+pip install -e .
+pytest                    # all tests
+pytest tests/unit         # unit only
+pytest -m golden          # golden/regression
+pytest -n auto --cov      # parallel + coverage
+ruff check . && black --check .
+```
+## Docs
+Xem [docs/README.md](docs/README.md) để biết chi tiết về pipeline stages và profile schema.

ocr_postprocess-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,152 @@
+# ocr-postprocess
+Biến raw OCR text thành structured document — trích xuất trường dữ liệu, xử lý nhiễu, cross-check, và render ra JSON/Markdown.
+## Installation
+```bash
+pip install ocr-postprocess
+```
+Hoặc cài từ source:
+```bash
+git clone https://github.com/your-org/ocr-postprocess
+cd ocr-postprocess
+pip install -e .
+```
+## Library usage
+```python
+from ocr_postprocess import Pipeline, ProcessedDocument, OcrPostprocessError
+# Sử dụng profiles bundled sẵn (no extra files needed)
+pipeline = Pipeline.from_default()
+raw_text = open("scan.txt").read()
+try:
+    doc: ProcessedDocument = pipeline.process(raw_text)
+except OcrPostprocessError as exc:
+    print(f"Pipeline error: {exc}")
+    raise
+# Lấy một trường
+name_candidate = doc.get("ho_va_ten")
+if name_candidate:
+    print(name_candidate.value)       # "NGUYỄN VĂN A"
+    print(name_candidate.confidence)  # 0.91
+# Toàn bộ trường đã trích
+fields = {c.key: c.value for c in doc.candidates}
+# Export JSON
+import json
+print(json.dumps(doc.to_json(), ensure_ascii=False, indent=2))
+# Export Markdown
+print(doc.markdown)
+```
+### Custom profiles directory
+```python
+# Dùng thư mục profiles riêng
+pipeline = Pipeline.from_default(profiles_dir="my_profiles/")
+```
+### Classify only
+```python
+profile_id, score = pipeline.classify(raw_text)
+# "cccd_2024", 0.97
+```
+### ProcessedDocument fields
+| Field | Type | Mô tả |
+|---|---|---|
+| `profile_id` | `str` | Profile được match |
+| `profile_score` | `float` | Điểm classify (0–1) |
+| `candidates` | `list[Candidate]` | Tất cả trường đã trích |
+| `overall_confidence` | `float` | Điểm tin cậy tổng hợp |
+| `warnings` | `list[str]` | Cảnh báo từ pipeline |
+| `markdown` | `str` | Kết quả render Markdown |
+| `cross_checks` | `list[CrossCheck]` | Kết quả cross-check |
+## CLI
+```bash
+# Process a document
+ocrpp process scan.txt
+# Markdown output
+ocrpp process scan.txt --format markdown
+# Classify only
+ocrpp classify scan.txt
+# Validate a profile
+ocrpp validate-profile profiles/my_profile.yml
+# Custom profiles directory
+ocrpp process scan.txt --profiles ./my_profiles/
+```
+## Adding a custom profile
+Tạo file YAML trong thư mục profiles của bạn:
+```yaml
+id: my_doc
+version: 1
+display_name: "My document type"
+classify:
+  any_of:
+    - contains_any: ["MY DOCUMENT HEADER"]
+extract:
+  - name: document_number
+    aliases: ["Document No", "Số chứng từ"]
+    extractor: value_in_same_line
+    required: true
+```
+Sau đó:
+```python
+pipeline = Pipeline.from_default(profiles_dir="my_profiles/")
+```
+## Exceptions
+```python
+from ocr_postprocess import (
+    OcrPostprocessError,      # base
+    ProfileNotFoundError,
+    ProfileValidationError,
+    ExtractorNotFoundError,
+    TransformError,
+)
+```
+## Development
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements-dev.txt
+pip install -e .
+pytest                    # all tests
+pytest tests/unit         # unit only
+pytest -m golden          # golden/regression
+pytest -n auto --cov      # parallel + coverage
+ruff check . && black --check .
+```
+## Docs
+Xem [docs/README.md](docs/README.md) để biết chi tiết về pipeline stages và profile schema.

ocr_postprocess-0.1.0/ocr_postprocess/__init__.py ADDED Viewed

@@ -0,0 +1,33 @@
+"""ocr_postprocess — OCR post-processing pipeline."""
+__version__ = "0.1.0"
+from ocr_postprocess.exceptions import (
+    CyclicComputeError,
+    ExtractorNotFoundError,
+    OcrPostprocessError,
+    ProfileNotFoundError,
+    ProfileValidationError,
+    TransformError,
+)
+from ocr_postprocess.models import PipelineContext, ProcessedDocument
+from ocr_postprocess.pipeline import Pipeline
+from ocr_postprocess.renderer.llm import render_llm_markdown
+__all__ = [
+    # Core
+    "Pipeline",
+    "ProcessedDocument",
+    "PipelineContext",
+    # Renderers
+    "render_llm_markdown",
+    # Exceptions — import these to catch errors without knowing internal paths
+    "OcrPostprocessError",
+    "ProfileNotFoundError",
+    "ProfileValidationError",
+    "ExtractorNotFoundError",
+    "TransformError",
+    "CyclicComputeError",
+    # Version
+    "__version__",
+]

ocr_postprocess-0.1.0/ocr_postprocess/classifier.py ADDED Viewed

@@ -0,0 +1,63 @@
+"""Stage 2 — Classifier: select best matching DocumentProfile."""
+from __future__ import annotations
+import logging
+from ocr_postprocess.models import PipelineContext
+from ocr_postprocess.profiles.matcher import evaluate
+logger = logging.getLogger(__name__)
+def classify_stage(ctx: PipelineContext) -> None:
+    """Pipeline stage 2: classify document and load matching profile."""
+    profiles: dict = ctx.__dict__.get("_profiles", {})
+    if not profiles:
+        logger.warning("No profiles loaded; using generic fallback")
+        ctx.classification_score = 0.0
+        return
+    text = ctx.normalized_text or ctx.raw_text
+    best_id: str | None = None
+    best_score = 0.0
+    second_score = 0.0
+    scores: list[tuple[float, str]] = []
+    for pid, profile in profiles.items():
+        if pid.startswith("_"):
+            continue  # skip generic fallback in main scoring
+        try:
+            score = evaluate(profile.classify, text)
+        except Exception:
+            logger.exception("Error evaluating classify for profile '%s'", pid)
+            score = 0.0
+        scores.append((score, pid))
+    scores.sort(reverse=True)
+    if scores:
+        best_score, best_id = scores[0]
+        second_score = scores[1][0] if len(scores) > 1 else 0.0
+        if best_score < 0.5:
+            # Fall back to _generic
+            best_id = "_generic"
+            best_score = 0.0
+            logger.info("Classification score too low; using _generic")
+        elif len(scores) > 1 and best_score - second_score < 0.1:
+            logger.warning(
+                "Ambiguous classification: '%s'=%.2f vs '%s'=%.2f",
+                best_id,
+                best_score,
+                scores[1][1],
+                second_score,
+            )
+    else:
+        best_id = "_generic"
+    profile_obj = profiles.get(best_id)
+    ctx.profile = profile_obj
+    ctx.classification_score = best_score
+    logger.info("Classified as '%s' (score=%.3f)", best_id, best_score)

ocr_postprocess-0.1.0/ocr_postprocess/cli.py ADDED Viewed

@@ -0,0 +1,130 @@
+"""Typer CLI for ocr_postprocess."""
+from __future__ import annotations
+import logging
+import sys
+from pathlib import Path
+from typing import Optional
+import typer
+app = typer.Typer(name="ocrpp", help="OCR post-processing pipeline CLI.", add_completion=False)
+_LOG_LEVELS = {
+    "debug": logging.DEBUG,
+    "info": logging.INFO,
+    "warning": logging.WARNING,
+    "error": logging.ERROR,
+}
+def _setup_logging(level: str) -> None:
+    """Configure root logger with given level string."""
+    logging.basicConfig(
+        level=_LOG_LEVELS.get(level.lower(), logging.INFO),
+        format="%(levelname)s %(name)s: %(message)s",
+    )
+@app.command()
+def classify(
+    input_file: Optional[Path] = typer.Argument(
+        None, help="Path to raw text file. Reads stdin if omitted."
+    ),
+    profiles_dir: Path = typer.Option(
+        Path("profiles"), "--profiles", "-p", help="Profiles directory."
+    ),
+    log_level: str = typer.Option("info", "--log-level", "-l"),
+) -> None:
+    """Classify a document and print the best matching profile ID."""
+    _setup_logging(log_level)
+    from ocr_postprocess.pipeline import Pipeline
+    raw = _read_input(input_file)
+    pipeline = Pipeline.from_default(profiles_dir=str(profiles_dir))
+    profile_id, score = pipeline.classify(raw)
+    typer.echo(f"{profile_id}\t{score:.4f}")
+@app.command()
+def process(
+    input_file: Optional[Path] = typer.Argument(
+        None, help="Path to raw text file. Reads stdin if omitted."
+    ),
+    profiles_dir: Path = typer.Option(
+        Path("profiles"), "--profiles", "-p", help="Profiles directory."
+    ),
+    output_format: str = typer.Option(
+        "json", "--format", "-f", help="Output format: json, markdown, or llm."
+    ),
+    debug: bool = typer.Option(False, "--debug", "-d", help="Include debug trace."),
+    log_level: str = typer.Option("info", "--log-level", "-l"),
+) -> None:
+    """Process a document and print extracted fields."""
+    _setup_logging(log_level)
+    from ocr_postprocess.pipeline import Pipeline
+    from ocr_postprocess.renderer.json_renderer import to_json
+    from ocr_postprocess.renderer.llm import render_llm_markdown
+    from ocr_postprocess.renderer.markdown import render_markdown
+    raw = _read_input(input_file)
+    pipeline = Pipeline.from_default(profiles_dir=str(profiles_dir))
+    doc = pipeline.process(raw, debug=debug)
+    if output_format == "markdown":
+        typer.echo(render_markdown(doc))
+    elif output_format == "llm":
+        typer.echo(render_llm_markdown(doc))
+    else:
+        typer.echo(to_json(doc))
+@app.command("validate-profile")
+def validate_profile(
+    profile_file: Path = typer.Argument(..., help="Path to YAML profile file."),
+    log_level: str = typer.Option("info", "--log-level", "-l"),
+) -> None:
+    """Validate a YAML profile file and print errors if any."""
+    _setup_logging(log_level)
+    from ocr_postprocess.profiles.loader import load_profile
+    try:
+        profile = load_profile(profile_file)
+        typer.echo(f"OK: {profile.id} (v{profile.version})")
+    except Exception as exc:
+        typer.echo(f"ERROR: {exc}", err=True)
+        raise typer.Exit(code=1)
+@app.command("dump-canonical")
+def dump_canonical(
+    input_file: Optional[Path] = typer.Argument(None, help="Path to raw text file."),
+    profiles_dir: Path = typer.Option(
+        Path("profiles"), "--profiles", "-p", help="Profiles directory."
+    ),
+    log_level: str = typer.Option("info", "--log-level", "-l"),
+) -> None:
+    """Process document and dump canonical JSON to stdout."""
+    _setup_logging(log_level)
+    from ocr_postprocess.pipeline import Pipeline
+    from ocr_postprocess.renderer.json_renderer import to_json
+    raw = _read_input(input_file)
+    pipeline = Pipeline.from_default(profiles_dir=str(profiles_dir))
+    doc = pipeline.process(raw)
+    typer.echo(to_json(doc))
+def _read_input(path: Optional[Path]) -> str:
+    """Read text from file path or stdin; exit with code 1 if no input available."""
+    if path is not None:
+        return path.read_text(encoding="utf-8")
+    if not sys.stdin.isatty():
+        return sys.stdin.read()
+    typer.echo("Error: no input provided", err=True)
+    raise typer.Exit(code=1)
+if __name__ == "__main__":
+    app()

ocr_postprocess-0.1.0/ocr_postprocess/engine/__init__.py ADDED Viewed

File without changes

ocr_postprocess-0.1.0/ocr_postprocess/engine/denoiser.py ADDED Viewed

@@ -0,0 +1,134 @@
+"""Stage 3 — Denoiser: remove boilerplate lines and patterns."""
+from __future__ import annotations
+import logging
+import regex as re
+from ocr_postprocess.models import PipelineContext
+logger = logging.getLogger(__name__)
+# Built-in rules (always applied regardless of profile config)
+_BUILTIN_DROP_PATTERNS = [
+    re.compile(
+        r"^[\W_]{3,}$"
+    ),  # lines consisting only of non-alphanumeric chars (e.g. "---", "***")
+    re.compile(r"^\s{0,2}\S\s{0,2}$"),  # single character with optional surrounding spaces
+]
+def _is_builtin_noise(line: str) -> bool:
+    """Return True if line matches built-in noise rules."""
+    stripped = line.strip()
+    if len(stripped) < 3:
+        return True
+    for pat in _BUILTIN_DROP_PATTERNS:
+        if pat.fullmatch(stripped):
+            return True
+    return False
+def denoise(
+    text: str,
+    drop_line_patterns: list[str] | None = None,
+    drop_inline_patterns: list[str] | None = None,
+    mask_patterns: list[dict[str, str]] | None = None,
+    collapse_repeats: bool = False,
+    protected_substrings: set[str] | None = None,
+) -> str:
+    """Remove noise from text, preserving line count (empty string for dropped lines)."""
+    compiled_drop = [re.compile(p) for p in (drop_line_patterns or [])]
+    compiled_inline = [re.compile(p) for p in (drop_inline_patterns or [])]
+    compiled_mask = [
+        (re.compile(m["pattern"]), m.get("replacement", "")) for m in (mask_patterns or [])
+    ]
+    lines = text.split("\n")
+    prev_line: str | None = None
+    result: list[str] = []
+    for line in lines:
+        # 1. Consecutive duplicate
+        if collapse_repeats and line == prev_line and line.strip():
+            result.append("")
+            continue
+        # 2. Built-in noise (skip if line contains protected label)
+        stripped = line.strip()
+        if _is_builtin_noise(stripped):
+            if protected_substrings and any(p in line for p in protected_substrings):
+                pass  # keep
+            else:
+                result.append("")
+                prev_line = line
+                continue
+        # 3. Profile drop_lines rules
+        dropped = False
+        for pat in compiled_drop:
+            if pat.search(stripped):
+                if protected_substrings and any(p in line for p in protected_substrings):
+                    break
+                dropped = True
+                break
+        if dropped:
+            result.append("")
+            prev_line = line
+            continue
+        # 4. Inline drop patterns
+        for pat in compiled_inline:
+            line = pat.sub("", line)
+        # 5. Mask patterns
+        for pat, repl in compiled_mask:
+            line = pat.sub(repl, line)
+        result.append(line)
+        prev_line = line
+    return "\n".join(result)
+def denoise_stage(ctx: PipelineContext) -> None:
+    """Pipeline stage 3: denoise text."""
+    profile = ctx.profile
+    noise = profile.noise if profile else None
+    protected: set[str] = set()
+    if profile:
+        for field in profile.fields:
+            for alias in field.aliases:
+                protected.add(alias)
+    # Build drop_line_patterns from DropLinesRule
+    drop_line_patterns: list[str] = []
+    if noise and noise.drop_lines:
+        dl = noise.drop_lines
+        drop_line_patterns.extend(dl.regex)
+        # contains_any patterns → convert to regex
+        for phrase in dl.contains_any:
+            drop_line_patterns.append(re.escape(phrase))
+    text = ctx.normalized_text or ctx.raw_text
+    before_lines = [ln for ln in text.splitlines() if ln.strip()]
+    ctx.normalized_text = denoise(
+        text,
+        drop_line_patterns=drop_line_patterns,
+        drop_inline_patterns=list(noise.drop_patterns) if noise else [],
+        mask_patterns=list(noise.mask_patterns) if noise else [],
+        collapse_repeats=noise.collapse_repeats if noise else False,
+        protected_substrings=protected,
+    )
+    after_lines = [ln for ln in ctx.normalized_text.splitlines() if ln.strip()]
+    dropped = len(before_lines) - len(after_lines)
+    logger.debug(
+        "Denoiser: %d non-empty lines → %d (dropped %d, %d drop-patterns, %d mask-patterns)",
+        len(before_lines),
+        len(after_lines),
+        dropped,
+        len(drop_line_patterns),
+        len(noise.mask_patterns) if noise else 0,
+    )