PyPI - sigdetect - Versions diffs - 0.1.1__tar.gz → 0.3.0__tar.gz - Mend

sigdetect 0.1.1tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

{sigdetect-0.1.1 → sigdetect-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,13 +1,12 @@
 Metadata-Version: 2.4
 Name: sigdetect
-Version: 0.1.1
+Version: 0.3.0
 Summary: Signature detection and role attribution for PDFs
 Author-email: BT Asmamaw <basmamaw@angeiongroup.com>
 License: MIT
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 Requires-Dist: pypdf>=4.0.0
-Requires-Dist: pandas>=2.0
 Requires-Dist: rich>=13.0
 Requires-Dist: typer>=0.12
 Requires-Dist: pydantic>=2.5
@@ -102,6 +101,8 @@ sigdetect detect \
 - `--profile` selects tuned role logic:
   - `hipaa` → patient / representative / attorney
   - `retainer` → client / firm (prefers detecting two signatures)
+- `--recursive/--no-recursive` toggles whether `sigdetect detect` descends into subdirectories when hunting for PDFs (recursive by default).
+- `--crop-signatures` enables PNG crops for each detected widget (requires installing the optional `pymupdf` dependency). Use `--crop-dir` to override the destination and `--crop-dpi` to choose rendering quality.
 - If the executable is not on `PATH`, you can always fall back to `python -m sigdetect.cli ...`.
 ### EDA (quick aggregate stats)
@@ -135,7 +136,7 @@ result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 print(result.to_dict())
 ~~~
-`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)).
+`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, `crop_path` points at the generated image.
 ---
@@ -146,7 +147,17 @@ Import from `sigdetect.api` and get plain dicts out (JSON-ready),
 with no I/O side effects by default:
 ~~~python
-from sigdetect.api import DetectPdf, DetectMany, ScanDirectory, ToCsvRow, Version
+from pathlib import Path
+from sigdetect.api import (
+    CropSignatureImages,
+    DetectMany,
+    DetectPdf,
+    ScanDirectory,
+    ToCsvRow,
+    Version,
+    get_detector,
+)
 print("sigdetect", Version())
@@ -178,8 +189,24 @@ for res in ScanDirectory(
     # store in DB, print, etc.
     pass
+# 3) Crop PNG snippets for FileResult objects (requires PyMuPDF)
+detector = get_detector(pdfRoot="/path/to/pdfs", profileName="hipaa")
+file_result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
+crops = CropSignatureImages(
+    "/path/to/pdfs/example.pdf",
+    file_result,
+    outputDirectory="./signature_crops",
+    dpi=200,
+    returnBytes=True,  # also returns in-memory PNG bytes for each crop
+)
+first_crop = crops[0]
+print(first_crop.path, len(first_crop.image_bytes))
 ~~~
+When ``returnBytes=True`` the helper returns ``SignatureCrop`` objects containing the saved path,
+PNG bytes, and the originating signature metadata.
 ## Result schema
@@ -205,7 +232,10 @@ High-level summary (per file):
       "score": 5,
       "scores": { "field": 3, "page_label": 2 },
       "evidence": ["field:patient", "page_label:patient"],
-      "hint": "AcroSig:sig_patient"
+      "hint": "AcroSig:sig_patient",
+      "render_type": "typed",
+      "bounding_box": [10.0, 10.0, 150.0, 40.0],
+      "crop_path": "signature_crops/example/sig_01_patient.png"
     },
     {
       "page": null,
@@ -214,7 +244,10 @@ High-level summary (per file):
       "score": 6,
       "scores": { "page_label": 4, "general": 2 },
       "evidence": ["page_label:representative(parent/guardian)", "pseudo:true"],
-      "hint": "VendorOrAcroOnly"
+      "hint": "VendorOrAcroOnly",
+      "render_type": "unknown",
+      "bounding_box": null,
+      "crop_path": null
     }
   ]
 }
@@ -227,6 +260,8 @@ High-level summary (per file):
 - **`mixed`** means both `esign_found` and `scanned_pdf` are `true`.
 - **`roles`** summarizes unique non-`unknown` roles across signatures.
 - In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
+- **`signatures[].bounding_box`** reports the widget rectangle in PDF points (origin bottom-left).
+- **`signatures[].crop_path`** is populated when PNG crops are generated (via CLI `--crop-signatures` or `CropSignatureImages`).
 ---
@@ -252,6 +287,9 @@ engine: pypdf2
 pseudo_signatures: true
 recurse_xobjects: true
 profile: retainer    # or: hipaa
+crop_signatures: false   # enable to write PNG crops (requires pymupdf)
+# crop_output_dir: ./signature_crops
+crop_image_dpi: 200
 ~~~
 YAML files can be customized or load at runtime (see CLI `--config`, if available, or import and pass patterns into engine).

{sigdetect-0.1.1 → sigdetect-0.3.0}/README.md RENAMED Viewed

@@ -85,6 +85,8 @@ sigdetect detect \
 - `--profile` selects tuned role logic:
   - `hipaa` → patient / representative / attorney
   - `retainer` → client / firm (prefers detecting two signatures)
+- `--recursive/--no-recursive` toggles whether `sigdetect detect` descends into subdirectories when hunting for PDFs (recursive by default).
+- `--crop-signatures` enables PNG crops for each detected widget (requires installing the optional `pymupdf` dependency). Use `--crop-dir` to override the destination and `--crop-dpi` to choose rendering quality.
 - If the executable is not on `PATH`, you can always fall back to `python -m sigdetect.cli ...`.
 ### EDA (quick aggregate stats)
@@ -118,7 +120,7 @@ result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 print(result.to_dict())
 ~~~
-`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)).
+`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, `crop_path` points at the generated image.
 ---
@@ -129,7 +131,17 @@ Import from `sigdetect.api` and get plain dicts out (JSON-ready),
 with no I/O side effects by default:
 ~~~python
-from sigdetect.api import DetectPdf, DetectMany, ScanDirectory, ToCsvRow, Version
+from pathlib import Path
+from sigdetect.api import (
+    CropSignatureImages,
+    DetectMany,
+    DetectPdf,
+    ScanDirectory,
+    ToCsvRow,
+    Version,
+    get_detector,
+)
 print("sigdetect", Version())
@@ -161,8 +173,24 @@ for res in ScanDirectory(
     # store in DB, print, etc.
     pass
+# 3) Crop PNG snippets for FileResult objects (requires PyMuPDF)
+detector = get_detector(pdfRoot="/path/to/pdfs", profileName="hipaa")
+file_result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
+crops = CropSignatureImages(
+    "/path/to/pdfs/example.pdf",
+    file_result,
+    outputDirectory="./signature_crops",
+    dpi=200,
+    returnBytes=True,  # also returns in-memory PNG bytes for each crop
+)
+first_crop = crops[0]
+print(first_crop.path, len(first_crop.image_bytes))
 ~~~
+When ``returnBytes=True`` the helper returns ``SignatureCrop`` objects containing the saved path,
+PNG bytes, and the originating signature metadata.
 ## Result schema
@@ -188,7 +216,10 @@ High-level summary (per file):
       "score": 5,
       "scores": { "field": 3, "page_label": 2 },
       "evidence": ["field:patient", "page_label:patient"],
-      "hint": "AcroSig:sig_patient"
+      "hint": "AcroSig:sig_patient",
+      "render_type": "typed",
+      "bounding_box": [10.0, 10.0, 150.0, 40.0],
+      "crop_path": "signature_crops/example/sig_01_patient.png"
     },
     {
       "page": null,
@@ -197,7 +228,10 @@ High-level summary (per file):
       "score": 6,
       "scores": { "page_label": 4, "general": 2 },
       "evidence": ["page_label:representative(parent/guardian)", "pseudo:true"],
-      "hint": "VendorOrAcroOnly"
+      "hint": "VendorOrAcroOnly",
+      "render_type": "unknown",
+      "bounding_box": null,
+      "crop_path": null
     }
   ]
 }
@@ -210,6 +244,8 @@ High-level summary (per file):
 - **`mixed`** means both `esign_found` and `scanned_pdf` are `true`.
 - **`roles`** summarizes unique non-`unknown` roles across signatures.
 - In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
+- **`signatures[].bounding_box`** reports the widget rectangle in PDF points (origin bottom-left).
+- **`signatures[].crop_path`** is populated when PNG crops are generated (via CLI `--crop-signatures` or `CropSignatureImages`).
 ---
@@ -235,6 +271,9 @@ engine: pypdf2
 pseudo_signatures: true
 recurse_xobjects: true
 profile: retainer    # or: hipaa
+crop_signatures: false   # enable to write PNG crops (requires pymupdf)
+# crop_output_dir: ./signature_crops
+crop_image_dpi: 200
 ~~~
 YAML files can be customized or load at runtime (see CLI `--config`, if available, or import and pass patterns into engine).

{sigdetect-0.1.1 → sigdetect-0.3.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sigdetect"
-version = "0.1.1"
+version = "0.3.0"
 description = "Signature detection and role attribution for PDFs"
 readme = "README.md"
 authors = [{ name = "BT Asmamaw", email = "basmamaw@angeiongroup.com" }]
@@ -12,7 +12,6 @@ license = { text = "MIT" }
 requires-python = ">=3.9"
 dependencies = [
   "pypdf>=4.0.0",
-  "pandas>=2.0",
   "rich>=13.0",
   "typer>=0.12",
   "pydantic>=2.5",

sigdetect-0.3.0/src/sigdetect/api.py ADDED Viewed

@@ -0,0 +1,287 @@
+"""Public helpers for programmatic use of the signature detection engine."""
+from __future__ import annotations
+from contextlib import contextmanager
+from pathlib import Path
+from typing import Any, Generator, Iterable, Iterator, Literal, overload
+from sigdetect.config import DetectConfiguration
+from sigdetect.cropping import SignatureCrop
+from sigdetect.detector import BuildDetector, Detector, FileResult, Signature
+EngineName = Literal["pypdf2", "pypdf", "pymupdf"]
+ProfileName = Literal["hipaa", "retainer"]
+def DetectPdf(
+    pdfPath: str | Path,
+    *,
+    profileName: ProfileName = "hipaa",
+    engineName: EngineName = "pypdf2",
+    includePseudoSignatures: bool = True,
+    recurseXObjects: bool = True,
+    detector: Detector | None = None,
+) -> dict[str, Any]:
+    """Detect signature evidence and assign roles for a single PDF."""
+    resolvedPath = Path(pdfPath)
+    activeDetector = detector or get_detector(
+        pdfRoot=resolvedPath.parent,
+        profileName=profileName,
+        engineName=engineName,
+        includePseudoSignatures=includePseudoSignatures,
+        recurseXObjects=recurseXObjects,
+        outputDirectory=None,
+    )
+    result = activeDetector.Detect(resolvedPath)
+    return _ToPlainDictionary(result)
+def get_detector(
+    *,
+    pdfRoot: str | Path | None = None,
+    profileName: ProfileName = "hipaa",
+    engineName: EngineName = "pypdf2",
+    includePseudoSignatures: bool = True,
+    recurseXObjects: bool = True,
+    outputDirectory: str | Path | None = None,
+) -> Detector:
+    """Return a reusable detector instance configured with the supplied options."""
+    configuration = DetectConfiguration(
+        PdfRoot=Path(pdfRoot) if pdfRoot is not None else Path.cwd(),
+        OutputDirectory=Path(outputDirectory) if outputDirectory is not None else None,
+        Engine=engineName,
+        PseudoSignatures=includePseudoSignatures,
+        RecurseXObjects=recurseXObjects,
+        Profile=profileName,
+    )
+    return BuildDetector(configuration)
+def _ToPlainDictionary(candidate: Any) -> dict[str, Any]:
+    """Convert pydantic/dataclass instances to plain dictionaries."""
+    if hasattr(candidate, "to_dict"):
+        return candidate.to_dict()
+    if hasattr(candidate, "model_dump"):
+        return candidate.model_dump()  # type: ignore[attr-defined]
+    if hasattr(candidate, "dict"):
+        return candidate.dict()  # type: ignore[attr-defined]
+    try:
+        from dataclasses import asdict, is_dataclass
+        if is_dataclass(candidate):
+            return asdict(candidate)
+    except Exception:
+        pass
+    if isinstance(candidate, dict):
+        return {key: _ToPlainValue(candidate[key]) for key in candidate}
+    raise TypeError(f"Unsupported result type: {type(candidate)!r}")
+def _ToPlainValue(value: Any) -> Any:
+    """Best effort conversion for nested structures."""
+    if hasattr(value, "to_dict"):
+        return value.to_dict()
+    if hasattr(value, "model_dump") or hasattr(value, "dict"):
+        return _ToPlainDictionary(value)
+    try:
+        from dataclasses import asdict, is_dataclass
+        if is_dataclass(value):
+            return asdict(value)
+    except Exception:
+        pass
+    if isinstance(value, list):
+        return [_ToPlainValue(item) for item in value]
+    if isinstance(value, tuple):
+        return tuple(_ToPlainValue(item) for item in value)
+    if isinstance(value, dict):
+        return {key: _ToPlainValue(result) for key, result in value.items()}
+    return value
+def DetectMany(
+    pdfPaths: Iterable[str | Path],
+    *,
+    detector: Detector | None = None,
+    **kwargs: Any,
+) -> Iterator[dict[str, Any]]:
+    """Yield :func:`DetectPdf` results for each path in ``pdfPaths``."""
+    if detector is not None:
+        for pdfPath in pdfPaths:
+            yield _DetectWithDetector(detector, pdfPath)
+        return
+    for pdfPath in pdfPaths:
+        yield DetectPdf(pdfPath, **kwargs)
+def ScanDirectory(
+    pdfRoot: str | Path,
+    *,
+    globPattern: str = "**/*.pdf",
+    detector: Detector | None = None,
+    **kwargs: Any,
+) -> Iterator[dict[str, Any]]:
+    """Walk ``pdfRoot`` and yield detection output for every matching PDF."""
+    rootDirectory = Path(pdfRoot)
+    if globPattern == "**/*.pdf":
+        iterator = (path for path in rootDirectory.rglob("*") if path.is_file())
+    else:
+        iterator = (
+            rootDirectory.rglob(globPattern.replace("**/", "", 1))
+            if globPattern.startswith("**/")
+            else rootDirectory.glob(globPattern)
+        )
+    for pdfPath in iterator:
+        if pdfPath.is_file() and pdfPath.suffix.lower() == ".pdf":
+            yield DetectPdf(pdfPath, detector=detector, **kwargs)
+def ToCsvRow(result: dict[str, Any]) -> dict[str, Any]:
+    """Return a curated subset of keys suitable for CSV export."""
+    return {
+        "file": result.get("file"),
+        "size_kb": result.get("size_kb"),
+        "pages": result.get("pages"),
+        "esign_found": result.get("esign_found"),
+        "scanned_pdf": result.get("scanned_pdf"),
+        "mixed": result.get("mixed"),
+        "sig_count": result.get("sig_count"),
+        "sig_pages": result.get("sig_pages"),
+        "roles": result.get("roles"),
+        "hints": result.get("hints"),
+    }
+def Version() -> str:
+    """Expose the installed package version without importing the CLI stack."""
+    try:
+        from importlib.metadata import version as resolveVersion
+        return resolveVersion("sigdetect")
+    except Exception:
+        return "0.0.0-dev"
+def _DetectWithDetector(detector: Detector, pdfPath: str | Path) -> dict[str, Any]:
+    """Helper that runs ``detector`` and returns the plain dictionary result."""
+    resolvedPath = Path(pdfPath)
+    return _ToPlainDictionary(detector.Detect(resolvedPath))
+@contextmanager
+def detector_context(**kwargs: Any) -> Generator[Detector, None, None]:
+    """Context manager wrapper around :func:`get_detector`."""
+    detector = get_detector(**kwargs)
+    try:
+        yield detector
+    finally:
+        pass
+@overload
+def CropSignatureImages(
+    pdfPath: str | Path,
+    fileResult: FileResult | dict[str, Any],
+    *,
+    outputDirectory: str | Path,
+    dpi: int = 200,
+    returnBytes: Literal[False] = False,
+) -> list[Path]: ...
+@overload
+def CropSignatureImages(
+    pdfPath: str | Path,
+    fileResult: FileResult | dict[str, Any],
+    *,
+    outputDirectory: str | Path,
+    dpi: int,
+    returnBytes: Literal[True],
+) -> list[SignatureCrop]: ...
+def CropSignatureImages(
+    pdfPath: str | Path,
+    fileResult: FileResult | dict[str, Any],
+    *,
+    outputDirectory: str | Path,
+    dpi: int = 200,
+    returnBytes: bool = False,
+) -> list[Path] | list[SignatureCrop]:
+    """Crop detected signature regions to PNG files.
+    Accepts either a :class:`FileResult` instance or the ``dict`` returned by
+    :func:`DetectPdf`. Requires the optional ``pymupdf`` dependency.
+    Set ``returnBytes=True`` to also receive in-memory PNG bytes for each crop.
+    """
+    from sigdetect.cropping import crop_signatures
+    file_result_obj, original_dict = _CoerceFileResult(fileResult)
+    paths = crop_signatures(
+        pdf_path=Path(pdfPath),
+        file_result=file_result_obj,
+        output_dir=Path(outputDirectory),
+        dpi=dpi,
+        return_bytes=returnBytes,
+    )
+    if original_dict is not None:
+        original_dict.clear()
+        original_dict.update(file_result_obj.to_dict())
+    return paths
+def _CoerceFileResult(
+    candidate: FileResult | dict[str, Any]
+) -> tuple[FileResult, dict[str, Any] | None]:
+    if isinstance(candidate, FileResult):
+        return candidate, None
+    if not isinstance(candidate, dict):
+        raise TypeError("fileResult must be FileResult or dict")
+    signatures: list[Signature] = []
+    for entry in candidate.get("signatures") or []:
+        bbox = entry.get("bounding_box")
+        signatures.append(
+            Signature(
+                Page=entry.get("page"),
+                FieldName=str(entry.get("field_name") or ""),
+                Role=str(entry.get("role") or "unknown"),
+                Score=int(entry.get("score") or 0),
+                Scores=dict(entry.get("scores") or {}),
+                Evidence=list(entry.get("evidence") or []),
+                Hint=str(entry.get("hint") or ""),
+                RenderType=str(entry.get("render_type") or "unknown"),
+                BoundingBox=tuple(bbox) if bbox else None,
+                CropPath=entry.get("crop_path"),
+            )
+        )
+    file_result = FileResult(
+        File=str(candidate.get("file") or ""),
+        SizeKilobytes=candidate.get("size_kb"),
+        PageCount=int(candidate.get("pages") or 0),
+        ElectronicSignatureFound=bool(candidate.get("esign_found")),
+        ScannedPdf=candidate.get("scanned_pdf"),
+        MixedContent=candidate.get("mixed"),
+        SignatureCount=int(candidate.get("sig_count") or len(signatures)),
+        SignaturePages=str(candidate.get("sig_pages") or ""),
+        Roles=str(candidate.get("roles") or "unknown"),
+        Hints=str(candidate.get("hints") or ""),
+        Signatures=signatures,
+    )
+    return file_result, candidate

sigdetect 0.1.1__tar.gz → 0.3.0__tar.gz

sigdetect 0.1.1tar.gz → 0.3.0tar.gz