PyPI - sigdetect - Versions diffs - 0.4.0__tar.gz → 0.5.0__tar.gz - Mend

sigdetect 0.4.0tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

{sigdetect-0.4.0 → sigdetect-0.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sigdetect
-Version: 0.4.0
+Version: 0.5.0
 Summary: Signature detection and role attribution for PDFs
 Author-email: BT Asmamaw <basmamaw@angeiongroup.com>
 License: MIT
@@ -10,9 +10,11 @@ Requires-Dist: pypdf>=4.0.0
 Requires-Dist: rich>=13.0
 Requires-Dist: typer>=0.12
 Requires-Dist: pydantic>=2.5
+Requires-Dist: pillow>=10.0
+Requires-Dist: python-docx>=1.1.0
+Requires-Dist: pytesseract>=0.3.10
+Requires-Dist: pymupdf>=1.23
 Requires-Dist: pyyaml>=6.0
-Provides-Extra: pymupdf
-Requires-Dist: pymupdf>=1.23; extra == "pymupdf"
 # CaseWorks.Automation.CaseDocumentIntake
@@ -95,14 +97,16 @@ sigdetect detect \
 ### Notes
 - The config file controls `pdf_root`, `out_dir`, `engine`, `pseudo_signatures`, `recurse_xobjects`, etc.
-- `--engine` accepts **auto** (default; prefers PyMuPDF when installed, falls back to PyPDF2), **pypdf2**, or **pymupdf**.
+- Engine selection is forced to **auto** (prefers PyMuPDF for geometry, falls back to PyPDF2); any configured `engine` value is overridden.
 - `--pseudo-signatures` enables a vendor/Acro-only pseudo-signature when no actual `/Widget` is present (useful for DocuSign / Acrobat Sign receipts).
 - `--recurse-xobjects` allows scanning Form XObjects for vendor markers and labels embedded in page resources.
 - `--profile` selects tuned role logic:
   - `hipaa` → patient / representative / attorney
   - `retainer` → client / firm (prefers detecting two signatures)
 - `--recursive/--no-recursive` toggles whether `sigdetect detect` descends into subdirectories when hunting for PDFs (recursive by default).
-- Cropping (`--crop-signatures`) and wet detection (`--detect-wet`) are enabled by default for single-pass runs; disable them if you want a light, e-sign-only pass. PyMuPDF is required for crops; PyMuPDF + Tesseract are required for wet detection.
+- Results output is disabled by default; set `write_results: true` or pass `--write-results` when you need `results.json` (for EDA).
+- Cropping (`--crop-signatures`) writes a one-image `.docx` per signature in the crop output directory (no PNG files on disk); `--crop-bytes` embeds base64 PNG data in `signatures[].crop_bytes` for in-memory use. PyMuPDF is required for crops, and `python-docx` is required for `.docx` output.
+- Wet detection runs automatically for non-e-sign PDFs when dependencies are available; missing OCR dependencies add a `ManualReview:*` hint instead of failing. PyMuPDF + Tesseract are required for wet detection.
 - If the executable is not on `PATH`, you can always fall back to `python -m sigdetect.cli ...`.
 ### EDA (quick aggregate stats)
@@ -113,6 +117,8 @@ sigdetect eda \
 ~~~
+`sigdetect eda` expects `results.json`; enable `write_results: true` when running detect.
 ---
 ## Library usage
@@ -136,13 +142,13 @@ result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 print(result.to_dict())
 ~~~
-`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, `crop_path` points at the generated image. Use `Engine="auto"` if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
+`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When cropping is enabled, `crop_path` points at the generated `.docx`. Use `Engine="auto"` if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
 ---
 ## Library API (embed in another script)
-Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping:
+Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping. Engine selection is forced to `auto` (PyMuPDF preferred) to ensure geometry. Wet detection runs automatically for non-e-sign PDFs; pass `runWetDetection=False` to skip OCR.
 ~~~python
 from pathlib import Path
@@ -165,6 +171,7 @@ result = DetectPdf(
     profileName="retainer",
     includePseudoSignatures=True,
     recurseXObjects=True,
+    # runWetDetection=False,  # disable OCR-backed wet detection if desired
 )
 print(
     result["file"],
@@ -187,7 +194,7 @@ for res in ScanDirectory(
     # store in DB, print, etc.
     pass
-# 3) Crop PNG snippets for FileResult objects (requires PyMuPDF)
+# 3) Create DOCX crops for FileResult objects (requires PyMuPDF + python-docx)
 detector = get_detector(pdfRoot="/path/to/pdfs", profileName="hipaa")
 file_result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 CropSignatureImages(
@@ -226,7 +233,7 @@ High-level summary (per file):
       "hint": "AcroSig:sig_patient",
       "render_type": "typed",
       "bounding_box": [10.0, 10.0, 150.0, 40.0],
-      "crop_path": "signature_crops/example/sig_01_patient.png"
+      "crop_path": "signature_crops/example/sig_01_patient.docx"
     },
     {
       "page": null,
@@ -252,7 +259,8 @@ High-level summary (per file):
 - **`roles`** summarizes unique non-`unknown` roles across signatures.
 - In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
 - **`signatures[].bounding_box`** reports the widget rectangle in PDF points (origin bottom-left).
-- **`signatures[].crop_path`** is populated when PNG crops are generated (via CLI `--crop-signatures` or `CropSignatureImages`).
+- **`signatures[].crop_path`** is populated when DOCX crops are generated (via CLI `--crop-signatures` or `CropSignatureImages`).
+- **`signatures[].crop_bytes`** contains base64 PNG data when CLI `--crop-bytes` is enabled.
 ---
@@ -274,14 +282,15 @@ You can keep one config YAML per dataset, e.g.:
 # ./sample_data/config.yml (example)
 pdf_root: ./pdfs
 out_dir: ./sigdetect_out
-engine: pypdf2
+engine: auto
+write_results: false
 pseudo_signatures: true
 recurse_xobjects: true
 profile: retainer    # or: hipaa
-crop_signatures: false   # enable to write PNG crops (requires pymupdf)
+crop_signatures: false   # enable to write DOCX crops (requires pymupdf + python-docx)
 # crop_output_dir: ./signature_crops
 crop_image_dpi: 200
-detect_wet_signatures: false   # opt-in OCR wet detection (PyMuPDF + Tesseract)
+detect_wet_signatures: false   # kept for compatibility; non-e-sign PDFs still trigger OCR
 wet_ocr_dpi: 200
 wet_ocr_languages: eng
 wet_precision_threshold: 0.82
@@ -299,7 +308,7 @@ YAML files can be customized or load at runtime (see CLI `--config`, if availabl
   - Looks for client and firm labels/tokens; boosts pages with law-firm markers (LLP/LLC/PA/PC) and “By:” blocks.
   - Applies an anti-front-matter rule to reduce page-1 false positives (e.g., letterheads, firm mastheads).
   - When only vendor/Acro clues exist (no widgets), it will emit two pseudo signatures targeting likely pages.
-- **Wet detection (opt-in):** With `detect_wet_signatures: true`, the CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection. It emits `RenderType="wet"` signatures for high-confidence label/stroke pairs in the lower page region. Missing OCR dependencies add a `ManualReview:*` hint instead of failing.
+- **Wet detection (non-e-sign):** The CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection whenever no e-sign evidence is found. It emits `RenderType="wet"` signatures for high-confidence label/stroke pairs in the lower page region. When an image-based signature is present on a page, label-only OCR candidates are suppressed unless a stroke is detected. Results are deduped to the top signature per role (dropping `unknown`). Missing OCR dependencies add a `ManualReview:*` hint instead of failing.
 ---

{sigdetect-0.4.0 → sigdetect-0.5.0}/README.md RENAMED Viewed

@@ -79,14 +79,16 @@ sigdetect detect \
 ### Notes
 - The config file controls `pdf_root`, `out_dir`, `engine`, `pseudo_signatures`, `recurse_xobjects`, etc.
-- `--engine` accepts **auto** (default; prefers PyMuPDF when installed, falls back to PyPDF2), **pypdf2**, or **pymupdf**.
+- Engine selection is forced to **auto** (prefers PyMuPDF for geometry, falls back to PyPDF2); any configured `engine` value is overridden.
 - `--pseudo-signatures` enables a vendor/Acro-only pseudo-signature when no actual `/Widget` is present (useful for DocuSign / Acrobat Sign receipts).
 - `--recurse-xobjects` allows scanning Form XObjects for vendor markers and labels embedded in page resources.
 - `--profile` selects tuned role logic:
   - `hipaa` → patient / representative / attorney
   - `retainer` → client / firm (prefers detecting two signatures)
 - `--recursive/--no-recursive` toggles whether `sigdetect detect` descends into subdirectories when hunting for PDFs (recursive by default).
-- Cropping (`--crop-signatures`) and wet detection (`--detect-wet`) are enabled by default for single-pass runs; disable them if you want a light, e-sign-only pass. PyMuPDF is required for crops; PyMuPDF + Tesseract are required for wet detection.
+- Results output is disabled by default; set `write_results: true` or pass `--write-results` when you need `results.json` (for EDA).
+- Cropping (`--crop-signatures`) writes a one-image `.docx` per signature in the crop output directory (no PNG files on disk); `--crop-bytes` embeds base64 PNG data in `signatures[].crop_bytes` for in-memory use. PyMuPDF is required for crops, and `python-docx` is required for `.docx` output.
+- Wet detection runs automatically for non-e-sign PDFs when dependencies are available; missing OCR dependencies add a `ManualReview:*` hint instead of failing. PyMuPDF + Tesseract are required for wet detection.
 - If the executable is not on `PATH`, you can always fall back to `python -m sigdetect.cli ...`.
 ### EDA (quick aggregate stats)
@@ -97,6 +99,8 @@ sigdetect eda \
 ~~~
+`sigdetect eda` expects `results.json`; enable `write_results: true` when running detect.
 ---
 ## Library usage
@@ -120,13 +124,13 @@ result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 print(result.to_dict())
 ~~~
-`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, `crop_path` points at the generated image. Use `Engine="auto"` if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
+`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When cropping is enabled, `crop_path` points at the generated `.docx`. Use `Engine="auto"` if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
 ---
 ## Library API (embed in another script)
-Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping:
+Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping. Engine selection is forced to `auto` (PyMuPDF preferred) to ensure geometry. Wet detection runs automatically for non-e-sign PDFs; pass `runWetDetection=False` to skip OCR.
 ~~~python
 from pathlib import Path
@@ -149,6 +153,7 @@ result = DetectPdf(
     profileName="retainer",
     includePseudoSignatures=True,
     recurseXObjects=True,
+    # runWetDetection=False,  # disable OCR-backed wet detection if desired
 )
 print(
     result["file"],
@@ -171,7 +176,7 @@ for res in ScanDirectory(
     # store in DB, print, etc.
     pass
-# 3) Crop PNG snippets for FileResult objects (requires PyMuPDF)
+# 3) Create DOCX crops for FileResult objects (requires PyMuPDF + python-docx)
 detector = get_detector(pdfRoot="/path/to/pdfs", profileName="hipaa")
 file_result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 CropSignatureImages(
@@ -210,7 +215,7 @@ High-level summary (per file):
       "hint": "AcroSig:sig_patient",
       "render_type": "typed",
       "bounding_box": [10.0, 10.0, 150.0, 40.0],
-      "crop_path": "signature_crops/example/sig_01_patient.png"
+      "crop_path": "signature_crops/example/sig_01_patient.docx"
     },
     {
       "page": null,
@@ -236,7 +241,8 @@ High-level summary (per file):
 - **`roles`** summarizes unique non-`unknown` roles across signatures.
 - In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
 - **`signatures[].bounding_box`** reports the widget rectangle in PDF points (origin bottom-left).
-- **`signatures[].crop_path`** is populated when PNG crops are generated (via CLI `--crop-signatures` or `CropSignatureImages`).
+- **`signatures[].crop_path`** is populated when DOCX crops are generated (via CLI `--crop-signatures` or `CropSignatureImages`).
+- **`signatures[].crop_bytes`** contains base64 PNG data when CLI `--crop-bytes` is enabled.
 ---
@@ -258,14 +264,15 @@ You can keep one config YAML per dataset, e.g.:
 # ./sample_data/config.yml (example)
 pdf_root: ./pdfs
 out_dir: ./sigdetect_out
-engine: pypdf2
+engine: auto
+write_results: false
 pseudo_signatures: true
 recurse_xobjects: true
 profile: retainer    # or: hipaa
-crop_signatures: false   # enable to write PNG crops (requires pymupdf)
+crop_signatures: false   # enable to write DOCX crops (requires pymupdf + python-docx)
 # crop_output_dir: ./signature_crops
 crop_image_dpi: 200
-detect_wet_signatures: false   # opt-in OCR wet detection (PyMuPDF + Tesseract)
+detect_wet_signatures: false   # kept for compatibility; non-e-sign PDFs still trigger OCR
 wet_ocr_dpi: 200
 wet_ocr_languages: eng
 wet_precision_threshold: 0.82
@@ -283,7 +290,7 @@ YAML files can be customized or load at runtime (see CLI `--config`, if availabl
   - Looks for client and firm labels/tokens; boosts pages with law-firm markers (LLP/LLC/PA/PC) and “By:” blocks.
   - Applies an anti-front-matter rule to reduce page-1 false positives (e.g., letterheads, firm mastheads).
   - When only vendor/Acro clues exist (no widgets), it will emit two pseudo signatures targeting likely pages.
-- **Wet detection (opt-in):** With `detect_wet_signatures: true`, the CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection. It emits `RenderType="wet"` signatures for high-confidence label/stroke pairs in the lower page region. Missing OCR dependencies add a `ManualReview:*` hint instead of failing.
+- **Wet detection (non-e-sign):** The CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection whenever no e-sign evidence is found. It emits `RenderType="wet"` signatures for high-confidence label/stroke pairs in the lower page region. When an image-based signature is present on a page, label-only OCR candidates are suppressed unless a stroke is detected. Results are deduped to the top signature per role (dropping `unknown`). Missing OCR dependencies add a `ManualReview:*` hint instead of failing.
 ---

{sigdetect-0.4.0 → sigdetect-0.5.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sigdetect"
-version = "0.4.0"
+version = "0.5.0"
 description = "Signature detection and role attribution for PDFs"
 readme = "README.md"
 authors = [{ name = "BT Asmamaw", email = "basmamaw@angeiongroup.com" }]
@@ -15,12 +15,13 @@ dependencies = [
   "rich>=13.0",
   "typer>=0.12",
   "pydantic>=2.5",
+  "pillow>=10.0",
+  "python-docx>=1.1.0",
+  "pytesseract>=0.3.10",
+  "pymupdf>=1.23",
   "pyyaml>=6.0",
 ]
-[project.optional-dependencies]
-pymupdf = ["pymupdf>=1.23"]
 [project.scripts]
 sigdetect = "sigdetect.cli:app"

{sigdetect-0.4.0 → sigdetect-0.5.0}/src/sigdetect/api.py RENAMED Viewed

@@ -9,6 +9,7 @@ from typing import Any, Generator, Iterable, Iterator, Literal, overload
 from sigdetect.config import DetectConfiguration
 from sigdetect.cropping import SignatureCrop
 from sigdetect.detector import BuildDetector, Detector, FileResult, Signature
+from sigdetect.wet_detection import apply_wet_detection
 EngineName = Literal["pypdf2", "pypdf", "pymupdf", "auto"]
 ProfileName = Literal["hipaa", "retainer"]
@@ -21,9 +22,13 @@ def DetectPdf(
     engineName: EngineName = "auto",
     includePseudoSignatures: bool = True,
     recurseXObjects: bool = True,
+    runWetDetection: bool = True,
     detector: Detector | None = None,
 ) -> dict[str, Any]:
-    """Detect signature evidence and assign roles for a single PDF."""
+    """Detect signature evidence and assign roles for a single PDF.
+    Wet detection runs by default for non-e-sign PDFs; pass ``runWetDetection=False`` to skip OCR.
+    """
     resolvedPath = Path(pdfPath)
     activeDetector = detector or get_detector(
@@ -36,6 +41,10 @@ def DetectPdf(
     )
     result = activeDetector.Detect(resolvedPath)
+    if runWetDetection:
+        configuration = _ResolveConfiguration(activeDetector)
+        if configuration is not None:
+            apply_wet_detection(resolvedPath, configuration, result)
     return _ToPlainDictionary(result)
@@ -48,7 +57,10 @@ def get_detector(
     recurseXObjects: bool = True,
     outputDirectory: str | Path | None = None,
 ) -> Detector:
-    """Return a reusable detector instance configured with the supplied options."""
+    """Return a reusable detector instance configured with the supplied options.
+    Engine selection is forced to ``auto`` (prefers PyMuPDF when available).
+    """
     configuration = DetectConfiguration(
         PdfRoot=Path(pdfRoot) if pdfRoot is not None else Path.cwd(),
@@ -108,6 +120,7 @@ def _ToPlainValue(value: Any) -> Any:
 def DetectMany(
     pdfPaths: Iterable[str | Path],
     *,
+    runWetDetection: bool = True,
     detector: Detector | None = None,
     **kwargs: Any,
 ) -> Iterator[dict[str, Any]]:
@@ -115,17 +128,18 @@ def DetectMany(
     if detector is not None:
         for pdfPath in pdfPaths:
-            yield _DetectWithDetector(detector, pdfPath)
+            yield _DetectWithDetector(detector, pdfPath, runWetDetection=runWetDetection)
         return
     for pdfPath in pdfPaths:
-        yield DetectPdf(pdfPath, **kwargs)
+        yield DetectPdf(pdfPath, runWetDetection=runWetDetection, **kwargs)
 def ScanDirectory(
     pdfRoot: str | Path,
     *,
     globPattern: str = "**/*.pdf",
+    runWetDetection: bool = True,
     detector: Detector | None = None,
     **kwargs: Any,
 ) -> Iterator[dict[str, Any]]:
@@ -143,7 +157,7 @@ def ScanDirectory(
     for pdfPath in iterator:
         if pdfPath.is_file() and pdfPath.suffix.lower() == ".pdf":
-            yield DetectPdf(pdfPath, detector=detector, **kwargs)
+            yield DetectPdf(pdfPath, detector=detector, runWetDetection=runWetDetection, **kwargs)
 def ToCsvRow(result: dict[str, Any]) -> dict[str, Any]:
@@ -174,11 +188,25 @@ def Version() -> str:
         return "0.0.0-dev"
-def _DetectWithDetector(detector: Detector, pdfPath: str | Path) -> dict[str, Any]:
+def _DetectWithDetector(
+    detector: Detector, pdfPath: str | Path, *, runWetDetection: bool
+) -> dict[str, Any]:
     """Helper that runs ``detector`` and returns the plain dictionary result."""
     resolvedPath = Path(pdfPath)
-    return _ToPlainDictionary(detector.Detect(resolvedPath))
+    result = detector.Detect(resolvedPath)
+    if runWetDetection:
+        configuration = _ResolveConfiguration(detector)
+        if configuration is not None:
+            apply_wet_detection(resolvedPath, configuration, result)
+    return _ToPlainDictionary(result)
+def _ResolveConfiguration(detector: Detector) -> DetectConfiguration | None:
+    configuration = getattr(detector, "Configuration", None)
+    if isinstance(configuration, DetectConfiguration):
+        return configuration
+    return None
 @contextmanager
@@ -201,8 +229,7 @@ def CropSignatureImages(
     dpi: int = 200,
     returnBytes: Literal[False] = False,
     saveToDisk: bool = True,
-) -> list[Path]:
-    ...
+) -> list[Path]: ...
 @overload
@@ -214,8 +241,7 @@ def CropSignatureImages(
     dpi: int,
     returnBytes: Literal[True],
     saveToDisk: bool,
-) -> list[SignatureCrop]:
-    ...
+) -> list[SignatureCrop]: ...
 def CropSignatureImages(
@@ -227,12 +253,15 @@ def CropSignatureImages(
     returnBytes: bool = False,
     saveToDisk: bool = True,
 ) -> list[Path] | list[SignatureCrop]:
-    """Crop detected signature regions to PNG files.
+    """Create DOCX files containing cropped signature images.
     Accepts either a :class:`FileResult` instance or the ``dict`` returned by
     :func:`DetectPdf`. Requires the optional ``pymupdf`` dependency.
     Set ``returnBytes=True`` to also receive in-memory PNG bytes for each crop. Set
     ``saveToDisk=False`` to skip writing PNG files while still returning in-memory data.
+    When ``saveToDisk`` is enabled, a one-image DOCX file is also written per crop. When
+    ``returnBytes`` is True and ``python-docx`` is available, the returned
+    :class:`SignatureCrop` objects include ``docx_bytes``.
     """
     from sigdetect.cropping import crop_signatures
@@ -275,6 +304,7 @@ def _CoerceFileResult(
                 RenderType=str(entry.get("render_type") or "unknown"),
                 BoundingBox=tuple(bbox) if bbox else None,
                 CropPath=entry.get("crop_path"),
+                CropBytes=entry.get("crop_bytes"),
             )
         )

{sigdetect-0.4.0 → sigdetect-0.5.0}/src/sigdetect/cli.py RENAMED Viewed

@@ -2,6 +2,7 @@
 from __future__ import annotations
+import base64
 import json
 from collections.abc import Iterator
 from dataclasses import asdict, is_dataclass
@@ -48,6 +49,12 @@ def Detect(
     configurationPath: Path | None = typer.Option(
         None, "--config", "-c", help="Path to YAML config"
     ),
+    writeResults: bool | None = typer.Option(
+        None,
+        "--write-results/--no-write-results",
+        help="Write results.json (or JSON to stdout when out_dir is none)",
+        show_default=False,
+    ),
     profileOverride: str | None = typer.Option(None, "--profile", "-p", help="hipaa or retainer"),
     recursive: bool = typer.Option(
         True,
@@ -57,13 +64,13 @@ def Detect(
     cropSignatures: bool | None = typer.Option(
         None,
         "--crop-signatures/--no-crop-signatures",
-        help="Crop detected signature regions to PNG files (requires PyMuPDF)",
+        help="Write DOCX files containing cropped signature images (requires PyMuPDF + python-docx)",
         show_default=False,
     ),
     cropDirectory: Path | None = typer.Option(
         None,
         "--crop-dir",
-        help="Directory for signature PNG crops (defaults to out_dir/signature_crops)",
+        help="Directory for signature DOCX crops (defaults to out_dir/signature_crops)",
     ),
     cropDpi: int | None = typer.Option(
         None,
@@ -73,10 +80,16 @@ def Detect(
         help="Rendering DPI for signature crops",
         show_default=False,
     ),
+    cropBytes: bool = typer.Option(
+        False,
+        "--crop-bytes/--no-crop-bytes",
+        help="Embed base64 PNG bytes for signature crops in results JSON",
+        show_default=False,
+    ),
     detectWetSignatures: bool | None = typer.Option(
         None,
         "--detect-wet/--no-detect-wet",
-        help="Run OCR-backed wet signature detection (requires PyMuPDF + Tesseract)",
+        help="Compatibility flag; non-e-sign PDFs always run OCR when deps are available",
         show_default=False,
     ),
     wetOcrDpi: int | None = typer.Option(
@@ -111,6 +124,8 @@ def Detect(
         configuration = configuration.model_copy(update={"Profile": normalized_profile})
     overrides: dict[str, object] = {}
+    if writeResults is not None:
+        overrides["WriteResults"] = writeResults
     if cropSignatures is not None:
         overrides["CropSignatures"] = cropSignatures
     if cropDirectory is not None:
@@ -145,44 +160,52 @@ def Detect(
     except StopIteration:
         raise SystemExit(f"No PDFs found in {configuration.PdfRoot}") from None
-    results_buffer: list[FileResult] | None = [] if configuration.OutputDirectory is None else None
+    write_results = configuration.WriteResults
+    results_buffer: list[FileResult] | None = (
+        [] if write_results and configuration.OutputDirectory is None else None
+    )
     json_handle = None
     json_path: Path | None = None
     wrote_first = False
-    if configuration.OutputDirectory is not None:
+    if write_results and configuration.OutputDirectory is not None:
         outputDirectory = configuration.OutputDirectory
         outputDirectory.mkdir(parents=True, exist_ok=True)
         json_path = outputDirectory / "results.json"
         json_handle = open(json_path, "w", encoding="utf-8")
         json_handle.write("[")
+    crop_bytes_enabled = bool(cropBytes)
     crop_dir = configuration.CropOutputDirectory
+    if crop_dir is None:
+        base_dir = configuration.OutputDirectory or configuration.PdfRoot
+        crop_dir = base_dir / "signature_crops"
     cropping_enabled = configuration.CropSignatures
     cropping_available = True
     cropping_attempted = False
-    if configuration.CropSignatures and crop_dir is None:
-        Logger.warning(
-            "CropSignatures enabled without an output directory",
-            extra={"pdf_root": str(configuration.PdfRoot)},
-        )
-        cropping_enabled = False
     total_bboxes = 0
     def _append_result(file_result: FileResult, source_pdf: Path) -> None:
         nonlocal wrote_first, json_handle, total_bboxes, cropping_available, cropping_attempted
-        if cropping_enabled and cropping_available and crop_dir is not None:
+        if cropping_available and (cropping_enabled or crop_bytes_enabled) and crop_dir is not None:
             try:
-                crop_signatures(
+                crops = crop_signatures(
                     pdf_path=source_pdf,
                     file_result=file_result,
                     output_dir=crop_dir,
                     dpi=configuration.CropImageDpi,
                     logger=Logger,
+                    return_bytes=crop_bytes_enabled,
+                    save_files=cropping_enabled,
                 )
                 cropping_attempted = True
+                if crop_bytes_enabled:
+                    for crop in crops:
+                        crop.signature.CropBytes = base64.b64encode(crop.image_bytes).decode(
+                            "ascii"
+                        )
             except SignatureCroppingUnavailable as exc:
                 cropping_available = False
                 Logger.warning("Signature cropping unavailable", extra={"error": str(exc)})
@@ -231,18 +254,24 @@ def Detect(
             json_handle.write(closing)
             json_handle.close()
-    if json_handle is not None:
-        typer.echo(f"Wrote {json_path}")
-    else:
-        payload = json.dumps(
-            results_buffer or [], indent=2, ensure_ascii=False, default=_JsonSerializer
-        )
-        typer.echo(payload)
-        typer.echo("Detection completed with output disabled (out_dir=none)")
-    if cropping_enabled and cropping_available and cropping_attempted and total_bboxes == 0:
+    if write_results:
+        if json_handle is not None:
+            typer.echo(f"Wrote {json_path}")
+        else:
+            payload = json.dumps(
+                results_buffer or [], indent=2, ensure_ascii=False, default=_JsonSerializer
+            )
+            typer.echo(payload)
+            typer.echo("Detection completed with output disabled (out_dir=none)")
+    if (
+        (cropping_enabled or crop_bytes_enabled)
+        and cropping_available
+        and cropping_attempted
+        and total_bboxes == 0
+    ):
         Logger.warning(
-            "No signature bounding boxes detected; try --engine pymupdf for crop-ready output",
+            "No signature bounding boxes detected; install PyMuPDF for crop-ready output",
             extra={"engine": configuration.Engine},
         )

{sigdetect-0.4.0 → sigdetect-0.5.0}/src/sigdetect/config.py RENAMED Viewed

@@ -25,6 +25,7 @@ class DetectConfiguration(BaseModel):
     PdfRoot: Path = Field(default=Path("hipaa_results"), alias="pdf_root")
     OutputDirectory: Path | None = Field(default=Path("out"), alias="out_dir")
+    WriteResults: bool = Field(default=False, alias="write_results")
     Engine: EngineName = Field(default="auto", alias="engine")
     Profile: ProfileName = Field(default="hipaa", alias="profile")
     PseudoSignatures: bool = Field(default=True, alias="pseudo_signatures")
@@ -63,6 +64,10 @@ class DetectConfiguration(BaseModel):
     def out_dir(self) -> Path | None:  # pragma: no cover - simple passthrough
         return self.OutputDirectory
+    @property
+    def write_results(self) -> bool:  # pragma: no cover - simple passthrough
+        return self.WriteResults
     @property
     def engine(self) -> EngineName:  # pragma: no cover - simple passthrough
         return self.Engine

sigdetect 0.4.0__tar.gz → 0.5.0__tar.gz

sigdetect 0.4.0tar.gz → 0.5.0tar.gz