PyPI - sigdetect - Versions diffs - 0.4.0__tar.gz → 0.5.1__tar.gz - Mend

sigdetect 0.4.0tar.gz → 0.5.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

{sigdetect-0.4.0/src/sigdetect.egg-info → sigdetect-0.5.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sigdetect
-Version: 0.4.0
+Version: 0.5.1
 Summary: Signature detection and role attribution for PDFs
 Author-email: BT Asmamaw <basmamaw@angeiongroup.com>
 License: MIT
@@ -10,9 +10,11 @@ Requires-Dist: pypdf>=4.0.0
 Requires-Dist: rich>=13.0
 Requires-Dist: typer>=0.12
 Requires-Dist: pydantic>=2.5
+Requires-Dist: pillow>=10.0
+Requires-Dist: python-docx>=1.1.0
+Requires-Dist: pytesseract>=0.3.10
+Requires-Dist: pymupdf>=1.23
 Requires-Dist: pyyaml>=6.0
-Provides-Extra: pymupdf
-Requires-Dist: pymupdf>=1.23; extra == "pymupdf"
 # CaseWorks.Automation.CaseDocumentIntake
@@ -95,14 +97,16 @@ sigdetect detect \
 ### Notes
 - The config file controls `pdf_root`, `out_dir`, `engine`, `pseudo_signatures`, `recurse_xobjects`, etc.
-- `--engine` accepts **auto** (default; prefers PyMuPDF when installed, falls back to PyPDF2), **pypdf2**, or **pymupdf**.
+- Engine selection is forced to **auto** (prefers PyMuPDF for geometry, falls back to PyPDF2); any configured `engine` value is overridden.
 - `--pseudo-signatures` enables a vendor/Acro-only pseudo-signature when no actual `/Widget` is present (useful for DocuSign / Acrobat Sign receipts).
 - `--recurse-xobjects` allows scanning Form XObjects for vendor markers and labels embedded in page resources.
 - `--profile` selects tuned role logic:
   - `hipaa` → patient / representative / attorney
   - `retainer` → client / firm (prefers detecting two signatures)
 - `--recursive/--no-recursive` toggles whether `sigdetect detect` descends into subdirectories when hunting for PDFs (recursive by default).
-- Cropping (`--crop-signatures`) and wet detection (`--detect-wet`) are enabled by default for single-pass runs; disable them if you want a light, e-sign-only pass. PyMuPDF is required for crops; PyMuPDF + Tesseract are required for wet detection.
+- Results output is disabled by default; set `write_results: true` or pass `--write-results` when you need `results.json` (for EDA).
+- Cropping (`--crop-signatures`) writes PNG crops to disk by default; enable `--crop-docx` to write DOCX files instead of PNGs. `--crop-bytes` embeds base64 PNG data in `signatures[].crop_bytes` and, when `--crop-docx` is enabled, embeds DOCX bytes in `signatures[].crop_docx_bytes`. PyMuPDF is required for crops, and `python-docx` is required for DOCX output.
+- Wet detection runs automatically for non-e-sign PDFs when dependencies are available; missing OCR dependencies add a `ManualReview:*` hint instead of failing. PyMuPDF + Tesseract are required for wet detection.
 - If the executable is not on `PATH`, you can always fall back to `python -m sigdetect.cli ...`.
 ### EDA (quick aggregate stats)
@@ -113,6 +117,8 @@ sigdetect eda \
 ~~~
+`sigdetect eda` expects `results.json`; enable `write_results: true` when running detect.
 ---
 ## Library usage
@@ -136,13 +142,13 @@ result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 print(result.to_dict())
 ~~~
-`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, `crop_path` points at the generated image. Use `Engine="auto"` if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
+`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, `crop_path` points at the generated image; when DOCX cropping is enabled, `crop_docx_path` points at the generated doc. Use `Engine="auto"` if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
 ---
 ## Library API (embed in another script)
-Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping:
+Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping. Engine selection is forced to `auto` (PyMuPDF preferred) to ensure geometry. Wet detection runs automatically for non-e-sign PDFs; pass `runWetDetection=False` to skip OCR.
 ~~~python
 from pathlib import Path
@@ -165,6 +171,7 @@ result = DetectPdf(
     profileName="retainer",
     includePseudoSignatures=True,
     recurseXObjects=True,
+    # runWetDetection=False,  # disable OCR-backed wet detection if desired
 )
 print(
     result["file"],
@@ -187,7 +194,7 @@ for res in ScanDirectory(
     # store in DB, print, etc.
     pass
-# 3) Crop PNG snippets for FileResult objects (requires PyMuPDF)
+# 3) Crop signature snippets for FileResult objects (requires PyMuPDF; DOCX needs python-docx)
 detector = get_detector(pdfRoot="/path/to/pdfs", profileName="hipaa")
 file_result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 CropSignatureImages(
@@ -226,7 +233,8 @@ High-level summary (per file):
       "hint": "AcroSig:sig_patient",
       "render_type": "typed",
       "bounding_box": [10.0, 10.0, 150.0, 40.0],
-      "crop_path": "signature_crops/example/sig_01_patient.png"
+      "crop_path": "signature_crops/example/sig_01_patient.png",
+      "crop_docx_path": null
     },
     {
       "page": null,
@@ -253,6 +261,9 @@ High-level summary (per file):
 - In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
 - **`signatures[].bounding_box`** reports the widget rectangle in PDF points (origin bottom-left).
 - **`signatures[].crop_path`** is populated when PNG crops are generated (via CLI `--crop-signatures` or `CropSignatureImages`).
+- **`signatures[].crop_docx_path`** is populated when DOCX crops are generated (`--crop-docx` or `docx=True`).
+- **`signatures[].crop_bytes`** contains base64 PNG data when CLI `--crop-bytes` is enabled.
+- **`signatures[].crop_docx_bytes`** contains base64 DOCX data when `--crop-docx` and `--crop-bytes` are enabled together.
 ---
@@ -274,14 +285,16 @@ You can keep one config YAML per dataset, e.g.:
 # ./sample_data/config.yml (example)
 pdf_root: ./pdfs
 out_dir: ./sigdetect_out
-engine: pypdf2
+engine: auto
+write_results: false
 pseudo_signatures: true
 recurse_xobjects: true
 profile: retainer    # or: hipaa
 crop_signatures: false   # enable to write PNG crops (requires pymupdf)
+crop_docx: false         # enable to write DOCX crops instead of PNGs (requires python-docx)
 # crop_output_dir: ./signature_crops
 crop_image_dpi: 200
-detect_wet_signatures: false   # opt-in OCR wet detection (PyMuPDF + Tesseract)
+detect_wet_signatures: false   # kept for compatibility; non-e-sign PDFs still trigger OCR
 wet_ocr_dpi: 200
 wet_ocr_languages: eng
 wet_precision_threshold: 0.82
@@ -299,7 +312,7 @@ YAML files can be customized or load at runtime (see CLI `--config`, if availabl
   - Looks for client and firm labels/tokens; boosts pages with law-firm markers (LLP/LLC/PA/PC) and “By:” blocks.
   - Applies an anti-front-matter rule to reduce page-1 false positives (e.g., letterheads, firm mastheads).
   - When only vendor/Acro clues exist (no widgets), it will emit two pseudo signatures targeting likely pages.
-- **Wet detection (opt-in):** With `detect_wet_signatures: true`, the CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection. It emits `RenderType="wet"` signatures for high-confidence label/stroke pairs in the lower page region. Missing OCR dependencies add a `ManualReview:*` hint instead of failing.
+- **Wet detection (non-e-sign):** The CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection whenever no e-sign evidence is found. It emits `RenderType="wet"` signatures for high-confidence label/stroke pairs in the lower page region. When an image-based signature is present on a page, label-only OCR candidates are suppressed unless a stroke is detected. Results are deduped to the top signature per role (dropping `unknown`). Missing OCR dependencies add a `ManualReview:*` hint instead of failing.
 ---

sigdetect-0.4.0/PKG-INFO → sigdetect-0.5.1/README.md RENAMED Viewed

@@ -1,19 +1,3 @@
-Metadata-Version: 2.4
-Name: sigdetect
-Version: 0.4.0
-Summary: Signature detection and role attribution for PDFs
-Author-email: BT Asmamaw <basmamaw@angeiongroup.com>
-License: MIT
-Requires-Python: >=3.9
-Description-Content-Type: text/markdown
-Requires-Dist: pypdf>=4.0.0
-Requires-Dist: rich>=13.0
-Requires-Dist: typer>=0.12
-Requires-Dist: pydantic>=2.5
-Requires-Dist: pyyaml>=6.0
-Provides-Extra: pymupdf
-Requires-Dist: pymupdf>=1.23; extra == "pymupdf"
 # CaseWorks.Automation.CaseDocumentIntake
 ## sigdetect
@@ -95,14 +79,16 @@ sigdetect detect \
 ### Notes
 - The config file controls `pdf_root`, `out_dir`, `engine`, `pseudo_signatures`, `recurse_xobjects`, etc.
-- `--engine` accepts **auto** (default; prefers PyMuPDF when installed, falls back to PyPDF2), **pypdf2**, or **pymupdf**.
+- Engine selection is forced to **auto** (prefers PyMuPDF for geometry, falls back to PyPDF2); any configured `engine` value is overridden.
 - `--pseudo-signatures` enables a vendor/Acro-only pseudo-signature when no actual `/Widget` is present (useful for DocuSign / Acrobat Sign receipts).
 - `--recurse-xobjects` allows scanning Form XObjects for vendor markers and labels embedded in page resources.
 - `--profile` selects tuned role logic:
   - `hipaa` → patient / representative / attorney
   - `retainer` → client / firm (prefers detecting two signatures)
 - `--recursive/--no-recursive` toggles whether `sigdetect detect` descends into subdirectories when hunting for PDFs (recursive by default).
-- Cropping (`--crop-signatures`) and wet detection (`--detect-wet`) are enabled by default for single-pass runs; disable them if you want a light, e-sign-only pass. PyMuPDF is required for crops; PyMuPDF + Tesseract are required for wet detection.
+- Results output is disabled by default; set `write_results: true` or pass `--write-results` when you need `results.json` (for EDA).
+- Cropping (`--crop-signatures`) writes PNG crops to disk by default; enable `--crop-docx` to write DOCX files instead of PNGs. `--crop-bytes` embeds base64 PNG data in `signatures[].crop_bytes` and, when `--crop-docx` is enabled, embeds DOCX bytes in `signatures[].crop_docx_bytes`. PyMuPDF is required for crops, and `python-docx` is required for DOCX output.
+- Wet detection runs automatically for non-e-sign PDFs when dependencies are available; missing OCR dependencies add a `ManualReview:*` hint instead of failing. PyMuPDF + Tesseract are required for wet detection.
 - If the executable is not on `PATH`, you can always fall back to `python -m sigdetect.cli ...`.
 ### EDA (quick aggregate stats)
@@ -113,6 +99,8 @@ sigdetect eda \
 ~~~
+`sigdetect eda` expects `results.json`; enable `write_results: true` when running detect.
 ---
 ## Library usage
@@ -136,13 +124,13 @@ result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 print(result.to_dict())
 ~~~
-`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, `crop_path` points at the generated image. Use `Engine="auto"` if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
+`Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)). Each signature entry now exposes `bounding_box` coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, `crop_path` points at the generated image; when DOCX cropping is enabled, `crop_docx_path` points at the generated doc. Use `Engine="auto"` if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
 ---
 ## Library API (embed in another script)
-Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping:
+Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping. Engine selection is forced to `auto` (PyMuPDF preferred) to ensure geometry. Wet detection runs automatically for non-e-sign PDFs; pass `runWetDetection=False` to skip OCR.
 ~~~python
 from pathlib import Path
@@ -165,6 +153,7 @@ result = DetectPdf(
     profileName="retainer",
     includePseudoSignatures=True,
     recurseXObjects=True,
+    # runWetDetection=False,  # disable OCR-backed wet detection if desired
 )
 print(
     result["file"],
@@ -187,7 +176,7 @@ for res in ScanDirectory(
     # store in DB, print, etc.
     pass
-# 3) Crop PNG snippets for FileResult objects (requires PyMuPDF)
+# 3) Crop signature snippets for FileResult objects (requires PyMuPDF; DOCX needs python-docx)
 detector = get_detector(pdfRoot="/path/to/pdfs", profileName="hipaa")
 file_result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
 CropSignatureImages(
@@ -226,7 +215,8 @@ High-level summary (per file):
       "hint": "AcroSig:sig_patient",
       "render_type": "typed",
       "bounding_box": [10.0, 10.0, 150.0, 40.0],
-      "crop_path": "signature_crops/example/sig_01_patient.png"
+      "crop_path": "signature_crops/example/sig_01_patient.png",
+      "crop_docx_path": null
     },
     {
       "page": null,
@@ -253,6 +243,9 @@ High-level summary (per file):
 - In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
 - **`signatures[].bounding_box`** reports the widget rectangle in PDF points (origin bottom-left).
 - **`signatures[].crop_path`** is populated when PNG crops are generated (via CLI `--crop-signatures` or `CropSignatureImages`).
+- **`signatures[].crop_docx_path`** is populated when DOCX crops are generated (`--crop-docx` or `docx=True`).
+- **`signatures[].crop_bytes`** contains base64 PNG data when CLI `--crop-bytes` is enabled.
+- **`signatures[].crop_docx_bytes`** contains base64 DOCX data when `--crop-docx` and `--crop-bytes` are enabled together.
 ---
@@ -274,14 +267,16 @@ You can keep one config YAML per dataset, e.g.:
 # ./sample_data/config.yml (example)
 pdf_root: ./pdfs
 out_dir: ./sigdetect_out
-engine: pypdf2
+engine: auto
+write_results: false
 pseudo_signatures: true
 recurse_xobjects: true
 profile: retainer    # or: hipaa
 crop_signatures: false   # enable to write PNG crops (requires pymupdf)
+crop_docx: false         # enable to write DOCX crops instead of PNGs (requires python-docx)
 # crop_output_dir: ./signature_crops
 crop_image_dpi: 200
-detect_wet_signatures: false   # opt-in OCR wet detection (PyMuPDF + Tesseract)
+detect_wet_signatures: false   # kept for compatibility; non-e-sign PDFs still trigger OCR
 wet_ocr_dpi: 200
 wet_ocr_languages: eng
 wet_precision_threshold: 0.82
@@ -299,7 +294,7 @@ YAML files can be customized or load at runtime (see CLI `--config`, if availabl
   - Looks for client and firm labels/tokens; boosts pages with law-firm markers (LLP/LLC/PA/PC) and “By:” blocks.
   - Applies an anti-front-matter rule to reduce page-1 false positives (e.g., letterheads, firm mastheads).
   - When only vendor/Acro clues exist (no widgets), it will emit two pseudo signatures targeting likely pages.
-- **Wet detection (opt-in):** With `detect_wet_signatures: true`, the CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection. It emits `RenderType="wet"` signatures for high-confidence label/stroke pairs in the lower page region. Missing OCR dependencies add a `ManualReview:*` hint instead of failing.
+- **Wet detection (non-e-sign):** The CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection whenever no e-sign evidence is found. It emits `RenderType="wet"` signatures for high-confidence label/stroke pairs in the lower page region. When an image-based signature is present on a page, label-only OCR candidates are suppressed unless a stroke is detected. Results are deduped to the top signature per role (dropping `unknown`). Missing OCR dependencies add a `ManualReview:*` hint instead of failing.
 ---

{sigdetect-0.4.0 → sigdetect-0.5.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sigdetect"
-version = "0.4.0"
+version = "0.5.1"
 description = "Signature detection and role attribution for PDFs"
 readme = "README.md"
 authors = [{ name = "BT Asmamaw", email = "basmamaw@angeiongroup.com" }]
@@ -15,12 +15,13 @@ dependencies = [
   "rich>=13.0",
   "typer>=0.12",
   "pydantic>=2.5",
+  "pillow>=10.0",
+  "python-docx>=1.1.0",
+  "pytesseract>=0.3.10",
+  "pymupdf>=1.23",
   "pyyaml>=6.0",
 ]
-[project.optional-dependencies]
-pymupdf = ["pymupdf>=1.23"]
 [project.scripts]
 sigdetect = "sigdetect.cli:app"

{sigdetect-0.4.0 → sigdetect-0.5.1}/src/sigdetect/api.py RENAMED Viewed

@@ -9,6 +9,7 @@ from typing import Any, Generator, Iterable, Iterator, Literal, overload
 from sigdetect.config import DetectConfiguration
 from sigdetect.cropping import SignatureCrop
 from sigdetect.detector import BuildDetector, Detector, FileResult, Signature
+from sigdetect.wet_detection import apply_wet_detection
 EngineName = Literal["pypdf2", "pypdf", "pymupdf", "auto"]
 ProfileName = Literal["hipaa", "retainer"]
@@ -21,9 +22,13 @@ def DetectPdf(
     engineName: EngineName = "auto",
     includePseudoSignatures: bool = True,
     recurseXObjects: bool = True,
+    runWetDetection: bool = True,
     detector: Detector | None = None,
 ) -> dict[str, Any]:
-    """Detect signature evidence and assign roles for a single PDF."""
+    """Detect signature evidence and assign roles for a single PDF.
+    Wet detection runs by default for non-e-sign PDFs; pass ``runWetDetection=False`` to skip OCR.
+    """
     resolvedPath = Path(pdfPath)
     activeDetector = detector or get_detector(
@@ -36,6 +41,10 @@ def DetectPdf(
     )
     result = activeDetector.Detect(resolvedPath)
+    if runWetDetection:
+        configuration = _ResolveConfiguration(activeDetector)
+        if configuration is not None:
+            apply_wet_detection(resolvedPath, configuration, result)
     return _ToPlainDictionary(result)
@@ -48,7 +57,10 @@ def get_detector(
     recurseXObjects: bool = True,
     outputDirectory: str | Path | None = None,
 ) -> Detector:
-    """Return a reusable detector instance configured with the supplied options."""
+    """Return a reusable detector instance configured with the supplied options.
+    Engine selection is forced to ``auto`` (prefers PyMuPDF when available).
+    """
     configuration = DetectConfiguration(
         PdfRoot=Path(pdfRoot) if pdfRoot is not None else Path.cwd(),
@@ -108,6 +120,7 @@ def _ToPlainValue(value: Any) -> Any:
 def DetectMany(
     pdfPaths: Iterable[str | Path],
     *,
+    runWetDetection: bool = True,
     detector: Detector | None = None,
     **kwargs: Any,
 ) -> Iterator[dict[str, Any]]:
@@ -115,17 +128,18 @@ def DetectMany(
     if detector is not None:
         for pdfPath in pdfPaths:
-            yield _DetectWithDetector(detector, pdfPath)
+            yield _DetectWithDetector(detector, pdfPath, runWetDetection=runWetDetection)
         return
     for pdfPath in pdfPaths:
-        yield DetectPdf(pdfPath, **kwargs)
+        yield DetectPdf(pdfPath, runWetDetection=runWetDetection, **kwargs)
 def ScanDirectory(
     pdfRoot: str | Path,
     *,
     globPattern: str = "**/*.pdf",
+    runWetDetection: bool = True,
     detector: Detector | None = None,
     **kwargs: Any,
 ) -> Iterator[dict[str, Any]]:
@@ -143,7 +157,7 @@ def ScanDirectory(
     for pdfPath in iterator:
         if pdfPath.is_file() and pdfPath.suffix.lower() == ".pdf":
-            yield DetectPdf(pdfPath, detector=detector, **kwargs)
+            yield DetectPdf(pdfPath, detector=detector, runWetDetection=runWetDetection, **kwargs)
 def ToCsvRow(result: dict[str, Any]) -> dict[str, Any]:
@@ -174,11 +188,25 @@ def Version() -> str:
         return "0.0.0-dev"
-def _DetectWithDetector(detector: Detector, pdfPath: str | Path) -> dict[str, Any]:
+def _DetectWithDetector(
+    detector: Detector, pdfPath: str | Path, *, runWetDetection: bool
+) -> dict[str, Any]:
     """Helper that runs ``detector`` and returns the plain dictionary result."""
     resolvedPath = Path(pdfPath)
-    return _ToPlainDictionary(detector.Detect(resolvedPath))
+    result = detector.Detect(resolvedPath)
+    if runWetDetection:
+        configuration = _ResolveConfiguration(detector)
+        if configuration is not None:
+            apply_wet_detection(resolvedPath, configuration, result)
+    return _ToPlainDictionary(result)
+def _ResolveConfiguration(detector: Detector) -> DetectConfiguration | None:
+    configuration = getattr(detector, "Configuration", None)
+    if isinstance(configuration, DetectConfiguration):
+        return configuration
+    return None
 @contextmanager
@@ -201,8 +229,8 @@ def CropSignatureImages(
     dpi: int = 200,
     returnBytes: Literal[False] = False,
     saveToDisk: bool = True,
-) -> list[Path]:
-    ...
+    docx: bool = False,
+) -> list[Path]: ...
 @overload
@@ -214,8 +242,8 @@ def CropSignatureImages(
     dpi: int,
     returnBytes: Literal[True],
     saveToDisk: bool,
-) -> list[SignatureCrop]:
-    ...
+    docx: bool = False,
+) -> list[SignatureCrop]: ...
 def CropSignatureImages(
@@ -226,13 +254,17 @@ def CropSignatureImages(
     dpi: int = 200,
     returnBytes: bool = False,
     saveToDisk: bool = True,
+    docx: bool = False,
 ) -> list[Path] | list[SignatureCrop]:
-    """Crop detected signature regions to PNG files.
+    """Create PNG files containing cropped signature images (or DOCX when enabled).
     Accepts either a :class:`FileResult` instance or the ``dict`` returned by
     :func:`DetectPdf`. Requires the optional ``pymupdf`` dependency.
     Set ``returnBytes=True`` to also receive in-memory PNG bytes for each crop. Set
     ``saveToDisk=False`` to skip writing PNG files while still returning in-memory data.
+    When ``docx`` is True, DOCX files are written instead of PNG files. When ``returnBytes`` is
+    True and ``docx`` is enabled, the returned :class:`SignatureCrop` objects include
+    ``docx_bytes``.
     """
     from sigdetect.cropping import crop_signatures
@@ -245,6 +277,7 @@ def CropSignatureImages(
         dpi=dpi,
         return_bytes=returnBytes,
         save_files=saveToDisk,
+        docx=docx,
     )
     if original_dict is not None:
         original_dict.clear()
@@ -275,6 +308,9 @@ def _CoerceFileResult(
                 RenderType=str(entry.get("render_type") or "unknown"),
                 BoundingBox=tuple(bbox) if bbox else None,
                 CropPath=entry.get("crop_path"),
+                CropBytes=entry.get("crop_bytes"),
+                CropDocxPath=entry.get("crop_docx_path"),
+                CropDocxBytes=entry.get("crop_docx_bytes"),
             )
         )

{sigdetect-0.4.0 → sigdetect-0.5.1}/src/sigdetect/cli.py RENAMED Viewed

@@ -2,6 +2,7 @@
 from __future__ import annotations
+import base64
 import json
 from collections.abc import Iterator
 from dataclasses import asdict, is_dataclass
@@ -48,6 +49,12 @@ def Detect(
     configurationPath: Path | None = typer.Option(
         None, "--config", "-c", help="Path to YAML config"
     ),
+    writeResults: bool | None = typer.Option(
+        None,
+        "--write-results/--no-write-results",
+        help="Write results.json (or JSON to stdout when out_dir is none)",
+        show_default=False,
+    ),
     profileOverride: str | None = typer.Option(None, "--profile", "-p", help="hipaa or retainer"),
     recursive: bool = typer.Option(
         True,
@@ -57,13 +64,19 @@ def Detect(
     cropSignatures: bool | None = typer.Option(
         None,
         "--crop-signatures/--no-crop-signatures",
-        help="Crop detected signature regions to PNG files (requires PyMuPDF)",
+        help="Write PNG crops for signature widgets (requires PyMuPDF)",
+        show_default=False,
+    ),
+    cropDocx: bool | None = typer.Option(
+        None,
+        "--crop-docx/--no-crop-docx",
+        help="Write DOCX crops instead of PNG files (requires PyMuPDF + python-docx)",
         show_default=False,
     ),
     cropDirectory: Path | None = typer.Option(
         None,
         "--crop-dir",
-        help="Directory for signature PNG crops (defaults to out_dir/signature_crops)",
+        help="Directory for signature crops (defaults to out_dir/signature_crops)",
     ),
     cropDpi: int | None = typer.Option(
         None,
@@ -73,10 +86,16 @@ def Detect(
         help="Rendering DPI for signature crops",
         show_default=False,
     ),
+    cropBytes: bool = typer.Option(
+        False,
+        "--crop-bytes/--no-crop-bytes",
+        help="Embed base64 PNG bytes (and DOCX bytes when --crop-docx) in results JSON",
+        show_default=False,
+    ),
     detectWetSignatures: bool | None = typer.Option(
         None,
         "--detect-wet/--no-detect-wet",
-        help="Run OCR-backed wet signature detection (requires PyMuPDF + Tesseract)",
+        help="Compatibility flag; non-e-sign PDFs always run OCR when deps are available",
         show_default=False,
     ),
     wetOcrDpi: int | None = typer.Option(
@@ -111,8 +130,12 @@ def Detect(
         configuration = configuration.model_copy(update={"Profile": normalized_profile})
     overrides: dict[str, object] = {}
+    if writeResults is not None:
+        overrides["WriteResults"] = writeResults
     if cropSignatures is not None:
         overrides["CropSignatures"] = cropSignatures
+    if cropDocx is not None:
+        overrides["CropDocx"] = cropDocx
     if cropDirectory is not None:
         overrides["CropOutputDirectory"] = cropDirectory
     if cropDpi is not None:
@@ -145,53 +168,66 @@ def Detect(
     except StopIteration:
         raise SystemExit(f"No PDFs found in {configuration.PdfRoot}") from None
-    results_buffer: list[FileResult] | None = [] if configuration.OutputDirectory is None else None
+    write_results = configuration.WriteResults
+    results_buffer: list[FileResult] | None = (
+        [] if write_results and configuration.OutputDirectory is None else None
+    )
     json_handle = None
     json_path: Path | None = None
     wrote_first = False
-    if configuration.OutputDirectory is not None:
+    if write_results and configuration.OutputDirectory is not None:
         outputDirectory = configuration.OutputDirectory
         outputDirectory.mkdir(parents=True, exist_ok=True)
         json_path = outputDirectory / "results.json"
         json_handle = open(json_path, "w", encoding="utf-8")
         json_handle.write("[")
+    crop_bytes_enabled = bool(cropBytes)
     crop_dir = configuration.CropOutputDirectory
+    if crop_dir is None:
+        base_dir = configuration.OutputDirectory or configuration.PdfRoot
+        crop_dir = base_dir / "signature_crops"
     cropping_enabled = configuration.CropSignatures
+    docx_enabled = configuration.CropDocx
     cropping_available = True
     cropping_attempted = False
-    if configuration.CropSignatures and crop_dir is None:
-        Logger.warning(
-            "CropSignatures enabled without an output directory",
-            extra={"pdf_root": str(configuration.PdfRoot)},
-        )
-        cropping_enabled = False
     total_bboxes = 0
     def _append_result(file_result: FileResult, source_pdf: Path) -> None:
         nonlocal wrote_first, json_handle, total_bboxes, cropping_available, cropping_attempted
-        if cropping_enabled and cropping_available and crop_dir is not None:
+        if cropping_available and (cropping_enabled or crop_bytes_enabled) and crop_dir is not None:
             try:
-                crop_signatures(
+                crops = crop_signatures(
                     pdf_path=source_pdf,
                     file_result=file_result,
                     output_dir=crop_dir,
                     dpi=configuration.CropImageDpi,
                     logger=Logger,
+                    return_bytes=crop_bytes_enabled,
+                    save_files=cropping_enabled,
+                    docx=docx_enabled,
                 )
                 cropping_attempted = True
+                if crop_bytes_enabled:
+                    for crop in crops:
+                        crop.signature.CropBytes = base64.b64encode(crop.image_bytes).decode(
+                            "ascii"
+                        )
+                        if crop.docx_bytes:
+                            crop.signature.CropDocxBytes = base64.b64encode(
+                                crop.docx_bytes
+                            ).decode("ascii")
             except SignatureCroppingUnavailable as exc:
                 cropping_available = False
                 Logger.warning("Signature cropping unavailable", extra={"error": str(exc)})
                 typer.echo(str(exc), err=True)
             except Exception as exc:  # pragma: no cover - defensive
-                Logger.warning(
-                    "Unexpected error while cropping signatures",
-                    extra={"error": str(exc)},
-                )
+                cropping_available = False
+                Logger.warning("Signature cropping unavailable", extra={"error": str(exc)})
+                typer.echo(str(exc), err=True)
         total_bboxes += sum(1 for sig in file_result.Signatures if sig.BoundingBox)
@@ -231,18 +267,24 @@ def Detect(
             json_handle.write(closing)
             json_handle.close()
-    if json_handle is not None:
-        typer.echo(f"Wrote {json_path}")
-    else:
-        payload = json.dumps(
-            results_buffer or [], indent=2, ensure_ascii=False, default=_JsonSerializer
-        )
-        typer.echo(payload)
-        typer.echo("Detection completed with output disabled (out_dir=none)")
-    if cropping_enabled and cropping_available and cropping_attempted and total_bboxes == 0:
+    if write_results:
+        if json_handle is not None:
+            typer.echo(f"Wrote {json_path}")
+        else:
+            payload = json.dumps(
+                results_buffer or [], indent=2, ensure_ascii=False, default=_JsonSerializer
+            )
+            typer.echo(payload)
+            typer.echo("Detection completed with output disabled (out_dir=none)")
+    if (
+        (cropping_enabled or crop_bytes_enabled)
+        and cropping_available
+        and cropping_attempted
+        and total_bboxes == 0
+    ):
         Logger.warning(
-            "No signature bounding boxes detected; try --engine pymupdf for crop-ready output",
+            "No signature bounding boxes detected; install PyMuPDF for crop-ready output",
             extra={"engine": configuration.Engine},
         )

sigdetect 0.4.0__tar.gz → 0.5.1__tar.gz

sigdetect 0.4.0tar.gz → 0.5.1tar.gz