sigdetect 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,394 @@
1
+ Metadata-Version: 2.4
2
+ Name: sigdetect
3
+ Version: 0.1.0
4
+ Summary: Signature detection and role attribution for PDFs
5
+ Author-email: BT Asmamaw <basmamaw@angeiongroup.com>
6
+ License: MIT
7
+ Requires-Python: >=3.9
8
+ Description-Content-Type: text/markdown
9
+ Requires-Dist: pypdf>=4.0.0
10
+ Requires-Dist: pandas>=2.0
11
+ Requires-Dist: rich>=13.0
12
+ Requires-Dist: typer>=0.12
13
+ Requires-Dist: pydantic>=2.5
14
+ Requires-Dist: pyyaml>=6.0
15
+ Provides-Extra: pymupdf
16
+ Requires-Dist: pymupdf>=1.23; extra == "pymupdf"
17
+
18
+ # CaseWorks.Automation.CaseDocumentIntake
19
+
20
+ ## sigdetect
21
+
22
+ `sigdetect` is a small Python library + CLI that detects **e-signature evidence** in PDFs and infers the **signer role** (e.g., _patient_, _attorney_, _representative_).
23
+
24
+ It looks for:
25
+
26
+ - Real signature **form fields** (`/Widget` annotations with `/FT /Sig`)
27
+ - **AcroForm** signature fields present only at the document level
28
+ - Common **vendor markers** (e.g., DocuSign, “Signature Certificate”)
29
+ - Page **labels** (like “Signature of Patient” or “Signature of Parent/Guardian”)
30
+
31
+ It returns a structured summary per file (pages, counts, roles, hints, etc.) that can be used downstream.
32
+
33
+ ---
34
+
35
+ ## Contents
36
+
37
+ - [Quick start](#quick-start)
38
+ - [CLI usage](#cli-usage)
39
+ - [Library usage](#library-usage)
40
+ - [Result schema](#result-schema)
41
+ - [Configuration & rules](#configuration--rules)
42
+ - [Smoke tests](#smoke-tests)
43
+ - [Dev workflow](#dev-workflow)
44
+ - [Troubleshooting](#troubleshooting)
45
+ - [License](#license)
46
+
47
+ ---
48
+
49
+ ## Quick start
50
+
51
+ ### Requirements
52
+
53
+ - Python **3.9+** (developed & tested on **3.11**)
54
+ - macOS / Linux / WSL
55
+
56
+ ### Setup
57
+
58
+ ~~~bash
59
+ # 1) Create and activate a virtualenv (example uses Python 3.11)
60
+ python3.11 -m venv .venv
61
+ source .venv/bin/activate
62
+
63
+ # 2) Install in editable (dev) mode
64
+ python -m pip install --upgrade pip
65
+ pip install -e .
66
+ ~~~
67
+
68
+ ### Sanity check
69
+
70
+ ~~~bash
71
+ # Run unit & smoke tests
72
+ pytest -q
73
+ ~~~
74
+
75
+ ---
76
+
77
+ ## CLI usage
78
+
79
+ The project ships a Typer-based CLI (exposed either as `sigdetect` or runnable via `python -m sigdetect.cli`, depending on how it is installed).
80
+
81
+ ~~~bash
82
+ sigdetect --help
83
+ # or
84
+ python -m sigdetect.cli --help
85
+ ~~~
86
+
87
+ ### Detect (per-file summary)
88
+
89
+ ~~~bash
90
+ # Execute detection according to the YAML configuration
91
+ sigdetect detect \
92
+ --config ./sample_data/config.yml \
93
+ --profile hipaa # or: retainer
94
+ ~~~
95
+
96
+ ### Notes
97
+
98
+ - The config file controls `pdf_root`, `out_dir`, `engine`, `pseudo_signatures`, `recurse_xobjects`, etc.
99
+ - `--engine` supports **pypdf2** (default); a **pymupdf** engine placeholder exists and may be included in a future build.
100
+ - `--pseudo-signatures` enables a vendor/Acro-only pseudo-signature when no actual `/Widget` is present (useful for DocuSign / Acrobat Sign receipts).
101
+ - `--recurse-xobjects` allows scanning Form XObjects for vendor markers and labels embedded in page resources.
102
+ - `--profile` selects tuned role logic:
103
+ - `hipaa` → patient / representative / attorney
104
+ - `retainer` → client / firm (prefers detecting two signatures)
105
+ - If the executable is not on `PATH`, you can always fall back to `python -m sigdetect.cli ...`.
106
+
107
+ ### EDA (quick aggregate stats)
108
+
109
+ ~~~bash
110
+ sigdetect eda \
111
+ --config ./sample_data/config.yml
112
+
113
+ ~~~
114
+
115
+ ---
116
+
117
+ ## Library usage
118
+
119
+ ~~~python
120
+ from pathlib import Path
121
+ from sigdetect.config import DetectConfiguration
122
+ from sigdetect.detector.pypdf2_engine import PyPDF2Detector
123
+
124
+ configuration = DetectConfiguration(
125
+ PdfRoot=Path("/path/to/pdfs"),
126
+ OutputDirectory=Path("./out"),
127
+ Engine="pypdf2",
128
+ PseudoSignatures=True,
129
+ RecurseXObjects=True,
130
+ Profile="retainer", # or "hipaa"
131
+ )
132
+
133
+ detector = PyPDF2Detector(configuration)
134
+ result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
135
+ print(result.to_dict())
136
+ ~~~
137
+
138
+ `Detect(Path)` returns a **FileResult** dataclass; call `.to_dict()` for the JSON-friendly representation (see [Result schema](#result-schema)).
139
+
140
+ ---
141
+
142
+ ## Library API (embed in another script)
143
+
144
+ Minimal, plug-and-play API
145
+ Import from `sigdetect.api` and get plain dicts out (JSON-ready),
146
+ with no I/O side effects by default:
147
+
148
+ ~~~python
149
+ from sigdetect.api import DetectPdf, DetectMany, ScanDirectory, ToCsvRow, Version
150
+
151
+ print("sigdetect", Version())
152
+
153
+ # 1) Single file → dict
154
+ result = DetectPdf(
155
+ "/path/to/file.pdf",
156
+ profileName="retainer",
157
+ includePseudoSignatures=True,
158
+ recurseXObjects=True,
159
+ )
160
+ print(
161
+ result["file"],
162
+ result["pages"],
163
+ result["esign_found"],
164
+ result["sig_count"],
165
+ result["sig_pages"],
166
+ result["roles"],
167
+ result["hints"],
168
+ )
169
+
170
+
171
+ # 2) Directory walk (generator of dicts)
172
+ for res in ScanDirectory(
173
+ "/path/to/pdfs",
174
+ profileName="hipaa",
175
+ includePseudoSignatures=True,
176
+ recurseXObjects=True,
177
+ ):
178
+ # store in DB, print, etc.
179
+ pass
180
+
181
+ ~~~
182
+
183
+
184
+ ## Result schema
185
+
186
+ High-level summary (per file):
187
+
188
+ ~~~json
189
+ {
190
+ "file": "example.pdf",
191
+ "size_kb": 123.4,
192
+ "pages": 3,
193
+ "esign_found": true,
194
+ "scanned_pdf": false,
195
+ "mixed": false,
196
+ "sig_count": 2,
197
+ "sig_pages": "1,3",
198
+ "roles": "patient;representative",
199
+ "hints": "AcroSig:sig_patient;VendorText:DocuSign\\s+Envelope\\s+ID",
200
+ "signatures": [
201
+ {
202
+ "page": 1,
203
+ "field_name": "sig_patient",
204
+ "role": "patient",
205
+ "score": 5,
206
+ "scores": { "field": 3, "page_label": 2 },
207
+ "evidence": ["field:patient", "page_label:patient"],
208
+ "hint": "AcroSig:sig_patient"
209
+ },
210
+ {
211
+ "page": null,
212
+ "field_name": "vendor_or_acro_detected",
213
+ "role": "representative",
214
+ "score": 6,
215
+ "scores": { "page_label": 4, "general": 2 },
216
+ "evidence": ["page_label:representative(parent/guardian)", "pseudo:true"],
217
+ "hint": "VendorOrAcroOnly"
218
+ }
219
+ ]
220
+ }
221
+ ~~~
222
+
223
+ ### Field notes
224
+
225
+ - **`esign_found`** is `true` if any signature widget, AcroForm `/Sig` field, or vendor marker is detected.
226
+ - **`scanned_pdf`** is a heuristic: pages with images only and no extractable text.
227
+ - **`mixed`** means both `esign_found` and `scanned_pdf` are `true`.
228
+ - **`roles`** summarizes unique non-`unknown` roles across signatures.
229
+ - In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
230
+
231
+ ---
232
+
233
+ ## Configuration & rules
234
+
235
+ Built-in rules live under **`src/sigdetect/data/`**:
236
+
237
+ - **`vendor_patterns.yml`** – vendor byte/text patterns (e.g., DocuSign, Acrobat Sign).
238
+ - **`role_rules.yml`** – signer-role logic:
239
+ - `labels` – strong page labels (e.g., “Signature of Patient”, including Parent/Guardian cases)
240
+ - `general` – weaker role hints in surrounding text
241
+ - `field_hints` – field-name keywords (e.g., `sig_patient`)
242
+ - `doc_hard` – strong document-level triggers (relationship to patient, “minor/unable to sign”, first-person consent)
243
+ - `weights` – scoring weights for the above
244
+ - **`role_rules.retainer.yml`** – retainer-specific rules (labels for client/firm, general tokens, and field hints).
245
+
246
+ You can keep one config YAML per dataset, e.g.:
247
+ ~~~yaml
248
+ # ./sample_data/config.yml (example)
249
+ pdf_root: ./pdfs
250
+ out_dir: ./sigdetect_out
251
+ engine: pypdf2
252
+ pseudo_signatures: true
253
+ recurse_xobjects: true
254
+ profile: retainer # or: hipaa
255
+ ~~~
256
+
257
+ YAML files can be customized or load at runtime (see CLI `--config`, if available, or import and pass patterns into engine).
258
+
259
+ ### Key detection behaviors
260
+
261
+ - **Widget-first in mixed docs:** if a real `/Widget` exists, no pseudo “VendorOrAcroOnly” signature is emitted.
262
+ - **Acro-only dedupe:** multiple `/Sig` fields at the document level collapse to a single pseudo signature.
263
+ - **Parent/Guardian label:** “Signature of Parent/Guardian” maps to the `representative` role.
264
+ - **Field-name fallbacks:** role hints are pulled from `/T`, `/TU`, or `/TM` (in that order).
265
+ - Retainer heuristics:
266
+ - Looks for client and firm labels/tokens; boosts pages with law-firm markers (LLP/LLC/PA/PC) and “By:” blocks.
267
+ - Applies an anti-front-matter rule to reduce page-1 false positives (e.g., letterheads, firm mastheads).
268
+ - When only vendor/Acro clues exist (no widgets), it will emit two pseudo signatures targeting likely pages.
269
+
270
+ ---
271
+
272
+ ## Smoke tests
273
+
274
+ Drop-in smoke tests live under **`tests/`** and cover:
275
+
276
+ - Vendor-only (multiple markers)
277
+ - Acro-only (single pseudo with multiple `/Sig`)
278
+ - Mixed (real widget + vendor markers → widget role, no pseudo)
279
+ - Field-name fallbacks (`/TU`, `/TM`)
280
+ - Parent/Guardian label → `representative`
281
+ - Encrypted PDFs (graceful handling)
282
+
283
+ Run a subset:
284
+
285
+ ~~~bash
286
+ pytest -q -k smoke
287
+ # or specific files:
288
+ pytest -q tests/test_mixed_widget_vendor_smoke.py
289
+ ~~~
290
+
291
+ ---
292
+
293
+ ## Debugging
294
+ If you need to debug or inspect the detection logic, you can run the CLI with `--debug`:
295
+ ~~~python
296
+ from pathlib import Path
297
+ from sigdetect.config import DetectConfiguration
298
+ from sigdetect.detector.pypdf2_engine import PyPDF2Detector
299
+
300
+ pdf = Path("/path/to/one.pdf")
301
+ configuration = DetectConfiguration(
302
+ PdfRoot=pdf.parent,
303
+ OutputDirectory=Path("."),
304
+ Engine="pypdf2",
305
+ Profile="retainer",
306
+ PseudoSignatures=True,
307
+ RecurseXObjects=True,
308
+ )
309
+ print(PyPDF2Detector(configuration).Detect(pdf).to_dict())
310
+
311
+ ~~~
312
+
313
+ ---
314
+
315
+ ## Dev workflow
316
+
317
+ ### Project layout
318
+
319
+ ~~~text
320
+ src/
321
+ sigdetect/
322
+ detector/
323
+ base.py
324
+ pypdf2_engine.py
325
+ data/
326
+ role_rules.yml
327
+ vendor_patterns.yml
328
+ cli.py
329
+ tests/
330
+ pyproject.toml
331
+ .pre-commit-config.yaml
332
+ ~~~
333
+
334
+ ### Formatting & linting (pre-commit)
335
+
336
+ ~~~bash
337
+ # one-time
338
+ pip install pre-commit
339
+ pre-commit install
340
+
341
+ # run on all files
342
+ pre-commit run --all-files
343
+ ~~~
344
+
345
+ Hooks: `black`, `isort`, `ruff`, plus `pytest` (optional).
346
+ Ensure your virtualenv folders are excluded in `.pre-commit-config.yaml` (e.g., `^\.venv`).
347
+
348
+ ### Typical loop
349
+
350
+ ~~~bash
351
+ # run tests
352
+ pytest -q
353
+
354
+ # run only smoke tests while iterating
355
+ pytest -q -k smoke
356
+ ~~~
357
+
358
+ ---
359
+
360
+ ## Troubleshooting
361
+
362
+ **Using the wrong Python**
363
+
364
+ ~~~bash
365
+ which python
366
+ python -V
367
+ ~~~
368
+
369
+ If you see 3.8 or system Python, recreate the venv with 3.11.
370
+
371
+ **ModuleNotFoundError: typer / click / pytest**
372
+
373
+ ~~~bash
374
+ pip install typer click pytest
375
+ ~~~
376
+
377
+ **Pre-commit reformats files in `.venv`**
378
+
379
+ ~~~yaml
380
+ exclude: |
381
+ ^(\.venv|\.venv311|dist|build)/
382
+ ~~~
383
+
384
+ **Vendor markers not detected**
385
+ Set `--recurse-xobjects true` and enable pseudo signatures. Many providers embed markers in Form XObjects or compressed streams.
386
+
387
+ **Parent/Guardian not recognized**
388
+ The rules already include a fallback for “Signature of Parent/Guardian”; if your variant differs, add it to `role_rules.yml → labels.representative`.
389
+
390
+ ---
391
+
392
+ ## License
393
+ MIT
394
+
@@ -0,0 +1,22 @@
1
+ sigdetect/__init__.py,sha256=LhY78mDZ1ClYVNTxW_qtE-vqJoN9N7N5ZcNRDUI_3ss,575
2
+ sigdetect/api.py,sha256=Un4SaZHNAmRLPh1aF9bzOfT6ibilT_y9C0xVmNlqHtI,4248
3
+ sigdetect/cli.py,sha256=jm7aStuv64MCcZZkzv8ncNVGGg8FYIFKjkTPNfXWUgs,3136
4
+ sigdetect/config.py,sha256=d3_AlAEFUHBoXyTbUAHQLTARVqM8q4I8q4xfwakPE0M,4165
5
+ sigdetect/eda.py,sha256=S92G1Gjmepri__D0n_V6foq0lQgH-RXI9anW8A58jfw,4681
6
+ sigdetect/logging_setup.py,sha256=LMF8ao_a-JwH0S522T6aYTFX3e8Ajjv_5ODS2YiBcHA,6404
7
+ sigdetect/utils.py,sha256=T9rubLf5T9JmjOHYMOba1j34fhOJaWocAXccnGTxRUE,5198
8
+ sigdetect/data/role_rules.retainer.yml,sha256=IFdwKnDBXR2cTkdfrsZ6ku6CXD8S_dg5A3vKRKLW5h8,2532
9
+ sigdetect/data/role_rules.yml,sha256=HuLKsZR_A6sD9XvY4NHiY_VG3dS5ERNCBF9-Mxawomw,2751
10
+ sigdetect/data/vendor_patterns.yml,sha256=NRbZNQxcx_GuL6n1jAphBn6MM6ChCpeWGCsjbRx-PEo,384
11
+ sigdetect/detector/__init__.py,sha256=up2FCmD09f2bRHcS4WbY-clx3GQbWuk1PM2JlxgusHg,1608
12
+ sigdetect/detector/base.py,sha256=L-iXWXqsTetDc4jRZo_wOdbNpKqOY20mX9FefrugdT0,263
13
+ sigdetect/detector/base_detector.py,sha256=GmAgUWO_fQgIfnihZSoyhR3wpnwZ-X3hS0Kuyz4G6Ys,608
14
+ sigdetect/detector/file_result_model.py,sha256=j2gTc9Sw3fJOHlexYsR_m5DiwHA8DzIzAMToESfvo4A,1767
15
+ sigdetect/detector/pymupdf_engine.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
16
+ sigdetect/detector/pypdf2_engine.py,sha256=e3JasLxI8K10IkpMcijES2EjA7RluNpKq6027oNROPU,45770
17
+ sigdetect/detector/signature_model.py,sha256=nApd53aDRMZhOLdUlmoEPjHO1hs8leM6NysG10v-jVc,857
18
+ sigdetect-0.1.0.dist-info/METADATA,sha256=7au6ZW0VN_y3JyZQJux6zEUO8BMBEp6qVn0HO86aXlU,10363
19
+ sigdetect-0.1.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
20
+ sigdetect-0.1.0.dist-info/entry_points.txt,sha256=iqtfKjBU44-omM7Sh-idGz2ahw19oAvpvSyKZVArG3o,48
21
+ sigdetect-0.1.0.dist-info/top_level.txt,sha256=PKlfwUobkRC0viwiSXmhtw83G26FSNpimWYC1Uy00FY,10
22
+ sigdetect-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (80.9.0)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ sigdetect = sigdetect.cli:app
@@ -0,0 +1 @@
1
+ sigdetect