@flexorch/audit 0.3.1 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,148 +1,275 @@
1
- # @flexorch/audit
2
-
3
- [![npm](https://img.shields.io/npm/v/@flexorch/audit)](https://www.npmjs.com/package/@flexorch/audit)
4
- [![Node](https://img.shields.io/node/v/@flexorch/audit)](https://www.npmjs.com/package/@flexorch/audit)
5
- [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
6
-
7
- Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.
8
-
9
- ## Why
10
-
11
- Before feeding documents into an LLM pipeline you need to answer three questions:
12
-
13
- 1. **Does this text contain personal data?** Sending PII to a language model is a compliance risk.
14
- 2. **Is the text quality high enough?** Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
15
- 3. **How bad is the noise?** Garbled encodings and control characters degrade model output silently.
16
-
17
- Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. `@flexorch/audit` answers all three with one call — using only regex and Node.js built-ins. No model weights, no network calls, no external packages.
18
-
19
- ## Features
20
-
21
- - **Quality grade** — A/B/C/D composite score: is this text LLM-ready at a glance?
22
- - **PII detection** — email, phone (TR mobile + E.164), credit card (Luhn), IPv4, IPv6, TCKN, VKN, IBAN (mod-97 validated), SSN, label-prefixed names
23
- - **Batch audit** — `auditBatch()` aggregates duplicate ratio and PII counts across an entire dataset in one call
24
- - **Noise metrics** — garbage character ratio, encoding health check
25
- - **Masking** — four strategies: redact, replace (synthetic), token, hash
26
- - **Zero runtime dependencies** — pure Node.js built-ins, Node 18+
27
- - **TypeScript-first** — full type definitions, no `@types/` package needed
28
-
29
- ## Install
30
-
31
- ```bash
32
- npm install @flexorch/audit
33
- ```
34
-
35
- ## Quick start
36
-
37
- ```ts
38
- import { audit, mask } from "@flexorch/audit"
39
- import { readFileSync } from "fs"
40
-
41
- const text = readFileSync("contract.txt", "utf8")
42
- const result = audit(text, { locale: "tr" })
43
-
44
- result.quality_grade // "A"
45
- result.quality_score // 0.91 (0.0–1.0 composite)
46
- result.pii_summary // [{ type: "national_id_tr", count: 3 }, { type: "email", count: 1 }]
47
-
48
- result.pii // [{ type: "email", value: "...", start: 8, end: 23 }]
49
- result.quality // { completeness: 1.0, avg_length: 342, duplicate_ratio: null }
50
- result.noise // { garbage_ratio: 0.0, encoding_ok: true }
51
-
52
- const clean = mask(text, result.pii, { strategy: "redact" })
53
- // "Contact: [REDACTED_EMAIL]"
54
- ```
55
-
56
- ![demo](assets/demo.svg)
57
-
58
- ## Batch audit
59
-
60
- Use `auditBatch()` to audit an entire dataset and get aggregate metrics including `duplicate_ratio`:
61
-
62
- ```ts
63
- import { auditBatch } from "@flexorch/audit"
64
-
65
- const texts = dataset.map((r) => r.text)
66
- const batch = auditBatch(texts, { locale: "tr" })
67
-
68
- batch.duplicate_ratio // 0.12 fraction of exact-duplicate records
69
- batch.avg_quality_score // 0.78
70
- batch.pii_summary // [{ type: "email", count: 47 }, ...]
71
- batch.results // AuditResult[], one per text
72
- ```
73
-
74
- ## Locale support
75
-
76
- | `locale` | Active detectors |
77
- |----------|-----------------|
78
- | `"tr"` (default) | email, iban, credit_card, ip, ip_v6 + TCKN, VKN, phone_tr, name |
79
- | `"us"` | email, iban, credit_card, ip, ip_v6 + SSN, E.164 phone |
80
- | `"eu"` | email, iban, credit_card, ip, ip_v6 + E.164 phone |
81
- | `"all"` | All of the above (phone_tr takes precedence over generic phone) |
82
-
83
- ## PII types
84
-
85
- | Type | Description | Locale |
86
- |------|-------------|--------|
87
- | `email` | RFC-5321 address | all |
88
- | `iban` | ISO 13616 IBAN mod-97 checksum validated | all |
89
- | `credit_card` | 16-digit groups, Luhn-validated | all |
90
- | `ip` | IPv4 address | all |
91
- | `ip_v6` | IPv6 address (full, compressed, loopback) | all |
92
- | `phone_tr` | Turkish mobile (+90/0 prefix + 10 digits) | tr |
93
- | `national_id_tr` | TCKN — 11-digit modular arithmetic checksum | tr |
94
- | `tax_id_tr` | VKN10-digit Luhn-variant checksum | tr |
95
- | `name` | Label-prefixed name (e.g. "Adı: Ali Yıldız", "Full Name: Jane Doe") | tr |
96
- | `phone` | E.164 international phone | us, eu |
97
- | `ssn` | US Social Security Number (###-##-####) | us |
98
-
99
- ## Masking strategies
100
-
101
- | Strategy | Example output |
102
- |----------|----------------|
103
- | `redact` (default) | `[REDACTED_EMAIL]` |
104
- | `replace` | `user@example.com` (static synthetic) |
105
- | `token` | `<PII_EMAIL_1>` (unique per type per call) |
106
- | `hash` | `[3d4f9a1b2c8e7f0a]` (SHA-256 first 16 hex chars) |
107
-
108
- ## TypeScript
109
-
110
- ```ts
111
- import {
112
- audit, auditBatch, mask,
113
- type AuditResult, type BatchAuditResult, type PiiFinding,
114
- } from "@flexorch/audit"
115
- ```
116
-
117
- ## Quality grade
118
-
119
- `quality_grade` (A–D) and `quality_score` (0.0–1.0) are composite signals:
120
-
121
- | Grade | Score | Signal |
122
- |-------|-------|--------|
123
- | A | ≥ 0.85 | Ready for LLM training or RAG |
124
- | B | ≥ 0.65 | Usable with minor cleanup |
125
- | C | 0.40 | Review before use |
126
- | D | < 0.40 | Not suitable — empty, too short, or high noise |
127
-
128
- Score formula: `completeness × (0.4 × noiseScore + 0.4 × lengthScore + 0.2)`
129
- `lengthScore = Math.min(charCount / 500, 1.0)` · `noiseScore = Math.max(0, 1 − garbageRatio × 10)`
130
-
131
- ## Limitations (v0.4)
132
-
133
- - Free-standing name detection (without a label prefix) requires NLP/NER — not included.
134
- - `replace` masking strategy uses static synthetic values; locale-aware realistic synthesis is not yet implemented.
135
-
136
- ## Also available for Python
137
-
138
- ```bash
139
- pip install flexorch-audit
140
- ```
141
-
142
- ## Contributing
143
-
144
- See [CONTRIBUTING.md](CONTRIBUTING.md).
145
-
146
- ## License
147
-
148
- MIT
1
+ # @flexorch/audit
2
+
3
+ [![npm](https://img.shields.io/npm/v/@flexorch/audit)](https://www.npmjs.com/package/@flexorch/audit)
4
+ [![Node](https://img.shields.io/node/v/@flexorch/audit)](https://www.npmjs.com/package/@flexorch/audit)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
6
+
7
+ Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.
8
+
9
+ ## Why
10
+
11
+ Before feeding documents into an LLM pipeline you need to answer three questions:
12
+
13
+ 1. **Does this text contain personal data?** Sending PII to a language model is a compliance risk.
14
+ 2. **Is the text quality high enough?** Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
15
+ 3. **How bad is the noise?** Garbled encodings and symbol clutter degrade model output silently.
16
+
17
+ Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. `@flexorch/audit` answers all three with one call — using only regex and Node.js built-ins. No model weights, no network calls, no external packages.
18
+
19
+ ## Features
20
+
21
+ - **Quality grade** — A/B/C/D composite score: is this text LLM-ready at a glance?
22
+ - **Noise ratio** — line-level symbol clutter detection (`noise_ratio`); values above 0.20 indicate likely extraction artifacts
23
+ - **PII detection** — 30+ types across 8 countries (TR/DE/FR/IT/NL/ES/UK/US) + universal types; all regex-based with checksum validation
24
+ - **Batch audit** — `auditBatch()` aggregates duplicate ratio and PII counts across an entire dataset in one call
25
+ - **Masking** — four strategies: redact, replace (synthetic), token, hash
26
+ - **Zero runtime dependencies** — pure Node.js built-ins, Node 18+
27
+ - **TypeScript-first** — full type definitions, no `@types/` package needed
28
+
29
+ ## Install
30
+
31
+ ```bash
32
+ npm install @flexorch/audit
33
+ ```
34
+
35
+ ## Quick start
36
+
37
+ ```ts
38
+ import { audit, mask } from "@flexorch/audit"
39
+ import { readFileSync } from "fs"
40
+
41
+ const text = readFileSync("contract.txt", "utf8") // extract from PDF/DOCX first
42
+
43
+ const result = audit(text) // "und" by default — all detectors active
44
+ // const result = audit(text, { locale: "tr" }) // restrict to TR-only detectors
45
+
46
+ result.quality_grade // "B"
47
+ result.quality_score // 0.73 (0.0–1.0 composite)
48
+ result.noise_ratio // 0.04 (fraction of blank/garbage lines; >0.20 = low quality)
49
+ result.detected_language // "und" (locale you passed in; caller controls language)
50
+ result.pii_summary // [{ type: "email", count: 2 }, { type: "national_id_tr", count: 1 }]
51
+
52
+ result.pii // [{ type: "email", value: "ali@example.com", start: 8, end: 23 }]
53
+ result.quality // { completeness: 1.0, avg_length: 342, duplicate_ratio: null }
54
+ result.noise // { garbage_ratio: 0.0, encoding_ok: true }
55
+
56
+ const clean = mask(text, result.pii, { strategy: "redact" })
57
+ // "Contact: [REDACTED_EMAIL]"
58
+ ```
59
+
60
+ ![demo](assets/demo.svg)
61
+
62
+ ## Batch audit
63
+
64
+ ```ts
65
+ import { auditBatch } from "@flexorch/audit"
66
+
67
+ const texts = dataset.map((r) => r.text)
68
+ const batch = auditBatch(texts) // locale: "und" by default
69
+
70
+ batch.duplicate_ratio // 0.12 fraction of exact-duplicate records
71
+ batch.avg_quality_score // 0.78
72
+ batch.pii_summary // [{ type: "email", count: 47 }, ...]
73
+ batch.results // AuditResult[], one per text
74
+ ```
75
+
76
+ ## Country coverage
77
+
78
+ | `locale` | Detectors activated |
79
+ |----------|---------------------|
80
+ | `"und"` **(default)** | All locales combined use when document language is unknown |
81
+ | `"all"` | Alias for `"und"` |
82
+ | `"tr"` | TCKN · VKN · phone_tr · name · IBAN_TR · company_name_tr · MERSIS · postal_code_tr · province_tr |
83
+ | `"de"` | Steueridentifikationsnummer · Sozialversicherungsnummer |
84
+ | `"fr"` | SIREN · SIRET · INSEE/NIR |
85
+ | `"it"` | Codice Fiscale · Partita IVA |
86
+ | `"nl"` | BSN · KvK |
87
+ | `"es"` | DNI/NIE · CIF |
88
+ | `"uk"` | NI number · UTR |
89
+ | `"us"` | SSN · EIN · ITIN |
90
+ | `"eu"` | E.164 phone · IBAN (EU+GB+CH+NO) · company name |
91
+
92
+ Universal detectors (always active regardless of locale): `email` · `iban` · `credit_card` · `ip` · `ip_v6`
93
+
94
+ > **Language detection:** `@flexorch/audit` is zero-dependencyno language detection library is included.
95
+ > Pass the correct `locale` yourself, or use `"und"` (default) to activate all detectors.
96
+
97
+ ## PII types
98
+
99
+ ### Universal
100
+
101
+ | Type | Description |
102
+ |------|-------------|
103
+ | `email` | RFC-5321 email address |
104
+ | `iban` | ISO 13616 IBAN — mod-97 validated; suppressed when `iban_tr` or `iban_intl` fires on same span |
105
+ | `credit_card` | 16-digit groups, Luhn-validated |
106
+ | `ip` | IPv4 address |
107
+ | `ip_v6` | IPv6 — full, compressed `::`, loopback forms |
108
+
109
+ ### Turkey (`locale="tr"`)
110
+
111
+ | Type | Description |
112
+ |------|-------------|
113
+ | `national_id_tr` | TCKN — 11-digit, modular arithmetic checksum |
114
+ | `tax_id_tr` | VKN — 10-digit, Luhn-variant checksum |
115
+ | `phone_tr` | Turkish mobile: `+90`/`0` prefix + 10 digits |
116
+ | `name` | Label-prefixed name: `Adı:`, `Full Name:`, `Customer Name:`, etc. |
117
+ | `iban_tr` | Turkish IBAN (`TR` + 24 chars), mod-97 validated |
118
+ | `company_name_tr` | Company with TR legal suffix: A.Ş. · Ltd.Şti. · Koll.Şti. · Koop. · T.A.Ş. |
119
+ | `mersis_no` | MERSIS 16-digit company registry number |
120
+ | `postal_code_tr` | Turkish postal code (province plate 01–81) |
121
+ | `province_tr` | All 81 Turkish provinces |
122
+
123
+ ### Germany (`locale="de"`)
124
+
125
+ | Type | Description |
126
+ |------|-------------|
127
+ | `tax_id_de` | Steueridentifikationsnummer — 11 digits, ISO 7064 MOD 11,2 checksum |
128
+ | `social_id_de` | Sozialversicherungsnummer area + DOB + letter + serial |
129
+
130
+ ### France (`locale="fr"`)
131
+
132
+ | Type | Description |
133
+ |------|-------------|
134
+ | `siret_fr` | SIRET 14 digits, label-prefix gated |
135
+ | `company_id_fr` | SIREN — 9 digits, label-prefix gated |
136
+ | `social_id_fr` | INSEE/NIR — 15 digits, starts with `1` or `2` |
137
+
138
+ ### Italy (`locale="it"`)
139
+
140
+ | Type | Description |
141
+ |------|-------------|
142
+ | `national_id_it` | Codice Fiscale — 16 chars alphanumeric, uppercase normalized |
143
+ | `tax_id_it` | Partita IVA — 11 digits, Agenzia delle Entrate checksum |
144
+
145
+ ### Netherlands (`locale="nl"`)
146
+
147
+ | Type | Description |
148
+ |------|-------------|
149
+ | `national_id_nl` | BSN — 9 digits, 11-check (weighted sum mod 11) |
150
+ | `company_id_nl` | KvK — 8 digits, label-prefix gated |
151
+
152
+ ### Spain (`locale="es"`)
153
+
154
+ | Type | Description |
155
+ |------|-------------|
156
+ | `national_id_es` | DNI (8 digits + letter, mod-23) and NIE (X/Y/Z prefix, same check) |
157
+ | `tax_id_es` | CIF — letter prefix + 7 digits + control character |
158
+
159
+ ### United Kingdom (`locale="uk"`)
160
+
161
+ | Type | Description |
162
+ |------|-------------|
163
+ | `social_id_uk` | NI number — 2 letters + 6 digits + A/B/C/D; HMRC forbidden prefixes excluded |
164
+ | `tax_id_uk` | UTR — 10 digits, label-prefix gated |
165
+
166
+ ### United States (`locale="us"`)
167
+
168
+ | Type | Description |
169
+ |------|-------------|
170
+ | `ssn` | SSN — `###-##-####`, invalid prefixes (000/666/9xx) excluded |
171
+ | `tax_id_us` | EIN — `XX-XXXXXXX`, IRS invalid area prefixes excluded |
172
+ | `national_id_us` | ITIN — `9XX-7X/8X/9X-XXXX` middle group validated |
173
+
174
+ ### EU / International (`locale="eu"`)
175
+
176
+ | Type | Description |
177
+ |------|-------------|
178
+ | `phone_intl` | E.164 international phone — 7–15 digits, TR (+90) excluded |
179
+ | `iban_intl` | IBAN for EU+GB+CH+NO — ISO 13616 country+length table + mod-97 |
180
+ | `company_name_intl` | Company with international suffix: GmbH · LLC · S.r.l. · B.V. · SAS · Inc. · Ltd. etc. |
181
+
182
+ ## Noise detection
183
+
184
+ `noise_ratio` measures the fraction of lines that are blank or contain symbol clutter:
185
+
186
+ ```ts
187
+ const result = audit("clean line\n@@@garbage\n\nclean")
188
+ result.noise_ratio // 0.5 (2 noisy lines out of 4)
189
+ ```
190
+
191
+ A line is "noisy" when it is blank (after trim) or contains 3+ consecutive characters from `@ # ! ~ * =`.
192
+
193
+ | `noise_ratio` | Signal |
194
+ |---------------|--------|
195
+ | `< 0.05` | Clean — likely well-extracted text |
196
+ | `0.05–0.20` | Acceptable — minor formatting artifacts |
197
+ | `> 0.20` | Low quality — likely OCR noise or extraction failure |
198
+
199
+ ## Masking strategies
200
+
201
+ ```ts
202
+ const clean = mask(text, result.pii) // redact (default)
203
+ const clean = mask(text, result.pii, { strategy: "token" })
204
+ const clean = mask(text, result.pii, { strategy: "hash" })
205
+ const clean = mask(text, result.pii, { strategy: "replace" })
206
+ ```
207
+
208
+ | Strategy | Example output |
209
+ |----------|----------------|
210
+ | `redact` (default) | `[REDACTED_EMAIL]` |
211
+ | `replace` | `user@example.com` (static synthetic) |
212
+ | `token` | `<PII_EMAIL_1>` (unique per type per call) |
213
+ | `hash` | `[3d4f9a1b2c8e7f0a]` (SHA-256 first 16 hex chars) |
214
+
215
+ ## TypeScript
216
+
217
+ Full type definitions — no `@types/` package needed:
218
+
219
+ ```ts
220
+ import {
221
+ audit, auditBatch, mask,
222
+ type AuditResult, type BatchAuditResult,
223
+ type PiiFinding, type AuditOptions,
224
+ } from "@flexorch/audit"
225
+ ```
226
+
227
+ `AuditResult` includes:
228
+
229
+ ```ts
230
+ interface AuditResult {
231
+ quality_grade: "A" | "B" | "C" | "D"
232
+ quality_score: number
233
+ noise_ratio: number
234
+ detected_language: string
235
+ pii_summary: { type: string; count: number }[]
236
+ pii: { type: string; value: string; start: number; end: number }[]
237
+ quality: { completeness: number; avg_length: number; duplicate_ratio: number | null }
238
+ noise: { garbage_ratio: number; encoding_ok: boolean }
239
+ }
240
+ ```
241
+
242
+ ## Quality grade
243
+
244
+ `quality_grade` (A–D) and `quality_score` (0.0–1.0) are composite signals:
245
+
246
+ | Grade | Score | Signal |
247
+ |-------|-------|--------|
248
+ | A | ≥ 0.85 | Ready for LLM training or RAG |
249
+ | B | ≥ 0.65 | Usable with minor cleanup |
250
+ | C | ≥ 0.40 | Review before use |
251
+ | D | < 0.40 | Not suitable — empty, too short, or high noise |
252
+
253
+ Score formula: `completeness × (0.4 × noiseScore + 0.4 × lengthScore + 0.2)`
254
+ `lengthScore = Math.min(charCount / 500, 1.0)` · `noiseScore = Math.max(0, 1 − garbageRatio × 10)`
255
+
256
+ ## Limitations
257
+
258
+ - **No automatic language detection** — `@flexorch/audit` has zero dependencies. Pass `locale` explicitly, or use the default `"und"` to activate all detectors.
259
+ - **Free-standing name detection** (without a label prefix) requires NLP/NER — not included.
260
+ - `replace` masking uses static synthetic values; locale-aware realistic synthesis is not implemented.
261
+ - The library audits plain text. PDF/DOCX parsing, e-invoice extraction, and pipeline orchestration are out of scope.
262
+
263
+ ## Also available for Python
264
+
265
+ ```bash
266
+ pip install flexorch-audit
267
+ ```
268
+
269
+ ## Contributing
270
+
271
+ See [CONTRIBUTING.md](CONTRIBUTING.md).
272
+
273
+ ## License
274
+
275
+ MIT