npm - @flexorch/audit - Versions diffs - 0.3.0 → 0.3.1 - Mend

@flexorch/audit 0.3.0 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/README.md CHANGED Viewed

@@ -1,13 +1,38 @@
 # @flexorch/audit
-Zero-dependency PII + quality + noise audit for LLM datasets. Answers one question: **is this dataset ready for LLM training?**
+[![npm](https://img.shields.io/npm/v/@flexorch/audit)](https://www.npmjs.com/package/@flexorch/audit)
+[![Node](https://img.shields.io/node/v/@flexorch/audit)](https://www.npmjs.com/package/@flexorch/audit)
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
-- **Quality grade** — A/B/C/D score that signals LLM-readiness at a glance
-- **PII detection** — email, phone (TR + E.164), credit card (Luhn), IP, TCKN, IBAN, SSN, label-prefixed names
-- **Quality metrics** — completeness, average length, duplicate ratio
-- **Noise metrics** — garbage character ratio, encoding health
-- **Masking** — redact / replace / token / hash strategies
+Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.
+## Why
+Before feeding documents into an LLM pipeline you need to answer three questions:
+1. **Does this text contain personal data?** Sending PII to a language model is a compliance risk.
+2. **Is the text quality high enough?** Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
+3. **How bad is the noise?** Garbled encodings and control characters degrade model output silently.
+Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. `@flexorch/audit` answers all three with one call — using only regex and Node.js built-ins. No model weights, no network calls, no external packages.
+## Features
+- **Quality grade** — A/B/C/D composite score: is this text LLM-ready at a glance?
+- **PII detection** — email, phone (TR mobile + E.164), credit card (Luhn), IPv4, IPv6, TCKN, VKN, IBAN (mod-97 validated), SSN, label-prefixed names
+- **Batch audit** — `auditBatch()` aggregates duplicate ratio and PII counts across an entire dataset in one call
+- **Noise metrics** — garbage character ratio, encoding health check
+- **Masking** — four strategies: redact, replace (synthetic), token, hash
 - **Zero runtime dependencies** — pure Node.js built-ins, Node 18+
+- **TypeScript-first** — full type definitions, no `@types/` package needed
+## Install
+```bash
+npm install @flexorch/audit
+```
+## Quick start
 ```ts
 import { audit, mask } from "@flexorch/audit"
@@ -20,7 +45,6 @@ result.quality_grade   // "A"
 result.quality_score   // 0.91  (0.0–1.0 composite)
 result.pii_summary     // [{ type: "national_id_tr", count: 3 }, { type: "email", count: 1 }]
-// Raw findings and metrics — also available:
 result.pii             // [{ type: "email", value: "...", start: 8, end: 23 }]
 result.quality         // { completeness: 1.0, avg_length: 342, duplicate_ratio: null }
 result.noise           // { garbage_ratio: 0.0, encoding_ok: true }
@@ -29,21 +53,31 @@ const clean = mask(text, result.pii, { strategy: "redact" })
 // "Contact: [REDACTED_EMAIL]"
 ```
-## Install
+![demo](assets/demo.svg)
-```bash
-npm install @flexorch/audit
-```
+## Batch audit
-![demo](assets/demo.svg)
+Use `auditBatch()` to audit an entire dataset and get aggregate metrics including `duplicate_ratio`:
+```ts
+import { auditBatch } from "@flexorch/audit"
+const texts = dataset.map((r) => r.text)
+const batch = auditBatch(texts, { locale: "tr" })
+batch.duplicate_ratio    // 0.12 — fraction of exact-duplicate records
+batch.avg_quality_score  // 0.78
+batch.pii_summary        // [{ type: "email", count: 47 }, ...]
+batch.results            // AuditResult[], one per text
+```
 ## Locale support
 | `locale` | Active detectors |
 |----------|-----------------|
-| `"tr"` (default) | email, iban, credit_card, ip + TCKN, phone_tr, name |
-| `"us"` | email, iban, credit_card, ip + SSN, E.164 phone |
-| `"eu"` | email, iban, credit_card, ip + E.164 phone |
+| `"tr"` (default) | email, iban, credit_card, ip, ip_v6 + TCKN, VKN, phone_tr, name |
+| `"us"` | email, iban, credit_card, ip, ip_v6 + SSN, E.164 phone |
+| `"eu"` | email, iban, credit_card, ip, ip_v6 + E.164 phone |
 | `"all"` | All of the above (phone_tr takes precedence over generic phone) |
 ## PII types
@@ -51,11 +85,13 @@ npm install @flexorch/audit
 | Type | Description | Locale |
 |------|-------------|--------|
 | `email` | RFC-5321 address | all |
-| `iban` | ISO 13616 IBAN (any country) | all |
+| `iban` | ISO 13616 IBAN — mod-97 checksum validated | all |
 | `credit_card` | 16-digit groups, Luhn-validated | all |
 | `ip` | IPv4 address | all |
+| `ip_v6` | IPv6 address (full, compressed, loopback) | all |
 | `phone_tr` | Turkish mobile (+90/0 prefix + 10 digits) | tr |
 | `national_id_tr` | TCKN — 11-digit modular arithmetic checksum | tr |
+| `tax_id_tr` | VKN — 10-digit Luhn-variant checksum | tr |
 | `name` | Label-prefixed name (e.g. "Adı: Ali Yıldız", "Full Name: Jane Doe") | tr |
 | `phone` | E.164 international phone | us, eu |
 | `ssn` | US Social Security Number (###-##-####) | us |
@@ -65,53 +101,37 @@ npm install @flexorch/audit
 | Strategy | Example output |
 |----------|----------------|
 | `redact` (default) | `[REDACTED_EMAIL]` |
-| `replace` | `user@example.com` (realistic synthetic) |
-| `token` | `<PII_EMAIL_1>` (unique per type) |
+| `replace` | `user@example.com` (static synthetic) |
+| `token` | `<PII_EMAIL_1>` (unique per type per call) |
 | `hash` | `[3d4f9a1b2c8e7f0a]` (SHA-256 first 16 hex chars) |
 ## TypeScript
-Full type definitions included. No `@types/` package needed.
 ```ts
-import { audit, mask, type AuditResult, type PiiFinding } from "@flexorch/audit"
+import {
+  audit, auditBatch, mask,
+  type AuditResult, type BatchAuditResult, type PiiFinding,
+} from "@flexorch/audit"
 ```
 ## Quality grade
-The `quality_grade` (A–D) and `quality_score` (0.0–1.0) are composite signals derived from three dimensions:
+`quality_grade` (A–D) and `quality_score` (0.0–1.0) are composite signals:
-| Grade | Score | Meaning |
-|-------|-------|---------|
+| Grade | Score | Signal |
+|-------|-------|--------|
 | A | ≥ 0.85 | Ready for LLM training or RAG |
 | B | ≥ 0.65 | Usable with minor cleanup |
-| C | ≥ 0.40 | Needs review before use |
+| C | ≥ 0.40 | Review before use |
 | D | < 0.40 | Not suitable — empty, too short, or high noise |
-Score formula: `completeness × (0.4 × noiseScore + 0.4 × lengthScore + 0.2)`
-where `lengthScore = Math.min(charCount / 500, 1.0)` and `noiseScore = Math.max(0, 1 − garbageRatio × 10)`.
+Score formula: `completeness × (0.4 × noiseScore + 0.4 × lengthScore + 0.2)`
+`lengthScore = Math.min(charCount / 500, 1.0)` · `noiseScore = Math.max(0, 1 − garbageRatio × 10)`
-## Quality & noise
-`duplicate_ratio` is `null` for single-string input. Compute it across your dataset:
-```ts
-const texts = dataset.map((r) => r.text)
-const seen = new Set<string>()
-let duplicates = 0
-for (const t of texts) {
-  if (seen.has(t)) duplicates++
-  else seen.add(t)
-}
-const duplicateRatio = duplicates / texts.length
-```
-## Limitations (v0.2)
+## Limitations (v0.4)
 - Free-standing name detection (without a label prefix) requires NLP/NER — not included.
-- `duplicate_ratio` is per-call; aggregate across your dataset manually (see above).
-- IPv6 not detected.
-- IBAN format-only check; mod-97 validation not performed.
+- `replace` masking strategy uses static synthetic values; locale-aware realistic synthesis is not yet implemented.
 ## Also available for Python
@@ -119,6 +139,10 @@ const duplicateRatio = duplicates / texts.length
 pip install flexorch-audit
 ```
+## Contributing
+See [CONTRIBUTING.md](CONTRIBUTING.md).
 ## License
 MIT

package/dist/index.cjs CHANGED Viewed

@@ -274,7 +274,7 @@ function applyMask(text, findings, strategy = "redact") {
 }
 // src/index.ts
-var version = "0.3.0";
+var version = "0.3.1";
 function computeQualityScore(completeness, avgLength, garbageRatio) {
   const lengthScore = Math.min(avgLength / 500, 1);
   const noiseScore = Math.max(0, 1 - garbageRatio * 10);

package/dist/index.d.cts CHANGED Viewed

@@ -45,7 +45,7 @@ declare function applyMask(text: string, findings: PiiFinding[], strategy?: Mask
  * // "Contact: [REDACTED_EMAIL]"
  */
-declare const version = "0.3.0";
+declare const version = "0.3.1";
 type QualityGrade = "A" | "B" | "C" | "D";
 interface PiiSummaryEntry {
     type: string;

package/dist/index.d.ts CHANGED Viewed

@@ -45,7 +45,7 @@ declare function applyMask(text: string, findings: PiiFinding[], strategy?: Mask
  * // "Contact: [REDACTED_EMAIL]"
  */
-declare const version = "0.3.0";
+declare const version = "0.3.1";
 type QualityGrade = "A" | "B" | "C" | "D";
 interface PiiSummaryEntry {
     type: string;

package/dist/index.js CHANGED Viewed

@@ -241,7 +241,7 @@ function applyMask(text, findings, strategy = "redact") {
 }
 // src/index.ts
-var version = "0.3.0";
+var version = "0.3.1";
 function computeQualityScore(completeness, avgLength, garbageRatio) {
   const lengthScore = Math.min(avgLength / 500, 1);
   const noiseScore = Math.max(0, 1 - garbageRatio * 10);

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@flexorch/audit",
-  "version": "0.3.0",
+  "version": "0.3.1",
   "description": "Zero-dependency PII + quality + noise audit for LLM datasets (TR/EU/US)",
   "keywords": [
     "pii",