@datafog/fogclaw 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (97) hide show
  1. package/.github/workflows/harness-docs.yml +30 -0
  2. package/AGENTS.md +28 -0
  3. package/LICENSE +21 -0
  4. package/README.md +208 -0
  5. package/dist/config.d.ts +4 -0
  6. package/dist/config.d.ts.map +1 -0
  7. package/dist/config.js +30 -0
  8. package/dist/config.js.map +1 -0
  9. package/dist/engines/gliner.d.ts +14 -0
  10. package/dist/engines/gliner.d.ts.map +1 -0
  11. package/dist/engines/gliner.js +75 -0
  12. package/dist/engines/gliner.js.map +1 -0
  13. package/dist/engines/regex.d.ts +5 -0
  14. package/dist/engines/regex.d.ts.map +1 -0
  15. package/dist/engines/regex.js +54 -0
  16. package/dist/engines/regex.js.map +1 -0
  17. package/dist/index.d.ts +19 -0
  18. package/dist/index.d.ts.map +1 -0
  19. package/dist/index.js +157 -0
  20. package/dist/index.js.map +1 -0
  21. package/dist/redactor.d.ts +3 -0
  22. package/dist/redactor.d.ts.map +1 -0
  23. package/dist/redactor.js +37 -0
  24. package/dist/redactor.js.map +1 -0
  25. package/dist/scanner.d.ts +11 -0
  26. package/dist/scanner.d.ts.map +1 -0
  27. package/dist/scanner.js +77 -0
  28. package/dist/scanner.js.map +1 -0
  29. package/dist/types.d.ts +31 -0
  30. package/dist/types.d.ts.map +1 -0
  31. package/dist/types.js +18 -0
  32. package/dist/types.js.map +1 -0
  33. package/docs/DATA.md +28 -0
  34. package/docs/DESIGN.md +17 -0
  35. package/docs/DOMAIN_DOCS.md +30 -0
  36. package/docs/FRONTEND.md +24 -0
  37. package/docs/OBSERVABILITY.md +25 -0
  38. package/docs/PLANS.md +171 -0
  39. package/docs/PRODUCT_SENSE.md +20 -0
  40. package/docs/RELIABILITY.md +60 -0
  41. package/docs/SECURITY.md +50 -0
  42. package/docs/design-docs/core-beliefs.md +17 -0
  43. package/docs/design-docs/index.md +8 -0
  44. package/docs/generated/README.md +36 -0
  45. package/docs/generated/memory.md +1 -0
  46. package/docs/plans/2026-02-16-fogclaw-design.md +172 -0
  47. package/docs/plans/2026-02-16-fogclaw-implementation.md +1606 -0
  48. package/docs/plans/README.md +15 -0
  49. package/docs/plans/active/2026-02-16-feat-openclaw-official-submission-plan.md +386 -0
  50. package/docs/plans/active/2026-02-17-feat-release-fogclaw-via-datafog-package-plan.md +318 -0
  51. package/docs/plans/active/2026-02-17-feat-submit-fogclaw-to-openclaw-plan.md +244 -0
  52. package/docs/plans/tech-debt-tracker.md +42 -0
  53. package/docs/plugins/fogclaw.md +95 -0
  54. package/docs/runbooks/address-review-findings.md +30 -0
  55. package/docs/runbooks/ci-failures.md +46 -0
  56. package/docs/runbooks/code-review.md +34 -0
  57. package/docs/runbooks/merge-change.md +28 -0
  58. package/docs/runbooks/pull-request.md +45 -0
  59. package/docs/runbooks/record-evidence.md +43 -0
  60. package/docs/runbooks/reproduce-bug.md +42 -0
  61. package/docs/runbooks/respond-to-feedback.md +42 -0
  62. package/docs/runbooks/review-findings.md +31 -0
  63. package/docs/runbooks/submit-openclaw-plugin.md +68 -0
  64. package/docs/runbooks/update-agents-md.md +59 -0
  65. package/docs/runbooks/update-domain-docs.md +42 -0
  66. package/docs/runbooks/validate-current-state.md +41 -0
  67. package/docs/runbooks/verify-release.md +69 -0
  68. package/docs/specs/2026-02-16-feat-openclaw-official-submission-spec.md +115 -0
  69. package/docs/specs/2026-02-17-feat-submit-fogclaw-to-openclaw.md +125 -0
  70. package/docs/specs/README.md +5 -0
  71. package/docs/specs/index.md +8 -0
  72. package/docs/spikes/README.md +8 -0
  73. package/fogclaw.config.example.json +15 -0
  74. package/openclaw.plugin.json +45 -0
  75. package/package.json +37 -0
  76. package/scripts/ci/he-docs-config.json +123 -0
  77. package/scripts/ci/he-docs-drift.sh +112 -0
  78. package/scripts/ci/he-docs-lint.sh +234 -0
  79. package/scripts/ci/he-plans-lint.sh +354 -0
  80. package/scripts/ci/he-runbooks-lint.sh +445 -0
  81. package/scripts/ci/he-specs-lint.sh +258 -0
  82. package/scripts/ci/he-spikes-lint.sh +249 -0
  83. package/scripts/runbooks/select-runbooks.sh +154 -0
  84. package/src/config.ts +46 -0
  85. package/src/engines/gliner.ts +88 -0
  86. package/src/engines/regex.ts +71 -0
  87. package/src/index.ts +223 -0
  88. package/src/redactor.ts +51 -0
  89. package/src/scanner.ts +90 -0
  90. package/src/types.ts +52 -0
  91. package/tests/config.test.ts +104 -0
  92. package/tests/gliner.test.ts +184 -0
  93. package/tests/plugin-smoke.test.ts +114 -0
  94. package/tests/redactor.test.ts +320 -0
  95. package/tests/regex.test.ts +345 -0
  96. package/tests/scanner.test.ts +199 -0
  97. package/tsconfig.json +20 -0
@@ -0,0 +1,1606 @@
1
+ # FogClaw Implementation Plan
2
+
3
+ > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
4
+
5
+ **Goal:** Build a pure TypeScript OpenClaw plugin that detects and redacts PII + custom entities using regex and GLiNER ONNX, exposed as both a message guardrail and an on-demand agent tool.
6
+
7
+ **Architecture:** Dual-engine pipeline (regex first for structured PII, GLiNER second for zero-shot NER) in a single OpenClaw plugin that registers a `before_agent_start` hook and two tools (`fogclaw_scan`, `fogclaw_redact`). Config-driven per-entity-type actions.
8
+
9
+ **Tech Stack:** TypeScript, Node.js 22+, vitest, `gliner` npm package, `onnxruntime-node`, OpenClaw plugin API.
10
+
11
+ **Design doc:** `docs/plans/2026-02-16-fogclaw-design.md`
12
+
13
+ ---
14
+
15
+ ### Task 1: Repository Scaffold
16
+
17
+ **Files:**
18
+ - Create: `package.json`
19
+ - Create: `tsconfig.json`
20
+ - Create: `.gitignore`
21
+ - Create: `openclaw.plugin.json`
22
+ - Create: `fogclaw.config.example.json`
23
+ - Create: `src/types.ts`
24
+
25
+ **Step 1: Initialize the repo**
26
+
27
+ Create the GitHub repo under the `datafog` org:
28
+
29
+ ```bash
30
+ mkdir fogclaw && cd fogclaw
31
+ git init
32
+ ```
33
+
34
+ **Step 2: Create `package.json`**
35
+
36
+ ```json
37
+ {
38
+ "name": "@datafog/fogclaw",
39
+ "version": "0.1.0",
40
+ "description": "OpenClaw plugin for PII detection & custom entity redaction powered by DataFog",
41
+ "type": "module",
42
+ "main": "dist/index.js",
43
+ "types": "dist/index.d.ts",
44
+ "scripts": {
45
+ "build": "tsc",
46
+ "test": "vitest run",
47
+ "test:watch": "vitest",
48
+ "lint": "tsc --noEmit"
49
+ },
50
+ "dependencies": {
51
+ "gliner": "^0.2.0",
52
+ "onnxruntime-node": "^1.20.0"
53
+ },
54
+ "devDependencies": {
55
+ "@types/node": "^22.0.0",
56
+ "typescript": "^5.7.0",
57
+ "vitest": "^2.1.0"
58
+ },
59
+ "engines": {
60
+ "node": ">=22.0.0"
61
+ },
62
+ "license": "MIT",
63
+ "repository": {
64
+ "type": "git",
65
+ "url": "https://github.com/datafog/fogclaw"
66
+ }
67
+ }
68
+ ```
69
+
70
+ **Step 3: Create `tsconfig.json`**
71
+
72
+ ```json
73
+ {
74
+ "compilerOptions": {
75
+ "target": "ES2022",
76
+ "module": "ESNext",
77
+ "moduleResolution": "bundler",
78
+ "lib": ["ES2022"],
79
+ "outDir": "dist",
80
+ "rootDir": "src",
81
+ "strict": true,
82
+ "declaration": true,
83
+ "declarationMap": true,
84
+ "sourceMap": true,
85
+ "esModuleInterop": true,
86
+ "skipLibCheck": true,
87
+ "forceConsistentCasingInFileNames": true,
88
+ "resolveJsonModule": true
89
+ },
90
+ "include": ["src/**/*"],
91
+ "exclude": ["node_modules", "dist", "tests"]
92
+ }
93
+ ```
94
+
95
+ **Step 4: Create `.gitignore`**
96
+
97
+ ```
98
+ node_modules/
99
+ dist/
100
+ models/
101
+ *.onnx
102
+ .env
103
+ ```
104
+
105
+ **Step 5: Create `openclaw.plugin.json`**
106
+
107
+ ```json
108
+ {
109
+ "id": "fogclaw",
110
+ "name": "FogClaw",
111
+ "version": "0.1.0",
112
+ "description": "PII detection & custom entity redaction powered by DataFog",
113
+ "configSchema": {
114
+ "type": "object",
115
+ "properties": {
116
+ "enabled": { "type": "boolean", "default": true },
117
+ "guardrail_mode": {
118
+ "type": "string",
119
+ "enum": ["redact", "block", "warn"],
120
+ "default": "redact"
121
+ },
122
+ "redactStrategy": {
123
+ "type": "string",
124
+ "enum": ["token", "mask", "hash"],
125
+ "default": "token"
126
+ },
127
+ "model": {
128
+ "type": "string",
129
+ "default": "onnx-community/gliner_large-v2.1"
130
+ },
131
+ "confidence_threshold": {
132
+ "type": "number",
133
+ "default": 0.5,
134
+ "minimum": 0,
135
+ "maximum": 1
136
+ },
137
+ "custom_entities": {
138
+ "type": "array",
139
+ "items": { "type": "string" },
140
+ "default": []
141
+ },
142
+ "entityActions": {
143
+ "type": "object",
144
+ "additionalProperties": {
145
+ "type": "string",
146
+ "enum": ["redact", "block", "warn"]
147
+ },
148
+ "default": {}
149
+ }
150
+ }
151
+ }
152
+ }
153
+ ```
154
+
155
+ **Step 6: Create `fogclaw.config.example.json`**
156
+
157
+ ```json
158
+ {
159
+ "enabled": true,
160
+ "guardrail_mode": "redact",
161
+ "redactStrategy": "token",
162
+ "model": "onnx-community/gliner_large-v2.1",
163
+ "confidence_threshold": 0.5,
164
+ "custom_entities": ["project codename", "internal tool name"],
165
+ "entityActions": {
166
+ "SSN": "block",
167
+ "CREDIT_CARD": "block",
168
+ "EMAIL": "redact",
169
+ "PHONE": "redact",
170
+ "PERSON": "warn"
171
+ }
172
+ }
173
+ ```
174
+
175
+ **Step 7: Create `src/types.ts`**
176
+
177
+ ```typescript
178
+ export interface Entity {
179
+ text: string;
180
+ label: string;
181
+ start: number;
182
+ end: number;
183
+ confidence: number;
184
+ source: "regex" | "gliner";
185
+ }
186
+
187
+ export type RedactStrategy = "token" | "mask" | "hash";
188
+
189
+ export type GuardrailAction = "redact" | "block" | "warn";
190
+
191
+ export interface FogClawConfig {
192
+ enabled: boolean;
193
+ guardrail_mode: GuardrailAction;
194
+ redactStrategy: RedactStrategy;
195
+ model: string;
196
+ confidence_threshold: number;
197
+ custom_entities: string[];
198
+ entityActions: Record<string, GuardrailAction>;
199
+ }
200
+
201
+ export interface ScanResult {
202
+ entities: Entity[];
203
+ text: string;
204
+ }
205
+
206
+ export interface RedactResult {
207
+ redacted_text: string;
208
+ mapping: Record<string, string>;
209
+ entities: Entity[];
210
+ }
211
+
212
+ export const CANONICAL_TYPE_MAP: Record<string, string> = {
213
+ DOB: "DATE",
214
+ ZIP: "ZIP_CODE",
215
+ PER: "PERSON",
216
+ ORG: "ORGANIZATION",
217
+ GPE: "LOCATION",
218
+ LOC: "LOCATION",
219
+ FAC: "ADDRESS",
220
+ PHONE_NUMBER: "PHONE",
221
+ SOCIAL_SECURITY_NUMBER: "SSN",
222
+ CREDIT_CARD_NUMBER: "CREDIT_CARD",
223
+ DATE_OF_BIRTH: "DATE",
224
+ };
225
+
226
+ export function canonicalType(entityType: string): string {
227
+ const normalized = entityType.toUpperCase().trim();
228
+ return CANONICAL_TYPE_MAP[normalized] ?? normalized;
229
+ }
230
+ ```
231
+
232
+ **Step 8: Install dependencies & verify build**
233
+
234
+ ```bash
235
+ npm install
236
+ npx tsc --noEmit
237
+ ```
238
+
239
+ Expected: Clean compile, no errors.
240
+
241
+ **Step 9: Commit**
242
+
243
+ ```bash
244
+ git add -A
245
+ git commit -m "chore: scaffold fogclaw repo with types, config, and plugin manifest"
246
+ ```
247
+
248
+ ---
249
+
250
+ ### Task 2: Regex Engine
251
+
252
+ **Files:**
253
+ - Create: `src/engines/regex.ts`
254
+ - Create: `tests/regex.test.ts`
255
+
256
+ **Step 1: Write the failing tests**
257
+
258
+ Create `tests/regex.test.ts`:
259
+
260
+ ```typescript
261
+ import { describe, it, expect } from "vitest";
262
+ import { RegexEngine } from "../src/engines/regex.js";
263
+
264
+ const engine = new RegexEngine();
265
+
266
+ describe("RegexEngine", () => {
267
+ describe("EMAIL", () => {
268
+ it("detects simple email", () => {
269
+ const entities = engine.scan("Contact john@example.com for info");
270
+ const emails = entities.filter((e) => e.label === "EMAIL");
271
+ expect(emails).toHaveLength(1);
272
+ expect(emails[0].text).toBe("john@example.com");
273
+ expect(emails[0].confidence).toBe(1.0);
274
+ expect(emails[0].source).toBe("regex");
275
+ });
276
+
277
+ it("detects email with subdomain", () => {
278
+ const entities = engine.scan("Email first.last@example.co.uk");
279
+ const emails = entities.filter((e) => e.label === "EMAIL");
280
+ expect(emails).toHaveLength(1);
281
+ expect(emails[0].text).toBe("first.last@example.co.uk");
282
+ });
283
+
284
+ it("detects email with plus tag", () => {
285
+ const entities = engine.scan("Send to user+tag@example.org");
286
+ const emails = entities.filter((e) => e.label === "EMAIL");
287
+ expect(emails).toHaveLength(1);
288
+ });
289
+
290
+ it("does not match bare @", () => {
291
+ const entities = engine.scan("@ is not an email");
292
+ const emails = entities.filter((e) => e.label === "EMAIL");
293
+ expect(emails).toHaveLength(0);
294
+ });
295
+ });
296
+
297
+ describe("PHONE", () => {
298
+ it("detects US phone with dashes", () => {
299
+ const entities = engine.scan("Call 555-123-4567");
300
+ const phones = entities.filter((e) => e.label === "PHONE");
301
+ expect(phones).toHaveLength(1);
302
+ expect(phones[0].text).toBe("555-123-4567");
303
+ });
304
+
305
+ it("detects US phone with parens", () => {
306
+ const entities = engine.scan("Call (555) 123-4567");
307
+ const phones = entities.filter((e) => e.label === "PHONE");
308
+ expect(phones).toHaveLength(1);
309
+ });
310
+
311
+ it("detects international phone", () => {
312
+ const entities = engine.scan("Call +44 20 7946 0958");
313
+ const phones = entities.filter((e) => e.label === "PHONE");
314
+ expect(phones).toHaveLength(1);
315
+ });
316
+ });
317
+
318
+ describe("SSN", () => {
319
+ it("detects SSN with dashes", () => {
320
+ const entities = engine.scan("SSN: 123-45-6789");
321
+ const ssns = entities.filter((e) => e.label === "SSN");
322
+ expect(ssns).toHaveLength(1);
323
+ expect(ssns[0].text).toBe("123-45-6789");
324
+ });
325
+
326
+ it("rejects SSN with area code 000", () => {
327
+ const entities = engine.scan("SSN: 000-45-6789");
328
+ const ssns = entities.filter((e) => e.label === "SSN");
329
+ expect(ssns).toHaveLength(0);
330
+ });
331
+
332
+ it("rejects SSN with area code 666", () => {
333
+ const entities = engine.scan("SSN: 666-45-6789");
334
+ const ssns = entities.filter((e) => e.label === "SSN");
335
+ expect(ssns).toHaveLength(0);
336
+ });
337
+ });
338
+
339
+ describe("CREDIT_CARD", () => {
340
+ it("detects Visa", () => {
341
+ const entities = engine.scan("Card: 4111111111111111");
342
+ const cards = entities.filter((e) => e.label === "CREDIT_CARD");
343
+ expect(cards).toHaveLength(1);
344
+ });
345
+
346
+ it("detects Mastercard", () => {
347
+ const entities = engine.scan("Card: 5500000000000004");
348
+ const cards = entities.filter((e) => e.label === "CREDIT_CARD");
349
+ expect(cards).toHaveLength(1);
350
+ });
351
+
352
+ it("detects Amex", () => {
353
+ const entities = engine.scan("Card: 340000000000009");
354
+ const cards = entities.filter((e) => e.label === "CREDIT_CARD");
355
+ expect(cards).toHaveLength(1);
356
+ });
357
+ });
358
+
359
+ describe("IP_ADDRESS", () => {
360
+ it("detects valid IPv4", () => {
361
+ const entities = engine.scan("Server at 192.168.1.1");
362
+ const ips = entities.filter((e) => e.label === "IP_ADDRESS");
363
+ expect(ips).toHaveLength(1);
364
+ expect(ips[0].text).toBe("192.168.1.1");
365
+ });
366
+
367
+ it("rejects invalid octet", () => {
368
+ const entities = engine.scan("Not valid: 256.168.1.1");
369
+ const ips = entities.filter((e) => e.label === "IP_ADDRESS");
370
+ expect(ips).toHaveLength(0);
371
+ });
372
+ });
373
+
374
+ describe("DATE", () => {
375
+ it("detects MM/DD/YYYY", () => {
376
+ const entities = engine.scan("Born on 01/15/1990");
377
+ const dates = entities.filter((e) => e.label === "DATE");
378
+ expect(dates).toHaveLength(1);
379
+ });
380
+
381
+ it("detects YYYY-MM-DD", () => {
382
+ const entities = engine.scan("Date: 2020-01-15");
383
+ const dates = entities.filter((e) => e.label === "DATE");
384
+ expect(dates).toHaveLength(1);
385
+ });
386
+
387
+ it("detects Month DD, YYYY", () => {
388
+ const entities = engine.scan("Born January 15, 2000");
389
+ const dates = entities.filter((e) => e.label === "DATE");
390
+ expect(dates).toHaveLength(1);
391
+ });
392
+ });
393
+
394
+ describe("ZIP_CODE", () => {
395
+ it("detects 5-digit zip", () => {
396
+ const entities = engine.scan("ZIP: 10001");
397
+ const zips = entities.filter((e) => e.label === "ZIP_CODE");
398
+ expect(zips).toHaveLength(1);
399
+ });
400
+
401
+ it("detects zip+4", () => {
402
+ const entities = engine.scan("ZIP: 10001-1234");
403
+ const zips = entities.filter((e) => e.label === "ZIP_CODE");
404
+ expect(zips).toHaveLength(1);
405
+ });
406
+ });
407
+
408
+ describe("multiple entities", () => {
409
+ it("detects multiple entity types in one text", () => {
410
+ const text =
411
+ "John's email is john@example.com, phone 555-123-4567, SSN 123-45-6789";
412
+ const entities = engine.scan(text);
413
+ const labels = new Set(entities.map((e) => e.label));
414
+ expect(labels.has("EMAIL")).toBe(true);
415
+ expect(labels.has("PHONE")).toBe(true);
416
+ expect(labels.has("SSN")).toBe(true);
417
+ });
418
+ });
419
+
420
+ describe("empty input", () => {
421
+ it("returns empty array for empty string", () => {
422
+ const entities = engine.scan("");
423
+ expect(entities).toHaveLength(0);
424
+ });
425
+ });
426
+
427
+ describe("span offsets", () => {
428
+ it("returns correct start/end offsets", () => {
429
+ const text = "Email: john@example.com here";
430
+ const entities = engine.scan(text);
431
+ const email = entities.find((e) => e.label === "EMAIL")!;
432
+ expect(text.slice(email.start, email.end)).toBe("john@example.com");
433
+ });
434
+ });
435
+ });
436
+ ```
437
+
438
+ **Step 2: Run tests to verify they fail**
439
+
440
+ ```bash
441
+ npx vitest run tests/regex.test.ts
442
+ ```
443
+
444
+ Expected: FAIL — `Cannot find module '../src/engines/regex.js'`
445
+
446
+ **Step 3: Write the regex engine**
447
+
448
+ Create `src/engines/regex.ts`:
449
+
450
+ ```typescript
451
+ import type { Entity } from "../types.js";
452
+
453
+ interface PatternDef {
454
+ label: string;
455
+ pattern: RegExp;
456
+ /** Canonical label to use in output (e.g., DOB → DATE) */
457
+ canonicalLabel?: string;
458
+ }
459
+
460
+ const PATTERNS: PatternDef[] = [
461
+ {
462
+ label: "EMAIL",
463
+ pattern:
464
+ /(?<![A-Za-z0-9._%+\-@])(?![A-Za-z_]{2,20}=)[A-Za-z0-9!#$%&*+\-/=^_`{|}~][A-Za-z0-9!#$%&'*+\-/=?^_`{|}~.]*@(?:\.?[A-Za-z0-9-]+\.)+[A-Za-z]{2,}(?=$|[^A-Za-z])/gim,
465
+ },
466
+ {
467
+ label: "PHONE",
468
+ pattern:
469
+ /(?<![A-Za-z0-9])(?:(?:(?:\+?1)[-.\s]?)?(?:\(\d{3}\)|\d{3})[-.\s]?\d{3}[-.\s]?\d{4}|\+\d{1,3}[\s\-.]?\d{1,4}(?:[\s\-.]?\d{2,4}){2,3})(?![-A-Za-z0-9])/gim,
470
+ },
471
+ {
472
+ label: "SSN",
473
+ pattern:
474
+ /(?<!\d)(?:(?!000|666)\d{3}-(?!00)\d{2}-(?!0000)\d{4}|(?!000|666)\d{3}(?!00)\d{2}(?!0000)\d{4})(?!\d)/gm,
475
+ },
476
+ {
477
+ label: "CREDIT_CARD",
478
+ pattern:
479
+ /\b(?:4\d{12}(?:\d{3})?|5[1-5]\d{14}|3[47]\d{13}|(?:(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2})[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4})|(?:3[47]\d{2}[-\s]?\d{6}[-\s]?\d{5}))\b/gm,
480
+ },
481
+ {
482
+ label: "IP_ADDRESS",
483
+ pattern:
484
+ /\b(?:(?:25[0-5]|2[0-4]\d|1?\d?\d)\.(?:25[0-5]|2[0-4]\d|1?\d?\d)\.(?:25[0-5]|2[0-4]\d|1?\d?\d)\.(?:25[0-5]|2[0-4]\d|1?\d?\d))\b/gm,
485
+ },
486
+ {
487
+ label: "DATE",
488
+ canonicalLabel: "DATE",
489
+ pattern:
490
+ /\b(?:(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12]\d|3[01])[/-](?:\d{2}|\d{4})|(?:\d{4})-(?:0?[1-9]|1[0-2])-(?:0?[1-9]|[12]\d|3[01])|(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+(?:0?[1-9]|[12]\d|3[01]),\s+(?:19|20)\d{2})\b/gim,
491
+ },
492
+ {
493
+ label: "ZIP_CODE",
494
+ pattern: /\b\d{5}(?:-\d{4})?\b/gm,
495
+ },
496
+ ];
497
+
498
+ export class RegexEngine {
499
+ scan(text: string): Entity[] {
500
+ if (!text) return [];
501
+
502
+ const entities: Entity[] = [];
503
+
504
+ for (const { label, pattern, canonicalLabel } of PATTERNS) {
505
+ // Reset lastIndex since we reuse the regex
506
+ pattern.lastIndex = 0;
507
+
508
+ let match: RegExpExecArray | null;
509
+ while ((match = pattern.exec(text)) !== null) {
510
+ entities.push({
511
+ text: match[0],
512
+ label: canonicalLabel ?? label,
513
+ start: match.index,
514
+ end: match.index + match[0].length,
515
+ confidence: 1.0,
516
+ source: "regex",
517
+ });
518
+ }
519
+ }
520
+
521
+ return entities;
522
+ }
523
+ }
524
+ ```
525
+
526
+ **Step 4: Run tests to verify they pass**
527
+
528
+ ```bash
529
+ npx vitest run tests/regex.test.ts
530
+ ```
531
+
532
+ Expected: All tests PASS.
533
+
534
+ **Step 5: Commit**
535
+
536
+ ```bash
537
+ git add src/engines/regex.ts tests/regex.test.ts
538
+ git commit -m "feat: add regex engine with ported DataFog PII patterns"
539
+ ```
540
+
541
+ ---
542
+
543
+ ### Task 3: Redactor
544
+
545
+ **Files:**
546
+ - Create: `src/redactor.ts`
547
+ - Create: `tests/redactor.test.ts`
548
+
549
+ **Step 1: Write the failing tests**
550
+
551
+ Create `tests/redactor.test.ts`:
552
+
553
+ ```typescript
554
+ import { describe, it, expect } from "vitest";
555
+ import { redact } from "../src/redactor.js";
556
+ import type { Entity } from "../src/types.js";
557
+
558
+ const email: Entity = {
559
+ text: "john@example.com",
560
+ label: "EMAIL",
561
+ start: 8,
562
+ end: 24,
563
+ confidence: 1.0,
564
+ source: "regex",
565
+ };
566
+
567
+ const phone: Entity = {
568
+ text: "555-123-4567",
569
+ label: "PHONE",
570
+ start: 32,
571
+ end: 44,
572
+ confidence: 1.0,
573
+ source: "regex",
574
+ };
575
+
576
+ const baseText = "Contact john@example.com, call 555-123-4567 please";
577
+
578
+ describe("redact", () => {
579
+ describe("token strategy", () => {
580
+ it("replaces entities with type tokens", () => {
581
+ const result = redact(baseText, [email, phone], "token");
582
+ expect(result.redacted_text).toContain("[EMAIL_1]");
583
+ expect(result.redacted_text).toContain("[PHONE_1]");
584
+ expect(result.redacted_text).not.toContain("john@example.com");
585
+ expect(result.redacted_text).not.toContain("555-123-4567");
586
+ });
587
+
588
+ it("increments counter for same type", () => {
589
+ const email2: Entity = {
590
+ text: "jane@example.com",
591
+ label: "EMAIL",
592
+ start: 30,
593
+ end: 46,
594
+ confidence: 1.0,
595
+ source: "regex",
596
+ };
597
+ const text = "Email john@example.com and also jane@example.com";
598
+ const result = redact(text, [
599
+ { ...email, start: 6, end: 22 },
600
+ { ...email2 },
601
+ ], "token");
602
+ expect(result.redacted_text).toContain("[EMAIL_1]");
603
+ expect(result.redacted_text).toContain("[EMAIL_2]");
604
+ });
605
+
606
+ it("builds mapping from replacement to original", () => {
607
+ const result = redact(baseText, [email], "token");
608
+ expect(result.mapping["[EMAIL_1]"]).toBe("john@example.com");
609
+ });
610
+ });
611
+
612
+ describe("mask strategy", () => {
613
+ it("replaces with asterisks matching length", () => {
614
+ const result = redact("Contact john@example.com", [
615
+ { ...email, start: 8, end: 24 },
616
+ ], "mask");
617
+ expect(result.redacted_text).toBe("Contact ****************");
618
+ });
619
+ });
620
+
621
+ describe("hash strategy", () => {
622
+ it("replaces with type and hash prefix", () => {
623
+ const result = redact("Contact john@example.com", [
624
+ { ...email, start: 8, end: 24 },
625
+ ], "hash");
626
+ expect(result.redacted_text).toMatch(/Contact \[EMAIL_[a-f0-9]{12}\]/);
627
+ });
628
+
629
+ it("produces consistent hashes for same input", () => {
630
+ const r1 = redact("Contact john@example.com", [
631
+ { ...email, start: 8, end: 24 },
632
+ ], "hash");
633
+ const r2 = redact("Contact john@example.com", [
634
+ { ...email, start: 8, end: 24 },
635
+ ], "hash");
636
+ expect(r1.redacted_text).toBe(r2.redacted_text);
637
+ });
638
+ });
639
+
640
+ describe("empty input", () => {
641
+ it("returns original text when no entities", () => {
642
+ const result = redact("Hello world", [], "token");
643
+ expect(result.redacted_text).toBe("Hello world");
644
+ expect(result.entities).toHaveLength(0);
645
+ });
646
+ });
647
+
648
+ describe("entity ordering", () => {
649
+ it("handles entities in any order without offset corruption", () => {
650
+ const result = redact(baseText, [phone, email], "token");
651
+ expect(result.redacted_text).not.toContain("john@example.com");
652
+ expect(result.redacted_text).not.toContain("555-123-4567");
653
+ });
654
+ });
655
+ });
656
+ ```
657
+
658
+ **Step 2: Run tests to verify they fail**
659
+
660
+ ```bash
661
+ npx vitest run tests/redactor.test.ts
662
+ ```
663
+
664
+ Expected: FAIL — `Cannot find module '../src/redactor.js'`
665
+
666
+ **Step 3: Write the redactor**
667
+
668
+ Create `src/redactor.ts`:
669
+
670
+ ```typescript
671
+ import { createHash } from "node:crypto";
672
+ import type { Entity, RedactResult, RedactStrategy } from "./types.js";
673
+
674
+ export function redact(
675
+ text: string,
676
+ entities: Entity[],
677
+ strategy: RedactStrategy = "token",
678
+ ): RedactResult {
679
+ if (entities.length === 0) {
680
+ return { redacted_text: text, mapping: {}, entities: [] };
681
+ }
682
+
683
+ // Sort by start position descending so we can replace from end to start
684
+ // without corrupting earlier offsets
685
+ const sorted = [...entities].sort((a, b) => b.start - a.start);
686
+
687
+ const counters: Record<string, number> = {};
688
+ const mapping: Record<string, string> = {};
689
+ let result = text;
690
+
691
+ for (const entity of sorted) {
692
+ const replacement = makeReplacement(entity, strategy, counters);
693
+ mapping[replacement] = entity.text;
694
+ result = result.slice(0, entity.start) + replacement + result.slice(entity.end);
695
+ }
696
+
697
+ return { redacted_text: result, mapping, entities };
698
+ }
699
+
700
+ function makeReplacement(
701
+ entity: Entity,
702
+ strategy: RedactStrategy,
703
+ counters: Record<string, number>,
704
+ ): string {
705
+ switch (strategy) {
706
+ case "token": {
707
+ counters[entity.label] = (counters[entity.label] ?? 0) + 1;
708
+ return `[${entity.label}_${counters[entity.label]}]`;
709
+ }
710
+ case "mask": {
711
+ return "*".repeat(Math.max(entity.text.length, 1));
712
+ }
713
+ case "hash": {
714
+ const digest = createHash("sha256")
715
+ .update(entity.text)
716
+ .digest("hex")
717
+ .slice(0, 12);
718
+ return `[${entity.label}_${digest}]`;
719
+ }
720
+ }
721
+ }
722
+ ```
723
+
724
+ **Step 4: Run tests to verify they pass**
725
+
726
+ ```bash
727
+ npx vitest run tests/redactor.test.ts
728
+ ```
729
+
730
+ Expected: All tests PASS.
731
+
732
+ **Step 5: Commit**
733
+
734
+ ```bash
735
+ git add src/redactor.ts tests/redactor.test.ts
736
+ git commit -m "feat: add redactor with token, mask, and hash strategies"
737
+ ```
738
+
739
+ ---
740
+
741
+ ### Task 4: Config Loader
742
+
743
+ **Files:**
744
+ - Create: `src/config.ts`
745
+ - Create: `tests/config.test.ts`
746
+
747
+ **Step 1: Write the failing tests**
748
+
749
+ Create `tests/config.test.ts`:
750
+
751
+ ```typescript
752
+ import { describe, it, expect } from "vitest";
753
+ import { loadConfig, DEFAULT_CONFIG } from "../src/config.js";
754
+
755
+ describe("loadConfig", () => {
756
+ it("returns defaults when no overrides", () => {
757
+ const config = loadConfig({});
758
+ expect(config.enabled).toBe(true);
759
+ expect(config.guardrail_mode).toBe("redact");
760
+ expect(config.redactStrategy).toBe("token");
761
+ expect(config.model).toBe("onnx-community/gliner_large-v2.1");
762
+ expect(config.confidence_threshold).toBe(0.5);
763
+ expect(config.custom_entities).toEqual([]);
764
+ expect(config.entityActions).toEqual({});
765
+ });
766
+
767
+ it("merges partial overrides with defaults", () => {
768
+ const config = loadConfig({
769
+ guardrail_mode: "block",
770
+ custom_entities: ["competitor name"],
771
+ });
772
+ expect(config.guardrail_mode).toBe("block");
773
+ expect(config.custom_entities).toEqual(["competitor name"]);
774
+ expect(config.enabled).toBe(true); // default preserved
775
+ });
776
+
777
+ it("validates guardrail_mode", () => {
778
+ expect(() => loadConfig({ guardrail_mode: "invalid" as any })).toThrow();
779
+ });
780
+
781
+ it("validates confidence_threshold range", () => {
782
+ expect(() => loadConfig({ confidence_threshold: -0.1 })).toThrow();
783
+ expect(() => loadConfig({ confidence_threshold: 1.5 })).toThrow();
784
+ });
785
+
786
+ it("validates entityActions values", () => {
787
+ expect(() =>
788
+ loadConfig({ entityActions: { EMAIL: "invalid" as any } }),
789
+ ).toThrow();
790
+ });
791
+ });
792
+ ```
793
+
794
+ **Step 2: Run tests to verify they fail**
795
+
796
+ ```bash
797
+ npx vitest run tests/config.test.ts
798
+ ```
799
+
800
+ Expected: FAIL — `Cannot find module '../src/config.js'`
801
+
802
+ **Step 3: Write the config loader**
803
+
804
+ Create `src/config.ts`:
805
+
806
+ ```typescript
807
+ import type { FogClawConfig, GuardrailAction, RedactStrategy } from "./types.js";
808
+
809
+ const VALID_GUARDRAIL_MODES: GuardrailAction[] = ["redact", "block", "warn"];
810
+ const VALID_REDACT_STRATEGIES: RedactStrategy[] = ["token", "mask", "hash"];
811
+
812
+ export const DEFAULT_CONFIG: FogClawConfig = {
813
+ enabled: true,
814
+ guardrail_mode: "redact",
815
+ redactStrategy: "token",
816
+ model: "onnx-community/gliner_large-v2.1",
817
+ confidence_threshold: 0.5,
818
+ custom_entities: [],
819
+ entityActions: {},
820
+ };
821
+
822
+ export function loadConfig(overrides: Partial<FogClawConfig>): FogClawConfig {
823
+ const config: FogClawConfig = { ...DEFAULT_CONFIG, ...overrides };
824
+
825
+ if (!VALID_GUARDRAIL_MODES.includes(config.guardrail_mode)) {
826
+ throw new Error(
827
+ `Invalid guardrail_mode "${config.guardrail_mode}". Must be one of: ${VALID_GUARDRAIL_MODES.join(", ")}`,
828
+ );
829
+ }
830
+
831
+ if (!VALID_REDACT_STRATEGIES.includes(config.redactStrategy)) {
832
+ throw new Error(
833
+ `Invalid redactStrategy "${config.redactStrategy}". Must be one of: ${VALID_REDACT_STRATEGIES.join(", ")}`,
834
+ );
835
+ }
836
+
837
+ if (config.confidence_threshold < 0 || config.confidence_threshold > 1) {
838
+ throw new Error(
839
+ `confidence_threshold must be between 0 and 1, got ${config.confidence_threshold}`,
840
+ );
841
+ }
842
+
843
+ for (const [entityType, action] of Object.entries(config.entityActions)) {
844
+ if (!VALID_GUARDRAIL_MODES.includes(action)) {
845
+ throw new Error(
846
+ `Invalid action "${action}" for entity type "${entityType}". Must be one of: ${VALID_GUARDRAIL_MODES.join(", ")}`,
847
+ );
848
+ }
849
+ }
850
+
851
+ return config;
852
+ }
853
+ ```
854
+
855
+ **Step 4: Run tests to verify they pass**
856
+
857
+ ```bash
858
+ npx vitest run tests/config.test.ts
859
+ ```
860
+
861
+ Expected: All tests PASS.
862
+
863
+ **Step 5: Commit**
864
+
865
+ ```bash
866
+ git add src/config.ts tests/config.test.ts
867
+ git commit -m "feat: add config loader with validation and defaults"
868
+ ```
869
+
870
+ ---
871
+
872
+ ### Task 5: GLiNER Engine Wrapper
873
+
874
+ **Files:**
875
+ - Create: `src/engines/gliner.ts`
876
+ - Create: `tests/gliner.test.ts`
877
+
878
+ **Step 1: Write the failing tests**
879
+
880
+ Create `tests/gliner.test.ts`:
881
+
882
+ ```typescript
883
+ import { describe, it, expect, vi, beforeEach } from "vitest";
884
+ import { GlinerEngine } from "../src/engines/gliner.js";
885
+
886
+ // Mock the gliner npm package since we don't want to download
887
+ // a 1.4GB model in tests
888
+ vi.mock("gliner", () => {
889
+ return {
890
+ Gliner: class MockGliner {
891
+ async initialize() {}
892
+ async inference(
893
+ text: string,
894
+ labels: string[],
895
+ _opts: { threshold: number },
896
+ ) {
897
+ // Simulate GLiNER output based on the input text
898
+ const results: Array<{
899
+ text: string;
900
+ label: string;
901
+ score: number;
902
+ start: number;
903
+ end: number;
904
+ }> = [];
905
+
906
+ if (text.includes("John Smith")) {
907
+ const idx = text.indexOf("John Smith");
908
+ results.push({
909
+ text: "John Smith",
910
+ label: "person",
911
+ score: 0.95,
912
+ start: idx,
913
+ end: idx + 10,
914
+ });
915
+ }
916
+
917
+ if (text.includes("Acme Corp")) {
918
+ const idx = text.indexOf("Acme Corp");
919
+ results.push({
920
+ text: "Acme Corp",
921
+ label: "organization",
922
+ score: 0.88,
923
+ start: idx,
924
+ end: idx + 9,
925
+ });
926
+ }
927
+
928
+ // Only return entities whose labels were requested
929
+ return results.filter((r) => labels.includes(r.label));
930
+ }
931
+ },
932
+ };
933
+ });
934
+
935
+ describe("GlinerEngine", () => {
936
+ let engine: GlinerEngine;
937
+
938
+ beforeEach(async () => {
939
+ engine = new GlinerEngine("mock-model", 0.5);
940
+ await engine.initialize();
941
+ });
942
+
943
+ it("detects person entities", async () => {
944
+ const entities = await engine.scan("John Smith works here");
945
+ const persons = entities.filter((e) => e.label === "PERSON");
946
+ expect(persons).toHaveLength(1);
947
+ expect(persons[0].text).toBe("John Smith");
948
+ expect(persons[0].source).toBe("gliner");
949
+ expect(persons[0].confidence).toBe(0.95);
950
+ });
951
+
952
+ it("detects organization entities", async () => {
953
+ const entities = await engine.scan("Works at Acme Corp");
954
+ const orgs = entities.filter((e) => e.label === "ORGANIZATION");
955
+ expect(orgs).toHaveLength(1);
956
+ expect(orgs[0].text).toBe("Acme Corp");
957
+ });
958
+
959
+ it("detects multiple entity types", async () => {
960
+ const entities = await engine.scan(
961
+ "John Smith works at Acme Corp",
962
+ );
963
+ expect(entities.length).toBeGreaterThanOrEqual(2);
964
+ });
965
+
966
+ it("returns empty array for text with no entities", async () => {
967
+ const entities = await engine.scan("The weather is nice today");
968
+ expect(entities).toHaveLength(0);
969
+ });
970
+
971
+ it("includes custom labels in detection", async () => {
972
+ engine.setCustomLabels(["competitor name"]);
973
+ const entities = await engine.scan("John Smith works here");
974
+ // Custom labels are passed to GLiNER but mock doesn't generate them
975
+ // Just verify no crash
976
+ expect(entities).toBeDefined();
977
+ });
978
+
979
+ it("applies canonical type mapping", async () => {
980
+ const entities = await engine.scan("John Smith works here");
981
+ const person = entities.find((e) => e.text === "John Smith");
982
+ // "person" from GLiNER → "PERSON" canonical
983
+ expect(person?.label).toBe("PERSON");
984
+ });
985
+ });
986
+ ```
987
+
988
+ **Step 2: Run tests to verify they fail**
989
+
990
+ ```bash
991
+ npx vitest run tests/gliner.test.ts
992
+ ```
993
+
994
+ Expected: FAIL — `Cannot find module '../src/engines/gliner.js'`
995
+
996
+ **Step 3: Write the GLiNER engine wrapper**
997
+
998
+ Create `src/engines/gliner.ts`:
999
+
1000
+ ```typescript
1001
+ import type { Entity } from "../types.js";
1002
+ import { canonicalType } from "../types.js";
1003
+
1004
+ const DEFAULT_NER_LABELS = [
1005
+ "person",
1006
+ "organization",
1007
+ "location",
1008
+ "address",
1009
+ "date of birth",
1010
+ "medical record number",
1011
+ "account number",
1012
+ "passport number",
1013
+ ];
1014
+
1015
+ export class GlinerEngine {
1016
+ private model: any = null;
1017
+ private modelPath: string;
1018
+ private threshold: number;
1019
+ private customLabels: string[] = [];
1020
+ private initialized = false;
1021
+
1022
+ constructor(modelPath: string, threshold: number = 0.5) {
1023
+ this.modelPath = modelPath;
1024
+ this.threshold = threshold;
1025
+ }
1026
+
1027
+ async initialize(): Promise<void> {
1028
+ if (this.initialized) return;
1029
+
1030
+ try {
1031
+ const { Gliner } = await import("gliner");
1032
+ this.model = new Gliner({
1033
+ tokenizerPath: this.modelPath,
1034
+ onnxSettings: {
1035
+ modelPath: this.modelPath,
1036
+ executionProvider: "cpu",
1037
+ },
1038
+ maxWidth: 12,
1039
+ modelType: "gliner",
1040
+ });
1041
+ await this.model.initialize();
1042
+ this.initialized = true;
1043
+ } catch (err) {
1044
+ throw new Error(
1045
+ `Failed to initialize GLiNER model "${this.modelPath}": ${err instanceof Error ? err.message : String(err)}`,
1046
+ );
1047
+ }
1048
+ }
1049
+
1050
+ setCustomLabels(labels: string[]): void {
1051
+ this.customLabels = labels;
1052
+ }
1053
+
1054
+ async scan(text: string, extraLabels?: string[]): Promise<Entity[]> {
1055
+ if (!text) return [];
1056
+ if (!this.model) {
1057
+ throw new Error("GLiNER engine not initialized. Call initialize() first.");
1058
+ }
1059
+
1060
+ const labels = [
1061
+ ...DEFAULT_NER_LABELS,
1062
+ ...this.customLabels,
1063
+ ...(extraLabels ?? []),
1064
+ ];
1065
+
1066
+ // Deduplicate labels
1067
+ const uniqueLabels = [...new Set(labels)];
1068
+
1069
+ const results = await this.model.inference(text, uniqueLabels, {
1070
+ threshold: this.threshold,
1071
+ });
1072
+
1073
+ return results.map(
1074
+ (r: { text: string; label: string; score: number; start: number; end: number }) => ({
1075
+ text: r.text,
1076
+ label: canonicalType(r.label),
1077
+ start: r.start,
1078
+ end: r.end,
1079
+ confidence: r.score,
1080
+ source: "gliner" as const,
1081
+ }),
1082
+ );
1083
+ }
1084
+
1085
+ get isInitialized(): boolean {
1086
+ return this.initialized;
1087
+ }
1088
+ }
1089
+ ```
1090
+
1091
+ **Step 4: Run tests to verify they pass**
1092
+
1093
+ ```bash
1094
+ npx vitest run tests/gliner.test.ts
1095
+ ```
1096
+
1097
+ Expected: All tests PASS.
1098
+
1099
+ **Step 5: Commit**
1100
+
1101
+ ```bash
1102
+ git add src/engines/gliner.ts tests/gliner.test.ts
1103
+ git commit -m "feat: add GLiNER ONNX engine wrapper with zero-shot NER"
1104
+ ```
1105
+
1106
+ ---
1107
+
1108
+ ### Task 6: Scanner (Pipeline Orchestrator)
1109
+
1110
+ **Files:**
1111
+ - Create: `src/scanner.ts`
1112
+ - Create: `tests/scanner.test.ts`
1113
+
1114
+ **Step 1: Write the failing tests**
1115
+
1116
+ Create `tests/scanner.test.ts`:
1117
+
1118
+ ```typescript
1119
+ import { describe, it, expect, vi, beforeEach } from "vitest";
1120
+ import { Scanner } from "../src/scanner.js";
1121
+ import type { FogClawConfig, Entity } from "../src/types.js";
1122
+ import { DEFAULT_CONFIG } from "../src/config.js";
1123
+
1124
+ // Mock GLiNER to avoid model downloads
1125
+ vi.mock("gliner", () => {
1126
+ return {
1127
+ Gliner: class MockGliner {
1128
+ async initialize() {}
1129
+ async inference(
1130
+ text: string,
1131
+ labels: string[],
1132
+ _opts: { threshold: number },
1133
+ ) {
1134
+ const results: any[] = [];
1135
+ if (text.includes("John Smith")) {
1136
+ const idx = text.indexOf("John Smith");
1137
+ results.push({
1138
+ text: "John Smith",
1139
+ label: "person",
1140
+ score: 0.95,
1141
+ start: idx,
1142
+ end: idx + 10,
1143
+ });
1144
+ }
1145
+ return results.filter((r) => labels.includes(r.label));
1146
+ }
1147
+ },
1148
+ };
1149
+ });
1150
+
1151
+ describe("Scanner", () => {
1152
+ let scanner: Scanner;
1153
+
1154
+ beforeEach(async () => {
1155
+ scanner = new Scanner(DEFAULT_CONFIG);
1156
+ await scanner.initialize();
1157
+ });
1158
+
1159
+ it("detects regex entities (email)", async () => {
1160
+ const result = await scanner.scan("Contact john@example.com");
1161
+ const emails = result.entities.filter((e) => e.label === "EMAIL");
1162
+ expect(emails).toHaveLength(1);
1163
+ });
1164
+
1165
+ it("detects GLiNER entities (person)", async () => {
1166
+ const result = await scanner.scan("John Smith is here");
1167
+ const persons = result.entities.filter((e) => e.label === "PERSON");
1168
+ expect(persons).toHaveLength(1);
1169
+ });
1170
+
1171
+ it("merges results from both engines", async () => {
1172
+ const result = await scanner.scan(
1173
+ "John Smith's email is john@example.com",
1174
+ );
1175
+ const labels = new Set(result.entities.map((e) => e.label));
1176
+ expect(labels.has("EMAIL")).toBe(true);
1177
+ expect(labels.has("PERSON")).toBe(true);
1178
+ });
1179
+
1180
+ it("deduplicates overlapping spans preferring higher confidence", async () => {
1181
+ const result = await scanner.scan(
1182
+ "John Smith's email is john@example.com",
1183
+ );
1184
+ // Check no duplicate spans at same position
1185
+ const seen = new Set<string>();
1186
+ for (const e of result.entities) {
1187
+ const key = `${e.start}-${e.end}`;
1188
+ expect(seen.has(key)).toBe(false);
1189
+ seen.add(key);
1190
+ }
1191
+ });
1192
+
1193
+ it("returns original text in result", async () => {
1194
+ const text = "Hello world";
1195
+ const result = await scanner.scan(text);
1196
+ expect(result.text).toBe(text);
1197
+ });
1198
+
1199
+ it("works with extra labels passed at scan time", async () => {
1200
+ const result = await scanner.scan("John Smith is here", [
1201
+ "competitor name",
1202
+ ]);
1203
+ expect(result).toBeDefined();
1204
+ });
1205
+
1206
+ it("works in regex-only mode when GLiNER fails to init", async () => {
1207
+ const failScanner = new Scanner({
1208
+ ...DEFAULT_CONFIG,
1209
+ model: "nonexistent/model",
1210
+ });
1211
+ // Don't initialize GLiNER — should fall back to regex-only
1212
+ const result = await failScanner.scan("Contact john@example.com");
1213
+ const emails = result.entities.filter((e) => e.label === "EMAIL");
1214
+ expect(emails).toHaveLength(1);
1215
+ });
1216
+ });
1217
+ ```
1218
+
1219
+ **Step 2: Run tests to verify they fail**
1220
+
1221
+ ```bash
1222
+ npx vitest run tests/scanner.test.ts
1223
+ ```
1224
+
1225
+ Expected: FAIL — `Cannot find module '../src/scanner.js'`
1226
+
1227
+ **Step 3: Write the scanner**
1228
+
1229
+ Create `src/scanner.ts`:
1230
+
1231
+ ```typescript
1232
+ import type { Entity, FogClawConfig, ScanResult } from "./types.js";
1233
+ import { RegexEngine } from "./engines/regex.js";
1234
+ import { GlinerEngine } from "./engines/gliner.js";
1235
+
1236
+ export class Scanner {
1237
+ private regexEngine: RegexEngine;
1238
+ private glinerEngine: GlinerEngine;
1239
+ private glinerAvailable = false;
1240
+ private config: FogClawConfig;
1241
+
1242
+ constructor(config: FogClawConfig) {
1243
+ this.config = config;
1244
+ this.regexEngine = new RegexEngine();
1245
+ this.glinerEngine = new GlinerEngine(
1246
+ config.model,
1247
+ config.confidence_threshold,
1248
+ );
1249
+ if (config.custom_entities.length > 0) {
1250
+ this.glinerEngine.setCustomLabels(config.custom_entities);
1251
+ }
1252
+ }
1253
+
1254
+ async initialize(): Promise<void> {
1255
+ try {
1256
+ await this.glinerEngine.initialize();
1257
+ this.glinerAvailable = true;
1258
+ } catch (err) {
1259
+ console.warn(
1260
+ `[fogclaw] GLiNER failed to initialize, falling back to regex-only mode: ${err instanceof Error ? err.message : String(err)}`,
1261
+ );
1262
+ this.glinerAvailable = false;
1263
+ }
1264
+ }
1265
+
1266
+ async scan(text: string, extraLabels?: string[]): Promise<ScanResult> {
1267
+ if (!text) return { entities: [], text };
1268
+
1269
+ // Step 1: Regex pass (always runs, synchronous)
1270
+ const regexEntities = this.regexEngine.scan(text);
1271
+
1272
+ // Step 2: GLiNER pass (if available)
1273
+ let glinerEntities: Entity[] = [];
1274
+ if (this.glinerAvailable) {
1275
+ try {
1276
+ glinerEntities = await this.glinerEngine.scan(text, extraLabels);
1277
+ } catch (err) {
1278
+ console.warn(`[fogclaw] GLiNER scan failed, using regex results only: ${err instanceof Error ? err.message : String(err)}`);
1279
+ }
1280
+ }
1281
+
1282
+ // Step 3: Merge and deduplicate
1283
+ const merged = deduplicateEntities([...regexEntities, ...glinerEntities]);
1284
+
1285
+ return { entities: merged, text };
1286
+ }
1287
+ }
1288
+
1289
+ /**
1290
+ * Remove overlapping entity spans. When two entities overlap,
1291
+ * keep the one with higher confidence. If equal, prefer regex.
1292
+ */
1293
+ function deduplicateEntities(entities: Entity[]): Entity[] {
1294
+ if (entities.length <= 1) return entities;
1295
+
1296
+ // Sort by start position, then by confidence descending
1297
+ const sorted = [...entities].sort((a, b) => {
1298
+ if (a.start !== b.start) return a.start - b.start;
1299
+ return b.confidence - a.confidence;
1300
+ });
1301
+
1302
+ const result: Entity[] = [sorted[0]];
1303
+
1304
+ for (let i = 1; i < sorted.length; i++) {
1305
+ const current = sorted[i];
1306
+ const last = result[result.length - 1];
1307
+
1308
+ // Check for overlap
1309
+ if (current.start < last.end) {
1310
+ // Overlapping: keep higher confidence (already in result if first)
1311
+ if (current.confidence > last.confidence) {
1312
+ result[result.length - 1] = current;
1313
+ }
1314
+ // Otherwise keep what's already in result
1315
+ } else {
1316
+ result.push(current);
1317
+ }
1318
+ }
1319
+
1320
+ return result;
1321
+ }
1322
+ ```
1323
+
1324
+ **Step 4: Run tests to verify they pass**
1325
+
1326
+ ```bash
1327
+ npx vitest run tests/scanner.test.ts
1328
+ ```
1329
+
1330
+ Expected: All tests PASS.
1331
+
1332
+ **Step 5: Commit**
1333
+
1334
+ ```bash
1335
+ git add src/scanner.ts tests/scanner.test.ts
1336
+ git commit -m "feat: add scanner pipeline orchestrating regex → GLiNER with dedup"
1337
+ ```
1338
+
1339
+ ---
1340
+
1341
+ ### Task 7: OpenClaw Plugin Entry Point
1342
+
1343
+ **Files:**
1344
+ - Create: `src/index.ts`
1345
+
1346
+ **Step 1: Write the plugin entry point**
1347
+
1348
+ Create `src/index.ts`:
1349
+
1350
+ ```typescript
1351
+ import { Scanner } from "./scanner.js";
1352
+ import { redact } from "./redactor.js";
1353
+ import { loadConfig } from "./config.js";
1354
+ import type { FogClawConfig, GuardrailAction } from "./types.js";
1355
+
1356
+ export { Scanner } from "./scanner.js";
1357
+ export { redact } from "./redactor.js";
1358
+ export { loadConfig, DEFAULT_CONFIG } from "./config.js";
1359
+ export type {
1360
+ Entity,
1361
+ FogClawConfig,
1362
+ ScanResult,
1363
+ RedactResult,
1364
+ RedactStrategy,
1365
+ GuardrailAction,
1366
+ } from "./types.js";
1367
+
1368
+ /**
1369
+ * OpenClaw plugin registration.
1370
+ *
1371
+ * Registers:
1372
+ * - `before_agent_start` hook for automatic PII guardrail
1373
+ * - `fogclaw_scan` tool for on-demand entity detection
1374
+ * - `fogclaw_redact` tool for on-demand redaction
1375
+ */
1376
+ export async function register(api: any) {
1377
+ const rawConfig = api.getConfig?.() ?? {};
1378
+ const config = loadConfig(rawConfig);
1379
+
1380
+ if (!config.enabled) {
1381
+ console.log("[fogclaw] Plugin disabled via config");
1382
+ return;
1383
+ }
1384
+
1385
+ const scanner = new Scanner(config);
1386
+ await scanner.initialize();
1387
+
1388
+ // --- HOOK: Guardrail on incoming messages ---
1389
+ api.registerHook("before_agent_start", async (context: any) => {
1390
+ const result = await scanner.scan(context.message);
1391
+
1392
+ if (result.entities.length === 0) return;
1393
+
1394
+ // Check for any "block" actions
1395
+ for (const entity of result.entities) {
1396
+ const action: GuardrailAction =
1397
+ config.entityActions[entity.label] ?? config.guardrail_mode;
1398
+
1399
+ if (action === "block") {
1400
+ return api.reply(
1401
+ `Message blocked: detected ${entity.label}. Please rephrase without sensitive information.`,
1402
+ );
1403
+ }
1404
+ }
1405
+
1406
+ // Check for any "warn" actions
1407
+ const warnings = result.entities.filter((e) => {
1408
+ const action = config.entityActions[e.label] ?? config.guardrail_mode;
1409
+ return action === "warn";
1410
+ });
1411
+ if (warnings.length > 0) {
1412
+ const types = [...new Set(warnings.map((w) => w.label))].join(", ");
1413
+ api.notify?.(`PII detected: ${types}`);
1414
+ }
1415
+
1416
+ // Apply redaction for "redact" action entities
1417
+ const toRedact = result.entities.filter((e) => {
1418
+ const action = config.entityActions[e.label] ?? config.guardrail_mode;
1419
+ return action === "redact";
1420
+ });
1421
+ if (toRedact.length > 0) {
1422
+ const redacted = redact(context.message, toRedact, config.redactStrategy);
1423
+ context.message = redacted.redacted_text;
1424
+ }
1425
+ });
1426
+
1427
+ // --- TOOL: On-demand scan ---
1428
+ api.registerTool({
1429
+ id: "fogclaw_scan",
1430
+ name: "Scan for PII",
1431
+ description:
1432
+ "Scan text for PII and custom entities. Returns detected entities with types, positions, and confidence scores.",
1433
+ parameters: {
1434
+ text: {
1435
+ type: "string",
1436
+ description: "Text to scan for entities",
1437
+ required: true,
1438
+ },
1439
+ custom_labels: {
1440
+ type: "array",
1441
+ description:
1442
+ "Additional entity labels for zero-shot detection (e.g., ['competitor name', 'project codename'])",
1443
+ required: false,
1444
+ },
1445
+ },
1446
+ handler: async ({
1447
+ text,
1448
+ custom_labels,
1449
+ }: {
1450
+ text: string;
1451
+ custom_labels?: string[];
1452
+ }) => {
1453
+ const result = await scanner.scan(text, custom_labels);
1454
+ return {
1455
+ entities: result.entities,
1456
+ count: result.entities.length,
1457
+ summary: result.entities.length > 0
1458
+ ? `Found ${result.entities.length} entities: ${[...new Set(result.entities.map((e) => e.label))].join(", ")}`
1459
+ : "No entities detected",
1460
+ };
1461
+ },
1462
+ });
1463
+
1464
+ // --- TOOL: On-demand redact ---
1465
+ api.registerTool({
1466
+ id: "fogclaw_redact",
1467
+ name: "Redact PII",
1468
+ description:
1469
+ "Scan and redact PII/custom entities from text. Returns sanitized text with entities replaced.",
1470
+ parameters: {
1471
+ text: {
1472
+ type: "string",
1473
+ description: "Text to scan and redact",
1474
+ required: true,
1475
+ },
1476
+ strategy: {
1477
+ type: "string",
1478
+ description:
1479
+ 'Redaction strategy: "token" ([EMAIL_1]), "mask" (****), or "hash" ([EMAIL_a1b2c3...])',
1480
+ enum: ["token", "mask", "hash"],
1481
+ required: false,
1482
+ },
1483
+ custom_labels: {
1484
+ type: "array",
1485
+ description: "Additional entity labels for zero-shot detection",
1486
+ required: false,
1487
+ },
1488
+ },
1489
+ handler: async ({
1490
+ text,
1491
+ strategy,
1492
+ custom_labels,
1493
+ }: {
1494
+ text: string;
1495
+ strategy?: "token" | "mask" | "hash";
1496
+ custom_labels?: string[];
1497
+ }) => {
1498
+ const result = await scanner.scan(text, custom_labels);
1499
+ const redacted = redact(
1500
+ text,
1501
+ result.entities,
1502
+ strategy ?? config.redactStrategy,
1503
+ );
1504
+ return {
1505
+ redacted_text: redacted.redacted_text,
1506
+ entities_found: result.entities.length,
1507
+ mapping: redacted.mapping,
1508
+ };
1509
+ },
1510
+ });
1511
+
1512
+ console.log(
1513
+ `[fogclaw] Plugin registered — guardrail: ${config.guardrail_mode}, model: ${config.model}, custom entities: ${config.custom_entities.length}`,
1514
+ );
1515
+ }
1516
+ ```
1517
+
1518
+ **Step 2: Verify the project builds**
1519
+
1520
+ ```bash
1521
+ npx tsc
1522
+ ```
1523
+
1524
+ Expected: Clean compile, no errors.
1525
+
1526
+ **Step 3: Commit**
1527
+
1528
+ ```bash
1529
+ git add src/index.ts
1530
+ git commit -m "feat: add OpenClaw plugin entry point with hook and tool registration"
1531
+ ```
1532
+
1533
+ ---
1534
+
1535
+ ### Task 8: Run Full Test Suite & Final Verification
1536
+
1537
+ **Step 1: Run all tests**
1538
+
1539
+ ```bash
1540
+ npx vitest run
1541
+ ```
1542
+
1543
+ Expected: All tests in `regex.test.ts`, `redactor.test.ts`, `config.test.ts`, `gliner.test.ts`, and `scanner.test.ts` pass.
1544
+
1545
+ **Step 2: Verify clean build**
1546
+
1547
+ ```bash
1548
+ npx tsc
1549
+ ```
1550
+
1551
+ Expected: No errors.
1552
+
1553
+ **Step 3: Verify package structure**
1554
+
1555
+ ```bash
1556
+ ls dist/
1557
+ ```
1558
+
1559
+ Expected: `index.js`, `index.d.ts`, `types.js`, `types.d.ts`, `config.js`, `config.d.ts`, `scanner.js`, `scanner.d.ts`, `redactor.js`, `redactor.d.ts`, `engines/regex.js`, `engines/regex.d.ts`, `engines/gliner.js`, `engines/gliner.d.ts` (plus `.map` files).
1560
+
1561
+ **Step 4: Commit any remaining changes**
1562
+
1563
+ ```bash
1564
+ git add -A
1565
+ git commit -m "chore: verify full build and test suite"
1566
+ ```
1567
+
1568
+ ---
1569
+
1570
+ ### Task 9: Push to GitHub
1571
+
1572
+ **Step 1: Create the repo on GitHub**
1573
+
1574
+ ```bash
1575
+ gh repo create datafog/fogclaw --public --description "OpenClaw plugin for PII detection & custom entity redaction powered by DataFog" --license MIT
1576
+ ```
1577
+
1578
+ **Step 2: Add remote and push**
1579
+
1580
+ ```bash
1581
+ git remote add origin https://github.com/datafog/fogclaw.git
1582
+ git branch -M main
1583
+ git push -u origin main
1584
+ ```
1585
+
1586
+ **Step 3: Verify on GitHub**
1587
+
1588
+ ```bash
1589
+ gh repo view datafog/fogclaw --web
1590
+ ```
1591
+
1592
+ ---
1593
+
1594
+ ## Summary
1595
+
1596
+ | Task | What | Key files |
1597
+ |------|------|-----------|
1598
+ | 1 | Repo scaffold | `package.json`, `tsconfig.json`, `openclaw.plugin.json`, `src/types.ts` |
1599
+ | 2 | Regex engine | `src/engines/regex.ts`, `tests/regex.test.ts` |
1600
+ | 3 | Redactor | `src/redactor.ts`, `tests/redactor.test.ts` |
1601
+ | 4 | Config loader | `src/config.ts`, `tests/config.test.ts` |
1602
+ | 5 | GLiNER wrapper | `src/engines/gliner.ts`, `tests/gliner.test.ts` |
1603
+ | 6 | Scanner pipeline | `src/scanner.ts`, `tests/scanner.test.ts` |
1604
+ | 7 | Plugin entry | `src/index.ts` |
1605
+ | 8 | Full verification | Run all tests + build |
1606
+ | 9 | Push to GitHub | Create repo + push |