@datafog/fogclaw 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.github/workflows/harness-docs.yml +30 -0
- package/AGENTS.md +28 -0
- package/LICENSE +21 -0
- package/README.md +208 -0
- package/dist/config.d.ts +4 -0
- package/dist/config.d.ts.map +1 -0
- package/dist/config.js +30 -0
- package/dist/config.js.map +1 -0
- package/dist/engines/gliner.d.ts +14 -0
- package/dist/engines/gliner.d.ts.map +1 -0
- package/dist/engines/gliner.js +75 -0
- package/dist/engines/gliner.js.map +1 -0
- package/dist/engines/regex.d.ts +5 -0
- package/dist/engines/regex.d.ts.map +1 -0
- package/dist/engines/regex.js +54 -0
- package/dist/engines/regex.js.map +1 -0
- package/dist/index.d.ts +19 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +157 -0
- package/dist/index.js.map +1 -0
- package/dist/redactor.d.ts +3 -0
- package/dist/redactor.d.ts.map +1 -0
- package/dist/redactor.js +37 -0
- package/dist/redactor.js.map +1 -0
- package/dist/scanner.d.ts +11 -0
- package/dist/scanner.d.ts.map +1 -0
- package/dist/scanner.js +77 -0
- package/dist/scanner.js.map +1 -0
- package/dist/types.d.ts +31 -0
- package/dist/types.d.ts.map +1 -0
- package/dist/types.js +18 -0
- package/dist/types.js.map +1 -0
- package/docs/DATA.md +28 -0
- package/docs/DESIGN.md +17 -0
- package/docs/DOMAIN_DOCS.md +30 -0
- package/docs/FRONTEND.md +24 -0
- package/docs/OBSERVABILITY.md +25 -0
- package/docs/PLANS.md +171 -0
- package/docs/PRODUCT_SENSE.md +20 -0
- package/docs/RELIABILITY.md +60 -0
- package/docs/SECURITY.md +50 -0
- package/docs/design-docs/core-beliefs.md +17 -0
- package/docs/design-docs/index.md +8 -0
- package/docs/generated/README.md +36 -0
- package/docs/generated/memory.md +1 -0
- package/docs/plans/2026-02-16-fogclaw-design.md +172 -0
- package/docs/plans/2026-02-16-fogclaw-implementation.md +1606 -0
- package/docs/plans/README.md +15 -0
- package/docs/plans/active/2026-02-16-feat-openclaw-official-submission-plan.md +386 -0
- package/docs/plans/active/2026-02-17-feat-release-fogclaw-via-datafog-package-plan.md +318 -0
- package/docs/plans/active/2026-02-17-feat-submit-fogclaw-to-openclaw-plan.md +244 -0
- package/docs/plans/tech-debt-tracker.md +42 -0
- package/docs/plugins/fogclaw.md +95 -0
- package/docs/runbooks/address-review-findings.md +30 -0
- package/docs/runbooks/ci-failures.md +46 -0
- package/docs/runbooks/code-review.md +34 -0
- package/docs/runbooks/merge-change.md +28 -0
- package/docs/runbooks/pull-request.md +45 -0
- package/docs/runbooks/record-evidence.md +43 -0
- package/docs/runbooks/reproduce-bug.md +42 -0
- package/docs/runbooks/respond-to-feedback.md +42 -0
- package/docs/runbooks/review-findings.md +31 -0
- package/docs/runbooks/submit-openclaw-plugin.md +68 -0
- package/docs/runbooks/update-agents-md.md +59 -0
- package/docs/runbooks/update-domain-docs.md +42 -0
- package/docs/runbooks/validate-current-state.md +41 -0
- package/docs/runbooks/verify-release.md +69 -0
- package/docs/specs/2026-02-16-feat-openclaw-official-submission-spec.md +115 -0
- package/docs/specs/2026-02-17-feat-submit-fogclaw-to-openclaw.md +125 -0
- package/docs/specs/README.md +5 -0
- package/docs/specs/index.md +8 -0
- package/docs/spikes/README.md +8 -0
- package/fogclaw.config.example.json +15 -0
- package/openclaw.plugin.json +45 -0
- package/package.json +37 -0
- package/scripts/ci/he-docs-config.json +123 -0
- package/scripts/ci/he-docs-drift.sh +112 -0
- package/scripts/ci/he-docs-lint.sh +234 -0
- package/scripts/ci/he-plans-lint.sh +354 -0
- package/scripts/ci/he-runbooks-lint.sh +445 -0
- package/scripts/ci/he-specs-lint.sh +258 -0
- package/scripts/ci/he-spikes-lint.sh +249 -0
- package/scripts/runbooks/select-runbooks.sh +154 -0
- package/src/config.ts +46 -0
- package/src/engines/gliner.ts +88 -0
- package/src/engines/regex.ts +71 -0
- package/src/index.ts +223 -0
- package/src/redactor.ts +51 -0
- package/src/scanner.ts +90 -0
- package/src/types.ts +52 -0
- package/tests/config.test.ts +104 -0
- package/tests/gliner.test.ts +184 -0
- package/tests/plugin-smoke.test.ts +114 -0
- package/tests/redactor.test.ts +320 -0
- package/tests/regex.test.ts +345 -0
- package/tests/scanner.test.ts +199 -0
- package/tsconfig.json +20 -0
|
@@ -0,0 +1,1606 @@
|
|
|
1
|
+
# FogClaw Implementation Plan
|
|
2
|
+
|
|
3
|
+
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
|
4
|
+
|
|
5
|
+
**Goal:** Build a pure TypeScript OpenClaw plugin that detects and redacts PII + custom entities using regex and GLiNER ONNX, exposed as both a message guardrail and an on-demand agent tool.
|
|
6
|
+
|
|
7
|
+
**Architecture:** Dual-engine pipeline (regex first for structured PII, GLiNER second for zero-shot NER) in a single OpenClaw plugin that registers a `before_agent_start` hook and two tools (`fogclaw_scan`, `fogclaw_redact`). Config-driven per-entity-type actions.
|
|
8
|
+
|
|
9
|
+
**Tech Stack:** TypeScript, Node.js 22+, vitest, `gliner` npm package, `onnxruntime-node`, OpenClaw plugin API.
|
|
10
|
+
|
|
11
|
+
**Design doc:** `docs/plans/2026-02-16-fogclaw-design.md`
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
### Task 1: Repository Scaffold
|
|
16
|
+
|
|
17
|
+
**Files:**
|
|
18
|
+
- Create: `package.json`
|
|
19
|
+
- Create: `tsconfig.json`
|
|
20
|
+
- Create: `.gitignore`
|
|
21
|
+
- Create: `openclaw.plugin.json`
|
|
22
|
+
- Create: `fogclaw.config.example.json`
|
|
23
|
+
- Create: `src/types.ts`
|
|
24
|
+
|
|
25
|
+
**Step 1: Initialize the repo**
|
|
26
|
+
|
|
27
|
+
Create the GitHub repo under the `datafog` org:
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
mkdir fogclaw && cd fogclaw
|
|
31
|
+
git init
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
**Step 2: Create `package.json`**
|
|
35
|
+
|
|
36
|
+
```json
|
|
37
|
+
{
|
|
38
|
+
"name": "@datafog/fogclaw",
|
|
39
|
+
"version": "0.1.0",
|
|
40
|
+
"description": "OpenClaw plugin for PII detection & custom entity redaction powered by DataFog",
|
|
41
|
+
"type": "module",
|
|
42
|
+
"main": "dist/index.js",
|
|
43
|
+
"types": "dist/index.d.ts",
|
|
44
|
+
"scripts": {
|
|
45
|
+
"build": "tsc",
|
|
46
|
+
"test": "vitest run",
|
|
47
|
+
"test:watch": "vitest",
|
|
48
|
+
"lint": "tsc --noEmit"
|
|
49
|
+
},
|
|
50
|
+
"dependencies": {
|
|
51
|
+
"gliner": "^0.2.0",
|
|
52
|
+
"onnxruntime-node": "^1.20.0"
|
|
53
|
+
},
|
|
54
|
+
"devDependencies": {
|
|
55
|
+
"@types/node": "^22.0.0",
|
|
56
|
+
"typescript": "^5.7.0",
|
|
57
|
+
"vitest": "^2.1.0"
|
|
58
|
+
},
|
|
59
|
+
"engines": {
|
|
60
|
+
"node": ">=22.0.0"
|
|
61
|
+
},
|
|
62
|
+
"license": "MIT",
|
|
63
|
+
"repository": {
|
|
64
|
+
"type": "git",
|
|
65
|
+
"url": "https://github.com/datafog/fogclaw"
|
|
66
|
+
}
|
|
67
|
+
}
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
**Step 3: Create `tsconfig.json`**
|
|
71
|
+
|
|
72
|
+
```json
|
|
73
|
+
{
|
|
74
|
+
"compilerOptions": {
|
|
75
|
+
"target": "ES2022",
|
|
76
|
+
"module": "ESNext",
|
|
77
|
+
"moduleResolution": "bundler",
|
|
78
|
+
"lib": ["ES2022"],
|
|
79
|
+
"outDir": "dist",
|
|
80
|
+
"rootDir": "src",
|
|
81
|
+
"strict": true,
|
|
82
|
+
"declaration": true,
|
|
83
|
+
"declarationMap": true,
|
|
84
|
+
"sourceMap": true,
|
|
85
|
+
"esModuleInterop": true,
|
|
86
|
+
"skipLibCheck": true,
|
|
87
|
+
"forceConsistentCasingInFileNames": true,
|
|
88
|
+
"resolveJsonModule": true
|
|
89
|
+
},
|
|
90
|
+
"include": ["src/**/*"],
|
|
91
|
+
"exclude": ["node_modules", "dist", "tests"]
|
|
92
|
+
}
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
**Step 4: Create `.gitignore`**
|
|
96
|
+
|
|
97
|
+
```
|
|
98
|
+
node_modules/
|
|
99
|
+
dist/
|
|
100
|
+
models/
|
|
101
|
+
*.onnx
|
|
102
|
+
.env
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
**Step 5: Create `openclaw.plugin.json`**
|
|
106
|
+
|
|
107
|
+
```json
|
|
108
|
+
{
|
|
109
|
+
"id": "fogclaw",
|
|
110
|
+
"name": "FogClaw",
|
|
111
|
+
"version": "0.1.0",
|
|
112
|
+
"description": "PII detection & custom entity redaction powered by DataFog",
|
|
113
|
+
"configSchema": {
|
|
114
|
+
"type": "object",
|
|
115
|
+
"properties": {
|
|
116
|
+
"enabled": { "type": "boolean", "default": true },
|
|
117
|
+
"guardrail_mode": {
|
|
118
|
+
"type": "string",
|
|
119
|
+
"enum": ["redact", "block", "warn"],
|
|
120
|
+
"default": "redact"
|
|
121
|
+
},
|
|
122
|
+
"redactStrategy": {
|
|
123
|
+
"type": "string",
|
|
124
|
+
"enum": ["token", "mask", "hash"],
|
|
125
|
+
"default": "token"
|
|
126
|
+
},
|
|
127
|
+
"model": {
|
|
128
|
+
"type": "string",
|
|
129
|
+
"default": "onnx-community/gliner_large-v2.1"
|
|
130
|
+
},
|
|
131
|
+
"confidence_threshold": {
|
|
132
|
+
"type": "number",
|
|
133
|
+
"default": 0.5,
|
|
134
|
+
"minimum": 0,
|
|
135
|
+
"maximum": 1
|
|
136
|
+
},
|
|
137
|
+
"custom_entities": {
|
|
138
|
+
"type": "array",
|
|
139
|
+
"items": { "type": "string" },
|
|
140
|
+
"default": []
|
|
141
|
+
},
|
|
142
|
+
"entityActions": {
|
|
143
|
+
"type": "object",
|
|
144
|
+
"additionalProperties": {
|
|
145
|
+
"type": "string",
|
|
146
|
+
"enum": ["redact", "block", "warn"]
|
|
147
|
+
},
|
|
148
|
+
"default": {}
|
|
149
|
+
}
|
|
150
|
+
}
|
|
151
|
+
}
|
|
152
|
+
}
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
**Step 6: Create `fogclaw.config.example.json`**
|
|
156
|
+
|
|
157
|
+
```json
|
|
158
|
+
{
|
|
159
|
+
"enabled": true,
|
|
160
|
+
"guardrail_mode": "redact",
|
|
161
|
+
"redactStrategy": "token",
|
|
162
|
+
"model": "onnx-community/gliner_large-v2.1",
|
|
163
|
+
"confidence_threshold": 0.5,
|
|
164
|
+
"custom_entities": ["project codename", "internal tool name"],
|
|
165
|
+
"entityActions": {
|
|
166
|
+
"SSN": "block",
|
|
167
|
+
"CREDIT_CARD": "block",
|
|
168
|
+
"EMAIL": "redact",
|
|
169
|
+
"PHONE": "redact",
|
|
170
|
+
"PERSON": "warn"
|
|
171
|
+
}
|
|
172
|
+
}
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
**Step 7: Create `src/types.ts`**
|
|
176
|
+
|
|
177
|
+
```typescript
|
|
178
|
+
export interface Entity {
|
|
179
|
+
text: string;
|
|
180
|
+
label: string;
|
|
181
|
+
start: number;
|
|
182
|
+
end: number;
|
|
183
|
+
confidence: number;
|
|
184
|
+
source: "regex" | "gliner";
|
|
185
|
+
}
|
|
186
|
+
|
|
187
|
+
export type RedactStrategy = "token" | "mask" | "hash";
|
|
188
|
+
|
|
189
|
+
export type GuardrailAction = "redact" | "block" | "warn";
|
|
190
|
+
|
|
191
|
+
export interface FogClawConfig {
|
|
192
|
+
enabled: boolean;
|
|
193
|
+
guardrail_mode: GuardrailAction;
|
|
194
|
+
redactStrategy: RedactStrategy;
|
|
195
|
+
model: string;
|
|
196
|
+
confidence_threshold: number;
|
|
197
|
+
custom_entities: string[];
|
|
198
|
+
entityActions: Record<string, GuardrailAction>;
|
|
199
|
+
}
|
|
200
|
+
|
|
201
|
+
export interface ScanResult {
|
|
202
|
+
entities: Entity[];
|
|
203
|
+
text: string;
|
|
204
|
+
}
|
|
205
|
+
|
|
206
|
+
export interface RedactResult {
|
|
207
|
+
redacted_text: string;
|
|
208
|
+
mapping: Record<string, string>;
|
|
209
|
+
entities: Entity[];
|
|
210
|
+
}
|
|
211
|
+
|
|
212
|
+
export const CANONICAL_TYPE_MAP: Record<string, string> = {
|
|
213
|
+
DOB: "DATE",
|
|
214
|
+
ZIP: "ZIP_CODE",
|
|
215
|
+
PER: "PERSON",
|
|
216
|
+
ORG: "ORGANIZATION",
|
|
217
|
+
GPE: "LOCATION",
|
|
218
|
+
LOC: "LOCATION",
|
|
219
|
+
FAC: "ADDRESS",
|
|
220
|
+
PHONE_NUMBER: "PHONE",
|
|
221
|
+
SOCIAL_SECURITY_NUMBER: "SSN",
|
|
222
|
+
CREDIT_CARD_NUMBER: "CREDIT_CARD",
|
|
223
|
+
DATE_OF_BIRTH: "DATE",
|
|
224
|
+
};
|
|
225
|
+
|
|
226
|
+
export function canonicalType(entityType: string): string {
|
|
227
|
+
const normalized = entityType.toUpperCase().trim();
|
|
228
|
+
return CANONICAL_TYPE_MAP[normalized] ?? normalized;
|
|
229
|
+
}
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
**Step 8: Install dependencies & verify build**
|
|
233
|
+
|
|
234
|
+
```bash
|
|
235
|
+
npm install
|
|
236
|
+
npx tsc --noEmit
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Expected: Clean compile, no errors.
|
|
240
|
+
|
|
241
|
+
**Step 9: Commit**
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
git add -A
|
|
245
|
+
git commit -m "chore: scaffold fogclaw repo with types, config, and plugin manifest"
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
---
|
|
249
|
+
|
|
250
|
+
### Task 2: Regex Engine
|
|
251
|
+
|
|
252
|
+
**Files:**
|
|
253
|
+
- Create: `src/engines/regex.ts`
|
|
254
|
+
- Create: `tests/regex.test.ts`
|
|
255
|
+
|
|
256
|
+
**Step 1: Write the failing tests**
|
|
257
|
+
|
|
258
|
+
Create `tests/regex.test.ts`:
|
|
259
|
+
|
|
260
|
+
```typescript
|
|
261
|
+
import { describe, it, expect } from "vitest";
|
|
262
|
+
import { RegexEngine } from "../src/engines/regex.js";
|
|
263
|
+
|
|
264
|
+
const engine = new RegexEngine();
|
|
265
|
+
|
|
266
|
+
describe("RegexEngine", () => {
|
|
267
|
+
describe("EMAIL", () => {
|
|
268
|
+
it("detects simple email", () => {
|
|
269
|
+
const entities = engine.scan("Contact john@example.com for info");
|
|
270
|
+
const emails = entities.filter((e) => e.label === "EMAIL");
|
|
271
|
+
expect(emails).toHaveLength(1);
|
|
272
|
+
expect(emails[0].text).toBe("john@example.com");
|
|
273
|
+
expect(emails[0].confidence).toBe(1.0);
|
|
274
|
+
expect(emails[0].source).toBe("regex");
|
|
275
|
+
});
|
|
276
|
+
|
|
277
|
+
it("detects email with subdomain", () => {
|
|
278
|
+
const entities = engine.scan("Email first.last@example.co.uk");
|
|
279
|
+
const emails = entities.filter((e) => e.label === "EMAIL");
|
|
280
|
+
expect(emails).toHaveLength(1);
|
|
281
|
+
expect(emails[0].text).toBe("first.last@example.co.uk");
|
|
282
|
+
});
|
|
283
|
+
|
|
284
|
+
it("detects email with plus tag", () => {
|
|
285
|
+
const entities = engine.scan("Send to user+tag@example.org");
|
|
286
|
+
const emails = entities.filter((e) => e.label === "EMAIL");
|
|
287
|
+
expect(emails).toHaveLength(1);
|
|
288
|
+
});
|
|
289
|
+
|
|
290
|
+
it("does not match bare @", () => {
|
|
291
|
+
const entities = engine.scan("@ is not an email");
|
|
292
|
+
const emails = entities.filter((e) => e.label === "EMAIL");
|
|
293
|
+
expect(emails).toHaveLength(0);
|
|
294
|
+
});
|
|
295
|
+
});
|
|
296
|
+
|
|
297
|
+
describe("PHONE", () => {
|
|
298
|
+
it("detects US phone with dashes", () => {
|
|
299
|
+
const entities = engine.scan("Call 555-123-4567");
|
|
300
|
+
const phones = entities.filter((e) => e.label === "PHONE");
|
|
301
|
+
expect(phones).toHaveLength(1);
|
|
302
|
+
expect(phones[0].text).toBe("555-123-4567");
|
|
303
|
+
});
|
|
304
|
+
|
|
305
|
+
it("detects US phone with parens", () => {
|
|
306
|
+
const entities = engine.scan("Call (555) 123-4567");
|
|
307
|
+
const phones = entities.filter((e) => e.label === "PHONE");
|
|
308
|
+
expect(phones).toHaveLength(1);
|
|
309
|
+
});
|
|
310
|
+
|
|
311
|
+
it("detects international phone", () => {
|
|
312
|
+
const entities = engine.scan("Call +44 20 7946 0958");
|
|
313
|
+
const phones = entities.filter((e) => e.label === "PHONE");
|
|
314
|
+
expect(phones).toHaveLength(1);
|
|
315
|
+
});
|
|
316
|
+
});
|
|
317
|
+
|
|
318
|
+
describe("SSN", () => {
|
|
319
|
+
it("detects SSN with dashes", () => {
|
|
320
|
+
const entities = engine.scan("SSN: 123-45-6789");
|
|
321
|
+
const ssns = entities.filter((e) => e.label === "SSN");
|
|
322
|
+
expect(ssns).toHaveLength(1);
|
|
323
|
+
expect(ssns[0].text).toBe("123-45-6789");
|
|
324
|
+
});
|
|
325
|
+
|
|
326
|
+
it("rejects SSN with area code 000", () => {
|
|
327
|
+
const entities = engine.scan("SSN: 000-45-6789");
|
|
328
|
+
const ssns = entities.filter((e) => e.label === "SSN");
|
|
329
|
+
expect(ssns).toHaveLength(0);
|
|
330
|
+
});
|
|
331
|
+
|
|
332
|
+
it("rejects SSN with area code 666", () => {
|
|
333
|
+
const entities = engine.scan("SSN: 666-45-6789");
|
|
334
|
+
const ssns = entities.filter((e) => e.label === "SSN");
|
|
335
|
+
expect(ssns).toHaveLength(0);
|
|
336
|
+
});
|
|
337
|
+
});
|
|
338
|
+
|
|
339
|
+
describe("CREDIT_CARD", () => {
|
|
340
|
+
it("detects Visa", () => {
|
|
341
|
+
const entities = engine.scan("Card: 4111111111111111");
|
|
342
|
+
const cards = entities.filter((e) => e.label === "CREDIT_CARD");
|
|
343
|
+
expect(cards).toHaveLength(1);
|
|
344
|
+
});
|
|
345
|
+
|
|
346
|
+
it("detects Mastercard", () => {
|
|
347
|
+
const entities = engine.scan("Card: 5500000000000004");
|
|
348
|
+
const cards = entities.filter((e) => e.label === "CREDIT_CARD");
|
|
349
|
+
expect(cards).toHaveLength(1);
|
|
350
|
+
});
|
|
351
|
+
|
|
352
|
+
it("detects Amex", () => {
|
|
353
|
+
const entities = engine.scan("Card: 340000000000009");
|
|
354
|
+
const cards = entities.filter((e) => e.label === "CREDIT_CARD");
|
|
355
|
+
expect(cards).toHaveLength(1);
|
|
356
|
+
});
|
|
357
|
+
});
|
|
358
|
+
|
|
359
|
+
describe("IP_ADDRESS", () => {
|
|
360
|
+
it("detects valid IPv4", () => {
|
|
361
|
+
const entities = engine.scan("Server at 192.168.1.1");
|
|
362
|
+
const ips = entities.filter((e) => e.label === "IP_ADDRESS");
|
|
363
|
+
expect(ips).toHaveLength(1);
|
|
364
|
+
expect(ips[0].text).toBe("192.168.1.1");
|
|
365
|
+
});
|
|
366
|
+
|
|
367
|
+
it("rejects invalid octet", () => {
|
|
368
|
+
const entities = engine.scan("Not valid: 256.168.1.1");
|
|
369
|
+
const ips = entities.filter((e) => e.label === "IP_ADDRESS");
|
|
370
|
+
expect(ips).toHaveLength(0);
|
|
371
|
+
});
|
|
372
|
+
});
|
|
373
|
+
|
|
374
|
+
describe("DATE", () => {
|
|
375
|
+
it("detects MM/DD/YYYY", () => {
|
|
376
|
+
const entities = engine.scan("Born on 01/15/1990");
|
|
377
|
+
const dates = entities.filter((e) => e.label === "DATE");
|
|
378
|
+
expect(dates).toHaveLength(1);
|
|
379
|
+
});
|
|
380
|
+
|
|
381
|
+
it("detects YYYY-MM-DD", () => {
|
|
382
|
+
const entities = engine.scan("Date: 2020-01-15");
|
|
383
|
+
const dates = entities.filter((e) => e.label === "DATE");
|
|
384
|
+
expect(dates).toHaveLength(1);
|
|
385
|
+
});
|
|
386
|
+
|
|
387
|
+
it("detects Month DD, YYYY", () => {
|
|
388
|
+
const entities = engine.scan("Born January 15, 2000");
|
|
389
|
+
const dates = entities.filter((e) => e.label === "DATE");
|
|
390
|
+
expect(dates).toHaveLength(1);
|
|
391
|
+
});
|
|
392
|
+
});
|
|
393
|
+
|
|
394
|
+
describe("ZIP_CODE", () => {
|
|
395
|
+
it("detects 5-digit zip", () => {
|
|
396
|
+
const entities = engine.scan("ZIP: 10001");
|
|
397
|
+
const zips = entities.filter((e) => e.label === "ZIP_CODE");
|
|
398
|
+
expect(zips).toHaveLength(1);
|
|
399
|
+
});
|
|
400
|
+
|
|
401
|
+
it("detects zip+4", () => {
|
|
402
|
+
const entities = engine.scan("ZIP: 10001-1234");
|
|
403
|
+
const zips = entities.filter((e) => e.label === "ZIP_CODE");
|
|
404
|
+
expect(zips).toHaveLength(1);
|
|
405
|
+
});
|
|
406
|
+
});
|
|
407
|
+
|
|
408
|
+
describe("multiple entities", () => {
|
|
409
|
+
it("detects multiple entity types in one text", () => {
|
|
410
|
+
const text =
|
|
411
|
+
"John's email is john@example.com, phone 555-123-4567, SSN 123-45-6789";
|
|
412
|
+
const entities = engine.scan(text);
|
|
413
|
+
const labels = new Set(entities.map((e) => e.label));
|
|
414
|
+
expect(labels.has("EMAIL")).toBe(true);
|
|
415
|
+
expect(labels.has("PHONE")).toBe(true);
|
|
416
|
+
expect(labels.has("SSN")).toBe(true);
|
|
417
|
+
});
|
|
418
|
+
});
|
|
419
|
+
|
|
420
|
+
describe("empty input", () => {
|
|
421
|
+
it("returns empty array for empty string", () => {
|
|
422
|
+
const entities = engine.scan("");
|
|
423
|
+
expect(entities).toHaveLength(0);
|
|
424
|
+
});
|
|
425
|
+
});
|
|
426
|
+
|
|
427
|
+
describe("span offsets", () => {
|
|
428
|
+
it("returns correct start/end offsets", () => {
|
|
429
|
+
const text = "Email: john@example.com here";
|
|
430
|
+
const entities = engine.scan(text);
|
|
431
|
+
const email = entities.find((e) => e.label === "EMAIL")!;
|
|
432
|
+
expect(text.slice(email.start, email.end)).toBe("john@example.com");
|
|
433
|
+
});
|
|
434
|
+
});
|
|
435
|
+
});
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
**Step 2: Run tests to verify they fail**
|
|
439
|
+
|
|
440
|
+
```bash
|
|
441
|
+
npx vitest run tests/regex.test.ts
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
Expected: FAIL — `Cannot find module '../src/engines/regex.js'`
|
|
445
|
+
|
|
446
|
+
**Step 3: Write the regex engine**
|
|
447
|
+
|
|
448
|
+
Create `src/engines/regex.ts`:
|
|
449
|
+
|
|
450
|
+
```typescript
|
|
451
|
+
import type { Entity } from "../types.js";
|
|
452
|
+
|
|
453
|
+
interface PatternDef {
|
|
454
|
+
label: string;
|
|
455
|
+
pattern: RegExp;
|
|
456
|
+
/** Canonical label to use in output (e.g., DOB → DATE) */
|
|
457
|
+
canonicalLabel?: string;
|
|
458
|
+
}
|
|
459
|
+
|
|
460
|
+
const PATTERNS: PatternDef[] = [
|
|
461
|
+
{
|
|
462
|
+
label: "EMAIL",
|
|
463
|
+
pattern:
|
|
464
|
+
/(?<![A-Za-z0-9._%+\-@])(?![A-Za-z_]{2,20}=)[A-Za-z0-9!#$%&*+\-/=^_`{|}~][A-Za-z0-9!#$%&'*+\-/=?^_`{|}~.]*@(?:\.?[A-Za-z0-9-]+\.)+[A-Za-z]{2,}(?=$|[^A-Za-z])/gim,
|
|
465
|
+
},
|
|
466
|
+
{
|
|
467
|
+
label: "PHONE",
|
|
468
|
+
pattern:
|
|
469
|
+
/(?<![A-Za-z0-9])(?:(?:(?:\+?1)[-.\s]?)?(?:\(\d{3}\)|\d{3})[-.\s]?\d{3}[-.\s]?\d{4}|\+\d{1,3}[\s\-.]?\d{1,4}(?:[\s\-.]?\d{2,4}){2,3})(?![-A-Za-z0-9])/gim,
|
|
470
|
+
},
|
|
471
|
+
{
|
|
472
|
+
label: "SSN",
|
|
473
|
+
pattern:
|
|
474
|
+
/(?<!\d)(?:(?!000|666)\d{3}-(?!00)\d{2}-(?!0000)\d{4}|(?!000|666)\d{3}(?!00)\d{2}(?!0000)\d{4})(?!\d)/gm,
|
|
475
|
+
},
|
|
476
|
+
{
|
|
477
|
+
label: "CREDIT_CARD",
|
|
478
|
+
pattern:
|
|
479
|
+
/\b(?:4\d{12}(?:\d{3})?|5[1-5]\d{14}|3[47]\d{13}|(?:(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2})[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4})|(?:3[47]\d{2}[-\s]?\d{6}[-\s]?\d{5}))\b/gm,
|
|
480
|
+
},
|
|
481
|
+
{
|
|
482
|
+
label: "IP_ADDRESS",
|
|
483
|
+
pattern:
|
|
484
|
+
/\b(?:(?:25[0-5]|2[0-4]\d|1?\d?\d)\.(?:25[0-5]|2[0-4]\d|1?\d?\d)\.(?:25[0-5]|2[0-4]\d|1?\d?\d)\.(?:25[0-5]|2[0-4]\d|1?\d?\d))\b/gm,
|
|
485
|
+
},
|
|
486
|
+
{
|
|
487
|
+
label: "DATE",
|
|
488
|
+
canonicalLabel: "DATE",
|
|
489
|
+
pattern:
|
|
490
|
+
/\b(?:(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12]\d|3[01])[/-](?:\d{2}|\d{4})|(?:\d{4})-(?:0?[1-9]|1[0-2])-(?:0?[1-9]|[12]\d|3[01])|(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+(?:0?[1-9]|[12]\d|3[01]),\s+(?:19|20)\d{2})\b/gim,
|
|
491
|
+
},
|
|
492
|
+
{
|
|
493
|
+
label: "ZIP_CODE",
|
|
494
|
+
pattern: /\b\d{5}(?:-\d{4})?\b/gm,
|
|
495
|
+
},
|
|
496
|
+
];
|
|
497
|
+
|
|
498
|
+
export class RegexEngine {
|
|
499
|
+
scan(text: string): Entity[] {
|
|
500
|
+
if (!text) return [];
|
|
501
|
+
|
|
502
|
+
const entities: Entity[] = [];
|
|
503
|
+
|
|
504
|
+
for (const { label, pattern, canonicalLabel } of PATTERNS) {
|
|
505
|
+
// Reset lastIndex since we reuse the regex
|
|
506
|
+
pattern.lastIndex = 0;
|
|
507
|
+
|
|
508
|
+
let match: RegExpExecArray | null;
|
|
509
|
+
while ((match = pattern.exec(text)) !== null) {
|
|
510
|
+
entities.push({
|
|
511
|
+
text: match[0],
|
|
512
|
+
label: canonicalLabel ?? label,
|
|
513
|
+
start: match.index,
|
|
514
|
+
end: match.index + match[0].length,
|
|
515
|
+
confidence: 1.0,
|
|
516
|
+
source: "regex",
|
|
517
|
+
});
|
|
518
|
+
}
|
|
519
|
+
}
|
|
520
|
+
|
|
521
|
+
return entities;
|
|
522
|
+
}
|
|
523
|
+
}
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
**Step 4: Run tests to verify they pass**
|
|
527
|
+
|
|
528
|
+
```bash
|
|
529
|
+
npx vitest run tests/regex.test.ts
|
|
530
|
+
```
|
|
531
|
+
|
|
532
|
+
Expected: All tests PASS.
|
|
533
|
+
|
|
534
|
+
**Step 5: Commit**
|
|
535
|
+
|
|
536
|
+
```bash
|
|
537
|
+
git add src/engines/regex.ts tests/regex.test.ts
|
|
538
|
+
git commit -m "feat: add regex engine with ported DataFog PII patterns"
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
---
|
|
542
|
+
|
|
543
|
+
### Task 3: Redactor
|
|
544
|
+
|
|
545
|
+
**Files:**
|
|
546
|
+
- Create: `src/redactor.ts`
|
|
547
|
+
- Create: `tests/redactor.test.ts`
|
|
548
|
+
|
|
549
|
+
**Step 1: Write the failing tests**
|
|
550
|
+
|
|
551
|
+
Create `tests/redactor.test.ts`:
|
|
552
|
+
|
|
553
|
+
```typescript
|
|
554
|
+
import { describe, it, expect } from "vitest";
|
|
555
|
+
import { redact } from "../src/redactor.js";
|
|
556
|
+
import type { Entity } from "../src/types.js";
|
|
557
|
+
|
|
558
|
+
const email: Entity = {
|
|
559
|
+
text: "john@example.com",
|
|
560
|
+
label: "EMAIL",
|
|
561
|
+
start: 8,
|
|
562
|
+
end: 24,
|
|
563
|
+
confidence: 1.0,
|
|
564
|
+
source: "regex",
|
|
565
|
+
};
|
|
566
|
+
|
|
567
|
+
const phone: Entity = {
|
|
568
|
+
text: "555-123-4567",
|
|
569
|
+
label: "PHONE",
|
|
570
|
+
start: 32,
|
|
571
|
+
end: 44,
|
|
572
|
+
confidence: 1.0,
|
|
573
|
+
source: "regex",
|
|
574
|
+
};
|
|
575
|
+
|
|
576
|
+
const baseText = "Contact john@example.com, call 555-123-4567 please";
|
|
577
|
+
|
|
578
|
+
describe("redact", () => {
|
|
579
|
+
describe("token strategy", () => {
|
|
580
|
+
it("replaces entities with type tokens", () => {
|
|
581
|
+
const result = redact(baseText, [email, phone], "token");
|
|
582
|
+
expect(result.redacted_text).toContain("[EMAIL_1]");
|
|
583
|
+
expect(result.redacted_text).toContain("[PHONE_1]");
|
|
584
|
+
expect(result.redacted_text).not.toContain("john@example.com");
|
|
585
|
+
expect(result.redacted_text).not.toContain("555-123-4567");
|
|
586
|
+
});
|
|
587
|
+
|
|
588
|
+
it("increments counter for same type", () => {
|
|
589
|
+
const email2: Entity = {
|
|
590
|
+
text: "jane@example.com",
|
|
591
|
+
label: "EMAIL",
|
|
592
|
+
start: 30,
|
|
593
|
+
end: 46,
|
|
594
|
+
confidence: 1.0,
|
|
595
|
+
source: "regex",
|
|
596
|
+
};
|
|
597
|
+
const text = "Email john@example.com and also jane@example.com";
|
|
598
|
+
const result = redact(text, [
|
|
599
|
+
{ ...email, start: 6, end: 22 },
|
|
600
|
+
{ ...email2 },
|
|
601
|
+
], "token");
|
|
602
|
+
expect(result.redacted_text).toContain("[EMAIL_1]");
|
|
603
|
+
expect(result.redacted_text).toContain("[EMAIL_2]");
|
|
604
|
+
});
|
|
605
|
+
|
|
606
|
+
it("builds mapping from replacement to original", () => {
|
|
607
|
+
const result = redact(baseText, [email], "token");
|
|
608
|
+
expect(result.mapping["[EMAIL_1]"]).toBe("john@example.com");
|
|
609
|
+
});
|
|
610
|
+
});
|
|
611
|
+
|
|
612
|
+
describe("mask strategy", () => {
|
|
613
|
+
it("replaces with asterisks matching length", () => {
|
|
614
|
+
const result = redact("Contact john@example.com", [
|
|
615
|
+
{ ...email, start: 8, end: 24 },
|
|
616
|
+
], "mask");
|
|
617
|
+
expect(result.redacted_text).toBe("Contact ****************");
|
|
618
|
+
});
|
|
619
|
+
});
|
|
620
|
+
|
|
621
|
+
describe("hash strategy", () => {
|
|
622
|
+
it("replaces with type and hash prefix", () => {
|
|
623
|
+
const result = redact("Contact john@example.com", [
|
|
624
|
+
{ ...email, start: 8, end: 24 },
|
|
625
|
+
], "hash");
|
|
626
|
+
expect(result.redacted_text).toMatch(/Contact \[EMAIL_[a-f0-9]{12}\]/);
|
|
627
|
+
});
|
|
628
|
+
|
|
629
|
+
it("produces consistent hashes for same input", () => {
|
|
630
|
+
const r1 = redact("Contact john@example.com", [
|
|
631
|
+
{ ...email, start: 8, end: 24 },
|
|
632
|
+
], "hash");
|
|
633
|
+
const r2 = redact("Contact john@example.com", [
|
|
634
|
+
{ ...email, start: 8, end: 24 },
|
|
635
|
+
], "hash");
|
|
636
|
+
expect(r1.redacted_text).toBe(r2.redacted_text);
|
|
637
|
+
});
|
|
638
|
+
});
|
|
639
|
+
|
|
640
|
+
describe("empty input", () => {
|
|
641
|
+
it("returns original text when no entities", () => {
|
|
642
|
+
const result = redact("Hello world", [], "token");
|
|
643
|
+
expect(result.redacted_text).toBe("Hello world");
|
|
644
|
+
expect(result.entities).toHaveLength(0);
|
|
645
|
+
});
|
|
646
|
+
});
|
|
647
|
+
|
|
648
|
+
describe("entity ordering", () => {
|
|
649
|
+
it("handles entities in any order without offset corruption", () => {
|
|
650
|
+
const result = redact(baseText, [phone, email], "token");
|
|
651
|
+
expect(result.redacted_text).not.toContain("john@example.com");
|
|
652
|
+
expect(result.redacted_text).not.toContain("555-123-4567");
|
|
653
|
+
});
|
|
654
|
+
});
|
|
655
|
+
});
|
|
656
|
+
```
|
|
657
|
+
|
|
658
|
+
**Step 2: Run tests to verify they fail**
|
|
659
|
+
|
|
660
|
+
```bash
|
|
661
|
+
npx vitest run tests/redactor.test.ts
|
|
662
|
+
```
|
|
663
|
+
|
|
664
|
+
Expected: FAIL — `Cannot find module '../src/redactor.js'`
|
|
665
|
+
|
|
666
|
+
**Step 3: Write the redactor**
|
|
667
|
+
|
|
668
|
+
Create `src/redactor.ts`:
|
|
669
|
+
|
|
670
|
+
```typescript
|
|
671
|
+
import { createHash } from "node:crypto";
|
|
672
|
+
import type { Entity, RedactResult, RedactStrategy } from "./types.js";
|
|
673
|
+
|
|
674
|
+
export function redact(
|
|
675
|
+
text: string,
|
|
676
|
+
entities: Entity[],
|
|
677
|
+
strategy: RedactStrategy = "token",
|
|
678
|
+
): RedactResult {
|
|
679
|
+
if (entities.length === 0) {
|
|
680
|
+
return { redacted_text: text, mapping: {}, entities: [] };
|
|
681
|
+
}
|
|
682
|
+
|
|
683
|
+
// Sort by start position descending so we can replace from end to start
|
|
684
|
+
// without corrupting earlier offsets
|
|
685
|
+
const sorted = [...entities].sort((a, b) => b.start - a.start);
|
|
686
|
+
|
|
687
|
+
const counters: Record<string, number> = {};
|
|
688
|
+
const mapping: Record<string, string> = {};
|
|
689
|
+
let result = text;
|
|
690
|
+
|
|
691
|
+
for (const entity of sorted) {
|
|
692
|
+
const replacement = makeReplacement(entity, strategy, counters);
|
|
693
|
+
mapping[replacement] = entity.text;
|
|
694
|
+
result = result.slice(0, entity.start) + replacement + result.slice(entity.end);
|
|
695
|
+
}
|
|
696
|
+
|
|
697
|
+
return { redacted_text: result, mapping, entities };
|
|
698
|
+
}
|
|
699
|
+
|
|
700
|
+
function makeReplacement(
|
|
701
|
+
entity: Entity,
|
|
702
|
+
strategy: RedactStrategy,
|
|
703
|
+
counters: Record<string, number>,
|
|
704
|
+
): string {
|
|
705
|
+
switch (strategy) {
|
|
706
|
+
case "token": {
|
|
707
|
+
counters[entity.label] = (counters[entity.label] ?? 0) + 1;
|
|
708
|
+
return `[${entity.label}_${counters[entity.label]}]`;
|
|
709
|
+
}
|
|
710
|
+
case "mask": {
|
|
711
|
+
return "*".repeat(Math.max(entity.text.length, 1));
|
|
712
|
+
}
|
|
713
|
+
case "hash": {
|
|
714
|
+
const digest = createHash("sha256")
|
|
715
|
+
.update(entity.text)
|
|
716
|
+
.digest("hex")
|
|
717
|
+
.slice(0, 12);
|
|
718
|
+
return `[${entity.label}_${digest}]`;
|
|
719
|
+
}
|
|
720
|
+
}
|
|
721
|
+
}
|
|
722
|
+
```
|
|
723
|
+
|
|
724
|
+
**Step 4: Run tests to verify they pass**
|
|
725
|
+
|
|
726
|
+
```bash
|
|
727
|
+
npx vitest run tests/redactor.test.ts
|
|
728
|
+
```
|
|
729
|
+
|
|
730
|
+
Expected: All tests PASS.
|
|
731
|
+
|
|
732
|
+
**Step 5: Commit**
|
|
733
|
+
|
|
734
|
+
```bash
|
|
735
|
+
git add src/redactor.ts tests/redactor.test.ts
|
|
736
|
+
git commit -m "feat: add redactor with token, mask, and hash strategies"
|
|
737
|
+
```
|
|
738
|
+
|
|
739
|
+
---
|
|
740
|
+
|
|
741
|
+
### Task 4: Config Loader
|
|
742
|
+
|
|
743
|
+
**Files:**
|
|
744
|
+
- Create: `src/config.ts`
|
|
745
|
+
- Create: `tests/config.test.ts`
|
|
746
|
+
|
|
747
|
+
**Step 1: Write the failing tests**
|
|
748
|
+
|
|
749
|
+
Create `tests/config.test.ts`:
|
|
750
|
+
|
|
751
|
+
```typescript
|
|
752
|
+
import { describe, it, expect } from "vitest";
|
|
753
|
+
import { loadConfig, DEFAULT_CONFIG } from "../src/config.js";
|
|
754
|
+
|
|
755
|
+
describe("loadConfig", () => {
|
|
756
|
+
it("returns defaults when no overrides", () => {
|
|
757
|
+
const config = loadConfig({});
|
|
758
|
+
expect(config.enabled).toBe(true);
|
|
759
|
+
expect(config.guardrail_mode).toBe("redact");
|
|
760
|
+
expect(config.redactStrategy).toBe("token");
|
|
761
|
+
expect(config.model).toBe("onnx-community/gliner_large-v2.1");
|
|
762
|
+
expect(config.confidence_threshold).toBe(0.5);
|
|
763
|
+
expect(config.custom_entities).toEqual([]);
|
|
764
|
+
expect(config.entityActions).toEqual({});
|
|
765
|
+
});
|
|
766
|
+
|
|
767
|
+
it("merges partial overrides with defaults", () => {
|
|
768
|
+
const config = loadConfig({
|
|
769
|
+
guardrail_mode: "block",
|
|
770
|
+
custom_entities: ["competitor name"],
|
|
771
|
+
});
|
|
772
|
+
expect(config.guardrail_mode).toBe("block");
|
|
773
|
+
expect(config.custom_entities).toEqual(["competitor name"]);
|
|
774
|
+
expect(config.enabled).toBe(true); // default preserved
|
|
775
|
+
});
|
|
776
|
+
|
|
777
|
+
it("validates guardrail_mode", () => {
|
|
778
|
+
expect(() => loadConfig({ guardrail_mode: "invalid" as any })).toThrow();
|
|
779
|
+
});
|
|
780
|
+
|
|
781
|
+
it("validates confidence_threshold range", () => {
|
|
782
|
+
expect(() => loadConfig({ confidence_threshold: -0.1 })).toThrow();
|
|
783
|
+
expect(() => loadConfig({ confidence_threshold: 1.5 })).toThrow();
|
|
784
|
+
});
|
|
785
|
+
|
|
786
|
+
it("validates entityActions values", () => {
|
|
787
|
+
expect(() =>
|
|
788
|
+
loadConfig({ entityActions: { EMAIL: "invalid" as any } }),
|
|
789
|
+
).toThrow();
|
|
790
|
+
});
|
|
791
|
+
});
|
|
792
|
+
```
|
|
793
|
+
|
|
794
|
+
**Step 2: Run tests to verify they fail**
|
|
795
|
+
|
|
796
|
+
```bash
|
|
797
|
+
npx vitest run tests/config.test.ts
|
|
798
|
+
```
|
|
799
|
+
|
|
800
|
+
Expected: FAIL — `Cannot find module '../src/config.js'`
|
|
801
|
+
|
|
802
|
+
**Step 3: Write the config loader**
|
|
803
|
+
|
|
804
|
+
Create `src/config.ts`:
|
|
805
|
+
|
|
806
|
+
```typescript
|
|
807
|
+
import type { FogClawConfig, GuardrailAction, RedactStrategy } from "./types.js";
|
|
808
|
+
|
|
809
|
+
const VALID_GUARDRAIL_MODES: GuardrailAction[] = ["redact", "block", "warn"];
|
|
810
|
+
const VALID_REDACT_STRATEGIES: RedactStrategy[] = ["token", "mask", "hash"];
|
|
811
|
+
|
|
812
|
+
export const DEFAULT_CONFIG: FogClawConfig = {
|
|
813
|
+
enabled: true,
|
|
814
|
+
guardrail_mode: "redact",
|
|
815
|
+
redactStrategy: "token",
|
|
816
|
+
model: "onnx-community/gliner_large-v2.1",
|
|
817
|
+
confidence_threshold: 0.5,
|
|
818
|
+
custom_entities: [],
|
|
819
|
+
entityActions: {},
|
|
820
|
+
};
|
|
821
|
+
|
|
822
|
+
export function loadConfig(overrides: Partial<FogClawConfig>): FogClawConfig {
|
|
823
|
+
const config: FogClawConfig = { ...DEFAULT_CONFIG, ...overrides };
|
|
824
|
+
|
|
825
|
+
if (!VALID_GUARDRAIL_MODES.includes(config.guardrail_mode)) {
|
|
826
|
+
throw new Error(
|
|
827
|
+
`Invalid guardrail_mode "${config.guardrail_mode}". Must be one of: ${VALID_GUARDRAIL_MODES.join(", ")}`,
|
|
828
|
+
);
|
|
829
|
+
}
|
|
830
|
+
|
|
831
|
+
if (!VALID_REDACT_STRATEGIES.includes(config.redactStrategy)) {
|
|
832
|
+
throw new Error(
|
|
833
|
+
`Invalid redactStrategy "${config.redactStrategy}". Must be one of: ${VALID_REDACT_STRATEGIES.join(", ")}`,
|
|
834
|
+
);
|
|
835
|
+
}
|
|
836
|
+
|
|
837
|
+
if (config.confidence_threshold < 0 || config.confidence_threshold > 1) {
|
|
838
|
+
throw new Error(
|
|
839
|
+
`confidence_threshold must be between 0 and 1, got ${config.confidence_threshold}`,
|
|
840
|
+
);
|
|
841
|
+
}
|
|
842
|
+
|
|
843
|
+
for (const [entityType, action] of Object.entries(config.entityActions)) {
|
|
844
|
+
if (!VALID_GUARDRAIL_MODES.includes(action)) {
|
|
845
|
+
throw new Error(
|
|
846
|
+
`Invalid action "${action}" for entity type "${entityType}". Must be one of: ${VALID_GUARDRAIL_MODES.join(", ")}`,
|
|
847
|
+
);
|
|
848
|
+
}
|
|
849
|
+
}
|
|
850
|
+
|
|
851
|
+
return config;
|
|
852
|
+
}
|
|
853
|
+
```
|
|
854
|
+
|
|
855
|
+
**Step 4: Run tests to verify they pass**
|
|
856
|
+
|
|
857
|
+
```bash
|
|
858
|
+
npx vitest run tests/config.test.ts
|
|
859
|
+
```
|
|
860
|
+
|
|
861
|
+
Expected: All tests PASS.
|
|
862
|
+
|
|
863
|
+
**Step 5: Commit**
|
|
864
|
+
|
|
865
|
+
```bash
|
|
866
|
+
git add src/config.ts tests/config.test.ts
|
|
867
|
+
git commit -m "feat: add config loader with validation and defaults"
|
|
868
|
+
```
|
|
869
|
+
|
|
870
|
+
---
|
|
871
|
+
|
|
872
|
+
### Task 5: GLiNER Engine Wrapper
|
|
873
|
+
|
|
874
|
+
**Files:**
|
|
875
|
+
- Create: `src/engines/gliner.ts`
|
|
876
|
+
- Create: `tests/gliner.test.ts`
|
|
877
|
+
|
|
878
|
+
**Step 1: Write the failing tests**
|
|
879
|
+
|
|
880
|
+
Create `tests/gliner.test.ts`:
|
|
881
|
+
|
|
882
|
+
```typescript
|
|
883
|
+
import { describe, it, expect, vi, beforeEach } from "vitest";
|
|
884
|
+
import { GlinerEngine } from "../src/engines/gliner.js";
|
|
885
|
+
|
|
886
|
+
// Mock the gliner npm package since we don't want to download
|
|
887
|
+
// a 1.4GB model in tests
|
|
888
|
+
vi.mock("gliner", () => {
|
|
889
|
+
return {
|
|
890
|
+
Gliner: class MockGliner {
|
|
891
|
+
async initialize() {}
|
|
892
|
+
async inference(
|
|
893
|
+
text: string,
|
|
894
|
+
labels: string[],
|
|
895
|
+
_opts: { threshold: number },
|
|
896
|
+
) {
|
|
897
|
+
// Simulate GLiNER output based on the input text
|
|
898
|
+
const results: Array<{
|
|
899
|
+
text: string;
|
|
900
|
+
label: string;
|
|
901
|
+
score: number;
|
|
902
|
+
start: number;
|
|
903
|
+
end: number;
|
|
904
|
+
}> = [];
|
|
905
|
+
|
|
906
|
+
if (text.includes("John Smith")) {
|
|
907
|
+
const idx = text.indexOf("John Smith");
|
|
908
|
+
results.push({
|
|
909
|
+
text: "John Smith",
|
|
910
|
+
label: "person",
|
|
911
|
+
score: 0.95,
|
|
912
|
+
start: idx,
|
|
913
|
+
end: idx + 10,
|
|
914
|
+
});
|
|
915
|
+
}
|
|
916
|
+
|
|
917
|
+
if (text.includes("Acme Corp")) {
|
|
918
|
+
const idx = text.indexOf("Acme Corp");
|
|
919
|
+
results.push({
|
|
920
|
+
text: "Acme Corp",
|
|
921
|
+
label: "organization",
|
|
922
|
+
score: 0.88,
|
|
923
|
+
start: idx,
|
|
924
|
+
end: idx + 9,
|
|
925
|
+
});
|
|
926
|
+
}
|
|
927
|
+
|
|
928
|
+
// Only return entities whose labels were requested
|
|
929
|
+
return results.filter((r) => labels.includes(r.label));
|
|
930
|
+
}
|
|
931
|
+
},
|
|
932
|
+
};
|
|
933
|
+
});
|
|
934
|
+
|
|
935
|
+
describe("GlinerEngine", () => {
|
|
936
|
+
let engine: GlinerEngine;
|
|
937
|
+
|
|
938
|
+
beforeEach(async () => {
|
|
939
|
+
engine = new GlinerEngine("mock-model", 0.5);
|
|
940
|
+
await engine.initialize();
|
|
941
|
+
});
|
|
942
|
+
|
|
943
|
+
it("detects person entities", async () => {
|
|
944
|
+
const entities = await engine.scan("John Smith works here");
|
|
945
|
+
const persons = entities.filter((e) => e.label === "PERSON");
|
|
946
|
+
expect(persons).toHaveLength(1);
|
|
947
|
+
expect(persons[0].text).toBe("John Smith");
|
|
948
|
+
expect(persons[0].source).toBe("gliner");
|
|
949
|
+
expect(persons[0].confidence).toBe(0.95);
|
|
950
|
+
});
|
|
951
|
+
|
|
952
|
+
it("detects organization entities", async () => {
|
|
953
|
+
const entities = await engine.scan("Works at Acme Corp");
|
|
954
|
+
const orgs = entities.filter((e) => e.label === "ORGANIZATION");
|
|
955
|
+
expect(orgs).toHaveLength(1);
|
|
956
|
+
expect(orgs[0].text).toBe("Acme Corp");
|
|
957
|
+
});
|
|
958
|
+
|
|
959
|
+
it("detects multiple entity types", async () => {
|
|
960
|
+
const entities = await engine.scan(
|
|
961
|
+
"John Smith works at Acme Corp",
|
|
962
|
+
);
|
|
963
|
+
expect(entities.length).toBeGreaterThanOrEqual(2);
|
|
964
|
+
});
|
|
965
|
+
|
|
966
|
+
it("returns empty array for text with no entities", async () => {
|
|
967
|
+
const entities = await engine.scan("The weather is nice today");
|
|
968
|
+
expect(entities).toHaveLength(0);
|
|
969
|
+
});
|
|
970
|
+
|
|
971
|
+
it("includes custom labels in detection", async () => {
|
|
972
|
+
engine.setCustomLabels(["competitor name"]);
|
|
973
|
+
const entities = await engine.scan("John Smith works here");
|
|
974
|
+
// Custom labels are passed to GLiNER but mock doesn't generate them
|
|
975
|
+
// Just verify no crash
|
|
976
|
+
expect(entities).toBeDefined();
|
|
977
|
+
});
|
|
978
|
+
|
|
979
|
+
it("applies canonical type mapping", async () => {
|
|
980
|
+
const entities = await engine.scan("John Smith works here");
|
|
981
|
+
const person = entities.find((e) => e.text === "John Smith");
|
|
982
|
+
// "person" from GLiNER → "PERSON" canonical
|
|
983
|
+
expect(person?.label).toBe("PERSON");
|
|
984
|
+
});
|
|
985
|
+
});
|
|
986
|
+
```
|
|
987
|
+
|
|
988
|
+
**Step 2: Run tests to verify they fail**
|
|
989
|
+
|
|
990
|
+
```bash
|
|
991
|
+
npx vitest run tests/gliner.test.ts
|
|
992
|
+
```
|
|
993
|
+
|
|
994
|
+
Expected: FAIL — `Cannot find module '../src/engines/gliner.js'`
|
|
995
|
+
|
|
996
|
+
**Step 3: Write the GLiNER engine wrapper**
|
|
997
|
+
|
|
998
|
+
Create `src/engines/gliner.ts`:
|
|
999
|
+
|
|
1000
|
+
```typescript
|
|
1001
|
+
import type { Entity } from "../types.js";
|
|
1002
|
+
import { canonicalType } from "../types.js";
|
|
1003
|
+
|
|
1004
|
+
const DEFAULT_NER_LABELS = [
|
|
1005
|
+
"person",
|
|
1006
|
+
"organization",
|
|
1007
|
+
"location",
|
|
1008
|
+
"address",
|
|
1009
|
+
"date of birth",
|
|
1010
|
+
"medical record number",
|
|
1011
|
+
"account number",
|
|
1012
|
+
"passport number",
|
|
1013
|
+
];
|
|
1014
|
+
|
|
1015
|
+
export class GlinerEngine {
|
|
1016
|
+
private model: any = null;
|
|
1017
|
+
private modelPath: string;
|
|
1018
|
+
private threshold: number;
|
|
1019
|
+
private customLabels: string[] = [];
|
|
1020
|
+
private initialized = false;
|
|
1021
|
+
|
|
1022
|
+
constructor(modelPath: string, threshold: number = 0.5) {
|
|
1023
|
+
this.modelPath = modelPath;
|
|
1024
|
+
this.threshold = threshold;
|
|
1025
|
+
}
|
|
1026
|
+
|
|
1027
|
+
async initialize(): Promise<void> {
|
|
1028
|
+
if (this.initialized) return;
|
|
1029
|
+
|
|
1030
|
+
try {
|
|
1031
|
+
const { Gliner } = await import("gliner");
|
|
1032
|
+
this.model = new Gliner({
|
|
1033
|
+
tokenizerPath: this.modelPath,
|
|
1034
|
+
onnxSettings: {
|
|
1035
|
+
modelPath: this.modelPath,
|
|
1036
|
+
executionProvider: "cpu",
|
|
1037
|
+
},
|
|
1038
|
+
maxWidth: 12,
|
|
1039
|
+
modelType: "gliner",
|
|
1040
|
+
});
|
|
1041
|
+
await this.model.initialize();
|
|
1042
|
+
this.initialized = true;
|
|
1043
|
+
} catch (err) {
|
|
1044
|
+
throw new Error(
|
|
1045
|
+
`Failed to initialize GLiNER model "${this.modelPath}": ${err instanceof Error ? err.message : String(err)}`,
|
|
1046
|
+
);
|
|
1047
|
+
}
|
|
1048
|
+
}
|
|
1049
|
+
|
|
1050
|
+
setCustomLabels(labels: string[]): void {
|
|
1051
|
+
this.customLabels = labels;
|
|
1052
|
+
}
|
|
1053
|
+
|
|
1054
|
+
async scan(text: string, extraLabels?: string[]): Promise<Entity[]> {
|
|
1055
|
+
if (!text) return [];
|
|
1056
|
+
if (!this.model) {
|
|
1057
|
+
throw new Error("GLiNER engine not initialized. Call initialize() first.");
|
|
1058
|
+
}
|
|
1059
|
+
|
|
1060
|
+
const labels = [
|
|
1061
|
+
...DEFAULT_NER_LABELS,
|
|
1062
|
+
...this.customLabels,
|
|
1063
|
+
...(extraLabels ?? []),
|
|
1064
|
+
];
|
|
1065
|
+
|
|
1066
|
+
// Deduplicate labels
|
|
1067
|
+
const uniqueLabels = [...new Set(labels)];
|
|
1068
|
+
|
|
1069
|
+
const results = await this.model.inference(text, uniqueLabels, {
|
|
1070
|
+
threshold: this.threshold,
|
|
1071
|
+
});
|
|
1072
|
+
|
|
1073
|
+
return results.map(
|
|
1074
|
+
(r: { text: string; label: string; score: number; start: number; end: number }) => ({
|
|
1075
|
+
text: r.text,
|
|
1076
|
+
label: canonicalType(r.label),
|
|
1077
|
+
start: r.start,
|
|
1078
|
+
end: r.end,
|
|
1079
|
+
confidence: r.score,
|
|
1080
|
+
source: "gliner" as const,
|
|
1081
|
+
}),
|
|
1082
|
+
);
|
|
1083
|
+
}
|
|
1084
|
+
|
|
1085
|
+
get isInitialized(): boolean {
|
|
1086
|
+
return this.initialized;
|
|
1087
|
+
}
|
|
1088
|
+
}
|
|
1089
|
+
```
|
|
1090
|
+
|
|
1091
|
+
**Step 4: Run tests to verify they pass**
|
|
1092
|
+
|
|
1093
|
+
```bash
|
|
1094
|
+
npx vitest run tests/gliner.test.ts
|
|
1095
|
+
```
|
|
1096
|
+
|
|
1097
|
+
Expected: All tests PASS.
|
|
1098
|
+
|
|
1099
|
+
**Step 5: Commit**
|
|
1100
|
+
|
|
1101
|
+
```bash
|
|
1102
|
+
git add src/engines/gliner.ts tests/gliner.test.ts
|
|
1103
|
+
git commit -m "feat: add GLiNER ONNX engine wrapper with zero-shot NER"
|
|
1104
|
+
```
|
|
1105
|
+
|
|
1106
|
+
---
|
|
1107
|
+
|
|
1108
|
+
### Task 6: Scanner (Pipeline Orchestrator)
|
|
1109
|
+
|
|
1110
|
+
**Files:**
|
|
1111
|
+
- Create: `src/scanner.ts`
|
|
1112
|
+
- Create: `tests/scanner.test.ts`
|
|
1113
|
+
|
|
1114
|
+
**Step 1: Write the failing tests**
|
|
1115
|
+
|
|
1116
|
+
Create `tests/scanner.test.ts`:
|
|
1117
|
+
|
|
1118
|
+
```typescript
|
|
1119
|
+
import { describe, it, expect, vi, beforeEach } from "vitest";
|
|
1120
|
+
import { Scanner } from "../src/scanner.js";
|
|
1121
|
+
import type { FogClawConfig, Entity } from "../src/types.js";
|
|
1122
|
+
import { DEFAULT_CONFIG } from "../src/config.js";
|
|
1123
|
+
|
|
1124
|
+
// Mock GLiNER to avoid model downloads
|
|
1125
|
+
vi.mock("gliner", () => {
|
|
1126
|
+
return {
|
|
1127
|
+
Gliner: class MockGliner {
|
|
1128
|
+
async initialize() {}
|
|
1129
|
+
async inference(
|
|
1130
|
+
text: string,
|
|
1131
|
+
labels: string[],
|
|
1132
|
+
_opts: { threshold: number },
|
|
1133
|
+
) {
|
|
1134
|
+
const results: any[] = [];
|
|
1135
|
+
if (text.includes("John Smith")) {
|
|
1136
|
+
const idx = text.indexOf("John Smith");
|
|
1137
|
+
results.push({
|
|
1138
|
+
text: "John Smith",
|
|
1139
|
+
label: "person",
|
|
1140
|
+
score: 0.95,
|
|
1141
|
+
start: idx,
|
|
1142
|
+
end: idx + 10,
|
|
1143
|
+
});
|
|
1144
|
+
}
|
|
1145
|
+
return results.filter((r) => labels.includes(r.label));
|
|
1146
|
+
}
|
|
1147
|
+
},
|
|
1148
|
+
};
|
|
1149
|
+
});
|
|
1150
|
+
|
|
1151
|
+
describe("Scanner", () => {
|
|
1152
|
+
let scanner: Scanner;
|
|
1153
|
+
|
|
1154
|
+
beforeEach(async () => {
|
|
1155
|
+
scanner = new Scanner(DEFAULT_CONFIG);
|
|
1156
|
+
await scanner.initialize();
|
|
1157
|
+
});
|
|
1158
|
+
|
|
1159
|
+
it("detects regex entities (email)", async () => {
|
|
1160
|
+
const result = await scanner.scan("Contact john@example.com");
|
|
1161
|
+
const emails = result.entities.filter((e) => e.label === "EMAIL");
|
|
1162
|
+
expect(emails).toHaveLength(1);
|
|
1163
|
+
});
|
|
1164
|
+
|
|
1165
|
+
it("detects GLiNER entities (person)", async () => {
|
|
1166
|
+
const result = await scanner.scan("John Smith is here");
|
|
1167
|
+
const persons = result.entities.filter((e) => e.label === "PERSON");
|
|
1168
|
+
expect(persons).toHaveLength(1);
|
|
1169
|
+
});
|
|
1170
|
+
|
|
1171
|
+
it("merges results from both engines", async () => {
|
|
1172
|
+
const result = await scanner.scan(
|
|
1173
|
+
"John Smith's email is john@example.com",
|
|
1174
|
+
);
|
|
1175
|
+
const labels = new Set(result.entities.map((e) => e.label));
|
|
1176
|
+
expect(labels.has("EMAIL")).toBe(true);
|
|
1177
|
+
expect(labels.has("PERSON")).toBe(true);
|
|
1178
|
+
});
|
|
1179
|
+
|
|
1180
|
+
it("deduplicates overlapping spans preferring higher confidence", async () => {
|
|
1181
|
+
const result = await scanner.scan(
|
|
1182
|
+
"John Smith's email is john@example.com",
|
|
1183
|
+
);
|
|
1184
|
+
// Check no duplicate spans at same position
|
|
1185
|
+
const seen = new Set<string>();
|
|
1186
|
+
for (const e of result.entities) {
|
|
1187
|
+
const key = `${e.start}-${e.end}`;
|
|
1188
|
+
expect(seen.has(key)).toBe(false);
|
|
1189
|
+
seen.add(key);
|
|
1190
|
+
}
|
|
1191
|
+
});
|
|
1192
|
+
|
|
1193
|
+
it("returns original text in result", async () => {
|
|
1194
|
+
const text = "Hello world";
|
|
1195
|
+
const result = await scanner.scan(text);
|
|
1196
|
+
expect(result.text).toBe(text);
|
|
1197
|
+
});
|
|
1198
|
+
|
|
1199
|
+
it("works with extra labels passed at scan time", async () => {
|
|
1200
|
+
const result = await scanner.scan("John Smith is here", [
|
|
1201
|
+
"competitor name",
|
|
1202
|
+
]);
|
|
1203
|
+
expect(result).toBeDefined();
|
|
1204
|
+
});
|
|
1205
|
+
|
|
1206
|
+
it("works in regex-only mode when GLiNER fails to init", async () => {
|
|
1207
|
+
const failScanner = new Scanner({
|
|
1208
|
+
...DEFAULT_CONFIG,
|
|
1209
|
+
model: "nonexistent/model",
|
|
1210
|
+
});
|
|
1211
|
+
// Don't initialize GLiNER — should fall back to regex-only
|
|
1212
|
+
const result = await failScanner.scan("Contact john@example.com");
|
|
1213
|
+
const emails = result.entities.filter((e) => e.label === "EMAIL");
|
|
1214
|
+
expect(emails).toHaveLength(1);
|
|
1215
|
+
});
|
|
1216
|
+
});
|
|
1217
|
+
```
|
|
1218
|
+
|
|
1219
|
+
**Step 2: Run tests to verify they fail**
|
|
1220
|
+
|
|
1221
|
+
```bash
|
|
1222
|
+
npx vitest run tests/scanner.test.ts
|
|
1223
|
+
```
|
|
1224
|
+
|
|
1225
|
+
Expected: FAIL — `Cannot find module '../src/scanner.js'`
|
|
1226
|
+
|
|
1227
|
+
**Step 3: Write the scanner**
|
|
1228
|
+
|
|
1229
|
+
Create `src/scanner.ts`:
|
|
1230
|
+
|
|
1231
|
+
```typescript
|
|
1232
|
+
import type { Entity, FogClawConfig, ScanResult } from "./types.js";
|
|
1233
|
+
import { RegexEngine } from "./engines/regex.js";
|
|
1234
|
+
import { GlinerEngine } from "./engines/gliner.js";
|
|
1235
|
+
|
|
1236
|
+
export class Scanner {
|
|
1237
|
+
private regexEngine: RegexEngine;
|
|
1238
|
+
private glinerEngine: GlinerEngine;
|
|
1239
|
+
private glinerAvailable = false;
|
|
1240
|
+
private config: FogClawConfig;
|
|
1241
|
+
|
|
1242
|
+
constructor(config: FogClawConfig) {
|
|
1243
|
+
this.config = config;
|
|
1244
|
+
this.regexEngine = new RegexEngine();
|
|
1245
|
+
this.glinerEngine = new GlinerEngine(
|
|
1246
|
+
config.model,
|
|
1247
|
+
config.confidence_threshold,
|
|
1248
|
+
);
|
|
1249
|
+
if (config.custom_entities.length > 0) {
|
|
1250
|
+
this.glinerEngine.setCustomLabels(config.custom_entities);
|
|
1251
|
+
}
|
|
1252
|
+
}
|
|
1253
|
+
|
|
1254
|
+
async initialize(): Promise<void> {
|
|
1255
|
+
try {
|
|
1256
|
+
await this.glinerEngine.initialize();
|
|
1257
|
+
this.glinerAvailable = true;
|
|
1258
|
+
} catch (err) {
|
|
1259
|
+
console.warn(
|
|
1260
|
+
`[fogclaw] GLiNER failed to initialize, falling back to regex-only mode: ${err instanceof Error ? err.message : String(err)}`,
|
|
1261
|
+
);
|
|
1262
|
+
this.glinerAvailable = false;
|
|
1263
|
+
}
|
|
1264
|
+
}
|
|
1265
|
+
|
|
1266
|
+
async scan(text: string, extraLabels?: string[]): Promise<ScanResult> {
|
|
1267
|
+
if (!text) return { entities: [], text };
|
|
1268
|
+
|
|
1269
|
+
// Step 1: Regex pass (always runs, synchronous)
|
|
1270
|
+
const regexEntities = this.regexEngine.scan(text);
|
|
1271
|
+
|
|
1272
|
+
// Step 2: GLiNER pass (if available)
|
|
1273
|
+
let glinerEntities: Entity[] = [];
|
|
1274
|
+
if (this.glinerAvailable) {
|
|
1275
|
+
try {
|
|
1276
|
+
glinerEntities = await this.glinerEngine.scan(text, extraLabels);
|
|
1277
|
+
} catch (err) {
|
|
1278
|
+
console.warn(`[fogclaw] GLiNER scan failed, using regex results only: ${err instanceof Error ? err.message : String(err)}`);
|
|
1279
|
+
}
|
|
1280
|
+
}
|
|
1281
|
+
|
|
1282
|
+
// Step 3: Merge and deduplicate
|
|
1283
|
+
const merged = deduplicateEntities([...regexEntities, ...glinerEntities]);
|
|
1284
|
+
|
|
1285
|
+
return { entities: merged, text };
|
|
1286
|
+
}
|
|
1287
|
+
}
|
|
1288
|
+
|
|
1289
|
+
/**
|
|
1290
|
+
* Remove overlapping entity spans. When two entities overlap,
|
|
1291
|
+
* keep the one with higher confidence. If equal, prefer regex.
|
|
1292
|
+
*/
|
|
1293
|
+
function deduplicateEntities(entities: Entity[]): Entity[] {
|
|
1294
|
+
if (entities.length <= 1) return entities;
|
|
1295
|
+
|
|
1296
|
+
// Sort by start position, then by confidence descending
|
|
1297
|
+
const sorted = [...entities].sort((a, b) => {
|
|
1298
|
+
if (a.start !== b.start) return a.start - b.start;
|
|
1299
|
+
return b.confidence - a.confidence;
|
|
1300
|
+
});
|
|
1301
|
+
|
|
1302
|
+
const result: Entity[] = [sorted[0]];
|
|
1303
|
+
|
|
1304
|
+
for (let i = 1; i < sorted.length; i++) {
|
|
1305
|
+
const current = sorted[i];
|
|
1306
|
+
const last = result[result.length - 1];
|
|
1307
|
+
|
|
1308
|
+
// Check for overlap
|
|
1309
|
+
if (current.start < last.end) {
|
|
1310
|
+
// Overlapping: keep higher confidence (already in result if first)
|
|
1311
|
+
if (current.confidence > last.confidence) {
|
|
1312
|
+
result[result.length - 1] = current;
|
|
1313
|
+
}
|
|
1314
|
+
// Otherwise keep what's already in result
|
|
1315
|
+
} else {
|
|
1316
|
+
result.push(current);
|
|
1317
|
+
}
|
|
1318
|
+
}
|
|
1319
|
+
|
|
1320
|
+
return result;
|
|
1321
|
+
}
|
|
1322
|
+
```
|
|
1323
|
+
|
|
1324
|
+
**Step 4: Run tests to verify they pass**
|
|
1325
|
+
|
|
1326
|
+
```bash
|
|
1327
|
+
npx vitest run tests/scanner.test.ts
|
|
1328
|
+
```
|
|
1329
|
+
|
|
1330
|
+
Expected: All tests PASS.
|
|
1331
|
+
|
|
1332
|
+
**Step 5: Commit**
|
|
1333
|
+
|
|
1334
|
+
```bash
|
|
1335
|
+
git add src/scanner.ts tests/scanner.test.ts
|
|
1336
|
+
git commit -m "feat: add scanner pipeline orchestrating regex → GLiNER with dedup"
|
|
1337
|
+
```
|
|
1338
|
+
|
|
1339
|
+
---
|
|
1340
|
+
|
|
1341
|
+
### Task 7: OpenClaw Plugin Entry Point
|
|
1342
|
+
|
|
1343
|
+
**Files:**
|
|
1344
|
+
- Create: `src/index.ts`
|
|
1345
|
+
|
|
1346
|
+
**Step 1: Write the plugin entry point**
|
|
1347
|
+
|
|
1348
|
+
Create `src/index.ts`:
|
|
1349
|
+
|
|
1350
|
+
```typescript
|
|
1351
|
+
import { Scanner } from "./scanner.js";
|
|
1352
|
+
import { redact } from "./redactor.js";
|
|
1353
|
+
import { loadConfig } from "./config.js";
|
|
1354
|
+
import type { FogClawConfig, GuardrailAction } from "./types.js";
|
|
1355
|
+
|
|
1356
|
+
export { Scanner } from "./scanner.js";
|
|
1357
|
+
export { redact } from "./redactor.js";
|
|
1358
|
+
export { loadConfig, DEFAULT_CONFIG } from "./config.js";
|
|
1359
|
+
export type {
|
|
1360
|
+
Entity,
|
|
1361
|
+
FogClawConfig,
|
|
1362
|
+
ScanResult,
|
|
1363
|
+
RedactResult,
|
|
1364
|
+
RedactStrategy,
|
|
1365
|
+
GuardrailAction,
|
|
1366
|
+
} from "./types.js";
|
|
1367
|
+
|
|
1368
|
+
/**
|
|
1369
|
+
* OpenClaw plugin registration.
|
|
1370
|
+
*
|
|
1371
|
+
* Registers:
|
|
1372
|
+
* - `before_agent_start` hook for automatic PII guardrail
|
|
1373
|
+
* - `fogclaw_scan` tool for on-demand entity detection
|
|
1374
|
+
* - `fogclaw_redact` tool for on-demand redaction
|
|
1375
|
+
*/
|
|
1376
|
+
export async function register(api: any) {
|
|
1377
|
+
const rawConfig = api.getConfig?.() ?? {};
|
|
1378
|
+
const config = loadConfig(rawConfig);
|
|
1379
|
+
|
|
1380
|
+
if (!config.enabled) {
|
|
1381
|
+
console.log("[fogclaw] Plugin disabled via config");
|
|
1382
|
+
return;
|
|
1383
|
+
}
|
|
1384
|
+
|
|
1385
|
+
const scanner = new Scanner(config);
|
|
1386
|
+
await scanner.initialize();
|
|
1387
|
+
|
|
1388
|
+
// --- HOOK: Guardrail on incoming messages ---
|
|
1389
|
+
api.registerHook("before_agent_start", async (context: any) => {
|
|
1390
|
+
const result = await scanner.scan(context.message);
|
|
1391
|
+
|
|
1392
|
+
if (result.entities.length === 0) return;
|
|
1393
|
+
|
|
1394
|
+
// Check for any "block" actions
|
|
1395
|
+
for (const entity of result.entities) {
|
|
1396
|
+
const action: GuardrailAction =
|
|
1397
|
+
config.entityActions[entity.label] ?? config.guardrail_mode;
|
|
1398
|
+
|
|
1399
|
+
if (action === "block") {
|
|
1400
|
+
return api.reply(
|
|
1401
|
+
`Message blocked: detected ${entity.label}. Please rephrase without sensitive information.`,
|
|
1402
|
+
);
|
|
1403
|
+
}
|
|
1404
|
+
}
|
|
1405
|
+
|
|
1406
|
+
// Check for any "warn" actions
|
|
1407
|
+
const warnings = result.entities.filter((e) => {
|
|
1408
|
+
const action = config.entityActions[e.label] ?? config.guardrail_mode;
|
|
1409
|
+
return action === "warn";
|
|
1410
|
+
});
|
|
1411
|
+
if (warnings.length > 0) {
|
|
1412
|
+
const types = [...new Set(warnings.map((w) => w.label))].join(", ");
|
|
1413
|
+
api.notify?.(`PII detected: ${types}`);
|
|
1414
|
+
}
|
|
1415
|
+
|
|
1416
|
+
// Apply redaction for "redact" action entities
|
|
1417
|
+
const toRedact = result.entities.filter((e) => {
|
|
1418
|
+
const action = config.entityActions[e.label] ?? config.guardrail_mode;
|
|
1419
|
+
return action === "redact";
|
|
1420
|
+
});
|
|
1421
|
+
if (toRedact.length > 0) {
|
|
1422
|
+
const redacted = redact(context.message, toRedact, config.redactStrategy);
|
|
1423
|
+
context.message = redacted.redacted_text;
|
|
1424
|
+
}
|
|
1425
|
+
});
|
|
1426
|
+
|
|
1427
|
+
// --- TOOL: On-demand scan ---
|
|
1428
|
+
api.registerTool({
|
|
1429
|
+
id: "fogclaw_scan",
|
|
1430
|
+
name: "Scan for PII",
|
|
1431
|
+
description:
|
|
1432
|
+
"Scan text for PII and custom entities. Returns detected entities with types, positions, and confidence scores.",
|
|
1433
|
+
parameters: {
|
|
1434
|
+
text: {
|
|
1435
|
+
type: "string",
|
|
1436
|
+
description: "Text to scan for entities",
|
|
1437
|
+
required: true,
|
|
1438
|
+
},
|
|
1439
|
+
custom_labels: {
|
|
1440
|
+
type: "array",
|
|
1441
|
+
description:
|
|
1442
|
+
"Additional entity labels for zero-shot detection (e.g., ['competitor name', 'project codename'])",
|
|
1443
|
+
required: false,
|
|
1444
|
+
},
|
|
1445
|
+
},
|
|
1446
|
+
handler: async ({
|
|
1447
|
+
text,
|
|
1448
|
+
custom_labels,
|
|
1449
|
+
}: {
|
|
1450
|
+
text: string;
|
|
1451
|
+
custom_labels?: string[];
|
|
1452
|
+
}) => {
|
|
1453
|
+
const result = await scanner.scan(text, custom_labels);
|
|
1454
|
+
return {
|
|
1455
|
+
entities: result.entities,
|
|
1456
|
+
count: result.entities.length,
|
|
1457
|
+
summary: result.entities.length > 0
|
|
1458
|
+
? `Found ${result.entities.length} entities: ${[...new Set(result.entities.map((e) => e.label))].join(", ")}`
|
|
1459
|
+
: "No entities detected",
|
|
1460
|
+
};
|
|
1461
|
+
},
|
|
1462
|
+
});
|
|
1463
|
+
|
|
1464
|
+
// --- TOOL: On-demand redact ---
|
|
1465
|
+
api.registerTool({
|
|
1466
|
+
id: "fogclaw_redact",
|
|
1467
|
+
name: "Redact PII",
|
|
1468
|
+
description:
|
|
1469
|
+
"Scan and redact PII/custom entities from text. Returns sanitized text with entities replaced.",
|
|
1470
|
+
parameters: {
|
|
1471
|
+
text: {
|
|
1472
|
+
type: "string",
|
|
1473
|
+
description: "Text to scan and redact",
|
|
1474
|
+
required: true,
|
|
1475
|
+
},
|
|
1476
|
+
strategy: {
|
|
1477
|
+
type: "string",
|
|
1478
|
+
description:
|
|
1479
|
+
'Redaction strategy: "token" ([EMAIL_1]), "mask" (****), or "hash" ([EMAIL_a1b2c3...])',
|
|
1480
|
+
enum: ["token", "mask", "hash"],
|
|
1481
|
+
required: false,
|
|
1482
|
+
},
|
|
1483
|
+
custom_labels: {
|
|
1484
|
+
type: "array",
|
|
1485
|
+
description: "Additional entity labels for zero-shot detection",
|
|
1486
|
+
required: false,
|
|
1487
|
+
},
|
|
1488
|
+
},
|
|
1489
|
+
handler: async ({
|
|
1490
|
+
text,
|
|
1491
|
+
strategy,
|
|
1492
|
+
custom_labels,
|
|
1493
|
+
}: {
|
|
1494
|
+
text: string;
|
|
1495
|
+
strategy?: "token" | "mask" | "hash";
|
|
1496
|
+
custom_labels?: string[];
|
|
1497
|
+
}) => {
|
|
1498
|
+
const result = await scanner.scan(text, custom_labels);
|
|
1499
|
+
const redacted = redact(
|
|
1500
|
+
text,
|
|
1501
|
+
result.entities,
|
|
1502
|
+
strategy ?? config.redactStrategy,
|
|
1503
|
+
);
|
|
1504
|
+
return {
|
|
1505
|
+
redacted_text: redacted.redacted_text,
|
|
1506
|
+
entities_found: result.entities.length,
|
|
1507
|
+
mapping: redacted.mapping,
|
|
1508
|
+
};
|
|
1509
|
+
},
|
|
1510
|
+
});
|
|
1511
|
+
|
|
1512
|
+
console.log(
|
|
1513
|
+
`[fogclaw] Plugin registered — guardrail: ${config.guardrail_mode}, model: ${config.model}, custom entities: ${config.custom_entities.length}`,
|
|
1514
|
+
);
|
|
1515
|
+
}
|
|
1516
|
+
```
|
|
1517
|
+
|
|
1518
|
+
**Step 2: Verify the project builds**
|
|
1519
|
+
|
|
1520
|
+
```bash
|
|
1521
|
+
npx tsc
|
|
1522
|
+
```
|
|
1523
|
+
|
|
1524
|
+
Expected: Clean compile, no errors.
|
|
1525
|
+
|
|
1526
|
+
**Step 3: Commit**
|
|
1527
|
+
|
|
1528
|
+
```bash
|
|
1529
|
+
git add src/index.ts
|
|
1530
|
+
git commit -m "feat: add OpenClaw plugin entry point with hook and tool registration"
|
|
1531
|
+
```
|
|
1532
|
+
|
|
1533
|
+
---
|
|
1534
|
+
|
|
1535
|
+
### Task 8: Run Full Test Suite & Final Verification
|
|
1536
|
+
|
|
1537
|
+
**Step 1: Run all tests**
|
|
1538
|
+
|
|
1539
|
+
```bash
|
|
1540
|
+
npx vitest run
|
|
1541
|
+
```
|
|
1542
|
+
|
|
1543
|
+
Expected: All tests in `regex.test.ts`, `redactor.test.ts`, `config.test.ts`, `gliner.test.ts`, and `scanner.test.ts` pass.
|
|
1544
|
+
|
|
1545
|
+
**Step 2: Verify clean build**
|
|
1546
|
+
|
|
1547
|
+
```bash
|
|
1548
|
+
npx tsc
|
|
1549
|
+
```
|
|
1550
|
+
|
|
1551
|
+
Expected: No errors.
|
|
1552
|
+
|
|
1553
|
+
**Step 3: Verify package structure**
|
|
1554
|
+
|
|
1555
|
+
```bash
|
|
1556
|
+
ls dist/
|
|
1557
|
+
```
|
|
1558
|
+
|
|
1559
|
+
Expected: `index.js`, `index.d.ts`, `types.js`, `types.d.ts`, `config.js`, `config.d.ts`, `scanner.js`, `scanner.d.ts`, `redactor.js`, `redactor.d.ts`, `engines/regex.js`, `engines/regex.d.ts`, `engines/gliner.js`, `engines/gliner.d.ts` (plus `.map` files).
|
|
1560
|
+
|
|
1561
|
+
**Step 4: Commit any remaining changes**
|
|
1562
|
+
|
|
1563
|
+
```bash
|
|
1564
|
+
git add -A
|
|
1565
|
+
git commit -m "chore: verify full build and test suite"
|
|
1566
|
+
```
|
|
1567
|
+
|
|
1568
|
+
---
|
|
1569
|
+
|
|
1570
|
+
### Task 9: Push to GitHub
|
|
1571
|
+
|
|
1572
|
+
**Step 1: Create the repo on GitHub**
|
|
1573
|
+
|
|
1574
|
+
```bash
|
|
1575
|
+
gh repo create datafog/fogclaw --public --description "OpenClaw plugin for PII detection & custom entity redaction powered by DataFog" --license MIT
|
|
1576
|
+
```
|
|
1577
|
+
|
|
1578
|
+
**Step 2: Add remote and push**
|
|
1579
|
+
|
|
1580
|
+
```bash
|
|
1581
|
+
git remote add origin https://github.com/datafog/fogclaw.git
|
|
1582
|
+
git branch -M main
|
|
1583
|
+
git push -u origin main
|
|
1584
|
+
```
|
|
1585
|
+
|
|
1586
|
+
**Step 3: Verify on GitHub**
|
|
1587
|
+
|
|
1588
|
+
```bash
|
|
1589
|
+
gh repo view datafog/fogclaw --web
|
|
1590
|
+
```
|
|
1591
|
+
|
|
1592
|
+
---
|
|
1593
|
+
|
|
1594
|
+
## Summary
|
|
1595
|
+
|
|
1596
|
+
| Task | What | Key files |
|
|
1597
|
+
|------|------|-----------|
|
|
1598
|
+
| 1 | Repo scaffold | `package.json`, `tsconfig.json`, `openclaw.plugin.json`, `src/types.ts` |
|
|
1599
|
+
| 2 | Regex engine | `src/engines/regex.ts`, `tests/regex.test.ts` |
|
|
1600
|
+
| 3 | Redactor | `src/redactor.ts`, `tests/redactor.test.ts` |
|
|
1601
|
+
| 4 | Config loader | `src/config.ts`, `tests/config.test.ts` |
|
|
1602
|
+
| 5 | GLiNER wrapper | `src/engines/gliner.ts`, `tests/gliner.test.ts` |
|
|
1603
|
+
| 6 | Scanner pipeline | `src/scanner.ts`, `tests/scanner.test.ts` |
|
|
1604
|
+
| 7 | Plugin entry | `src/index.ts` |
|
|
1605
|
+
| 8 | Full verification | Run all tests + build |
|
|
1606
|
+
| 9 | Push to GitHub | Create repo + push |
|