rehydra 0.3.4 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +170 -760
- package/dist/index.d.ts +2 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +4 -0
- package/dist/index.js.map +1 -1
- package/dist/proxy/index.d.ts +12 -0
- package/dist/proxy/index.d.ts.map +1 -0
- package/dist/proxy/index.js +11 -0
- package/dist/proxy/index.js.map +1 -0
- package/dist/proxy/providers/anthropic.d.ts +17 -0
- package/dist/proxy/providers/anthropic.d.ts.map +1 -0
- package/dist/proxy/providers/anthropic.js +117 -0
- package/dist/proxy/providers/anthropic.js.map +1 -0
- package/dist/proxy/providers/index.d.ts +19 -0
- package/dist/proxy/providers/index.d.ts.map +1 -0
- package/dist/proxy/providers/index.js +40 -0
- package/dist/proxy/providers/index.js.map +1 -0
- package/dist/proxy/providers/openai.d.ts +17 -0
- package/dist/proxy/providers/openai.d.ts.map +1 -0
- package/dist/proxy/providers/openai.js +92 -0
- package/dist/proxy/providers/openai.js.map +1 -0
- package/dist/proxy/providers/types.d.ts +29 -0
- package/dist/proxy/providers/types.d.ts.map +1 -0
- package/dist/proxy/providers/types.js +6 -0
- package/dist/proxy/providers/types.js.map +1 -0
- package/dist/proxy/proxy-server.d.ts +53 -0
- package/dist/proxy/proxy-server.d.ts.map +1 -0
- package/dist/proxy/proxy-server.js +146 -0
- package/dist/proxy/proxy-server.js.map +1 -0
- package/dist/proxy/rehydra-fetch.d.ts +35 -0
- package/dist/proxy/rehydra-fetch.d.ts.map +1 -0
- package/dist/proxy/rehydra-fetch.js +217 -0
- package/dist/proxy/rehydra-fetch.js.map +1 -0
- package/dist/proxy/rehydra-proxy.d.ts +40 -0
- package/dist/proxy/rehydra-proxy.d.ts.map +1 -0
- package/dist/proxy/rehydra-proxy.js +82 -0
- package/dist/proxy/rehydra-proxy.js.map +1 -0
- package/dist/proxy/sse-parser.d.ts +59 -0
- package/dist/proxy/sse-parser.d.ts.map +1 -0
- package/dist/proxy/sse-parser.js +112 -0
- package/dist/proxy/sse-parser.js.map +1 -0
- package/dist/proxy/types.d.ts +49 -0
- package/dist/proxy/types.d.ts.map +1 -0
- package/dist/proxy/types.js +5 -0
- package/dist/proxy/types.js.map +1 -0
- package/dist/proxy/wrap-client.d.ts +47 -0
- package/dist/proxy/wrap-client.d.ts.map +1 -0
- package/dist/proxy/wrap-client.js +70 -0
- package/dist/proxy/wrap-client.js.map +1 -0
- package/dist/storage/session.d.ts +3 -0
- package/dist/storage/session.d.ts.map +1 -1
- package/dist/storage/session.js +16 -0
- package/dist/storage/session.js.map +1 -1
- package/dist/storage/types.d.ts +16 -0
- package/dist/storage/types.d.ts.map +1 -1
- package/dist/streaming/anonymizer-stream.d.ts +63 -0
- package/dist/streaming/anonymizer-stream.d.ts.map +1 -0
- package/dist/streaming/anonymizer-stream.js +184 -0
- package/dist/streaming/anonymizer-stream.js.map +1 -0
- package/dist/streaming/index.d.ts +9 -0
- package/dist/streaming/index.d.ts.map +1 -0
- package/dist/streaming/index.js +8 -0
- package/dist/streaming/index.js.map +1 -0
- package/dist/streaming/sentence-buffer.d.ts +78 -0
- package/dist/streaming/sentence-buffer.d.ts.map +1 -0
- package/dist/streaming/sentence-buffer.js +238 -0
- package/dist/streaming/sentence-buffer.js.map +1 -0
- package/dist/streaming/stream-factory.d.ts +38 -0
- package/dist/streaming/stream-factory.d.ts.map +1 -0
- package/dist/streaming/stream-factory.js +69 -0
- package/dist/streaming/stream-factory.js.map +1 -0
- package/dist/streaming/types.d.ts +121 -0
- package/dist/streaming/types.d.ts.map +1 -0
- package/dist/streaming/types.js +5 -0
- package/dist/streaming/types.js.map +1 -0
- package/package.json +18 -1
package/README.md
CHANGED
|
@@ -6,898 +6,308 @@
|
|
|
6
6
|

|
|
7
7
|
[](https://codecov.io/github/rehydra-ai/rehydra)
|
|
8
8
|
|
|
9
|
-
|
|
10
|
-
On-device PII anonymization module for high-privacy AI workflows. Detects and replaces Personally Identifiable Information (PII) with semantically valuable placeholder tags while maintaining an encrypted mapping for rehydration.
|
|
11
|
-
|
|
12
|
-
```bash
|
|
13
|
-
npm install rehydra
|
|
14
|
-
```
|
|
15
|
-
|
|
16
|
-
**Works in Node.js, Bun, and browsers**
|
|
17
|
-
|
|
18
|
-
## Features
|
|
19
|
-
|
|
20
|
-
- **Structured PII Detection**: Regex-based detection for emails, phones, IBANs, credit cards, IPs, URLs
|
|
21
|
-
- **Soft PII Detection**: ONNX-powered NER model for names, organizations, locations (auto-downloads on first use if enabled)
|
|
22
|
-
- **Semantic Enrichment**: AI/MT-friendly tags with gender/location attributes
|
|
23
|
-
- **Secure PII Mapping**: AES-256-GCM encrypted storage of original PII values
|
|
24
|
-
- **Cross-Platform**: Works identically in Node.js, Bun, and browsers
|
|
25
|
-
- **Configurable Policies**: Customizable detection rules, thresholds, and allowlists
|
|
26
|
-
- **Validation & Leak Scanning**: Built-in validation and optional leak detection
|
|
27
|
-
|
|
28
|
-
## Installation
|
|
29
|
-
|
|
30
|
-
### Node.js
|
|
9
|
+
On-device PII anonymization for AI workflows. Detects names, emails, phones, IBANs, and more — replaces them with encrypted placeholder tags — and rehydrates them back after processing.
|
|
31
10
|
|
|
32
11
|
```bash
|
|
33
12
|
npm install rehydra
|
|
34
13
|
```
|
|
35
|
-
For bun support see [Bun Support](#bun-support)
|
|
36
|
-
|
|
37
|
-
### Browser (with bundler)
|
|
38
|
-
|
|
39
|
-
```bash
|
|
40
|
-
npm install rehydra onnxruntime-web
|
|
41
|
-
```
|
|
42
14
|
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
```html
|
|
46
|
-
<script type="module">
|
|
47
|
-
// Import directly from your dist folder or CDN
|
|
48
|
-
import { createAnonymizer } from './node_modules/rehydra/dist/index.js';
|
|
49
|
-
|
|
50
|
-
// onnxruntime-web is automatically loaded from CDN when needed
|
|
51
|
-
</script>
|
|
52
|
-
```
|
|
15
|
+
**Works in Node.js, Bun, and browsers.** No data leaves your machine.
|
|
53
16
|
|
|
54
17
|
## Quick Start
|
|
55
18
|
|
|
56
|
-
### Full pipeline (Anonymize → LLM → Rehydrate)
|
|
57
|
-
|
|
58
|
-
The full workflow for privacy-preserving LLM workflows:
|
|
59
|
-
|
|
60
19
|
```typescript
|
|
61
|
-
import {
|
|
62
|
-
createAnonymizer,
|
|
63
|
-
decryptPIIMap,
|
|
64
|
-
rehydrate,
|
|
65
|
-
InMemoryKeyProvider
|
|
66
|
-
} from 'rehydra';
|
|
20
|
+
import { createAnonymizer, decryptPIIMap, rehydrate, InMemoryKeyProvider } from 'rehydra';
|
|
67
21
|
|
|
68
|
-
// 1. Create a key provider (required to decrypt later)
|
|
69
22
|
const keyProvider = new InMemoryKeyProvider();
|
|
70
|
-
|
|
71
|
-
// 2. Create anonymizer with key provider
|
|
72
23
|
const anonymizer = createAnonymizer({
|
|
73
|
-
ner: { mode: 'quantized' },
|
|
74
|
-
|
|
75
|
-
keyProvider: keyProvider
|
|
24
|
+
ner: { mode: 'quantized' }, // ~280 MB model, auto-downloads on first use
|
|
25
|
+
keyProvider,
|
|
76
26
|
});
|
|
77
27
|
|
|
78
|
-
await anonymizer.
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
const original = 'Hello John Smith from Acme Corp in Berlin!';
|
|
82
|
-
const result = await anonymizer.anonymize(original);
|
|
28
|
+
const result = await anonymizer.anonymize(
|
|
29
|
+
'Email john.smith@acme-corp.com or call John at +41 79 123 45 67'
|
|
30
|
+
);
|
|
83
31
|
|
|
84
32
|
console.log(result.anonymizedText);
|
|
85
|
-
// "
|
|
86
|
-
|
|
87
|
-
// 4. Translate (or do other AI workloads that preserve placeholders)
|
|
88
|
-
const translated = await yourAIWorkflow(result.anonymizedText, { from: 'en', to: 'de' });
|
|
89
|
-
// "Hallo <PII type="PERSON" gender="male" id="1"/> von <PII type="ORG" id="2"/> in <PII type="LOCATION" scope="city" id="3"/>!"
|
|
33
|
+
// "Email <PII type="EMAIL" id="1"/> or call <PII type="PERSON" id="2"/> at <PII type="PHONE" id="3"/>"
|
|
90
34
|
|
|
91
|
-
//
|
|
92
|
-
const
|
|
93
|
-
const piiMap = await decryptPIIMap(result.piiMap
|
|
94
|
-
|
|
95
|
-
//
|
|
96
|
-
const rehydrated = rehydrate(translated, piiMap);
|
|
97
|
-
// "Hallo John Smith von Acme Corp in Berlin!"
|
|
35
|
+
// Rehydrate after translation or other processing
|
|
36
|
+
const key = await keyProvider.getKey();
|
|
37
|
+
const piiMap = await decryptPIIMap(result.piiMap!, key);
|
|
38
|
+
const original = rehydrate(result.anonymizedText, piiMap);
|
|
39
|
+
// "Email john.smith@acme-corp.com or call John at +41 79 123 45 67"
|
|
98
40
|
|
|
99
|
-
// 7. Clean up
|
|
100
41
|
await anonymizer.dispose();
|
|
101
42
|
```
|
|
102
43
|
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
For structured PII like emails, phones, IBANs, credit cards:
|
|
106
|
-
|
|
107
|
-
```typescript
|
|
108
|
-
import { anonymizeRegexOnly } from 'rehydra';
|
|
109
|
-
|
|
110
|
-
const result = await anonymizeRegexOnly(
|
|
111
|
-
'Contact john@example.com or call +49 30 123456. IBAN: DE89370400440532013000'
|
|
112
|
-
);
|
|
113
|
-
|
|
114
|
-
console.log(result.anonymizedText);
|
|
115
|
-
// "Contact <PII type="EMAIL" id="1"/> or call <PII type="PHONE" id="2"/>. IBAN: <PII type="IBAN" id="3"/>"
|
|
116
|
-
```
|
|
44
|
+
## LLM Proxy
|
|
117
45
|
|
|
118
|
-
|
|
46
|
+
Drop-in middleware that anonymizes prompts before they leave your machine and rehydrates responses. Works with OpenAI, Anthropic, and any OpenAI-compatible API.
|
|
119
47
|
|
|
120
|
-
|
|
48
|
+
### Wrap any fetch-based client
|
|
121
49
|
|
|
122
50
|
```typescript
|
|
123
|
-
import
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
51
|
+
import OpenAI from 'openai';
|
|
52
|
+
import { createRehydraFetch, InMemoryKeyProvider, InMemoryPIIStorageProvider } from 'rehydra';
|
|
53
|
+
|
|
54
|
+
const openai = new OpenAI({
|
|
55
|
+
fetch: createRehydraFetch({
|
|
56
|
+
anonymizer: { ner: { mode: 'quantized' } },
|
|
57
|
+
keyProvider: new InMemoryKeyProvider(),
|
|
58
|
+
piiStorageProvider: new InMemoryPIIStorageProvider(),
|
|
59
|
+
}),
|
|
130
60
|
});
|
|
131
61
|
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
'
|
|
136
|
-
);
|
|
137
|
-
|
|
138
|
-
console.log(result.anonymizedText);
|
|
139
|
-
// "Hello <PII type="PERSON" id="1"/> from <PII type="ORG" id="2"/> in <PII type="LOCATION" id="3"/>!"
|
|
140
|
-
|
|
141
|
-
// Clean up when done
|
|
142
|
-
await anonymizer.dispose();
|
|
62
|
+
// PII is anonymized before leaving your machine, response is rehydrated automatically
|
|
63
|
+
const response = await openai.chat.completions.create({
|
|
64
|
+
model: 'gpt-4o',
|
|
65
|
+
messages: [{ role: 'user', content: 'Draft a reply to john@example.com about the meeting' }],
|
|
66
|
+
});
|
|
143
67
|
```
|
|
144
68
|
|
|
145
|
-
###
|
|
146
|
-
|
|
147
|
-
Add gender and location scope for better machine translation:
|
|
69
|
+
### Or use wrapLLMClient for even less code
|
|
148
70
|
|
|
149
71
|
```typescript
|
|
150
|
-
import
|
|
72
|
+
import OpenAI from 'openai';
|
|
73
|
+
import { wrapLLMClient, InMemoryKeyProvider, InMemoryPIIStorageProvider } from 'rehydra';
|
|
151
74
|
|
|
152
|
-
const
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
enabled: true, // Downloads ~12 MB of semantic data on first use
|
|
156
|
-
onStatus: (status) => console.log(status),
|
|
157
|
-
}
|
|
75
|
+
const openai = wrapLLMClient(new OpenAI(), {
|
|
76
|
+
keyProvider: new InMemoryKeyProvider(),
|
|
77
|
+
piiStorageProvider: new InMemoryPIIStorageProvider(),
|
|
158
78
|
});
|
|
159
|
-
|
|
160
|
-
await anonymizer.initialize();
|
|
161
|
-
|
|
162
|
-
const result = await anonymizer.anonymize(
|
|
163
|
-
'Hello Maria Schmidt from Berlin!'
|
|
164
|
-
);
|
|
165
|
-
|
|
166
|
-
console.log(result.anonymizedText);
|
|
167
|
-
// "Hello <PII type="PERSON" gender="female" id="1"/> from <PII type="LOCATION" scope="city" id="2"/>!"
|
|
168
79
|
```
|
|
169
80
|
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
Full documentation on [https://docs.rehydra.ai](https://docs.rehydra.ai).
|
|
81
|
+
### Standalone proxy server
|
|
173
82
|
|
|
174
|
-
|
|
83
|
+
Point any LLM client at a local proxy — zero code changes needed:
|
|
175
84
|
|
|
176
85
|
```typescript
|
|
177
|
-
import {
|
|
86
|
+
import { createRehydraProxyServer, InMemoryKeyProvider, InMemoryPIIStorageProvider } from 'rehydra';
|
|
178
87
|
|
|
179
|
-
const
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
mode: 'quantized', // 'standard' | 'quantized' | 'disabled' | 'custom'
|
|
183
|
-
backend: 'local', // 'local' (default) | 'inference-server'
|
|
184
|
-
autoDownload: true, // Auto-download model if not present
|
|
185
|
-
onStatus: (status) => {}, // Status messages callback
|
|
186
|
-
onDownloadProgress: (progress) => {
|
|
187
|
-
console.log(`${progress.file}: ${progress.percent}%`);
|
|
188
|
-
},
|
|
189
|
-
|
|
190
|
-
// For 'inference-server' backend:
|
|
191
|
-
inferenceServerUrl: 'http://localhost:8080',
|
|
192
|
-
|
|
193
|
-
// For 'custom' mode only:
|
|
194
|
-
modelPath: './my-model.onnx',
|
|
195
|
-
vocabPath: './vocab.txt',
|
|
196
|
-
},
|
|
197
|
-
|
|
198
|
-
// Semantic enrichment (adds gender/scope attributes)
|
|
199
|
-
semantic: {
|
|
200
|
-
enabled: true, // Enable MT-friendly attributes
|
|
201
|
-
autoDownload: true, // Auto-download semantic data (~12 MB)
|
|
202
|
-
onStatus: (status) => {},
|
|
203
|
-
onDownloadProgress: (progress) => {},
|
|
204
|
-
},
|
|
205
|
-
|
|
206
|
-
// Encryption key provider
|
|
88
|
+
const proxy = await createRehydraProxyServer({
|
|
89
|
+
port: 8080,
|
|
90
|
+
upstream: 'https://api.openai.com',
|
|
207
91
|
keyProvider: new InMemoryKeyProvider(),
|
|
208
|
-
|
|
209
|
-
// Custom policy (optional)
|
|
210
|
-
defaultPolicy: { /* see Policy section */ },
|
|
92
|
+
piiStorageProvider: new InMemoryPIIStorageProvider(),
|
|
211
93
|
});
|
|
212
94
|
|
|
213
|
-
|
|
95
|
+
// Point your client at the proxy
|
|
96
|
+
const openai = new OpenAI({ baseURL: 'http://localhost:8080/v1' });
|
|
214
97
|
```
|
|
215
98
|
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
| Mode | Description | Size | Auto-Download |
|
|
219
|
-
|------|-------------|------|---------------|
|
|
220
|
-
| `'disabled'` | No NER, regex only | 0 | N/A |
|
|
221
|
-
| `'quantized'` | Smaller model, ~95% accuracy | ~280 MB | Yes |
|
|
222
|
-
| `'standard'` | Full model, best accuracy | ~1.1 GB | Yes |
|
|
223
|
-
| `'custom'` | Your own ONNX model | Varies | No |
|
|
99
|
+
Supports non-streaming and streaming (SSE) responses for both OpenAI and Anthropic APIs.
|
|
224
100
|
|
|
225
|
-
|
|
101
|
+
## Streaming
|
|
226
102
|
|
|
227
|
-
|
|
103
|
+
Process text chunk-by-chunk with constant memory. Works as a Node.js Transform stream.
|
|
228
104
|
|
|
229
105
|
```typescript
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
mode: 'quantized',
|
|
233
|
-
sessionOptions: {
|
|
234
|
-
// Graph optimization level: 'disabled' | 'basic' | 'extended' | 'all'
|
|
235
|
-
graphOptimizationLevel: 'all', // default
|
|
236
|
-
|
|
237
|
-
// Threading (Node.js only)
|
|
238
|
-
intraOpNumThreads: 4, // threads within operators
|
|
239
|
-
interOpNumThreads: 1, // threads between operators
|
|
240
|
-
|
|
241
|
-
// Memory optimization
|
|
242
|
-
enableCpuMemArena: true,
|
|
243
|
-
enableMemPattern: true,
|
|
244
|
-
}
|
|
245
|
-
}
|
|
246
|
-
});
|
|
247
|
-
```
|
|
248
|
-
|
|
249
|
-
#### Execution Providers
|
|
250
|
-
|
|
251
|
-
By default, Rehydra uses:
|
|
252
|
-
- **Node.js**: CPU (fastest for quantized models)
|
|
253
|
-
- **Browsers**: CPU (WASM)
|
|
106
|
+
import { createReadStream, createWriteStream } from 'fs';
|
|
107
|
+
import { createAnonymizerStream, InMemoryKeyProvider } from 'rehydra';
|
|
254
108
|
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
For high-throughput production deployments, Rehydra supports GPU-accelerated inference via a dedicated inference server. This is useful for large documents.
|
|
261
|
-
|
|
262
|
-
```typescript
|
|
263
|
-
const anonymizer = createAnonymizer({
|
|
264
|
-
ner: {
|
|
265
|
-
backend: 'inference-server',
|
|
266
|
-
inferenceServerUrl: 'http://localhost:8080',
|
|
267
|
-
}
|
|
109
|
+
const stream = await createAnonymizerStream({
|
|
110
|
+
anonymizer: { ner: { mode: 'quantized' } },
|
|
111
|
+
keyProvider: new InMemoryKeyProvider(),
|
|
112
|
+
sessionId: 'batch-job-001',
|
|
113
|
+
piiStorageProvider: storage,
|
|
268
114
|
});
|
|
269
115
|
|
|
270
|
-
|
|
116
|
+
createReadStream('input.txt').pipe(stream).pipe(createWriteStream('anonymized.txt'));
|
|
271
117
|
```
|
|
272
118
|
|
|
273
|
-
|
|
274
|
-
|
|
275
|
-
| Text Size | CPU (local) | GPU (server) |
|
|
276
|
-
|-----------|-------------|--------------|
|
|
277
|
-
| Short (~40 chars) | 4.3ms | 62ms |
|
|
278
|
-
| Medium (~500 chars) | 26ms | 73ms |
|
|
279
|
-
| Long (~2000 chars) | 93ms | 117ms |
|
|
280
|
-
| Entity-dense | 13ms | 68ms |
|
|
281
|
-
|
|
282
|
-
Local CPU faster for most use cases due to network overhead. GPU is beneficial for batch processing and large documents.
|
|
283
|
-
|
|
284
|
-
**Backend Options:**
|
|
285
|
-
|
|
286
|
-
| Backend | Description | Latency (2K chars) |
|
|
287
|
-
|---------|-------------|-------------------|
|
|
288
|
-
| `'local'` | CPU inference (default) | ~4,300ms |
|
|
289
|
-
| `'inference-server'` | GPU server (enterprise) | ~117ms |
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
### Main Functions
|
|
293
|
-
|
|
294
|
-
#### `createAnonymizer(config?)`
|
|
119
|
+
### Low-latency mode for LLM token streams
|
|
295
120
|
|
|
296
|
-
|
|
121
|
+
Regex-only, smaller buffers, flushes aggressively — designed for real-time token streams:
|
|
297
122
|
|
|
298
123
|
```typescript
|
|
299
|
-
const
|
|
300
|
-
|
|
124
|
+
const stream = await createAnonymizerStream({
|
|
125
|
+
buffer: { lowLatency: true },
|
|
301
126
|
});
|
|
302
127
|
|
|
303
|
-
|
|
304
|
-
|
|
305
|
-
await anonymizer.dispose();
|
|
306
|
-
```
|
|
307
|
-
|
|
308
|
-
#### `anonymize(text, locale?, policy?)`
|
|
309
|
-
|
|
310
|
-
One-off anonymization (regex-only by default):
|
|
311
|
-
|
|
312
|
-
```typescript
|
|
313
|
-
import { anonymize } from 'rehydra';
|
|
314
|
-
|
|
315
|
-
const result = await anonymize('Contact test@example.com');
|
|
316
|
-
```
|
|
317
|
-
|
|
318
|
-
#### `anonymizeWithNER(text, nerConfig, policy?)`
|
|
319
|
-
|
|
320
|
-
One-off anonymization with NER:
|
|
321
|
-
|
|
322
|
-
```typescript
|
|
323
|
-
import { anonymizeWithNER } from 'rehydra';
|
|
324
|
-
|
|
325
|
-
const result = await anonymizeWithNER(
|
|
326
|
-
'Hello John Smith',
|
|
327
|
-
{ mode: 'quantized' }
|
|
328
|
-
);
|
|
329
|
-
```
|
|
330
|
-
|
|
331
|
-
#### `anonymizeRegexOnly(text, policy?)`
|
|
332
|
-
|
|
333
|
-
Fast regex-only anonymization:
|
|
334
|
-
|
|
335
|
-
```typescript
|
|
336
|
-
import { anonymizeRegexOnly } from 'rehydra';
|
|
337
|
-
|
|
338
|
-
const result = await anonymizeRegexOnly('Card: 4111111111111111');
|
|
339
|
-
```
|
|
340
|
-
|
|
341
|
-
### Rehydration Functions
|
|
342
|
-
|
|
343
|
-
#### `decryptPIIMap(encryptedMap, key)`
|
|
344
|
-
|
|
345
|
-
Decrypts the PII map for rehydration:
|
|
346
|
-
|
|
347
|
-
```typescript
|
|
348
|
-
import { decryptPIIMap } from 'rehydra';
|
|
349
|
-
|
|
350
|
-
const piiMap = await decryptPIIMap(result.piiMap, encryptionKey);
|
|
351
|
-
// Returns Map<string, string> where key is "PERSON:1" and value is "John Smith"
|
|
352
|
-
```
|
|
353
|
-
|
|
354
|
-
#### `rehydrate(text, piiMap)`
|
|
355
|
-
|
|
356
|
-
Replaces placeholders with original values:
|
|
357
|
-
|
|
358
|
-
```typescript
|
|
359
|
-
import { rehydrate } from 'rehydra';
|
|
360
|
-
|
|
361
|
-
const original = rehydrate(translatedText, piiMap);
|
|
362
|
-
```
|
|
363
|
-
|
|
364
|
-
### Result Structure
|
|
365
|
-
|
|
366
|
-
```typescript
|
|
367
|
-
interface AnonymizationResult {
|
|
368
|
-
// Text with PII replaced by placeholder tags
|
|
369
|
-
anonymizedText: string;
|
|
370
|
-
|
|
371
|
-
// Detected entities (without original text for safety)
|
|
372
|
-
entities: Array<{
|
|
373
|
-
type: PIIType;
|
|
374
|
-
id: number;
|
|
375
|
-
start: number;
|
|
376
|
-
end: number;
|
|
377
|
-
confidence: number;
|
|
378
|
-
source: 'REGEX' | 'NER';
|
|
379
|
-
}>;
|
|
380
|
-
|
|
381
|
-
// Encrypted PII mapping (for later rehydration)
|
|
382
|
-
piiMap: {
|
|
383
|
-
ciphertext: string; // Base64
|
|
384
|
-
iv: string; // Base64
|
|
385
|
-
authTag: string; // Base64
|
|
386
|
-
};
|
|
387
|
-
|
|
388
|
-
// Processing statistics
|
|
389
|
-
stats: {
|
|
390
|
-
countsByType: Record<PIIType, number>;
|
|
391
|
-
totalEntities: number;
|
|
392
|
-
processingTimeMs: number;
|
|
393
|
-
modelVersion: string;
|
|
394
|
-
leakScanPassed?: boolean;
|
|
395
|
-
};
|
|
396
|
-
}
|
|
397
|
-
```
|
|
398
|
-
|
|
399
|
-
## Supported PII Types
|
|
400
|
-
|
|
401
|
-
| Type | Description | Detection | Semantic Attributes |
|
|
402
|
-
|------|-------------|-----------|---------------------|
|
|
403
|
-
| `EMAIL` | Email addresses | Regex | - |
|
|
404
|
-
| `PHONE` | Phone numbers (international) | Regex | - |
|
|
405
|
-
| `IBAN` | International Bank Account Numbers | Regex + Checksum | - |
|
|
406
|
-
| `BIC_SWIFT` | Bank Identifier Codes | Regex | - |
|
|
407
|
-
| `CREDIT_CARD` | Credit card numbers | Regex + Luhn | - |
|
|
408
|
-
| `IP_ADDRESS` | IPv4 and IPv6 addresses | Regex | - |
|
|
409
|
-
| `URL` | Web URLs | Regex | - |
|
|
410
|
-
| `CASE_ID` | Case/ticket numbers | Regex (configurable) | - |
|
|
411
|
-
| `CUSTOMER_ID` | Customer identifiers | Regex (configurable) | - |
|
|
412
|
-
| `PERSON` | Person names | NER | `gender` (male/female/neutral) |
|
|
413
|
-
| `ORG` | Organization names | NER | - |
|
|
414
|
-
| `LOCATION` | Location/place names | NER | `scope` (city/country/region) |
|
|
415
|
-
| `ADDRESS` | Physical addresses | NER | - |
|
|
416
|
-
| `DATE_OF_BIRTH` | Dates of birth | NER | - |
|
|
417
|
-
|
|
418
|
-
## Configuration
|
|
419
|
-
|
|
420
|
-
### Anonymization Policy
|
|
421
|
-
|
|
422
|
-
```typescript
|
|
423
|
-
import { createAnonymizer, PIIType } from 'rehydra';
|
|
424
|
-
|
|
425
|
-
const anonymizer = createAnonymizer({
|
|
426
|
-
ner: { mode: 'quantized' },
|
|
427
|
-
defaultPolicy: {
|
|
428
|
-
// Which PII types to detect
|
|
429
|
-
enabledTypes: new Set([PIIType.EMAIL, PIIType.PHONE, PIIType.PERSON]),
|
|
430
|
-
|
|
431
|
-
// Confidence thresholds per type (0.0 - 1.0)
|
|
432
|
-
confidenceThresholds: new Map([
|
|
433
|
-
[PIIType.PERSON, 0.8],
|
|
434
|
-
[PIIType.EMAIL, 0.5],
|
|
435
|
-
]),
|
|
436
|
-
|
|
437
|
-
// Terms to never treat as PII
|
|
438
|
-
allowlistTerms: new Set(['Customer Service', 'Help Desk']),
|
|
439
|
-
|
|
440
|
-
// Enable semantic enrichment (gender/scope)
|
|
441
|
-
enableSemanticMasking: true,
|
|
442
|
-
|
|
443
|
-
// Enable leak scanning on output
|
|
444
|
-
enableLeakScan: true,
|
|
445
|
-
},
|
|
128
|
+
llmTokenStream.pipe(stream).on('data', (chunk) => {
|
|
129
|
+
ws.send(chunk.toString());
|
|
446
130
|
});
|
|
447
131
|
```
|
|
448
132
|
|
|
449
|
-
###
|
|
450
|
-
|
|
451
|
-
Add domain-specific patterns:
|
|
133
|
+
### Stream from a session
|
|
452
134
|
|
|
453
135
|
```typescript
|
|
454
|
-
|
|
455
|
-
|
|
456
|
-
|
|
457
|
-
{
|
|
458
|
-
name: 'Order Number',
|
|
459
|
-
pattern: /\bORD-[A-Z0-9]{8}\b/g,
|
|
460
|
-
type: PIIType.CASE_ID,
|
|
461
|
-
},
|
|
462
|
-
]);
|
|
463
|
-
|
|
464
|
-
const anonymizer = createAnonymizer();
|
|
465
|
-
anonymizer.getRegistry().register(customRecognizer);
|
|
136
|
+
const session = anonymizer.session('chat-123');
|
|
137
|
+
const stream = await session.createStream();
|
|
138
|
+
input.pipe(stream).pipe(output);
|
|
466
139
|
```
|
|
467
140
|
|
|
468
|
-
##
|
|
469
|
-
|
|
470
|
-
Models and semantic data are cached locally for offline use.
|
|
141
|
+
## Sessions
|
|
471
142
|
|
|
472
|
-
|
|
473
|
-
|
|
474
|
-
| Data | macOS | Linux | Windows |
|
|
475
|
-
|------|-------|-------|---------|
|
|
476
|
-
| NER Models | `~/Library/Caches/rehydra/models/` | `~/.cache/rehydra/models/` | `%LOCALAPPDATA%/rehydra/models/` |
|
|
477
|
-
| Semantic Data | `~/Library/Caches/rehydra/semantic-data/` | `~/.cache/rehydra/semantic-data/` | `%LOCALAPPDATA%/rehydra/semantic-data/` |
|
|
478
|
-
|
|
479
|
-
### Browser Cache
|
|
480
|
-
|
|
481
|
-
In browsers, data is stored using:
|
|
482
|
-
- **IndexedDB**: For semantic data and smaller files
|
|
483
|
-
- **Origin Private File System (OPFS)**: For large model files (~280 MB)
|
|
484
|
-
|
|
485
|
-
Data persists across page reloads and browser sessions.
|
|
486
|
-
|
|
487
|
-
### Manual Data Management
|
|
143
|
+
For multi-message conversations where PII IDs need to stay consistent and PII maps need to persist:
|
|
488
144
|
|
|
489
145
|
```typescript
|
|
490
|
-
import {
|
|
491
|
-
|
|
492
|
-
|
|
493
|
-
|
|
494
|
-
clearModelCache,
|
|
495
|
-
listDownloadedModels,
|
|
496
|
-
|
|
497
|
-
// Semantic data management
|
|
498
|
-
isSemanticDataDownloaded,
|
|
499
|
-
downloadSemanticData,
|
|
500
|
-
clearSemanticDataCache,
|
|
146
|
+
import {
|
|
147
|
+
createAnonymizer,
|
|
148
|
+
InMemoryKeyProvider,
|
|
149
|
+
SQLitePIIStorageProvider, // or InMemoryPIIStorageProvider, IndexedDBPIIStorageProvider
|
|
501
150
|
} from 'rehydra';
|
|
502
151
|
|
|
503
|
-
|
|
504
|
-
|
|
505
|
-
|
|
506
|
-
|
|
507
|
-
await downloadModel('quantized', (progress) => {
|
|
508
|
-
console.log(`${progress.file}: ${progress.percent}%`);
|
|
152
|
+
const anonymizer = createAnonymizer({
|
|
153
|
+
ner: { mode: 'quantized' },
|
|
154
|
+
keyProvider: new InMemoryKeyProvider(),
|
|
155
|
+
piiStorageProvider: new SQLitePIIStorageProvider('./pii.db'),
|
|
509
156
|
});
|
|
510
157
|
|
|
511
|
-
|
|
512
|
-
const hasSemanticData = await isSemanticDataDownloaded();
|
|
513
|
-
|
|
514
|
-
// List downloaded models
|
|
515
|
-
const models = await listDownloadedModels();
|
|
516
|
-
|
|
517
|
-
// Clear caches
|
|
518
|
-
await clearModelCache('quantized'); // or clearModelCache() for all
|
|
519
|
-
await clearSemanticDataCache();
|
|
520
|
-
```
|
|
521
|
-
|
|
522
|
-
## Encryption & Security
|
|
523
|
-
|
|
524
|
-
The PII map is encrypted using **AES-256-GCM** via the Web Crypto API (works in both Node.js and browsers).
|
|
158
|
+
const session = anonymizer.session('chat-123');
|
|
525
159
|
|
|
526
|
-
|
|
160
|
+
// Message 1
|
|
161
|
+
await session.anonymize('Contact me at user@example.com');
|
|
162
|
+
// → "Contact me at <PII type="EMAIL" id="1"/>"
|
|
527
163
|
|
|
528
|
-
|
|
529
|
-
|
|
530
|
-
|
|
531
|
-
ConfigKeyProvider, // For production with pre-configured key
|
|
532
|
-
KeyProvider, // Interface for custom implementations
|
|
533
|
-
generateKey,
|
|
534
|
-
} from 'rehydra';
|
|
164
|
+
// Message 2 — same email gets the same ID
|
|
165
|
+
await session.anonymize('CC: user@example.com and admin@example.com');
|
|
166
|
+
// → "CC: <PII type="EMAIL" id="1"/> and <PII type="EMAIL" id="2"/>"
|
|
535
167
|
|
|
536
|
-
//
|
|
537
|
-
const
|
|
538
|
-
|
|
539
|
-
// Production: Pre-configured key
|
|
540
|
-
// Generate key: openssl rand -base64 32
|
|
541
|
-
const keyBase64 = process.env.PII_ENCRYPTION_KEY; // or read from config
|
|
542
|
-
const prodKeyProvider = new ConfigKeyProvider(keyBase64);
|
|
543
|
-
|
|
544
|
-
// Custom: Implement KeyProvider interface
|
|
545
|
-
class SecureKeyProvider implements KeyProvider {
|
|
546
|
-
async getKey(): Promise<Uint8Array> {
|
|
547
|
-
// Retrieve from secure storage, HSM, keychain, etc.
|
|
548
|
-
return await getKeyFromSecureStorage();
|
|
549
|
-
}
|
|
550
|
-
}
|
|
168
|
+
// Rehydrate any message — auto-loads the PII map from storage
|
|
169
|
+
const original = await session.rehydrate(translatedText);
|
|
551
170
|
```
|
|
552
171
|
|
|
553
|
-
### Security Best Practices
|
|
554
|
-
|
|
555
|
-
- **Never log the raw PII map** - Always use encrypted storage
|
|
556
|
-
- **Persist the encryption key securely** - Use platform keystores (iOS Keychain, Android Keystore, etc.)
|
|
557
|
-
- **Rotate keys** - Implement key rotation for long-running applications
|
|
558
|
-
- **Enable leak scanning** - Catch any missed PII in output
|
|
559
|
-
|
|
560
|
-
## PII Map Storage
|
|
561
|
-
|
|
562
|
-
For applications that need to persist encrypted PII maps (e.g., chat applications where you need to rehydrate later), use sessions with built-in storage providers.
|
|
563
|
-
|
|
564
172
|
### Storage Providers
|
|
565
173
|
|
|
566
|
-
| Provider | Environment | Persistence |
|
|
567
|
-
|
|
568
|
-
| `InMemoryPIIStorageProvider` | All | None (lost on restart) |
|
|
569
|
-
| `SQLitePIIStorageProvider` | Node.js, Bun
|
|
570
|
-
| `IndexedDBPIIStorageProvider` | Browser | Browser storage |
|
|
174
|
+
| Provider | Environment | Persistence |
|
|
175
|
+
|----------|-------------|-------------|
|
|
176
|
+
| `InMemoryPIIStorageProvider` | All | None (lost on restart) |
|
|
177
|
+
| `SQLitePIIStorageProvider` | Node.js, Bun | File-based (`better-sqlite3` on Node, `bun:sqlite` on Bun) |
|
|
178
|
+
| `IndexedDBPIIStorageProvider` | Browser | Browser storage |
|
|
179
|
+
|
|
180
|
+
## Supported PII Types
|
|
571
181
|
|
|
572
|
-
|
|
182
|
+
| Type | Detection | Notes |
|
|
183
|
+
|------|-----------|-------|
|
|
184
|
+
| `PERSON` | NER | Names, with optional `gender` attribute |
|
|
185
|
+
| `ORG` | NER | Organization names |
|
|
186
|
+
| `LOCATION` | NER | Places, with optional `scope` attribute (city/country/region) |
|
|
187
|
+
| `ADDRESS` | NER | Physical addresses |
|
|
188
|
+
| `DATE_OF_BIRTH` | NER | Dates of birth |
|
|
189
|
+
| `EMAIL` | Regex | Email addresses |
|
|
190
|
+
| `PHONE` | Regex | International phone numbers |
|
|
191
|
+
| `IBAN` | Regex + checksum | International Bank Account Numbers |
|
|
192
|
+
| `BIC_SWIFT` | Regex | Bank Identifier Codes |
|
|
193
|
+
| `CREDIT_CARD` | Regex + Luhn | Credit card numbers |
|
|
194
|
+
| `IP_ADDRESS` | Regex | IPv4 and IPv6 |
|
|
195
|
+
| `URL` | Regex | Web URLs |
|
|
196
|
+
| `CASE_ID` | Regex | Configurable case/ticket patterns |
|
|
197
|
+
| `CUSTOMER_ID` | Regex | Configurable customer ID patterns |
|
|
573
198
|
|
|
574
|
-
|
|
199
|
+
## Configuration
|
|
575
200
|
|
|
576
|
-
|
|
577
|
-
> Calling `anonymizer.anonymize()` directly does NOT save to storage - the encrypted PII map
|
|
578
|
-
> is only returned in the result for you to handle manually.
|
|
201
|
+
### NER Modes
|
|
579
202
|
|
|
580
|
-
|
|
581
|
-
|
|
582
|
-
|
|
583
|
-
|
|
584
|
-
|
|
585
|
-
|
|
586
|
-
const session = anonymizer.session('conversation-123');
|
|
587
|
-
const result = await session.anonymize('Hello John!');
|
|
588
|
-
// result.piiMap is automatically saved to storage
|
|
589
|
-
```
|
|
203
|
+
| Mode | Size | Description |
|
|
204
|
+
|------|------|-------------|
|
|
205
|
+
| `'disabled'` | 0 | Regex only — no model download |
|
|
206
|
+
| `'quantized'` | ~280 MB | Recommended — good accuracy, smaller download |
|
|
207
|
+
| `'standard'` | ~1.1 GB | Best accuracy |
|
|
208
|
+
| `'custom'` | Varies | Bring your own ONNX model |
|
|
590
209
|
|
|
591
|
-
###
|
|
210
|
+
### Semantic Enrichment
|
|
592
211
|
|
|
593
|
-
|
|
212
|
+
Adds gender/scope attributes for better machine translation:
|
|
594
213
|
|
|
595
214
|
```typescript
|
|
596
|
-
import { createAnonymizer, decryptPIIMap, rehydrate, InMemoryKeyProvider } from 'rehydra';
|
|
597
|
-
|
|
598
|
-
const keyProvider = new InMemoryKeyProvider();
|
|
599
215
|
const anonymizer = createAnonymizer({
|
|
600
216
|
ner: { mode: 'quantized' },
|
|
601
|
-
|
|
217
|
+
semantic: { enabled: true }, // Downloads ~12 MB of name/location data
|
|
602
218
|
});
|
|
603
|
-
await anonymizer.initialize();
|
|
604
|
-
|
|
605
|
-
// Anonymize
|
|
606
|
-
const result = await anonymizer.anonymize('Hello John Smith!');
|
|
607
219
|
|
|
608
|
-
//
|
|
609
|
-
const translated = await translateAPI(result.anonymizedText);
|
|
610
|
-
|
|
611
|
-
// Rehydrate manually using the returned PII map
|
|
612
|
-
const key = await keyProvider.getKey();
|
|
613
|
-
const piiMap = await decryptPIIMap(result.piiMap, key);
|
|
614
|
-
const original = rehydrate(translated, piiMap);
|
|
220
|
+
// "Hello <PII type="PERSON" gender="female" id="1"/> from <PII type="LOCATION" scope="city" id="2"/>!"
|
|
615
221
|
```
|
|
616
222
|
|
|
617
|
-
###
|
|
618
|
-
|
|
619
|
-
For applications that need to persist PII maps across requests/restarts:
|
|
223
|
+
### Anonymization Policy
|
|
620
224
|
|
|
621
225
|
```typescript
|
|
622
|
-
import {
|
|
623
|
-
createAnonymizer,
|
|
624
|
-
InMemoryKeyProvider,
|
|
625
|
-
SQLitePIIStorageProvider,
|
|
626
|
-
} from 'rehydra';
|
|
627
|
-
|
|
628
|
-
// 1. Setup storage (once at app start)
|
|
629
|
-
const storage = new SQLitePIIStorageProvider('./pii-maps.db');
|
|
630
|
-
await storage.initialize();
|
|
631
|
-
|
|
632
|
-
// 2. Create anonymizer with storage and key provider
|
|
633
226
|
const anonymizer = createAnonymizer({
|
|
634
227
|
ner: { mode: 'quantized' },
|
|
635
|
-
|
|
636
|
-
|
|
228
|
+
defaultPolicy: {
|
|
229
|
+
enabledTypes: new Set([PIIType.EMAIL, PIIType.PHONE, PIIType.PERSON]),
|
|
230
|
+
confidenceThresholds: new Map([[PIIType.PERSON, 0.8]]),
|
|
231
|
+
allowlistTerms: new Set(['Customer Service']),
|
|
232
|
+
enableLeakScan: true,
|
|
233
|
+
},
|
|
637
234
|
});
|
|
638
|
-
await anonymizer.initialize();
|
|
639
|
-
|
|
640
|
-
// 3. Create a session for each conversation
|
|
641
|
-
const session = anonymizer.session('conversation-123');
|
|
642
|
-
|
|
643
|
-
// 4. Anonymize - auto-saves to storage
|
|
644
|
-
const result = await session.anonymize('Hello John Smith from Acme Corp!');
|
|
645
|
-
console.log(result.anonymizedText);
|
|
646
|
-
// "Hello <PII type="PERSON" id="1"/> from <PII type="ORG" id="1"/>!"
|
|
647
|
-
|
|
648
|
-
// 5. Later (even after app restart): rehydrate - auto-loads and decrypts
|
|
649
|
-
const translated = await translateAPI(result.anonymizedText);
|
|
650
|
-
const original = await session.rehydrate(translated);
|
|
651
|
-
console.log(original);
|
|
652
|
-
// "Hello John Smith from Acme Corp!"
|
|
653
|
-
|
|
654
|
-
// 6. Optional: check existence or delete
|
|
655
|
-
await session.exists(); // true
|
|
656
|
-
await session.delete(); // removes from storage
|
|
657
235
|
```
|
|
658
236
|
|
|
659
|
-
###
|
|
660
|
-
|
|
661
|
-
Each session ID maps to a separate stored PII map:
|
|
237
|
+
### Anonymization Modes
|
|
662
238
|
|
|
663
239
|
```typescript
|
|
664
|
-
//
|
|
665
|
-
const
|
|
666
|
-
const chat2 = anonymizer.session('user-bob-chat');
|
|
667
|
-
|
|
668
|
-
await chat1.anonymize('Alice: Contact me at alice@example.com');
|
|
669
|
-
await chat2.anonymize('Bob: My number is +49 30 123456');
|
|
240
|
+
// Pseudonymize (default): reversible, returns encrypted PII map
|
|
241
|
+
const anonymizer = createAnonymizer({ mode: 'pseudonymize' });
|
|
670
242
|
|
|
671
|
-
//
|
|
672
|
-
|
|
673
|
-
await chat2.rehydrate(translatedText2); // Uses Bob's PII map
|
|
243
|
+
// Anonymize: irreversible, no PII map returned
|
|
244
|
+
const anonymizer = createAnonymizer({ mode: 'anonymize' });
|
|
674
245
|
```
|
|
675
246
|
|
|
676
|
-
###
|
|
677
|
-
|
|
678
|
-
Within a session, entity IDs are consistent across multiple `anonymize()` calls:
|
|
679
|
-
|
|
680
|
-
```typescript
|
|
681
|
-
const session = anonymizer.session('chat-123');
|
|
682
|
-
|
|
683
|
-
// Message 1: User provides contact info
|
|
684
|
-
const msg1 = await session.anonymize('Contact me at user@example.com');
|
|
685
|
-
// → "Contact me at <PII type="EMAIL" id="1"/>"
|
|
686
|
-
|
|
687
|
-
// Message 2: References same email + new one
|
|
688
|
-
const msg2 = await session.anonymize('CC: user@example.com and admin@example.com');
|
|
689
|
-
// → "CC: <PII type="EMAIL" id="1"/> and <PII type="EMAIL" id="2"/>"
|
|
690
|
-
// ↑ Same ID (reused) ↑ New ID
|
|
691
|
-
|
|
692
|
-
// Message 3: No PII
|
|
693
|
-
await session.anonymize('Please translate to German');
|
|
694
|
-
// Previous PII preserved
|
|
695
|
-
|
|
696
|
-
// All messages can be rehydrated correctly
|
|
697
|
-
await session.rehydrate(msg1.anonymizedText); // ✓
|
|
698
|
-
await session.rehydrate(msg2.anonymizedText); // ✓
|
|
699
|
-
```
|
|
700
|
-
|
|
701
|
-
This ensures that follow-up messages referencing the same PII produce consistent placeholders, and rehydration works correctly across the entire conversation.
|
|
702
|
-
|
|
703
|
-
### SQLite Provider (Node.js + Bun only)
|
|
704
|
-
|
|
705
|
-
The SQLite provider works on both Node.js and Bun with automatic runtime detection.
|
|
706
|
-
|
|
707
|
-
> **Note:** `SQLitePIIStorageProvider` is **not available in browser builds**. When bundling for browser with Vite/webpack, use `IndexedDBPIIStorageProvider` instead. The browser-safe build automatically excludes SQLite to avoid bundling Node.js dependencies.
|
|
247
|
+
### Custom Recognizers
|
|
708
248
|
|
|
709
249
|
```typescript
|
|
710
|
-
|
|
711
|
-
import { SQLitePIIStorageProvider } from 'rehydra';
|
|
712
|
-
// Or explicitly: import { SQLitePIIStorageProvider } from 'rehydra/storage/sqlite';
|
|
250
|
+
import { createCustomIdRecognizer, PIIType } from 'rehydra';
|
|
713
251
|
|
|
714
|
-
|
|
715
|
-
|
|
716
|
-
|
|
252
|
+
const recognizer = createCustomIdRecognizer([{
|
|
253
|
+
name: 'Order Number',
|
|
254
|
+
pattern: /\bORD-[A-Z0-9]{8}\b/g,
|
|
255
|
+
type: PIIType.CASE_ID,
|
|
256
|
+
}]);
|
|
717
257
|
|
|
718
|
-
|
|
719
|
-
const testStorage = new SQLitePIIStorageProvider(':memory:');
|
|
720
|
-
await testStorage.initialize();
|
|
258
|
+
anonymizer.getRegistry().register(recognizer);
|
|
721
259
|
```
|
|
722
260
|
|
|
723
|
-
|
|
724
|
-
- **Bun**: Uses built-in `bun:sqlite` (no additional install needed)
|
|
725
|
-
- **Node.js**: Requires `better-sqlite3`:
|
|
261
|
+
### GPU Acceleration
|
|
726
262
|
|
|
727
|
-
|
|
728
|
-
npm install better-sqlite3
|
|
729
|
-
```
|
|
730
|
-
|
|
731
|
-
### IndexedDB Provider (Browser)
|
|
263
|
+
For high-throughput batch processing, use a remote inference server with GPU:
|
|
732
264
|
|
|
733
265
|
```typescript
|
|
734
|
-
import {
|
|
735
|
-
createAnonymizer,
|
|
736
|
-
InMemoryKeyProvider,
|
|
737
|
-
IndexedDBPIIStorageProvider,
|
|
738
|
-
} from 'rehydra';
|
|
739
|
-
|
|
740
|
-
// Custom database name (defaults to 'rehydra-pii-storage')
|
|
741
|
-
const storage = new IndexedDBPIIStorageProvider('my-app-pii');
|
|
742
|
-
|
|
743
266
|
const anonymizer = createAnonymizer({
|
|
744
|
-
ner: {
|
|
745
|
-
|
|
746
|
-
|
|
267
|
+
ner: {
|
|
268
|
+
backend: 'inference-server',
|
|
269
|
+
inferenceServerUrl: 'http://localhost:8080',
|
|
270
|
+
},
|
|
747
271
|
});
|
|
748
|
-
await anonymizer.initialize();
|
|
749
|
-
|
|
750
|
-
// Use sessions as usual
|
|
751
|
-
const session = anonymizer.session('browser-chat-123');
|
|
752
|
-
const result = await session.anonymize('Hello John!');
|
|
753
|
-
const original = await session.rehydrate(result.anonymizedText);
|
|
754
272
|
```
|
|
755
273
|
|
|
756
|
-
|
|
274
|
+
## Encryption
|
|
757
275
|
|
|
758
|
-
|
|
276
|
+
PII maps are encrypted with **AES-256-GCM** via the Web Crypto API.
|
|
759
277
|
|
|
760
278
|
```typescript
|
|
761
|
-
|
|
762
|
-
|
|
763
|
-
anonymize(text: string, locale?: string, policy?: Partial<AnonymizationPolicy>): Promise<AnonymizationResult>;
|
|
764
|
-
rehydrate(text: string): Promise<string>;
|
|
765
|
-
load(): Promise<StoredPIIMap | null>;
|
|
766
|
-
delete(): Promise<boolean>;
|
|
767
|
-
exists(): Promise<boolean>;
|
|
768
|
-
}
|
|
769
|
-
```
|
|
770
|
-
|
|
771
|
-
### Data Retention
|
|
772
|
-
|
|
773
|
-
**Entries persist forever by default.** Use `cleanup()` on the storage provider to remove old entries:
|
|
774
|
-
|
|
775
|
-
```typescript
|
|
776
|
-
// Delete entries older than 7 days
|
|
777
|
-
const count = await storage.cleanup(new Date(Date.now() - 7 * 24 * 60 * 60 * 1000));
|
|
778
|
-
|
|
779
|
-
// Or delete specific sessions
|
|
780
|
-
await session.delete();
|
|
781
|
-
|
|
782
|
-
// List all stored sessions
|
|
783
|
-
const sessionIds = await storage.list();
|
|
784
|
-
```
|
|
785
|
-
|
|
786
|
-
## Browser Usage
|
|
787
|
-
|
|
788
|
-
The library works seamlessly in browsers without any special configuration.
|
|
789
|
-
|
|
790
|
-
### Browser Notes
|
|
791
|
-
|
|
792
|
-
- **First-use downloads**: NER model (~280 MB) and semantic data (~12 MB) are downloaded on first use
|
|
793
|
-
- **ONNX runtime**: Automatically loaded from CDN if not bundled
|
|
794
|
-
- **Offline support**: After initial download, everything works offline
|
|
795
|
-
- **Storage**: Uses IndexedDB and OPFS - data persists across sessions
|
|
796
|
-
|
|
797
|
-
### Bundler Support (Vite, webpack, esbuild)
|
|
798
|
-
|
|
799
|
-
The package uses [conditional exports](https://nodejs.org/api/packages.html#conditional-exports) to automatically provide a browser-safe build when bundling for the web. This means:
|
|
800
|
-
|
|
801
|
-
- **Automatic**: Vite, webpack, esbuild, and other modern bundlers will automatically use `dist/browser.js`
|
|
802
|
-
- **No Node.js modules**: The browser build excludes `SQLitePIIStorageProvider` and other Node.js-specific code
|
|
803
|
-
- **Tree-shakable**: Only the code you use is included in your bundle
|
|
804
|
-
|
|
805
|
-
```json
|
|
806
|
-
// package.json exports (simplified)
|
|
807
|
-
{
|
|
808
|
-
"exports": {
|
|
809
|
-
".": {
|
|
810
|
-
"browser": "./dist/browser.js",
|
|
811
|
-
"node": "./dist/index.js",
|
|
812
|
-
"default": "./dist/index.js"
|
|
813
|
-
}
|
|
814
|
-
}
|
|
815
|
-
}
|
|
816
|
-
```
|
|
817
|
-
|
|
818
|
-
## Bun Support
|
|
819
|
-
|
|
820
|
-
This library works with [Bun](https://bun.sh). Since `onnxruntime-node` is a native Node.js addon, Bun uses `onnxruntime-web`:
|
|
279
|
+
// Development: random key, lost on restart
|
|
280
|
+
const keyProvider = new InMemoryKeyProvider();
|
|
821
281
|
|
|
822
|
-
|
|
823
|
-
|
|
282
|
+
// Production: persistent key (generate with: openssl rand -base64 32)
|
|
283
|
+
const keyProvider = new ConfigKeyProvider(process.env.PII_ENCRYPTION_KEY!);
|
|
824
284
|
```
|
|
825
285
|
|
|
826
|
-
|
|
827
|
-
|
|
828
|
-
## Performance
|
|
829
|
-
|
|
830
|
-
Benchmarks on Apple M-series (CPU) and NVIDIA T4 (GPU). Run `npm run benchmark:compare` to measure on your hardware.
|
|
831
|
-
|
|
832
|
-
### Backend Comparison
|
|
833
|
-
|
|
834
|
-
| Backend | Short (~40 chars) | Medium (~500 chars) | Long (~2K chars) | Entity-dense |
|
|
835
|
-
|---------|-------------------|---------------------|------------------|--------------|
|
|
836
|
-
| **Regex-only** | 0.38 ms | 0.50 ms | 0.91 ms | 0.35 ms |
|
|
837
|
-
| **NER CPU** | 4.3 ms | 26 ms | 93 ms | 13 ms |
|
|
838
|
-
| **NER GPU** | 62 ms | 73 ms | 117 ms | 68 ms |
|
|
839
|
-
|
|
840
|
-
Local CPU inference is faster than GPU for typical workloads due to network overhead. GPU servers are beneficial for high-throughput batch processing where many requests can be parallelized.
|
|
841
|
-
|
|
842
|
-
### Throughput (ops/sec)
|
|
843
|
-
|
|
844
|
-
| Backend | Short | Medium | Long |
|
|
845
|
-
|---------|-------|--------|------|
|
|
846
|
-
| **Regex-only** | ~2,640 | ~2,017 | ~1,096 |
|
|
847
|
-
| **NER CPU** | ~234 | ~38 | ~11 |
|
|
848
|
-
| **NER GPU** | ~16 | ~14 | ~9 |
|
|
849
|
-
|
|
850
|
-
### Model Downloads
|
|
851
|
-
|
|
852
|
-
| Model | Size | First-Use Download |
|
|
853
|
-
|-------|------|-------------------|
|
|
854
|
-
| Quantized NER | ~265 MB | ~30s on fast connection |
|
|
855
|
-
| Standard NER | ~1.1 GB | ~2min on fast connection |
|
|
856
|
-
| Semantic Data | ~12 MB | ~5s on fast connection |
|
|
857
|
-
|
|
858
|
-
### Recommendations
|
|
859
|
-
|
|
860
|
-
| Use Case | Recommended Backend |
|
|
861
|
-
|----------|---------------------|
|
|
862
|
-
| Structured PII only (email, phone, IBAN) | Regex-only |
|
|
863
|
-
| General use with name/org/location detection | **NER CPU (default)** |
|
|
864
|
-
| High-throughput batch processing (1000s of docs) | NER GPU |
|
|
865
|
-
| Privacy-sensitive / zero-knowledge required | NER CPU (data never leaves device) |
|
|
866
|
-
|
|
867
|
-
> **Note:** Local CPU inference now outperforms GPU for most use cases due to network overhead elimination. The trie-based tokenizer provides O(token_length) lookups instead of O(vocab_size), making local inference practical for production use.
|
|
868
|
-
|
|
869
|
-
## Requirements
|
|
286
|
+
## Platform Support
|
|
870
287
|
|
|
871
288
|
| Environment | Version | Notes |
|
|
872
289
|
|-------------|---------|-------|
|
|
873
290
|
| Node.js | >= 18.0.0 | Uses native `onnxruntime-node` |
|
|
874
|
-
| Bun | >= 1.0.0 |
|
|
875
|
-
| Browsers | Chrome 86+, Firefox 89+, Safari 15.4
|
|
291
|
+
| Bun | >= 1.0.0 | Install `onnxruntime-web`: `bun add rehydra onnxruntime-web` |
|
|
292
|
+
| Browsers | Chrome 86+, Firefox 89+, Safari 15.4+ | Uses OPFS for model storage |
|
|
876
293
|
|
|
877
|
-
|
|
294
|
+
The browser build (`rehydra/browser`) automatically excludes Node.js dependencies. Modern bundlers (Vite, webpack, esbuild) select the right entry point via conditional exports.
|
|
878
295
|
|
|
879
|
-
|
|
880
|
-
# Install dependencies
|
|
881
|
-
npm install
|
|
882
|
-
|
|
883
|
-
# Run tests
|
|
884
|
-
npm test
|
|
885
|
-
|
|
886
|
-
# Build
|
|
887
|
-
npm run build
|
|
888
|
-
|
|
889
|
-
# Lint
|
|
890
|
-
npm run lint
|
|
891
|
-
```
|
|
892
|
-
|
|
893
|
-
### Building Custom Models
|
|
894
|
-
|
|
895
|
-
For development or custom models:
|
|
296
|
+
## Development
|
|
896
297
|
|
|
897
298
|
```bash
|
|
898
|
-
#
|
|
899
|
-
npm run
|
|
900
|
-
npm
|
|
299
|
+
npm install # Install dependencies
|
|
300
|
+
npm run build # Compile TypeScript
|
|
301
|
+
npm test # Run tests (watch mode)
|
|
302
|
+
npm run test:run # Run tests once
|
|
303
|
+
npm run lint # ESLint
|
|
304
|
+
npm run setup:ner # Pre-download NER model (~280 MB)
|
|
305
|
+
npm run benchmark # Run benchmarks
|
|
306
|
+
|
|
307
|
+
# Integration tests (require API keys)
|
|
308
|
+
npm run test:streaming # No API key needed
|
|
309
|
+
OPENAI_API_KEY=... npm run test:proxy:openai -- --ner # OpenAI with NER
|
|
310
|
+
ANTHROPIC_API_KEY=... npm run test:proxy:anthropic -- --ner # Anthropic with NER
|
|
901
311
|
```
|
|
902
312
|
|
|
903
313
|
## License
|