pseudonym-mcp 0.7.1 → 0.7.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,21 +1,23 @@
1
1
  # pseudonym-mcp
2
2
 
3
- Local privacy proxy for LLMspseudonymizes sensitive data before it reaches the cloud, then restores it on the way back.
3
+ Local pseudonymisation layer for LLM workflows replaces detected PII with opaque tokens before the prompt reaches the cloud, then restores it on the way back.
4
4
 
5
5
  [![npm version](https://img.shields.io/npm/v/pseudonym-mcp?style=flat-square&logo=npm&logoColor=white)](https://www.npmjs.com/package/pseudonym-mcp)
6
6
  [![License: MIT](https://img.shields.io/badge/License-MIT-ffd60a?style=flat-square)](LICENSE)
7
7
  [![Node 18+](https://img.shields.io/badge/node-18%2B-339933?logo=node.js&logoColor=white&style=flat-square)](#)
8
- [![GDPR Ready](https://img.shields.io/badge/GDPR-ready-0070f3?style=flat-square)](#gdpr--ai-compliance)
9
- [![Zero Cloud](https://img.shields.io/badge/zero%20cloud-detection-brightgreen?style=flat-square)](#)
8
+ [![GDPR-aligned](https://img.shields.io/badge/GDPR-aligned-0070f3?style=flat-square)](#gdpr--ai-compliance)
9
+ [![Local detection](https://img.shields.io/badge/detection-local-brightgreen?style=flat-square)](#)
10
10
  [![Offline NER](https://img.shields.io/badge/NER-local%20%2F%20offline-blue?style=flat-square)](#)
11
11
 
12
- Sits between your application and any cloud LLM (Claude, GPT-4, Gemini…). Replaces PII with opaque tokens locally before the prompt ever leaves your machine, then seamlessly restores original values in the response — so users never see the tags.
12
+ Sits between your application and any cloud LLM (Claude, GPT-4, Gemini…). Detects PII locally and replaces it with opaque tokens before the prompt leaves your machine, then restores original values in the response — so users never see the tags.
13
+
14
+ It is a **defense-in-depth measure**, not a compliance silver bullet. Read the [Limitations](#limitations) and [GDPR & AI Compliance](#gdpr--ai-compliance) sections before assuming this stack does more than it does.
13
15
 
14
16
  ## What you get
15
17
 
16
18
  - **Multi-language PII detection**: Built-in support for English (SSN, credit cards, US phone) and Polish (PESEL, IBAN, Polish phone). New **heuristic language detection** (`detectLanguage()`) infers the language from text content — `--lang` remains the authoritative override but is no longer the only input.
17
- - **Hybrid NER engine**: Regex for structured PII (SSN, credit cards, IBAN, email, phone) + local Ollama LLM for unstructured entities (names, organizations).
18
- - **Zero-trust architecture**: All detection and substitution happens on your machine. No PII reaches a third-party API.
19
+ - **Hybrid NER engine**: Regex for structured PII (SSN, credit cards, IBAN, email, phone) + local Ollama LLM for unstructured entities (names, organisations).
20
+ - **Local-detection architecture**: Detection and substitution happen on your machine. The cloud LLM call still happens (that's the point) — but it sees tokens instead of detected PII.
19
21
  - **Session-keyed mapping store**: Tokens like `[PERSON:1]` map back to originals in an isolated, per-request session. Multiple round-trips preserve token coherence.
20
22
  - **Auto-unmask**: Optional mode that automatically restores tokens in the LLM's response before returning it to the user.
21
23
  - **Flexible engines**: Run `regex` only (no Ollama required), `llm` only, or `hybrid` (default).
@@ -27,53 +29,57 @@ Sits between your application and any cloud LLM (Claude, GPT-4, Gemini…). Repl
27
29
 
28
30
  ❌ **Without pseudonym-mcp:**
29
31
 
30
- - Prompt: `"John Smith, SSN 123-45-6789, card 4111 1111 1111 1111"` → sent verbatim to OpenAI / Anthropic servers
31
- - Every name, ID number, and credit card in your prompt is processed and potentially logged by the LLM provider
32
- - A data breach at the provider's end exposes your users' real PII
33
- - Sending personal data to a US-based LLM provider without explicit safeguards may violate GDPR Article 44 (international data transfers)
32
+ - Prompt: `"John Smith, SSN 123-45-6789, card 4111 1111 1111 1111"` → sent verbatim to the LLM provider
33
+ - Every name, ID number, and credit card in your prompt is processed and potentially logged by the provider
34
+ - A breach at the provider's end exposes those values in cleartext
35
+ - Sending personal data to a non-EU LLM provider without further safeguards raises GDPR Article 44 questions you'll need to answer
34
36
 
35
37
  ✅ **With pseudonym-mcp:**
36
38
 
37
39
  - The same prompt becomes `"[PERSON:1], SSN [SSN:1], card [CREDIT_CARD:1]"` before it leaves your machine
38
- - The LLM reasons about the structure and content without ever seeing the real values
39
- - The response is automatically de-tokenized locally before reaching the user
40
- - Your GDPR DPA can truthfully state: _personal data never left the local environment_
40
+ - The LLM reasons about structure and content without seeing those detected values in cleartext
41
+ - The response is locally de-tokenised before reaching the user
42
+ - Detected direct identifiers are no longer shipped upstream though structure, dates, indirect references, and any missed PII still are
43
+
44
+ This is a meaningful reduction in cleartext PII exposure. It is **not** "no personal data leaves your machine" — see [Limitations](#limitations).
41
45
 
42
46
  ## GDPR & AI Compliance
43
47
 
44
- pseudonym-mcp directly addresses the regulatory challenges of using cloud AI in data-sensitive contexts.
48
+ pseudonym-mcp is relevant to compliance work, but it is a **technical control**, not a compliance product. Whether you are compliant with any specific regulation depends on your full stack, your role (controller/processor), your contracts, your DPIA, and your jurisdiction.
45
49
 
46
50
  ### Why this matters
47
51
 
48
- The EU **General Data Protection Regulation (GDPR)** classifies names, national ID numbers (like SSN or PESEL), bank account numbers (IBAN), email addresses, credit card numbers, and phone numbers as **personal data** under Article 4(1). Sending this data to a cloud LLM provider constitutes **processing** under Article 4(2) and triggers a range of obligations:
52
+ The EU **General Data Protection Regulation (GDPR)** classifies names, national ID numbers (like SSN or PESEL), bank account numbers (IBAN), email addresses, credit card numbers, and phone numbers as **personal data** under Article 4(1). Sending this data to a cloud LLM provider constitutes **processing** under Article 4(2). Pseudonymisation is explicitly recognised under Art. 4(5) as a risk-reduction measure — but, critically, **pseudonymised data is still personal data** (Recital 26).
49
53
 
50
- | GDPR Article | Obligation | How pseudonym-mcp helps |
51
- | ------------ | -------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
52
- | Art. 5(1)(c) | **Data minimisation** only necessary data should be processed | Strips PII before transmission; the LLM receives only what it needs to reason |
53
- | Art. 25 | **Privacy by design and by default** | Pseudonymization layer is built into the MCP transport, not bolted on |
54
- | Art. 32 | **Security of processing** appropriate technical measures | Local token substitution is a recognized technical measure under Recital 83 |
55
- | Art. 44 | **Transfers to third countries** requires safeguards | If no personal data is transferred, Art. 44 restrictions do not apply |
56
- | Art. 4(5) | **Pseudonymisation** explicitly recognized as a protective measure | Tokens are opaque; re-identification requires access to the local mapping store |
54
+ | GDPR Article | Obligation | Where pseudonym-mcp helps | Where it doesn't |
55
+ | ------------ | -------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------- |
56
+ | Art. 5(1)(c) | **Data minimisation** | Strips detected direct identifiers before transmission | Doesn't minimise context, structure, or undetected PII |
57
+ | Art. 25 | **Privacy by design and by default** | Provides a technical layer that fits into a privacy-by-design architecture | Architecture and policy decisions are still your responsibility |
58
+ | Art. 32 | **Security of processing** | Recognised technical measure under Recital 83 (pseudonymisation) | One control among many; doesn't replace access control, logging, encryption |
59
+ | Art. 44 | **Transfers to third countries** | Reduces the cleartext PII you transfer | Pseudonymised personal data is still personal data transfer rules still apply |
60
+ | Art. 4(5) | **Pseudonymisation** definition | The mapping store is opaque to the cloud LLM; re-identification requires the local session | Re-identification is possible from context for anyone with side knowledge |
57
61
 
58
- > **Note:** Pseudonymisation under GDPR (Art. 4(5)) does not equal anonymisation the data is still personal data in your system. However, it substantially reduces risk and demonstrates compliance with the accountability principle (Art. 5(2)).
62
+ > **The honest bottom line:** pseudonymisation under GDPR Art. 4(5) is **not** anonymisation. The data remains personal data in your system, and Art. 44 transfer obligations are not switched off just because you tokenised the name field.
59
63
 
60
64
  ### AI Act alignment
61
65
 
62
- The EU **AI Act** (in force from 2024) places additional requirements on high-risk AI systems that process personal data. Using pseudonym-mcp as an intermediary layer:
66
+ The EU **AI Act** places additional requirements on high-risk AI systems that process personal data. Using pseudonym-mcp as an intermediary layer can:
63
67
 
64
- - Reduces the risk classification of downstream LLM usage by ensuring the model never processes identifiable natural persons' data directly.
65
- - Supports documentation requirements for AI system transparency and human oversight.
66
- - Aligns with the principle of **technical robustness and safety** (Art. 15) by limiting PII exposure surface.
68
+ - Support data minimisation in your AI system's data flows.
69
+ - Help document a technical control for transparency and human-oversight requirements.
70
+ - Align with the principle of **technical robustness and safety** (Art. 15) by limiting cleartext PII exposure.
71
+
72
+ It does not change your AI Act risk classification on its own — classification is a function of use-case and deployment context, not of the masking step in front of the model.
67
73
 
68
74
  ### US & international applicability
69
75
 
70
- While GDPR originates in the EU, pseudonym-mcp is equally relevant for:
76
+ The tool is also relevant outside the EU, with the same caveats:
71
77
 
72
- - **CCPA / CPRA** (California) — consumers have the right to know what personal information is collected; minimising data sent to third-party LLMs reduces disclosure surface.
73
- - **HIPAA** (US healthcare) — PHI (Protected Health Information) must not be sent to non-BAA cloud providers; local pseudonymization allows LLM use without a BAA.
74
- - **PCI DSS** (payment industry) — credit card numbers (PAN) must never be stored or transmitted in the clear; masking before LLM transit satisfies requirement 3.4.
75
- - **SOC 2** — data handling controls are strengthened by demonstrating that PII is replaced before leaving the trust boundary.
76
- - **PIPEDA** (Canada), **LGPD** (Brazil), **POPIA** (South Africa) — all require appropriate safeguards for cross-border personal data transfers.
78
+ - **CCPA / CPRA** (California) — reduces personal information sent to third-party processors; doesn't change controller/business obligations or consumer rights.
79
+ - **HIPAA** (US healthcare) — pseudonymised PHI is still PHI under HIPAA. Using this tool does **not** eliminate the need for a BAA with your cloud LLM provider if you're a covered entity or business associate. It can be part of a defensible safeguard posture; it cannot substitute for one.
80
+ - **PCI DSS** (payment industry) — Luhn-validated detection reduces the chance card numbers ride in cleartext to an LLM. It is one control; PCI scope, segmentation, and storage rules are separate concerns.
81
+ - **SOC 2** — useful evidence of a technical control limiting PII exposure. Auditors will look at the full picture, not just this layer.
82
+ - **PIPEDA** (Canada), **LGPD** (Brazil), **POPIA** (South Africa) — all require appropriate safeguards for cross-border personal data transfers. This tool is a relevant safeguard, not a substitute for the legal basis of the transfer.
77
83
 
78
84
  ### Sector-specific applicability
79
85
 
@@ -82,11 +88,13 @@ While GDPR originates in the EU, pseudonym-mcp is equally relevant for:
82
88
  | Healthcare | GDPR + HIPAA + national health data laws | Patient names, SSN, diagnoses |
83
89
  | Banking & Finance | GDPR + PCI DSS + PSD2 + DORA | Credit cards, IBAN, SSN, PESEL |
84
90
  | HR & Recruitment | GDPR Art. 9 (special categories) | Names, national IDs, contact details |
85
- | Legal | GDPR + attorney-client privilege | Names, case numbers, personal details |
91
+ | Legal | GDPR + attorneyclient privilege | Names, case numbers, personal details |
86
92
  | Insurance | GDPR + Solvency II | Personal identifiers, health data |
87
93
  | Public Sector (US) | CCPA + state privacy laws | SSN, driver's license numbers |
88
94
  | Public Sector (PL) | GDPR + UODO + KRI | PESEL, NIP, REGON |
89
95
 
96
+ In every row of this table, pseudonym-mcp is a useful **building block**. None of those regimes can be satisfied by a masking tool alone.
97
+
90
98
  ## How it works
91
99
 
92
100
  ```
@@ -102,7 +110,7 @@ Your App / Claude Desktop
102
110
  │ Phase 2: Ollama NER │ ← PERSON, ORG (local LLM)
103
111
  │ MappingStore (session) │ ← [TAG:N] ↔ original value
104
112
  └────────────┬────────────┘
105
- sanitized prompt (no PII)
113
+ sanitised prompt (detected PII → tokens)
106
114
 
107
115
  Cloud LLM API
108
116
  (Claude / GPT-4 / Gemini)
@@ -154,17 +162,17 @@ We discussed a contract for 45 000 zł. Contact: jan.kowalski@acme.pl
154
162
  In Claude Code you type:
155
163
 
156
164
  ```
157
- Use mask_text on this note, then summarize the key points of the meeting.
165
+ Use mask_text on this note, then summarise the key points of the meeting.
158
166
  ```
159
167
 
160
- **pseudonym-mcp replaces PII locally before sending to Claude:**
168
+ **pseudonym-mcp replaces detected PII locally before the prompt goes upstream:**
161
169
 
162
170
  ```
163
171
  Meeting with [PERSON:1] ([PESEL:1]) from [ORG:1].
164
172
  We discussed a contract for 45 000 zł. Contact: [EMAIL:1]
165
173
  ```
166
174
 
167
- **Claude responds (sees tokens only):**
175
+ **Claude responds (working from tokens):**
168
176
 
169
177
  ```
170
178
  Meeting with [PERSON:1] from [ORG:1] covered a contract
@@ -178,7 +186,7 @@ Meeting with Jan Kowalski from Acme sp. z o.o. covered
178
186
  a contract for 45 000 zł. Follow up via jan.kowalski@acme.pl
179
187
  ```
180
188
 
181
- Anthropic / OpenAI never saw any real data. The entire swap happens on your machine.
189
+ The cloud provider saw the structure of the meeting and the amount — but not the detected name, PESEL, organisation, or email in cleartext. The swap happens on your machine.
182
190
 
183
191
  ### Obsidian vault with `session_id`
184
192
 
@@ -187,13 +195,13 @@ Anthropic / OpenAI never saw any real data. The entire swap happens on your mach
187
195
  Use mask_text on my notes — remember the session_id
188
196
 
189
197
  # ask Claude anything across multiple prompts
190
- Summarize all meetings from Q1
198
+ Summarise all meetings from Q1
191
199
 
192
200
  # Claude replies with tokens; restore originals
193
201
  Use unmask_text with session_id abc123 on the response
194
202
  ```
195
203
 
196
- The `session_id` keeps the token map alive for the entire session — the same `[PERSON:1]` always refers to the same person, no matter how many times they appear across different notes.
204
+ The `session_id` keeps the token map alive for the session — the same `[PERSON:1]` always refers to the same person across notes. That consistency is what makes cross-note reasoning possible; it is also what makes a masked corpus potentially re-identifiable to anyone with side knowledge of your work. Use long-lived sessions deliberately.
197
205
 
198
206
  ## MCP Prompt Templates
199
207
 
@@ -207,28 +215,28 @@ pseudonym-mcp ships two built-in prompt templates that chain masking, an LLM tas
207
215
 
208
216
  What happens:
209
217
 
210
- 1. pseudonym-mcp masks PII locally → `[PERSON:1]`, `[PESEL:1]`
211
- 2. Claude processes the anonymized text
218
+ 1. pseudonym-mcp masks detected PII locally → `[PERSON:1]`, `[PESEL:1]`
219
+ 2. Claude processes the masked text
212
220
  3. pseudonym-mcp restores originals in the response
213
221
 
214
222
  Optional `lang` argument: `en` (default) or `pl`.
215
223
 
216
224
  ### `privacy_scan_file` — file / PDF (macOS only)
217
225
 
218
- > **Requires [macos-vision-mcp](https://github.com/woladi/macos-vision-mcp)** — a separate MCP server that uses Apple's Vision framework to extract text from PDFs and images. macOS only.
226
+ > **Requires [macos-vision-mcp](https://github.com/woladi/macos-vision-mcp)** — a separate MCP server that uses Apple's Vision framework to extract text from PDFs and images on-device. macOS only.
219
227
 
220
228
  ```
221
- /privacy_scan_file filePath="/Users/me/contracts/nda.pdf" task="Summarize obligations and deadlines"
229
+ /privacy_scan_file filePath="/Users/me/contracts/nda.pdf" task="Summarise obligations and deadlines"
222
230
  ```
223
231
 
224
232
  What happens:
225
233
 
226
- 1. macos-vision-mcp extracts text from the file
227
- 2. pseudonym-mcp masks all PII locally
228
- 3. Claude processes the anonymized content
234
+ 1. macos-vision-mcp extracts text from the file on-device
235
+ 2. pseudonym-mcp masks detected PII locally
236
+ 3. Claude processes the masked content
229
237
  4. pseudonym-mcp restores originals before the response is shown
230
238
 
231
- Optional arguments: `task` (default: _summarize the key points_), `lang` (`en` or `pl`).
239
+ Optional arguments: `task` (default: _summarise the key points_), `lang` (`en` or `pl`).
232
240
 
233
241
  ## Quick Start
234
242
 
@@ -244,7 +252,7 @@ claude mcp add pseudonym-mcp -- npx -y pseudonym-mcp --engines hybrid
244
252
  ollama pull llama3
245
253
  ```
246
254
 
247
- Skip this step if you only need regex-based masking (`--engines regex`).
255
+ Skip this step if you only need regex-based masking (`--engines regex`). Without Ollama, you'll catch structured identifiers (SSN, IBAN, cards, email, phone, PESEL) but not free-form names and organisations.
248
256
 
249
257
  > **Global install** — if you prefer `npm install -g pseudonym-mcp`, replace `npx -y pseudonym-mcp` with `pseudonym-mcp` in all snippets below.
250
258
 
@@ -254,7 +262,7 @@ Restart your client. The `mask_text` and `unmask_text` tools appear automaticall
254
262
 
255
263
  | Tool | What it does | Example prompt |
256
264
  | ------------- | -------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
257
- | `mask_text` | Pseudonymize PII in text. Returns `masked_text` + `session_id`. | _"Use mask_text on this customer letter before summarizing it"_ |
265
+ | `mask_text` | Pseudonymise detected PII in text. Returns `masked_text` + `session_id`. | _"Use mask_text on this customer letter before summarising it"_ |
258
266
  | `unmask_text` | Restore original values from a session. Pass the `session_id` returned by `mask_text`. | _"Use unmask_text with session_id X to restore the response"_ |
259
267
 
260
268
  ### `mask_text` input
@@ -327,7 +335,7 @@ pseudonym-mcp --lang en --engines regex --ollama-model llama3 --auto-unmask
327
335
  | `--ollama-model` | Ollama model to use for NER |
328
336
  | `--ollama-base-url` | Ollama base URL |
329
337
  | `--config` | Path to a custom JSON config file |
330
- | `--auto-unmask` | Enable automatic response de-tokenization |
338
+ | `--auto-unmask` | Enable automatic response de-tokenisation |
331
339
  | `--custom-literals` | Comma-separated strings to always redact, e.g. `"Jan Kowalski,78091512345"` |
332
340
 
333
341
  ### Claude Code
@@ -368,6 +376,8 @@ Add to `~/.cursor/mcp.json`:
368
376
 
369
377
  ## Supported PII types
370
378
 
379
+ Detection is best-effort. The patterns below are what the tool **looks for** — not a guarantee of what it will always catch. See [Limitations](#limitations) for known gaps.
380
+
371
381
  ### Custom literals
372
382
 
373
383
  | Tag | Detection | Match |
@@ -396,7 +406,7 @@ Custom literals are applied after the regex phase and before LLM NER, regardless
396
406
  | `PHONE` | `+1 (XXX) XXX-XXXX`, `XXX-XXX-XXXX`, `XXX.XXX.XXXX` | Format match |
397
407
  | `ZIP_CODE` | `XXXXX` or `XXXXX-XXXX` (paranoid mode only) | Format match |
398
408
  | `PERSON` | Full names | Ollama NER (hybrid / llm engines) |
399
- | `ORG` | Company / organization names | Ollama NER (hybrid / llm engines) |
409
+ | `ORG` | Company / organisation names | Ollama NER (hybrid / llm engines) |
400
410
 
401
411
  ### Polish (`--lang pl`)
402
412
 
@@ -409,7 +419,7 @@ Custom literals are applied after the regex phase and before LLM NER, regardless
409
419
  | `NIP` | 10-digit tax ID (strict / paranoid modes) | Checksum (weights `[6,5,7,2,3,4,5,6,7]`) |
410
420
  | `POSTAL_CODE` | `XX-XXX` (paranoid mode only) | Format match |
411
421
  | `PERSON` | Full names | Ollama NER (hybrid / llm engines) |
412
- | `ORG` | Company / organization names | Ollama NER (hybrid / llm engines) |
422
+ | `ORG` | Company / organisation names | Ollama NER (hybrid / llm engines) |
413
423
 
414
424
  ## Language Detection
415
425
 
@@ -432,7 +442,7 @@ detectLanguage('Hello')
432
442
  | `confidence` | Score 0–1 from franc, or `null` when franc was not called |
433
443
 
434
444
  Texts shorter than 20 characters or with low confidence return `detected: 'unknown'`.
435
- The detector does not affect the current pseudonymization pipeline — `--lang` config remains authoritative.
445
+ The detector does not affect the current pseudonymisation pipeline — `--lang` config remains authoritative.
436
446
  It is a building block for future multi-language and auto-select modes.
437
447
 
438
448
  ## Engine modes
@@ -443,27 +453,37 @@ It is a building block for future multi-language and auto-select modes.
443
453
  | `llm` | Yes | No | Yes |
444
454
  | `hybrid` (default) | Yes (graceful fallback) | Yes | Yes |
445
455
 
446
- In `hybrid` mode, Ollama runs after the regex pass so the LLM never sees already-tokenized values. If Ollama is unreachable, the server logs a warning to stderr and returns the regex-only masked text — no crash, no hang.
456
+ In `hybrid` mode, Ollama runs after the regex pass so the LLM never sees already-tokenised values. If Ollama is unreachable, the server logs a warning to stderr and returns the regex-only masked text — no crash, no hang.
447
457
 
448
458
  ## Privacy & Security notes
449
459
 
450
- - **No telemetry.** pseudonym-mcp makes no network requests except to your local Ollama instance and (optionally) the MCP stdio transport.
451
- - **In-memory only.** The mapping store is never written to disk. Sessions are scoped to the server process lifetime.
452
- - **Idempotent tokens.** The same original value always maps to the same token within a session (`[PERSON:1]` will not become `[PERSON:2]` for the same name on a second occurrence), preserving semantic coherence in LLM reasoning.
453
- - **No model training.** The local Ollama model operates entirely offline. Your data is not used to train any model.
460
+ Calibrated claims:
461
+
462
+ - **No telemetry from the tool itself.** pseudonym-mcp makes no network requests except to your local Ollama instance and (optionally) the MCP stdio transport.
463
+ - **In-memory mapping by default.** The mapping store is not written to disk. Sessions are scoped to the server process lifetime.
464
+ - **Idempotent tokens within a session.** The same original value always maps to the same token (`[PERSON:1]` will not become `[PERSON:2]` for the same name on a second occurrence), preserving semantic coherence in LLM reasoning.
465
+ - **No model training.** The local Ollama model operates offline. Your data is not used to train any model by this tool.
454
466
  - **Strict validation by default.** Invalid SSNs (area 000/666/900+), failed-Luhn credit card numbers, and invalid-checksum PESELs are not masked, preventing false positives from OCR errors or random digit sequences.
455
467
 
468
+ What this does **not** guarantee:
469
+
470
+ - That all PII in your input is detected.
471
+ - That tokenised text is unlinkable to real people — re-identification from context is possible.
472
+ - That the cloud provider can't learn sensitive things from structure, timing, or content.
473
+ - Compliance with any specific regulation — that's a system-level property, not a tool-level one.
474
+
456
475
  ## Limitations
457
476
 
458
477
  pseudonym-mcp is a technical privacy control, not a legal guarantee of compliance.
459
478
 
460
- - Detection is best-effort false negatives and false positives are possible.
461
- - Indirect references (e.g. _"the tall guy from accounting"_) are not detected.
462
- - If plaintext is logged before being passed to `mask_text`, pseudonym-mcp cannot protect it.
463
- - The mapping store is process-local; restarting the server ends the session.
464
- - Re-identification is possible for anyone with access to the local mapping store this is pseudonymization, not anonymization.
479
+ - **Detection is best-effort.** False negatives and false positives are both possible. Indirect references (e.g. _"the tall guy from accounting"_, _"my landlord"_, _"the place near the bridge"_) are not detected. Nicknames, initials, and partial names are typically missed.
480
+ - **Structure still travels.** Dates, amounts, relationships between tokens, narrative content, and any PII the detector missed all reach the cloud LLM. Tokenisation hides _who_, not _what kind of situation_.
481
+ - **Pre-mask logging is your problem.** If your application logs plaintext before passing it to `mask_text`, this tool cannot help you.
482
+ - **Process-local mapping.** Restarting the server ends the session and discards mappings. This is intentional.
483
+ - **Re-identification is possible** for anyone with access to the local mapping store, and may be possible from context alone for anyone with side knowledge. This is pseudonymisation under GDPR Art. 4(5), not anonymisation.
484
+ - **No legal advice.** Nothing in this README constitutes legal advice. Compliance is a system-level property — talk to your DPO, your compliance team, and your lawyers about your specific deployment.
465
485
 
466
- > Under GDPR Art. 4(5), pseudonymized data is still personal data in your system. pseudonym-mcp substantially reduces risk but does not eliminate your legal obligations.
486
+ > Under GDPR Art. 4(5) and Recital 26, pseudonymised data is still personal data. pseudonym-mcp substantially reduces cleartext PII exposure but does not eliminate your legal obligations.
467
487
 
468
488
  ## Development
469
489
 
@@ -1 +1 @@
1
- {"version":3,"file":"pesel.d.ts","sourceRoot":"","sources":["../../../../src/patterns/locale/pl/pesel.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,WAAW,EAAE,MAAM,gBAAgB,CAAA;AAiBjD,eAAO,MAAM,SAAS,EAAE,WASvB,CAAA"}
1
+ {"version":3,"file":"pesel.d.ts","sourceRoot":"","sources":["../../../../src/patterns/locale/pl/pesel.ts"],"names":[],"mappings":"AAAA,OAAO,KAAK,EAAE,WAAW,EAAE,MAAM,gBAAgB,CAAA;AAEjD,eAAO,MAAM,SAAS,EAAE,WAWvB,CAAA"}
@@ -1,26 +1,13 @@
1
- /**
2
- * Validates a Polish PESEL number using the official checksum algorithm.
3
- * Weights: [1, 3, 7, 9, 1, 3, 7, 9, 1, 3]
4
- * Check digit = (10 - (weighted_sum % 10)) % 10
5
- */
6
- function peselChecksum(input) {
7
- const digits = input.replace(/\D/g, '');
8
- if (digits.length !== 11)
9
- return false;
10
- const weights = [1, 3, 7, 9, 1, 3, 7, 9, 1, 3];
11
- const d = digits.split('').map(Number);
12
- const sum = weights.reduce((acc, w, i) => acc + w * d[i], 0);
13
- const check = (10 - (sum % 10)) % 10;
14
- return check === d[10];
15
- }
16
1
  export const peselRule = {
17
2
  id: 'pl.pesel',
18
3
  entityType: 'PESEL',
19
- // Matches "nr PESEL: XXXXXXXXXXX", "PESEL XXXXXXXXXXX", or standalone 11 digits
20
- pattern: /(?:(?:nr\s+)?PESEL:?\s*)?(?<!\d)\d{11}(?!\d)/gi,
4
+ // Matches exactly 11 consecutive digits (word-bounded).
5
+ // Negative lookbehind for '+' prevents matching the digit portion of a
6
+ // compact international phone like "+48601234567" (which is 11 digits after '+').
7
+ // The label "PESEL" / "nr PESEL:" stays in the text — only the digits are replaced.
8
+ pattern: /(?<!\+)\b\d{11}\b/g,
21
9
  locales: ['pl'],
22
10
  engines: ['balanced', 'strict', 'paranoid'],
23
- description: 'Polish national identification number (PESEL) — 11 digits with checksum',
24
- validate: peselChecksum,
11
+ description: 'Polish national identification number (PESEL) — exactly 11 consecutive digits',
25
12
  };
26
13
  //# sourceMappingURL=pesel.js.map
@@ -1 +1 @@
1
- {"version":3,"file":"pesel.js","sourceRoot":"","sources":["../../../../src/patterns/locale/pl/pesel.ts"],"names":[],"mappings":"AAEA;;;;GAIG;AACH,SAAS,aAAa,CAAC,KAAa;IAClC,MAAM,MAAM,GAAG,KAAK,CAAC,OAAO,CAAC,KAAK,EAAE,EAAE,CAAC,CAAA;IACvC,IAAI,MAAM,CAAC,MAAM,KAAK,EAAE;QAAE,OAAO,KAAK,CAAA;IACtC,MAAM,OAAO,GAAG,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAA;IAC9C,MAAM,CAAC,GAAG,MAAM,CAAC,KAAK,CAAC,EAAE,CAAC,CAAC,GAAG,CAAC,MAAM,CAAC,CAAA;IACtC,MAAM,GAAG,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,CAAC,EAAE,EAAE,CAAC,GAAG,GAAG,CAAC,GAAG,CAAC,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAA;IAC5D,MAAM,KAAK,GAAG,CAAC,EAAE,GAAG,CAAC,GAAG,GAAG,EAAE,CAAC,CAAC,GAAG,EAAE,CAAA;IACpC,OAAO,KAAK,KAAK,CAAC,CAAC,EAAE,CAAC,CAAA;AACxB,CAAC;AAED,MAAM,CAAC,MAAM,SAAS,GAAgB;IACpC,EAAE,EAAE,UAAU;IACd,UAAU,EAAE,OAAO;IACnB,gFAAgF;IAChF,OAAO,EAAE,gDAAgD;IACzD,OAAO,EAAE,CAAC,IAAI,CAAC;IACf,OAAO,EAAE,CAAC,UAAU,EAAE,QAAQ,EAAE,UAAU,CAAC;IAC3C,WAAW,EAAE,yEAAyE;IACtF,QAAQ,EAAE,aAAa;CACxB,CAAA"}
1
+ {"version":3,"file":"pesel.js","sourceRoot":"","sources":["../../../../src/patterns/locale/pl/pesel.ts"],"names":[],"mappings":"AAEA,MAAM,CAAC,MAAM,SAAS,GAAgB;IACpC,EAAE,EAAE,UAAU;IACd,UAAU,EAAE,OAAO;IACnB,wDAAwD;IACxD,uEAAuE;IACvE,kFAAkF;IAClF,oFAAoF;IACpF,OAAO,EAAE,oBAAoB;IAC7B,OAAO,EAAE,CAAC,IAAI,CAAC;IACf,OAAO,EAAE,CAAC,UAAU,EAAE,QAAQ,EAAE,UAAU,CAAC;IAC3C,WAAW,EAAE,+EAA+E;CAC7F,CAAA"}
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pseudonym-mcp",
3
- "version": "0.7.1",
3
+ "version": "0.7.3",
4
4
  "mcpName": "io.github.woladi/pseudonym-mcp",
5
5
  "description": "MCP server for privacy-preserving pseudonymization of sensitive data before cloud LLM processing",
6
6
  "type": "module",