data_redactor 0.7.2 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +19 -1
- data/ext/data_redactor/patterns.c +221 -208
- data/ext/data_redactor/patterns.h +1 -1
- data/lib/data_redactor/integrations/rack.rb +21 -0
- data/lib/data_redactor/name_pattern.rb +170 -0
- data/lib/data_redactor/version.rb +1 -1
- data/lib/data_redactor.rb +75 -0
- data/readme.md +100 -3
- metadata +2 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 007d59e430d1675a13b84670f6c34c300f8b72fd7ee4744aa191f846bb89b072
|
|
4
|
+
data.tar.gz: a23f3b99c3ead341d2c9415a1b4b2eb32a45ee002f052a8e58d928eb1ce03919
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: ccd4f6f97a0110585e4f43f9402eac2a1f57b2aef01a3c6870f0e57ea578377291a7367ee924585d9d11e92af98f4178bb0b9488c1a24a2338f6a41936efad30
|
|
7
|
+
data.tar.gz: 5281171119b4892167a6b1d55e0996db47408c8a6d334656998f8f2ca50794a3a7b5c987132369ca32965da0943f954eab61f34f5a97c683b8a14851e9beca1e
|
data/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [0.9.0] - 2026-05-22
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
- `DataRedactor.name_pattern(first, last, middle:)` — generates a POSIX ERE that matches a person's name across common written variations (case-insensitivity, First/Last order swaps, `Last, First`, initials, diacritics, and interchangeable space/hyphen separators). Returns a String ready to pass to `add_pattern`. The pattern is boundary-wrapped, so `"Mario"` matches as a word but not inside `"Mariolino"`. When `middle:` is given, both the no-middle and with-middle forms match.
|
|
14
|
+
|
|
15
|
+
## [0.8.0] - 2026-05-21
|
|
16
|
+
|
|
17
|
+
### Added
|
|
18
|
+
- `DataRedactor.redact_deep(data, only:, except:, placeholder:)` — recursively redacts every String value in a nested Hash/Array structure. Non-string scalars (Integer, Float, nil, Boolean) and Hash keys are passed through unchanged. Returns a deep copy; never mutates the input. Raises `ArgumentError` on circular references.
|
|
19
|
+
- `DataRedactor.redact_json(json_string, only:, except:, placeholder:)` — parses JSON, redacts via `redact_deep`, and returns valid JSON. Raises `JSON::ParserError` on invalid input.
|
|
20
|
+
- HashiCorp Vault service tokens (`hvs.` prefix, 90–120 chars) — pattern `hashicorp_vault_service_token`
|
|
21
|
+
- HashiCorp Vault batch tokens (`hvb.` prefix, 138–300 chars) — pattern `hashicorp_vault_batch_token`
|
|
22
|
+
- HashiCorp Terraform Cloud API tokens (`<14-char-id>.atlasv1.<token>`) — pattern `hashicorp_terraform_api_token`
|
|
23
|
+
|
|
24
|
+
All three HashiCorp patterns are tagged `:credentials` and do not require word-boundary wrapping (distinctive prefixes eliminate false positives).
|
|
25
|
+
|
|
10
26
|
## [0.7.2] - 2026-05-09
|
|
11
27
|
|
|
12
28
|
**Supersedes 0.7.1, which has been yanked from RubyGems.**
|
|
@@ -161,7 +177,9 @@ features as 0.7.1 plus the pipeline fix.
|
|
|
161
177
|
- `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
|
|
162
178
|
- RSpec suite with one example per pattern.
|
|
163
179
|
|
|
164
|
-
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.
|
|
180
|
+
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.9.0...HEAD
|
|
181
|
+
[0.9.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.8.0...v0.9.0
|
|
182
|
+
[0.8.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.2...v0.8.0
|
|
165
183
|
[0.7.2]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.1...v0.7.2
|
|
166
184
|
[0.7.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.0...v0.7.1
|
|
167
185
|
[0.7.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.6.1...v0.7.0
|
|
@@ -56,67 +56,70 @@ const int boundary_wrapped[NUM_PATTERNS] = {
|
|
|
56
56
|
0, /* 26: Scaleway Access Key */
|
|
57
57
|
0, /* 27: PEM private key header (generic) */
|
|
58
58
|
0, /* 28: GPG Private Key Block */
|
|
59
|
+
0, /* 29: HashiCorp Vault Service Token (hvs.) */
|
|
60
|
+
0, /* 30: HashiCorp Vault Batch Token (hvb.) */
|
|
61
|
+
0, /* 31: HashiCorp Terraform Cloud API Token (atlasv1) */
|
|
59
62
|
/* ---- Tier 3: IBANs (longest → shortest) ---- */
|
|
60
|
-
0, /*
|
|
61
|
-
0, /*
|
|
62
|
-
0, /*
|
|
63
|
-
0, /*
|
|
64
|
-
0, /*
|
|
65
|
-
0, /*
|
|
66
|
-
0, /*
|
|
67
|
-
0, /*
|
|
68
|
-
0, /*
|
|
69
|
-
0, /*
|
|
70
|
-
0, /*
|
|
71
|
-
0, /*
|
|
72
|
-
0, /*
|
|
73
|
-
0, /*
|
|
74
|
-
0, /*
|
|
75
|
-
0, /*
|
|
76
|
-
0, /*
|
|
77
|
-
0, /*
|
|
63
|
+
0, /* 32: Hungary IBAN (28 chars) */
|
|
64
|
+
0, /* 33: Poland IBAN (28 chars) */
|
|
65
|
+
0, /* 34: France IBAN (27 chars) */
|
|
66
|
+
0, /* 35: Italy IBAN (27 chars) */
|
|
67
|
+
0, /* 36: Portugal IBAN (25 chars) */
|
|
68
|
+
0, /* 37: Spain IBAN (24 chars) */
|
|
69
|
+
0, /* 38: Czechia IBAN (24 chars) */
|
|
70
|
+
0, /* 39: Romania IBAN (24 chars) */
|
|
71
|
+
0, /* 40: Sweden IBAN (24 chars) */
|
|
72
|
+
0, /* 41: Germany IBAN (22 chars) */
|
|
73
|
+
0, /* 42: Ireland IBAN (22 chars) */
|
|
74
|
+
0, /* 43: Switzerland IBAN (21 chars) */
|
|
75
|
+
0, /* 44: Austria IBAN (20 chars) */
|
|
76
|
+
0, /* 45: Netherlands IBAN (18 chars) */
|
|
77
|
+
0, /* 46: Denmark IBAN (18 chars) */
|
|
78
|
+
0, /* 47: Finland IBAN (18 chars) */
|
|
79
|
+
0, /* 48: Belgium IBAN (16 chars) */
|
|
80
|
+
0, /* 49: Norway IBAN (15 chars) */
|
|
78
81
|
/* ---- Tier 4: Structured formats (dots, dashes, slashes, @) ---- */
|
|
79
|
-
0, /*
|
|
80
|
-
0, /*
|
|
81
|
-
0, /*
|
|
82
|
-
0, /*
|
|
83
|
-
0, /*
|
|
84
|
-
0, /*
|
|
85
|
-
0, /*
|
|
86
|
-
0, /*
|
|
82
|
+
0, /* 50: Email Address */
|
|
83
|
+
0, /* 51: International Phone Number */
|
|
84
|
+
0, /* 52: Brazilian CNPJ (XX.XXX.XXX/XXXX-XX) */
|
|
85
|
+
0, /* 53: Brazilian CPF (XXX.XXX.XXX-XX) */
|
|
86
|
+
0, /* 54: UUID v4 */
|
|
87
|
+
0, /* 55: IPv4 address */
|
|
88
|
+
0, /* 56: Credit card numbers */
|
|
89
|
+
0, /* 57: Indian Aadhaar (XXXX XXXX XXXX) */
|
|
87
90
|
/* ---- Tier 5: Letter-anchored patterns ---- */
|
|
88
|
-
0, /*
|
|
89
|
-
0, /*
|
|
90
|
-
0, /*
|
|
91
|
-
0, /*
|
|
92
|
-
0, /*
|
|
93
|
-
0, /*
|
|
91
|
+
0, /* 58: Mexican CURP (18 alphanum, distinctive structure) */
|
|
92
|
+
0, /* 59: Italian CF with omocodia (16 chars) */
|
|
93
|
+
0, /* 60: Italian CF basic (16 chars) */
|
|
94
|
+
0, /* 61: UK National Insurance Number */
|
|
95
|
+
0, /* 62: Spanish NIE (X/Y/Z prefix) */
|
|
96
|
+
0, /* 63: Passport letter prefix + digits */
|
|
94
97
|
/* ---- Tier 6: Boundary-wrapped structured (dash/dot/slash separated) ---- */
|
|
95
|
-
1, /*
|
|
96
|
-
1, /*
|
|
97
|
-
1, /*
|
|
98
|
-
1, /*
|
|
99
|
-
1, /*
|
|
100
|
-
1, /*
|
|
101
|
-
1, /*
|
|
102
|
-
1, /*
|
|
103
|
-
1, /*
|
|
104
|
-
1, /*
|
|
105
|
-
1, /*
|
|
106
|
-
1, /*
|
|
107
|
-
1, /*
|
|
98
|
+
1, /* 64: South Korean RRN (YYMMDD-XXXXXXX, 14 chars) */
|
|
99
|
+
1, /* 65: Swiss AHV Number (756.XXXX.XXXX.XX) */
|
|
100
|
+
1, /* 66: Finnish HETU (DDMMYY[+-A]XXXC) */
|
|
101
|
+
1, /* 67: Swedish Personnummer (YYMMDD[-+]XXXX) */
|
|
102
|
+
1, /* 68: Danish CPR Number (DDMMYY-XXXX) */
|
|
103
|
+
1, /* 69: Czech Rodné číslo (YYMMDD/XXXX) */
|
|
104
|
+
1, /* 70: US Social Security Number (XXX-XX-XXXX) */
|
|
105
|
+
1, /* 71: US ITIN (9XX-XX-XXXX) */
|
|
106
|
+
1, /* 72: Canadian SIN (XXX-XXX-XXX) */
|
|
107
|
+
1, /* 73: Australian TFN (XXX-XXX-XXX) */
|
|
108
|
+
1, /* 74: Indian PAN (AAAAA0000A) */
|
|
109
|
+
1, /* 75: Spanish DNI (8 digits + letter) */
|
|
110
|
+
1, /* 76: Hungarian Tax ID (8XXXXXXXXX, 10 digits) */
|
|
108
111
|
/* ---- Tier 7: Boundary-wrapped pure digits (longest → shortest) ---- */
|
|
109
|
-
1, /*
|
|
110
|
-
1, /*
|
|
111
|
-
1, /*
|
|
112
|
-
1, /*
|
|
113
|
-
1, /*
|
|
114
|
-
1, /*
|
|
115
|
-
1, /*
|
|
116
|
-
1, /*
|
|
117
|
-
1, /*
|
|
118
|
-
1, /*
|
|
119
|
-
1 /*
|
|
112
|
+
1, /* 77: French NIR (15 digits) */
|
|
113
|
+
1, /* 78: South African ID (13 digits) */
|
|
114
|
+
1, /* 79: Romanian CNP (13 digits) */
|
|
115
|
+
1, /* 80: Japanese My Number (12 digits) */
|
|
116
|
+
1, /* 81: Polish PESEL (11 digits) */
|
|
117
|
+
1, /* 82: Belgian National Number (11 digits) */
|
|
118
|
+
1, /* 83: Norwegian Fødselsnummer (11 digits) */
|
|
119
|
+
1, /* 84: Passport 9 digits */
|
|
120
|
+
1, /* 85: Dutch BSN (8-9 digits) */
|
|
121
|
+
1, /* 86: Austrian Abgabenkontonummer (9 digits) */
|
|
122
|
+
1 /* 87: Polish PESEL duplicate */
|
|
120
123
|
};
|
|
121
124
|
|
|
122
125
|
/*
|
|
@@ -124,56 +127,57 @@ const int boundary_wrapped[NUM_PATTERNS] = {
|
|
|
124
127
|
* patterns run when the caller passes a mask (only/except).
|
|
125
128
|
*/
|
|
126
129
|
const int pattern_tags[NUM_PATTERNS] = {
|
|
127
|
-
/* 0-
|
|
130
|
+
/* 0-31: secrets, API keys, tokens, private keys, webhooks */
|
|
128
131
|
TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS,
|
|
129
132
|
TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS,
|
|
130
133
|
TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS,
|
|
131
134
|
TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS,
|
|
132
135
|
TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS,
|
|
133
136
|
TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS,
|
|
134
|
-
|
|
137
|
+
TAG_CREDENTIALS, TAG_CREDENTIALS, TAG_CREDENTIALS,
|
|
138
|
+
/* 32-49: IBANs */
|
|
135
139
|
TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL,
|
|
136
140
|
TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL,
|
|
137
141
|
TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL,
|
|
138
142
|
TAG_FINANCIAL, TAG_FINANCIAL, TAG_FINANCIAL,
|
|
139
|
-
TAG_CONTACT, /*
|
|
140
|
-
TAG_CONTACT, /*
|
|
141
|
-
TAG_TAX_ID, /*
|
|
142
|
-
TAG_TAX_ID, /*
|
|
143
|
-
TAG_OTHER, /*
|
|
144
|
-
TAG_NETWORK, /*
|
|
145
|
-
TAG_FINANCIAL, /*
|
|
146
|
-
TAG_NATIONAL_ID, /*
|
|
147
|
-
TAG_NATIONAL_ID, /*
|
|
148
|
-
TAG_TAX_ID, /*
|
|
149
|
-
TAG_TAX_ID, /*
|
|
150
|
-
TAG_NATIONAL_ID, /*
|
|
151
|
-
TAG_NATIONAL_ID, /*
|
|
152
|
-
TAG_TRAVEL, /*
|
|
153
|
-
TAG_NATIONAL_ID, /*
|
|
154
|
-
TAG_NATIONAL_ID, /*
|
|
155
|
-
TAG_NATIONAL_ID, /*
|
|
156
|
-
TAG_NATIONAL_ID, /*
|
|
157
|
-
TAG_NATIONAL_ID, /*
|
|
158
|
-
TAG_NATIONAL_ID, /*
|
|
159
|
-
TAG_NATIONAL_ID, /*
|
|
160
|
-
TAG_TAX_ID, /*
|
|
161
|
-
TAG_NATIONAL_ID, /*
|
|
162
|
-
TAG_TAX_ID, /*
|
|
163
|
-
TAG_TAX_ID, /*
|
|
164
|
-
TAG_NATIONAL_ID, /*
|
|
165
|
-
TAG_TAX_ID, /*
|
|
166
|
-
TAG_NATIONAL_ID, /*
|
|
167
|
-
TAG_NATIONAL_ID, /*
|
|
168
|
-
TAG_NATIONAL_ID, /*
|
|
169
|
-
TAG_TAX_ID, /*
|
|
170
|
-
TAG_NATIONAL_ID, /*
|
|
171
|
-
TAG_NATIONAL_ID, /*
|
|
172
|
-
TAG_NATIONAL_ID, /*
|
|
173
|
-
TAG_TRAVEL, /*
|
|
174
|
-
TAG_NATIONAL_ID, /*
|
|
175
|
-
TAG_TAX_ID, /*
|
|
176
|
-
TAG_NATIONAL_ID /*
|
|
143
|
+
TAG_CONTACT, /* 50: email */
|
|
144
|
+
TAG_CONTACT, /* 51: phone */
|
|
145
|
+
TAG_TAX_ID, /* 52: Brazilian CNPJ */
|
|
146
|
+
TAG_TAX_ID, /* 53: Brazilian CPF */
|
|
147
|
+
TAG_OTHER, /* 54: UUID v4 */
|
|
148
|
+
TAG_NETWORK, /* 55: IPv4 */
|
|
149
|
+
TAG_FINANCIAL, /* 56: credit card */
|
|
150
|
+
TAG_NATIONAL_ID, /* 57: Indian Aadhaar */
|
|
151
|
+
TAG_NATIONAL_ID, /* 58: Mexican CURP */
|
|
152
|
+
TAG_TAX_ID, /* 59: Italian CF (omocodia) */
|
|
153
|
+
TAG_TAX_ID, /* 60: Italian CF (basic) */
|
|
154
|
+
TAG_NATIONAL_ID, /* 61: UK NIN */
|
|
155
|
+
TAG_NATIONAL_ID, /* 62: Spanish NIE */
|
|
156
|
+
TAG_TRAVEL, /* 63: passport letter prefix */
|
|
157
|
+
TAG_NATIONAL_ID, /* 64: Korean RRN */
|
|
158
|
+
TAG_NATIONAL_ID, /* 65: Swiss AHV */
|
|
159
|
+
TAG_NATIONAL_ID, /* 66: Finnish HETU */
|
|
160
|
+
TAG_NATIONAL_ID, /* 67: Swedish Personnummer */
|
|
161
|
+
TAG_NATIONAL_ID, /* 68: Danish CPR */
|
|
162
|
+
TAG_NATIONAL_ID, /* 69: Czech Rodné číslo */
|
|
163
|
+
TAG_NATIONAL_ID, /* 70: US SSN */
|
|
164
|
+
TAG_TAX_ID, /* 71: US ITIN */
|
|
165
|
+
TAG_NATIONAL_ID, /* 72: Canadian SIN */
|
|
166
|
+
TAG_TAX_ID, /* 73: Australian TFN */
|
|
167
|
+
TAG_TAX_ID, /* 74: Indian PAN */
|
|
168
|
+
TAG_NATIONAL_ID, /* 75: Spanish DNI */
|
|
169
|
+
TAG_TAX_ID, /* 76: Hungarian Tax ID */
|
|
170
|
+
TAG_NATIONAL_ID, /* 77: French NIR */
|
|
171
|
+
TAG_NATIONAL_ID, /* 78: South African ID */
|
|
172
|
+
TAG_NATIONAL_ID, /* 79: Romanian CNP */
|
|
173
|
+
TAG_TAX_ID, /* 80: Japanese My Number */
|
|
174
|
+
TAG_NATIONAL_ID, /* 81: Polish PESEL */
|
|
175
|
+
TAG_NATIONAL_ID, /* 82: Belgian National Number */
|
|
176
|
+
TAG_NATIONAL_ID, /* 83: Norwegian Fødselsnummer */
|
|
177
|
+
TAG_TRAVEL, /* 84: passport 9 digits */
|
|
178
|
+
TAG_NATIONAL_ID, /* 85: Dutch BSN */
|
|
179
|
+
TAG_TAX_ID, /* 86: Austrian Abgabenkontonummer */
|
|
180
|
+
TAG_NATIONAL_ID /* 87: Polish PESEL duplicate */
|
|
177
181
|
};
|
|
178
182
|
|
|
179
183
|
const char *pattern_names[NUM_PATTERNS] = {
|
|
@@ -206,62 +210,65 @@ const char *pattern_names[NUM_PATTERNS] = {
|
|
|
206
210
|
"scaleway_access_key", /* 26 */
|
|
207
211
|
"pem_private_key", /* 27 */
|
|
208
212
|
"gpg_private_key", /* 28 */
|
|
209
|
-
"
|
|
210
|
-
"
|
|
211
|
-
"
|
|
212
|
-
"
|
|
213
|
-
"
|
|
214
|
-
"
|
|
215
|
-
"
|
|
216
|
-
"
|
|
217
|
-
"
|
|
218
|
-
"
|
|
219
|
-
"
|
|
220
|
-
"
|
|
221
|
-
"
|
|
222
|
-
"
|
|
223
|
-
"
|
|
224
|
-
"
|
|
225
|
-
"
|
|
226
|
-
"
|
|
227
|
-
"
|
|
228
|
-
"
|
|
229
|
-
"
|
|
230
|
-
"
|
|
231
|
-
"
|
|
232
|
-
"
|
|
233
|
-
"
|
|
234
|
-
"
|
|
235
|
-
"
|
|
236
|
-
"
|
|
237
|
-
"
|
|
238
|
-
"
|
|
239
|
-
"
|
|
240
|
-
"
|
|
241
|
-
"
|
|
242
|
-
"
|
|
243
|
-
"
|
|
244
|
-
"
|
|
245
|
-
"
|
|
246
|
-
"
|
|
247
|
-
"
|
|
248
|
-
"
|
|
249
|
-
"
|
|
250
|
-
"
|
|
251
|
-
"
|
|
252
|
-
"
|
|
253
|
-
"
|
|
254
|
-
"
|
|
255
|
-
"
|
|
256
|
-
"
|
|
257
|
-
"
|
|
258
|
-
"
|
|
259
|
-
"
|
|
260
|
-
"
|
|
261
|
-
"
|
|
262
|
-
"
|
|
263
|
-
"
|
|
264
|
-
"
|
|
213
|
+
"hashicorp_vault_service_token", /* 29 */
|
|
214
|
+
"hashicorp_vault_batch_token", /* 30 */
|
|
215
|
+
"hashicorp_terraform_api_token", /* 31 */
|
|
216
|
+
"iban_hu", /* 32 */
|
|
217
|
+
"iban_pl", /* 33 */
|
|
218
|
+
"iban_fr", /* 34 */
|
|
219
|
+
"iban_it", /* 35 */
|
|
220
|
+
"iban_pt", /* 36 */
|
|
221
|
+
"iban_es", /* 37 */
|
|
222
|
+
"iban_cz", /* 38 */
|
|
223
|
+
"iban_ro", /* 39 */
|
|
224
|
+
"iban_se", /* 40 */
|
|
225
|
+
"iban_de", /* 41 */
|
|
226
|
+
"iban_ie", /* 42 */
|
|
227
|
+
"iban_ch", /* 43 */
|
|
228
|
+
"iban_at", /* 44 */
|
|
229
|
+
"iban_nl", /* 45 */
|
|
230
|
+
"iban_dk", /* 46 */
|
|
231
|
+
"iban_fi", /* 47 */
|
|
232
|
+
"iban_be", /* 48 */
|
|
233
|
+
"iban_no", /* 49 */
|
|
234
|
+
"email", /* 50 */
|
|
235
|
+
"phone_e164", /* 51 */
|
|
236
|
+
"brazilian_cnpj", /* 52 */
|
|
237
|
+
"brazilian_cpf", /* 53 */
|
|
238
|
+
"uuid_v4", /* 54 */
|
|
239
|
+
"ipv4", /* 55 */
|
|
240
|
+
"credit_card", /* 56 */
|
|
241
|
+
"indian_aadhaar", /* 57 */
|
|
242
|
+
"mexican_curp", /* 58 */
|
|
243
|
+
"italian_cf_omocodia", /* 59 */
|
|
244
|
+
"italian_cf", /* 60 */
|
|
245
|
+
"uk_nin", /* 61 */
|
|
246
|
+
"spanish_nie", /* 62 */
|
|
247
|
+
"passport_letter_prefix", /* 63 */
|
|
248
|
+
"korean_rrn", /* 64 */
|
|
249
|
+
"swiss_ahv", /* 65 */
|
|
250
|
+
"finnish_hetu", /* 66 */
|
|
251
|
+
"swedish_personnummer", /* 67 */
|
|
252
|
+
"danish_cpr", /* 68 */
|
|
253
|
+
"czech_rodne_cislo", /* 69 */
|
|
254
|
+
"us_ssn", /* 70 */
|
|
255
|
+
"us_itin", /* 71 */
|
|
256
|
+
"canadian_sin", /* 72 */
|
|
257
|
+
"australian_tfn", /* 73 */
|
|
258
|
+
"indian_pan", /* 74 */
|
|
259
|
+
"spanish_dni", /* 75 */
|
|
260
|
+
"hungarian_tax_id", /* 76 */
|
|
261
|
+
"french_nir", /* 77 */
|
|
262
|
+
"south_african_id", /* 78 */
|
|
263
|
+
"romanian_cnp", /* 79 */
|
|
264
|
+
"japanese_my_number", /* 80 */
|
|
265
|
+
"polish_pesel", /* 81 */
|
|
266
|
+
"belgian_national_number", /* 82 */
|
|
267
|
+
"norwegian_fodselsnummer", /* 83 */
|
|
268
|
+
"passport_9digits", /* 84 */
|
|
269
|
+
"dutch_bsn", /* 85 */
|
|
270
|
+
"austrian_abgabenkontonummer", /* 86 */
|
|
271
|
+
"polish_pesel_2" /* 87 */
|
|
265
272
|
};
|
|
266
273
|
|
|
267
274
|
/*
|
|
@@ -330,126 +337,132 @@ const char *pattern_strings[NUM_PATTERNS] = {
|
|
|
330
337
|
"-----BEGIN [A-Z ]*PRIVATE KEY-----",
|
|
331
338
|
/* 28: GPG Private Key Block */
|
|
332
339
|
"-----BEGIN PGP PRIVATE KEY BLOCK-----",
|
|
340
|
+
/* 29: HashiCorp Vault Service Token (hvs. + 90-120 base64url chars) */
|
|
341
|
+
"hvs\\.[A-Za-z0-9_-]{90,120}",
|
|
342
|
+
/* 30: HashiCorp Vault Batch Token (hvb. + 138-300 base64url chars) */
|
|
343
|
+
"hvb\\.[A-Za-z0-9_-]{138,300}",
|
|
344
|
+
/* 31: HashiCorp Terraform Cloud API Token (14 alphanum + .atlasv1. + 60-70 base64url chars) */
|
|
345
|
+
"[A-Za-z0-9]{14}\\.atlasv1\\.[A-Za-z0-9_=-]{60,70}",
|
|
333
346
|
|
|
334
347
|
/* ---- Tier 3: IBANs (longest → shortest) ---- */
|
|
335
|
-
/*
|
|
348
|
+
/* 32: Hungary IBAN (HU, 28 chars) */
|
|
336
349
|
"HU[0-9]{2}[0-9]{24}",
|
|
337
|
-
/*
|
|
350
|
+
/* 33: Poland IBAN (PL, 28 chars) */
|
|
338
351
|
"PL[0-9]{2}[0-9]{24}",
|
|
339
|
-
/*
|
|
352
|
+
/* 34: France IBAN (FR, 27 chars) */
|
|
340
353
|
"FR[0-9]{2}[0-9]{10}[A-Z0-9]{11}[0-9]{2}",
|
|
341
|
-
/*
|
|
354
|
+
/* 35: Italy IBAN (IT, 27 chars) */
|
|
342
355
|
"IT[0-9]{2}[A-Z][0-9]{10}[A-Z0-9]{12}",
|
|
343
|
-
/*
|
|
356
|
+
/* 36: Portugal IBAN (PT, 25 chars) */
|
|
344
357
|
"PT[0-9]{2}[0-9]{21}",
|
|
345
|
-
/*
|
|
358
|
+
/* 37: Spain IBAN (ES, 24 chars) */
|
|
346
359
|
"ES[0-9]{2}[0-9]{20}",
|
|
347
|
-
/*
|
|
360
|
+
/* 38: Czechia IBAN (CZ, 24 chars) */
|
|
348
361
|
"CZ[0-9]{2}[0-9]{20}",
|
|
349
|
-
/*
|
|
362
|
+
/* 39: Romania IBAN (RO, 24 chars) */
|
|
350
363
|
"RO[0-9]{2}[A-Z]{4}[A-Z0-9]{16}",
|
|
351
|
-
/*
|
|
364
|
+
/* 40: Sweden IBAN (SE, 24 chars) */
|
|
352
365
|
"SE[0-9]{2}[0-9]{20}",
|
|
353
|
-
/*
|
|
366
|
+
/* 41: Germany IBAN (DE, 22 chars) */
|
|
354
367
|
"DE[0-9]{2}[0-9]{18}",
|
|
355
|
-
/*
|
|
368
|
+
/* 42: Ireland IBAN (IE, 22 chars) */
|
|
356
369
|
"IE[0-9]{2}[A-Z]{4}[0-9]{14}",
|
|
357
|
-
/*
|
|
370
|
+
/* 43: Switzerland IBAN (CH, 21 chars) */
|
|
358
371
|
"CH[0-9]{2}[0-9]{5}[A-Z0-9]{12}",
|
|
359
|
-
/*
|
|
372
|
+
/* 44: Austria IBAN (AT, 20 chars) */
|
|
360
373
|
"AT[0-9]{2}[0-9]{16}",
|
|
361
|
-
/*
|
|
374
|
+
/* 45: Netherlands IBAN (NL, 18 chars) */
|
|
362
375
|
"NL[0-9]{2}[A-Z]{4}[0-9]{10}",
|
|
363
|
-
/*
|
|
376
|
+
/* 46: Denmark IBAN (DK, 18 chars) */
|
|
364
377
|
"DK[0-9]{2}[0-9]{14}",
|
|
365
|
-
/*
|
|
378
|
+
/* 47: Finland IBAN (FI, 18 chars) */
|
|
366
379
|
"FI[0-9]{2}[0-9]{14}",
|
|
367
|
-
/*
|
|
380
|
+
/* 48: Belgium IBAN (BE, 16 chars) */
|
|
368
381
|
"BE[0-9]{2}[0-9]{12}",
|
|
369
|
-
/*
|
|
382
|
+
/* 49: Norway IBAN (NO, 15 chars) */
|
|
370
383
|
"NO[0-9]{2}[0-9]{11}",
|
|
371
384
|
|
|
372
385
|
/* ---- Tier 4: Structured formats (dots, dashes, slashes, @) ---- */
|
|
373
|
-
/*
|
|
386
|
+
/* 50: Email Address */
|
|
374
387
|
"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
|
|
375
|
-
/*
|
|
388
|
+
/* 51: International Phone Number (E.164) */
|
|
376
389
|
"\\+[0-9]{1,3}[- ]?[0-9][0-9 -]{6,13}[0-9]",
|
|
377
|
-
/*
|
|
390
|
+
/* 52: Brazilian CNPJ (XX.XXX.XXX/XXXX-XX) */
|
|
378
391
|
"[0-9]{2}\\.[0-9]{3}\\.[0-9]{3}/[0-9]{4}-[0-9]{2}",
|
|
379
|
-
/*
|
|
392
|
+
/* 53: Brazilian CPF (XXX.XXX.XXX-XX) */
|
|
380
393
|
"[0-9]{3}\\.[0-9]{3}\\.[0-9]{3}-[0-9]{2}",
|
|
381
|
-
/*
|
|
394
|
+
/* 54: UUID v4 / Scaleway Secret Key */
|
|
382
395
|
"[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}",
|
|
383
|
-
/*
|
|
396
|
+
/* 55: IPv4 address */
|
|
384
397
|
"(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)",
|
|
385
|
-
/*
|
|
398
|
+
/* 56: Credit card numbers (Visa, Mastercard, Amex, Discover, JCB) */
|
|
386
399
|
"(4[0-9]{15}|4[0-9]{12}|5[1-5][0-9]{14}|6011[0-9]{12}|65[0-9]{14}|3[47][0-9]{13}|3[068][0-9]{11}|35[0-9]{14})",
|
|
387
|
-
/*
|
|
400
|
+
/* 57: Indian Aadhaar (XXXX XXXX XXXX or XXXX-XXXX-XXXX) */
|
|
388
401
|
"[0-9]{4}[- ][0-9]{4}[- ][0-9]{4}",
|
|
389
402
|
|
|
390
403
|
/* ---- Tier 5: Letter-anchored patterns ---- */
|
|
391
|
-
/*
|
|
404
|
+
/* 58: Mexican CURP (18 alphanum, distinctive structure) */
|
|
392
405
|
"[A-Z]{4}[0-9]{6}[HM][A-Z]{5}[A-Z0-9][0-9]",
|
|
393
|
-
/*
|
|
406
|
+
/* 59: Italian CF with omocodia (16 chars) */
|
|
394
407
|
"[A-Z]{6}[0-9LMNPQRSTUV]{2}[ABCDEHLMPRST][0-9LMNPQRSTUV]{2}[A-Z][0-9LMNPQRSTUV]{3}[A-Z]",
|
|
395
|
-
/*
|
|
408
|
+
/* 60: Italian CF basic (16 chars) */
|
|
396
409
|
"[A-Z]{6}[0-9]{2}[A-Z][0-9]{2}[A-Z][0-9]{3}[A-Z]",
|
|
397
|
-
/*
|
|
410
|
+
/* 61: UK National Insurance Number (AA 99 99 99 A-D) */
|
|
398
411
|
"[A-Z]{2} ?[0-9]{2} ?[0-9]{2} ?[0-9]{2} ?[A-D]",
|
|
399
|
-
/*
|
|
412
|
+
/* 62: Spanish NIE (X/Y/Z + 7 digits + letter) */
|
|
400
413
|
"[XYZ][0-9]{7}[A-Z]",
|
|
401
|
-
/*
|
|
414
|
+
/* 63: Passport - letter prefix + digits (e.g. AB1234567) */
|
|
402
415
|
"[A-Z]{1,2}[0-9]{6,7}",
|
|
403
416
|
|
|
404
417
|
/* ---- Tier 6: Boundary-wrapped structured (dash/dot/slash separated) ---- */
|
|
405
|
-
/*
|
|
418
|
+
/* 64: South Korean RRN (YYMMDD-XXXXXXX, 14 chars with dash) */
|
|
406
419
|
"[0-9]{6}-[0-9]{7}",
|
|
407
|
-
/*
|
|
420
|
+
/* 65: Swiss AHV Number (756.XXXX.XXXX.XX) */
|
|
408
421
|
"756\\.[0-9]{4}\\.[0-9]{4}\\.[0-9]{2}",
|
|
409
|
-
/*
|
|
422
|
+
/* 66: Finnish HETU (DDMMYY[+-A]XXXC) */
|
|
410
423
|
"[0-9]{6}[-+A][0-9]{3}[0-9A-Y]",
|
|
411
|
-
/*
|
|
424
|
+
/* 67: Swedish Personnummer (YYMMDD[-+]XXXX) */
|
|
412
425
|
"[0-9]{6}[-+][0-9]{4}",
|
|
413
|
-
/*
|
|
426
|
+
/* 68: Danish CPR Number (DDMMYY-XXXX) */
|
|
414
427
|
"[0-9]{6}-[0-9]{4}",
|
|
415
|
-
/*
|
|
428
|
+
/* 69: Czech Rodné číslo (YYMMDD/XXXX or YYMMDDXXXX) */
|
|
416
429
|
"[0-9]{6}/?[0-9]{3,4}",
|
|
417
|
-
/*
|
|
430
|
+
/* 70: US Social Security Number (XXX-XX-XXXX) */
|
|
418
431
|
"[0-9]{3}-[0-9]{2}-[0-9]{4}",
|
|
419
|
-
/*
|
|
432
|
+
/* 71: US ITIN (9XX-XX-XXXX) */
|
|
420
433
|
"9[0-9]{2}-[0-9]{2}-[0-9]{4}",
|
|
421
|
-
/*
|
|
434
|
+
/* 72: Canadian SIN (XXX-XXX-XXX) */
|
|
422
435
|
"[0-9]{3}-[0-9]{3}-[0-9]{3}",
|
|
423
|
-
/*
|
|
436
|
+
/* 73: Australian TFN (XXX-XXX-XXX or XXX XXX XXX) */
|
|
424
437
|
"[0-9]{3}[- ][0-9]{3}[- ][0-9]{3}",
|
|
425
|
-
/*
|
|
438
|
+
/* 74: Indian PAN (5 letters + 4 digits + 1 letter) */
|
|
426
439
|
"[A-Z]{5}[0-9]{4}[A-Z]",
|
|
427
|
-
/*
|
|
440
|
+
/* 75: Spanish DNI (8 digits + 1 letter) */
|
|
428
441
|
"[0-9]{8}[A-Z]",
|
|
429
|
-
/*
|
|
442
|
+
/* 76: Hungarian Tax ID (starts with 8, 10 digits) */
|
|
430
443
|
"8[0-9]{9}",
|
|
431
444
|
|
|
432
445
|
/* ---- Tier 7: Boundary-wrapped pure digits (longest → shortest) ---- */
|
|
433
|
-
/*
|
|
446
|
+
/* 77: French NIR / Social Security (15 digits) */
|
|
434
447
|
"[12][0-9]{2}[01][0-9][0-9]{2}[0-9]{3}[0-9]{3}[0-9]{2}",
|
|
435
|
-
/*
|
|
448
|
+
/* 78: South African ID (13 digits) */
|
|
436
449
|
"[0-9]{13}",
|
|
437
|
-
/*
|
|
450
|
+
/* 79: Romanian CNP (13 digits, first digit 1-8) */
|
|
438
451
|
"[1-8][0-9]{12}",
|
|
439
|
-
/*
|
|
452
|
+
/* 80: Japanese My Number (12 digits) */
|
|
440
453
|
"[0-9]{12}",
|
|
441
|
-
/*
|
|
454
|
+
/* 81: Polish PESEL (11 digits) */
|
|
442
455
|
"[0-9]{11}",
|
|
443
|
-
/*
|
|
456
|
+
/* 82: Belgian National Number (11 digits) */
|
|
444
457
|
"[0-9]{11}",
|
|
445
|
-
/*
|
|
458
|
+
/* 83: Norwegian Fødselsnummer (11 digits) */
|
|
446
459
|
"[0-9]{11}",
|
|
447
|
-
/*
|
|
460
|
+
/* 84: Passport - 9 consecutive digits */
|
|
448
461
|
"[0-9]{9}",
|
|
449
|
-
/*
|
|
462
|
+
/* 85: Dutch BSN (8-9 digits) */
|
|
450
463
|
"[0-9]{8,9}",
|
|
451
|
-
/*
|
|
464
|
+
/* 86: Austrian Abgabenkontonummer (9 digits) */
|
|
452
465
|
"[0-9]{9}",
|
|
453
|
-
/*
|
|
466
|
+
/* 87: Polish PESEL duplicate */
|
|
454
467
|
"[0-9]{11}"
|
|
455
468
|
};
|
|
@@ -1,6 +1,12 @@
|
|
|
1
1
|
require "data_redactor"
|
|
2
2
|
|
|
3
3
|
module DataRedactor
|
|
4
|
+
# Namespace for the optional framework adapters under
|
|
5
|
+
# +lib/data_redactor/integrations/+ ({Logger}, +Rails+, {Rack}).
|
|
6
|
+
#
|
|
7
|
+
# Each adapter is soft-required — none load with +require "data_redactor"+;
|
|
8
|
+
# +require+ only the one you need. They add no runtime gem dependencies and
|
|
9
|
+
# all redaction is delegated to {DataRedactor.redact}.
|
|
4
10
|
module Integrations
|
|
5
11
|
# Rack middleware that scrubs sensitive data from selectable surfaces of
|
|
6
12
|
# the response (and request headers, for downstream loggers to see scrubbed
|
|
@@ -23,8 +29,13 @@ module DataRedactor
|
|
|
23
29
|
# the env hash so any downstream middleware that logs them sees scrubbed
|
|
24
30
|
# values.
|
|
25
31
|
class Rack
|
|
32
|
+
# Surfaces scrubbed when +scrub:+ is not given to {#initialize}.
|
|
33
|
+
# @return [Array<Symbol>]
|
|
26
34
|
DEFAULT_SCRUB = [:body, :headers].freeze
|
|
27
35
|
|
|
36
|
+
# Request-header env keys redacted in place when +:headers+ is scrubbed,
|
|
37
|
+
# so downstream middleware that logs the env sees scrubbed values.
|
|
38
|
+
# @return [Array<String>] Rack env keys (HTTP_-prefixed, upper-case).
|
|
28
39
|
SENSITIVE_REQUEST_HEADERS = %w[
|
|
29
40
|
HTTP_AUTHORIZATION
|
|
30
41
|
HTTP_PROXY_AUTHORIZATION
|
|
@@ -34,6 +45,9 @@ module DataRedactor
|
|
|
34
45
|
HTTP_X_ACCESS_TOKEN
|
|
35
46
|
].freeze
|
|
36
47
|
|
|
48
|
+
# Response headers whose values are redacted when +:headers+ is scrubbed.
|
|
49
|
+
# Matched case-insensitively (Rack 2 capitalises, Rack 3 lower-cases).
|
|
50
|
+
# @return [Array<String>]
|
|
37
51
|
SENSITIVE_RESPONSE_HEADERS = %w[
|
|
38
52
|
Set-Cookie
|
|
39
53
|
Authorization
|
|
@@ -60,6 +74,13 @@ module DataRedactor
|
|
|
60
74
|
@placeholder = placeholder
|
|
61
75
|
end
|
|
62
76
|
|
|
77
|
+
# Rack entry point. Scrubs the configured surfaces of the request and
|
|
78
|
+
# response and returns the standard Rack response triple.
|
|
79
|
+
#
|
|
80
|
+
# @param env [Hash] the Rack environment.
|
|
81
|
+
# @return [Array(Integer, Hash, #each)] the +[status, headers, body]+
|
|
82
|
+
# triple, with sensitive data redacted from the surfaces named in
|
|
83
|
+
# +scrub:+. When +:body+ is scrubbed, +Content-Length+ is dropped.
|
|
63
84
|
def call(env)
|
|
64
85
|
scrub_request_headers(env) if @scrub.include?(:headers)
|
|
65
86
|
status, headers, body = @app.call(env)
|
|
@@ -0,0 +1,170 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module DataRedactor
|
|
4
|
+
# Maps a base ASCII letter to the set of accented characters that should
|
|
5
|
+
# also match it. Used to make generated name patterns diacritic-tolerant:
|
|
6
|
+
# an input "Jose" still matches "José", and "Munoz" matches "Muñoz".
|
|
7
|
+
#
|
|
8
|
+
# @api private
|
|
9
|
+
DIACRITIC_FOLD = {
|
|
10
|
+
"a" => "àáâãäåāăą",
|
|
11
|
+
"c" => "çćĉċč",
|
|
12
|
+
"e" => "èéêëēĕėęě",
|
|
13
|
+
"i" => "ìíîïĩīĭįı",
|
|
14
|
+
"n" => "ñńņňʼn",
|
|
15
|
+
"o" => "òóôõöøōŏő",
|
|
16
|
+
"u" => "ùúûüũūŭůűų",
|
|
17
|
+
"y" => "ýÿŷ",
|
|
18
|
+
"s" => "śŝşš",
|
|
19
|
+
"z" => "źżž",
|
|
20
|
+
"g" => "ĝğġģ",
|
|
21
|
+
"l" => "ĺļľŀł",
|
|
22
|
+
"r" => "ŕŗř",
|
|
23
|
+
"t" => "ţťŧ"
|
|
24
|
+
}.freeze
|
|
25
|
+
|
|
26
|
+
module_function
|
|
27
|
+
|
|
28
|
+
# Build a POSIX ERE that matches a person's name across common written
|
|
29
|
+
# variations, ready to hand to {add_pattern}.
|
|
30
|
+
#
|
|
31
|
+
# The returned pattern is **boundary-wrapped** — it embeds
|
|
32
|
+
# +(^|[^A-Za-z])+ ... +([^A-Za-z]|$)+ so that +"Mario"+ matches as a whole
|
|
33
|
+
# word but not inside +"Mariolino"+. Because the wrapper uses capture
|
|
34
|
+
# groups, register the pattern with the default +boundary: false+ (do
|
|
35
|
+
# **not** pass +boundary: true+ — that would double-wrap and reject the
|
|
36
|
+
# groups).
|
|
37
|
+
#
|
|
38
|
+
# Variations covered:
|
|
39
|
+
# - **Case** — every letter becomes a case-insensitive character class
|
|
40
|
+
# (+[Mm][Aa]...+), since POSIX ERE has no +/i+ flag.
|
|
41
|
+
# - **Order** — +"First Last"+, +"Last First"+, +"Last, First"+,
|
|
42
|
+
# +"Last,First"+.
|
|
43
|
+
# - **Initials** — +"M. Last"+, +"M Last"+, +"First R."+, +"First R"+,
|
|
44
|
+
# +"M.R."+, +"M R"+, +"MR"+.
|
|
45
|
+
# - **Diacritics** — an ASCII letter with a {DIACRITIC_FOLD} entry also
|
|
46
|
+
# matches its accented forms (+"Jose"+ matches +"José"+). An accented
|
|
47
|
+
# input letter also matches its bare ASCII form.
|
|
48
|
+
# - **Separators** — spaces and hyphens are interchangeable between and
|
|
49
|
+
# within name parts. A hyphenated part like +"Anne-Marie"+ also matches
|
|
50
|
+
# +"Anne Marie"+, +"AnneMarie"+, and each half on its own (+"Anne"+,
|
|
51
|
+
# +"Marie"+). Multi-word parts like +"Van der Berg"+ tolerate any
|
|
52
|
+
# space/hyphen separator between words.
|
|
53
|
+
#
|
|
54
|
+
# @param first [String] the given name. May contain hyphens or spaces.
|
|
55
|
+
# @param last [String] the family name. May contain hyphens or spaces.
|
|
56
|
+
# @param middle [String, nil] optional middle name. When given, the pattern
|
|
57
|
+
# matches **both** the no-middle forms and the with-middle forms.
|
|
58
|
+
# @return [String] a POSIX ERE source string.
|
|
59
|
+
# @raise [ArgumentError] if +first+ or +last+ is not a non-empty String,
|
|
60
|
+
# or +middle+ is given but is not a non-empty String.
|
|
61
|
+
#
|
|
62
|
+
# @example Register a name pattern
|
|
63
|
+
# DataRedactor.add_pattern(
|
|
64
|
+
# name: "person_mario_rossi",
|
|
65
|
+
# regex: DataRedactor.name_pattern("Mario", "Rossi"),
|
|
66
|
+
# tag: :contact
|
|
67
|
+
# )
|
|
68
|
+
#
|
|
69
|
+
# @example With a middle name
|
|
70
|
+
# DataRedactor.name_pattern("Mario", "Rossi", middle: "Luigi")
|
|
71
|
+
def name_pattern(first, last, middle: nil)
|
|
72
|
+
_validate_name_arg!(first, "first")
|
|
73
|
+
_validate_name_arg!(last, "last")
|
|
74
|
+
_validate_name_arg!(middle, "middle") unless middle.nil?
|
|
75
|
+
|
|
76
|
+
first_tok = _part_token(first)
|
|
77
|
+
last_tok = _part_token(last)
|
|
78
|
+
middle_tok = middle && _part_token(middle)
|
|
79
|
+
|
|
80
|
+
# Separator between name parts. Optional so initial-only forms collapse
|
|
81
|
+
# ("MR", "M.R.") and so "First,Last" with no space still matches.
|
|
82
|
+
sep = "[ ,-]*"
|
|
83
|
+
|
|
84
|
+
bodies = []
|
|
85
|
+
bodies << "#{first_tok}#{sep}#{last_tok}" # First Last
|
|
86
|
+
bodies << "#{last_tok}#{sep}#{first_tok}" # Last First / Last, First
|
|
87
|
+
|
|
88
|
+
if middle_tok
|
|
89
|
+
bodies << "#{first_tok}#{sep}#{middle_tok}#{sep}#{last_tok}" # First Middle Last
|
|
90
|
+
bodies << "#{last_tok}#{sep}#{first_tok}#{sep}#{middle_tok}" # Last First Middle
|
|
91
|
+
end
|
|
92
|
+
|
|
93
|
+
"(^|[^A-Za-z])(#{bodies.join('|')})([^A-Za-z]|$)"
|
|
94
|
+
end
|
|
95
|
+
|
|
96
|
+
# @api private
|
|
97
|
+
# Build the alternation for one name part: the full case-insensitive name,
|
|
98
|
+
# or its initial (with optional dot). Hyphenated/multi-word parts also
|
|
99
|
+
# match each sub-word alone and tolerant separators between sub-words.
|
|
100
|
+
#
|
|
101
|
+
# @param part [String] a single name part, e.g. "Mario" or "Anne-Marie".
|
|
102
|
+
# @return [String] a parenthesised POSIX ERE alternation.
|
|
103
|
+
def _part_token(part)
|
|
104
|
+
words = part.split(/[ -]+/).reject(&:empty?)
|
|
105
|
+
|
|
106
|
+
word_alts = words.map { |w| _word_alternatives(w) }
|
|
107
|
+
|
|
108
|
+
forms = []
|
|
109
|
+
# whole part with tolerant separators between its words
|
|
110
|
+
forms << word_alts.map { |alts| "(#{alts.join('|')})" }.join("[ -]?")
|
|
111
|
+
# each word on its own (covers "Anne" / "Marie" from "Anne-Marie")
|
|
112
|
+
if words.length > 1
|
|
113
|
+
word_alts.each { |alts| forms << "(#{alts.join('|')})" }
|
|
114
|
+
end
|
|
115
|
+
|
|
116
|
+
"(#{forms.uniq.join('|')})"
|
|
117
|
+
end
|
|
118
|
+
|
|
119
|
+
# @api private
|
|
120
|
+
# Alternatives for a single whitespace-free word: the full name (each
|
|
121
|
+
# letter as a case-insensitive, diacritic-folded class) and its initial.
|
|
122
|
+
#
|
|
123
|
+
# @param word [String] a single word with no spaces or hyphens.
|
|
124
|
+
# @return [Array<String>] alternation members for this word.
|
|
125
|
+
def _word_alternatives(word)
|
|
126
|
+
full = word.chars.map { |ch| _letter_class(ch) }.join
|
|
127
|
+
initial = "#{_letter_class(word[0])}\\.?"
|
|
128
|
+
[full, initial]
|
|
129
|
+
end
|
|
130
|
+
|
|
131
|
+
# @api private
|
|
132
|
+
# Build a POSIX bracket expression matching one letter case-insensitively
|
|
133
|
+
# and, where applicable, its accented variants.
|
|
134
|
+
#
|
|
135
|
+
# @param char [String] a single character.
|
|
136
|
+
# @return [String] a bracket expression, e.g. "[Mm]" or "[EeÈÉÊËèéêë]".
|
|
137
|
+
def _letter_class(char)
|
|
138
|
+
down = char.downcase
|
|
139
|
+
up = char.upcase
|
|
140
|
+
members = [down]
|
|
141
|
+
members << up unless up == down
|
|
142
|
+
|
|
143
|
+
base = DIACRITIC_FOLD.key?(down) ? down : _ascii_base(down)
|
|
144
|
+
if base && DIACRITIC_FOLD.key?(base)
|
|
145
|
+
accented = DIACRITIC_FOLD[base]
|
|
146
|
+
members << accented << accented.upcase
|
|
147
|
+
members << base << base.upcase # accented input still matches bare ASCII
|
|
148
|
+
end
|
|
149
|
+
|
|
150
|
+
"[#{members.join}]"
|
|
151
|
+
end
|
|
152
|
+
|
|
153
|
+
# @api private
|
|
154
|
+
# If +char+ is an accented letter, return the bare ASCII letter it folds
|
|
155
|
+
# to; otherwise nil.
|
|
156
|
+
#
|
|
157
|
+
# @param char [String] a single lowercase character.
|
|
158
|
+
# @return [String, nil]
|
|
159
|
+
def _ascii_base(char)
|
|
160
|
+
DIACRITIC_FOLD.each { |ascii, accents| return ascii if accents.include?(char) }
|
|
161
|
+
nil
|
|
162
|
+
end
|
|
163
|
+
|
|
164
|
+
# @api private
|
|
165
|
+
def _validate_name_arg!(value, label)
|
|
166
|
+
return if value.is_a?(String) && !value.strip.empty?
|
|
167
|
+
|
|
168
|
+
raise ArgumentError, "#{label} must be a non-empty String, got #{value.inspect}"
|
|
169
|
+
end
|
|
170
|
+
end
|
data/lib/data_redactor.rb
CHANGED
|
@@ -1,6 +1,8 @@
|
|
|
1
1
|
require "set"
|
|
2
|
+
require "json"
|
|
2
3
|
require_relative "data_redactor/version"
|
|
3
4
|
require_relative "data_redactor/data_redactor" # loads the compiled .so
|
|
5
|
+
require_relative "data_redactor/name_pattern"
|
|
4
6
|
|
|
5
7
|
# High-performance regex-based redactor for sensitive data.
|
|
6
8
|
#
|
|
@@ -161,6 +163,54 @@ module DataRedactor
|
|
|
161
163
|
result
|
|
162
164
|
end
|
|
163
165
|
|
|
166
|
+
# Recursively redact every String value in a nested Hash/Array structure.
|
|
167
|
+
#
|
|
168
|
+
# Walks the structure depth-first. Only String leaves are passed through
|
|
169
|
+
# {redact}; all other leaf types (Integer, Float, nil, Symbol, Boolean)
|
|
170
|
+
# are copied unchanged. Hash keys are never modified.
|
|
171
|
+
#
|
|
172
|
+
# Returns a deep copy — the original structure is never mutated.
|
|
173
|
+
#
|
|
174
|
+
# @param data [Hash, Array, String, Object] the structure to walk.
|
|
175
|
+
# Any type is accepted; non-String scalars are returned as-is.
|
|
176
|
+
# @param only [Symbol, String, Array, nil] forwarded to {redact}.
|
|
177
|
+
# @param except [Symbol, String, Array, nil] forwarded to {redact}.
|
|
178
|
+
# @param placeholder [String, :tagged, :hash] forwarded to {redact}.
|
|
179
|
+
# @return [Hash, Array, String, Object] a new structure of the same shape
|
|
180
|
+
# with all String leaves redacted.
|
|
181
|
+
# @raise [ArgumentError] if the structure contains a circular reference.
|
|
182
|
+
#
|
|
183
|
+
# @example Rails params
|
|
184
|
+
# safe = DataRedactor.redact_deep(params.to_h)
|
|
185
|
+
#
|
|
186
|
+
# @example Mixed filter
|
|
187
|
+
# DataRedactor.redact_deep(payload, only: :credentials, placeholder: :tagged)
|
|
188
|
+
def redact_deep(data, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
|
|
189
|
+
_walk(data, only: only, except: except, placeholder: placeholder, seen: Set.new)
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
# Parse +json_string+, redact every String value in the resulting structure,
|
|
193
|
+
# and return valid JSON.
|
|
194
|
+
#
|
|
195
|
+
# Delegates traversal to {redact_deep}. All keyword arguments are forwarded
|
|
196
|
+
# to {redact}.
|
|
197
|
+
#
|
|
198
|
+
# @param json_string [String] valid JSON input.
|
|
199
|
+
# @param only [Symbol, String, Array, nil] forwarded to {redact}.
|
|
200
|
+
# @param except [Symbol, String, Array, nil] forwarded to {redact}.
|
|
201
|
+
# @param placeholder [String, :tagged, :hash] forwarded to {redact}.
|
|
202
|
+
# @return [String] a JSON string with all String values redacted.
|
|
203
|
+
# @raise [JSON::ParserError] if +json_string+ is not valid JSON.
|
|
204
|
+
#
|
|
205
|
+
# @example
|
|
206
|
+
# DataRedactor.redact_json('{"email":"alice@example.com","count":3}')
|
|
207
|
+
# # => '{"email":"[REDACTED]","count":3}'
|
|
208
|
+
def redact_json(json_string, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
|
|
209
|
+
parsed = JSON.parse(json_string)
|
|
210
|
+
redacted = redact_deep(parsed, only: only, except: except, placeholder: placeholder)
|
|
211
|
+
JSON.generate(redacted)
|
|
212
|
+
end
|
|
213
|
+
|
|
164
214
|
# Register a custom redaction pattern.
|
|
165
215
|
#
|
|
166
216
|
# Patterns must be valid POSIX ERE. Ruby-only syntax (+\d+, +\s+, +\w+,
|
|
@@ -317,6 +367,31 @@ module DataRedactor
|
|
|
317
367
|
bits
|
|
318
368
|
end
|
|
319
369
|
|
|
370
|
+
# @api private
|
|
371
|
+
# Depth-first recursive walker for {redact_deep}.
|
|
372
|
+
# +seen+ is a Set of object_ids already on the current traversal stack,
|
|
373
|
+
# used to detect circular references.
|
|
374
|
+
def _walk(node, only:, except:, placeholder:, seen:)
|
|
375
|
+
case node
|
|
376
|
+
when String
|
|
377
|
+
redact(node, only: only, except: except, placeholder: placeholder)
|
|
378
|
+
when Hash
|
|
379
|
+
raise ArgumentError, "redact_deep: circular reference detected" if seen.include?(node.object_id)
|
|
380
|
+
seen.add(node.object_id)
|
|
381
|
+
result = node.transform_values { |v| _walk(v, only: only, except: except, placeholder: placeholder, seen: seen) }
|
|
382
|
+
seen.delete(node.object_id)
|
|
383
|
+
result
|
|
384
|
+
when Array
|
|
385
|
+
raise ArgumentError, "redact_deep: circular reference detected" if seen.include?(node.object_id)
|
|
386
|
+
seen.add(node.object_id)
|
|
387
|
+
result = node.map { |v| _walk(v, only: only, except: except, placeholder: placeholder, seen: seen) }
|
|
388
|
+
seen.delete(node.object_id)
|
|
389
|
+
result
|
|
390
|
+
else
|
|
391
|
+
node
|
|
392
|
+
end
|
|
393
|
+
end
|
|
394
|
+
|
|
320
395
|
# @api private
|
|
321
396
|
def pattern_enabled?(name, tag_bit, only_present, only_bits, only_names,
|
|
322
397
|
except_bits, except_names)
|
data/readme.md
CHANGED
|
@@ -8,7 +8,32 @@ A Ruby gem with a C extension for high-performance regex-based redaction of sens
|
|
|
8
8
|
|
|
9
9
|
## What it does
|
|
10
10
|
|
|
11
|
-
DataRedactor scans text for sensitive
|
|
11
|
+
DataRedactor scans text for sensitive data — API keys and cloud secrets, IBANs,
|
|
12
|
+
credit cards, national IDs, emails, phone numbers, IPs, and more — and replaces
|
|
13
|
+
each match with a placeholder. The scanning runs in a C extension backed by POSIX
|
|
14
|
+
`regex.h`, so the heavy lifting happens outside the Ruby VM and stays fast enough
|
|
15
|
+
to run inline on large payloads.
|
|
16
|
+
|
|
17
|
+
It ships **88 built-in patterns** across 15+ countries, grouped into tags
|
|
18
|
+
(`:credentials`, `:financial`, `:contact`, ...) so you can redact only what you
|
|
19
|
+
care about. Beyond plain strings it can walk nested Hashes, Arrays, and JSON,
|
|
20
|
+
audit a payload without mutating it (`scan`), and plug into Logger, Rails, and
|
|
21
|
+
Rack. You can also register your own patterns at boot.
|
|
22
|
+
|
|
23
|
+
### Use cases
|
|
24
|
+
|
|
25
|
+
- **Log scrubbing** — drop the `Logger` formatter in so no secret or PII ever
|
|
26
|
+
reaches disk or your log aggregator.
|
|
27
|
+
- **Rails parameter filtering** — feed `filter_parameters` a redactor-backed proc
|
|
28
|
+
to keep request params out of logs and error reports.
|
|
29
|
+
- **HTTP request/response sanitising** — Rack middleware scrubs response bodies
|
|
30
|
+
and sensitive headers in flight.
|
|
31
|
+
- **Sanitising LLM / API payloads** — run `redact_deep` over a params hash or
|
|
32
|
+
`redact_json` over a JSON body before it leaves the process.
|
|
33
|
+
- **Compliance & auditing** — `scan` reports every match with byte offsets, tag,
|
|
34
|
+
and pattern name without changing the text, for false-positive tuning.
|
|
35
|
+
- **Internal identifiers** — register company-specific patterns (`add_pattern`)
|
|
36
|
+
or generate them from a person's name (`name_pattern`).
|
|
12
37
|
|
|
13
38
|
## Usage
|
|
14
39
|
|
|
@@ -103,6 +128,36 @@ DataRedactor.scan(text, except: :network)
|
|
|
103
128
|
DataRedactor.scan(text, only: :contact, except: ["email"])
|
|
104
129
|
```
|
|
105
130
|
|
|
131
|
+
### Hash / JSON traversal
|
|
132
|
+
|
|
133
|
+
Redact every string value inside a nested Hash or Array — useful for params hashes, Sidekiq job payloads, webhook bodies, and anything that isn't a flat string:
|
|
134
|
+
|
|
135
|
+
```ruby
|
|
136
|
+
# Hash — returns a deep copy, never mutates the input
|
|
137
|
+
result = DataRedactor.redact_deep({
|
|
138
|
+
"user" => { "email" => "alice@example.com" },
|
|
139
|
+
"count" => 3,
|
|
140
|
+
"tags" => ["admin", "alice@example.com"]
|
|
141
|
+
})
|
|
142
|
+
# => { "user" => { "email" => "[REDACTED]" }, "count" => 3, "tags" => ["admin", "[REDACTED]"] }
|
|
143
|
+
|
|
144
|
+
# Hash keys are never touched — only values are redacted
|
|
145
|
+
# Non-string scalars (Integer, Float, nil, Boolean) pass through unchanged
|
|
146
|
+
|
|
147
|
+
# Accepts the same filters as redact
|
|
148
|
+
DataRedactor.redact_deep(params, only: :credentials)
|
|
149
|
+
DataRedactor.redact_deep(payload, except: :network, placeholder: :tagged)
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
```ruby
|
|
153
|
+
# JSON string — parse → redact_deep → re-serialise
|
|
154
|
+
safe_json = DataRedactor.redact_json('{"email":"alice@example.com","count":3}')
|
|
155
|
+
# => '{"email":"[REDACTED]","count":3}'
|
|
156
|
+
|
|
157
|
+
# Raises JSON::ParserError on invalid input
|
|
158
|
+
DataRedactor.redact_json("not json") # => JSON::ParserError
|
|
159
|
+
```
|
|
160
|
+
|
|
106
161
|
### Custom patterns
|
|
107
162
|
|
|
108
163
|
Teams often have internal IDs that the gem can't ship. Register them at boot:
|
|
@@ -128,6 +183,46 @@ DataRedactor.clear_custom_patterns! # mostly for test suites
|
|
|
128
183
|
|
|
129
184
|
**`boundary: true`** — wraps the pattern with `(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)` so it only fires when the token is not embedded in a longer alphanumeric string. Incompatible with patterns that contain capture groups.
|
|
130
185
|
|
|
186
|
+
### Name patterns
|
|
187
|
+
|
|
188
|
+
Personal names can't ship as built-ins — every team has different ones — but the regex
|
|
189
|
+
boilerplate to match a name across its written variations is the same every time.
|
|
190
|
+
`name_pattern` generates that regex for you, ready to hand to `add_pattern`:
|
|
191
|
+
|
|
192
|
+
```ruby
|
|
193
|
+
DataRedactor.add_pattern(
|
|
194
|
+
name: "person_mario_rossi",
|
|
195
|
+
regex: DataRedactor.name_pattern("Mario", "Rossi"),
|
|
196
|
+
tag: :contact
|
|
197
|
+
)
|
|
198
|
+
|
|
199
|
+
DataRedactor.redact("ticket from Mario Rossi about ...")
|
|
200
|
+
# => "ticket from [REDACTED] about ..."
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
A single generated pattern matches all of these:
|
|
204
|
+
|
|
205
|
+
- **Case** — `Mario Rossi`, `mario rossi`, `MARIO ROSSI`
|
|
206
|
+
- **Order** — `Mario Rossi`, `Rossi Mario`, `Rossi, Mario`, `Rossi,Mario`
|
|
207
|
+
- **Initials** — `M. Rossi`, `M Rossi`, `Mario R.`, `M.R.`, `MR`
|
|
208
|
+
- **Diacritics** — `name_pattern("Jose", "Munoz")` also matches `José Muñoz` (and vice versa)
|
|
209
|
+
- **Separators** — spaces and hyphens are interchangeable. `name_pattern("Anne-Marie", "Berg")`
|
|
210
|
+
matches `Anne-Marie Berg`, `Anne Marie Berg`, `AnneMarie Berg`, and each half alone
|
|
211
|
+
(`Anne Berg`, `Marie Berg`). Multi-word parts like `"Van der Berg"` tolerate any
|
|
212
|
+
space/hyphen separator between words.
|
|
213
|
+
|
|
214
|
+
It does **not** match a name embedded in a longer word — `Mario` will not fire inside
|
|
215
|
+
`Mariolino` — because the generated pattern is boundary-wrapped. For that reason, register
|
|
216
|
+
it with the default `boundary: false` (the wrapper is already baked into the returned
|
|
217
|
+
string; `boundary: true` would double-wrap and reject its capture groups).
|
|
218
|
+
|
|
219
|
+
Pass `middle:` to also cover a middle name — both the no-middle and with-middle forms match:
|
|
220
|
+
|
|
221
|
+
```ruby
|
|
222
|
+
DataRedactor.name_pattern("Mario", "Rossi", middle: "Luigi")
|
|
223
|
+
# matches "Mario Rossi" AND "Mario Luigi Rossi" AND "Rossi Mario Luigi"
|
|
224
|
+
```
|
|
225
|
+
|
|
131
226
|
## Integrations
|
|
132
227
|
|
|
133
228
|
Optional adapters for Logger, Rails, and Rack. None are loaded automatically — `require` only what you use, and the gem adds zero runtime dependencies in the gemspec.
|
|
@@ -179,7 +274,7 @@ Pass an empty subset (e.g. `scrub: [:headers]`) to opt out of body wrapping. For
|
|
|
179
274
|
|
|
180
275
|
> **Body wrapping is buffering.** The middleware reads the entire response body into memory before scanning. For streaming endpoints (SSE, large file downloads, Rack::Hijack) use `scrub: [:headers]` and rely on the Logger formatter for application logs instead.
|
|
181
276
|
|
|
182
|
-
## Detected patterns (
|
|
277
|
+
## Detected patterns (88 total)
|
|
183
278
|
|
|
184
279
|
The table below is a representative sample. Use `DataRedactor.pattern_names` for the canonical, machine-readable list — it stays in sync with the C extension automatically.
|
|
185
280
|
|
|
@@ -276,7 +371,9 @@ redactor/
|
|
|
276
371
|
├── lib/
|
|
277
372
|
│ ├── data_redactor.rb # Ruby entry point, loads the .so
|
|
278
373
|
│ └── data_redactor/
|
|
279
|
-
│
|
|
374
|
+
│ ├── version.rb
|
|
375
|
+
│ ├── name_pattern.rb # name_pattern helper — generates a name regex for add_pattern
|
|
376
|
+
│ └── integrations/ # soft-required Logger / Rails / Rack adapters
|
|
280
377
|
├── ext/
|
|
281
378
|
│ └── data_redactor/
|
|
282
379
|
│ ├── extconf.rb # Checks for C headers, generates Makefile (globs *.c)
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_redactor
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.9.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Daniele Frisanco
|
|
@@ -110,6 +110,7 @@ files:
|
|
|
110
110
|
- lib/data_redactor/integrations/logger.rb
|
|
111
111
|
- lib/data_redactor/integrations/rack.rb
|
|
112
112
|
- lib/data_redactor/integrations/rails.rb
|
|
113
|
+
- lib/data_redactor/name_pattern.rb
|
|
113
114
|
- lib/data_redactor/version.rb
|
|
114
115
|
- readme.md
|
|
115
116
|
homepage: https://github.com/danielefrisanco/data_redactor
|