data_redactor 0.13.0 → 0.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +61 -1
- data/README.md +29 -17
- data/ext/data_redactor/matcher.c +35 -15
- data/ext/data_redactor/matcher.h +8 -7
- data/ext/data_redactor/patterns.c +51 -13
- data/ext/data_redactor/patterns.h +10 -1
- data/lib/data_redactor/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: dfda0b2a543fc9b415c0816dd08a3fb727c1abcfb082c4c0b6d04362c83ee4d2
|
|
4
|
+
data.tar.gz: 6e1b528c5ce1759ebbefff1f3f2e72bdfc2b2fa2959e5c523945752bebb999f2
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: a6a3e64351089a1b69b94a9ed33ee50eeb7339a0d359d357ce0ae6ca57c722d3b9fc5c92a4cb3b5fe8d7cb0fa40d0f3e50529e5b27226465294df78a30872714
|
|
7
|
+
data.tar.gz: 9f4082c693c639a8f4ab211bbbf92d4c9e6d79422f1696971937d95539670171cac662d64f2a3dd05185d5946b9c79f3b6a35d4ec1c4c4376d8853ecaaea719c
|
data/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,63 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [0.15.0] - 2026-06-17
|
|
11
|
+
|
|
12
|
+
### Changed
|
|
13
|
+
- **Overlap resolution is now longest-match-wins** (was earlier-index-wins). When
|
|
14
|
+
two patterns match overlapping spans, the engine keeps the **longer** span;
|
|
15
|
+
equal-length ties go to the lower pattern index (preserving prior behaviour for
|
|
16
|
+
same-length matches). The previous "earliest pattern by index wins any region it
|
|
17
|
+
can match" semantic was an accidental by-product of sequential per-pattern
|
|
18
|
+
rewriting, and it could leave a secret **partly unredacted** — e.g.
|
|
19
|
+
`AKIA…EXAMPLE` followed by 20 more alphanumeric bytes used to redact only the
|
|
20
|
+
20-char access-key prefix and leak the trailing 20 bytes; it now redacts the full
|
|
21
|
+
40-char secret. The public API (`redact`, `scan`) is unchanged; `scan` may report
|
|
22
|
+
one longer match where it previously reported several shorter overlapping ones.
|
|
23
|
+
Aligns with Onigmo/PCRE/RE2/Hyperscan semantics. Resolver only — no measurable
|
|
24
|
+
throughput change (still ~2.4× over pure-Ruby on the 1 MB log).
|
|
25
|
+
|
|
26
|
+
### Added
|
|
27
|
+
- **CI throughput regression gate** (`throughput-gate` job). Runs
|
|
28
|
+
`benchmark/ci_throughput_gate.rb`, which gates on the ratio of the C engine to
|
|
29
|
+
a pure-Ruby gsub loop over the same patterns (the ratio cancels CI-runner
|
|
30
|
+
speed variance, unlike absolute MB/s). Loose floor (1.5×; known result
|
|
31
|
+
~2.25×), informational throughput output, plus a correctness guard so an
|
|
32
|
+
engine that redacts less cannot pass as "faster". Repo/CI only — not packaged.
|
|
33
|
+
|
|
34
|
+
## [0.14.1] - 2026-06-17
|
|
35
|
+
|
|
36
|
+
### Changed
|
|
37
|
+
- **Bounded the greedy tails of seven built-in token patterns** (`jwt`,
|
|
38
|
+
`grafana_api_token`, `ssh_public_key`, `bearer_token`, `anthropic_api_key`,
|
|
39
|
+
`openai_project_api_key`, `sendgrid_api_key`). Open-ended quantifiers (`+` and
|
|
40
|
+
`{n,}`) are capped at the POSIX `RE_DUP_MAX` of 255 (`{n,255}`), matching the
|
|
41
|
+
existing `hashicorp_vault_batch_token` precedent. A token is unusable once its
|
|
42
|
+
front is redacted, so a bounded prefix is sufficient to neutralize it. This
|
|
43
|
+
restores a finite `max_len` for these patterns (re-enabling the engine's
|
|
44
|
+
literal back-up skip) and removes a theoretical O(N²) worst case where a
|
|
45
|
+
crafted prefix plus a megabyte of matching characters forces a long greedy
|
|
46
|
+
scan. Tokens longer than 255 characters are still neutralized — only a
|
|
47
|
+
cryptographically-dead tail may remain.
|
|
48
|
+
|
|
49
|
+
### Added
|
|
50
|
+
- **Key-name-anchored secret redaction** (`:credentials`). A new pattern tier
|
|
51
|
+
redacts a secret by the *name of the field it is assigned to*, for values with
|
|
52
|
+
no distinctive shape of their own — the primary case being an `.env` file or
|
|
53
|
+
config blob passed through the redactor. Anchored on the key words `password`,
|
|
54
|
+
`passwd`, `pwd`, `secret`, `token`, `api_key`, `apikey`, `access_key`, and
|
|
55
|
+
`client_secret` (case-insensitive), followed by `=` or `:` (dotenv and YAML
|
|
56
|
+
styles), with quoted (`"..."`/`'...'`) or unquoted (≥6 chars) values. Only the
|
|
57
|
+
**value** is redacted; the key is kept so logs stay greppable
|
|
58
|
+
(`PASSWORD=[REDACTED]`). Compound key names match whether the secret word is a
|
|
59
|
+
prefix or suffix segment (`POSTGRES_DB_PASSWORD=`, `PASSWORD_POSTGRES=`).
|
|
60
|
+
Requires the assignment separator, so the word in prose ("reset your password")
|
|
61
|
+
is not a false positive.
|
|
62
|
+
- `examples/` directory with runnable, copy-pasteable usage scripts for every
|
|
63
|
+
feature (core redaction, scan/dry-run, custom patterns, deep/JSON traversal,
|
|
64
|
+
and the Logger / Rack / Rails / LLM integrations). Repo-only — not packaged in
|
|
65
|
+
the gem. Linked from the README.
|
|
66
|
+
|
|
10
67
|
## [0.13.0] - 2026-06-13
|
|
11
68
|
|
|
12
69
|
### Changed
|
|
@@ -255,7 +312,10 @@ features as 0.7.1 plus the pipeline fix.
|
|
|
255
312
|
- `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
|
|
256
313
|
- RSpec suite with one example per pattern.
|
|
257
314
|
|
|
258
|
-
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.
|
|
315
|
+
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.15.0...HEAD
|
|
316
|
+
[0.15.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.1...v0.15.0
|
|
317
|
+
[0.14.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.0...v0.14.1
|
|
318
|
+
[0.14.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.13.0...v0.14.0
|
|
259
319
|
[0.13.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.11.0...v0.13.0
|
|
260
320
|
[0.11.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.1...v0.11.0
|
|
261
321
|
[0.10.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.0...v0.10.1
|
data/README.md
CHANGED
|
@@ -12,10 +12,10 @@ DataRedactor scans text for sensitive data — API keys and cloud secrets, IBANs
|
|
|
12
12
|
credit cards, national IDs, emails, phone numbers, IPs, and more — and replaces
|
|
13
13
|
each match with a placeholder. The scanning runs in a C extension backed by a
|
|
14
14
|
zero-dependency Thompson NFA → lazy-DFA multi-pattern engine (v19) that scans
|
|
15
|
-
|
|
15
|
+
every built-in pattern in a single pass — 2–2.5× faster than pure-Ruby `gsub`
|
|
16
16
|
on large payloads, with no external library dependencies.
|
|
17
17
|
|
|
18
|
-
It ships **
|
|
18
|
+
It ships **89 built-in patterns** across 15+ countries, grouped into tags
|
|
19
19
|
(`:credentials`, `:financial`, `:contact`, ...) so you can redact only what you
|
|
20
20
|
care about. Beyond plain strings it can walk nested Hashes, Arrays, and JSON,
|
|
21
21
|
audit a payload without mutating it (`scan`), and plug into Logger, Rails, and
|
|
@@ -46,6 +46,12 @@ DataRedactor.redact(text)
|
|
|
46
46
|
# => "User CF is [REDACTED] and key is [REDACTED]"
|
|
47
47
|
```
|
|
48
48
|
|
|
49
|
+
Prefer runnable code? The [`examples/`](examples/) directory has self-contained,
|
|
50
|
+
copy-pasteable scripts for every feature below — core redaction, scan/dry-run,
|
|
51
|
+
custom patterns, deep/JSON traversal, and the Logger / Rack / Rails / LLM
|
|
52
|
+
integrations. Run any of them with `bundle exec ruby examples/<name>.rb` (see
|
|
53
|
+
[examples/README.md](examples/README.md)).
|
|
54
|
+
|
|
49
55
|
### Filtering by tag or pattern name
|
|
50
56
|
|
|
51
57
|
`only:` and `except:` both accept a single value or an Array, mixing **Symbols** (tag names) and **Strings** (specific pattern names).
|
|
@@ -303,7 +309,7 @@ safe_response = DataRedactor::Integrations::OpenAI.redact_response(response)
|
|
|
303
309
|
|
|
304
310
|
`content` may be a plain String or an array of content blocks/parts (`{ type: "text", text: "..." }`) — only the `text` of `text` blocks is redacted; image and other block types pass through untouched. For Claude, a top-level `system:` String is also redacted; for OpenAI, a `{ role: "system" }` message in the array is redacted like any other. Pass a bare `messages` array or the whole request Hash (with a `messages` key) — either works.
|
|
305
311
|
|
|
306
|
-
## Detected patterns (
|
|
312
|
+
## Detected patterns (89 total)
|
|
307
313
|
|
|
308
314
|
The table below is a representative sample. Use `DataRedactor.pattern_names` for the canonical, machine-readable list — it stays in sync with the C extension automatically.
|
|
309
315
|
|
|
@@ -415,11 +421,21 @@ redactor/
|
|
|
415
421
|
│ └── tags.h # TAG_* bit constants
|
|
416
422
|
├── spec/
|
|
417
423
|
│ └── data_redactor_spec.rb # RSpec tests — at least one example per pattern, plus filter / placeholder / custom-pattern coverage
|
|
424
|
+
├── examples/ # Repo-only runnable usage scripts (not packaged in the gem)
|
|
425
|
+
│ ├── README.md # Index + how to run
|
|
426
|
+
│ ├── basic_redact.rb # redact, tag filters, placeholder modes
|
|
427
|
+
│ ├── scan_report.rb # scan dry-run with byte offsets
|
|
428
|
+
│ ├── custom_pattern.rb # add_pattern + name_pattern
|
|
429
|
+
│ ├── deep_and_json.rb # redact_deep / redact_json
|
|
430
|
+
│ ├── logger.rb # Logger::Formatter integration
|
|
431
|
+
│ ├── rack_middleware.rb # Rack middleware (body + headers)
|
|
432
|
+
│ ├── rails_filter.rb # filter_parameters adapter
|
|
433
|
+
│ └── llm_payload.rb # Claude / OpenAI message + response redaction
|
|
418
434
|
├── benchmark/ # Repo-only perf scripts (not packaged in the gem)
|
|
419
435
|
│ ├── README.md # How to run, what each script measures
|
|
420
436
|
│ ├── support/corpus.rb # Shared payload builders + pure-Ruby baseline redactor
|
|
421
437
|
│ ├── throughput.rb # MB/s on representative payloads
|
|
422
|
-
│ ├── vs_pure_ruby.rb # C extension vs pure-Ruby gsub (same
|
|
438
|
+
│ ├── vs_pure_ruby.rb # C extension vs pure-Ruby gsub (same patterns)
|
|
423
439
|
│ ├── scaling.rb # Runtime vs input size 1KB → 50MB
|
|
424
440
|
│ └── per_pattern.rb # Per-pattern scan cost
|
|
425
441
|
└── docs/ # Design and execution docs for future work
|
|
@@ -507,7 +523,7 @@ different angles. They are **not** packaged with the gem.
|
|
|
507
523
|
```bash
|
|
508
524
|
bundle install # pulls benchmark-ips, benchmark-memory (dev deps)
|
|
509
525
|
bundle exec rake compile
|
|
510
|
-
bundle exec ruby benchmark/vs_pure_ruby.rb # head-to-head vs pure-Ruby gsub, same
|
|
526
|
+
bundle exec ruby benchmark/vs_pure_ruby.rb # head-to-head vs pure-Ruby gsub, same patterns
|
|
511
527
|
bundle exec ruby benchmark/throughput.rb # MB/s on a log line, JSON, 1MB and 10MB log files
|
|
512
528
|
bundle exec ruby benchmark/scaling.rb # runtime vs input size (1KB → 50MB), confirms linear scaling
|
|
513
529
|
bundle exec ruby benchmark/per_pattern.rb # per-pattern scan cost over a 1MB payload
|
|
@@ -519,11 +535,8 @@ C engine uses, via `DataRedactor::BUILTIN_PATTERN_SOURCES`).
|
|
|
519
535
|
|
|
520
536
|
### Performance (0.10.0 — v19 multi-pattern engine)
|
|
521
537
|
|
|
522
|
-
|
|
523
|
-
|
|
524
|
-
with two selective-merge passes (pure-digit group + IBAN union) that further
|
|
525
|
-
reduce work for the most common pattern classes. Custom patterns (`add_pattern`)
|
|
526
|
-
still use the glibc path (required for correct UTF-8 diacritic matching).
|
|
538
|
+
Measured on the v19 engine ([How it works](#how-it-works)) vs a pure-Ruby `gsub`
|
|
539
|
+
loop over the same patterns:
|
|
527
540
|
|
|
528
541
|
| Payload | v19 engine (0.10.0) | Pure-Ruby `gsub` | Ratio |
|
|
529
542
|
|-----------------------|---------------------|------------------|-----------------|
|
|
@@ -560,14 +573,14 @@ machine-dependent, but the flat curve is not.
|
|
|
560
573
|
|
|
561
574
|
## How it works
|
|
562
575
|
|
|
563
|
-
1. At load time, `
|
|
564
|
-
2. `DataRedactor.redact(text)`
|
|
565
|
-
3.
|
|
566
|
-
4.
|
|
576
|
+
1. At load time, `mm_init()` compiles every built-in pattern from a Thompson NFA into bytecode, lazily building each pattern's DFA on first use (interned and cached). Boundary-wrapped patterns are expanded with the word-boundary group before compilation.
|
|
577
|
+
2. `DataRedactor.redact(text)` / `scan(text)` hand the input to the v19 engine, which scans it **once** and emits `(pattern_id, start, length)` events for every enabled pattern. Two selective-merge passes (a pure-digit group and an IBAN union) collapse the most common pattern classes into shared scans. The single pass over the original buffer is what makes the engine O(N).
|
|
578
|
+
3. The raw events are resolved by `mm_resolve` under the **longest-match-wins** policy: overlapping spans are reduced to a non-overlapping set keeping the longest match at each position, with the lower pattern index breaking equal-length ties.
|
|
579
|
+
4. `redact` rewrites the surviving spans to placeholders in one buffer build (preserving the boundary characters of boundary-wrapped matches); `scan` returns the event list with byte offsets into the original string. Custom patterns (`add_pattern`) run on the glibc `regexec` path afterward — required for correct UTF-8 diacritic matching.
|
|
567
580
|
|
|
568
581
|
## Memory management
|
|
569
582
|
|
|
570
|
-
All C-side buffers are heap-allocated
|
|
583
|
+
All C-side working buffers are heap-allocated and freed before the call returns; the only Ruby-managed allocation is the final result `String`. No Ruby objects are created mid-scan, so GC cannot collect anything out from under the C code. Per-thread engine scratch (NFA state, lazy-DFA cache) is freed automatically when the thread exits — see [Thread safety](#thread-safety).
|
|
571
584
|
|
|
572
585
|
## Thread safety
|
|
573
586
|
|
|
@@ -585,7 +598,6 @@ Released under the [MIT License](LICENSE).
|
|
|
585
598
|
|
|
586
599
|
## Known limitations
|
|
587
600
|
|
|
588
|
-
- **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
|
|
589
601
|
- **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
|
|
590
602
|
- **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
|
|
591
|
-
- **
|
|
603
|
+
- **Overlap resolution is longest-match-wins** — when two patterns match overlapping spans the engine keeps the longer span; equal-length ties go to the lower pattern index. This favours redacting *more* when uncertain (a 40-char secret is redacted whole rather than leaking the bytes past a shorter prefix match). When two secrets abut with **no separator** between them, a boundary-wrapped pattern can fail to match because the original buffer has no word boundary where one token meets the next, leaving the abutting token unredacted. This is rare in real text (secrets are almost always separator-delimited).
|
data/ext/data_redactor/matcher.c
CHANGED
|
@@ -16,10 +16,10 @@
|
|
|
16
16
|
*
|
|
17
17
|
* 2. Output contract. mm_scan() takes an enable_bits gate and emits ORIGINAL-
|
|
18
18
|
* frame (pattern_id, start, span) events for ALL enabled patterns in one
|
|
19
|
-
* pass
|
|
20
|
-
*
|
|
21
|
-
*
|
|
22
|
-
*
|
|
19
|
+
* pass. The caller applies mm_resolve() (longest-match-wins greedy claim:
|
|
20
|
+
* longest span at each position wins, equal lengths broken by lower
|
|
21
|
+
* pattern_id) to pick the final non-overlapping set. See TODO.md §1d Gap 5
|
|
22
|
+
* and the overlap-resolution specs in spec/data_redactor_spec.rb.
|
|
23
23
|
*
|
|
24
24
|
* The infix-literal classification and the BM_INFIX hint table below are ported
|
|
25
25
|
* from prototypes/multi_pattern_matcher/gen_patterns.rb (which derived them from the
|
|
@@ -406,6 +406,7 @@ typedef struct {
|
|
|
406
406
|
int has_first_filter;
|
|
407
407
|
int use_dfa;
|
|
408
408
|
int boundary_wrapped;
|
|
409
|
+
int keyname_anchored;
|
|
409
410
|
int has_eol;
|
|
410
411
|
size_t max_len;
|
|
411
412
|
/* selective-merge membership (built-ins only; customs never join a merge) */
|
|
@@ -1014,6 +1015,24 @@ static size_t scan_one(int p, scan_state_t *state, const char *input, size_t len
|
|
|
1014
1015
|
!isalnum((unsigned char)input[core_so])) core_so++;
|
|
1015
1016
|
if (core_eo > core_so &&
|
|
1016
1017
|
!isalnum((unsigned char)input[core_eo-1])) core_eo--;
|
|
1018
|
+
} else if (eng->keyname_anchored) {
|
|
1019
|
+
/* The match is KEY<sep>VALUE (e.g. PASSWORD="hunter2"). We redact
|
|
1020
|
+
* only VALUE and keep KEY<sep> so logs stay greppable. The value
|
|
1021
|
+
* grammar forbids '=' and ':' unquoted, so the FIRST separator in
|
|
1022
|
+
* the span unambiguously ends the key. Advance past it, then past
|
|
1023
|
+
* surrounding whitespace and a single opening/closing quote. */
|
|
1024
|
+
size_t s = core_so;
|
|
1025
|
+
while (s < core_eo && input[s] != '=' && input[s] != ':') s++;
|
|
1026
|
+
if (s < core_eo) s++; /* skip the separator */
|
|
1027
|
+
while (s < core_eo &&
|
|
1028
|
+
(input[s] == ' ' || input[s] == '\t')) s++;
|
|
1029
|
+
if (s < core_eo &&
|
|
1030
|
+
(input[s] == '"' || input[s] == '\'')) {
|
|
1031
|
+
char q = input[s];
|
|
1032
|
+
s++;
|
|
1033
|
+
if (core_eo > s && input[core_eo-1] == q) core_eo--;
|
|
1034
|
+
}
|
|
1035
|
+
core_so = s;
|
|
1017
1036
|
}
|
|
1018
1037
|
if (count < max)
|
|
1019
1038
|
out[count++] = (mm_match_t){p, core_so, core_eo - core_so};
|
|
@@ -1150,6 +1169,7 @@ void mm_init(void) {
|
|
|
1150
1169
|
for (int p = 0; p < NUM_PATTERNS; p++) {
|
|
1151
1170
|
engine_t *eng = eng_grow_one();
|
|
1152
1171
|
engine_build(eng, pattern_strings[p], boundary_wrapped[p], pattern_names[p]);
|
|
1172
|
+
eng->keyname_anchored = keyname_anchored[p];
|
|
1153
1173
|
|
|
1154
1174
|
const char *lit = pattern_required_literal[p];
|
|
1155
1175
|
if (lit) {
|
|
@@ -1240,14 +1260,14 @@ size_t mm_scan(const char *input, size_t len,
|
|
|
1240
1260
|
return count;
|
|
1241
1261
|
}
|
|
1242
1262
|
|
|
1243
|
-
/* Order events for the
|
|
1244
|
-
*
|
|
1245
|
-
*
|
|
1263
|
+
/* Order events for the longest-match-wins greedy claim: ascending start, then
|
|
1264
|
+
* descending length (so the longest span at a given start is seen first), then
|
|
1265
|
+
* ascending pattern_id (lower index wins a tie of equal length). */
|
|
1246
1266
|
static int ev_cmp_resolve(const void *a, const void *b) {
|
|
1247
1267
|
const mm_match_t *x = a, *y = b;
|
|
1248
|
-
if (x->pattern_id != y->pattern_id) return x->pattern_id - y->pattern_id;
|
|
1249
1268
|
if (x->start != y->start) return x->start < y->start ? -1 : 1;
|
|
1250
|
-
return
|
|
1269
|
+
if (x->length != y->length) return x->length > y->length ? -1 : 1;
|
|
1270
|
+
return x->pattern_id - y->pattern_id;
|
|
1251
1271
|
}
|
|
1252
1272
|
|
|
1253
1273
|
/* Order kept events for emission: ascending start. */
|
|
@@ -1261,12 +1281,12 @@ size_t mm_resolve(mm_match_t *ev, size_t n) {
|
|
|
1261
1281
|
if (n == 0) return 0;
|
|
1262
1282
|
qsort(ev, n, sizeof(mm_match_t), ev_cmp_resolve);
|
|
1263
1283
|
|
|
1264
|
-
/* Greedy claim in (
|
|
1265
|
-
*
|
|
1266
|
-
*
|
|
1267
|
-
*
|
|
1268
|
-
*
|
|
1269
|
-
*
|
|
1284
|
+
/* Greedy claim in (start, -length, pattern_id) order: the longest span at
|
|
1285
|
+
* each position is offered first and claims its region; any later (shorter,
|
|
1286
|
+
* or equal-length higher-id) event overlapping an already-kept span is
|
|
1287
|
+
* dropped. An event is kept iff its span [start, start+length) does not
|
|
1288
|
+
* overlap any already-kept span. Match counts are modest, so a linear
|
|
1289
|
+
* overlap check against the kept set is used. */
|
|
1270
1290
|
mm_match_t *kept = mm_xmalloc(n * sizeof(mm_match_t));
|
|
1271
1291
|
size_t nk = 0;
|
|
1272
1292
|
for (size_t i = 0; i < n; i++) {
|
data/ext/data_redactor/matcher.h
CHANGED
|
@@ -53,19 +53,20 @@ void mm_clear_custom(void);
|
|
|
53
53
|
* array disables out-of-range patterns. Events carry ORIGINAL-frame offsets.
|
|
54
54
|
*
|
|
55
55
|
* Events are NOT pre-resolved for cross-pattern overlap — the caller applies
|
|
56
|
-
* the
|
|
57
|
-
*
|
|
56
|
+
* the longest-match-wins greedy claim (mm_resolve) to pick the final
|
|
57
|
+
* non-overlapping set.
|
|
58
58
|
*/
|
|
59
59
|
size_t mm_scan(const char *input, size_t len,
|
|
60
60
|
const int *enable_bits, size_t n_bits,
|
|
61
61
|
mm_match_t *out, size_t max);
|
|
62
62
|
|
|
63
63
|
/*
|
|
64
|
-
* Resolve raw scan events into the non-overlapping set the
|
|
65
|
-
*
|
|
66
|
-
* event iff its CORE span does not overlap an
|
|
67
|
-
*
|
|
68
|
-
*
|
|
64
|
+
* Resolve raw scan events into the final non-overlapping set under the
|
|
65
|
+
* longest-match-wins policy: process events in (start asc, length desc,
|
|
66
|
+
* pattern_id asc) order and keep an event iff its CORE span does not overlap an
|
|
67
|
+
* already-kept span. The longest match at each position wins; equal-length ties
|
|
68
|
+
* go to the lower pattern_id. Sorts `ev` in place and returns the kept count
|
|
69
|
+
* (compacted to the front of `ev`), in ascending start order.
|
|
69
70
|
*/
|
|
70
71
|
size_t mm_resolve(mm_match_t *ev, size_t n);
|
|
71
72
|
|
|
@@ -120,7 +120,17 @@ const int boundary_wrapped[NUM_PATTERNS] = {
|
|
|
120
120
|
1, /* 84: Passport 9 digits */
|
|
121
121
|
1, /* 85: Dutch BSN (8-9 digits) */
|
|
122
122
|
1, /* 86: Austrian Abgabenkontonummer (9 digits) */
|
|
123
|
-
1
|
|
123
|
+
1, /* 87: Polish PESEL duplicate */
|
|
124
|
+
0 /* 88: Key-name-anchored secret (KEY=VALUE / KEY: VALUE) */
|
|
125
|
+
};
|
|
126
|
+
|
|
127
|
+
/*
|
|
128
|
+
* keyname_anchored[i] == 1 marks a KEY<sep>VALUE pattern whose match span has
|
|
129
|
+
* the key + separator (and any quotes) stripped so only VALUE is redacted.
|
|
130
|
+
* Mutually exclusive with boundary_wrapped[] above. See patterns.h.
|
|
131
|
+
*/
|
|
132
|
+
const int keyname_anchored[NUM_PATTERNS] = {
|
|
133
|
+
[88] = 1,
|
|
124
134
|
};
|
|
125
135
|
|
|
126
136
|
/*
|
|
@@ -178,7 +188,8 @@ const int pattern_tags[NUM_PATTERNS] = {
|
|
|
178
188
|
TAG_TRAVEL, /* 84: passport 9 digits */
|
|
179
189
|
TAG_NATIONAL_ID, /* 85: Dutch BSN */
|
|
180
190
|
TAG_TAX_ID, /* 86: Austrian Abgabenkontonummer */
|
|
181
|
-
TAG_NATIONAL_ID
|
|
191
|
+
TAG_NATIONAL_ID, /* 87: Polish PESEL duplicate */
|
|
192
|
+
TAG_CREDENTIALS /* 88: Key-name-anchored secret */
|
|
182
193
|
};
|
|
183
194
|
|
|
184
195
|
const char *pattern_names[NUM_PATTERNS] = {
|
|
@@ -269,7 +280,8 @@ const char *pattern_names[NUM_PATTERNS] = {
|
|
|
269
280
|
"passport_9digits", /* 84 */
|
|
270
281
|
"dutch_bsn", /* 85 */
|
|
271
282
|
"austrian_abgabenkontonummer", /* 86 */
|
|
272
|
-
"polish_pesel_2"
|
|
283
|
+
"polish_pesel_2", /* 87 */
|
|
284
|
+
"keyname_anchored_secret" /* 88 */
|
|
273
285
|
};
|
|
274
286
|
|
|
275
287
|
/*
|
|
@@ -387,7 +399,8 @@ const char *pattern_required_literal[NUM_PATTERNS] = {
|
|
|
387
399
|
NULL, /* 84: passport 9 digits — pure digits */
|
|
388
400
|
NULL, /* 85: Dutch BSN — pure digits */
|
|
389
401
|
NULL, /* 86: Austrian Abgabenkontonummer — pure digits */
|
|
390
|
-
NULL
|
|
402
|
+
NULL, /* 87: Polish PESEL duplicate — pure digits */
|
|
403
|
+
NULL /* 88: Key-name-anchored — key name is an alternation, no single required literal */
|
|
391
404
|
};
|
|
392
405
|
|
|
393
406
|
/*
|
|
@@ -412,18 +425,21 @@ const char *pattern_strings[NUM_PATTERNS] = {
|
|
|
412
425
|
/* ---- Tier 2: Long prefixed tokens ---- */
|
|
413
426
|
/* 6: GitHub PAT fine-grained (github_pat_ + 82 chars) */
|
|
414
427
|
"github_pat_[0-9a-zA-Z_]{82}",
|
|
415
|
-
/* 7: JWT (three base64url segments)
|
|
416
|
-
|
|
428
|
+
/* 7: JWT (three base64url segments). Tails bounded at RE_DUP_MAX (255):
|
|
429
|
+
* a JWT is unusable once its front is gone, so a bounded prefix is enough to
|
|
430
|
+
* neutralize it. Bounding restores a finite max_len (re-enables the engine's
|
|
431
|
+
* literal back-up skip) and removes the O(N^2) greedy-tail worst case. */
|
|
432
|
+
"eyJ[A-Za-z0-9_-]{10,255}\\.eyJ[A-Za-z0-9_-]{10,255}\\.[A-Za-z0-9_-]{1,255}",
|
|
417
433
|
/* 8: Grafana API Token (base64 of {\"k\":\") */
|
|
418
|
-
"eyJrIjoi[A-Za-z0-9_=-]{42,}",
|
|
434
|
+
"eyJrIjoi[A-Za-z0-9_=-]{42,255}",
|
|
419
435
|
/* 9: SSH Public Key */
|
|
420
|
-
"ssh-(rsa|ed25519|ecdsa) [a-zA-Z0-9/+=]{20,}",
|
|
436
|
+
"ssh-(rsa|ed25519|ecdsa) [a-zA-Z0-9/+=]{20,255}",
|
|
421
437
|
/* 10: Bearer Token */
|
|
422
|
-
"[Bb]earer [a-zA-Z0-9_.=/+:-]{12,}",
|
|
438
|
+
"[Bb]earer [a-zA-Z0-9_.=/+:-]{12,255}",
|
|
423
439
|
/* 11: Anthropic API Key (sk-ant-apiNN-... ~ 95+ chars) */
|
|
424
|
-
"sk-ant-api[0-9]{2}-[A-Za-z0-9_-]{90,}",
|
|
440
|
+
"sk-ant-api[0-9]{2}-[A-Za-z0-9_-]{90,255}",
|
|
425
441
|
/* 12: OpenAI Project API Key (sk-proj-...) */
|
|
426
|
-
"sk-proj-[A-Za-z0-9_-]{20,}",
|
|
442
|
+
"sk-proj-[A-Za-z0-9_-]{20,255}",
|
|
427
443
|
/* 13: Google API Key (AIza + 35 chars) */
|
|
428
444
|
"AIza[0-9A-Za-z_-]{35}",
|
|
429
445
|
/* 14: AWS Access Key ID (all prefixes + 16 chars) */
|
|
@@ -431,7 +447,7 @@ const char *pattern_strings[NUM_PATTERNS] = {
|
|
|
431
447
|
/* 15: AWS Secret Access Key (40 base64 chars) */
|
|
432
448
|
"[A-Za-z0-9/+=]{40}",
|
|
433
449
|
/* 16: SendGrid API Key */
|
|
434
|
-
"SG\\.[a-zA-Z0-9_-]{5,}\\.[a-zA-Z0-9_-]{5,}",
|
|
450
|
+
"SG\\.[a-zA-Z0-9_-]{5,255}\\.[a-zA-Z0-9_-]{5,255}",
|
|
435
451
|
/* 17: Amazon MWS Auth Token */
|
|
436
452
|
"amzn\\.mws\\.[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
|
|
437
453
|
/* 18: LaunchDarkly API Key (api-UUID or sdk-UUID) */
|
|
@@ -587,5 +603,27 @@ const char *pattern_strings[NUM_PATTERNS] = {
|
|
|
587
603
|
/* 86: Austrian Abgabenkontonummer (9 digits) */
|
|
588
604
|
"[0-9]{9}",
|
|
589
605
|
/* 87: Polish PESEL duplicate */
|
|
590
|
-
"[0-9]{11}"
|
|
606
|
+
"[0-9]{11}",
|
|
607
|
+
/* 88: Key-name-anchored secret (dotenv KEY=VALUE / YAML KEY: VALUE).
|
|
608
|
+
* POSIX ERE has no /i, so each key name is char-class case-folded by hand.
|
|
609
|
+
* Keys ordered longest-first so leftmost-longest picks the full name.
|
|
610
|
+
* The key word may be surrounded by other key-name chars on either side
|
|
611
|
+
* (unanchored left; [A-Za-z0-9_]* right) so compound names match both ways:
|
|
612
|
+
* POSTGRES_DB_PASSWORD= (prefix) and PASSWORD_POSTGRES= (suffix).
|
|
613
|
+
* Separator is = or : with optional surrounding space. Value is either a
|
|
614
|
+
* quoted run ("..."/'...') or an unquoted token of >=6 chars that stops at
|
|
615
|
+
* whitespace, quotes, ; , : =. The matcher strips key+sep (keyname_anchored)
|
|
616
|
+
* so only the value is redacted, the full compound key name is kept. */
|
|
617
|
+
"([Cc][Ll][Ii][Ee][Nn][Tt]_[Ss][Ee][Cc][Rr][Ee][Tt]"
|
|
618
|
+
"|[Aa][Cc][Cc][Ee][Ss][Ss]_[Kk][Ee][Yy]"
|
|
619
|
+
"|[Aa][Pp][Ii]_[Kk][Ee][Yy]"
|
|
620
|
+
"|[Aa][Pp][Ii][Kk][Ee][Yy]"
|
|
621
|
+
"|[Pp][Aa][Ss][Ss][Ww][Oo][Rr][Dd]"
|
|
622
|
+
"|[Pp][Aa][Ss][Ss][Ww][Dd]"
|
|
623
|
+
"|[Ss][Ee][Cc][Rr][Ee][Tt]"
|
|
624
|
+
"|[Tt][Oo][Kk][Ee][Nn]"
|
|
625
|
+
"|[Pp][Ww][Dd])"
|
|
626
|
+
"[A-Za-z0-9_]*"
|
|
627
|
+
"[[:space:]]*[=:][[:space:]]*"
|
|
628
|
+
"(\"[^\"]+\"|'[^']+'|[^[:space:]\"';,:=]{6,})"
|
|
591
629
|
};
|
|
@@ -3,13 +3,22 @@
|
|
|
3
3
|
|
|
4
4
|
#include <regex.h>
|
|
5
5
|
|
|
6
|
-
#define NUM_PATTERNS
|
|
6
|
+
#define NUM_PATTERNS 89
|
|
7
7
|
|
|
8
8
|
extern const char *pattern_strings[NUM_PATTERNS];
|
|
9
9
|
extern const int boundary_wrapped[NUM_PATTERNS];
|
|
10
10
|
extern const int pattern_tags[NUM_PATTERNS];
|
|
11
11
|
extern const char *pattern_names[NUM_PATTERNS];
|
|
12
12
|
|
|
13
|
+
/*
|
|
14
|
+
* Key-name-anchored patterns match KEY<sep>VALUE (e.g. PASSWORD="hunter2") and
|
|
15
|
+
* redact only VALUE, preserving KEY<sep> so logs stay greppable. The matcher
|
|
16
|
+
* strips the key+separator (and surrounding quotes/whitespace) from the match
|
|
17
|
+
* span; see the keyname_anchored branch in matcher.c's match emission. These
|
|
18
|
+
* are mutually exclusive with boundary_wrapped[] (a span has one strip rule).
|
|
19
|
+
*/
|
|
20
|
+
extern const int keyname_anchored[NUM_PATTERNS];
|
|
21
|
+
|
|
13
22
|
/*
|
|
14
23
|
* Optional case-sensitive literal substring that the input must contain for
|
|
15
24
|
* the pattern to have any chance of matching. NULL means no pre-filter — the
|