data_redactor 0.14.0-x86_64-linux → 0.15.0-x86_64-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +41 -2
- data/README.md +13 -17
- data/lib/data_redactor/3.0/data_redactor.so +0 -0
- data/lib/data_redactor/3.1/data_redactor.so +0 -0
- data/lib/data_redactor/3.2/data_redactor.so +0 -0
- data/lib/data_redactor/3.3/data_redactor.so +0 -0
- data/lib/data_redactor/3.4/data_redactor.so +0 -0
- data/lib/data_redactor/4.0/data_redactor.so +0 -0
- data/lib/data_redactor/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 9e46ff430308e5ed5a1908907a393e88ec383c0f1cfa7a068fd39ada4a196d59
|
|
4
|
+
data.tar.gz: '094f0ebbfdbd0299377adc33db8725e2dec5f974df4281403bbb829e09ee8519'
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: b1b4862afe360943b379874dbc87853d25c2b686d84fe94db2e1b0946b843ec365b722223fc1c719767558657661a1bef0d30e7296f9b5da85711be19e86deda
|
|
7
|
+
data.tar.gz: 1e77bf2887c050776b7765d2d946d4f2b0650dfc29f5360859f08ddc24232b9bc34482ec3cf9cacf53726379b0f3fcb8c3700866e1870b747074e0eb4178667c
|
data/CHANGELOG.md
CHANGED
|
@@ -7,7 +7,44 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
-
## [0.
|
|
10
|
+
## [0.15.0] - 2026-06-17
|
|
11
|
+
|
|
12
|
+
### Changed
|
|
13
|
+
- **Overlap resolution is now longest-match-wins** (was earlier-index-wins). When
|
|
14
|
+
two patterns match overlapping spans, the engine keeps the **longer** span;
|
|
15
|
+
equal-length ties go to the lower pattern index (preserving prior behaviour for
|
|
16
|
+
same-length matches). The previous "earliest pattern by index wins any region it
|
|
17
|
+
can match" semantic was an accidental by-product of sequential per-pattern
|
|
18
|
+
rewriting, and it could leave a secret **partly unredacted** — e.g.
|
|
19
|
+
`AKIA…EXAMPLE` followed by 20 more alphanumeric bytes used to redact only the
|
|
20
|
+
20-char access-key prefix and leak the trailing 20 bytes; it now redacts the full
|
|
21
|
+
40-char secret. The public API (`redact`, `scan`) is unchanged; `scan` may report
|
|
22
|
+
one longer match where it previously reported several shorter overlapping ones.
|
|
23
|
+
Aligns with Onigmo/PCRE/RE2/Hyperscan semantics. Resolver only — no measurable
|
|
24
|
+
throughput change (still ~2.4× over pure-Ruby on the 1 MB log).
|
|
25
|
+
|
|
26
|
+
### Added
|
|
27
|
+
- **CI throughput regression gate** (`throughput-gate` job). Runs
|
|
28
|
+
`benchmark/ci_throughput_gate.rb`, which gates on the ratio of the C engine to
|
|
29
|
+
a pure-Ruby gsub loop over the same patterns (the ratio cancels CI-runner
|
|
30
|
+
speed variance, unlike absolute MB/s). Loose floor (1.5×; known result
|
|
31
|
+
~2.25×), informational throughput output, plus a correctness guard so an
|
|
32
|
+
engine that redacts less cannot pass as "faster". Repo/CI only — not packaged.
|
|
33
|
+
|
|
34
|
+
## [0.14.1] - 2026-06-17
|
|
35
|
+
|
|
36
|
+
### Changed
|
|
37
|
+
- **Bounded the greedy tails of seven built-in token patterns** (`jwt`,
|
|
38
|
+
`grafana_api_token`, `ssh_public_key`, `bearer_token`, `anthropic_api_key`,
|
|
39
|
+
`openai_project_api_key`, `sendgrid_api_key`). Open-ended quantifiers (`+` and
|
|
40
|
+
`{n,}`) are capped at the POSIX `RE_DUP_MAX` of 255 (`{n,255}`), matching the
|
|
41
|
+
existing `hashicorp_vault_batch_token` precedent. A token is unusable once its
|
|
42
|
+
front is redacted, so a bounded prefix is sufficient to neutralize it. This
|
|
43
|
+
restores a finite `max_len` for these patterns (re-enabling the engine's
|
|
44
|
+
literal back-up skip) and removes a theoretical O(N²) worst case where a
|
|
45
|
+
crafted prefix plus a megabyte of matching characters forces a long greedy
|
|
46
|
+
scan. Tokens longer than 255 characters are still neutralized — only a
|
|
47
|
+
cryptographically-dead tail may remain.
|
|
11
48
|
|
|
12
49
|
### Added
|
|
13
50
|
- **Key-name-anchored secret redaction** (`:credentials`). A new pattern tier
|
|
@@ -275,7 +312,9 @@ features as 0.7.1 plus the pipeline fix.
|
|
|
275
312
|
- `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
|
|
276
313
|
- RSpec suite with one example per pattern.
|
|
277
314
|
|
|
278
|
-
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.
|
|
315
|
+
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.15.0...HEAD
|
|
316
|
+
[0.15.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.1...v0.15.0
|
|
317
|
+
[0.14.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.0...v0.14.1
|
|
279
318
|
[0.14.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.13.0...v0.14.0
|
|
280
319
|
[0.13.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.11.0...v0.13.0
|
|
281
320
|
[0.11.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.1...v0.11.0
|
data/README.md
CHANGED
|
@@ -12,10 +12,10 @@ DataRedactor scans text for sensitive data — API keys and cloud secrets, IBANs
|
|
|
12
12
|
credit cards, national IDs, emails, phone numbers, IPs, and more — and replaces
|
|
13
13
|
each match with a placeholder. The scanning runs in a C extension backed by a
|
|
14
14
|
zero-dependency Thompson NFA → lazy-DFA multi-pattern engine (v19) that scans
|
|
15
|
-
|
|
15
|
+
every built-in pattern in a single pass — 2–2.5× faster than pure-Ruby `gsub`
|
|
16
16
|
on large payloads, with no external library dependencies.
|
|
17
17
|
|
|
18
|
-
It ships **
|
|
18
|
+
It ships **89 built-in patterns** across 15+ countries, grouped into tags
|
|
19
19
|
(`:credentials`, `:financial`, `:contact`, ...) so you can redact only what you
|
|
20
20
|
care about. Beyond plain strings it can walk nested Hashes, Arrays, and JSON,
|
|
21
21
|
audit a payload without mutating it (`scan`), and plug into Logger, Rails, and
|
|
@@ -309,7 +309,7 @@ safe_response = DataRedactor::Integrations::OpenAI.redact_response(response)
|
|
|
309
309
|
|
|
310
310
|
`content` may be a plain String or an array of content blocks/parts (`{ type: "text", text: "..." }`) — only the `text` of `text` blocks is redacted; image and other block types pass through untouched. For Claude, a top-level `system:` String is also redacted; for OpenAI, a `{ role: "system" }` message in the array is redacted like any other. Pass a bare `messages` array or the whole request Hash (with a `messages` key) — either works.
|
|
311
311
|
|
|
312
|
-
## Detected patterns (
|
|
312
|
+
## Detected patterns (89 total)
|
|
313
313
|
|
|
314
314
|
The table below is a representative sample. Use `DataRedactor.pattern_names` for the canonical, machine-readable list — it stays in sync with the C extension automatically.
|
|
315
315
|
|
|
@@ -435,7 +435,7 @@ redactor/
|
|
|
435
435
|
│ ├── README.md # How to run, what each script measures
|
|
436
436
|
│ ├── support/corpus.rb # Shared payload builders + pure-Ruby baseline redactor
|
|
437
437
|
│ ├── throughput.rb # MB/s on representative payloads
|
|
438
|
-
│ ├── vs_pure_ruby.rb # C extension vs pure-Ruby gsub (same
|
|
438
|
+
│ ├── vs_pure_ruby.rb # C extension vs pure-Ruby gsub (same patterns)
|
|
439
439
|
│ ├── scaling.rb # Runtime vs input size 1KB → 50MB
|
|
440
440
|
│ └── per_pattern.rb # Per-pattern scan cost
|
|
441
441
|
└── docs/ # Design and execution docs for future work
|
|
@@ -523,7 +523,7 @@ different angles. They are **not** packaged with the gem.
|
|
|
523
523
|
```bash
|
|
524
524
|
bundle install # pulls benchmark-ips, benchmark-memory (dev deps)
|
|
525
525
|
bundle exec rake compile
|
|
526
|
-
bundle exec ruby benchmark/vs_pure_ruby.rb # head-to-head vs pure-Ruby gsub, same
|
|
526
|
+
bundle exec ruby benchmark/vs_pure_ruby.rb # head-to-head vs pure-Ruby gsub, same patterns
|
|
527
527
|
bundle exec ruby benchmark/throughput.rb # MB/s on a log line, JSON, 1MB and 10MB log files
|
|
528
528
|
bundle exec ruby benchmark/scaling.rb # runtime vs input size (1KB → 50MB), confirms linear scaling
|
|
529
529
|
bundle exec ruby benchmark/per_pattern.rb # per-pattern scan cost over a 1MB payload
|
|
@@ -535,11 +535,8 @@ C engine uses, via `DataRedactor::BUILTIN_PATTERN_SOURCES`).
|
|
|
535
535
|
|
|
536
536
|
### Performance (0.10.0 — v19 multi-pattern engine)
|
|
537
537
|
|
|
538
|
-
|
|
539
|
-
|
|
540
|
-
with two selective-merge passes (pure-digit group + IBAN union) that further
|
|
541
|
-
reduce work for the most common pattern classes. Custom patterns (`add_pattern`)
|
|
542
|
-
still use the glibc path (required for correct UTF-8 diacritic matching).
|
|
538
|
+
Measured on the v19 engine ([How it works](#how-it-works)) vs a pure-Ruby `gsub`
|
|
539
|
+
loop over the same patterns:
|
|
543
540
|
|
|
544
541
|
| Payload | v19 engine (0.10.0) | Pure-Ruby `gsub` | Ratio |
|
|
545
542
|
|-----------------------|---------------------|------------------|-----------------|
|
|
@@ -576,14 +573,14 @@ machine-dependent, but the flat curve is not.
|
|
|
576
573
|
|
|
577
574
|
## How it works
|
|
578
575
|
|
|
579
|
-
1. At load time, `
|
|
580
|
-
2. `DataRedactor.redact(text)`
|
|
581
|
-
3.
|
|
582
|
-
4.
|
|
576
|
+
1. At load time, `mm_init()` compiles every built-in pattern from a Thompson NFA into bytecode, lazily building each pattern's DFA on first use (interned and cached). Boundary-wrapped patterns are expanded with the word-boundary group before compilation.
|
|
577
|
+
2. `DataRedactor.redact(text)` / `scan(text)` hand the input to the v19 engine, which scans it **once** and emits `(pattern_id, start, length)` events for every enabled pattern. Two selective-merge passes (a pure-digit group and an IBAN union) collapse the most common pattern classes into shared scans. The single pass over the original buffer is what makes the engine O(N).
|
|
578
|
+
3. The raw events are resolved by `mm_resolve` under the **longest-match-wins** policy: overlapping spans are reduced to a non-overlapping set keeping the longest match at each position, with the lower pattern index breaking equal-length ties.
|
|
579
|
+
4. `redact` rewrites the surviving spans to placeholders in one buffer build (preserving the boundary characters of boundary-wrapped matches); `scan` returns the event list with byte offsets into the original string. Custom patterns (`add_pattern`) run on the glibc `regexec` path afterward — required for correct UTF-8 diacritic matching.
|
|
583
580
|
|
|
584
581
|
## Memory management
|
|
585
582
|
|
|
586
|
-
All C-side buffers are heap-allocated
|
|
583
|
+
All C-side working buffers are heap-allocated and freed before the call returns; the only Ruby-managed allocation is the final result `String`. No Ruby objects are created mid-scan, so GC cannot collect anything out from under the C code. Per-thread engine scratch (NFA state, lazy-DFA cache) is freed automatically when the thread exits — see [Thread safety](#thread-safety).
|
|
587
584
|
|
|
588
585
|
## Thread safety
|
|
589
586
|
|
|
@@ -601,7 +598,6 @@ Released under the [MIT License](LICENSE).
|
|
|
601
598
|
|
|
602
599
|
## Known limitations
|
|
603
600
|
|
|
604
|
-
- **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
|
|
605
601
|
- **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
|
|
606
602
|
- **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
|
|
607
|
-
- **
|
|
603
|
+
- **Overlap resolution is longest-match-wins** — when two patterns match overlapping spans the engine keeps the longer span; equal-length ties go to the lower pattern index. This favours redacting *more* when uncertain (a 40-char secret is redacted whole rather than leaking the bytes past a shorter prefix match). When two secrets abut with **no separator** between them, a boundary-wrapped pattern can fail to match because the original buffer has no word boundary where one token meets the next, leaving the abutting token unredacted. This is rare in real text (secrets are almost always separator-delimited).
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|