data_redactor 0.13.0-aarch64-linux → 0.15.0-aarch64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c7438c67d7a8b27f8ae85e2b5cbc173c8e0cc62b13e9967a5cd773fba8050615
4
- data.tar.gz: f7ed50daa5e34ecdb6bba03ec9e90d0c56045cb73bc10c5fb831eb069e52fd36
3
+ metadata.gz: 186d6a1ed6ad690e0b022c9410dd1bc9ce79ca6d57f8f953c1081321757025ed
4
+ data.tar.gz: 13f8fbf32f59af749fa660fadd5ce47820ad726d86acb239907a3237e2a3c478
5
5
  SHA512:
6
- metadata.gz: 4e369c1ab92f3a66d63a66749c3573432929fc557ddd0e9170f257d2241ecb0551833e13240eb7443a63fdf4983fabedba8720ea8233c240195c0ec49b156c8f
7
- data.tar.gz: c1796e78765583613885df0a03e7fd683f12a44a668b9c83a883345376956f2a058137e8a77a844f2f823b0c8dc02507dc4179afbe767323a511f483cd6e2deb
6
+ metadata.gz: 6377bd25095e6614a85c4333b8b1d7a9d74d3fe892485520dd4643f53bf48faab684c431dd1fc52cbecf17abdff6c4a9823f8adbfce88104aa37d5d27f0c23fb
7
+ data.tar.gz: 622c9bd536de4a34cf53ea4c09f36cb6b14efd82d7b5228155d308b4284c8ec05e067ebee5535520cc29857100a2eadd188a8499310c7faa62f49b0c57a2e0ee
data/CHANGELOG.md CHANGED
@@ -7,6 +7,63 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [0.15.0] - 2026-06-17
11
+
12
+ ### Changed
13
+ - **Overlap resolution is now longest-match-wins** (was earlier-index-wins). When
14
+ two patterns match overlapping spans, the engine keeps the **longer** span;
15
+ equal-length ties go to the lower pattern index (preserving prior behaviour for
16
+ same-length matches). The previous "earliest pattern by index wins any region it
17
+ can match" semantic was an accidental by-product of sequential per-pattern
18
+ rewriting, and it could leave a secret **partly unredacted** — e.g.
19
+ `AKIA…EXAMPLE` followed by 20 more alphanumeric bytes used to redact only the
20
+ 20-char access-key prefix and leak the trailing 20 bytes; it now redacts the full
21
+ 40-char secret. The public API (`redact`, `scan`) is unchanged; `scan` may report
22
+ one longer match where it previously reported several shorter overlapping ones.
23
+ Aligns with Onigmo/PCRE/RE2/Hyperscan semantics. Resolver only — no measurable
24
+ throughput change (still ~2.4× over pure-Ruby on the 1 MB log).
25
+
26
+ ### Added
27
+ - **CI throughput regression gate** (`throughput-gate` job). Runs
28
+ `benchmark/ci_throughput_gate.rb`, which gates on the ratio of the C engine to
29
+ a pure-Ruby gsub loop over the same patterns (the ratio cancels CI-runner
30
+ speed variance, unlike absolute MB/s). Loose floor (1.5×; known result
31
+ ~2.25×), informational throughput output, plus a correctness guard so an
32
+ engine that redacts less cannot pass as "faster". Repo/CI only — not packaged.
33
+
34
+ ## [0.14.1] - 2026-06-17
35
+
36
+ ### Changed
37
+ - **Bounded the greedy tails of seven built-in token patterns** (`jwt`,
38
+ `grafana_api_token`, `ssh_public_key`, `bearer_token`, `anthropic_api_key`,
39
+ `openai_project_api_key`, `sendgrid_api_key`). Open-ended quantifiers (`+` and
40
+ `{n,}`) are capped at the POSIX `RE_DUP_MAX` of 255 (`{n,255}`), matching the
41
+ existing `hashicorp_vault_batch_token` precedent. A token is unusable once its
42
+ front is redacted, so a bounded prefix is sufficient to neutralize it. This
43
+ restores a finite `max_len` for these patterns (re-enabling the engine's
44
+ literal back-up skip) and removes a theoretical O(N²) worst case where a
45
+ crafted prefix plus a megabyte of matching characters forces a long greedy
46
+ scan. Tokens longer than 255 characters are still neutralized — only a
47
+ cryptographically-dead tail may remain.
48
+
49
+ ### Added
50
+ - **Key-name-anchored secret redaction** (`:credentials`). A new pattern tier
51
+ redacts a secret by the *name of the field it is assigned to*, for values with
52
+ no distinctive shape of their own — the primary case being an `.env` file or
53
+ config blob passed through the redactor. Anchored on the key words `password`,
54
+ `passwd`, `pwd`, `secret`, `token`, `api_key`, `apikey`, `access_key`, and
55
+ `client_secret` (case-insensitive), followed by `=` or `:` (dotenv and YAML
56
+ styles), with quoted (`"..."`/`'...'`) or unquoted (≥6 chars) values. Only the
57
+ **value** is redacted; the key is kept so logs stay greppable
58
+ (`PASSWORD=[REDACTED]`). Compound key names match whether the secret word is a
59
+ prefix or suffix segment (`POSTGRES_DB_PASSWORD=`, `PASSWORD_POSTGRES=`).
60
+ Requires the assignment separator, so the word in prose ("reset your password")
61
+ is not a false positive.
62
+ - `examples/` directory with runnable, copy-pasteable usage scripts for every
63
+ feature (core redaction, scan/dry-run, custom patterns, deep/JSON traversal,
64
+ and the Logger / Rack / Rails / LLM integrations). Repo-only — not packaged in
65
+ the gem. Linked from the README.
66
+
10
67
  ## [0.13.0] - 2026-06-13
11
68
 
12
69
  ### Changed
@@ -255,7 +312,10 @@ features as 0.7.1 plus the pipeline fix.
255
312
  - `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
256
313
  - RSpec suite with one example per pattern.
257
314
 
258
- [Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.13.0...HEAD
315
+ [Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.15.0...HEAD
316
+ [0.15.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.1...v0.15.0
317
+ [0.14.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.0...v0.14.1
318
+ [0.14.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.13.0...v0.14.0
259
319
  [0.13.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.11.0...v0.13.0
260
320
  [0.11.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.1...v0.11.0
261
321
  [0.10.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.0...v0.10.1
data/README.md CHANGED
@@ -12,10 +12,10 @@ DataRedactor scans text for sensitive data — API keys and cloud secrets, IBANs
12
12
  credit cards, national IDs, emails, phone numbers, IPs, and more — and replaces
13
13
  each match with a placeholder. The scanning runs in a C extension backed by a
14
14
  zero-dependency Thompson NFA → lazy-DFA multi-pattern engine (v19) that scans
15
- all 88 built-in patterns in a single pass — 2–2.5× faster than pure-Ruby `gsub`
15
+ every built-in pattern in a single pass — 2–2.5× faster than pure-Ruby `gsub`
16
16
  on large payloads, with no external library dependencies.
17
17
 
18
- It ships **88 built-in patterns** across 15+ countries, grouped into tags
18
+ It ships **89 built-in patterns** across 15+ countries, grouped into tags
19
19
  (`:credentials`, `:financial`, `:contact`, ...) so you can redact only what you
20
20
  care about. Beyond plain strings it can walk nested Hashes, Arrays, and JSON,
21
21
  audit a payload without mutating it (`scan`), and plug into Logger, Rails, and
@@ -46,6 +46,12 @@ DataRedactor.redact(text)
46
46
  # => "User CF is [REDACTED] and key is [REDACTED]"
47
47
  ```
48
48
 
49
+ Prefer runnable code? The [`examples/`](examples/) directory has self-contained,
50
+ copy-pasteable scripts for every feature below — core redaction, scan/dry-run,
51
+ custom patterns, deep/JSON traversal, and the Logger / Rack / Rails / LLM
52
+ integrations. Run any of them with `bundle exec ruby examples/<name>.rb` (see
53
+ [examples/README.md](examples/README.md)).
54
+
49
55
  ### Filtering by tag or pattern name
50
56
 
51
57
  `only:` and `except:` both accept a single value or an Array, mixing **Symbols** (tag names) and **Strings** (specific pattern names).
@@ -303,7 +309,7 @@ safe_response = DataRedactor::Integrations::OpenAI.redact_response(response)
303
309
 
304
310
  `content` may be a plain String or an array of content blocks/parts (`{ type: "text", text: "..." }`) — only the `text` of `text` blocks is redacted; image and other block types pass through untouched. For Claude, a top-level `system:` String is also redacted; for OpenAI, a `{ role: "system" }` message in the array is redacted like any other. Pass a bare `messages` array or the whole request Hash (with a `messages` key) — either works.
305
311
 
306
- ## Detected patterns (88 total)
312
+ ## Detected patterns (89 total)
307
313
 
308
314
  The table below is a representative sample. Use `DataRedactor.pattern_names` for the canonical, machine-readable list — it stays in sync with the C extension automatically.
309
315
 
@@ -415,11 +421,21 @@ redactor/
415
421
  │ └── tags.h # TAG_* bit constants
416
422
  ├── spec/
417
423
  │ └── data_redactor_spec.rb # RSpec tests — at least one example per pattern, plus filter / placeholder / custom-pattern coverage
424
+ ├── examples/ # Repo-only runnable usage scripts (not packaged in the gem)
425
+ │ ├── README.md # Index + how to run
426
+ │ ├── basic_redact.rb # redact, tag filters, placeholder modes
427
+ │ ├── scan_report.rb # scan dry-run with byte offsets
428
+ │ ├── custom_pattern.rb # add_pattern + name_pattern
429
+ │ ├── deep_and_json.rb # redact_deep / redact_json
430
+ │ ├── logger.rb # Logger::Formatter integration
431
+ │ ├── rack_middleware.rb # Rack middleware (body + headers)
432
+ │ ├── rails_filter.rb # filter_parameters adapter
433
+ │ └── llm_payload.rb # Claude / OpenAI message + response redaction
418
434
  ├── benchmark/ # Repo-only perf scripts (not packaged in the gem)
419
435
  │ ├── README.md # How to run, what each script measures
420
436
  │ ├── support/corpus.rb # Shared payload builders + pure-Ruby baseline redactor
421
437
  │ ├── throughput.rb # MB/s on representative payloads
422
- │ ├── vs_pure_ruby.rb # C extension vs pure-Ruby gsub (same 88 patterns)
438
+ │ ├── vs_pure_ruby.rb # C extension vs pure-Ruby gsub (same patterns)
423
439
  │ ├── scaling.rb # Runtime vs input size 1KB → 50MB
424
440
  │ └── per_pattern.rb # Per-pattern scan cost
425
441
  └── docs/ # Design and execution docs for future work
@@ -507,7 +523,7 @@ different angles. They are **not** packaged with the gem.
507
523
  ```bash
508
524
  bundle install # pulls benchmark-ips, benchmark-memory (dev deps)
509
525
  bundle exec rake compile
510
- bundle exec ruby benchmark/vs_pure_ruby.rb # head-to-head vs pure-Ruby gsub, same 88 patterns
526
+ bundle exec ruby benchmark/vs_pure_ruby.rb # head-to-head vs pure-Ruby gsub, same patterns
511
527
  bundle exec ruby benchmark/throughput.rb # MB/s on a log line, JSON, 1MB and 10MB log files
512
528
  bundle exec ruby benchmark/scaling.rb # runtime vs input size (1KB → 50MB), confirms linear scaling
513
529
  bundle exec ruby benchmark/per_pattern.rb # per-pattern scan cost over a 1MB payload
@@ -519,11 +535,8 @@ C engine uses, via `DataRedactor::BUILTIN_PATTERN_SOURCES`).
519
535
 
520
536
  ### Performance (0.10.0 — v19 multi-pattern engine)
521
537
 
522
- As of 0.10.0 the C extension runs a **Thompson NFA lazy-DFA multi-pattern
523
- engine** (v19) that scans the input once across all 88 built-in patterns,
524
- with two selective-merge passes (pure-digit group + IBAN union) that further
525
- reduce work for the most common pattern classes. Custom patterns (`add_pattern`)
526
- still use the glibc path (required for correct UTF-8 diacritic matching).
538
+ Measured on the v19 engine ([How it works](#how-it-works)) vs a pure-Ruby `gsub`
539
+ loop over the same patterns:
527
540
 
528
541
  | Payload | v19 engine (0.10.0) | Pure-Ruby `gsub` | Ratio |
529
542
  |-----------------------|---------------------|------------------|-----------------|
@@ -560,14 +573,14 @@ machine-dependent, but the flat curve is not.
560
573
 
561
574
  ## How it works
562
575
 
563
- 1. At load time, `Init_data_redactor` compiles all 85 regex patterns once using `regcomp` (POSIX ERE) and stores them as static `regex_t` structs. Patterns marked as boundary-wrapped are expanded with `wrap_boundary()` before compilation.
564
- 2. `DataRedactor.redact(text)` receives a Ruby `String`, converts it to a C `char*` via `StringValueCStr`, and runs each compiled pattern in sequence on a working buffer.
565
- 3. For each pattern, `replace_all_matches` iterates using `regexec`, copies non-matching segments to a fresh output buffer, and inserts `[REDACTED]` in place of each match. For boundary-wrapped patterns, `regexec` is called with `nmatch=4` and sub-match groups `[1]`/`[3]` identify the boundary characters so they are preserved verbatim.
566
- 4. The output buffer is grown with `realloc` as needed. After all patterns are applied the result is returned as a Ruby `String` via `rb_str_new_cstr`. All intermediate `malloc`/`strdup` allocations are explicitly `free`d.
576
+ 1. At load time, `mm_init()` compiles every built-in pattern from a Thompson NFA into bytecode, lazily building each pattern's DFA on first use (interned and cached). Boundary-wrapped patterns are expanded with the word-boundary group before compilation.
577
+ 2. `DataRedactor.redact(text)` / `scan(text)` hand the input to the v19 engine, which scans it **once** and emits `(pattern_id, start, length)` events for every enabled pattern. Two selective-merge passes (a pure-digit group and an IBAN union) collapse the most common pattern classes into shared scans. The single pass over the original buffer is what makes the engine O(N).
578
+ 3. The raw events are resolved by `mm_resolve` under the **longest-match-wins** policy: overlapping spans are reduced to a non-overlapping set keeping the longest match at each position, with the lower pattern index breaking equal-length ties.
579
+ 4. `redact` rewrites the surviving spans to placeholders in one buffer build (preserving the boundary characters of boundary-wrapped matches); `scan` returns the event list with byte offsets into the original string. Custom patterns (`add_pattern`) run on the glibc `regexec` path afterward required for correct UTF-8 diacritic matching.
567
580
 
568
581
  ## Memory management
569
582
 
570
- All C-side buffers are heap-allocated with `malloc`/`strdup` and freed before the function returns. The only Ruby-managed allocation is the final return value from `rb_str_new_cstr`. No Ruby objects are created mid-processing, so GC cannot collect anything out from under the C code.
583
+ All C-side working buffers are heap-allocated and freed before the call returns; the only Ruby-managed allocation is the final result `String`. No Ruby objects are created mid-scan, so GC cannot collect anything out from under the C code. Per-thread engine scratch (NFA state, lazy-DFA cache) is freed automatically when the thread exits — see [Thread safety](#thread-safety).
571
584
 
572
585
  ## Thread safety
573
586
 
@@ -585,7 +598,6 @@ Released under the [MIT License](LICENSE).
585
598
 
586
599
  ## Known limitations
587
600
 
588
- - **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
589
601
  - **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
590
602
  - **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
591
- - **Single-pass overlap semantics** — built-in patterns are resolved by an index-order greedy claim: the lower-index pattern wins any region it matches. When two secrets abut with no separator, a rewrite-created word boundary can cause the second to be missed. This is rare in real text (secrets are almost always separator-delimited) and will be fixed by the upcoming longest-match-wins resolver in 1.0.
603
+ - **Overlap resolution is longest-match-wins** — when two patterns match overlapping spans the engine keeps the longer span; equal-length ties go to the lower pattern index. This favours redacting *more* when uncertain (a 40-char secret is redacted whole rather than leaking the bytes past a shorter prefix match). When two secrets abut with **no separator** between them, a boundary-wrapped pattern can fail to match because the original buffer has no word boundary where one token meets the next, leaving the abutting token unredacted. This is rare in real text (secrets are almost always separator-delimited).
Binary file
Binary file
Binary file
Binary file
Binary file
Binary file
@@ -1,4 +1,4 @@
1
1
  module DataRedactor
2
2
  # Current gem version. Follows {https://semver.org Semantic Versioning 2.0.0}.
3
- VERSION = "0.13.0"
3
+ VERSION = "0.15.0"
4
4
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_redactor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.13.0
4
+ version: 0.15.0
5
5
  platform: aarch64-linux
6
6
  authors:
7
7
  - Daniele Frisanco