RubyGems - data_redactor - Versions diffs - 0.14.0 → 0.15.0 - Mend

data_redactor 0.14.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +41 -2
data/README.md +13 -17
data/ext/data_redactor/matcher.c +15 -15
data/ext/data_redactor/matcher.h +8 -7
data/ext/data_redactor/patterns.c +11 -8
data/lib/data_redactor/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b29290519836ca25d5188a5ef4da2585bd7f11faa0c072927863c637fb618eeb
-  data.tar.gz: 465091099d2fcf4b990d4e4259c3c4ad549588839d918d831c9747236f84e864
+  metadata.gz: dfda0b2a543fc9b415c0816dd08a3fb727c1abcfb082c4c0b6d04362c83ee4d2
+  data.tar.gz: 6e1b528c5ce1759ebbefff1f3f2e72bdfc2b2fa2959e5c523945752bebb999f2
 SHA512:
-  metadata.gz: fbc51cb331674163af43d4e952bce6ec936db4e3235ca356082a83211ae552d84409bebcdffcb364c09ed8099504ac8418d2fffed3d273d4392d762d99098d59
-  data.tar.gz: e57d9545b5acec4ca25c1c5a30b1987d3d9769f027725e6613f396e7b2bedbe352620278a6cb8c613d5e1a1c1ecabb0e446bf2a4b6ae0339bedf8a4563a33b01
+  metadata.gz: a6a3e64351089a1b69b94a9ed33ee50eeb7339a0d359d357ce0ae6ca57c722d3b9fc5c92a4cb3b5fe8d7cb0fa40d0f3e50529e5b27226465294df78a30872714
+  data.tar.gz: 9f4082c693c639a8f4ab211bbbf92d4c9e6d79422f1696971937d95539670171cac662d64f2a3dd05185d5946b9c79f3b6a35d4ec1c4c4376d8853ecaaea719c

data/CHANGELOG.md CHANGED Viewed

@@ -7,7 +7,44 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
-## [0.14.0] - 2026-06-17
+## [0.15.0] - 2026-06-17
+### Changed
+- **Overlap resolution is now longest-match-wins** (was earlier-index-wins). When
+  two patterns match overlapping spans, the engine keeps the **longer** span;
+  equal-length ties go to the lower pattern index (preserving prior behaviour for
+  same-length matches). The previous "earliest pattern by index wins any region it
+  can match" semantic was an accidental by-product of sequential per-pattern
+  rewriting, and it could leave a secret **partly unredacted** — e.g.
+  `AKIA…EXAMPLE` followed by 20 more alphanumeric bytes used to redact only the
+  20-char access-key prefix and leak the trailing 20 bytes; it now redacts the full
+  40-char secret. The public API (`redact`, `scan`) is unchanged; `scan` may report
+  one longer match where it previously reported several shorter overlapping ones.
+  Aligns with Onigmo/PCRE/RE2/Hyperscan semantics. Resolver only — no measurable
+  throughput change (still ~2.4× over pure-Ruby on the 1 MB log).
+### Added
+- **CI throughput regression gate** (`throughput-gate` job). Runs
+  `benchmark/ci_throughput_gate.rb`, which gates on the ratio of the C engine to
+  a pure-Ruby gsub loop over the same patterns (the ratio cancels CI-runner
+  speed variance, unlike absolute MB/s). Loose floor (1.5×; known result
+  ~2.25×), informational throughput output, plus a correctness guard so an
+  engine that redacts less cannot pass as "faster". Repo/CI only — not packaged.
+## [0.14.1] - 2026-06-17
+### Changed
+- **Bounded the greedy tails of seven built-in token patterns** (`jwt`,
+  `grafana_api_token`, `ssh_public_key`, `bearer_token`, `anthropic_api_key`,
+  `openai_project_api_key`, `sendgrid_api_key`). Open-ended quantifiers (`+` and
+  `{n,}`) are capped at the POSIX `RE_DUP_MAX` of 255 (`{n,255}`), matching the
+  existing `hashicorp_vault_batch_token` precedent. A token is unusable once its
+  front is redacted, so a bounded prefix is sufficient to neutralize it. This
+  restores a finite `max_len` for these patterns (re-enabling the engine's
+  literal back-up skip) and removes a theoretical O(N²) worst case where a
+  crafted prefix plus a megabyte of matching characters forces a long greedy
+  scan. Tokens longer than 255 characters are still neutralized — only a
+  cryptographically-dead tail may remain.
 ### Added
 - **Key-name-anchored secret redaction** (`:credentials`). A new pattern tier
@@ -275,7 +312,9 @@ features as 0.7.1 plus the pipeline fix.
 - `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
 - RSpec suite with one example per pattern.
-[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.0...HEAD
+[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.15.0...HEAD
+[0.15.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.1...v0.15.0
+[0.14.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.14.0...v0.14.1
 [0.14.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.13.0...v0.14.0
 [0.13.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.11.0...v0.13.0
 [0.11.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.1...v0.11.0

data/README.md CHANGED Viewed

@@ -12,10 +12,10 @@ DataRedactor scans text for sensitive data — API keys and cloud secrets, IBANs
 credit cards, national IDs, emails, phone numbers, IPs, and more — and replaces
 each match with a placeholder. The scanning runs in a C extension backed by a
 zero-dependency Thompson NFA → lazy-DFA multi-pattern engine (v19) that scans
-all 88 built-in patterns in a single pass — 2–2.5× faster than pure-Ruby `gsub`
+every built-in pattern in a single pass — 2–2.5× faster than pure-Ruby `gsub`
 on large payloads, with no external library dependencies.
-It ships **88 built-in patterns** across 15+ countries, grouped into tags
+It ships **89 built-in patterns** across 15+ countries, grouped into tags
 (`:credentials`, `:financial`, `:contact`, ...) so you can redact only what you
 care about. Beyond plain strings it can walk nested Hashes, Arrays, and JSON,
 audit a payload without mutating it (`scan`), and plug into Logger, Rails, and
@@ -309,7 +309,7 @@ safe_response = DataRedactor::Integrations::OpenAI.redact_response(response)
 `content` may be a plain String or an array of content blocks/parts (`{ type: "text", text: "..." }`) — only the `text` of `text` blocks is redacted; image and other block types pass through untouched. For Claude, a top-level `system:` String is also redacted; for OpenAI, a `{ role: "system" }` message in the array is redacted like any other. Pass a bare `messages` array or the whole request Hash (with a `messages` key) — either works.
-## Detected patterns (88 total)
+## Detected patterns (89 total)
 The table below is a representative sample. Use `DataRedactor.pattern_names` for the canonical, machine-readable list — it stays in sync with the C extension automatically.
@@ -435,7 +435,7 @@ redactor/
 │   ├── README.md                 # How to run, what each script measures
 │   ├── support/corpus.rb         # Shared payload builders + pure-Ruby baseline redactor
 │   ├── throughput.rb             # MB/s on representative payloads
-│   ├── vs_pure_ruby.rb           # C extension vs pure-Ruby gsub (same 88 patterns)
+│   ├── vs_pure_ruby.rb           # C extension vs pure-Ruby gsub (same patterns)
 │   ├── scaling.rb                # Runtime vs input size 1KB → 50MB
 │   └── per_pattern.rb            # Per-pattern scan cost
 └── docs/                         # Design and execution docs for future work
@@ -523,7 +523,7 @@ different angles. They are **not** packaged with the gem.
 ```bash
 bundle install                                   # pulls benchmark-ips, benchmark-memory (dev deps)
 bundle exec rake compile
-bundle exec ruby benchmark/vs_pure_ruby.rb       # head-to-head vs pure-Ruby gsub, same 88 patterns
+bundle exec ruby benchmark/vs_pure_ruby.rb       # head-to-head vs pure-Ruby gsub, same patterns
 bundle exec ruby benchmark/throughput.rb         # MB/s on a log line, JSON, 1MB and 10MB log files
 bundle exec ruby benchmark/scaling.rb            # runtime vs input size (1KB → 50MB), confirms linear scaling
 bundle exec ruby benchmark/per_pattern.rb        # per-pattern scan cost over a 1MB payload
@@ -535,11 +535,8 @@ C engine uses, via `DataRedactor::BUILTIN_PATTERN_SOURCES`).
 ### Performance (0.10.0 — v19 multi-pattern engine)
-As of 0.10.0 the C extension runs a **Thompson NFA → lazy-DFA multi-pattern
-engine** (v19) that scans the input once across all 88 built-in patterns,
-with two selective-merge passes (pure-digit group + IBAN union) that further
-reduce work for the most common pattern classes. Custom patterns (`add_pattern`)
-still use the glibc path (required for correct UTF-8 diacritic matching).
+Measured on the v19 engine ([How it works](#how-it-works)) vs a pure-Ruby `gsub`
+loop over the same patterns:
 | Payload               | v19 engine (0.10.0) | Pure-Ruby `gsub` | Ratio           |
 |-----------------------|---------------------|------------------|-----------------|
@@ -576,14 +573,14 @@ machine-dependent, but the flat curve is not.
 ## How it works
-1. At load time, `Init_data_redactor` compiles all 85 regex patterns once using `regcomp` (POSIX ERE) and stores them as static `regex_t` structs. Patterns marked as boundary-wrapped are expanded with `wrap_boundary()` before compilation.
-2. `DataRedactor.redact(text)` receives a Ruby `String`, converts it to a C `char*` via `StringValueCStr`, and runs each compiled pattern in sequence on a working buffer.
-3. For each pattern, `replace_all_matches` iterates using `regexec`, copies non-matching segments to a fresh output buffer, and inserts `[REDACTED]` in place of each match. For boundary-wrapped patterns, `regexec` is called with `nmatch=4` and sub-match groups `[1]`/`[3]` identify the boundary characters so they are preserved verbatim.
-4. The output buffer is grown with `realloc` as needed. After all patterns are applied the result is returned as a Ruby `String` via `rb_str_new_cstr`. All intermediate `malloc`/`strdup` allocations are explicitly `free`d.
+1. At load time, `mm_init()` compiles every built-in pattern from a Thompson NFA into bytecode, lazily building each pattern's DFA on first use (interned and cached). Boundary-wrapped patterns are expanded with the word-boundary group before compilation.
+2. `DataRedactor.redact(text)` / `scan(text)` hand the input to the v19 engine, which scans it **once** and emits `(pattern_id, start, length)` events for every enabled pattern. Two selective-merge passes (a pure-digit group and an IBAN union) collapse the most common pattern classes into shared scans. The single pass over the original buffer is what makes the engine O(N).
+3. The raw events are resolved by `mm_resolve` under the **longest-match-wins** policy: overlapping spans are reduced to a non-overlapping set keeping the longest match at each position, with the lower pattern index breaking equal-length ties.
+4. `redact` rewrites the surviving spans to placeholders in one buffer build (preserving the boundary characters of boundary-wrapped matches); `scan` returns the event list with byte offsets into the original string. Custom patterns (`add_pattern`) run on the glibc `regexec` path afterward — required for correct UTF-8 diacritic matching.
 ## Memory management
-All C-side buffers are heap-allocated with `malloc`/`strdup` and freed before the function returns. The only Ruby-managed allocation is the final return value from `rb_str_new_cstr`. No Ruby objects are created mid-processing, so GC cannot collect anything out from under the C code.
+All C-side working buffers are heap-allocated and freed before the call returns; the only Ruby-managed allocation is the final result `String`. No Ruby objects are created mid-scan, so GC cannot collect anything out from under the C code. Per-thread engine scratch (NFA state, lazy-DFA cache) is freed automatically when the thread exits — see [Thread safety](#thread-safety).
 ## Thread safety
@@ -601,7 +598,6 @@ Released under the [MIT License](LICENSE).
 ## Known limitations
-- **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
 - **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
 - **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
-- **Single-pass overlap semantics** — built-in patterns are resolved by an index-order greedy claim: the lower-index pattern wins any region it matches. When two secrets abut with no separator, a rewrite-created word boundary can cause the second to be missed. This is rare in real text (secrets are almost always separator-delimited) and will be fixed by the upcoming longest-match-wins resolver in 1.0.
+- **Overlap resolution is longest-match-wins** — when two patterns match overlapping spans the engine keeps the longer span; equal-length ties go to the lower pattern index. This favours redacting *more* when uncertain (a 40-char secret is redacted whole rather than leaking the bytes past a shorter prefix match). When two secrets abut with **no separator** between them, a boundary-wrapped pattern can fail to match because the original buffer has no word boundary where one token meets the next, leaving the abutting token unredacted. This is rare in real text (secrets are almost always separator-delimited).

data/ext/data_redactor/matcher.c CHANGED Viewed

@@ -16,10 +16,10 @@
  *
  *   2. Output contract. mm_scan() takes an enable_bits gate and emits ORIGINAL-
  *      frame (pattern_id, start, span) events for ALL enabled patterns in one
- *      pass; it does NOT model the gem's cross-pattern sequential rewrite. The
- *      caller applies mm_resolve() (index-order greedy claim) to reproduce
- *      today's "earlier-index pattern wins" semantics byte-for-byte. See
- *      TODO.md §1d Gap 5 and the AKIA specs in spec/data_redactor_spec.rb.
+ *      pass. The caller applies mm_resolve() (longest-match-wins greedy claim:
+ *      longest span at each position wins, equal lengths broken by lower
+ *      pattern_id) to pick the final non-overlapping set. See TODO.md §1d Gap 5
+ *      and the overlap-resolution specs in spec/data_redactor_spec.rb.
  *
  * The infix-literal classification and the BM_INFIX hint table below are ported
  * from prototypes/multi_pattern_matcher/gen_patterns.rb (which derived them from the
@@ -1260,14 +1260,14 @@ size_t mm_scan(const char *input, size_t len,
     return count;
 }
-/* Order events for the index-order greedy claim: ascending pattern_id, then
- * ascending start (so a lower-index pattern always gets first claim on a region;
- * within a pattern, earlier matches are seen first). */
+/* Order events for the longest-match-wins greedy claim: ascending start, then
+ * descending length (so the longest span at a given start is seen first), then
+ * ascending pattern_id (lower index wins a tie of equal length). */
 static int ev_cmp_resolve(const void *a, const void *b) {
     const mm_match_t *x = a, *y = b;
-    if (x->pattern_id != y->pattern_id) return x->pattern_id - y->pattern_id;
     if (x->start != y->start) return x->start < y->start ? -1 : 1;
-    return 0;
+    if (x->length != y->length) return x->length > y->length ? -1 : 1;
+    return x->pattern_id - y->pattern_id;
 }
 /* Order kept events for emission: ascending start. */
@@ -1281,12 +1281,12 @@ size_t mm_resolve(mm_match_t *ev, size_t n) {
     if (n == 0) return 0;
     qsort(ev, n, sizeof(mm_match_t), ev_cmp_resolve);
-    /* Greedy claim in (pattern_id, start) order. An event is kept iff its span
-     * [start, start+length) does not overlap any already-kept span. Kept spans
-     * are accumulated in `kept`; we check membership against them. n is small
-     * for typical inputs, but to stay linear-ish we keep `kept` sorted by start
-     * and binary-search the neighbourhood. For simplicity and because match
-     * counts are modest, a linear overlap check against the kept set is used. */
+    /* Greedy claim in (start, -length, pattern_id) order: the longest span at
+     * each position is offered first and claims its region; any later (shorter,
+     * or equal-length higher-id) event overlapping an already-kept span is
+     * dropped. An event is kept iff its span [start, start+length) does not
+     * overlap any already-kept span. Match counts are modest, so a linear
+     * overlap check against the kept set is used. */
     mm_match_t *kept = mm_xmalloc(n * sizeof(mm_match_t));
     size_t nk = 0;
     for (size_t i = 0; i < n; i++) {

data/ext/data_redactor/matcher.h CHANGED Viewed

@@ -53,19 +53,20 @@ void mm_clear_custom(void);
  * array disables out-of-range patterns. Events carry ORIGINAL-frame offsets.
  *
  * Events are NOT pre-resolved for cross-pattern overlap — the caller applies
- * the index-order greedy claim (mm_resolve) to reproduce the gem's sequential
- * per-pattern rewrite semantics.
+ * the longest-match-wins greedy claim (mm_resolve) to pick the final
+ * non-overlapping set.
  */
 size_t mm_scan(const char *input, size_t len,
                const int *enable_bits, size_t n_bits,
                mm_match_t *out, size_t max);
 /*
- * Resolve raw scan events into the non-overlapping set the gem's sequential
- * per-pattern rewrite would produce: in (pattern_id, start) order, keep an
- * event iff its CORE span does not overlap an already-kept span. Sorts `ev`
- * in place and returns the kept count (compacted to the front of `ev`), in
- * ascending start order. n_total is the pattern-id upper bound for ordering.
+ * Resolve raw scan events into the final non-overlapping set under the
+ * longest-match-wins policy: process events in (start asc, length desc,
+ * pattern_id asc) order and keep an event iff its CORE span does not overlap an
+ * already-kept span. The longest match at each position wins; equal-length ties
+ * go to the lower pattern_id. Sorts `ev` in place and returns the kept count
+ * (compacted to the front of `ev`), in ascending start order.
  */
 size_t mm_resolve(mm_match_t *ev, size_t n);

data/ext/data_redactor/patterns.c CHANGED Viewed

@@ -425,18 +425,21 @@ const char *pattern_strings[NUM_PATTERNS] = {
     /* ---- Tier 2: Long prefixed tokens ---- */
     /*  6: GitHub PAT fine-grained (github_pat_ + 82 chars) */
     "github_pat_[0-9a-zA-Z_]{82}",
-    /*  7: JWT (three base64url segments) */
-    "eyJ[A-Za-z0-9_-]{10,}\\.eyJ[A-Za-z0-9_-]{10,}\\.[A-Za-z0-9_-]+",
+    /*  7: JWT (three base64url segments). Tails bounded at RE_DUP_MAX (255):
+     * a JWT is unusable once its front is gone, so a bounded prefix is enough to
+     * neutralize it. Bounding restores a finite max_len (re-enables the engine's
+     * literal back-up skip) and removes the O(N^2) greedy-tail worst case. */
+    "eyJ[A-Za-z0-9_-]{10,255}\\.eyJ[A-Za-z0-9_-]{10,255}\\.[A-Za-z0-9_-]{1,255}",
     /*  8: Grafana API Token (base64 of {\"k\":\") */
-    "eyJrIjoi[A-Za-z0-9_=-]{42,}",
+    "eyJrIjoi[A-Za-z0-9_=-]{42,255}",
     /*  9: SSH Public Key */
-    "ssh-(rsa|ed25519|ecdsa) [a-zA-Z0-9/+=]{20,}",
+    "ssh-(rsa|ed25519|ecdsa) [a-zA-Z0-9/+=]{20,255}",
     /* 10: Bearer Token */
-    "[Bb]earer [a-zA-Z0-9_.=/+:-]{12,}",
+    "[Bb]earer [a-zA-Z0-9_.=/+:-]{12,255}",
     /* 11: Anthropic API Key (sk-ant-apiNN-... ~ 95+ chars) */
-    "sk-ant-api[0-9]{2}-[A-Za-z0-9_-]{90,}",
+    "sk-ant-api[0-9]{2}-[A-Za-z0-9_-]{90,255}",
     /* 12: OpenAI Project API Key (sk-proj-...) */
-    "sk-proj-[A-Za-z0-9_-]{20,}",
+    "sk-proj-[A-Za-z0-9_-]{20,255}",
     /* 13: Google API Key (AIza + 35 chars) */
     "AIza[0-9A-Za-z_-]{35}",
     /* 14: AWS Access Key ID (all prefixes + 16 chars) */
@@ -444,7 +447,7 @@ const char *pattern_strings[NUM_PATTERNS] = {
     /* 15: AWS Secret Access Key (40 base64 chars) */
     "[A-Za-z0-9/+=]{40}",
     /* 16: SendGrid API Key */
-    "SG\\.[a-zA-Z0-9_-]{5,}\\.[a-zA-Z0-9_-]{5,}",
+    "SG\\.[a-zA-Z0-9_-]{5,255}\\.[a-zA-Z0-9_-]{5,255}",
     /* 17: Amazon MWS Auth Token */
     "amzn\\.mws\\.[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
     /* 18: LaunchDarkly API Key (api-UUID or sdk-UUID) */

data/lib/data_redactor/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 module DataRedactor
   # Current gem version. Follows {https://semver.org Semantic Versioning 2.0.0}.
-  VERSION = "0.14.0"
+  VERSION = "0.15.0"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: data_redactor
 version: !ruby/object:Gem::Version
-  version: 0.14.0
+  version: 0.15.0
 platform: ruby
 authors:
 - Daniele Frisanco