RubyGems - data_redactor - Versions diffs - 0.10.0 → 0.10.1 - Mend

data_redactor 0.10.0 → 0.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 59ae814186478e16f6ba16f66aa1dfa8f3fd63d088cd5c837221d7530c6a0c73
-  data.tar.gz: 631c2a6f5198d7c2e741f9a283263ffbadfb49053bf6767ba57cce67b33e381f
+  metadata.gz: e744f5e18d5ce2311c21197b6d47b20d6032a38a72c8bfb2287a80219b8e2d77
+  data.tar.gz: cfd314afec1d018175a8424d81c12958582d5ddd9c16d05baa6048c17bba8ca8
 SHA512:
-  metadata.gz: a5fdcc1bf088c9065f7e0c458fa4cf210917d688cb9b2b17e0824e59e9757f2f9c0491ef53c0e91f7ff29ac971b1ef6cc2d434a1d80505206e2d9f5b36893ca9
-  data.tar.gz: 1457805dc7599d1655ebb8bc569607be3380290f99f65cdf15cdc17fc7431932e4ff4250f967ff0d7459350d1d836c4d63857bf0b90ad408c2fcd7a969ab453f
+  metadata.gz: f7ea267c8927f9852621180d77530818980af4fc5c089b7db95e6b0f980c1f3cb09c11806e9459241ca363b5c06203f34ede5e2b1473b1a2046a9a4e37e63fbb
+  data.tar.gz: 4081b898b339423bb5dd06f557e32bea5c7242869fc61a6b0a2447053ddcec28a751768b83d1ba66a2fcb82755ceb654c6e95052bc01e2cb23e1fac4b44b8d37

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.10.1] - 2026-06-10
+### Fixed
+- **musl/Alpine load failure** — the `hashicorp_vault_batch_token` pattern used a
+  `{138,300}` interval whose upper bound exceeds POSIX `RE_DUP_MAX` (255). glibc
+  accepts it, but musl's `regcomp` rejects it ("Invalid contents of {}"), so the
+  native musl gem raised at load (`require "data_redactor"`) on Alpine. Capped the
+  bound at 255; tokens are still neutralized (prefix + 251+ chars redacted).
 ## [0.10.0] - 2026-06-09
 ### Changed
@@ -204,7 +213,9 @@ features as 0.7.1 plus the pipeline fix.
 - `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
 - RSpec suite with one example per pattern.
-[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.9.0...HEAD
+[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.1...HEAD
+[0.10.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.0...v0.10.1
+[0.10.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.9.0...v0.10.0
 [0.9.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.8.0...v0.9.0
 [0.8.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.2...v0.8.0
 [0.7.2]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.1...v0.7.2

data/{readme.md → README.md} RENAMED Viewed

@@ -523,7 +523,7 @@ All C-side buffers are heap-allocated with `malloc`/`strdup` and freed before th
 ## Thread safety
-`DataRedactor.redact` and `DataRedactor.scan` are safe to call concurrently from multiple threads. Built-in patterns are compiled into a static `regex_t` array at load time and never mutated afterward, and each call allocates its own working buffers. POSIX `regexec` is documented as thread-safe.
+`DataRedactor.redact` and `DataRedactor.scan` are safe to call concurrently from multiple threads. The v19 engine holds MRI's GVL for the duration of each call (no `rb_thread_call_without_gvl`), so concurrent calls are serialised by the GVL. Each call allocates its own working buffers; built-in engine state is read-only after `mm_init()` at load time.
 `DataRedactor.add_pattern`, `remove_pattern`, and `clear_custom_patterns!` mutate a shared dynamic array and are **not** thread-safe. Register custom patterns once at boot — before spawning worker threads or forking — and they will be visible (read-only) to every subsequent `redact`/`scan` call.
@@ -540,4 +540,4 @@ Released under the [MIT License](LICENSE).
 - **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
 - **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
 - **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
-- **Performance is currently slower than pure-Ruby `gsub`.** A May 2026 investigation found the C extension is 3–5× slower than a pure-Ruby `gsub` loop running the same 88 patterns, across input sizes from 168 bytes to 1 MB. The root cause is glibc's POSIX `regexec()`: each call allocates an O(input-length) state buffer before any matching begins, and the gem calls it once per pattern in sequence. Ruby's Onigmo engine wins by using a built-in Boyer-Moore literal pre-filter that this gem can only approximate. Two perf fixes have shipped (buffer-sizing in `replace_all_matches`, a `strstr` literal pre-filter, and input chunking for large payloads), which gave ~25-30% improvement and made scaling linear, but the absolute gap remains. Use the gem on small payloads where the absolute latency is still acceptable (< 1 ms for typical log lines); for high-throughput pipelines, hold off until the next major release. See `docs/standalone_matcher_design.md` for the long-term plan.
+- **Single-pass overlap semantics** — built-in patterns are resolved by an index-order greedy claim: the lower-index pattern wins any region it matches. When two secrets abut with no separator, a rewrite-created word boundary can cause the second to be missed. This is rare in real text (secrets are almost always separator-delimited) and will be fixed by the upcoming longest-match-wins resolver in 1.0.

data/ext/data_redactor/matcher.c CHANGED Viewed

@@ -1,6 +1,6 @@
 /* matcher.c — the v19 multi-pattern engine, ported into the gem.
  *
- * Ported from prototypes/multi_matcher_v1/matcher19.c (the standalone prototype
+ * Ported from prototypes/multi_pattern_matcher/matcher19.c (the standalone prototype
  * proven in docs/research_log.md). The matching core — regex parser -> Thompson
  * bytecode -> per-pattern lazy DFA, the v14 first-byte filter, the v12 literal
  * skip, the v18.1 anchor lowering, the v19 pure-digit and IBAN selective merges,
@@ -22,7 +22,7 @@
  *      TODO.md §1d Gap 5 and the AKIA specs in spec/data_redactor_spec.rb.
  *
  * The infix-literal classification and the BM_INFIX hint table below are ported
- * from prototypes/multi_matcher_v1/gen_patterns.rb (which derived them from the
+ * from prototypes/multi_pattern_matcher/gen_patterns.rb (which derived them from the
  * same gem arrays at codegen time). They are pure optimisation hints — the
  * first-byte filter computed from the program itself is what guarantees
  * correctness — so a stale hint can only cost speed, never miss a match.

data/ext/data_redactor/matcher.h CHANGED Viewed

@@ -8,7 +8,7 @@
  * interned DFA) with two selective merges (pure-digit run pass, IBAN union
  * pass) and the v19.1 EOL-at-buffer-end fix. Zero dependencies beyond libc.
  * See docs/research_log.md (v15..v19) for the derivation, and
- * prototypes/multi_matcher_v1/ for the standalone prototype this is ported from.
+ * prototypes/multi_pattern_matcher/ for the standalone prototype this is ported from.
  *
  * Built-in pattern engines are sourced from the gem's pattern arrays
  * (pattern_strings[]/boundary_wrapped[]/pattern_required_literal[]), NOT a

data/ext/data_redactor/patterns.c CHANGED Viewed

@@ -458,8 +458,12 @@ const char *pattern_strings[NUM_PATTERNS] = {
     "-----BEGIN PGP PRIVATE KEY BLOCK-----",
     /* 29: HashiCorp Vault Service Token (hvs. + 90-120 base64url chars) */
     "hvs\\.[A-Za-z0-9_-]{90,120}",
-    /* 30: HashiCorp Vault Batch Token (hvb. + 138-300 base64url chars) */
-    "hvb\\.[A-Za-z0-9_-]{138,300}",
+    /* 30: HashiCorp Vault Batch Token (hvb. + 138+ base64url chars).
+     * Upper bound capped at POSIX RE_DUP_MAX (255), not gitleaks' 300: musl's
+     * regcomp rejects {m,n} with n>255 ("Invalid contents of {}"), so the gem
+     * failed to load on Alpine. 255 still neutralizes the token (prefix + 251+
+     * chars redacted); only an unusually long >255-char token leaves a dead tail. */
+    "hvb\\.[A-Za-z0-9_-]{138,255}",
     /* 31: HashiCorp Terraform Cloud API Token (14 alphanum + .atlasv1. + 60-70 base64url chars) */
     "[A-Za-z0-9]{14}\\.atlasv1\\.[A-Za-z0-9_=-]{60,70}",

data/lib/data_redactor/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 module DataRedactor
   # Current gem version. Follows {https://semver.org Semantic Versioning 2.0.0}.
-  VERSION = "0.10.0"
+  VERSION = "0.10.1"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: data_redactor
 version: !ruby/object:Gem::Version
-  version: 0.10.0
+  version: 0.10.1
 platform: ruby
 authors:
 - Daniele Frisanco
@@ -121,6 +121,7 @@ extra_rdoc_files: []
 files:
 - CHANGELOG.md
 - LICENSE
+- README.md
 - ext/data_redactor/custom_patterns.c
 - ext/data_redactor/custom_patterns.h
 - ext/data_redactor/data_redactor.c
@@ -142,7 +143,6 @@ files:
 - lib/data_redactor/integrations/rails.rb
 - lib/data_redactor/name_pattern.rb
 - lib/data_redactor/version.rb
-- readme.md
 homepage: https://github.com/danielefrisanco/data_redactor
 licenses:
 - MIT