data_redactor 0.10.0 → 0.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 59ae814186478e16f6ba16f66aa1dfa8f3fd63d088cd5c837221d7530c6a0c73
4
- data.tar.gz: 631c2a6f5198d7c2e741f9a283263ffbadfb49053bf6767ba57cce67b33e381f
3
+ metadata.gz: e744f5e18d5ce2311c21197b6d47b20d6032a38a72c8bfb2287a80219b8e2d77
4
+ data.tar.gz: cfd314afec1d018175a8424d81c12958582d5ddd9c16d05baa6048c17bba8ca8
5
5
  SHA512:
6
- metadata.gz: a5fdcc1bf088c9065f7e0c458fa4cf210917d688cb9b2b17e0824e59e9757f2f9c0491ef53c0e91f7ff29ac971b1ef6cc2d434a1d80505206e2d9f5b36893ca9
7
- data.tar.gz: 1457805dc7599d1655ebb8bc569607be3380290f99f65cdf15cdc17fc7431932e4ff4250f967ff0d7459350d1d836c4d63857bf0b90ad408c2fcd7a969ab453f
6
+ metadata.gz: f7ea267c8927f9852621180d77530818980af4fc5c089b7db95e6b0f980c1f3cb09c11806e9459241ca363b5c06203f34ede5e2b1473b1a2046a9a4e37e63fbb
7
+ data.tar.gz: 4081b898b339423bb5dd06f557e32bea5c7242869fc61a6b0a2447053ddcec28a751768b83d1ba66a2fcb82755ceb654c6e95052bc01e2cb23e1fac4b44b8d37
data/CHANGELOG.md CHANGED
@@ -7,6 +7,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [0.10.1] - 2026-06-10
11
+
12
+ ### Fixed
13
+ - **musl/Alpine load failure** — the `hashicorp_vault_batch_token` pattern used a
14
+ `{138,300}` interval whose upper bound exceeds POSIX `RE_DUP_MAX` (255). glibc
15
+ accepts it, but musl's `regcomp` rejects it ("Invalid contents of {}"), so the
16
+ native musl gem raised at load (`require "data_redactor"`) on Alpine. Capped the
17
+ bound at 255; tokens are still neutralized (prefix + 251+ chars redacted).
18
+
10
19
  ## [0.10.0] - 2026-06-09
11
20
 
12
21
  ### Changed
@@ -204,7 +213,9 @@ features as 0.7.1 plus the pipeline fix.
204
213
  - `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
205
214
  - RSpec suite with one example per pattern.
206
215
 
207
- [Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.9.0...HEAD
216
+ [Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.1...HEAD
217
+ [0.10.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.0...v0.10.1
218
+ [0.10.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.9.0...v0.10.0
208
219
  [0.9.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.8.0...v0.9.0
209
220
  [0.8.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.2...v0.8.0
210
221
  [0.7.2]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.1...v0.7.2
@@ -523,7 +523,7 @@ All C-side buffers are heap-allocated with `malloc`/`strdup` and freed before th
523
523
 
524
524
  ## Thread safety
525
525
 
526
- `DataRedactor.redact` and `DataRedactor.scan` are safe to call concurrently from multiple threads. Built-in patterns are compiled into a static `regex_t` array at load time and never mutated afterward, and each call allocates its own working buffers. POSIX `regexec` is documented as thread-safe.
526
+ `DataRedactor.redact` and `DataRedactor.scan` are safe to call concurrently from multiple threads. The v19 engine holds MRI's GVL for the duration of each call (no `rb_thread_call_without_gvl`), so concurrent calls are serialised by the GVL. Each call allocates its own working buffers; built-in engine state is read-only after `mm_init()` at load time.
527
527
 
528
528
  `DataRedactor.add_pattern`, `remove_pattern`, and `clear_custom_patterns!` mutate a shared dynamic array and are **not** thread-safe. Register custom patterns once at boot — before spawning worker threads or forking — and they will be visible (read-only) to every subsequent `redact`/`scan` call.
529
529
 
@@ -540,4 +540,4 @@ Released under the [MIT License](LICENSE).
540
540
  - **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
541
541
  - **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
542
542
  - **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
543
- - **Performance is currently slower than pure-Ruby `gsub`.** A May 2026 investigation found the C extension is 3–5× slower than a pure-Ruby `gsub` loop running the same 88 patterns, across input sizes from 168 bytes to 1 MB. The root cause is glibc's POSIX `regexec()`: each call allocates an O(input-length) state buffer before any matching begins, and the gem calls it once per pattern in sequence. Ruby's Onigmo engine wins by using a built-in Boyer-Moore literal pre-filter that this gem can only approximate. Two perf fixes have shipped (buffer-sizing in `replace_all_matches`, a `strstr` literal pre-filter, and input chunking for large payloads), which gave ~25-30% improvement and made scaling linear, but the absolute gap remains. Use the gem on small payloads where the absolute latency is still acceptable (< 1 ms for typical log lines); for high-throughput pipelines, hold off until the next major release. See `docs/standalone_matcher_design.md` for the long-term plan.
543
+ - **Single-pass overlap semantics** built-in patterns are resolved by an index-order greedy claim: the lower-index pattern wins any region it matches. When two secrets abut with no separator, a rewrite-created word boundary can cause the second to be missed. This is rare in real text (secrets are almost always separator-delimited) and will be fixed by the upcoming longest-match-wins resolver in 1.0.
@@ -1,6 +1,6 @@
1
1
  /* matcher.c — the v19 multi-pattern engine, ported into the gem.
2
2
  *
3
- * Ported from prototypes/multi_matcher_v1/matcher19.c (the standalone prototype
3
+ * Ported from prototypes/multi_pattern_matcher/matcher19.c (the standalone prototype
4
4
  * proven in docs/research_log.md). The matching core — regex parser -> Thompson
5
5
  * bytecode -> per-pattern lazy DFA, the v14 first-byte filter, the v12 literal
6
6
  * skip, the v18.1 anchor lowering, the v19 pure-digit and IBAN selective merges,
@@ -22,7 +22,7 @@
22
22
  * TODO.md §1d Gap 5 and the AKIA specs in spec/data_redactor_spec.rb.
23
23
  *
24
24
  * The infix-literal classification and the BM_INFIX hint table below are ported
25
- * from prototypes/multi_matcher_v1/gen_patterns.rb (which derived them from the
25
+ * from prototypes/multi_pattern_matcher/gen_patterns.rb (which derived them from the
26
26
  * same gem arrays at codegen time). They are pure optimisation hints — the
27
27
  * first-byte filter computed from the program itself is what guarantees
28
28
  * correctness — so a stale hint can only cost speed, never miss a match.
@@ -8,7 +8,7 @@
8
8
  * interned DFA) with two selective merges (pure-digit run pass, IBAN union
9
9
  * pass) and the v19.1 EOL-at-buffer-end fix. Zero dependencies beyond libc.
10
10
  * See docs/research_log.md (v15..v19) for the derivation, and
11
- * prototypes/multi_matcher_v1/ for the standalone prototype this is ported from.
11
+ * prototypes/multi_pattern_matcher/ for the standalone prototype this is ported from.
12
12
  *
13
13
  * Built-in pattern engines are sourced from the gem's pattern arrays
14
14
  * (pattern_strings[]/boundary_wrapped[]/pattern_required_literal[]), NOT a
@@ -458,8 +458,12 @@ const char *pattern_strings[NUM_PATTERNS] = {
458
458
  "-----BEGIN PGP PRIVATE KEY BLOCK-----",
459
459
  /* 29: HashiCorp Vault Service Token (hvs. + 90-120 base64url chars) */
460
460
  "hvs\\.[A-Za-z0-9_-]{90,120}",
461
- /* 30: HashiCorp Vault Batch Token (hvb. + 138-300 base64url chars) */
462
- "hvb\\.[A-Za-z0-9_-]{138,300}",
461
+ /* 30: HashiCorp Vault Batch Token (hvb. + 138+ base64url chars).
462
+ * Upper bound capped at POSIX RE_DUP_MAX (255), not gitleaks' 300: musl's
463
+ * regcomp rejects {m,n} with n>255 ("Invalid contents of {}"), so the gem
464
+ * failed to load on Alpine. 255 still neutralizes the token (prefix + 251+
465
+ * chars redacted); only an unusually long >255-char token leaves a dead tail. */
466
+ "hvb\\.[A-Za-z0-9_-]{138,255}",
463
467
  /* 31: HashiCorp Terraform Cloud API Token (14 alphanum + .atlasv1. + 60-70 base64url chars) */
464
468
  "[A-Za-z0-9]{14}\\.atlasv1\\.[A-Za-z0-9_=-]{60,70}",
465
469
 
@@ -1,4 +1,4 @@
1
1
  module DataRedactor
2
2
  # Current gem version. Follows {https://semver.org Semantic Versioning 2.0.0}.
3
- VERSION = "0.10.0"
3
+ VERSION = "0.10.1"
4
4
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_redactor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.10.0
4
+ version: 0.10.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Daniele Frisanco
@@ -121,6 +121,7 @@ extra_rdoc_files: []
121
121
  files:
122
122
  - CHANGELOG.md
123
123
  - LICENSE
124
+ - README.md
124
125
  - ext/data_redactor/custom_patterns.c
125
126
  - ext/data_redactor/custom_patterns.h
126
127
  - ext/data_redactor/data_redactor.c
@@ -142,7 +143,6 @@ files:
142
143
  - lib/data_redactor/integrations/rails.rb
143
144
  - lib/data_redactor/name_pattern.rb
144
145
  - lib/data_redactor/version.rb
145
- - readme.md
146
146
  homepage: https://github.com/danielefrisanco/data_redactor
147
147
  licenses:
148
148
  - MIT