data_redactor 0.10.0 → 0.10.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +12 -1
- data/{readme.md → README.md} +2 -2
- data/ext/data_redactor/matcher.c +2 -2
- data/ext/data_redactor/matcher.h +1 -1
- data/ext/data_redactor/patterns.c +6 -2
- data/lib/data_redactor/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: e744f5e18d5ce2311c21197b6d47b20d6032a38a72c8bfb2287a80219b8e2d77
|
|
4
|
+
data.tar.gz: cfd314afec1d018175a8424d81c12958582d5ddd9c16d05baa6048c17bba8ca8
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: f7ea267c8927f9852621180d77530818980af4fc5c089b7db95e6b0f980c1f3cb09c11806e9459241ca363b5c06203f34ede5e2b1473b1a2046a9a4e37e63fbb
|
|
7
|
+
data.tar.gz: 4081b898b339423bb5dd06f557e32bea5c7242869fc61a6b0a2447053ddcec28a751768b83d1ba66a2fcb82755ceb654c6e95052bc01e2cb23e1fac4b44b8d37
|
data/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [0.10.1] - 2026-06-10
|
|
11
|
+
|
|
12
|
+
### Fixed
|
|
13
|
+
- **musl/Alpine load failure** — the `hashicorp_vault_batch_token` pattern used a
|
|
14
|
+
`{138,300}` interval whose upper bound exceeds POSIX `RE_DUP_MAX` (255). glibc
|
|
15
|
+
accepts it, but musl's `regcomp` rejects it ("Invalid contents of {}"), so the
|
|
16
|
+
native musl gem raised at load (`require "data_redactor"`) on Alpine. Capped the
|
|
17
|
+
bound at 255; tokens are still neutralized (prefix + 251+ chars redacted).
|
|
18
|
+
|
|
10
19
|
## [0.10.0] - 2026-06-09
|
|
11
20
|
|
|
12
21
|
### Changed
|
|
@@ -204,7 +213,9 @@ features as 0.7.1 plus the pipeline fix.
|
|
|
204
213
|
- `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
|
|
205
214
|
- RSpec suite with one example per pattern.
|
|
206
215
|
|
|
207
|
-
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.
|
|
216
|
+
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.1...HEAD
|
|
217
|
+
[0.10.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.10.0...v0.10.1
|
|
218
|
+
[0.10.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.9.0...v0.10.0
|
|
208
219
|
[0.9.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.8.0...v0.9.0
|
|
209
220
|
[0.8.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.2...v0.8.0
|
|
210
221
|
[0.7.2]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.1...v0.7.2
|
data/{readme.md → README.md}
RENAMED
|
@@ -523,7 +523,7 @@ All C-side buffers are heap-allocated with `malloc`/`strdup` and freed before th
|
|
|
523
523
|
|
|
524
524
|
## Thread safety
|
|
525
525
|
|
|
526
|
-
`DataRedactor.redact` and `DataRedactor.scan` are safe to call concurrently from multiple threads.
|
|
526
|
+
`DataRedactor.redact` and `DataRedactor.scan` are safe to call concurrently from multiple threads. The v19 engine holds MRI's GVL for the duration of each call (no `rb_thread_call_without_gvl`), so concurrent calls are serialised by the GVL. Each call allocates its own working buffers; built-in engine state is read-only after `mm_init()` at load time.
|
|
527
527
|
|
|
528
528
|
`DataRedactor.add_pattern`, `remove_pattern`, and `clear_custom_patterns!` mutate a shared dynamic array and are **not** thread-safe. Register custom patterns once at boot — before spawning worker threads or forking — and they will be visible (read-only) to every subsequent `redact`/`scan` call.
|
|
529
529
|
|
|
@@ -540,4 +540,4 @@ Released under the [MIT License](LICENSE).
|
|
|
540
540
|
- **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
|
|
541
541
|
- **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
|
|
542
542
|
- **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
|
|
543
|
-
- **
|
|
543
|
+
- **Single-pass overlap semantics** — built-in patterns are resolved by an index-order greedy claim: the lower-index pattern wins any region it matches. When two secrets abut with no separator, a rewrite-created word boundary can cause the second to be missed. This is rare in real text (secrets are almost always separator-delimited) and will be fixed by the upcoming longest-match-wins resolver in 1.0.
|
data/ext/data_redactor/matcher.c
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
/* matcher.c — the v19 multi-pattern engine, ported into the gem.
|
|
2
2
|
*
|
|
3
|
-
* Ported from prototypes/
|
|
3
|
+
* Ported from prototypes/multi_pattern_matcher/matcher19.c (the standalone prototype
|
|
4
4
|
* proven in docs/research_log.md). The matching core — regex parser -> Thompson
|
|
5
5
|
* bytecode -> per-pattern lazy DFA, the v14 first-byte filter, the v12 literal
|
|
6
6
|
* skip, the v18.1 anchor lowering, the v19 pure-digit and IBAN selective merges,
|
|
@@ -22,7 +22,7 @@
|
|
|
22
22
|
* TODO.md §1d Gap 5 and the AKIA specs in spec/data_redactor_spec.rb.
|
|
23
23
|
*
|
|
24
24
|
* The infix-literal classification and the BM_INFIX hint table below are ported
|
|
25
|
-
* from prototypes/
|
|
25
|
+
* from prototypes/multi_pattern_matcher/gen_patterns.rb (which derived them from the
|
|
26
26
|
* same gem arrays at codegen time). They are pure optimisation hints — the
|
|
27
27
|
* first-byte filter computed from the program itself is what guarantees
|
|
28
28
|
* correctness — so a stale hint can only cost speed, never miss a match.
|
data/ext/data_redactor/matcher.h
CHANGED
|
@@ -8,7 +8,7 @@
|
|
|
8
8
|
* interned DFA) with two selective merges (pure-digit run pass, IBAN union
|
|
9
9
|
* pass) and the v19.1 EOL-at-buffer-end fix. Zero dependencies beyond libc.
|
|
10
10
|
* See docs/research_log.md (v15..v19) for the derivation, and
|
|
11
|
-
* prototypes/
|
|
11
|
+
* prototypes/multi_pattern_matcher/ for the standalone prototype this is ported from.
|
|
12
12
|
*
|
|
13
13
|
* Built-in pattern engines are sourced from the gem's pattern arrays
|
|
14
14
|
* (pattern_strings[]/boundary_wrapped[]/pattern_required_literal[]), NOT a
|
|
@@ -458,8 +458,12 @@ const char *pattern_strings[NUM_PATTERNS] = {
|
|
|
458
458
|
"-----BEGIN PGP PRIVATE KEY BLOCK-----",
|
|
459
459
|
/* 29: HashiCorp Vault Service Token (hvs. + 90-120 base64url chars) */
|
|
460
460
|
"hvs\\.[A-Za-z0-9_-]{90,120}",
|
|
461
|
-
/* 30: HashiCorp Vault Batch Token (hvb. + 138
|
|
462
|
-
|
|
461
|
+
/* 30: HashiCorp Vault Batch Token (hvb. + 138+ base64url chars).
|
|
462
|
+
* Upper bound capped at POSIX RE_DUP_MAX (255), not gitleaks' 300: musl's
|
|
463
|
+
* regcomp rejects {m,n} with n>255 ("Invalid contents of {}"), so the gem
|
|
464
|
+
* failed to load on Alpine. 255 still neutralizes the token (prefix + 251+
|
|
465
|
+
* chars redacted); only an unusually long >255-char token leaves a dead tail. */
|
|
466
|
+
"hvb\\.[A-Za-z0-9_-]{138,255}",
|
|
463
467
|
/* 31: HashiCorp Terraform Cloud API Token (14 alphanum + .atlasv1. + 60-70 base64url chars) */
|
|
464
468
|
"[A-Za-z0-9]{14}\\.atlasv1\\.[A-Za-z0-9_=-]{60,70}",
|
|
465
469
|
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: data_redactor
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.10.
|
|
4
|
+
version: 0.10.1
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Daniele Frisanco
|
|
@@ -121,6 +121,7 @@ extra_rdoc_files: []
|
|
|
121
121
|
files:
|
|
122
122
|
- CHANGELOG.md
|
|
123
123
|
- LICENSE
|
|
124
|
+
- README.md
|
|
124
125
|
- ext/data_redactor/custom_patterns.c
|
|
125
126
|
- ext/data_redactor/custom_patterns.h
|
|
126
127
|
- ext/data_redactor/data_redactor.c
|
|
@@ -142,7 +143,6 @@ files:
|
|
|
142
143
|
- lib/data_redactor/integrations/rails.rb
|
|
143
144
|
- lib/data_redactor/name_pattern.rb
|
|
144
145
|
- lib/data_redactor/version.rb
|
|
145
|
-
- readme.md
|
|
146
146
|
homepage: https://github.com/danielefrisanco/data_redactor
|
|
147
147
|
licenses:
|
|
148
148
|
- MIT
|