data_redactor 0.9.0-x86_64-darwin → 0.10.0-x86_64-darwin

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 38318caae259b4bdb91aa0b8bab99c26e2827348e0d16ab3ce08cde5dd71af68
4
- data.tar.gz: 6c3322602789eba6b388f06e7476ef5d98cade654c978fc2cdd77d36c1ee4b22
3
+ metadata.gz: 9981022b2ac65d6a44a6c219816e7fc5ed1d349c000863f7ff729280951d9245
4
+ data.tar.gz: 2d7d9fd895324d94776447525905cd96c17e2feab7fbd4d918ec8c693edc9cfc
5
5
  SHA512:
6
- metadata.gz: 776812b41c38f3810897205a2702243c0e7954ff06596d20275c8a0259226ad98f3e3593a1aba8194d6b3e3abfd0275c7838e49fbc9d3bc07b1235cc983fa0f7
7
- data.tar.gz: 8b4bf0dde0974e75db557861a70818eaabb1829b32260f236c726b8a199711599adb46571d980a24b2a776625b927503e605db66ef1cac1dbdfafd622374f9ee
6
+ metadata.gz: 5f4bf64d38faf953792f2e1196ceb9743ec47388a304915451333ee9133f9f2184178813581b8e8c518e6a4868822038fc4b914a1646aa69a0dd406434974a35
7
+ data.tar.gz: e43fb509e3d628069abf87f437fe827f344e3550f4f451d83a4d978316f40278c52db5e03683b36c376d55381a2bdf45ec8f626b48b3e1df0f464710f3008654
data/CHANGELOG.md CHANGED
@@ -7,6 +7,33 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [0.10.0] - 2026-06-09
11
+
12
+ ### Changed
13
+ - **Engine rewrite (v19 hybrid)** — `redact` and `scan` now run through a
14
+ Thompson NFA → bytecode → lazy-DFA multi-pattern engine (v19) for all 88
15
+ built-in patterns, replacing the previous per-pattern POSIX `regexec` loop.
16
+ Custom patterns (`add_pattern`) continue to use the glibc path (hybrid split
17
+ — required for correct UTF-8 multibyte character-class matching in user regex).
18
+ - Throughput on a 1 MB log: **~8.4× faster** than the previous C engine
19
+ (0.87 i/s → 7.27 i/s); **2.25× faster** than pure-Ruby `gsub` (was 4×
20
+ slower). Small per-call strings: 1.7–2.3× faster (was 3–4.6× slower).
21
+ - Overlap resolution: built-in matches are now resolved by an index-order
22
+ greedy claim (`mm_resolve`) that reproduces today's sequential per-pattern
23
+ rewrite semantics exactly. The one accepted divergence (rewrite-created
24
+ boundary when two secrets abut with no separator) is documented in
25
+ `TODO.md §1d` and pinned by `DIVERGENCE` specs.
26
+ - `rb_data_redactor_scan`: coordinate mapping (`repl_log` / `WORKING_TO_ORIG`)
27
+ replaced by direct original-frame offset emission from the v19 engine; custom
28
+ patterns use a lightweight offset-walk over the built-in event list.
29
+
30
+ ### Fixed
31
+ - **Swiss AHV false-negative** — boundary-wrapped patterns with a
32
+ start-anchored required literal now correctly set `max_back = 1` (not 0) so
33
+ the literal-skip does not overshoot the boundary byte. `756.1234.5678.90`
34
+ now matches as expected. (Pre-existing bug in the old engine, caught by
35
+ going live.)
36
+
10
37
  ## [0.9.0] - 2026-05-22
11
38
 
12
39
  ### Added
@@ -1,4 +1,4 @@
1
1
  module DataRedactor
2
2
  # Current gem version. Follows {https://semver.org Semantic Versioning 2.0.0}.
3
- VERSION = "0.9.0"
3
+ VERSION = "0.10.0"
4
4
  end
data/lib/data_redactor.rb CHANGED
@@ -74,6 +74,15 @@ module DataRedactor
74
74
  # Default placeholder used when +placeholder:+ is not given to {redact}.
75
75
  PLACEHOLDER_DEFAULT = "[REDACTED]"
76
76
 
77
+ # @api private
78
+ # Inputs larger than this (bytes) are split into newline-bounded chunks before
79
+ # being handed to the C engine. Bounds the per-call O(N) cost glibc regexec
80
+ # pays for state-log allocation, turning total redaction cost from O(N²) (one
81
+ # giant pass) into O(N × CHUNK_SIZE) (many bounded passes). 64 KB is a
82
+ # compromise: small enough to keep per-call cost low, large enough that
83
+ # typical log/JSON inputs use few chunks. See option G in TODO.md.
84
+ CHUNK_SIZE = 64 * 1024
85
+
77
86
  module_function
78
87
 
79
88
  # List of supported tag symbols.
@@ -132,6 +141,11 @@ module DataRedactor
132
141
  def redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
133
142
  enable_bits = build_enable_bits(only, except)
134
143
  ph_mode, ph_str = resolve_placeholder(placeholder)
144
+ # Defer to the C layer's TypeError for non-Strings; only chunk if the input
145
+ # is a String big enough to benefit (avoid bytesize on non-Strings).
146
+ if text.is_a?(String) && text.bytesize > CHUNK_SIZE
147
+ return _chunk_bytes(text).map { |c| _redact(c, ph_mode, ph_str, enable_bits) }.join
148
+ end
135
149
  _redact(text, ph_mode, ph_str, enable_bits)
136
150
  end
137
151
 
@@ -157,7 +171,12 @@ module DataRedactor
157
171
  # # value: "user@example.com", start: 0, length: 16}] }
158
172
  def scan(text, only: nil, except: nil)
159
173
  enable_bits = build_enable_bits(only, except)
160
- result = _scan(text, enable_bits)
174
+ result =
175
+ if text.is_a?(String) && text.bytesize > CHUNK_SIZE
176
+ _chunked_scan(text, enable_bits)
177
+ else
178
+ _scan(text, enable_bits)
179
+ end
161
180
  # Normalise: convert tag string from C (uppercase) back to the Symbol used in TAGS
162
181
  result[:matches].each { |m| m[:tag] = m[:tag].to_s.downcase.to_sym }
163
182
  result
@@ -419,4 +438,59 @@ module DataRedactor
419
438
  "placeholder must be a String, :tagged, or :hash — got #{placeholder.inspect}"
420
439
  end
421
440
  end
441
+
442
+ # @api private
443
+ # Split +text+ into byte-bounded chunks for the chunked redact/scan path.
444
+ # Chunks end at a +\n+ when possible so no match straddles a boundary; if a
445
+ # single line exceeds {CHUNK_SIZE} (rare in real inputs), it becomes one
446
+ # oversized chunk and pays the per-pattern O(N) cost — documented limitation.
447
+ # Returns an Array of byte-Strings whose concatenation equals +text+ exactly
448
+ # (including the original newline separators).
449
+ #
450
+ # @param text [String]
451
+ # @return [Array<String>]
452
+ def _chunk_bytes(text)
453
+ chunks = []
454
+ pos = 0
455
+ len = text.bytesize
456
+ while pos < len
457
+ remaining = len - pos
458
+ if remaining <= CHUNK_SIZE
459
+ chunks << text.byteslice(pos, remaining)
460
+ break
461
+ end
462
+ # Find the last \n in [pos, pos+CHUNK_SIZE). If none, chunk is one long
463
+ # line — take CHUNK_SIZE bytes as a fallback (boundary-split risk).
464
+ window = text.byteslice(pos, CHUNK_SIZE)
465
+ nl = window.rindex("\n")
466
+ take = nl ? nl + 1 : CHUNK_SIZE
467
+ chunks << text.byteslice(pos, take)
468
+ pos += take
469
+ end
470
+ chunks
471
+ end
472
+
473
+ # @api private
474
+ # Chunked variant of +_scan+: runs the C scanner on each chunk, then offsets
475
+ # each match's +:start+ by the chunk's base byte-position in the original
476
+ # input so the byteslice invariant holds end-to-end.
477
+ #
478
+ # @param text [String]
479
+ # @param enable_bits [Array<Integer>]
480
+ # @return [Hash{Symbol => Object}] +{ redacted: String, matches: Array<Hash> }+
481
+ def _chunked_scan(text, enable_bits)
482
+ redacted = +""
483
+ matches = []
484
+ base = 0
485
+ _chunk_bytes(text).each do |chunk|
486
+ part = _scan(chunk, enable_bits)
487
+ redacted << part[:redacted]
488
+ part[:matches].each do |m|
489
+ m[:start] += base
490
+ matches << m
491
+ end
492
+ base += chunk.bytesize
493
+ end
494
+ { redacted: redacted, matches: matches }
495
+ end
422
496
  end
data/readme.md CHANGED
@@ -10,9 +10,10 @@ A Ruby gem with a C extension for high-performance regex-based redaction of sens
10
10
 
11
11
  DataRedactor scans text for sensitive data — API keys and cloud secrets, IBANs,
12
12
  credit cards, national IDs, emails, phone numbers, IPs, and more — and replaces
13
- each match with a placeholder. The scanning runs in a C extension backed by POSIX
14
- `regex.h`, so the heavy lifting happens outside the Ruby VM and stays fast enough
15
- to run inline on large payloads.
13
+ each match with a placeholder. The scanning runs in a C extension backed by a
14
+ zero-dependency Thompson NFA lazy-DFA multi-pattern engine (v19) that scans
15
+ all 88 built-in patterns in a single pass — 2–2.5× faster than pure-Ruby `gsub`
16
+ on large payloads, with no external library dependencies.
16
17
 
17
18
  It ships **88 built-in patterns** across 15+ countries, grouped into tags
18
19
  (`:credentials`, `:financial`, `:contact`, ...) so you can redact only what you
@@ -384,8 +385,18 @@ redactor/
384
385
  │ ├── scan.{c,h} # _scan + byte-offset replacement-log macros
385
386
  │ ├── custom_patterns.{c,h} # Dynamic registry: add/remove/clear/list
386
387
  │ └── tags.h # TAG_* bit constants
387
- └── spec/
388
- └── data_redactor_spec.rb # RSpec tests — at least one example per pattern, plus filter / placeholder / custom-pattern coverage
388
+ ├── spec/
389
+ └── data_redactor_spec.rb # RSpec tests — at least one example per pattern, plus filter / placeholder / custom-pattern coverage
390
+ ├── benchmark/ # Repo-only perf scripts (not packaged in the gem)
391
+ │ ├── README.md # How to run, what each script measures
392
+ │ ├── support/corpus.rb # Shared payload builders + pure-Ruby baseline redactor
393
+ │ ├── throughput.rb # MB/s on representative payloads
394
+ │ ├── vs_pure_ruby.rb # C extension vs pure-Ruby gsub (same 88 patterns)
395
+ │ ├── scaling.rb # Runtime vs input size 1KB → 50MB
396
+ │ └── per_pattern.rb # Per-pattern scan cost
397
+ └── docs/ # Design and execution docs for future work
398
+ ├── standalone_matcher_design.md
399
+ └── combined_matcher_plan.md
389
400
  ```
390
401
 
391
402
  ## Requirements
@@ -460,6 +471,45 @@ Or compile and test in one step:
460
471
  bundle exec rake
461
472
  ```
462
473
 
474
+ ## Benchmarks
475
+
476
+ The `benchmark/` directory holds four scripts that measure the C engine under
477
+ different angles. They are **not** packaged with the gem.
478
+
479
+ ```bash
480
+ bundle install # pulls benchmark-ips, benchmark-memory (dev deps)
481
+ bundle exec rake compile
482
+ bundle exec ruby benchmark/vs_pure_ruby.rb # head-to-head vs pure-Ruby gsub, same 88 patterns
483
+ bundle exec ruby benchmark/throughput.rb # MB/s on a log line, JSON, 1MB and 10MB log files
484
+ bundle exec ruby benchmark/scaling.rb # runtime vs input size (1KB → 50MB), confirms linear scaling
485
+ bundle exec ruby benchmark/per_pattern.rb # per-pattern scan cost over a 1MB payload
486
+ ```
487
+
488
+ See [`benchmark/README.md`](benchmark/README.md) for what each script measures
489
+ and how the pure-Ruby baseline is kept honest (it reads the same patterns the
490
+ C engine uses, via `DataRedactor::BUILTIN_PATTERN_SOURCES`).
491
+
492
+ ### Performance (0.10.0 — v19 multi-pattern engine)
493
+
494
+ As of 0.10.0 the C extension runs a **Thompson NFA → lazy-DFA multi-pattern
495
+ engine** (v19) that scans the input once across all 88 built-in patterns,
496
+ with two selective-merge passes (pure-digit group + IBAN union) that further
497
+ reduce work for the most common pattern classes. Custom patterns (`add_pattern`)
498
+ still use the glibc path (required for correct UTF-8 diacritic matching).
499
+
500
+ | Payload | v19 engine (0.10.0) | Pure-Ruby `gsub` | Ratio |
501
+ |-----------------------|---------------------|------------------|-----------------|
502
+ | log line (168 B) | 41 µs / call | 71 µs / call | **1.7× faster** |
503
+ | JSON blob (~580 B) | 81 µs / call | 132 µs / call | **1.6× faster** |
504
+ | 8 log lines (1.3 KB) | 175 µs / call | 399 µs / call | **2.3× faster** |
505
+ | 100 log lines (17 KB) | 2.0 ms / call | 4.6 ms / call | **2.3× faster** |
506
+ | 1 MB log | 138 ms / call | 294 ms / call | **2.1× faster** |
507
+ | 10 MB log | 1.44 s / call | — | 6.9 MB/s |
508
+
509
+ All payload sizes pass a correctness check (redaction count matches pure-Ruby `gsub`).
510
+ The previous engine (per-pattern `regexec`) was **4.25× slower** than pure Ruby on the
511
+ 1 MB payload — a ~9× swing. Old numbers are in git history (`CHANGELOG.md` [0.9.0]).
512
+
463
513
  ## How it works
464
514
 
465
515
  1. At load time, `Init_data_redactor` compiles all 85 regex patterns once using `regcomp` (POSIX ERE) and stores them as static `regex_t` structs. Patterns marked as boundary-wrapped are expanded with `wrap_boundary()` before compilation.
@@ -490,3 +540,4 @@ Released under the [MIT License](LICENSE).
490
540
  - **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
491
541
  - **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
492
542
  - **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
543
+ - **Performance is currently slower than pure-Ruby `gsub`.** A May 2026 investigation found the C extension is 3–5× slower than a pure-Ruby `gsub` loop running the same 88 patterns, across input sizes from 168 bytes to 1 MB. The root cause is glibc's POSIX `regexec()`: each call allocates an O(input-length) state buffer before any matching begins, and the gem calls it once per pattern in sequence. Ruby's Onigmo engine wins by using a built-in Boyer-Moore literal pre-filter that this gem can only approximate. Two perf fixes have shipped (buffer-sizing in `replace_all_matches`, a `strstr` literal pre-filter, and input chunking for large payloads), which gave ~25-30% improvement and made scaling linear, but the absolute gap remains. Use the gem on small payloads where the absolute latency is still acceptable (< 1 ms for typical log lines); for high-throughput pipelines, hold off until the next major release. See `docs/standalone_matcher_design.md` for the long-term plan.
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_redactor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.0
4
+ version: 0.10.0
5
5
  platform: x86_64-darwin
6
6
  authors:
7
7
  - Daniele Frisanco
@@ -65,6 +65,34 @@ dependencies:
65
65
  - - ">="
66
66
  - !ruby/object:Gem::Version
67
67
  version: '2.0'
68
+ - !ruby/object:Gem::Dependency
69
+ name: benchmark-ips
70
+ requirement: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - "~>"
73
+ - !ruby/object:Gem::Version
74
+ version: '2.13'
75
+ type: :development
76
+ prerelease: false
77
+ version_requirements: !ruby/object:Gem::Requirement
78
+ requirements:
79
+ - - "~>"
80
+ - !ruby/object:Gem::Version
81
+ version: '2.13'
82
+ - !ruby/object:Gem::Dependency
83
+ name: benchmark-memory
84
+ requirement: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - "~>"
87
+ - !ruby/object:Gem::Version
88
+ version: '0.2'
89
+ type: :development
90
+ prerelease: false
91
+ version_requirements: !ruby/object:Gem::Requirement
92
+ requirements:
93
+ - - "~>"
94
+ - !ruby/object:Gem::Version
95
+ version: '0.2'
68
96
  description: A Ruby gem with a C extension for high-performance scanning and redaction
69
97
  of 85 sensitive patterns — API keys, tokens, credentials, IBANs, national IDs, emails,
70
98
  phone numbers, and PII from 15+ countries. Optional Logger formatter, Rails filter_parameters