data_redactor 0.7.2-x86_64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 308a3387249fae55cdff4392a2655072450589df3a07409a3623156eb9c1c4fd
4
+ data.tar.gz: 2cbdf4a55c7648ae74245b11c9ee298eef10116e154f2fb9e27926546c909182
5
+ SHA512:
6
+ metadata.gz: 2e4a8c4ce276ccbd236e023eca96640b293a644ed3b0c4c2a8d844f0151f0b9befb18bdbd2221ea4ca4524d8dc28e376651b742eb9d6488fcdbb71a267672f95
7
+ data.tar.gz: 7079f9aeb48c18264a2d74b4d5c50383f3b38c0227b39cb336fab94dbac28db4147bb8963d7d6f2d5a1e04813c5e7ef71167ee906f0bd3eaad0613be39b3d882
data/CHANGELOG.md ADDED
@@ -0,0 +1,171 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [0.7.2] - 2026-05-09
11
+
12
+ **Supersedes 0.7.1, which has been yanked from RubyGems.**
13
+
14
+ 0.7.1 had a release pipeline bug: the source gem and the precompiled native
15
+ gems were published by two independent workflows, with no gating between
16
+ them. When the native-binary builds failed (`oxidize-rb/actions/cross-gem`
17
+ couldn't pull `rbsys/aarch64-linux:0.9.128` from Docker Hub), the source
18
+ gem still published — leaving users with release notes that promised
19
+ precompiled binaries that didn't exist on RubyGems. 0.7.2 ships the same
20
+ features as 0.7.1 plus the pipeline fix.
21
+
22
+ ### Changed
23
+ - **Atomic release pipeline.** Source-gem publishing moved out of `ci.yml`
24
+ and into `release-binaries.yml`, alongside the native-gem builds. The
25
+ publish job now `needs: [build-source, build-native]`; if any native
26
+ platform fails to build, **nothing publishes**. This guarantees the
27
+ RubyGems release matches what the GitHub release notes promise.
28
+ - **Direct `rake-compiler-dock` invocation in CI** instead of the
29
+ `oxidize-rb/actions/cross-gem` action. Same code path as `rake gem:all`
30
+ locally and the existing PR-time smoke test in `ci.yml`. Uses
31
+ `ghcr.io/rake-compiler/*` images (no Docker Hub rate limits).
32
+
33
+ ### Fixed
34
+ - All 6 precompiled native gems now actually publish on release — the
35
+ `aarch64-linux` variant in particular was previously failing.
36
+
37
+ ### Documentation
38
+ - README installation section rewritten around the user's question
39
+ ("what changes for me?"). Adds explicit Docker / Alpine guidance and a
40
+ heads-up about `bundle lock --add-platform` for cross-platform deploys.
41
+
42
+ ## [0.7.1] - 2026-05-09 [YANKED]
43
+
44
+ ### Added
45
+ - **Precompiled native gems** for the most common platforms — installing
46
+ `data_redactor` no longer requires a C toolchain on these targets:
47
+ - `x86_64-linux`, `aarch64-linux` (glibc)
48
+ - `x86_64-linux-musl`, `aarch64-linux-musl` (Alpine)
49
+ - `x86_64-darwin`, `arm64-darwin` (macOS Intel + Apple Silicon)
50
+ Each native gem ships compiled `.so` files for Ruby 3.1, 3.2, 3.3, and 3.4.
51
+ Bundler/RubyGems automatically picks the right gem for the host; users on
52
+ any other platform fall back to the source gem and compile as before.
53
+ - `rake gem:all` task — builds every native gem locally via `rake-compiler-dock`
54
+ (requires Docker). Single command to regenerate the full release matrix.
55
+ - `.github/workflows/release-binaries.yml` — builds & publishes all native
56
+ gems on every GitHub release. Also exposes `workflow_dispatch` so a
57
+ maintainer can rebuild any past release without cutting a new tag.
58
+
59
+ ### Changed
60
+ - CI test matrix now includes Ruby 3.4 in addition to 3.1, 3.2, 3.3.
61
+ - Gemspec: added `rake-compiler-dock` as a development dependency. Source-only
62
+ gem size is unchanged — native gems strip `ext/` and the `extconf.rb`
63
+ extension hook so they only carry the prebuilt `.so` files.
64
+
65
+ ## [0.7.0] - 2026-05-08
66
+
67
+ ### Added
68
+ - **Rails / Rack / Logger integrations** under `lib/data_redactor/integrations/`. Soft-required — none are loaded by default; the gem still has zero runtime dependencies in the gemspec.
69
+ - `DataRedactor::Integrations::Logger` — drop-in `Logger::Formatter` that scrubs every emitted line, wraps an inner formatter (default `Logger::Formatter`), and preserves exception cause chains.
70
+ - `DataRedactor::Integrations::Rails.filter(...)` — returns a `(key, value)` proc for `Rails.application.config.filter_parameters`. Mutates String values in place via `String#replace`.
71
+ - `DataRedactor::Integrations::Rack` — middleware with selectable surfaces. `scrub:` accepts any subset of `[:body, :headers]` (default both). `:body` buffers the response and drops `Content-Length`; `:headers` scrubs sensitive response headers (`Set-Cookie`, `Authorization`, `X-Api-Key`, ...) and request headers in the env hash. Unknown surfaces raise `ArgumentError`.
72
+ - All three integrations forward `only:`, `except:`, `placeholder:` to `DataRedactor.redact`.
73
+
74
+ ### Changed
75
+ - Gemspec: added `rack` as a development dependency. No new runtime dependencies.
76
+
77
+ ## [0.6.1] - 2026-05-08
78
+
79
+ ### Added
80
+ - Six new distinctive-prefix API key patterns under the `:credentials` tag, exposed via `DataRedactor.pattern_names`:
81
+ - `anthropic_api_key` — `sk-ant-apiNN-...`
82
+ - `openai_project_api_key` — `sk-proj-...`
83
+ - `gitlab_pat` — `glpat-...`
84
+ - `digitalocean_pat` — `dop_v1_...`
85
+ - `databricks_api_token` — `dapi...`
86
+ - `sentry_dsn` — `https://KEY@oNNN.ingest.sentry.io/PID` (also matches the legacy `KEY:SECRET@` form)
87
+
88
+ ### Changed
89
+ - `NUM_PATTERNS` is now 85 (was 79). Built-in pattern indices in C have shifted accordingly; the public Ruby API and pattern names are stable.
90
+
91
+ ## [0.6.0] - 2026-05-08
92
+
93
+ ### Added
94
+ - **Per-pattern allow / deny via `only:` / `except:`.** Both kwargs now accept a mix of Symbols (tags) and Strings (pattern names from `DataRedactor.pattern_names`). They can be combined: `only: :contact, except: ["email"]` redacts every contact pattern except email. Mixed-list shapes like `only: [:credentials, "iban_de"]` also work. Precedence: `except:` always wins when the two overlap.
95
+ - `DataRedactor.pattern_names` — array of every known pattern name (built-ins + currently registered custom).
96
+ - `DataRedactor::BUILTIN_PATTERN_NAMES` and `DataRedactor::BUILTIN_PATTERN_TAG_BITS` constants (frozen) exposing the compiled-in pattern roster.
97
+ - `DataRedactor::UnknownPatternError` raised when a String passed to `only:`/`except:` does not match any known pattern.
98
+ - YARD docs deploy job in `.github/workflows/ci.yml` publishes `bundle exec yard doc` output to GitHub Pages on every push to `main`.
99
+
100
+ ### Changed
101
+ - **C entry-point signatures.** `_redact(text, ph_mode, ph_str, enable_bits)` and `_scan(text, enable_bits)` now take a per-pattern enable bit array (built by the Ruby wrapper from `only:`/`except:`) instead of a tag bitmask. The public `DataRedactor.redact` / `.scan` API is fully backward compatible — only the underscore-prefixed C boundary changed. Single-pass: filtering happens in C, no second pass through `_scan`.
102
+ - `only:` and `except:` may now be combined (previously raised `ArgumentError` if both were passed).
103
+ - **Internal: C extension split into focused modules.** `ext/data_redactor/data_redactor.c` was a single ~1000-line file; it is now a 60-line entry point plus `patterns.{c,h}`, `placeholder.{c,h}`, `redact.{c,h}`, `scan.{c,h}`, `custom_patterns.{c,h}`, and `tags.h`. `extconf.rb` now globs every `.c` in the extension directory via `$srcs`, so adding a new module needs no Makefile edits.
104
+ - **YARD inline docs** — every public method on `DataRedactor` now has `@param`/`@return`/`@raise` annotations (100% coverage); `.yardopts` configures markdown rendering with the README as the front page.
105
+
106
+ ### Documentation
107
+ - README: gem version / CI / license badges; new "Thread safety" section clarifying that `redact`/`scan` are thread-safe but `add_pattern`/`remove_pattern`/`clear_custom_patterns!` are not (register custom patterns once at boot).
108
+
109
+ ## [0.5.0] - 2026-05-02
110
+
111
+ ### Added
112
+ - `DataRedactor.scan(text, only:, except:)` — returns `{ redacted: String, matches: Array<Hash> }` where each match contains `:tag` (Symbol), `:name` (pattern name String), `:value` (matched text), `:start` (byte offset into original), `:length` (byte length). Accepts the same `only:`/`except:` tag filters as `redact`. Includes both built-in and custom pattern matches.
113
+ - `pattern_names[]` array in the C extension mapping each built-in pattern index to a stable snake_case name string (e.g. `"aws_access_key_id"`, `"email"`, `"iban_de"`).
114
+
115
+ ## [0.4.0] - 2026-05-02
116
+
117
+ ### Added
118
+ - `placeholder:` keyword argument on `DataRedactor.redact`.
119
+ - Plain string (default `"[REDACTED]"`): `placeholder: "***"`
120
+ - Tagged: `placeholder: :tagged` → `[REDACTED:CONTACT]`, `[REDACTED:CREDENTIALS]`, etc.
121
+ - Deterministic hash: `placeholder: :hash` → `[CONTACT_a3f9]` (4-hex djb2 suffix, same value always produces the same token — useful for correlating redactions across log lines).
122
+ - `PH_MODE_PLAIN`, `PH_MODE_TAGGED`, `PH_MODE_HASH` integer constants exposed from C.
123
+ - `DataRedactor::PLACEHOLDER_DEFAULT` constant (`"[REDACTED]"`).
124
+
125
+ ### Changed
126
+ - `DataRedactor._redact` now takes 4 arguments: `(text, mask, ph_mode, ph_str)`. The public `DataRedactor.redact` API is fully backward compatible.
127
+
128
+ ## [0.3.0] - 2026-05-02
129
+
130
+ ### Added
131
+ - User-supplied custom patterns via `DataRedactor.add_pattern(name:, regex:, tag: :custom, boundary: false)`.
132
+ - `DataRedactor.remove_pattern(name)` — remove a named custom pattern (returns `true`/`false`).
133
+ - `DataRedactor.custom_patterns` — list all registered custom patterns as an array of hashes.
134
+ - `DataRedactor.clear_custom_patterns!` — remove all custom patterns (useful in test suites).
135
+ - New `:custom` tag and `TAG_CUSTOM` bitmask constant for custom patterns. Works with `only:`/`except:`.
136
+ - `DataRedactor::InvalidPatternError` raised when a pattern fails `regcomp` or uses unsupported Ruby-only syntax (`\d`, `\s`, `\w`, `\b`, lookaround, non-greedy quantifiers, named groups).
137
+ - Capture groups rejected at registration when `boundary: true` (group indices would shift).
138
+ - Name collisions replace the existing pattern (the old compiled `regex_t` is freed).
139
+
140
+ ## [0.2.0] - 2026-05-02
141
+
142
+ ### Added
143
+ - Tag system: every pattern now belongs to one of 8 tags (`:credentials`, `:financial`, `:tax_id`, `:national_id`, `:contact`, `:network`, `:travel`, `:other`).
144
+ - `DataRedactor.redact(text, only: [...])` to redact only patterns in the given tags.
145
+ - `DataRedactor.redact(text, except: [...])` to redact every tag except the given ones.
146
+ - `DataRedactor.tags` returning the list of supported tags.
147
+ - `DataRedactor::TAGS` constant mapping tag symbols to bitmask values, plus `TAG_*` integer constants exposed from C for advanced use.
148
+ - `DataRedactor::UnknownTagError` raised when an unknown tag symbol is passed.
149
+
150
+ ### Changed
151
+ - The C-level entry point is now `DataRedactor._redact(text, mask)` (two-arg, mask is an integer bitmask). The public API is the Ruby wrapper `DataRedactor.redact`, which remains backward compatible: `redact(text)` with no keyword arguments runs every pattern exactly as before.
152
+
153
+ ## [0.1.0] - 2026-05-02
154
+
155
+ ### Added
156
+ - Initial release.
157
+ - C extension (`ext/data_redactor/data_redactor.c`) using POSIX `regex.h` for high-throughput scanning.
158
+ - 79 redaction patterns across cloud secrets, API keys, IBANs, national IDs, and PII for 15+ countries.
159
+ - Patterns ordered most-specific to most-generic to prevent shorter patterns from consuming parts of longer matches.
160
+ - Boundary-wrapping mechanism for generic digit/alphanum sequences so they only match at word boundaries.
161
+ - `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
162
+ - RSpec suite with one example per pattern.
163
+
164
+ [Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.2...HEAD
165
+ [0.7.2]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.1...v0.7.2
166
+ [0.7.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.0...v0.7.1
167
+ [0.7.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.6.1...v0.7.0
168
+ [0.6.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.6.0...v0.6.1
169
+ [0.6.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.5.0...v0.6.0
170
+ [0.2.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.1.0...v0.2.0
171
+ [0.1.0]: https://github.com/danielefrisanco/data_redactor/releases/tag/v0.1.0
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Daniele Frisanco
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,42 @@
1
+ require "logger"
2
+ require "data_redactor"
3
+
4
+ module DataRedactor
5
+ module Integrations
6
+ # Logger formatter that runs every log message through {DataRedactor.redact}
7
+ # before delegating to an inner formatter.
8
+ #
9
+ # @example Drop-in replacement for Ruby's default formatter
10
+ # logger = Logger.new($stdout)
11
+ # logger.formatter = DataRedactor::Integrations::Logger.new
12
+ # logger.info("Auth failed for user alice@example.com")
13
+ # # => "I, [...] -- : Auth failed for user [REDACTED]"
14
+ #
15
+ # @example Wrapping an existing formatter (e.g. Rails JSON logger)
16
+ # logger.formatter = DataRedactor::Integrations::Logger.new(
17
+ # inner: Rails.logger.formatter,
18
+ # only: [:credentials, :contact]
19
+ # )
20
+ class Logger
21
+ # @param inner [#call, nil] formatter to wrap. Defaults to {::Logger::Formatter}.
22
+ # @param only [Symbol, String, Array, nil] forwarded to {DataRedactor.redact}.
23
+ # @param except [Symbol, String, Array, nil] forwarded to {DataRedactor.redact}.
24
+ # @param placeholder forwarded to {DataRedactor.redact}.
25
+ def initialize(inner: ::Logger::Formatter.new, only: nil, except: nil, placeholder: DataRedactor::PLACEHOLDER_DEFAULT)
26
+ @inner = inner
27
+ @only = only
28
+ @except = except
29
+ @placeholder = placeholder
30
+ end
31
+
32
+ # Formatter contract — called by Logger for every emitted line.
33
+ # Lets the inner formatter render whatever it likes (string, exception,
34
+ # arbitrary object) and scrubs the resulting line in one pass. Keeps the
35
+ # exception cause chain intact so downstream formatters still see it.
36
+ def call(severity, time, progname, msg)
37
+ line = @inner.call(severity, time, progname, msg)
38
+ DataRedactor.redact(line.to_s, only: @only, except: @except, placeholder: @placeholder)
39
+ end
40
+ end
41
+ end
42
+ end
@@ -0,0 +1,121 @@
1
+ require "data_redactor"
2
+
3
+ module DataRedactor
4
+ module Integrations
5
+ # Rack middleware that scrubs sensitive data from selectable surfaces of
6
+ # the response (and request headers, for downstream loggers to see scrubbed
7
+ # values).
8
+ #
9
+ # @example Both surfaces (default)
10
+ # use DataRedactor::Integrations::Rack, scrub: [:body, :headers]
11
+ #
12
+ # @example Headers only — leave the response body untouched
13
+ # use DataRedactor::Integrations::Rack, scrub: [:headers]
14
+ #
15
+ # ### Surfaces
16
+ #
17
+ # - `:body` — wraps the response body so emitted bytes pass through
18
+ # {DataRedactor.redact} before reaching the client. Drops the
19
+ # `Content-Length` header (the redacted body may have a different
20
+ # byte length, and recomputing requires buffering).
21
+ # - `:headers` — scrubs response headers in place. Sensitive request
22
+ # headers (`Authorization`, `Cookie`, `X-Api-Key`, etc.) are redacted in
23
+ # the env hash so any downstream middleware that logs them sees scrubbed
24
+ # values.
25
+ class Rack
26
+ DEFAULT_SCRUB = [:body, :headers].freeze
27
+
28
+ SENSITIVE_REQUEST_HEADERS = %w[
29
+ HTTP_AUTHORIZATION
30
+ HTTP_PROXY_AUTHORIZATION
31
+ HTTP_COOKIE
32
+ HTTP_X_API_KEY
33
+ HTTP_X_AUTH_TOKEN
34
+ HTTP_X_ACCESS_TOKEN
35
+ ].freeze
36
+
37
+ SENSITIVE_RESPONSE_HEADERS = %w[
38
+ Set-Cookie
39
+ Authorization
40
+ X-Api-Key
41
+ X-Auth-Token
42
+ X-Access-Token
43
+ ].freeze
44
+
45
+ # @param app [#call] the Rack app
46
+ # @param scrub [Array<Symbol>] which surfaces to redact. Subset of
47
+ # `[:body, :headers]`. Defaults to `[:body, :headers]`.
48
+ # @param only forwarded to {DataRedactor.redact}
49
+ # @param except forwarded to {DataRedactor.redact}
50
+ # @param placeholder forwarded to {DataRedactor.redact}
51
+ def initialize(app, scrub: DEFAULT_SCRUB, only: nil, except: nil, placeholder: DataRedactor::PLACEHOLDER_DEFAULT)
52
+ @app = app
53
+ @scrub = Array(scrub).map(&:to_sym)
54
+ unknown = @scrub - [:body, :headers]
55
+ unless unknown.empty?
56
+ raise ArgumentError, "unknown scrub surface(s) #{unknown.inspect}; valid: [:body, :headers]"
57
+ end
58
+ @only = only
59
+ @except = except
60
+ @placeholder = placeholder
61
+ end
62
+
63
+ def call(env)
64
+ scrub_request_headers(env) if @scrub.include?(:headers)
65
+ status, headers, body = @app.call(env)
66
+ headers = scrub_response_headers(headers) if @scrub.include?(:headers)
67
+ if @scrub.include?(:body)
68
+ body, headers = wrap_body(body, headers)
69
+ end
70
+ [status, headers, body]
71
+ end
72
+
73
+ private
74
+
75
+ def redact(s)
76
+ DataRedactor.redact(s, only: @only, except: @except, placeholder: @placeholder)
77
+ end
78
+
79
+ def scrub_request_headers(env)
80
+ SENSITIVE_REQUEST_HEADERS.each do |key|
81
+ value = env[key]
82
+ env[key] = redact(value) if value.is_a?(String) && !value.empty?
83
+ end
84
+ end
85
+
86
+ def scrub_response_headers(headers)
87
+ # Rack 3 uses lower-case header names; Rack 2 uses Capitalized.
88
+ # Match case-insensitively against our known list.
89
+ sensitive_lc = SENSITIVE_RESPONSE_HEADERS.map(&:downcase)
90
+ headers.each_with_object({}) do |(key, value), out|
91
+ if sensitive_lc.include?(key.to_s.downcase)
92
+ out[key] = scrub_header_value(value)
93
+ else
94
+ out[key] = value
95
+ end
96
+ end
97
+ end
98
+
99
+ def scrub_header_value(value)
100
+ case value
101
+ when String then redact(value)
102
+ when Array then value.map { |v| v.is_a?(String) ? redact(v) : v }
103
+ else value
104
+ end
105
+ end
106
+
107
+ def wrap_body(body, headers)
108
+ # Buffer the body, redact, return as a single-element array.
109
+ # Stripping Content-Length because the redacted body may differ in
110
+ # byte length; downstream servers will recompute or chunk-encode.
111
+ buffered = +""
112
+ body.each { |chunk| buffered << chunk.to_s }
113
+ body.close if body.respond_to?(:close)
114
+
115
+ scrubbed = redact(buffered)
116
+ new_headers = headers.reject { |k, _| k.to_s.downcase == "content-length" }
117
+ [[scrubbed], new_headers]
118
+ end
119
+ end
120
+ end
121
+ end
@@ -0,0 +1,38 @@
1
+ require "data_redactor"
2
+
3
+ module DataRedactor
4
+ module Integrations
5
+ # Rails `config.filter_parameters` adapter. Returns a `Proc` that Rails
6
+ # invokes with `(key, value)` for every leaf in the params tree; we redact
7
+ # the value in place when it is a String.
8
+ #
9
+ # @example
10
+ # # config/initializers/filter_parameter_logging.rb
11
+ # require "data_redactor/integrations/rails"
12
+ # Rails.application.config.filter_parameters += [
13
+ # DataRedactor::Integrations::Rails.filter
14
+ # ]
15
+ #
16
+ # @example Restricting to specific tags
17
+ # Rails.application.config.filter_parameters += [
18
+ # DataRedactor::Integrations::Rails.filter(only: [:credentials, :financial])
19
+ # ]
20
+ module Rails
21
+ module_function
22
+
23
+ # @param only forwarded to {DataRedactor.redact}
24
+ # @param except forwarded to {DataRedactor.redact}
25
+ # @param placeholder forwarded to {DataRedactor.redact}
26
+ # @return [Proc] a `(key, value)` proc compatible with `config.filter_parameters`
27
+ def filter(only: nil, except: nil, placeholder: DataRedactor::PLACEHOLDER_DEFAULT)
28
+ lambda do |_key, value|
29
+ next unless value.is_a?(String)
30
+ # Rails' Parameter Filter mutates the value in place. We can't
31
+ # reassign `value` here, so use String#replace.
32
+ redacted = DataRedactor.redact(value, only: only, except: except, placeholder: placeholder)
33
+ value.replace(redacted) if redacted != value
34
+ end
35
+ end
36
+ end
37
+ end
38
+ end
@@ -0,0 +1,4 @@
1
+ module DataRedactor
2
+ # Current gem version. Follows {https://semver.org Semantic Versioning 2.0.0}.
3
+ VERSION = "0.7.2"
4
+ end
@@ -0,0 +1,347 @@
1
+ require "set"
2
+ require_relative "data_redactor/version"
3
+ require_relative "data_redactor/data_redactor" # loads the compiled .so
4
+
5
+ # High-performance regex-based redactor for sensitive data.
6
+ #
7
+ # DataRedactor scans text for sensitive patterns (API keys, IBANs, national
8
+ # IDs, emails, phone numbers, etc.) and replaces matches with a configurable
9
+ # placeholder. The matching is done by a C extension backed by POSIX
10
+ # +regex.h+, so it is fast enough to run inline on large payloads.
11
+ #
12
+ # @example Basic redaction
13
+ # DataRedactor.redact("key is AKIAIOSFODNN7EXAMPLE")
14
+ # # => "key is [REDACTED]"
15
+ #
16
+ # @example Filter by tag or pattern name
17
+ # DataRedactor.redact(text, only: :credentials)
18
+ # DataRedactor.redact(text, except: [:contact, :network])
19
+ # DataRedactor.redact(text, only: :contact, except: ["email"])
20
+ # DataRedactor.redact(text, only: ["aws_access_key_id"])
21
+ #
22
+ # @example Custom placeholder
23
+ # DataRedactor.redact(text, placeholder: "***")
24
+ # DataRedactor.redact(text, placeholder: :tagged) # => "[REDACTED:CONTACT]"
25
+ # DataRedactor.redact(text, placeholder: :hash) # => "[CONTACT_a3f9]"
26
+ #
27
+ # @example Audit / dry-run
28
+ # DataRedactor.scan(text)
29
+ # # => { redacted: "...", matches: [{tag:, name:, value:, start:, length:}, ...] }
30
+ #
31
+ # @example Custom pattern
32
+ # DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
33
+ module DataRedactor
34
+ # Map of tag symbol to the integer bit used by the C layer.
35
+ #
36
+ # The keys of this hash are the canonical list of supported tags; pass any
37
+ # of them to {redact} or {scan} via +only:+ / +except:+.
38
+ #
39
+ # @return [Hash{Symbol => Integer}] frozen tag-to-bit map
40
+ TAGS = {
41
+ credentials: TAG_CREDENTIALS,
42
+ financial: TAG_FINANCIAL,
43
+ tax_id: TAG_TAX_ID,
44
+ national_id: TAG_NATIONAL_ID,
45
+ contact: TAG_CONTACT,
46
+ network: TAG_NETWORK,
47
+ travel: TAG_TRAVEL,
48
+ other: TAG_OTHER,
49
+ custom: TAG_CUSTOM
50
+ }.freeze
51
+
52
+ # Raised when a tag symbol passed to +only:+ / +except:+ / +tag:+ is not in {TAGS}.
53
+ class UnknownTagError < ArgumentError; end
54
+
55
+ # Raised when a String passed via +only:+ / +except:+ does not match any
56
+ # registered pattern name. See {pattern_names}.
57
+ class UnknownPatternError < ArgumentError; end
58
+
59
+ # Raised by {add_pattern} when the supplied regex is not valid POSIX ERE,
60
+ # uses Ruby-only syntax (+\d+, +\s+, lookaround, non-greedy, etc.), or
61
+ # contains capture groups while +boundary: true+ is requested.
62
+ class InvalidPatternError < ArgumentError; end
63
+
64
+ # @api private
65
+ # Capture groups break boundary-wrapper group index assumptions ([1],[2],[3] shift).
66
+ CAPTURE_GROUP_RE = /(?<!\\)\((?!\?:)/.freeze
67
+
68
+ # @api private
69
+ # Ruby regex syntax that has no POSIX ERE equivalent.
70
+ RUBY_ONLY_SYNTAX_RE = /\\[dDwWsShHbB]|\(\?[<!=]|\(\?<[a-zA-Z]|\(\?[imx]|[*+?]\?/.freeze
71
+
72
+ # Default placeholder used when +placeholder:+ is not given to {redact}.
73
+ PLACEHOLDER_DEFAULT = "[REDACTED]"
74
+
75
+ module_function
76
+
77
+ # List of supported tag symbols.
78
+ #
79
+ # @return [Array<Symbol>] every key from {TAGS}
80
+ def tags
81
+ TAGS.keys
82
+ end
83
+
84
+ # List of every pattern name the redactor knows about.
85
+ #
86
+ # Includes the {BUILTIN_PATTERN_NAMES} plus any names registered via
87
+ # {add_pattern}. Useful for discovering what String values +only:+ /
88
+ # +except:+ accept, and for filtering / debugging.
89
+ #
90
+ # @return [Array<String>] built-in names first (in execution order),
91
+ # then custom names in registration order.
92
+ def pattern_names
93
+ BUILTIN_PATTERN_NAMES + _custom_patterns.map { |h| h[:name] }
94
+ end
95
+
96
+ # Redact every match of the configured patterns in +text+.
97
+ #
98
+ # +only:+ and +except:+ both accept a single value or an Array, mixing:
99
+ # - **Symbols** — tag names from {TAGS} (e.g. +:contact+, +:credentials+).
100
+ # - **Strings** — specific pattern names from {pattern_names} (e.g. +"email"+).
101
+ #
102
+ # They can be combined: +only: :contact, except: ["email"]+ means
103
+ # "redact every contact pattern except email." Symbols give you tag-level
104
+ # control; Strings give you per-pattern precision.
105
+ #
106
+ # **Precedence:** a pattern is redacted iff
107
+ # +(only is nil OR pattern matches only:)+ AND +(pattern does not match except:)+.
108
+ # +except:+ always wins over +only:+ when they overlap — e.g.
109
+ # +only: :contact, except: :contact+ produces an empty redaction (no-op),
110
+ # and +only: ["email"], except: ["email"]+ likewise skips email entirely.
111
+ #
112
+ # @param text [String] input string. Returned unchanged if no patterns match.
113
+ # @param only [Symbol, String, Array, nil] include only the given tag(s)
114
+ # and/or pattern name(s).
115
+ # @param except [Symbol, String, Array, nil] exclude the given tag(s)
116
+ # and/or pattern name(s). May be combined with +only:+.
117
+ # @param placeholder [String, :tagged, :hash] replacement strategy.
118
+ # A String is used verbatim. +:tagged+ produces +[REDACTED:TAGNAME]+.
119
+ # +:hash+ produces a deterministic +[TAGNAME_xxxx]+ token (4-hex djb2)
120
+ # so the same input value always maps to the same token.
121
+ # @return [String] a new string with every match replaced.
122
+ # @raise [ArgumentError] if +placeholder:+ is not a String/:tagged/:hash.
123
+ # @raise [UnknownTagError] if any Symbol in +only:+/+except:+ is not in {TAGS}.
124
+ # @raise [UnknownPatternError] if any String in +only:+/+except:+ is not in {pattern_names}.
125
+ #
126
+ # @example
127
+ # DataRedactor.redact("token sk_live_abc123", only: :credentials)
128
+ # DataRedactor.redact(text, only: [:contact, "aws_access_key_id"])
129
+ # DataRedactor.redact(text, only: :contact, except: ["email"])
130
+ def redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
131
+ enable_bits = build_enable_bits(only, except)
132
+ ph_mode, ph_str = resolve_placeholder(placeholder)
133
+ _redact(text, ph_mode, ph_str, enable_bits)
134
+ end
135
+
136
+ # Scan +text+ and return both the redacted string and per-match metadata.
137
+ #
138
+ # Useful for auditing, false-positive tuning, and compliance pipelines.
139
+ # +:start+ and +:length+ are byte offsets into the *original* string, so
140
+ # +text.byteslice(m[:start], m[:length]) == m[:value]+.
141
+ #
142
+ # @param text [String] input string.
143
+ # @param only [Symbol, String, Array, nil] same semantics as {redact}.
144
+ # @param except [Symbol, String, Array, nil] same semantics as {redact}.
145
+ # @return [Hash{Symbol => Object}] +{ redacted: String, matches:
146
+ # Array<Hash> }+. Each match hash has +:tag+ (Symbol), +:name+ (String),
147
+ # +:value+ (String), +:start+ (Integer byte offset), +:length+ (Integer).
148
+ # @raise [UnknownTagError] if any Symbol in +only:+/+except:+ is not in {TAGS}.
149
+ # @raise [UnknownPatternError] if any String in +only:+/+except:+ is not in {pattern_names}.
150
+ #
151
+ # @example
152
+ # DataRedactor.scan("user@example.com")
153
+ # # => { redacted: "[REDACTED]",
154
+ # # matches: [{tag: :contact, name: "email",
155
+ # # value: "user@example.com", start: 0, length: 16}] }
156
+ def scan(text, only: nil, except: nil)
157
+ enable_bits = build_enable_bits(only, except)
158
+ result = _scan(text, enable_bits)
159
+ # Normalise: convert tag string from C (uppercase) back to the Symbol used in TAGS
160
+ result[:matches].each { |m| m[:tag] = m[:tag].to_s.downcase.to_sym }
161
+ result
162
+ end
163
+
164
+ # Register a custom redaction pattern.
165
+ #
166
+ # Patterns must be valid POSIX ERE. Ruby-only syntax (+\d+, +\s+, +\w+,
167
+ # +\b+, lookaround, non-greedy quantifiers, named groups) is rejected
168
+ # at registration time, never at redaction time.
169
+ #
170
+ # If a pattern with the same +name+ is already registered, it is replaced
171
+ # (the old compiled +regex_t+ is freed).
172
+ #
173
+ # @param name [String] unique identifier for this pattern. Used by {remove_pattern}.
174
+ # @param regex [String, Regexp] POSIX ERE source. A Regexp is accepted
175
+ # for convenience but only its +.source+ is used; flags are ignored.
176
+ # @param tag [Symbol] one of {TAGS} keys. Defaults to +:custom+.
177
+ # @param boundary [Boolean] when true, the pattern is wrapped with
178
+ # +(^|[^0-9A-Za-z])(...)([^0-9A-Za-z]|$)+ so it only matches when not
179
+ # embedded in a longer alphanumeric token. Incompatible with patterns
180
+ # that contain capture groups.
181
+ # @return [Boolean] +true+ on success.
182
+ # @raise [ArgumentError] if +name+ is not a non-empty String, or +regex+
183
+ # is neither a String nor a Regexp.
184
+ # @raise [InvalidPatternError] if the pattern uses Ruby-only syntax,
185
+ # contains capture groups while +boundary: true+, or fails +regcomp+.
186
+ # @raise [UnknownTagError] if +tag+ is not in {TAGS}.
187
+ #
188
+ # @example
189
+ # DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
190
+ # DataRedactor.add_pattern(name: "internal_key",
191
+ # regex: /INT-[A-Z]{3}/,
192
+ # tag: :credentials,
193
+ # boundary: true)
194
+ def add_pattern(name:, regex:, tag: :custom, boundary: false)
195
+ raise ArgumentError, "name must be a non-empty String" \
196
+ unless name.is_a?(String) && !name.empty?
197
+
198
+ source = case regex
199
+ when String then regex
200
+ when Regexp then regex.source
201
+ else raise ArgumentError, "regex must be a String or Regexp, got #{regex.class}"
202
+ end
203
+
204
+ if source =~ RUBY_ONLY_SYNTAX_RE
205
+ raise InvalidPatternError,
206
+ "pattern #{name.inspect} uses Ruby-only syntax (#{$&.inspect}); " \
207
+ "use POSIX ERE — no \\d, \\s, \\w, \\b, lookaround, non-greedy, or named groups"
208
+ end
209
+
210
+ if boundary && source =~ CAPTURE_GROUP_RE
211
+ raise InvalidPatternError,
212
+ "pattern #{name.inspect} has capture groups and cannot use boundary: true"
213
+ end
214
+
215
+ tag_bit = TAGS[tag] or raise UnknownTagError,
216
+ "unknown tag #{tag.inspect}; valid tags: #{TAGS.keys.inspect}"
217
+
218
+ _add_pattern(name, source, tag_bit, boundary ? 1 : 0)
219
+ end
220
+
221
+ # Remove a previously registered custom pattern.
222
+ #
223
+ # @param name [String, Symbol] the +name+ used in {add_pattern}.
224
+ # @return [Boolean] +true+ if a pattern was removed, +false+ if no
225
+ # pattern with that name was registered.
226
+ def remove_pattern(name)
227
+ _remove_pattern(name.to_s)
228
+ end
229
+
230
+ # List every currently registered custom pattern.
231
+ #
232
+ # @return [Array<Hash{Symbol => Object}>] one hash per pattern with keys
233
+ # +:name+ (String), +:source+ (String — the POSIX ERE source),
234
+ # +:tag+ (Symbol), +:boundary+ (Boolean).
235
+ def custom_patterns
236
+ _custom_patterns.map do |h|
237
+ { name: h[:name], source: h[:source], tag: TAGS.key(h[:tag_bit]) || :custom,
238
+ boundary: h[:boundary] }
239
+ end
240
+ end
241
+
242
+ # Remove every registered custom pattern.
243
+ #
244
+ # Mostly useful in test suites that need a clean slate between examples.
245
+ #
246
+ # @return [nil]
247
+ def clear_custom_patterns!
248
+ _clear_custom_patterns
249
+ end
250
+
251
+ # @api private
252
+ # Split a mixed Symbol/String filter list into +(tag_bitmask, name_set)+.
253
+ #
254
+ # @param entries [nil, Symbol, String, Array]
255
+ # @return [Array(Integer, Set<String>)] tag bits OR-ed together; set of
256
+ # pattern-name Strings.
257
+ # @raise [UnknownTagError] for unknown Symbols.
258
+ # @raise [UnknownPatternError] for unknown Strings.
259
+ def split_filter(entries)
260
+ bits = 0
261
+ names = Set.new
262
+ return [bits, names] if entries.nil?
263
+ Array(entries).each do |e|
264
+ case e
265
+ when Symbol
266
+ bit = TAGS[e] or raise UnknownTagError,
267
+ "unknown tag #{e.inspect}; valid tags: #{TAGS.keys.inspect}"
268
+ bits |= bit
269
+ when String
270
+ unless pattern_names.include?(e)
271
+ raise UnknownPatternError,
272
+ "unknown pattern name #{e.inspect}; see DataRedactor.pattern_names"
273
+ end
274
+ names << e
275
+ else
276
+ raise ArgumentError,
277
+ "only:/except: entries must be a Symbol (tag) or String (pattern name), got #{e.inspect}"
278
+ end
279
+ end
280
+ [bits, names]
281
+ end
282
+
283
+ # @api private
284
+ # Build the per-pattern enable bit-list passed to the C layer.
285
+ #
286
+ # The list has one Integer (0 or 1) per pattern in execution order:
287
+ # built-ins first (NUM_PATTERNS entries), then currently registered custom
288
+ # patterns in registration order. C iterates by index and skips zeros.
289
+ #
290
+ # Semantics of +only:+ / +except:+ — both accept a mix of Symbols (tags)
291
+ # and Strings (pattern names):
292
+ # enabled(p) iff
293
+ # (only is nil OR p.tag ∈ only_tags OR p.name ∈ only_names)
294
+ # AND p.tag ∉ except_tags AND p.name ∉ except_names
295
+ #
296
+ # @return [Array<Integer>] same length as built-ins + customs.
297
+ def build_enable_bits(only, except)
298
+ only_bits, only_names = split_filter(only)
299
+ except_bits, except_names = split_filter(except)
300
+ only_present = !only.nil?
301
+
302
+ bits = Array.new(BUILTIN_PATTERN_NAMES.length + _custom_patterns.length, 0)
303
+
304
+ BUILTIN_PATTERN_NAMES.each_with_index do |name, i|
305
+ tag_bit = BUILTIN_PATTERN_TAG_BITS[i]
306
+ bits[i] = 1 if pattern_enabled?(name, tag_bit, only_present,
307
+ only_bits, only_names,
308
+ except_bits, except_names)
309
+ end
310
+
311
+ _custom_patterns.each_with_index do |h, i|
312
+ bits[BUILTIN_PATTERN_NAMES.length + i] = 1 if pattern_enabled?(
313
+ h[:name], h[:tag_bit], only_present,
314
+ only_bits, only_names, except_bits, except_names)
315
+ end
316
+
317
+ bits
318
+ end
319
+
320
+ # @api private
321
+ def pattern_enabled?(name, tag_bit, only_present, only_bits, only_names,
322
+ except_bits, except_names)
323
+ return false if (tag_bit & except_bits) != 0
324
+ return false if except_names.include?(name)
325
+ return true unless only_present
326
+ return true if (tag_bit & only_bits) != 0
327
+ only_names.include?(name)
328
+ end
329
+
330
+ # @api private
331
+ # Translate the user-facing +placeholder:+ value into the +(mode_int, str)+
332
+ # pair the C layer expects.
333
+ #
334
+ # @param placeholder [String, :tagged, :hash]
335
+ # @return [Array(Integer, String)]
336
+ # @raise [ArgumentError] if +placeholder+ is none of the accepted values.
337
+ def resolve_placeholder(placeholder)
338
+ case placeholder
339
+ when :tagged then [PH_MODE_TAGGED, ""]
340
+ when :hash then [PH_MODE_HASH, ""]
341
+ when String then [PH_MODE_PLAIN, placeholder]
342
+ else
343
+ raise ArgumentError,
344
+ "placeholder must be a String, :tagged, or :hash — got #{placeholder.inspect}"
345
+ end
346
+ end
347
+ end
data/readme.md ADDED
@@ -0,0 +1,395 @@
1
+ # DataRedactor
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/data_redactor.svg)](https://rubygems.org/gems/data_redactor)
4
+ [![CI](https://github.com/danielefrisanco/data_redactor/actions/workflows/ci.yml/badge.svg)](https://github.com/danielefrisanco/data_redactor/actions/workflows/ci.yml)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
6
+
7
+ A Ruby gem with a C extension for high-performance regex-based redaction of sensitive data from strings.
8
+
9
+ ## What it does
10
+
11
+ DataRedactor scans text for sensitive patterns and replaces matches with `[REDACTED]`. It uses a C extension backed by POSIX `regex.h` so the heavy lifting happens outside the Ruby VM, making it fast enough for large payloads.
12
+
13
+ ## Usage
14
+
15
+ ```ruby
16
+ require "data_redactor"
17
+
18
+ text = "User CF is RSSMRA85M01H501Z and key is AKIAIOSFODNN7EXAMPLE"
19
+ DataRedactor.redact(text)
20
+ # => "User CF is [REDACTED] and key is [REDACTED]"
21
+ ```
22
+
23
+ ### Filtering by tag or pattern name
24
+
25
+ `only:` and `except:` both accept a single value or an Array, mixing **Symbols** (tag names) and **Strings** (specific pattern names).
26
+
27
+ ```ruby
28
+ DataRedactor.tags
29
+ # => [:credentials, :financial, :tax_id, :national_id, :contact, :network, :travel, :other, :custom]
30
+
31
+ DataRedactor.pattern_names
32
+ # => ["aws_s3_presigned_url", "aws_access_key_id", "email", "phone_e164", "ipv4", ...]
33
+
34
+ # Tag-level filtering
35
+ DataRedactor.redact(text, only: [:credentials])
36
+ DataRedactor.redact(text, except: :contact)
37
+
38
+ # Single specific pattern
39
+ DataRedactor.redact(text, only: ["aws_access_key_id"])
40
+
41
+ # Mix — every credentials pattern PLUS aws_access_key_id (even if it lived in another tag)
42
+ DataRedactor.redact(text, only: [:credentials, "aws_access_key_id"])
43
+
44
+ # Combine — every contact pattern EXCEPT email
45
+ DataRedactor.redact(text, only: :contact, except: ["email"])
46
+ ```
47
+
48
+ **Precedence:** a pattern is redacted iff `(only is nil OR matches only:)` AND `(does not match except:)`. `except:` always wins when the two overlap, so `only: :contact, except: :contact` produces a no-op (everything is excluded).
49
+
50
+ **Errors:** an unknown tag Symbol raises `DataRedactor::UnknownTagError`; an unknown pattern name String raises `DataRedactor::UnknownPatternError`.
51
+
52
+ ### Configurable placeholder
53
+
54
+ By default every match is replaced with `[REDACTED]`. Use the `placeholder:` keyword to change this:
55
+
56
+ ```ruby
57
+ # Plain string — any replacement text
58
+ DataRedactor.redact(text, placeholder: "***")
59
+ DataRedactor.redact(text, placeholder: "")
60
+
61
+ # Tagged — embeds the pattern's tag name so you know what was redacted
62
+ DataRedactor.redact(text, placeholder: :tagged)
63
+ # "user@example.com" → "[REDACTED:CONTACT]"
64
+ # "AKIAIOSFODNN7EXAMPLE" → "[REDACTED:CREDENTIALS]"
65
+ # "DE89370400440532013000" → "[REDACTED:FINANCIAL]"
66
+
67
+ # Hash — deterministic 4-hex suffix of the matched value
68
+ # Same value always produces the same token — useful for correlating
69
+ # redactions across log lines without leaking the original.
70
+ DataRedactor.redact(text, placeholder: :hash)
71
+ # "user@example.com" → "[CONTACT_3d7a]"
72
+ # "user@example.com" → "[CONTACT_3d7a]" (same every time)
73
+ # "other@example.com" → "[CONTACT_91fc]" (different value, different hash)
74
+ ```
75
+
76
+ All three modes compose with `only:` and `except:`:
77
+
78
+ ```ruby
79
+ DataRedactor.redact(text, only: :contact, placeholder: :tagged)
80
+ ```
81
+
82
+ ### Scan / dry-run mode
83
+
84
+ `DataRedactor.scan` returns every match alongside the redacted string — useful for auditing, tuning false positives, and compliance pipelines:
85
+
86
+ ```ruby
87
+ result = DataRedactor.scan("User AKIAIOSFODNN7EXAMPLE logged in from 192.168.1.1")
88
+ # => {
89
+ # redacted: "User [REDACTED] logged in from [REDACTED]",
90
+ # matches: [
91
+ # { tag: :credentials, name: "aws_access_key_id", value: "AKIAIOSFODNN7EXAMPLE", start: 5, length: 20 },
92
+ # { tag: :network, name: "ipv4", value: "192.168.1.1", start: 35, length: 11 }
93
+ # ]
94
+ # }
95
+
96
+ # :start and :length are byte offsets into the original string
97
+ m = result[:matches].first
98
+ original_text.byteslice(m[:start], m[:length]) # => "AKIAIOSFODNN7EXAMPLE"
99
+
100
+ # Accepts the same filters as redact (tags + specific pattern names)
101
+ DataRedactor.scan(text, only: :credentials)
102
+ DataRedactor.scan(text, except: :network)
103
+ DataRedactor.scan(text, only: :contact, except: ["email"])
104
+ ```
105
+
106
+ ### Custom patterns
107
+
108
+ Teams often have internal IDs that the gem can't ship. Register them at boot:
109
+
110
+ ```ruby
111
+ # String (POSIX ERE) or Regexp — both accepted
112
+ DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
113
+ DataRedactor.add_pattern(name: "ticket_ref", regex: /TICKET-[A-Z]{2}[0-9]{4}/, boundary: true)
114
+
115
+ # Custom patterns are tagged :custom by default; pass any built-in tag to group differently
116
+ DataRedactor.add_pattern(name: "internal_key", regex: "INT-[A-Z]{3}", tag: :credentials)
117
+
118
+ DataRedactor.redact(text) # runs all patterns including custom
119
+ DataRedactor.redact(text, only: [:custom]) # only user patterns
120
+ DataRedactor.redact(text, only: [:custom, :credentials]) # mix
121
+
122
+ DataRedactor.custom_patterns # => [{name:, source:, tag:, boundary:}, ...]
123
+ DataRedactor.remove_pattern("employee_id")
124
+ DataRedactor.clear_custom_patterns! # mostly for test suites
125
+ ```
126
+
127
+ **Regex rules** — patterns must be POSIX ERE (the same engine used for built-ins). Not supported: `\d`, `\s`, `\w`, `\b`, lookahead/lookbehind, non-greedy quantifiers, named groups. Violations raise `DataRedactor::InvalidPatternError` at registration time, never at redaction time. Use `[0-9]` instead of `\d`, `[[:space:]]` instead of `\s`, etc.
128
+
129
+ **`boundary: true`** — wraps the pattern with `(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)` so it only fires when the token is not embedded in a longer alphanumeric string. Incompatible with patterns that contain capture groups.
130
+
131
+ ## Integrations
132
+
133
+ Optional adapters for Logger, Rails, and Rack. None are loaded automatically — `require` only what you use, and the gem adds zero runtime dependencies in the gemspec.
134
+
135
+ ### Logger formatter
136
+
137
+ Drop-in `Logger::Formatter` replacement that scrubs every emitted line:
138
+
139
+ ```ruby
140
+ require "data_redactor/integrations/logger"
141
+
142
+ logger = Logger.new($stdout)
143
+ logger.formatter = DataRedactor::Integrations::Logger.new
144
+ logger.info("Auth failed for alice@example.com")
145
+ # => I, [...] -- : Auth failed for [REDACTED]
146
+ ```
147
+
148
+ Wraps an inner formatter (defaults to `Logger::Formatter`), so it composes with structured loggers. Forwards `only:`, `except:`, `placeholder:` to `DataRedactor.redact`. Exception messages and arbitrary objects are scrubbed too — the wrapped object is passed unchanged to the inner formatter so the exception cause chain is preserved; only the rendered string is redacted.
149
+
150
+ ### Rails `filter_parameters` adapter
151
+
152
+ ```ruby
153
+ # config/initializers/filter_parameter_logging.rb
154
+ require "data_redactor/integrations/rails"
155
+
156
+ Rails.application.config.filter_parameters += [
157
+ DataRedactor::Integrations::Rails.filter
158
+ ]
159
+ ```
160
+
161
+ Returns a `(key, value)` proc compatible with Rails' parameter filter. String values are mutated in place via `String#replace` so Rails sees the redacted value. Non-strings are left alone. Accepts the same `only:`/`except:`/`placeholder:` kwargs.
162
+
163
+ ### Rack middleware
164
+
165
+ ```ruby
166
+ # config.ru
167
+ require "data_redactor/integrations/rack"
168
+
169
+ use DataRedactor::Integrations::Rack, scrub: [:body, :headers]
170
+ run MyApp
171
+ ```
172
+
173
+ `scrub:` selects which surfaces to redact (default `[:body, :headers]`):
174
+
175
+ - **`:body`** — buffers the response body, runs `DataRedactor.redact` over it, returns it as a single chunk. Drops the `Content-Length` header so the server recomputes (the redacted body may differ in byte length).
176
+ - **`:headers`** — scrubs sensitive **response** headers (`Set-Cookie`, `Authorization`, `X-Api-Key`, `X-Auth-Token`, `X-Access-Token`) in place, and sensitive **request** headers (`HTTP_AUTHORIZATION`, `HTTP_PROXY_AUTHORIZATION`, `HTTP_COOKIE`, `HTTP_X_API_KEY`, `HTTP_X_AUTH_TOKEN`, `HTTP_X_ACCESS_TOKEN`) in the env hash so any downstream middleware that logs them sees redacted values.
177
+
178
+ Pass an empty subset (e.g. `scrub: [:headers]`) to opt out of body wrapping. Forwards `only:`/`except:`/`placeholder:` to `DataRedactor.redact`. Unknown surfaces raise `ArgumentError` at boot.
179
+
180
+ > **Body wrapping is buffering.** The middleware reads the entire response body into memory before scanning. For streaming endpoints (SSE, large file downloads, Rack::Hijack) use `scrub: [:headers]` and rely on the Logger formatter for application logs instead.
181
+
182
+ ## Detected patterns (85 total)
183
+
184
+ The table below is a representative sample. Use `DataRedactor.pattern_names` for the canonical, machine-readable list — it stays in sync with the C extension automatically.
185
+
186
+ ### Cloud & API secrets
187
+
188
+ | # | Pattern | Example |
189
+ |---|---|---|
190
+ | — | AWS Access Key ID | `AKIAIOSFODNN7EXAMPLE` |
191
+ | — | AWS Secret Access Key | 40-character base64 string |
192
+ | — | Google API Key | `AIzaSyXXXX...` |
193
+ | — | GitHub Personal Access Token | `github_pat_XXXX...` |
194
+ | — | GitHub Classic PAT / OAuth | `ghp_XXXX...` / `gho_XXXX...` |
195
+ | — | Slack Webhook URL | `https://hooks.slack.com/services/T.../B.../...` |
196
+ | — | Stripe Secret Key | `sk_live_XXXX...` |
197
+ | — | Anthropic API Key | `sk-ant-api03-XXXX...` |
198
+ | — | OpenAI Project API Key | `sk-proj-XXXX...` |
199
+ | — | GitLab Personal Access Token | `glpat-XXXX...` |
200
+ | — | DigitalOcean PAT | `dop_v1_XXXX...` |
201
+ | — | Databricks API Token | `dapiXXXX...` |
202
+ | — | Sentry DSN | `https://KEY@oNNN.ingest.sentry.io/PID` |
203
+ | — | PEM Private Key header | `-----BEGIN RSA PRIVATE KEY-----` |
204
+ | — | Scaleway Access Key | `SCW12345ABCDE6789FGHIJ` |
205
+ | — | UUID v4 / Scaleway Secret Key | `550e8400-e29b-41d4-a716-446655440000` |
206
+
207
+ ### Travel documents
208
+
209
+ | # | Pattern | Example |
210
+ |---|---|---|
211
+ | 2 | Italian Codice Fiscale (basic) | `RSSMRA85M01H501Z` |
212
+ | 3 | Passport — letter prefix + digits | `AB1234567` |
213
+ | 4 | Passport — 9 consecutive digits ¹ | `123456789` |
214
+ | 22 | Italian Codice Fiscale (omocodia) | `RSSMRALPMNLH5LMZ` |
215
+
216
+ ### Payment & network
217
+
218
+ | # | Pattern | Example |
219
+ |---|---|---|
220
+ | 11 | Credit card — Visa, Mastercard, Amex, Discover, JCB | `4111111111111111` |
221
+ | 12 | IPv4 address | `192.168.1.100` |
222
+
223
+ ### IBANs
224
+
225
+ | # | Country | Example |
226
+ |---|---|---|
227
+ | 10 | Italy | `IT60X0542811101000000123456` |
228
+ | 15 | France | `FR7630006000011234567890189` |
229
+ | 16 | Germany | `DE89370400440532013000` |
230
+ | 17 | Spain | `ES9121000418450200051332` |
231
+ | 18 | Netherlands | `NL91ABNA0417164300` |
232
+ | 19 | Belgium | `BE68539007547034` |
233
+ | 20 | Portugal | `PT50000201231234567890154` |
234
+ | 21 | Ireland | `IE29AIBK93115212345678` |
235
+ | 28 | Sweden | `SE4550000000058398257466` |
236
+ | 29 | Denmark | `DK5000400440116243` |
237
+ | 30 | Norway | `NO9386011117947` |
238
+ | 31 | Finland | `FI2112345600000785` |
239
+ | 37 | Poland | `PL61109010140000071219812874` |
240
+ | 38 | Austria | `AT611904300234573201` |
241
+ | 39 | Switzerland | `CH9300762011623852957` |
242
+ | 40 | Czechia | `CZ6508000000192000145399` |
243
+ | 41 | Hungary | `HU42117730161111101800000000` |
244
+ | 42 | Romania | `RO49AAAA1B31007593840000` |
245
+
246
+ ### National personal identifiers
247
+
248
+ | # | Country | Type | Example |
249
+ |---|---|---|---|
250
+ | 23 | France | NIR / Social Security ¹ | `185126203450342` |
251
+ | 24 | Spain | DNI ¹ | `12345678Z` |
252
+ | 25 | Spain | NIE | `X1234567L` |
253
+ | 26 | Netherlands | BSN ¹ | `123456789` |
254
+ | 27 | Poland | PESEL ¹ | `85121612345` |
255
+ | 32 | Belgium | National Number ¹ | `85121612345` |
256
+ | 33 | Sweden | Personnummer ¹ | `850101-1234` |
257
+ | 34 | Denmark | CPR Number ¹ | `010185-1234` |
258
+ | 35 | Norway | Fødselsnummer ¹ | `01018512345` |
259
+ | 36 | Finland | HETU ¹ | `010185-123A` |
260
+ | 43 | Poland | PESEL (alt slot) ¹ | `90010112345` |
261
+ | 44 | Austria | Abgabenkontonummer ¹ | `123456789` |
262
+ | 45 | Switzerland | AHV Number ¹ | `756.1234.5678.90` |
263
+ | 46 | Czechia | Rodné číslo ¹ | `856121/1234` |
264
+ | 47 | Hungary | Tax ID ¹ | `8012345678` |
265
+ | 48 | Romania | CNP ¹ | `1850101123456` |
266
+
267
+ > ¹ **Word-boundary protected** — these patterns are wrapped with `(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)` at compile time so they do not fire when the digit sequence appears inside a longer alphanumeric token.
268
+
269
+ ## Directory structure
270
+
271
+ ```
272
+ redactor/
273
+ ├── data_redactor.gemspec
274
+ ├── Gemfile
275
+ ├── Rakefile
276
+ ├── lib/
277
+ │ ├── data_redactor.rb # Ruby entry point, loads the .so
278
+ │ └── data_redactor/
279
+ │ └── version.rb
280
+ ├── ext/
281
+ │ └── data_redactor/
282
+ │ ├── extconf.rb # Checks for C headers, generates Makefile (globs *.c)
283
+ │ ├── data_redactor.c # Entry point: Init_data_redactor only
284
+ │ ├── patterns.{c,h} # Built-in pattern table + compiled regex_t array
285
+ │ ├── placeholder.{c,h} # write_placeholder, djb2 hash, tag_name_for_bit
286
+ │ ├── redact.{c,h} # _redact + replace_all_matches + wrap_boundary
287
+ │ ├── scan.{c,h} # _scan + byte-offset replacement-log macros
288
+ │ ├── custom_patterns.{c,h} # Dynamic registry: add/remove/clear/list
289
+ │ └── tags.h # TAG_* bit constants
290
+ └── spec/
291
+ └── data_redactor_spec.rb # RSpec tests — at least one example per pattern, plus filter / placeholder / custom-pattern coverage
292
+ ```
293
+
294
+ ## Requirements
295
+
296
+ - Ruby >= 2.7
297
+ - A C compiler (`gcc` or `clang`) — only required when installing the source gem
298
+ - POSIX `regex.h` — only required when installing the source gem (standard on Linux and macOS)
299
+
300
+ ## Installation
301
+
302
+ ```ruby
303
+ # Gemfile
304
+ gem "data_redactor"
305
+ ```
306
+
307
+ ```bash
308
+ bundle install
309
+ ```
310
+
311
+ That's it — there is nothing extra to configure for precompiled binaries. Bundler/RubyGems looks at your platform and Ruby version and picks the right gem automatically.
312
+
313
+ ### What you'll see
314
+
315
+ - **On a supported platform** (Linux glibc/musl, macOS Intel/ARM): bundler downloads a precompiled gem with the C extension already built. Install is near-instant — **no compiler, no `make`, no `regex.h` headers needed**. Especially valuable in slim Docker images (`ruby:3.x-alpine`, `ruby:3.x-slim`) that don't ship `gcc`.
316
+ - **On any other platform** (FreeBSD, OpenBSD, etc.): bundler downloads the source gem and compiles the C extension on install — the same behavior as before 0.7.1. You'll need a C compiler and POSIX `regex.h` available.
317
+
318
+ ### Supported precompiled targets
319
+
320
+ Each precompiled gem ships compiled binaries for Ruby 3.1, 3.2, 3.3, and 3.4.
321
+
322
+ | Platform | Targets |
323
+ |---|---|
324
+ | Linux (glibc) | `x86_64-linux`, `aarch64-linux` |
325
+ | Linux (musl / Alpine) | `x86_64-linux-musl`, `aarch64-linux-musl` |
326
+ | macOS | `x86_64-darwin` (Intel), `arm64-darwin` (Apple Silicon) |
327
+
328
+ ### Bundler-locked deploys
329
+
330
+ If your `Gemfile.lock` was generated on one platform but you deploy to another, run `bundle lock --add-platform <target>` so bundler resolves the right native gem at deploy time. Example for Alpine deploys built from a glibc dev box:
331
+
332
+ ```bash
333
+ bundle lock --add-platform x86_64-linux-musl aarch64-linux-musl
334
+ ```
335
+
336
+ ## Compile the C extension (source / development install only)
337
+
338
+ ```bash
339
+ bundle exec rake compile
340
+ ```
341
+
342
+ This runs `extconf.rb` via `rake-compiler`, which generates a `Makefile` and compiles `data_redactor.c` into a `.so` shared library placed under `lib/data_redactor/`.
343
+
344
+ ## Building precompiled gems locally
345
+
346
+ Maintainers can rebuild the full set of native gems with one command (requires Docker):
347
+
348
+ ```bash
349
+ bundle exec rake gem:all
350
+ ```
351
+
352
+ This invokes `rake-compiler-dock` to cross-compile every supported (platform × Ruby ABI) combination. Output lands in `pkg/`.
353
+
354
+ ## Run the tests
355
+
356
+ ```bash
357
+ bundle exec rake spec
358
+ ```
359
+
360
+ Or compile and test in one step:
361
+
362
+ ```bash
363
+ bundle exec rake
364
+ ```
365
+
366
+ ## How it works
367
+
368
+ 1. At load time, `Init_data_redactor` compiles all 85 regex patterns once using `regcomp` (POSIX ERE) and stores them as static `regex_t` structs. Patterns marked as boundary-wrapped are expanded with `wrap_boundary()` before compilation.
369
+ 2. `DataRedactor.redact(text)` receives a Ruby `String`, converts it to a C `char*` via `StringValueCStr`, and runs each compiled pattern in sequence on a working buffer.
370
+ 3. For each pattern, `replace_all_matches` iterates using `regexec`, copies non-matching segments to a fresh output buffer, and inserts `[REDACTED]` in place of each match. For boundary-wrapped patterns, `regexec` is called with `nmatch=4` and sub-match groups `[1]`/`[3]` identify the boundary characters so they are preserved verbatim.
371
+ 4. The output buffer is grown with `realloc` as needed. After all patterns are applied the result is returned as a Ruby `String` via `rb_str_new_cstr`. All intermediate `malloc`/`strdup` allocations are explicitly `free`d.
372
+
373
+ ## Memory management
374
+
375
+ All C-side buffers are heap-allocated with `malloc`/`strdup` and freed before the function returns. The only Ruby-managed allocation is the final return value from `rb_str_new_cstr`. No Ruby objects are created mid-processing, so GC cannot collect anything out from under the C code.
376
+
377
+ ## Thread safety
378
+
379
+ `DataRedactor.redact` and `DataRedactor.scan` are safe to call concurrently from multiple threads. Built-in patterns are compiled into a static `regex_t` array at load time and never mutated afterward, and each call allocates its own working buffers. POSIX `regexec` is documented as thread-safe.
380
+
381
+ `DataRedactor.add_pattern`, `remove_pattern`, and `clear_custom_patterns!` mutate a shared dynamic array and are **not** thread-safe. Register custom patterns once at boot — before spawning worker threads or forking — and they will be visible (read-only) to every subsequent `redact`/`scan` call.
382
+
383
+ ## Versioning
384
+
385
+ This project follows [Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html). Until `1.0.0`, minor versions may introduce breaking changes; from `1.0.0` onward, breaking changes will only land in major versions. See [CHANGELOG.md](CHANGELOG.md) for the release history.
386
+
387
+ ## License
388
+
389
+ Released under the [MIT License](LICENSE).
390
+
391
+ ## Known limitations
392
+
393
+ - **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
394
+ - **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
395
+ - **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
metadata ADDED
@@ -0,0 +1,122 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: data_redactor
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.7.2
5
+ platform: x86_64-linux
6
+ authors:
7
+ - Daniele Frisanco
8
+ bindir: bin
9
+ cert_chain: []
10
+ date: 1980-01-02 00:00:00.000000000 Z
11
+ dependencies:
12
+ - !ruby/object:Gem::Dependency
13
+ name: rake-compiler-dock
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - "~>"
17
+ - !ruby/object:Gem::Version
18
+ version: '1.5'
19
+ type: :development
20
+ prerelease: false
21
+ version_requirements: !ruby/object:Gem::Requirement
22
+ requirements:
23
+ - - "~>"
24
+ - !ruby/object:Gem::Version
25
+ version: '1.5'
26
+ - !ruby/object:Gem::Dependency
27
+ name: rspec
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - "~>"
31
+ - !ruby/object:Gem::Version
32
+ version: '3.12'
33
+ type: :development
34
+ prerelease: false
35
+ version_requirements: !ruby/object:Gem::Requirement
36
+ requirements:
37
+ - - "~>"
38
+ - !ruby/object:Gem::Version
39
+ version: '3.12'
40
+ - !ruby/object:Gem::Dependency
41
+ name: yard
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - "~>"
45
+ - !ruby/object:Gem::Version
46
+ version: '0.9'
47
+ type: :development
48
+ prerelease: false
49
+ version_requirements: !ruby/object:Gem::Requirement
50
+ requirements:
51
+ - - "~>"
52
+ - !ruby/object:Gem::Version
53
+ version: '0.9'
54
+ - !ruby/object:Gem::Dependency
55
+ name: rack
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - ">="
59
+ - !ruby/object:Gem::Version
60
+ version: '2.0'
61
+ type: :development
62
+ prerelease: false
63
+ version_requirements: !ruby/object:Gem::Requirement
64
+ requirements:
65
+ - - ">="
66
+ - !ruby/object:Gem::Version
67
+ version: '2.0'
68
+ description: A Ruby gem with a C extension for high-performance scanning and redaction
69
+ of 85 sensitive patterns — API keys, tokens, credentials, IBANs, national IDs, emails,
70
+ phone numbers, and PII from 15+ countries. Optional Logger formatter, Rails filter_parameters
71
+ adapter, and Rack middleware. Designed to sanitize text before sending to LLMs,
72
+ logging systems, or any public/third-party API.
73
+ email:
74
+ - daniele.frisanco@gmail.com
75
+ executables: []
76
+ extensions: []
77
+ extra_rdoc_files: []
78
+ files:
79
+ - CHANGELOG.md
80
+ - LICENSE
81
+ - lib/data_redactor.rb
82
+ - lib/data_redactor/3.0/data_redactor.so
83
+ - lib/data_redactor/3.1/data_redactor.so
84
+ - lib/data_redactor/3.2/data_redactor.so
85
+ - lib/data_redactor/3.3/data_redactor.so
86
+ - lib/data_redactor/3.4/data_redactor.so
87
+ - lib/data_redactor/4.0/data_redactor.so
88
+ - lib/data_redactor/integrations/logger.rb
89
+ - lib/data_redactor/integrations/rack.rb
90
+ - lib/data_redactor/integrations/rails.rb
91
+ - lib/data_redactor/version.rb
92
+ - readme.md
93
+ homepage: https://github.com/danielefrisanco/data_redactor
94
+ licenses:
95
+ - MIT
96
+ metadata:
97
+ homepage_uri: https://github.com/danielefrisanco/data_redactor
98
+ source_code_uri: https://github.com/danielefrisanco/data_redactor
99
+ changelog_uri: https://github.com/danielefrisanco/data_redactor/blob/main/CHANGELOG.md
100
+ bug_tracker_uri: https://github.com/danielefrisanco/data_redactor/issues
101
+ rubygems_mfa_required: 'true'
102
+ rdoc_options: []
103
+ require_paths:
104
+ - lib
105
+ required_ruby_version: !ruby/object:Gem::Requirement
106
+ requirements:
107
+ - - ">="
108
+ - !ruby/object:Gem::Version
109
+ version: '3.0'
110
+ - - "<"
111
+ - !ruby/object:Gem::Version
112
+ version: 4.1.dev
113
+ required_rubygems_version: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - ">="
116
+ - !ruby/object:Gem::Version
117
+ version: '0'
118
+ requirements: []
119
+ rubygems_version: 4.0.6
120
+ specification_version: 4
121
+ summary: Redact PII and secrets from strings before sending to AI or external services
122
+ test_files: []