data_redactor 0.7.2-x86_64-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +171 -0
- data/LICENSE +21 -0
- data/lib/data_redactor/3.0/data_redactor.so +0 -0
- data/lib/data_redactor/3.1/data_redactor.so +0 -0
- data/lib/data_redactor/3.2/data_redactor.so +0 -0
- data/lib/data_redactor/3.3/data_redactor.so +0 -0
- data/lib/data_redactor/3.4/data_redactor.so +0 -0
- data/lib/data_redactor/4.0/data_redactor.so +0 -0
- data/lib/data_redactor/integrations/logger.rb +42 -0
- data/lib/data_redactor/integrations/rack.rb +121 -0
- data/lib/data_redactor/integrations/rails.rb +38 -0
- data/lib/data_redactor/version.rb +4 -0
- data/lib/data_redactor.rb +347 -0
- data/readme.md +395 -0
- metadata +122 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 308a3387249fae55cdff4392a2655072450589df3a07409a3623156eb9c1c4fd
|
|
4
|
+
data.tar.gz: 2cbdf4a55c7648ae74245b11c9ee298eef10116e154f2fb9e27926546c909182
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 2e4a8c4ce276ccbd236e023eca96640b293a644ed3b0c4c2a8d844f0151f0b9befb18bdbd2221ea4ca4524d8dc28e376651b742eb9d6488fcdbb71a267672f95
|
|
7
|
+
data.tar.gz: 7079f9aeb48c18264a2d74b4d5c50383f3b38c0227b39cb336fab94dbac28db4147bb8963d7d6f2d5a1e04813c5e7ef71167ee906f0bd3eaad0613be39b3d882
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
## [0.7.2] - 2026-05-09
|
|
11
|
+
|
|
12
|
+
**Supersedes 0.7.1, which has been yanked from RubyGems.**
|
|
13
|
+
|
|
14
|
+
0.7.1 had a release pipeline bug: the source gem and the precompiled native
|
|
15
|
+
gems were published by two independent workflows, with no gating between
|
|
16
|
+
them. When the native-binary builds failed (`oxidize-rb/actions/cross-gem`
|
|
17
|
+
couldn't pull `rbsys/aarch64-linux:0.9.128` from Docker Hub), the source
|
|
18
|
+
gem still published — leaving users with release notes that promised
|
|
19
|
+
precompiled binaries that didn't exist on RubyGems. 0.7.2 ships the same
|
|
20
|
+
features as 0.7.1 plus the pipeline fix.
|
|
21
|
+
|
|
22
|
+
### Changed
|
|
23
|
+
- **Atomic release pipeline.** Source-gem publishing moved out of `ci.yml`
|
|
24
|
+
and into `release-binaries.yml`, alongside the native-gem builds. The
|
|
25
|
+
publish job now `needs: [build-source, build-native]`; if any native
|
|
26
|
+
platform fails to build, **nothing publishes**. This guarantees the
|
|
27
|
+
RubyGems release matches what the GitHub release notes promise.
|
|
28
|
+
- **Direct `rake-compiler-dock` invocation in CI** instead of the
|
|
29
|
+
`oxidize-rb/actions/cross-gem` action. Same code path as `rake gem:all`
|
|
30
|
+
locally and the existing PR-time smoke test in `ci.yml`. Uses
|
|
31
|
+
`ghcr.io/rake-compiler/*` images (no Docker Hub rate limits).
|
|
32
|
+
|
|
33
|
+
### Fixed
|
|
34
|
+
- All 6 precompiled native gems now actually publish on release — the
|
|
35
|
+
`aarch64-linux` variant in particular was previously failing.
|
|
36
|
+
|
|
37
|
+
### Documentation
|
|
38
|
+
- README installation section rewritten around the user's question
|
|
39
|
+
("what changes for me?"). Adds explicit Docker / Alpine guidance and a
|
|
40
|
+
heads-up about `bundle lock --add-platform` for cross-platform deploys.
|
|
41
|
+
|
|
42
|
+
## [0.7.1] - 2026-05-09 [YANKED]
|
|
43
|
+
|
|
44
|
+
### Added
|
|
45
|
+
- **Precompiled native gems** for the most common platforms — installing
|
|
46
|
+
`data_redactor` no longer requires a C toolchain on these targets:
|
|
47
|
+
- `x86_64-linux`, `aarch64-linux` (glibc)
|
|
48
|
+
- `x86_64-linux-musl`, `aarch64-linux-musl` (Alpine)
|
|
49
|
+
- `x86_64-darwin`, `arm64-darwin` (macOS Intel + Apple Silicon)
|
|
50
|
+
Each native gem ships compiled `.so` files for Ruby 3.1, 3.2, 3.3, and 3.4.
|
|
51
|
+
Bundler/RubyGems automatically picks the right gem for the host; users on
|
|
52
|
+
any other platform fall back to the source gem and compile as before.
|
|
53
|
+
- `rake gem:all` task — builds every native gem locally via `rake-compiler-dock`
|
|
54
|
+
(requires Docker). Single command to regenerate the full release matrix.
|
|
55
|
+
- `.github/workflows/release-binaries.yml` — builds & publishes all native
|
|
56
|
+
gems on every GitHub release. Also exposes `workflow_dispatch` so a
|
|
57
|
+
maintainer can rebuild any past release without cutting a new tag.
|
|
58
|
+
|
|
59
|
+
### Changed
|
|
60
|
+
- CI test matrix now includes Ruby 3.4 in addition to 3.1, 3.2, 3.3.
|
|
61
|
+
- Gemspec: added `rake-compiler-dock` as a development dependency. Source-only
|
|
62
|
+
gem size is unchanged — native gems strip `ext/` and the `extconf.rb`
|
|
63
|
+
extension hook so they only carry the prebuilt `.so` files.
|
|
64
|
+
|
|
65
|
+
## [0.7.0] - 2026-05-08
|
|
66
|
+
|
|
67
|
+
### Added
|
|
68
|
+
- **Rails / Rack / Logger integrations** under `lib/data_redactor/integrations/`. Soft-required — none are loaded by default; the gem still has zero runtime dependencies in the gemspec.
|
|
69
|
+
- `DataRedactor::Integrations::Logger` — drop-in `Logger::Formatter` that scrubs every emitted line, wraps an inner formatter (default `Logger::Formatter`), and preserves exception cause chains.
|
|
70
|
+
- `DataRedactor::Integrations::Rails.filter(...)` — returns a `(key, value)` proc for `Rails.application.config.filter_parameters`. Mutates String values in place via `String#replace`.
|
|
71
|
+
- `DataRedactor::Integrations::Rack` — middleware with selectable surfaces. `scrub:` accepts any subset of `[:body, :headers]` (default both). `:body` buffers the response and drops `Content-Length`; `:headers` scrubs sensitive response headers (`Set-Cookie`, `Authorization`, `X-Api-Key`, ...) and request headers in the env hash. Unknown surfaces raise `ArgumentError`.
|
|
72
|
+
- All three integrations forward `only:`, `except:`, `placeholder:` to `DataRedactor.redact`.
|
|
73
|
+
|
|
74
|
+
### Changed
|
|
75
|
+
- Gemspec: added `rack` as a development dependency. No new runtime dependencies.
|
|
76
|
+
|
|
77
|
+
## [0.6.1] - 2026-05-08
|
|
78
|
+
|
|
79
|
+
### Added
|
|
80
|
+
- Six new distinctive-prefix API key patterns under the `:credentials` tag, exposed via `DataRedactor.pattern_names`:
|
|
81
|
+
- `anthropic_api_key` — `sk-ant-apiNN-...`
|
|
82
|
+
- `openai_project_api_key` — `sk-proj-...`
|
|
83
|
+
- `gitlab_pat` — `glpat-...`
|
|
84
|
+
- `digitalocean_pat` — `dop_v1_...`
|
|
85
|
+
- `databricks_api_token` — `dapi...`
|
|
86
|
+
- `sentry_dsn` — `https://KEY@oNNN.ingest.sentry.io/PID` (also matches the legacy `KEY:SECRET@` form)
|
|
87
|
+
|
|
88
|
+
### Changed
|
|
89
|
+
- `NUM_PATTERNS` is now 85 (was 79). Built-in pattern indices in C have shifted accordingly; the public Ruby API and pattern names are stable.
|
|
90
|
+
|
|
91
|
+
## [0.6.0] - 2026-05-08
|
|
92
|
+
|
|
93
|
+
### Added
|
|
94
|
+
- **Per-pattern allow / deny via `only:` / `except:`.** Both kwargs now accept a mix of Symbols (tags) and Strings (pattern names from `DataRedactor.pattern_names`). They can be combined: `only: :contact, except: ["email"]` redacts every contact pattern except email. Mixed-list shapes like `only: [:credentials, "iban_de"]` also work. Precedence: `except:` always wins when the two overlap.
|
|
95
|
+
- `DataRedactor.pattern_names` — array of every known pattern name (built-ins + currently registered custom).
|
|
96
|
+
- `DataRedactor::BUILTIN_PATTERN_NAMES` and `DataRedactor::BUILTIN_PATTERN_TAG_BITS` constants (frozen) exposing the compiled-in pattern roster.
|
|
97
|
+
- `DataRedactor::UnknownPatternError` raised when a String passed to `only:`/`except:` does not match any known pattern.
|
|
98
|
+
- YARD docs deploy job in `.github/workflows/ci.yml` publishes `bundle exec yard doc` output to GitHub Pages on every push to `main`.
|
|
99
|
+
|
|
100
|
+
### Changed
|
|
101
|
+
- **C entry-point signatures.** `_redact(text, ph_mode, ph_str, enable_bits)` and `_scan(text, enable_bits)` now take a per-pattern enable bit array (built by the Ruby wrapper from `only:`/`except:`) instead of a tag bitmask. The public `DataRedactor.redact` / `.scan` API is fully backward compatible — only the underscore-prefixed C boundary changed. Single-pass: filtering happens in C, no second pass through `_scan`.
|
|
102
|
+
- `only:` and `except:` may now be combined (previously raised `ArgumentError` if both were passed).
|
|
103
|
+
- **Internal: C extension split into focused modules.** `ext/data_redactor/data_redactor.c` was a single ~1000-line file; it is now a 60-line entry point plus `patterns.{c,h}`, `placeholder.{c,h}`, `redact.{c,h}`, `scan.{c,h}`, `custom_patterns.{c,h}`, and `tags.h`. `extconf.rb` now globs every `.c` in the extension directory via `$srcs`, so adding a new module needs no Makefile edits.
|
|
104
|
+
- **YARD inline docs** — every public method on `DataRedactor` now has `@param`/`@return`/`@raise` annotations (100% coverage); `.yardopts` configures markdown rendering with the README as the front page.
|
|
105
|
+
|
|
106
|
+
### Documentation
|
|
107
|
+
- README: gem version / CI / license badges; new "Thread safety" section clarifying that `redact`/`scan` are thread-safe but `add_pattern`/`remove_pattern`/`clear_custom_patterns!` are not (register custom patterns once at boot).
|
|
108
|
+
|
|
109
|
+
## [0.5.0] - 2026-05-02
|
|
110
|
+
|
|
111
|
+
### Added
|
|
112
|
+
- `DataRedactor.scan(text, only:, except:)` — returns `{ redacted: String, matches: Array<Hash> }` where each match contains `:tag` (Symbol), `:name` (pattern name String), `:value` (matched text), `:start` (byte offset into original), `:length` (byte length). Accepts the same `only:`/`except:` tag filters as `redact`. Includes both built-in and custom pattern matches.
|
|
113
|
+
- `pattern_names[]` array in the C extension mapping each built-in pattern index to a stable snake_case name string (e.g. `"aws_access_key_id"`, `"email"`, `"iban_de"`).
|
|
114
|
+
|
|
115
|
+
## [0.4.0] - 2026-05-02
|
|
116
|
+
|
|
117
|
+
### Added
|
|
118
|
+
- `placeholder:` keyword argument on `DataRedactor.redact`.
|
|
119
|
+
- Plain string (default `"[REDACTED]"`): `placeholder: "***"`
|
|
120
|
+
- Tagged: `placeholder: :tagged` → `[REDACTED:CONTACT]`, `[REDACTED:CREDENTIALS]`, etc.
|
|
121
|
+
- Deterministic hash: `placeholder: :hash` → `[CONTACT_a3f9]` (4-hex djb2 suffix, same value always produces the same token — useful for correlating redactions across log lines).
|
|
122
|
+
- `PH_MODE_PLAIN`, `PH_MODE_TAGGED`, `PH_MODE_HASH` integer constants exposed from C.
|
|
123
|
+
- `DataRedactor::PLACEHOLDER_DEFAULT` constant (`"[REDACTED]"`).
|
|
124
|
+
|
|
125
|
+
### Changed
|
|
126
|
+
- `DataRedactor._redact` now takes 4 arguments: `(text, mask, ph_mode, ph_str)`. The public `DataRedactor.redact` API is fully backward compatible.
|
|
127
|
+
|
|
128
|
+
## [0.3.0] - 2026-05-02
|
|
129
|
+
|
|
130
|
+
### Added
|
|
131
|
+
- User-supplied custom patterns via `DataRedactor.add_pattern(name:, regex:, tag: :custom, boundary: false)`.
|
|
132
|
+
- `DataRedactor.remove_pattern(name)` — remove a named custom pattern (returns `true`/`false`).
|
|
133
|
+
- `DataRedactor.custom_patterns` — list all registered custom patterns as an array of hashes.
|
|
134
|
+
- `DataRedactor.clear_custom_patterns!` — remove all custom patterns (useful in test suites).
|
|
135
|
+
- New `:custom` tag and `TAG_CUSTOM` bitmask constant for custom patterns. Works with `only:`/`except:`.
|
|
136
|
+
- `DataRedactor::InvalidPatternError` raised when a pattern fails `regcomp` or uses unsupported Ruby-only syntax (`\d`, `\s`, `\w`, `\b`, lookaround, non-greedy quantifiers, named groups).
|
|
137
|
+
- Capture groups rejected at registration when `boundary: true` (group indices would shift).
|
|
138
|
+
- Name collisions replace the existing pattern (the old compiled `regex_t` is freed).
|
|
139
|
+
|
|
140
|
+
## [0.2.0] - 2026-05-02
|
|
141
|
+
|
|
142
|
+
### Added
|
|
143
|
+
- Tag system: every pattern now belongs to one of 8 tags (`:credentials`, `:financial`, `:tax_id`, `:national_id`, `:contact`, `:network`, `:travel`, `:other`).
|
|
144
|
+
- `DataRedactor.redact(text, only: [...])` to redact only patterns in the given tags.
|
|
145
|
+
- `DataRedactor.redact(text, except: [...])` to redact every tag except the given ones.
|
|
146
|
+
- `DataRedactor.tags` returning the list of supported tags.
|
|
147
|
+
- `DataRedactor::TAGS` constant mapping tag symbols to bitmask values, plus `TAG_*` integer constants exposed from C for advanced use.
|
|
148
|
+
- `DataRedactor::UnknownTagError` raised when an unknown tag symbol is passed.
|
|
149
|
+
|
|
150
|
+
### Changed
|
|
151
|
+
- The C-level entry point is now `DataRedactor._redact(text, mask)` (two-arg, mask is an integer bitmask). The public API is the Ruby wrapper `DataRedactor.redact`, which remains backward compatible: `redact(text)` with no keyword arguments runs every pattern exactly as before.
|
|
152
|
+
|
|
153
|
+
## [0.1.0] - 2026-05-02
|
|
154
|
+
|
|
155
|
+
### Added
|
|
156
|
+
- Initial release.
|
|
157
|
+
- C extension (`ext/data_redactor/data_redactor.c`) using POSIX `regex.h` for high-throughput scanning.
|
|
158
|
+
- 79 redaction patterns across cloud secrets, API keys, IBANs, national IDs, and PII for 15+ countries.
|
|
159
|
+
- Patterns ordered most-specific to most-generic to prevent shorter patterns from consuming parts of longer matches.
|
|
160
|
+
- Boundary-wrapping mechanism for generic digit/alphanum sequences so they only match at word boundaries.
|
|
161
|
+
- `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
|
|
162
|
+
- RSpec suite with one example per pattern.
|
|
163
|
+
|
|
164
|
+
[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.2...HEAD
|
|
165
|
+
[0.7.2]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.1...v0.7.2
|
|
166
|
+
[0.7.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.0...v0.7.1
|
|
167
|
+
[0.7.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.6.1...v0.7.0
|
|
168
|
+
[0.6.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.6.0...v0.6.1
|
|
169
|
+
[0.6.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.5.0...v0.6.0
|
|
170
|
+
[0.2.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.1.0...v0.2.0
|
|
171
|
+
[0.1.0]: https://github.com/danielefrisanco/data_redactor/releases/tag/v0.1.0
|
data/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Daniele Frisanco
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
require "logger"
|
|
2
|
+
require "data_redactor"
|
|
3
|
+
|
|
4
|
+
module DataRedactor
|
|
5
|
+
module Integrations
|
|
6
|
+
# Logger formatter that runs every log message through {DataRedactor.redact}
|
|
7
|
+
# before delegating to an inner formatter.
|
|
8
|
+
#
|
|
9
|
+
# @example Drop-in replacement for Ruby's default formatter
|
|
10
|
+
# logger = Logger.new($stdout)
|
|
11
|
+
# logger.formatter = DataRedactor::Integrations::Logger.new
|
|
12
|
+
# logger.info("Auth failed for user alice@example.com")
|
|
13
|
+
# # => "I, [...] -- : Auth failed for user [REDACTED]"
|
|
14
|
+
#
|
|
15
|
+
# @example Wrapping an existing formatter (e.g. Rails JSON logger)
|
|
16
|
+
# logger.formatter = DataRedactor::Integrations::Logger.new(
|
|
17
|
+
# inner: Rails.logger.formatter,
|
|
18
|
+
# only: [:credentials, :contact]
|
|
19
|
+
# )
|
|
20
|
+
class Logger
|
|
21
|
+
# @param inner [#call, nil] formatter to wrap. Defaults to {::Logger::Formatter}.
|
|
22
|
+
# @param only [Symbol, String, Array, nil] forwarded to {DataRedactor.redact}.
|
|
23
|
+
# @param except [Symbol, String, Array, nil] forwarded to {DataRedactor.redact}.
|
|
24
|
+
# @param placeholder forwarded to {DataRedactor.redact}.
|
|
25
|
+
def initialize(inner: ::Logger::Formatter.new, only: nil, except: nil, placeholder: DataRedactor::PLACEHOLDER_DEFAULT)
|
|
26
|
+
@inner = inner
|
|
27
|
+
@only = only
|
|
28
|
+
@except = except
|
|
29
|
+
@placeholder = placeholder
|
|
30
|
+
end
|
|
31
|
+
|
|
32
|
+
# Formatter contract — called by Logger for every emitted line.
|
|
33
|
+
# Lets the inner formatter render whatever it likes (string, exception,
|
|
34
|
+
# arbitrary object) and scrubs the resulting line in one pass. Keeps the
|
|
35
|
+
# exception cause chain intact so downstream formatters still see it.
|
|
36
|
+
def call(severity, time, progname, msg)
|
|
37
|
+
line = @inner.call(severity, time, progname, msg)
|
|
38
|
+
DataRedactor.redact(line.to_s, only: @only, except: @except, placeholder: @placeholder)
|
|
39
|
+
end
|
|
40
|
+
end
|
|
41
|
+
end
|
|
42
|
+
end
|
|
@@ -0,0 +1,121 @@
|
|
|
1
|
+
require "data_redactor"
|
|
2
|
+
|
|
3
|
+
module DataRedactor
|
|
4
|
+
module Integrations
|
|
5
|
+
# Rack middleware that scrubs sensitive data from selectable surfaces of
|
|
6
|
+
# the response (and request headers, for downstream loggers to see scrubbed
|
|
7
|
+
# values).
|
|
8
|
+
#
|
|
9
|
+
# @example Both surfaces (default)
|
|
10
|
+
# use DataRedactor::Integrations::Rack, scrub: [:body, :headers]
|
|
11
|
+
#
|
|
12
|
+
# @example Headers only — leave the response body untouched
|
|
13
|
+
# use DataRedactor::Integrations::Rack, scrub: [:headers]
|
|
14
|
+
#
|
|
15
|
+
# ### Surfaces
|
|
16
|
+
#
|
|
17
|
+
# - `:body` — wraps the response body so emitted bytes pass through
|
|
18
|
+
# {DataRedactor.redact} before reaching the client. Drops the
|
|
19
|
+
# `Content-Length` header (the redacted body may have a different
|
|
20
|
+
# byte length, and recomputing requires buffering).
|
|
21
|
+
# - `:headers` — scrubs response headers in place. Sensitive request
|
|
22
|
+
# headers (`Authorization`, `Cookie`, `X-Api-Key`, etc.) are redacted in
|
|
23
|
+
# the env hash so any downstream middleware that logs them sees scrubbed
|
|
24
|
+
# values.
|
|
25
|
+
class Rack
|
|
26
|
+
DEFAULT_SCRUB = [:body, :headers].freeze
|
|
27
|
+
|
|
28
|
+
SENSITIVE_REQUEST_HEADERS = %w[
|
|
29
|
+
HTTP_AUTHORIZATION
|
|
30
|
+
HTTP_PROXY_AUTHORIZATION
|
|
31
|
+
HTTP_COOKIE
|
|
32
|
+
HTTP_X_API_KEY
|
|
33
|
+
HTTP_X_AUTH_TOKEN
|
|
34
|
+
HTTP_X_ACCESS_TOKEN
|
|
35
|
+
].freeze
|
|
36
|
+
|
|
37
|
+
SENSITIVE_RESPONSE_HEADERS = %w[
|
|
38
|
+
Set-Cookie
|
|
39
|
+
Authorization
|
|
40
|
+
X-Api-Key
|
|
41
|
+
X-Auth-Token
|
|
42
|
+
X-Access-Token
|
|
43
|
+
].freeze
|
|
44
|
+
|
|
45
|
+
# @param app [#call] the Rack app
|
|
46
|
+
# @param scrub [Array<Symbol>] which surfaces to redact. Subset of
|
|
47
|
+
# `[:body, :headers]`. Defaults to `[:body, :headers]`.
|
|
48
|
+
# @param only forwarded to {DataRedactor.redact}
|
|
49
|
+
# @param except forwarded to {DataRedactor.redact}
|
|
50
|
+
# @param placeholder forwarded to {DataRedactor.redact}
|
|
51
|
+
def initialize(app, scrub: DEFAULT_SCRUB, only: nil, except: nil, placeholder: DataRedactor::PLACEHOLDER_DEFAULT)
|
|
52
|
+
@app = app
|
|
53
|
+
@scrub = Array(scrub).map(&:to_sym)
|
|
54
|
+
unknown = @scrub - [:body, :headers]
|
|
55
|
+
unless unknown.empty?
|
|
56
|
+
raise ArgumentError, "unknown scrub surface(s) #{unknown.inspect}; valid: [:body, :headers]"
|
|
57
|
+
end
|
|
58
|
+
@only = only
|
|
59
|
+
@except = except
|
|
60
|
+
@placeholder = placeholder
|
|
61
|
+
end
|
|
62
|
+
|
|
63
|
+
def call(env)
|
|
64
|
+
scrub_request_headers(env) if @scrub.include?(:headers)
|
|
65
|
+
status, headers, body = @app.call(env)
|
|
66
|
+
headers = scrub_response_headers(headers) if @scrub.include?(:headers)
|
|
67
|
+
if @scrub.include?(:body)
|
|
68
|
+
body, headers = wrap_body(body, headers)
|
|
69
|
+
end
|
|
70
|
+
[status, headers, body]
|
|
71
|
+
end
|
|
72
|
+
|
|
73
|
+
private
|
|
74
|
+
|
|
75
|
+
def redact(s)
|
|
76
|
+
DataRedactor.redact(s, only: @only, except: @except, placeholder: @placeholder)
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
def scrub_request_headers(env)
|
|
80
|
+
SENSITIVE_REQUEST_HEADERS.each do |key|
|
|
81
|
+
value = env[key]
|
|
82
|
+
env[key] = redact(value) if value.is_a?(String) && !value.empty?
|
|
83
|
+
end
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
def scrub_response_headers(headers)
|
|
87
|
+
# Rack 3 uses lower-case header names; Rack 2 uses Capitalized.
|
|
88
|
+
# Match case-insensitively against our known list.
|
|
89
|
+
sensitive_lc = SENSITIVE_RESPONSE_HEADERS.map(&:downcase)
|
|
90
|
+
headers.each_with_object({}) do |(key, value), out|
|
|
91
|
+
if sensitive_lc.include?(key.to_s.downcase)
|
|
92
|
+
out[key] = scrub_header_value(value)
|
|
93
|
+
else
|
|
94
|
+
out[key] = value
|
|
95
|
+
end
|
|
96
|
+
end
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
def scrub_header_value(value)
|
|
100
|
+
case value
|
|
101
|
+
when String then redact(value)
|
|
102
|
+
when Array then value.map { |v| v.is_a?(String) ? redact(v) : v }
|
|
103
|
+
else value
|
|
104
|
+
end
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
def wrap_body(body, headers)
|
|
108
|
+
# Buffer the body, redact, return as a single-element array.
|
|
109
|
+
# Stripping Content-Length because the redacted body may differ in
|
|
110
|
+
# byte length; downstream servers will recompute or chunk-encode.
|
|
111
|
+
buffered = +""
|
|
112
|
+
body.each { |chunk| buffered << chunk.to_s }
|
|
113
|
+
body.close if body.respond_to?(:close)
|
|
114
|
+
|
|
115
|
+
scrubbed = redact(buffered)
|
|
116
|
+
new_headers = headers.reject { |k, _| k.to_s.downcase == "content-length" }
|
|
117
|
+
[[scrubbed], new_headers]
|
|
118
|
+
end
|
|
119
|
+
end
|
|
120
|
+
end
|
|
121
|
+
end
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
require "data_redactor"
|
|
2
|
+
|
|
3
|
+
module DataRedactor
|
|
4
|
+
module Integrations
|
|
5
|
+
# Rails `config.filter_parameters` adapter. Returns a `Proc` that Rails
|
|
6
|
+
# invokes with `(key, value)` for every leaf in the params tree; we redact
|
|
7
|
+
# the value in place when it is a String.
|
|
8
|
+
#
|
|
9
|
+
# @example
|
|
10
|
+
# # config/initializers/filter_parameter_logging.rb
|
|
11
|
+
# require "data_redactor/integrations/rails"
|
|
12
|
+
# Rails.application.config.filter_parameters += [
|
|
13
|
+
# DataRedactor::Integrations::Rails.filter
|
|
14
|
+
# ]
|
|
15
|
+
#
|
|
16
|
+
# @example Restricting to specific tags
|
|
17
|
+
# Rails.application.config.filter_parameters += [
|
|
18
|
+
# DataRedactor::Integrations::Rails.filter(only: [:credentials, :financial])
|
|
19
|
+
# ]
|
|
20
|
+
module Rails
|
|
21
|
+
module_function
|
|
22
|
+
|
|
23
|
+
# @param only forwarded to {DataRedactor.redact}
|
|
24
|
+
# @param except forwarded to {DataRedactor.redact}
|
|
25
|
+
# @param placeholder forwarded to {DataRedactor.redact}
|
|
26
|
+
# @return [Proc] a `(key, value)` proc compatible with `config.filter_parameters`
|
|
27
|
+
def filter(only: nil, except: nil, placeholder: DataRedactor::PLACEHOLDER_DEFAULT)
|
|
28
|
+
lambda do |_key, value|
|
|
29
|
+
next unless value.is_a?(String)
|
|
30
|
+
# Rails' Parameter Filter mutates the value in place. We can't
|
|
31
|
+
# reassign `value` here, so use String#replace.
|
|
32
|
+
redacted = DataRedactor.redact(value, only: only, except: except, placeholder: placeholder)
|
|
33
|
+
value.replace(redacted) if redacted != value
|
|
34
|
+
end
|
|
35
|
+
end
|
|
36
|
+
end
|
|
37
|
+
end
|
|
38
|
+
end
|
|
@@ -0,0 +1,347 @@
|
|
|
1
|
+
require "set"
|
|
2
|
+
require_relative "data_redactor/version"
|
|
3
|
+
require_relative "data_redactor/data_redactor" # loads the compiled .so
|
|
4
|
+
|
|
5
|
+
# High-performance regex-based redactor for sensitive data.
|
|
6
|
+
#
|
|
7
|
+
# DataRedactor scans text for sensitive patterns (API keys, IBANs, national
|
|
8
|
+
# IDs, emails, phone numbers, etc.) and replaces matches with a configurable
|
|
9
|
+
# placeholder. The matching is done by a C extension backed by POSIX
|
|
10
|
+
# +regex.h+, so it is fast enough to run inline on large payloads.
|
|
11
|
+
#
|
|
12
|
+
# @example Basic redaction
|
|
13
|
+
# DataRedactor.redact("key is AKIAIOSFODNN7EXAMPLE")
|
|
14
|
+
# # => "key is [REDACTED]"
|
|
15
|
+
#
|
|
16
|
+
# @example Filter by tag or pattern name
|
|
17
|
+
# DataRedactor.redact(text, only: :credentials)
|
|
18
|
+
# DataRedactor.redact(text, except: [:contact, :network])
|
|
19
|
+
# DataRedactor.redact(text, only: :contact, except: ["email"])
|
|
20
|
+
# DataRedactor.redact(text, only: ["aws_access_key_id"])
|
|
21
|
+
#
|
|
22
|
+
# @example Custom placeholder
|
|
23
|
+
# DataRedactor.redact(text, placeholder: "***")
|
|
24
|
+
# DataRedactor.redact(text, placeholder: :tagged) # => "[REDACTED:CONTACT]"
|
|
25
|
+
# DataRedactor.redact(text, placeholder: :hash) # => "[CONTACT_a3f9]"
|
|
26
|
+
#
|
|
27
|
+
# @example Audit / dry-run
|
|
28
|
+
# DataRedactor.scan(text)
|
|
29
|
+
# # => { redacted: "...", matches: [{tag:, name:, value:, start:, length:}, ...] }
|
|
30
|
+
#
|
|
31
|
+
# @example Custom pattern
|
|
32
|
+
# DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
|
|
33
|
+
module DataRedactor
|
|
34
|
+
# Map of tag symbol to the integer bit used by the C layer.
|
|
35
|
+
#
|
|
36
|
+
# The keys of this hash are the canonical list of supported tags; pass any
|
|
37
|
+
# of them to {redact} or {scan} via +only:+ / +except:+.
|
|
38
|
+
#
|
|
39
|
+
# @return [Hash{Symbol => Integer}] frozen tag-to-bit map
|
|
40
|
+
TAGS = {
|
|
41
|
+
credentials: TAG_CREDENTIALS,
|
|
42
|
+
financial: TAG_FINANCIAL,
|
|
43
|
+
tax_id: TAG_TAX_ID,
|
|
44
|
+
national_id: TAG_NATIONAL_ID,
|
|
45
|
+
contact: TAG_CONTACT,
|
|
46
|
+
network: TAG_NETWORK,
|
|
47
|
+
travel: TAG_TRAVEL,
|
|
48
|
+
other: TAG_OTHER,
|
|
49
|
+
custom: TAG_CUSTOM
|
|
50
|
+
}.freeze
|
|
51
|
+
|
|
52
|
+
# Raised when a tag symbol passed to +only:+ / +except:+ / +tag:+ is not in {TAGS}.
|
|
53
|
+
class UnknownTagError < ArgumentError; end
|
|
54
|
+
|
|
55
|
+
# Raised when a String passed via +only:+ / +except:+ does not match any
|
|
56
|
+
# registered pattern name. See {pattern_names}.
|
|
57
|
+
class UnknownPatternError < ArgumentError; end
|
|
58
|
+
|
|
59
|
+
# Raised by {add_pattern} when the supplied regex is not valid POSIX ERE,
|
|
60
|
+
# uses Ruby-only syntax (+\d+, +\s+, lookaround, non-greedy, etc.), or
|
|
61
|
+
# contains capture groups while +boundary: true+ is requested.
|
|
62
|
+
class InvalidPatternError < ArgumentError; end
|
|
63
|
+
|
|
64
|
+
# @api private
|
|
65
|
+
# Capture groups break boundary-wrapper group index assumptions ([1],[2],[3] shift).
|
|
66
|
+
CAPTURE_GROUP_RE = /(?<!\\)\((?!\?:)/.freeze
|
|
67
|
+
|
|
68
|
+
# @api private
|
|
69
|
+
# Ruby regex syntax that has no POSIX ERE equivalent.
|
|
70
|
+
RUBY_ONLY_SYNTAX_RE = /\\[dDwWsShHbB]|\(\?[<!=]|\(\?<[a-zA-Z]|\(\?[imx]|[*+?]\?/.freeze
|
|
71
|
+
|
|
72
|
+
# Default placeholder used when +placeholder:+ is not given to {redact}.
|
|
73
|
+
PLACEHOLDER_DEFAULT = "[REDACTED]"
|
|
74
|
+
|
|
75
|
+
module_function
|
|
76
|
+
|
|
77
|
+
# List of supported tag symbols.
|
|
78
|
+
#
|
|
79
|
+
# @return [Array<Symbol>] every key from {TAGS}
|
|
80
|
+
def tags
|
|
81
|
+
TAGS.keys
|
|
82
|
+
end
|
|
83
|
+
|
|
84
|
+
# List of every pattern name the redactor knows about.
|
|
85
|
+
#
|
|
86
|
+
# Includes the {BUILTIN_PATTERN_NAMES} plus any names registered via
|
|
87
|
+
# {add_pattern}. Useful for discovering what String values +only:+ /
|
|
88
|
+
# +except:+ accept, and for filtering / debugging.
|
|
89
|
+
#
|
|
90
|
+
# @return [Array<String>] built-in names first (in execution order),
|
|
91
|
+
# then custom names in registration order.
|
|
92
|
+
def pattern_names
|
|
93
|
+
BUILTIN_PATTERN_NAMES + _custom_patterns.map { |h| h[:name] }
|
|
94
|
+
end
|
|
95
|
+
|
|
96
|
+
# Redact every match of the configured patterns in +text+.
|
|
97
|
+
#
|
|
98
|
+
# +only:+ and +except:+ both accept a single value or an Array, mixing:
|
|
99
|
+
# - **Symbols** — tag names from {TAGS} (e.g. +:contact+, +:credentials+).
|
|
100
|
+
# - **Strings** — specific pattern names from {pattern_names} (e.g. +"email"+).
|
|
101
|
+
#
|
|
102
|
+
# They can be combined: +only: :contact, except: ["email"]+ means
|
|
103
|
+
# "redact every contact pattern except email." Symbols give you tag-level
|
|
104
|
+
# control; Strings give you per-pattern precision.
|
|
105
|
+
#
|
|
106
|
+
# **Precedence:** a pattern is redacted iff
|
|
107
|
+
# +(only is nil OR pattern matches only:)+ AND +(pattern does not match except:)+.
|
|
108
|
+
# +except:+ always wins over +only:+ when they overlap — e.g.
|
|
109
|
+
# +only: :contact, except: :contact+ produces an empty redaction (no-op),
|
|
110
|
+
# and +only: ["email"], except: ["email"]+ likewise skips email entirely.
|
|
111
|
+
#
|
|
112
|
+
# @param text [String] input string. Returned unchanged if no patterns match.
|
|
113
|
+
# @param only [Symbol, String, Array, nil] include only the given tag(s)
|
|
114
|
+
# and/or pattern name(s).
|
|
115
|
+
# @param except [Symbol, String, Array, nil] exclude the given tag(s)
|
|
116
|
+
# and/or pattern name(s). May be combined with +only:+.
|
|
117
|
+
# @param placeholder [String, :tagged, :hash] replacement strategy.
|
|
118
|
+
# A String is used verbatim. +:tagged+ produces +[REDACTED:TAGNAME]+.
|
|
119
|
+
# +:hash+ produces a deterministic +[TAGNAME_xxxx]+ token (4-hex djb2)
|
|
120
|
+
# so the same input value always maps to the same token.
|
|
121
|
+
# @return [String] a new string with every match replaced.
|
|
122
|
+
# @raise [ArgumentError] if +placeholder:+ is not a String/:tagged/:hash.
|
|
123
|
+
# @raise [UnknownTagError] if any Symbol in +only:+/+except:+ is not in {TAGS}.
|
|
124
|
+
# @raise [UnknownPatternError] if any String in +only:+/+except:+ is not in {pattern_names}.
|
|
125
|
+
#
|
|
126
|
+
# @example
|
|
127
|
+
# DataRedactor.redact("token sk_live_abc123", only: :credentials)
|
|
128
|
+
# DataRedactor.redact(text, only: [:contact, "aws_access_key_id"])
|
|
129
|
+
# DataRedactor.redact(text, only: :contact, except: ["email"])
|
|
130
|
+
def redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
|
|
131
|
+
enable_bits = build_enable_bits(only, except)
|
|
132
|
+
ph_mode, ph_str = resolve_placeholder(placeholder)
|
|
133
|
+
_redact(text, ph_mode, ph_str, enable_bits)
|
|
134
|
+
end
|
|
135
|
+
|
|
136
|
+
# Scan +text+ and return both the redacted string and per-match metadata.
|
|
137
|
+
#
|
|
138
|
+
# Useful for auditing, false-positive tuning, and compliance pipelines.
|
|
139
|
+
# +:start+ and +:length+ are byte offsets into the *original* string, so
|
|
140
|
+
# +text.byteslice(m[:start], m[:length]) == m[:value]+.
|
|
141
|
+
#
|
|
142
|
+
# @param text [String] input string.
|
|
143
|
+
# @param only [Symbol, String, Array, nil] same semantics as {redact}.
|
|
144
|
+
# @param except [Symbol, String, Array, nil] same semantics as {redact}.
|
|
145
|
+
# @return [Hash{Symbol => Object}] +{ redacted: String, matches:
|
|
146
|
+
# Array<Hash> }+. Each match hash has +:tag+ (Symbol), +:name+ (String),
|
|
147
|
+
# +:value+ (String), +:start+ (Integer byte offset), +:length+ (Integer).
|
|
148
|
+
# @raise [UnknownTagError] if any Symbol in +only:+/+except:+ is not in {TAGS}.
|
|
149
|
+
# @raise [UnknownPatternError] if any String in +only:+/+except:+ is not in {pattern_names}.
|
|
150
|
+
#
|
|
151
|
+
# @example
|
|
152
|
+
# DataRedactor.scan("user@example.com")
|
|
153
|
+
# # => { redacted: "[REDACTED]",
|
|
154
|
+
# # matches: [{tag: :contact, name: "email",
|
|
155
|
+
# # value: "user@example.com", start: 0, length: 16}] }
|
|
156
|
+
def scan(text, only: nil, except: nil)
|
|
157
|
+
enable_bits = build_enable_bits(only, except)
|
|
158
|
+
result = _scan(text, enable_bits)
|
|
159
|
+
# Normalise: convert tag string from C (uppercase) back to the Symbol used in TAGS
|
|
160
|
+
result[:matches].each { |m| m[:tag] = m[:tag].to_s.downcase.to_sym }
|
|
161
|
+
result
|
|
162
|
+
end
|
|
163
|
+
|
|
164
|
+
# Register a custom redaction pattern.
|
|
165
|
+
#
|
|
166
|
+
# Patterns must be valid POSIX ERE. Ruby-only syntax (+\d+, +\s+, +\w+,
|
|
167
|
+
# +\b+, lookaround, non-greedy quantifiers, named groups) is rejected
|
|
168
|
+
# at registration time, never at redaction time.
|
|
169
|
+
#
|
|
170
|
+
# If a pattern with the same +name+ is already registered, it is replaced
|
|
171
|
+
# (the old compiled +regex_t+ is freed).
|
|
172
|
+
#
|
|
173
|
+
# @param name [String] unique identifier for this pattern. Used by {remove_pattern}.
|
|
174
|
+
# @param regex [String, Regexp] POSIX ERE source. A Regexp is accepted
|
|
175
|
+
# for convenience but only its +.source+ is used; flags are ignored.
|
|
176
|
+
# @param tag [Symbol] one of {TAGS} keys. Defaults to +:custom+.
|
|
177
|
+
# @param boundary [Boolean] when true, the pattern is wrapped with
|
|
178
|
+
# +(^|[^0-9A-Za-z])(...)([^0-9A-Za-z]|$)+ so it only matches when not
|
|
179
|
+
# embedded in a longer alphanumeric token. Incompatible with patterns
|
|
180
|
+
# that contain capture groups.
|
|
181
|
+
# @return [Boolean] +true+ on success.
|
|
182
|
+
# @raise [ArgumentError] if +name+ is not a non-empty String, or +regex+
|
|
183
|
+
# is neither a String nor a Regexp.
|
|
184
|
+
# @raise [InvalidPatternError] if the pattern uses Ruby-only syntax,
|
|
185
|
+
# contains capture groups while +boundary: true+, or fails +regcomp+.
|
|
186
|
+
# @raise [UnknownTagError] if +tag+ is not in {TAGS}.
|
|
187
|
+
#
|
|
188
|
+
# @example
|
|
189
|
+
# DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
|
|
190
|
+
# DataRedactor.add_pattern(name: "internal_key",
|
|
191
|
+
# regex: /INT-[A-Z]{3}/,
|
|
192
|
+
# tag: :credentials,
|
|
193
|
+
# boundary: true)
|
|
194
|
+
def add_pattern(name:, regex:, tag: :custom, boundary: false)
|
|
195
|
+
raise ArgumentError, "name must be a non-empty String" \
|
|
196
|
+
unless name.is_a?(String) && !name.empty?
|
|
197
|
+
|
|
198
|
+
source = case regex
|
|
199
|
+
when String then regex
|
|
200
|
+
when Regexp then regex.source
|
|
201
|
+
else raise ArgumentError, "regex must be a String or Regexp, got #{regex.class}"
|
|
202
|
+
end
|
|
203
|
+
|
|
204
|
+
if source =~ RUBY_ONLY_SYNTAX_RE
|
|
205
|
+
raise InvalidPatternError,
|
|
206
|
+
"pattern #{name.inspect} uses Ruby-only syntax (#{$&.inspect}); " \
|
|
207
|
+
"use POSIX ERE — no \\d, \\s, \\w, \\b, lookaround, non-greedy, or named groups"
|
|
208
|
+
end
|
|
209
|
+
|
|
210
|
+
if boundary && source =~ CAPTURE_GROUP_RE
|
|
211
|
+
raise InvalidPatternError,
|
|
212
|
+
"pattern #{name.inspect} has capture groups and cannot use boundary: true"
|
|
213
|
+
end
|
|
214
|
+
|
|
215
|
+
tag_bit = TAGS[tag] or raise UnknownTagError,
|
|
216
|
+
"unknown tag #{tag.inspect}; valid tags: #{TAGS.keys.inspect}"
|
|
217
|
+
|
|
218
|
+
_add_pattern(name, source, tag_bit, boundary ? 1 : 0)
|
|
219
|
+
end
|
|
220
|
+
|
|
221
|
+
# Remove a previously registered custom pattern.
|
|
222
|
+
#
|
|
223
|
+
# @param name [String, Symbol] the +name+ used in {add_pattern}.
|
|
224
|
+
# @return [Boolean] +true+ if a pattern was removed, +false+ if no
|
|
225
|
+
# pattern with that name was registered.
|
|
226
|
+
def remove_pattern(name)
|
|
227
|
+
_remove_pattern(name.to_s)
|
|
228
|
+
end
|
|
229
|
+
|
|
230
|
+
# List every currently registered custom pattern.
|
|
231
|
+
#
|
|
232
|
+
# @return [Array<Hash{Symbol => Object}>] one hash per pattern with keys
|
|
233
|
+
# +:name+ (String), +:source+ (String — the POSIX ERE source),
|
|
234
|
+
# +:tag+ (Symbol), +:boundary+ (Boolean).
|
|
235
|
+
def custom_patterns
|
|
236
|
+
_custom_patterns.map do |h|
|
|
237
|
+
{ name: h[:name], source: h[:source], tag: TAGS.key(h[:tag_bit]) || :custom,
|
|
238
|
+
boundary: h[:boundary] }
|
|
239
|
+
end
|
|
240
|
+
end
|
|
241
|
+
|
|
242
|
+
# Remove every registered custom pattern.
|
|
243
|
+
#
|
|
244
|
+
# Mostly useful in test suites that need a clean slate between examples.
|
|
245
|
+
#
|
|
246
|
+
# @return [nil]
|
|
247
|
+
def clear_custom_patterns!
|
|
248
|
+
_clear_custom_patterns
|
|
249
|
+
end
|
|
250
|
+
|
|
251
|
+
# @api private
|
|
252
|
+
# Split a mixed Symbol/String filter list into +(tag_bitmask, name_set)+.
|
|
253
|
+
#
|
|
254
|
+
# @param entries [nil, Symbol, String, Array]
|
|
255
|
+
# @return [Array(Integer, Set<String>)] tag bits OR-ed together; set of
|
|
256
|
+
# pattern-name Strings.
|
|
257
|
+
# @raise [UnknownTagError] for unknown Symbols.
|
|
258
|
+
# @raise [UnknownPatternError] for unknown Strings.
|
|
259
|
+
def split_filter(entries)
|
|
260
|
+
bits = 0
|
|
261
|
+
names = Set.new
|
|
262
|
+
return [bits, names] if entries.nil?
|
|
263
|
+
Array(entries).each do |e|
|
|
264
|
+
case e
|
|
265
|
+
when Symbol
|
|
266
|
+
bit = TAGS[e] or raise UnknownTagError,
|
|
267
|
+
"unknown tag #{e.inspect}; valid tags: #{TAGS.keys.inspect}"
|
|
268
|
+
bits |= bit
|
|
269
|
+
when String
|
|
270
|
+
unless pattern_names.include?(e)
|
|
271
|
+
raise UnknownPatternError,
|
|
272
|
+
"unknown pattern name #{e.inspect}; see DataRedactor.pattern_names"
|
|
273
|
+
end
|
|
274
|
+
names << e
|
|
275
|
+
else
|
|
276
|
+
raise ArgumentError,
|
|
277
|
+
"only:/except: entries must be a Symbol (tag) or String (pattern name), got #{e.inspect}"
|
|
278
|
+
end
|
|
279
|
+
end
|
|
280
|
+
[bits, names]
|
|
281
|
+
end
|
|
282
|
+
|
|
283
|
+
# @api private
|
|
284
|
+
# Build the per-pattern enable bit-list passed to the C layer.
|
|
285
|
+
#
|
|
286
|
+
# The list has one Integer (0 or 1) per pattern in execution order:
|
|
287
|
+
# built-ins first (NUM_PATTERNS entries), then currently registered custom
|
|
288
|
+
# patterns in registration order. C iterates by index and skips zeros.
|
|
289
|
+
#
|
|
290
|
+
# Semantics of +only:+ / +except:+ — both accept a mix of Symbols (tags)
|
|
291
|
+
# and Strings (pattern names):
|
|
292
|
+
# enabled(p) iff
|
|
293
|
+
# (only is nil OR p.tag ∈ only_tags OR p.name ∈ only_names)
|
|
294
|
+
# AND p.tag ∉ except_tags AND p.name ∉ except_names
|
|
295
|
+
#
|
|
296
|
+
# @return [Array<Integer>] same length as built-ins + customs.
|
|
297
|
+
def build_enable_bits(only, except)
|
|
298
|
+
only_bits, only_names = split_filter(only)
|
|
299
|
+
except_bits, except_names = split_filter(except)
|
|
300
|
+
only_present = !only.nil?
|
|
301
|
+
|
|
302
|
+
bits = Array.new(BUILTIN_PATTERN_NAMES.length + _custom_patterns.length, 0)
|
|
303
|
+
|
|
304
|
+
BUILTIN_PATTERN_NAMES.each_with_index do |name, i|
|
|
305
|
+
tag_bit = BUILTIN_PATTERN_TAG_BITS[i]
|
|
306
|
+
bits[i] = 1 if pattern_enabled?(name, tag_bit, only_present,
|
|
307
|
+
only_bits, only_names,
|
|
308
|
+
except_bits, except_names)
|
|
309
|
+
end
|
|
310
|
+
|
|
311
|
+
_custom_patterns.each_with_index do |h, i|
|
|
312
|
+
bits[BUILTIN_PATTERN_NAMES.length + i] = 1 if pattern_enabled?(
|
|
313
|
+
h[:name], h[:tag_bit], only_present,
|
|
314
|
+
only_bits, only_names, except_bits, except_names)
|
|
315
|
+
end
|
|
316
|
+
|
|
317
|
+
bits
|
|
318
|
+
end
|
|
319
|
+
|
|
320
|
+
# @api private
|
|
321
|
+
def pattern_enabled?(name, tag_bit, only_present, only_bits, only_names,
|
|
322
|
+
except_bits, except_names)
|
|
323
|
+
return false if (tag_bit & except_bits) != 0
|
|
324
|
+
return false if except_names.include?(name)
|
|
325
|
+
return true unless only_present
|
|
326
|
+
return true if (tag_bit & only_bits) != 0
|
|
327
|
+
only_names.include?(name)
|
|
328
|
+
end
|
|
329
|
+
|
|
330
|
+
# @api private
|
|
331
|
+
# Translate the user-facing +placeholder:+ value into the +(mode_int, str)+
|
|
332
|
+
# pair the C layer expects.
|
|
333
|
+
#
|
|
334
|
+
# @param placeholder [String, :tagged, :hash]
|
|
335
|
+
# @return [Array(Integer, String)]
|
|
336
|
+
# @raise [ArgumentError] if +placeholder+ is none of the accepted values.
|
|
337
|
+
def resolve_placeholder(placeholder)
|
|
338
|
+
case placeholder
|
|
339
|
+
when :tagged then [PH_MODE_TAGGED, ""]
|
|
340
|
+
when :hash then [PH_MODE_HASH, ""]
|
|
341
|
+
when String then [PH_MODE_PLAIN, placeholder]
|
|
342
|
+
else
|
|
343
|
+
raise ArgumentError,
|
|
344
|
+
"placeholder must be a String, :tagged, or :hash — got #{placeholder.inspect}"
|
|
345
|
+
end
|
|
346
|
+
end
|
|
347
|
+
end
|
data/readme.md
ADDED
|
@@ -0,0 +1,395 @@
|
|
|
1
|
+
# DataRedactor
|
|
2
|
+
|
|
3
|
+
[](https://rubygems.org/gems/data_redactor)
|
|
4
|
+
[](https://github.com/danielefrisanco/data_redactor/actions/workflows/ci.yml)
|
|
5
|
+
[](LICENSE)
|
|
6
|
+
|
|
7
|
+
A Ruby gem with a C extension for high-performance regex-based redaction of sensitive data from strings.
|
|
8
|
+
|
|
9
|
+
## What it does
|
|
10
|
+
|
|
11
|
+
DataRedactor scans text for sensitive patterns and replaces matches with `[REDACTED]`. It uses a C extension backed by POSIX `regex.h` so the heavy lifting happens outside the Ruby VM, making it fast enough for large payloads.
|
|
12
|
+
|
|
13
|
+
## Usage
|
|
14
|
+
|
|
15
|
+
```ruby
|
|
16
|
+
require "data_redactor"
|
|
17
|
+
|
|
18
|
+
text = "User CF is RSSMRA85M01H501Z and key is AKIAIOSFODNN7EXAMPLE"
|
|
19
|
+
DataRedactor.redact(text)
|
|
20
|
+
# => "User CF is [REDACTED] and key is [REDACTED]"
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
### Filtering by tag or pattern name
|
|
24
|
+
|
|
25
|
+
`only:` and `except:` both accept a single value or an Array, mixing **Symbols** (tag names) and **Strings** (specific pattern names).
|
|
26
|
+
|
|
27
|
+
```ruby
|
|
28
|
+
DataRedactor.tags
|
|
29
|
+
# => [:credentials, :financial, :tax_id, :national_id, :contact, :network, :travel, :other, :custom]
|
|
30
|
+
|
|
31
|
+
DataRedactor.pattern_names
|
|
32
|
+
# => ["aws_s3_presigned_url", "aws_access_key_id", "email", "phone_e164", "ipv4", ...]
|
|
33
|
+
|
|
34
|
+
# Tag-level filtering
|
|
35
|
+
DataRedactor.redact(text, only: [:credentials])
|
|
36
|
+
DataRedactor.redact(text, except: :contact)
|
|
37
|
+
|
|
38
|
+
# Single specific pattern
|
|
39
|
+
DataRedactor.redact(text, only: ["aws_access_key_id"])
|
|
40
|
+
|
|
41
|
+
# Mix — every credentials pattern PLUS aws_access_key_id (even if it lived in another tag)
|
|
42
|
+
DataRedactor.redact(text, only: [:credentials, "aws_access_key_id"])
|
|
43
|
+
|
|
44
|
+
# Combine — every contact pattern EXCEPT email
|
|
45
|
+
DataRedactor.redact(text, only: :contact, except: ["email"])
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
**Precedence:** a pattern is redacted iff `(only is nil OR matches only:)` AND `(does not match except:)`. `except:` always wins when the two overlap, so `only: :contact, except: :contact` produces a no-op (everything is excluded).
|
|
49
|
+
|
|
50
|
+
**Errors:** an unknown tag Symbol raises `DataRedactor::UnknownTagError`; an unknown pattern name String raises `DataRedactor::UnknownPatternError`.
|
|
51
|
+
|
|
52
|
+
### Configurable placeholder
|
|
53
|
+
|
|
54
|
+
By default every match is replaced with `[REDACTED]`. Use the `placeholder:` keyword to change this:
|
|
55
|
+
|
|
56
|
+
```ruby
|
|
57
|
+
# Plain string — any replacement text
|
|
58
|
+
DataRedactor.redact(text, placeholder: "***")
|
|
59
|
+
DataRedactor.redact(text, placeholder: "")
|
|
60
|
+
|
|
61
|
+
# Tagged — embeds the pattern's tag name so you know what was redacted
|
|
62
|
+
DataRedactor.redact(text, placeholder: :tagged)
|
|
63
|
+
# "user@example.com" → "[REDACTED:CONTACT]"
|
|
64
|
+
# "AKIAIOSFODNN7EXAMPLE" → "[REDACTED:CREDENTIALS]"
|
|
65
|
+
# "DE89370400440532013000" → "[REDACTED:FINANCIAL]"
|
|
66
|
+
|
|
67
|
+
# Hash — deterministic 4-hex suffix of the matched value
|
|
68
|
+
# Same value always produces the same token — useful for correlating
|
|
69
|
+
# redactions across log lines without leaking the original.
|
|
70
|
+
DataRedactor.redact(text, placeholder: :hash)
|
|
71
|
+
# "user@example.com" → "[CONTACT_3d7a]"
|
|
72
|
+
# "user@example.com" → "[CONTACT_3d7a]" (same every time)
|
|
73
|
+
# "other@example.com" → "[CONTACT_91fc]" (different value, different hash)
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
All three modes compose with `only:` and `except:`:
|
|
77
|
+
|
|
78
|
+
```ruby
|
|
79
|
+
DataRedactor.redact(text, only: :contact, placeholder: :tagged)
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### Scan / dry-run mode
|
|
83
|
+
|
|
84
|
+
`DataRedactor.scan` returns every match alongside the redacted string — useful for auditing, tuning false positives, and compliance pipelines:
|
|
85
|
+
|
|
86
|
+
```ruby
|
|
87
|
+
result = DataRedactor.scan("User AKIAIOSFODNN7EXAMPLE logged in from 192.168.1.1")
|
|
88
|
+
# => {
|
|
89
|
+
# redacted: "User [REDACTED] logged in from [REDACTED]",
|
|
90
|
+
# matches: [
|
|
91
|
+
# { tag: :credentials, name: "aws_access_key_id", value: "AKIAIOSFODNN7EXAMPLE", start: 5, length: 20 },
|
|
92
|
+
# { tag: :network, name: "ipv4", value: "192.168.1.1", start: 35, length: 11 }
|
|
93
|
+
# ]
|
|
94
|
+
# }
|
|
95
|
+
|
|
96
|
+
# :start and :length are byte offsets into the original string
|
|
97
|
+
m = result[:matches].first
|
|
98
|
+
original_text.byteslice(m[:start], m[:length]) # => "AKIAIOSFODNN7EXAMPLE"
|
|
99
|
+
|
|
100
|
+
# Accepts the same filters as redact (tags + specific pattern names)
|
|
101
|
+
DataRedactor.scan(text, only: :credentials)
|
|
102
|
+
DataRedactor.scan(text, except: :network)
|
|
103
|
+
DataRedactor.scan(text, only: :contact, except: ["email"])
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### Custom patterns
|
|
107
|
+
|
|
108
|
+
Teams often have internal IDs that the gem can't ship. Register them at boot:
|
|
109
|
+
|
|
110
|
+
```ruby
|
|
111
|
+
# String (POSIX ERE) or Regexp — both accepted
|
|
112
|
+
DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
|
|
113
|
+
DataRedactor.add_pattern(name: "ticket_ref", regex: /TICKET-[A-Z]{2}[0-9]{4}/, boundary: true)
|
|
114
|
+
|
|
115
|
+
# Custom patterns are tagged :custom by default; pass any built-in tag to group differently
|
|
116
|
+
DataRedactor.add_pattern(name: "internal_key", regex: "INT-[A-Z]{3}", tag: :credentials)
|
|
117
|
+
|
|
118
|
+
DataRedactor.redact(text) # runs all patterns including custom
|
|
119
|
+
DataRedactor.redact(text, only: [:custom]) # only user patterns
|
|
120
|
+
DataRedactor.redact(text, only: [:custom, :credentials]) # mix
|
|
121
|
+
|
|
122
|
+
DataRedactor.custom_patterns # => [{name:, source:, tag:, boundary:}, ...]
|
|
123
|
+
DataRedactor.remove_pattern("employee_id")
|
|
124
|
+
DataRedactor.clear_custom_patterns! # mostly for test suites
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**Regex rules** — patterns must be POSIX ERE (the same engine used for built-ins). Not supported: `\d`, `\s`, `\w`, `\b`, lookahead/lookbehind, non-greedy quantifiers, named groups. Violations raise `DataRedactor::InvalidPatternError` at registration time, never at redaction time. Use `[0-9]` instead of `\d`, `[[:space:]]` instead of `\s`, etc.
|
|
128
|
+
|
|
129
|
+
**`boundary: true`** — wraps the pattern with `(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)` so it only fires when the token is not embedded in a longer alphanumeric string. Incompatible with patterns that contain capture groups.
|
|
130
|
+
|
|
131
|
+
## Integrations
|
|
132
|
+
|
|
133
|
+
Optional adapters for Logger, Rails, and Rack. None are loaded automatically — `require` only what you use, and the gem adds zero runtime dependencies in the gemspec.
|
|
134
|
+
|
|
135
|
+
### Logger formatter
|
|
136
|
+
|
|
137
|
+
Drop-in `Logger::Formatter` replacement that scrubs every emitted line:
|
|
138
|
+
|
|
139
|
+
```ruby
|
|
140
|
+
require "data_redactor/integrations/logger"
|
|
141
|
+
|
|
142
|
+
logger = Logger.new($stdout)
|
|
143
|
+
logger.formatter = DataRedactor::Integrations::Logger.new
|
|
144
|
+
logger.info("Auth failed for alice@example.com")
|
|
145
|
+
# => I, [...] -- : Auth failed for [REDACTED]
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
Wraps an inner formatter (defaults to `Logger::Formatter`), so it composes with structured loggers. Forwards `only:`, `except:`, `placeholder:` to `DataRedactor.redact`. Exception messages and arbitrary objects are scrubbed too — the wrapped object is passed unchanged to the inner formatter so the exception cause chain is preserved; only the rendered string is redacted.
|
|
149
|
+
|
|
150
|
+
### Rails `filter_parameters` adapter
|
|
151
|
+
|
|
152
|
+
```ruby
|
|
153
|
+
# config/initializers/filter_parameter_logging.rb
|
|
154
|
+
require "data_redactor/integrations/rails"
|
|
155
|
+
|
|
156
|
+
Rails.application.config.filter_parameters += [
|
|
157
|
+
DataRedactor::Integrations::Rails.filter
|
|
158
|
+
]
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
Returns a `(key, value)` proc compatible with Rails' parameter filter. String values are mutated in place via `String#replace` so Rails sees the redacted value. Non-strings are left alone. Accepts the same `only:`/`except:`/`placeholder:` kwargs.
|
|
162
|
+
|
|
163
|
+
### Rack middleware
|
|
164
|
+
|
|
165
|
+
```ruby
|
|
166
|
+
# config.ru
|
|
167
|
+
require "data_redactor/integrations/rack"
|
|
168
|
+
|
|
169
|
+
use DataRedactor::Integrations::Rack, scrub: [:body, :headers]
|
|
170
|
+
run MyApp
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
`scrub:` selects which surfaces to redact (default `[:body, :headers]`):
|
|
174
|
+
|
|
175
|
+
- **`:body`** — buffers the response body, runs `DataRedactor.redact` over it, returns it as a single chunk. Drops the `Content-Length` header so the server recomputes (the redacted body may differ in byte length).
|
|
176
|
+
- **`:headers`** — scrubs sensitive **response** headers (`Set-Cookie`, `Authorization`, `X-Api-Key`, `X-Auth-Token`, `X-Access-Token`) in place, and sensitive **request** headers (`HTTP_AUTHORIZATION`, `HTTP_PROXY_AUTHORIZATION`, `HTTP_COOKIE`, `HTTP_X_API_KEY`, `HTTP_X_AUTH_TOKEN`, `HTTP_X_ACCESS_TOKEN`) in the env hash so any downstream middleware that logs them sees redacted values.
|
|
177
|
+
|
|
178
|
+
Pass an empty subset (e.g. `scrub: [:headers]`) to opt out of body wrapping. Forwards `only:`/`except:`/`placeholder:` to `DataRedactor.redact`. Unknown surfaces raise `ArgumentError` at boot.
|
|
179
|
+
|
|
180
|
+
> **Body wrapping is buffering.** The middleware reads the entire response body into memory before scanning. For streaming endpoints (SSE, large file downloads, Rack::Hijack) use `scrub: [:headers]` and rely on the Logger formatter for application logs instead.
|
|
181
|
+
|
|
182
|
+
## Detected patterns (85 total)
|
|
183
|
+
|
|
184
|
+
The table below is a representative sample. Use `DataRedactor.pattern_names` for the canonical, machine-readable list — it stays in sync with the C extension automatically.
|
|
185
|
+
|
|
186
|
+
### Cloud & API secrets
|
|
187
|
+
|
|
188
|
+
| # | Pattern | Example |
|
|
189
|
+
|---|---|---|
|
|
190
|
+
| — | AWS Access Key ID | `AKIAIOSFODNN7EXAMPLE` |
|
|
191
|
+
| — | AWS Secret Access Key | 40-character base64 string |
|
|
192
|
+
| — | Google API Key | `AIzaSyXXXX...` |
|
|
193
|
+
| — | GitHub Personal Access Token | `github_pat_XXXX...` |
|
|
194
|
+
| — | GitHub Classic PAT / OAuth | `ghp_XXXX...` / `gho_XXXX...` |
|
|
195
|
+
| — | Slack Webhook URL | `https://hooks.slack.com/services/T.../B.../...` |
|
|
196
|
+
| — | Stripe Secret Key | `sk_live_XXXX...` |
|
|
197
|
+
| — | Anthropic API Key | `sk-ant-api03-XXXX...` |
|
|
198
|
+
| — | OpenAI Project API Key | `sk-proj-XXXX...` |
|
|
199
|
+
| — | GitLab Personal Access Token | `glpat-XXXX...` |
|
|
200
|
+
| — | DigitalOcean PAT | `dop_v1_XXXX...` |
|
|
201
|
+
| — | Databricks API Token | `dapiXXXX...` |
|
|
202
|
+
| — | Sentry DSN | `https://KEY@oNNN.ingest.sentry.io/PID` |
|
|
203
|
+
| — | PEM Private Key header | `-----BEGIN RSA PRIVATE KEY-----` |
|
|
204
|
+
| — | Scaleway Access Key | `SCW12345ABCDE6789FGHIJ` |
|
|
205
|
+
| — | UUID v4 / Scaleway Secret Key | `550e8400-e29b-41d4-a716-446655440000` |
|
|
206
|
+
|
|
207
|
+
### Travel documents
|
|
208
|
+
|
|
209
|
+
| # | Pattern | Example |
|
|
210
|
+
|---|---|---|
|
|
211
|
+
| 2 | Italian Codice Fiscale (basic) | `RSSMRA85M01H501Z` |
|
|
212
|
+
| 3 | Passport — letter prefix + digits | `AB1234567` |
|
|
213
|
+
| 4 | Passport — 9 consecutive digits ¹ | `123456789` |
|
|
214
|
+
| 22 | Italian Codice Fiscale (omocodia) | `RSSMRALPMNLH5LMZ` |
|
|
215
|
+
|
|
216
|
+
### Payment & network
|
|
217
|
+
|
|
218
|
+
| # | Pattern | Example |
|
|
219
|
+
|---|---|---|
|
|
220
|
+
| 11 | Credit card — Visa, Mastercard, Amex, Discover, JCB | `4111111111111111` |
|
|
221
|
+
| 12 | IPv4 address | `192.168.1.100` |
|
|
222
|
+
|
|
223
|
+
### IBANs
|
|
224
|
+
|
|
225
|
+
| # | Country | Example |
|
|
226
|
+
|---|---|---|
|
|
227
|
+
| 10 | Italy | `IT60X0542811101000000123456` |
|
|
228
|
+
| 15 | France | `FR7630006000011234567890189` |
|
|
229
|
+
| 16 | Germany | `DE89370400440532013000` |
|
|
230
|
+
| 17 | Spain | `ES9121000418450200051332` |
|
|
231
|
+
| 18 | Netherlands | `NL91ABNA0417164300` |
|
|
232
|
+
| 19 | Belgium | `BE68539007547034` |
|
|
233
|
+
| 20 | Portugal | `PT50000201231234567890154` |
|
|
234
|
+
| 21 | Ireland | `IE29AIBK93115212345678` |
|
|
235
|
+
| 28 | Sweden | `SE4550000000058398257466` |
|
|
236
|
+
| 29 | Denmark | `DK5000400440116243` |
|
|
237
|
+
| 30 | Norway | `NO9386011117947` |
|
|
238
|
+
| 31 | Finland | `FI2112345600000785` |
|
|
239
|
+
| 37 | Poland | `PL61109010140000071219812874` |
|
|
240
|
+
| 38 | Austria | `AT611904300234573201` |
|
|
241
|
+
| 39 | Switzerland | `CH9300762011623852957` |
|
|
242
|
+
| 40 | Czechia | `CZ6508000000192000145399` |
|
|
243
|
+
| 41 | Hungary | `HU42117730161111101800000000` |
|
|
244
|
+
| 42 | Romania | `RO49AAAA1B31007593840000` |
|
|
245
|
+
|
|
246
|
+
### National personal identifiers
|
|
247
|
+
|
|
248
|
+
| # | Country | Type | Example |
|
|
249
|
+
|---|---|---|---|
|
|
250
|
+
| 23 | France | NIR / Social Security ¹ | `185126203450342` |
|
|
251
|
+
| 24 | Spain | DNI ¹ | `12345678Z` |
|
|
252
|
+
| 25 | Spain | NIE | `X1234567L` |
|
|
253
|
+
| 26 | Netherlands | BSN ¹ | `123456789` |
|
|
254
|
+
| 27 | Poland | PESEL ¹ | `85121612345` |
|
|
255
|
+
| 32 | Belgium | National Number ¹ | `85121612345` |
|
|
256
|
+
| 33 | Sweden | Personnummer ¹ | `850101-1234` |
|
|
257
|
+
| 34 | Denmark | CPR Number ¹ | `010185-1234` |
|
|
258
|
+
| 35 | Norway | Fødselsnummer ¹ | `01018512345` |
|
|
259
|
+
| 36 | Finland | HETU ¹ | `010185-123A` |
|
|
260
|
+
| 43 | Poland | PESEL (alt slot) ¹ | `90010112345` |
|
|
261
|
+
| 44 | Austria | Abgabenkontonummer ¹ | `123456789` |
|
|
262
|
+
| 45 | Switzerland | AHV Number ¹ | `756.1234.5678.90` |
|
|
263
|
+
| 46 | Czechia | Rodné číslo ¹ | `856121/1234` |
|
|
264
|
+
| 47 | Hungary | Tax ID ¹ | `8012345678` |
|
|
265
|
+
| 48 | Romania | CNP ¹ | `1850101123456` |
|
|
266
|
+
|
|
267
|
+
> ¹ **Word-boundary protected** — these patterns are wrapped with `(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)` at compile time so they do not fire when the digit sequence appears inside a longer alphanumeric token.
|
|
268
|
+
|
|
269
|
+
## Directory structure
|
|
270
|
+
|
|
271
|
+
```
|
|
272
|
+
redactor/
|
|
273
|
+
├── data_redactor.gemspec
|
|
274
|
+
├── Gemfile
|
|
275
|
+
├── Rakefile
|
|
276
|
+
├── lib/
|
|
277
|
+
│ ├── data_redactor.rb # Ruby entry point, loads the .so
|
|
278
|
+
│ └── data_redactor/
|
|
279
|
+
│ └── version.rb
|
|
280
|
+
├── ext/
|
|
281
|
+
│ └── data_redactor/
|
|
282
|
+
│ ├── extconf.rb # Checks for C headers, generates Makefile (globs *.c)
|
|
283
|
+
│ ├── data_redactor.c # Entry point: Init_data_redactor only
|
|
284
|
+
│ ├── patterns.{c,h} # Built-in pattern table + compiled regex_t array
|
|
285
|
+
│ ├── placeholder.{c,h} # write_placeholder, djb2 hash, tag_name_for_bit
|
|
286
|
+
│ ├── redact.{c,h} # _redact + replace_all_matches + wrap_boundary
|
|
287
|
+
│ ├── scan.{c,h} # _scan + byte-offset replacement-log macros
|
|
288
|
+
│ ├── custom_patterns.{c,h} # Dynamic registry: add/remove/clear/list
|
|
289
|
+
│ └── tags.h # TAG_* bit constants
|
|
290
|
+
└── spec/
|
|
291
|
+
└── data_redactor_spec.rb # RSpec tests — at least one example per pattern, plus filter / placeholder / custom-pattern coverage
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
## Requirements
|
|
295
|
+
|
|
296
|
+
- Ruby >= 2.7
|
|
297
|
+
- A C compiler (`gcc` or `clang`) — only required when installing the source gem
|
|
298
|
+
- POSIX `regex.h` — only required when installing the source gem (standard on Linux and macOS)
|
|
299
|
+
|
|
300
|
+
## Installation
|
|
301
|
+
|
|
302
|
+
```ruby
|
|
303
|
+
# Gemfile
|
|
304
|
+
gem "data_redactor"
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
```bash
|
|
308
|
+
bundle install
|
|
309
|
+
```
|
|
310
|
+
|
|
311
|
+
That's it — there is nothing extra to configure for precompiled binaries. Bundler/RubyGems looks at your platform and Ruby version and picks the right gem automatically.
|
|
312
|
+
|
|
313
|
+
### What you'll see
|
|
314
|
+
|
|
315
|
+
- **On a supported platform** (Linux glibc/musl, macOS Intel/ARM): bundler downloads a precompiled gem with the C extension already built. Install is near-instant — **no compiler, no `make`, no `regex.h` headers needed**. Especially valuable in slim Docker images (`ruby:3.x-alpine`, `ruby:3.x-slim`) that don't ship `gcc`.
|
|
316
|
+
- **On any other platform** (FreeBSD, OpenBSD, etc.): bundler downloads the source gem and compiles the C extension on install — the same behavior as before 0.7.1. You'll need a C compiler and POSIX `regex.h` available.
|
|
317
|
+
|
|
318
|
+
### Supported precompiled targets
|
|
319
|
+
|
|
320
|
+
Each precompiled gem ships compiled binaries for Ruby 3.1, 3.2, 3.3, and 3.4.
|
|
321
|
+
|
|
322
|
+
| Platform | Targets |
|
|
323
|
+
|---|---|
|
|
324
|
+
| Linux (glibc) | `x86_64-linux`, `aarch64-linux` |
|
|
325
|
+
| Linux (musl / Alpine) | `x86_64-linux-musl`, `aarch64-linux-musl` |
|
|
326
|
+
| macOS | `x86_64-darwin` (Intel), `arm64-darwin` (Apple Silicon) |
|
|
327
|
+
|
|
328
|
+
### Bundler-locked deploys
|
|
329
|
+
|
|
330
|
+
If your `Gemfile.lock` was generated on one platform but you deploy to another, run `bundle lock --add-platform <target>` so bundler resolves the right native gem at deploy time. Example for Alpine deploys built from a glibc dev box:
|
|
331
|
+
|
|
332
|
+
```bash
|
|
333
|
+
bundle lock --add-platform x86_64-linux-musl aarch64-linux-musl
|
|
334
|
+
```
|
|
335
|
+
|
|
336
|
+
## Compile the C extension (source / development install only)
|
|
337
|
+
|
|
338
|
+
```bash
|
|
339
|
+
bundle exec rake compile
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
This runs `extconf.rb` via `rake-compiler`, which generates a `Makefile` and compiles `data_redactor.c` into a `.so` shared library placed under `lib/data_redactor/`.
|
|
343
|
+
|
|
344
|
+
## Building precompiled gems locally
|
|
345
|
+
|
|
346
|
+
Maintainers can rebuild the full set of native gems with one command (requires Docker):
|
|
347
|
+
|
|
348
|
+
```bash
|
|
349
|
+
bundle exec rake gem:all
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
This invokes `rake-compiler-dock` to cross-compile every supported (platform × Ruby ABI) combination. Output lands in `pkg/`.
|
|
353
|
+
|
|
354
|
+
## Run the tests
|
|
355
|
+
|
|
356
|
+
```bash
|
|
357
|
+
bundle exec rake spec
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
Or compile and test in one step:
|
|
361
|
+
|
|
362
|
+
```bash
|
|
363
|
+
bundle exec rake
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
## How it works
|
|
367
|
+
|
|
368
|
+
1. At load time, `Init_data_redactor` compiles all 85 regex patterns once using `regcomp` (POSIX ERE) and stores them as static `regex_t` structs. Patterns marked as boundary-wrapped are expanded with `wrap_boundary()` before compilation.
|
|
369
|
+
2. `DataRedactor.redact(text)` receives a Ruby `String`, converts it to a C `char*` via `StringValueCStr`, and runs each compiled pattern in sequence on a working buffer.
|
|
370
|
+
3. For each pattern, `replace_all_matches` iterates using `regexec`, copies non-matching segments to a fresh output buffer, and inserts `[REDACTED]` in place of each match. For boundary-wrapped patterns, `regexec` is called with `nmatch=4` and sub-match groups `[1]`/`[3]` identify the boundary characters so they are preserved verbatim.
|
|
371
|
+
4. The output buffer is grown with `realloc` as needed. After all patterns are applied the result is returned as a Ruby `String` via `rb_str_new_cstr`. All intermediate `malloc`/`strdup` allocations are explicitly `free`d.
|
|
372
|
+
|
|
373
|
+
## Memory management
|
|
374
|
+
|
|
375
|
+
All C-side buffers are heap-allocated with `malloc`/`strdup` and freed before the function returns. The only Ruby-managed allocation is the final return value from `rb_str_new_cstr`. No Ruby objects are created mid-processing, so GC cannot collect anything out from under the C code.
|
|
376
|
+
|
|
377
|
+
## Thread safety
|
|
378
|
+
|
|
379
|
+
`DataRedactor.redact` and `DataRedactor.scan` are safe to call concurrently from multiple threads. Built-in patterns are compiled into a static `regex_t` array at load time and never mutated afterward, and each call allocates its own working buffers. POSIX `regexec` is documented as thread-safe.
|
|
380
|
+
|
|
381
|
+
`DataRedactor.add_pattern`, `remove_pattern`, and `clear_custom_patterns!` mutate a shared dynamic array and are **not** thread-safe. Register custom patterns once at boot — before spawning worker threads or forking — and they will be visible (read-only) to every subsequent `redact`/`scan` call.
|
|
382
|
+
|
|
383
|
+
## Versioning
|
|
384
|
+
|
|
385
|
+
This project follows [Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html). Until `1.0.0`, minor versions may introduce breaking changes; from `1.0.0` onward, breaking changes will only land in major versions. See [CHANGELOG.md](CHANGELOG.md) for the release history.
|
|
386
|
+
|
|
387
|
+
## License
|
|
388
|
+
|
|
389
|
+
Released under the [MIT License](LICENSE).
|
|
390
|
+
|
|
391
|
+
## Known limitations
|
|
392
|
+
|
|
393
|
+
- **Pattern ordering matters** — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
|
|
394
|
+
- **AWS Secret Key (pattern 1)** — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
|
|
395
|
+
- **Duplicate digit patterns** — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
|
metadata
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
|
2
|
+
name: data_redactor
|
|
3
|
+
version: !ruby/object:Gem::Version
|
|
4
|
+
version: 0.7.2
|
|
5
|
+
platform: x86_64-linux
|
|
6
|
+
authors:
|
|
7
|
+
- Daniele Frisanco
|
|
8
|
+
bindir: bin
|
|
9
|
+
cert_chain: []
|
|
10
|
+
date: 1980-01-02 00:00:00.000000000 Z
|
|
11
|
+
dependencies:
|
|
12
|
+
- !ruby/object:Gem::Dependency
|
|
13
|
+
name: rake-compiler-dock
|
|
14
|
+
requirement: !ruby/object:Gem::Requirement
|
|
15
|
+
requirements:
|
|
16
|
+
- - "~>"
|
|
17
|
+
- !ruby/object:Gem::Version
|
|
18
|
+
version: '1.5'
|
|
19
|
+
type: :development
|
|
20
|
+
prerelease: false
|
|
21
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
22
|
+
requirements:
|
|
23
|
+
- - "~>"
|
|
24
|
+
- !ruby/object:Gem::Version
|
|
25
|
+
version: '1.5'
|
|
26
|
+
- !ruby/object:Gem::Dependency
|
|
27
|
+
name: rspec
|
|
28
|
+
requirement: !ruby/object:Gem::Requirement
|
|
29
|
+
requirements:
|
|
30
|
+
- - "~>"
|
|
31
|
+
- !ruby/object:Gem::Version
|
|
32
|
+
version: '3.12'
|
|
33
|
+
type: :development
|
|
34
|
+
prerelease: false
|
|
35
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
36
|
+
requirements:
|
|
37
|
+
- - "~>"
|
|
38
|
+
- !ruby/object:Gem::Version
|
|
39
|
+
version: '3.12'
|
|
40
|
+
- !ruby/object:Gem::Dependency
|
|
41
|
+
name: yard
|
|
42
|
+
requirement: !ruby/object:Gem::Requirement
|
|
43
|
+
requirements:
|
|
44
|
+
- - "~>"
|
|
45
|
+
- !ruby/object:Gem::Version
|
|
46
|
+
version: '0.9'
|
|
47
|
+
type: :development
|
|
48
|
+
prerelease: false
|
|
49
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
50
|
+
requirements:
|
|
51
|
+
- - "~>"
|
|
52
|
+
- !ruby/object:Gem::Version
|
|
53
|
+
version: '0.9'
|
|
54
|
+
- !ruby/object:Gem::Dependency
|
|
55
|
+
name: rack
|
|
56
|
+
requirement: !ruby/object:Gem::Requirement
|
|
57
|
+
requirements:
|
|
58
|
+
- - ">="
|
|
59
|
+
- !ruby/object:Gem::Version
|
|
60
|
+
version: '2.0'
|
|
61
|
+
type: :development
|
|
62
|
+
prerelease: false
|
|
63
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
64
|
+
requirements:
|
|
65
|
+
- - ">="
|
|
66
|
+
- !ruby/object:Gem::Version
|
|
67
|
+
version: '2.0'
|
|
68
|
+
description: A Ruby gem with a C extension for high-performance scanning and redaction
|
|
69
|
+
of 85 sensitive patterns — API keys, tokens, credentials, IBANs, national IDs, emails,
|
|
70
|
+
phone numbers, and PII from 15+ countries. Optional Logger formatter, Rails filter_parameters
|
|
71
|
+
adapter, and Rack middleware. Designed to sanitize text before sending to LLMs,
|
|
72
|
+
logging systems, or any public/third-party API.
|
|
73
|
+
email:
|
|
74
|
+
- daniele.frisanco@gmail.com
|
|
75
|
+
executables: []
|
|
76
|
+
extensions: []
|
|
77
|
+
extra_rdoc_files: []
|
|
78
|
+
files:
|
|
79
|
+
- CHANGELOG.md
|
|
80
|
+
- LICENSE
|
|
81
|
+
- lib/data_redactor.rb
|
|
82
|
+
- lib/data_redactor/3.0/data_redactor.so
|
|
83
|
+
- lib/data_redactor/3.1/data_redactor.so
|
|
84
|
+
- lib/data_redactor/3.2/data_redactor.so
|
|
85
|
+
- lib/data_redactor/3.3/data_redactor.so
|
|
86
|
+
- lib/data_redactor/3.4/data_redactor.so
|
|
87
|
+
- lib/data_redactor/4.0/data_redactor.so
|
|
88
|
+
- lib/data_redactor/integrations/logger.rb
|
|
89
|
+
- lib/data_redactor/integrations/rack.rb
|
|
90
|
+
- lib/data_redactor/integrations/rails.rb
|
|
91
|
+
- lib/data_redactor/version.rb
|
|
92
|
+
- readme.md
|
|
93
|
+
homepage: https://github.com/danielefrisanco/data_redactor
|
|
94
|
+
licenses:
|
|
95
|
+
- MIT
|
|
96
|
+
metadata:
|
|
97
|
+
homepage_uri: https://github.com/danielefrisanco/data_redactor
|
|
98
|
+
source_code_uri: https://github.com/danielefrisanco/data_redactor
|
|
99
|
+
changelog_uri: https://github.com/danielefrisanco/data_redactor/blob/main/CHANGELOG.md
|
|
100
|
+
bug_tracker_uri: https://github.com/danielefrisanco/data_redactor/issues
|
|
101
|
+
rubygems_mfa_required: 'true'
|
|
102
|
+
rdoc_options: []
|
|
103
|
+
require_paths:
|
|
104
|
+
- lib
|
|
105
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
|
106
|
+
requirements:
|
|
107
|
+
- - ">="
|
|
108
|
+
- !ruby/object:Gem::Version
|
|
109
|
+
version: '3.0'
|
|
110
|
+
- - "<"
|
|
111
|
+
- !ruby/object:Gem::Version
|
|
112
|
+
version: 4.1.dev
|
|
113
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
114
|
+
requirements:
|
|
115
|
+
- - ">="
|
|
116
|
+
- !ruby/object:Gem::Version
|
|
117
|
+
version: '0'
|
|
118
|
+
requirements: []
|
|
119
|
+
rubygems_version: 4.0.6
|
|
120
|
+
specification_version: 4
|
|
121
|
+
summary: Redact PII and secrets from strings before sending to AI or external services
|
|
122
|
+
test_files: []
|