RubyGems - data_redactor - Versions diffs - 0.8.0 → 0.9.0 - Mend

data_redactor 0.8.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +10 -1
data/lib/data_redactor/integrations/rack.rb +21 -0
data/lib/data_redactor/name_pattern.rb +170 -0
data/lib/data_redactor/version.rb +1 -1
data/lib/data_redactor.rb +1 -0
data/readme.md +69 -2
metadata +2 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: aae84ce43ab8d2ad6751ade655397480057d687bdff7a6b857ee2821dffeb91b
-  data.tar.gz: 6e01ebe9d76e64ac3a93c31f14f7089ddaff4645e0819dec675d7585e37c4078
+  metadata.gz: 007d59e430d1675a13b84670f6c34c300f8b72fd7ee4744aa191f846bb89b072
+  data.tar.gz: a23f3b99c3ead341d2c9415a1b4b2eb32a45ee002f052a8e58d928eb1ce03919
 SHA512:
-  metadata.gz: a80b34b6e35fdf97cca2d9fecf1cb136b0e0c676ca1e3080c3680aaeb41b442cb2400c371e38417703910fc023ad01a3cf61f2f4a8f8dc5d4bd681174420d2b4
-  data.tar.gz: 4dbf049d027385c21a721044ac6651f4b24b33500af98a0c4c88d7860c06eb2721ea5aab4b8793f6fcd5614cac0cc645c1f67e73b5ebf8d7ff04ead58b67a244
+  metadata.gz: ccd4f6f97a0110585e4f43f9402eac2a1f57b2aef01a3c6870f0e57ea578377291a7367ee924585d9d11e92af98f4178bb0b9488c1a24a2338f6a41936efad30
+  data.tar.gz: 5281171119b4892167a6b1d55e0996db47408c8a6d334656998f8f2ca50794a3a7b5c987132369ca32965da0943f954eab61f34f5a97c683b8a14851e9beca1e

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.9.0] - 2026-05-22
+### Added
+- `DataRedactor.name_pattern(first, last, middle:)` — generates a POSIX ERE that matches a person's name across common written variations (case-insensitivity, First/Last order swaps, `Last, First`, initials, diacritics, and interchangeable space/hyphen separators). Returns a String ready to pass to `add_pattern`. The pattern is boundary-wrapped, so `"Mario"` matches as a word but not inside `"Mariolino"`. When `middle:` is given, both the no-middle and with-middle forms match.
+## [0.8.0] - 2026-05-21
 ### Added
 - `DataRedactor.redact_deep(data, only:, except:, placeholder:)` — recursively redacts every String value in a nested Hash/Array structure. Non-string scalars (Integer, Float, nil, Boolean) and Hash keys are passed through unchanged. Returns a deep copy; never mutates the input. Raises `ArgumentError` on circular references.
 - `DataRedactor.redact_json(json_string, only:, except:, placeholder:)` — parses JSON, redacts via `redact_deep`, and returns valid JSON. Raises `JSON::ParserError` on invalid input.
@@ -170,7 +177,9 @@ features as 0.7.1 plus the pipeline fix.
 - `DataRedactor.redact(text)` module function returning the input with every match replaced by `[REDACTED]`.
 - RSpec suite with one example per pattern.
-[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.2...HEAD
+[Unreleased]: https://github.com/danielefrisanco/data_redactor/compare/v0.9.0...HEAD
+[0.9.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.8.0...v0.9.0
+[0.8.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.2...v0.8.0
 [0.7.2]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.1...v0.7.2
 [0.7.1]: https://github.com/danielefrisanco/data_redactor/compare/v0.7.0...v0.7.1
 [0.7.0]: https://github.com/danielefrisanco/data_redactor/compare/v0.6.1...v0.7.0

data/lib/data_redactor/integrations/rack.rb CHANGED Viewed

@@ -1,6 +1,12 @@
 require "data_redactor"
 module DataRedactor
+  # Namespace for the optional framework adapters under
+  # +lib/data_redactor/integrations/+ ({Logger}, +Rails+, {Rack}).
+  #
+  # Each adapter is soft-required — none load with +require "data_redactor"+;
+  # +require+ only the one you need. They add no runtime gem dependencies and
+  # all redaction is delegated to {DataRedactor.redact}.
   module Integrations
     # Rack middleware that scrubs sensitive data from selectable surfaces of
     # the response (and request headers, for downstream loggers to see scrubbed
@@ -23,8 +29,13 @@ module DataRedactor
     #   the env hash so any downstream middleware that logs them sees scrubbed
     #   values.
     class Rack
+      # Surfaces scrubbed when +scrub:+ is not given to {#initialize}.
+      # @return [Array<Symbol>]
       DEFAULT_SCRUB = [:body, :headers].freeze
+      # Request-header env keys redacted in place when +:headers+ is scrubbed,
+      # so downstream middleware that logs the env sees scrubbed values.
+      # @return [Array<String>] Rack env keys (HTTP_-prefixed, upper-case).
       SENSITIVE_REQUEST_HEADERS = %w[
         HTTP_AUTHORIZATION
         HTTP_PROXY_AUTHORIZATION
@@ -34,6 +45,9 @@ module DataRedactor
         HTTP_X_ACCESS_TOKEN
       ].freeze
+      # Response headers whose values are redacted when +:headers+ is scrubbed.
+      # Matched case-insensitively (Rack 2 capitalises, Rack 3 lower-cases).
+      # @return [Array<String>]
       SENSITIVE_RESPONSE_HEADERS = %w[
         Set-Cookie
         Authorization
@@ -60,6 +74,13 @@ module DataRedactor
         @placeholder = placeholder
       end
+      # Rack entry point. Scrubs the configured surfaces of the request and
+      # response and returns the standard Rack response triple.
+      #
+      # @param env [Hash] the Rack environment.
+      # @return [Array(Integer, Hash, #each)] the +[status, headers, body]+
+      #   triple, with sensitive data redacted from the surfaces named in
+      #   +scrub:+. When +:body+ is scrubbed, +Content-Length+ is dropped.
       def call(env)
         scrub_request_headers(env) if @scrub.include?(:headers)
         status, headers, body = @app.call(env)

data/lib/data_redactor/name_pattern.rb ADDED Viewed

@@ -0,0 +1,170 @@
+# frozen_string_literal: true
+module DataRedactor
+  # Maps a base ASCII letter to the set of accented characters that should
+  # also match it. Used to make generated name patterns diacritic-tolerant:
+  # an input "Jose" still matches "José", and "Munoz" matches "Muñoz".
+  #
+  # @api private
+  DIACRITIC_FOLD = {
+    "a" => "àáâãäåāăą",
+    "c" => "çćĉċč",
+    "e" => "èéêëēĕėęě",
+    "i" => "ìíîïĩīĭįı",
+    "n" => "ñńņňŉ",
+    "o" => "òóôõöøōŏő",
+    "u" => "ùúûüũūŭůűų",
+    "y" => "ýÿŷ",
+    "s" => "śŝşš",
+    "z" => "źżž",
+    "g" => "ĝğġģ",
+    "l" => "ĺļľŀł",
+    "r" => "ŕŗř",
+    "t" => "ţťŧ"
+  }.freeze
+  module_function
+  # Build a POSIX ERE that matches a person's name across common written
+  # variations, ready to hand to {add_pattern}.
+  #
+  # The returned pattern is **boundary-wrapped** — it embeds
+  # +(^|[^A-Za-z])+ ... +([^A-Za-z]|$)+ so that +"Mario"+ matches as a whole
+  # word but not inside +"Mariolino"+. Because the wrapper uses capture
+  # groups, register the pattern with the default +boundary: false+ (do
+  # **not** pass +boundary: true+ — that would double-wrap and reject the
+  # groups).
+  #
+  # Variations covered:
+  # - **Case** — every letter becomes a case-insensitive character class
+  #   (+[Mm][Aa]...+), since POSIX ERE has no +/i+ flag.
+  # - **Order** — +"First Last"+, +"Last First"+, +"Last, First"+,
+  #   +"Last,First"+.
+  # - **Initials** — +"M. Last"+, +"M Last"+, +"First R."+, +"First R"+,
+  #   +"M.R."+, +"M R"+, +"MR"+.
+  # - **Diacritics** — an ASCII letter with a {DIACRITIC_FOLD} entry also
+  #   matches its accented forms (+"Jose"+ matches +"José"+). An accented
+  #   input letter also matches its bare ASCII form.
+  # - **Separators** — spaces and hyphens are interchangeable between and
+  #   within name parts. A hyphenated part like +"Anne-Marie"+ also matches
+  #   +"Anne Marie"+, +"AnneMarie"+, and each half on its own (+"Anne"+,
+  #   +"Marie"+). Multi-word parts like +"Van der Berg"+ tolerate any
+  #   space/hyphen separator between words.
+  #
+  # @param first [String] the given name. May contain hyphens or spaces.
+  # @param last [String] the family name. May contain hyphens or spaces.
+  # @param middle [String, nil] optional middle name. When given, the pattern
+  #   matches **both** the no-middle forms and the with-middle forms.
+  # @return [String] a POSIX ERE source string.
+  # @raise [ArgumentError] if +first+ or +last+ is not a non-empty String,
+  #   or +middle+ is given but is not a non-empty String.
+  #
+  # @example Register a name pattern
+  #   DataRedactor.add_pattern(
+  #     name:  "person_mario_rossi",
+  #     regex: DataRedactor.name_pattern("Mario", "Rossi"),
+  #     tag:   :contact
+  #   )
+  #
+  # @example With a middle name
+  #   DataRedactor.name_pattern("Mario", "Rossi", middle: "Luigi")
+  def name_pattern(first, last, middle: nil)
+    _validate_name_arg!(first, "first")
+    _validate_name_arg!(last, "last")
+    _validate_name_arg!(middle, "middle") unless middle.nil?
+    first_tok  = _part_token(first)
+    last_tok   = _part_token(last)
+    middle_tok = middle && _part_token(middle)
+    # Separator between name parts. Optional so initial-only forms collapse
+    # ("MR", "M.R.") and so "First,Last" with no space still matches.
+    sep = "[ ,-]*"
+    bodies = []
+    bodies << "#{first_tok}#{sep}#{last_tok}"            # First Last
+    bodies << "#{last_tok}#{sep}#{first_tok}"            # Last First / Last, First
+    if middle_tok
+      bodies << "#{first_tok}#{sep}#{middle_tok}#{sep}#{last_tok}" # First Middle Last
+      bodies << "#{last_tok}#{sep}#{first_tok}#{sep}#{middle_tok}" # Last First Middle
+    end
+    "(^|[^A-Za-z])(#{bodies.join('|')})([^A-Za-z]|$)"
+  end
+  # @api private
+  # Build the alternation for one name part: the full case-insensitive name,
+  # or its initial (with optional dot). Hyphenated/multi-word parts also
+  # match each sub-word alone and tolerant separators between sub-words.
+  #
+  # @param part [String] a single name part, e.g. "Mario" or "Anne-Marie".
+  # @return [String] a parenthesised POSIX ERE alternation.
+  def _part_token(part)
+    words = part.split(/[ -]+/).reject(&:empty?)
+    word_alts = words.map { |w| _word_alternatives(w) }
+    forms = []
+    # whole part with tolerant separators between its words
+    forms << word_alts.map { |alts| "(#{alts.join('|')})" }.join("[ -]?")
+    # each word on its own (covers "Anne" / "Marie" from "Anne-Marie")
+    if words.length > 1
+      word_alts.each { |alts| forms << "(#{alts.join('|')})" }
+    end
+    "(#{forms.uniq.join('|')})"
+  end
+  # @api private
+  # Alternatives for a single whitespace-free word: the full name (each
+  # letter as a case-insensitive, diacritic-folded class) and its initial.
+  #
+  # @param word [String] a single word with no spaces or hyphens.
+  # @return [Array<String>] alternation members for this word.
+  def _word_alternatives(word)
+    full    = word.chars.map { |ch| _letter_class(ch) }.join
+    initial = "#{_letter_class(word[0])}\\.?"
+    [full, initial]
+  end
+  # @api private
+  # Build a POSIX bracket expression matching one letter case-insensitively
+  # and, where applicable, its accented variants.
+  #
+  # @param char [String] a single character.
+  # @return [String] a bracket expression, e.g. "[Mm]" or "[EeÈÉÊËèéêë]".
+  def _letter_class(char)
+    down = char.downcase
+    up   = char.upcase
+    members = [down]
+    members << up unless up == down
+    base = DIACRITIC_FOLD.key?(down) ? down : _ascii_base(down)
+    if base && DIACRITIC_FOLD.key?(base)
+      accented = DIACRITIC_FOLD[base]
+      members << accented << accented.upcase
+      members << base << base.upcase # accented input still matches bare ASCII
+    end
+    "[#{members.join}]"
+  end
+  # @api private
+  # If +char+ is an accented letter, return the bare ASCII letter it folds
+  # to; otherwise nil.
+  #
+  # @param char [String] a single lowercase character.
+  # @return [String, nil]
+  def _ascii_base(char)
+    DIACRITIC_FOLD.each { |ascii, accents| return ascii if accents.include?(char) }
+    nil
+  end
+  # @api private
+  def _validate_name_arg!(value, label)
+    return if value.is_a?(String) && !value.strip.empty?
+    raise ArgumentError, "#{label} must be a non-empty String, got #{value.inspect}"
+  end
+end

data/lib/data_redactor/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 module DataRedactor
   # Current gem version. Follows {https://semver.org Semantic Versioning 2.0.0}.
-  VERSION = "0.8.0"
+  VERSION = "0.9.0"
 end

data/lib/data_redactor.rb CHANGED Viewed

@@ -2,6 +2,7 @@ require "set"
 require "json"
 require_relative "data_redactor/version"
 require_relative "data_redactor/data_redactor" # loads the compiled .so
+require_relative "data_redactor/name_pattern"
 # High-performance regex-based redactor for sensitive data.
 #

data/readme.md CHANGED Viewed

@@ -8,7 +8,32 @@ A Ruby gem with a C extension for high-performance regex-based redaction of sens
 ## What it does
-DataRedactor scans text for sensitive patterns and replaces matches with `[REDACTED]`. It uses a C extension backed by POSIX `regex.h` so the heavy lifting happens outside the Ruby VM, making it fast enough for large payloads.
+DataRedactor scans text for sensitive data — API keys and cloud secrets, IBANs,
+credit cards, national IDs, emails, phone numbers, IPs, and more — and replaces
+each match with a placeholder. The scanning runs in a C extension backed by POSIX
+`regex.h`, so the heavy lifting happens outside the Ruby VM and stays fast enough
+to run inline on large payloads.
+It ships **88 built-in patterns** across 15+ countries, grouped into tags
+(`:credentials`, `:financial`, `:contact`, ...) so you can redact only what you
+care about. Beyond plain strings it can walk nested Hashes, Arrays, and JSON,
+audit a payload without mutating it (`scan`), and plug into Logger, Rails, and
+Rack. You can also register your own patterns at boot.
+### Use cases
+- **Log scrubbing** — drop the `Logger` formatter in so no secret or PII ever
+  reaches disk or your log aggregator.
+- **Rails parameter filtering** — feed `filter_parameters` a redactor-backed proc
+  to keep request params out of logs and error reports.
+- **HTTP request/response sanitising** — Rack middleware scrubs response bodies
+  and sensitive headers in flight.
+- **Sanitising LLM / API payloads** — run `redact_deep` over a params hash or
+  `redact_json` over a JSON body before it leaves the process.
+- **Compliance & auditing** — `scan` reports every match with byte offsets, tag,
+  and pattern name without changing the text, for false-positive tuning.
+- **Internal identifiers** — register company-specific patterns (`add_pattern`)
+  or generate them from a person's name (`name_pattern`).
 ## Usage
@@ -158,6 +183,46 @@ DataRedactor.clear_custom_patterns!               # mostly for test suites
 **`boundary: true`** — wraps the pattern with `(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)` so it only fires when the token is not embedded in a longer alphanumeric string. Incompatible with patterns that contain capture groups.
+### Name patterns
+Personal names can't ship as built-ins — every team has different ones — but the regex
+boilerplate to match a name across its written variations is the same every time.
+`name_pattern` generates that regex for you, ready to hand to `add_pattern`:
+```ruby
+DataRedactor.add_pattern(
+  name:  "person_mario_rossi",
+  regex: DataRedactor.name_pattern("Mario", "Rossi"),
+  tag:   :contact
+)
+DataRedactor.redact("ticket from Mario Rossi about ...")
+# => "ticket from [REDACTED] about ..."
+```
+A single generated pattern matches all of these:
+- **Case** — `Mario Rossi`, `mario rossi`, `MARIO ROSSI`
+- **Order** — `Mario Rossi`, `Rossi Mario`, `Rossi, Mario`, `Rossi,Mario`
+- **Initials** — `M. Rossi`, `M Rossi`, `Mario R.`, `M.R.`, `MR`
+- **Diacritics** — `name_pattern("Jose", "Munoz")` also matches `José Muñoz` (and vice versa)
+- **Separators** — spaces and hyphens are interchangeable. `name_pattern("Anne-Marie", "Berg")`
+  matches `Anne-Marie Berg`, `Anne Marie Berg`, `AnneMarie Berg`, and each half alone
+  (`Anne Berg`, `Marie Berg`). Multi-word parts like `"Van der Berg"` tolerate any
+  space/hyphen separator between words.
+It does **not** match a name embedded in a longer word — `Mario` will not fire inside
+`Mariolino` — because the generated pattern is boundary-wrapped. For that reason, register
+it with the default `boundary: false` (the wrapper is already baked into the returned
+string; `boundary: true` would double-wrap and reject its capture groups).
+Pass `middle:` to also cover a middle name — both the no-middle and with-middle forms match:
+```ruby
+DataRedactor.name_pattern("Mario", "Rossi", middle: "Luigi")
+# matches "Mario Rossi" AND "Mario Luigi Rossi" AND "Rossi Mario Luigi"
+```
 ## Integrations
 Optional adapters for Logger, Rails, and Rack. None are loaded automatically — `require` only what you use, and the gem adds zero runtime dependencies in the gemspec.
@@ -306,7 +371,9 @@ redactor/
 ├── lib/
 │   ├── data_redactor.rb          # Ruby entry point, loads the .so
 │   └── data_redactor/
-│       └── version.rb
+│       ├── version.rb
+│       ├── name_pattern.rb        # name_pattern helper — generates a name regex for add_pattern
+│       └── integrations/          # soft-required Logger / Rails / Rack adapters
 ├── ext/
 │   └── data_redactor/
 │       ├── extconf.rb            # Checks for C headers, generates Makefile (globs *.c)

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: data_redactor
 version: !ruby/object:Gem::Version
-  version: 0.8.0
+  version: 0.9.0
 platform: ruby
 authors:
 - Daniele Frisanco
@@ -110,6 +110,7 @@ files:
 - lib/data_redactor/integrations/logger.rb
 - lib/data_redactor/integrations/rack.rb
 - lib/data_redactor/integrations/rails.rb
+- lib/data_redactor/name_pattern.rb
 - lib/data_redactor/version.rb
 - readme.md
 homepage: https://github.com/danielefrisanco/data_redactor