prompt-sanitizer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 7e864f26f679a72a9bad801c49bbe351e708553593dc67441f122e79c6d292b6
4
+ data.tar.gz: d29b0754dfd1c75caebd1ffd854ae19ae1e4016b9820933a20fa64ddbf87f976
5
+ SHA512:
6
+ metadata.gz: 9db1e6f1aa3b42cd67cf72368c45c5fb20af4573b9f1f129e36eb83d4ebf80a438b4c9b0e1056065fcdb692eb650175516b2e2782c465bb66d4c90bdd5293577
7
+ data.tar.gz: 7493b5d3c26df78d25173f53082d92f70c59475eed3f53a6721d5137ca2864e66dbbb1cf02b5c8813f6b553e51bde5c45db42ff00db9d082c42b0e414df12434
data/CHANGELOG.md ADDED
@@ -0,0 +1,34 @@
1
+ # Changelog — prompt-sanitizer (Ruby)
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ---
9
+
10
+ ## [Unreleased]
11
+
12
+ ### Added
13
+ - **Rails integrations**: Railtie, Rack middleware, `ActionControllerConcern`,
14
+ `ActiveJobConcern`, and an install generator (`rails g prompt_sanitizer:install`)
15
+ - **`Session`** — multi-turn vault with `anonymize` / `deanonymize` / `anonymize_with_result`,
16
+ block form (`session.use { |s| … }`) with automatic vault cleanup
17
+ - **`Sanitizer`** — full pipeline: regex → secrets → NER (mode-gated) → dedup →
18
+ on_detect callback → replace → reconstruct → audit
19
+ - **`SyntheticEngine`** — Faker-backed realistic replacements for all 27 entity types;
20
+ graceful fallback to `[TYPE_N]` tokens when Faker is not installed
21
+ - **`MemoryAuditLog`** — thread-safe, Mutex-backed audit log; JSON/CSV export;
22
+ `since:` / `session_id:` filtering
23
+ - **`AuditEvent`** — struct capturing timestamp, session, entity type, confidence,
24
+ mode, and SHA-256 hash of the original value (never stores raw PII)
25
+ - **`PIIDetectedError`** — raised in `:block` mode; carries `entities` array
26
+ - **`RegexEngine`** — 27 built-in patterns; `add_pattern` accepts String or Regexp
27
+ - **`SecretsEngine`** — API keys, JWTs, bearer tokens, AWS credentials, private keys,
28
+ DB connection strings
29
+ - **`NEREngine`** — pluggable backend (`informers` distilbert / `mitie`); lazy load
30
+
31
+ ### Technical notes
32
+ - Zero runtime dependencies in FAST mode
33
+ - Thread-safe throughout (Mutex-backed vault and audit log)
34
+ - Ruby ≥ 3.1 required (Ruby ≥ 3.3 needed for `informers` / `mitie` NER backends)
data/README.md ADDED
@@ -0,0 +1,269 @@
1
+ # prompt-sanitizer — Ruby
2
+
3
+ **Bidirectional PII sanitizer for LLM pipelines. Zero cloud calls. GDPR & HIPAA ready.**
4
+
5
+ Strips PII from prompts before they reach any model API, then optionally restores
6
+ original values in the response — all in-process, with no third-party telemetry.
7
+
8
+ ---
9
+
10
+ ## Table of contents
11
+
12
+ - [Installation](#installation)
13
+ - [Quick start](#quick-start)
14
+ - [Modes](#modes)
15
+ - [Multi-turn sessions](#multi-turn-sessions)
16
+ - [Rails integration](#rails-integration)
17
+ - [Rack middleware](#rack-middleware)
18
+ - [ActionController concern](#actioncontroller-concern)
19
+ - [ActiveJob concern](#activejob-concern)
20
+ - [Install generator](#install-generator)
21
+ - [Audit log](#audit-log)
22
+ - [Custom patterns](#custom-patterns)
23
+ - [Entity types detected](#entity-types-detected)
24
+ - [Optional dependencies](#optional-dependencies)
25
+ - [License](#license)
26
+
27
+ ---
28
+
29
+ ## Installation
30
+
31
+ ```ruby
32
+ # Gemfile
33
+ gem "prompt-sanitizer"
34
+ ```
35
+
36
+ ```bash
37
+ bundle install
38
+ ```
39
+
40
+ ---
41
+
42
+ ## Quick start
43
+
44
+ ```ruby
45
+ require "prompt_sanitizer"
46
+
47
+ sanitizer = PromptSanitizer::Sanitizer.new # FAST mode — zero dependencies
48
+
49
+ result = sanitizer.sanitize("Hi, I'm John Doe. Reach me at john@acme.com or 555-867-5309")
50
+ puts result.text
51
+ # => "Hi, I'm [PERSON_1]. Reach me at [EMAIL_1] or [PHONE_1]"
52
+
53
+ puts result.entities.map { |e| [e.type, e.original] }.inspect
54
+ # => [[:person, "John Doe"], [:email, "john@acme.com"], [:phone, "555-867-5309"]]
55
+ ```
56
+
57
+ ---
58
+
59
+ ## Modes
60
+
61
+ | Mode | Engines | Latency | Catches |
62
+ |------|---------|---------|---------|
63
+ | `:fast` *(default)* | Regex + Secrets | < 1 ms | Email, phone, SSN, CC, IBAN, IP, MAC, URL, ZIP, dates, crypto, bank, passport, DL, API keys, JWTs, AWS keys, DB strings |
64
+ | `:smart` | Fast + NER | ~25–50 ms | + Names, organisations, locations, miscellaneous entities |
65
+ | `:full` | Smart + Synthetic + Audit | ~25–50 ms | + Realistic fake replacements, compliance audit trail |
66
+
67
+ ```ruby
68
+ # SMART mode — requires `gem "informers"` (or `gem "mitie"`)
69
+ sanitizer = PromptSanitizer::Sanitizer.new(mode: :smart)
70
+
71
+ # FULL mode — also requires `gem "faker"`
72
+ sanitizer = PromptSanitizer::Sanitizer.new(mode: :full)
73
+ ```
74
+
75
+ ### on_detect callbacks
76
+
77
+ ```ruby
78
+ # :redact (default) — replace PII with tokens
79
+ sanitizer = PromptSanitizer::Sanitizer.new(on_detect: :redact)
80
+
81
+ # :warn — replace AND call a warning handler
82
+ sanitizer = PromptSanitizer::Sanitizer.new(
83
+ on_detect: :warn,
84
+ on_detect_callback: ->(entities) { Rails.logger.warn "PII: #{entities.map(&:type)}" }
85
+ )
86
+
87
+ # :block — raise PIIDetectedError immediately
88
+ sanitizer = PromptSanitizer::Sanitizer.new(on_detect: :block)
89
+ begin
90
+ sanitizer.sanitize("SSN: 123-45-6789")
91
+ rescue PromptSanitizer::PIIDetectedError => e
92
+ puts e.entities.first.type # => :ssn
93
+ end
94
+ ```
95
+
96
+ ---
97
+
98
+ ## Multi-turn sessions
99
+
100
+ Sessions share a vault across conversation turns so the original values can be
101
+ restored from the model's reply:
102
+
103
+ ```ruby
104
+ sanitizer = PromptSanitizer::Sanitizer.new
105
+ session = sanitizer.session
106
+
107
+ # Turn 1
108
+ clean_prompt = session.anonymize("Book a flight for Alice Chen, alice@example.com")
109
+ # => "Book a flight for [PERSON_1], [EMAIL_1]"
110
+
111
+ llm_reply = YourLLMClient.chat(clean_prompt)
112
+ # => "Sure! I've booked a flight for [PERSON_1] ([EMAIL_1])."
113
+
114
+ final_reply = session.deanonymize(llm_reply)
115
+ # => "Sure! I've booked a flight for Alice Chen (alice@example.com)."
116
+
117
+ # Block form — vault is cleared automatically on exit
118
+ sanitizer.session do |s|
119
+ clean = s.anonymize(user_prompt)
120
+ s.deanonymize(llm_client.chat(clean))
121
+ end
122
+ ```
123
+
124
+ ---
125
+
126
+ ## Rails integration
127
+
128
+ ### Install generator
129
+
130
+ ```bash
131
+ rails generate prompt_sanitizer:install
132
+ ```
133
+
134
+ This creates `config/initializers/prompt_sanitizer.rb` with all options commented.
135
+
136
+ ### Initializer
137
+
138
+ ```ruby
139
+ # config/initializers/prompt_sanitizer.rb
140
+ PromptSanitizer.configure do |config|
141
+ config.mode = :smart # :fast | :smart | :full
142
+ config.ner_backend = :informers # :informers | :mitie
143
+ config.on_detect = :redact # :redact | :warn | :block
144
+ config.audit_log = :memory # :memory (more backends coming)
145
+ end
146
+ ```
147
+
148
+ ### Rack middleware
149
+
150
+ Automatically sanitizes JSON request bodies before they hit your controllers.
151
+ Supports `prompt`, `messages[].content` (OpenAI format), `input`, `text`, `query`.
152
+
153
+ ```ruby
154
+ # config/initializers/prompt_sanitizer.rb
155
+ PromptSanitizer.configure do |config|
156
+ config.use_middleware = true
157
+ config.middleware_routes = ["/api/"] # only sanitize these path prefixes
158
+ config.restore_response = false # set true to deanonymize JSON responses
159
+ end
160
+ ```
161
+
162
+ Alternatively, insert manually:
163
+
164
+ ```ruby
165
+ # config/application.rb
166
+ config.middleware.use PromptSanitizer::Integrations::SanitizerMiddleware,
167
+ routes: ["/api/v1/chat"],
168
+ restore_response: false
169
+ ```
170
+
171
+ ### ActionController concern
172
+
173
+ Fine-grained control inside individual actions:
174
+
175
+ ```ruby
176
+ class ChatController < ApplicationController
177
+ include PromptSanitizer::Integrations::ActionControllerConcern
178
+
179
+ def create
180
+ # Sanitize specific params in-place
181
+ sanitize_params!(:prompt, :message)
182
+
183
+ # Or use a multi-turn session scoped to this request
184
+ with_pii_session do |session|
185
+ clean = session.anonymize(params[:prompt])
186
+ raw = LLMClient.chat(clean)
187
+ @reply = session.deanonymize(raw)
188
+ end
189
+ end
190
+ end
191
+ ```
192
+
193
+ ### ActiveJob concern
194
+
195
+ Scrubs PII from job arguments before the job performs:
196
+
197
+ ```ruby
198
+ class LLMJob < ApplicationJob
199
+ include PromptSanitizer::Integrations::ActiveJobConcern
200
+
201
+ sanitize_argument :prompt # sanitized in-place before perform
202
+
203
+ def perform(prompt:, user_id:)
204
+ LLMClient.chat(prompt) # prompt is already clean
205
+ end
206
+ end
207
+ ```
208
+
209
+ ---
210
+
211
+ ## Audit log
212
+
213
+ The audit log records every sanitization event — entity type, confidence, session ID,
214
+ and a SHA-256 hash of the original value. **Raw PII is never stored.**
215
+
216
+ ```ruby
217
+ PromptSanitizer.configure do |c|
218
+ c.audit_log = :memory
219
+ end
220
+
221
+ sanitizer = PromptSanitizer::Sanitizer.new(mode: :full)
222
+ sanitizer.sanitize("Call Jane at 555-123-4567")
223
+
224
+ log = PromptSanitizer.audit_log
225
+ puts log.count # => 1
226
+ puts log.export(format: :json) # JSON array of events
227
+ puts log.export(since: "1h") # events in the last hour
228
+ ```
229
+
230
+ ---
231
+
232
+ ## Custom patterns
233
+
234
+ ```ruby
235
+ sanitizer = PromptSanitizer::Sanitizer.new
236
+ sanitizer.add_pattern(/EMP-\d{6}/, :custom) # employee IDs
237
+ sanitizer.sanitize("Assigned to EMP-004821")
238
+ # => "Assigned to [CUSTOM_1]"
239
+ ```
240
+
241
+ ---
242
+
243
+ ## Entity types detected
244
+
245
+ `EMAIL` · `PHONE` · `SSN` · `CREDIT_CARD` · `IBAN` · `IP_ADDRESS` ·
246
+ `MAC_ADDRESS` · `URL` · `ZIP_CODE` · `DATE_OF_BIRTH` · `DATE` ·
247
+ `CRYPTO_ADDRESS` · `BANK_ACCOUNT` · `PASSPORT` · `DRIVING_LICENSE` ·
248
+ `API_KEY` · `JWT` · `BEARER_TOKEN` · `AWS_ACCESS_KEY` · `AWS_SECRET_KEY` ·
249
+ `PRIVATE_KEY` · `DB_CONNECTION` · `PERSON` · `ORGANIZATION` · `LOCATION` ·
250
+ `AGE` · `CUSTOM`
251
+
252
+ ---
253
+
254
+ ## Optional dependencies
255
+
256
+ | Gem | Version | Required for |
257
+ |-----|---------|-------------|
258
+ | [`informers`](https://github.com/ankane/informers) | `>= 1.3` | SMART / FULL mode NER (recommended) |
259
+ | [`mitie`](https://github.com/ankane/mitie) | `>= 0.4` | SMART / FULL mode NER (alternative) |
260
+ | [`faker`](https://github.com/faker-ruby/faker) | `>= 2.0` | FULL mode synthetic replacements |
261
+
262
+ > **Note:** `informers` and `mitie` require Ruby ≥ 3.3.
263
+
264
+ ---
265
+
266
+ ## License
267
+
268
+ MIT
269
+
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "rails/generators"
4
+
5
+ module PromptSanitizer
6
+ module Generators
7
+ # Generates a prompt_sanitizer initializer.
8
+ #
9
+ # Usage:
10
+ # rails g prompt_sanitizer:install
11
+ #
12
+ # Creates:
13
+ # config/initializers/prompt_sanitizer.rb
14
+ class InstallGenerator < Rails::Generators::Base
15
+ source_root File.expand_path("templates", __dir__)
16
+
17
+ desc "Creates a PromptSanitizer initializer in config/initializers/."
18
+
19
+ def create_initializer
20
+ template "initializer.rb", "config/initializers/prompt_sanitizer.rb"
21
+ end
22
+
23
+ def show_instructions
24
+ say ""
25
+ say "✅ prompt_sanitizer initializer created.", :green
26
+ say ""
27
+ say "Next steps:", :bold
28
+ say " 1. Review config/initializers/prompt_sanitizer.rb"
29
+ say " 2. Choose a mode: :fast (default), :smart (+ NER), or :full (+ audit log)"
30
+ say " 3. For :smart/:full mode, add to your Gemfile:"
31
+ say ' gem "informers", ">= 1.3" # downloads distilbert-NER (~66 MB on first run)'
32
+ say ""
33
+ say " Full docs: https://github.com/jeslor/prompt-sanitizer/tree/main/packages/ruby"
34
+ say ""
35
+ end
36
+ end
37
+ end
38
+ end
@@ -0,0 +1,36 @@
1
+ # frozen_string_literal: true
2
+
3
+ PromptSanitizer.configure do |config|
4
+ # Detection mode:
5
+ # :fast — regex + secrets patterns only. Zero ML deps. Best for production
6
+ # where latency matters and NER is not required. (default)
7
+ # :smart — adds NER via the `informers` gem (distilbert-NER, ~66 MB).
8
+ # Catches names, orgs, and locations missed by regex alone.
9
+ # :full — SMART + in-memory audit log. Every detection event is recorded
10
+ # (hashed, never raw PII) for compliance export.
11
+ config.mode = :fast
12
+
13
+ # BCP-47 locale used by the synthetic replacement engine (Faker).
14
+ # e.g. "en", "fr", "de", "es", "ja"
15
+ config.locale = "en"
16
+
17
+ # NER backend — used only when mode is :smart or :full.
18
+ # :informers — distilbert-NER (ONNX, int8, ~66 MB). Downloads once to
19
+ # ~/.cache/huggingface/ on first call. Recommended.
20
+ # :mitie — MITIE C++ library (~600 MB model). Faster than informers
21
+ # but requires a separate model file and the `mitie` gem.
22
+ config.ner_backend = :informers
23
+
24
+ # Optional: supply a custom audit log backend (must inherit from
25
+ # PromptSanitizer::Audit::Base). When nil, a MemoryAuditLog is used
26
+ # automatically in :full mode.
27
+ # config.audit_log = MyActiveRecordAuditLog.new
28
+ end
29
+
30
+ # ── Optional: Rack middleware ──────────────────────────────────────────────────
31
+ # Auto-sanitize JSON request bodies (messages[].content, prompt, input, …)
32
+ # before they reach your controllers. Uncomment to enable.
33
+ #
34
+ # Rails.application.config.prompt_sanitizer.middleware = true
35
+ # Rails.application.config.prompt_sanitizer.middleware_routes = ["/api/llm", "/chat"]
36
+ # Rails.application.config.prompt_sanitizer.restore_response = false
@@ -0,0 +1,105 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "digest"
4
+ require "time"
5
+ require "json"
6
+
7
+ module PromptSanitizer
8
+ module Audit
9
+ # Immutable value object representing a single PII-detection event.
10
+ #
11
+ # The original PII text is *never* stored — only a 16-char SHA-256 prefix
12
+ # hash, making the log itself safe to persist or export for compliance.
13
+ AuditEvent = Struct.new(
14
+ :timestamp, # String — ISO-8601 UTC
15
+ :entity_type, # Symbol — EntityType constant
16
+ :confidence, # Float — detection confidence 0.0..1.0
17
+ :layer, # Symbol — :regex | :secrets | :ner
18
+ :redaction_method, # Symbol — :synthetic | :placeholder
19
+ :text_hash, # String — SHA-256[:16] of original PII (NOT the value)
20
+ :session_id, # String | nil — caller-supplied session identifier
21
+ keyword_init: true
22
+ ) do
23
+ def to_h
24
+ super.transform_keys(&:to_s)
25
+ end
26
+ end
27
+
28
+ # Compute a SHA-256 prefix hash of a PII value for safe audit storage.
29
+ def self.hash_value(value)
30
+ Digest::SHA256.hexdigest(value.to_s)[0, 16]
31
+ end
32
+
33
+ # Return the current UTC time as an ISO-8601 string.
34
+ def self.now_iso
35
+ Time.now.utc.iso8601
36
+ end
37
+
38
+ # Parse a "since" argument into a comparable UTC Time.
39
+ #
40
+ # Accepts:
41
+ # - +nil+ → no cutoff
42
+ # - Integer → seconds ago
43
+ # - "7d" → 7 days ago
44
+ # - "12h" → 12 hours ago
45
+ # - Time → as-is
46
+ # - ISO-8601 String → parsed
47
+ def self.parse_since(since)
48
+ return nil if since.nil?
49
+ return since if since.is_a?(Time)
50
+
51
+ if since.is_a?(String)
52
+ if since =~ /\A(\d+)d\z/
53
+ Time.now.utc - (Regexp.last_match(1).to_i * 86_400) - 1
54
+ elsif since =~ /\A(\d+)h\z/
55
+ Time.now.utc - (Regexp.last_match(1).to_i * 3_600) - 1
56
+ else
57
+ Time.parse(since).utc
58
+ end
59
+ elsif since.is_a?(Integer)
60
+ Time.now.utc - since
61
+ end
62
+ end
63
+
64
+ # Abstract base class for audit log backends.
65
+ #
66
+ # Subclass and implement +record+, +export+, +count+, and +clear+.
67
+ #
68
+ # Example custom backend:
69
+ #
70
+ # class MyAuditLog < PromptSanitizer::Audit::Base
71
+ # def record(event) = MyDB.insert(event.to_h)
72
+ # def export(format: :json, since: nil, session_id: nil) = "..."
73
+ # def count(since: nil) = MyDB.count
74
+ # def clear = MyDB.truncate
75
+ # end
76
+ class Base
77
+ # Record a detection event. Must not store the original PII value.
78
+ # @param event [AuditEvent]
79
+ def record(_event)
80
+ raise NotImplementedError, "#{self.class}#record is not implemented"
81
+ end
82
+
83
+ # Export events as a formatted string.
84
+ # @param format [Symbol] :json or :csv
85
+ # @param since [String, Time, nil] cutoff (e.g. "7d", "1h", Time object)
86
+ # @param session_id [String, nil] filter to a specific session
87
+ # @return [String]
88
+ def export(format: :json, since: nil, session_id: nil)
89
+ raise NotImplementedError, "#{self.class}#export is not implemented"
90
+ end
91
+
92
+ # Count matching events.
93
+ # @param since [String, Time, nil]
94
+ # @return [Integer]
95
+ def count(since: nil)
96
+ raise NotImplementedError, "#{self.class}#count is not implemented"
97
+ end
98
+
99
+ # Remove all stored events.
100
+ def clear
101
+ raise NotImplementedError, "#{self.class}#clear is not implemented"
102
+ end
103
+ end
104
+ end
105
+ end
@@ -0,0 +1,86 @@
1
+ # frozen_string_literal: true
2
+
3
+ module PromptSanitizer
4
+ module Audit
5
+ # Thread-safe in-memory audit log backend.
6
+ #
7
+ # Events are stored in a plain Array protected by a Mutex.
8
+ # Data is lost on process restart — use this for development,
9
+ # testing, or short-lived jobs. For persistence, implement a
10
+ # custom backend (e.g. ActiveRecord, SQLite) using Audit::Base.
11
+ #
12
+ # Usage:
13
+ #
14
+ # log = PromptSanitizer::Audit::MemoryAuditLog.new
15
+ # sanitizer = PromptSanitizer.sanitizer(audit_log: log)
16
+ # sanitizer.sanitize("contact john@acme.com")
17
+ # log.count # => 1
18
+ # log.export # => "[{\"entity_type\":\"email\", ...}]"
19
+ class MemoryAuditLog < Base
20
+ def initialize
21
+ super
22
+ @mutex = Mutex.new
23
+ @events = []
24
+ end
25
+
26
+ # @param event [AuditEvent]
27
+ def record(event)
28
+ @mutex.synchronize { @events << event }
29
+ nil
30
+ end
31
+
32
+ # @return [Array<AuditEvent>] a frozen snapshot (thread-safe copy)
33
+ def events
34
+ @mutex.synchronize { @events.dup }
35
+ end
36
+
37
+ # @param format [:json, :csv]
38
+ # @param since [String, Time, nil] e.g. "7d", "1h", Time.now - 3600
39
+ # @param session_id [String, nil]
40
+ # @return [String]
41
+ def export(format: :json, since: nil, session_id: nil)
42
+ rows = _filter(since: since, session_id: session_id).map(&:to_h)
43
+ case format.to_sym
44
+ when :json
45
+ JSON.generate(rows)
46
+ when :csv
47
+ return "" if rows.empty?
48
+
49
+ fields = rows.first.keys
50
+ lines = [fields.join(",")]
51
+ rows.each { |r| lines << fields.map { |f| r[f].to_s }.join(",") }
52
+ lines.join("\n")
53
+ else
54
+ raise ArgumentError, "Unknown format: #{format.inspect}. Use :json or :csv"
55
+ end
56
+ end
57
+
58
+ # @param since [String, Time, nil]
59
+ # @return [Integer]
60
+ def count(since: nil)
61
+ _filter(since: since).length
62
+ end
63
+
64
+ def clear
65
+ @mutex.synchronize { @events.clear }
66
+ nil
67
+ end
68
+
69
+ private
70
+
71
+ def _filter(since: nil, session_id: nil)
72
+ cutoff = Audit.parse_since(since)
73
+ evts = @mutex.synchronize { @events.dup }
74
+
75
+ if cutoff
76
+ evts = evts.select do |e|
77
+ Time.parse(e.timestamp).utc >= cutoff
78
+ end
79
+ end
80
+
81
+ evts = evts.select { |e| e.session_id == session_id } if session_id
82
+ evts
83
+ end
84
+ end
85
+ end
86
+ end