prompt-sanitizer 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +34 -0
- data/README.md +269 -0
- data/lib/generators/prompt_sanitizer/install_generator.rb +38 -0
- data/lib/generators/prompt_sanitizer/templates/initializer.rb +36 -0
- data/lib/prompt_sanitizer/audit/base.rb +105 -0
- data/lib/prompt_sanitizer/audit/memory_audit_log.rb +86 -0
- data/lib/prompt_sanitizer/engines/ner_engine.rb +279 -0
- data/lib/prompt_sanitizer/engines/regex_engine.rb +216 -0
- data/lib/prompt_sanitizer/engines/secrets_engine.rb +230 -0
- data/lib/prompt_sanitizer/entities.rb +56 -0
- data/lib/prompt_sanitizer/integrations/action_controller.rb +64 -0
- data/lib/prompt_sanitizer/integrations/active_job.rb +79 -0
- data/lib/prompt_sanitizer/integrations/middleware.rb +153 -0
- data/lib/prompt_sanitizer/modes.rb +26 -0
- data/lib/prompt_sanitizer/railtie.rb +44 -0
- data/lib/prompt_sanitizer/result.rb +37 -0
- data/lib/prompt_sanitizer/sanitizer.rb +221 -0
- data/lib/prompt_sanitizer/session.rb +97 -0
- data/lib/prompt_sanitizer/synthetic.rb +152 -0
- data/lib/prompt_sanitizer/vault.rb +88 -0
- data/lib/prompt_sanitizer/version.rb +5 -0
- data/lib/prompt_sanitizer.rb +110 -0
- metadata +131 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 7e864f26f679a72a9bad801c49bbe351e708553593dc67441f122e79c6d292b6
|
|
4
|
+
data.tar.gz: d29b0754dfd1c75caebd1ffd854ae19ae1e4016b9820933a20fa64ddbf87f976
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 9db1e6f1aa3b42cd67cf72368c45c5fb20af4573b9f1f129e36eb83d4ebf80a438b4c9b0e1056065fcdb692eb650175516b2e2782c465bb66d4c90bdd5293577
|
|
7
|
+
data.tar.gz: 7493b5d3c26df78d25173f53082d92f70c59475eed3f53a6721d5137ca2864e66dbbb1cf02b5c8813f6b553e51bde5c45db42ff00db9d082c42b0e414df12434
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# Changelog — prompt-sanitizer (Ruby)
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## [Unreleased]
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
- **Rails integrations**: Railtie, Rack middleware, `ActionControllerConcern`,
|
|
14
|
+
`ActiveJobConcern`, and an install generator (`rails g prompt_sanitizer:install`)
|
|
15
|
+
- **`Session`** — multi-turn vault with `anonymize` / `deanonymize` / `anonymize_with_result`,
|
|
16
|
+
block form (`session.use { |s| … }`) with automatic vault cleanup
|
|
17
|
+
- **`Sanitizer`** — full pipeline: regex → secrets → NER (mode-gated) → dedup →
|
|
18
|
+
on_detect callback → replace → reconstruct → audit
|
|
19
|
+
- **`SyntheticEngine`** — Faker-backed realistic replacements for all 27 entity types;
|
|
20
|
+
graceful fallback to `[TYPE_N]` tokens when Faker is not installed
|
|
21
|
+
- **`MemoryAuditLog`** — thread-safe, Mutex-backed audit log; JSON/CSV export;
|
|
22
|
+
`since:` / `session_id:` filtering
|
|
23
|
+
- **`AuditEvent`** — struct capturing timestamp, session, entity type, confidence,
|
|
24
|
+
mode, and SHA-256 hash of the original value (never stores raw PII)
|
|
25
|
+
- **`PIIDetectedError`** — raised in `:block` mode; carries `entities` array
|
|
26
|
+
- **`RegexEngine`** — 27 built-in patterns; `add_pattern` accepts String or Regexp
|
|
27
|
+
- **`SecretsEngine`** — API keys, JWTs, bearer tokens, AWS credentials, private keys,
|
|
28
|
+
DB connection strings
|
|
29
|
+
- **`NEREngine`** — pluggable backend (`informers` distilbert / `mitie`); lazy load
|
|
30
|
+
|
|
31
|
+
### Technical notes
|
|
32
|
+
- Zero runtime dependencies in FAST mode
|
|
33
|
+
- Thread-safe throughout (Mutex-backed vault and audit log)
|
|
34
|
+
- Ruby ≥ 3.1 required (Ruby ≥ 3.3 needed for `informers` / `mitie` NER backends)
|
data/README.md
ADDED
|
@@ -0,0 +1,269 @@
|
|
|
1
|
+
# prompt-sanitizer — Ruby
|
|
2
|
+
|
|
3
|
+
**Bidirectional PII sanitizer for LLM pipelines. Zero cloud calls. GDPR & HIPAA ready.**
|
|
4
|
+
|
|
5
|
+
Strips PII from prompts before they reach any model API, then optionally restores
|
|
6
|
+
original values in the response — all in-process, with no third-party telemetry.
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Table of contents
|
|
11
|
+
|
|
12
|
+
- [Installation](#installation)
|
|
13
|
+
- [Quick start](#quick-start)
|
|
14
|
+
- [Modes](#modes)
|
|
15
|
+
- [Multi-turn sessions](#multi-turn-sessions)
|
|
16
|
+
- [Rails integration](#rails-integration)
|
|
17
|
+
- [Rack middleware](#rack-middleware)
|
|
18
|
+
- [ActionController concern](#actioncontroller-concern)
|
|
19
|
+
- [ActiveJob concern](#activejob-concern)
|
|
20
|
+
- [Install generator](#install-generator)
|
|
21
|
+
- [Audit log](#audit-log)
|
|
22
|
+
- [Custom patterns](#custom-patterns)
|
|
23
|
+
- [Entity types detected](#entity-types-detected)
|
|
24
|
+
- [Optional dependencies](#optional-dependencies)
|
|
25
|
+
- [License](#license)
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Installation
|
|
30
|
+
|
|
31
|
+
```ruby
|
|
32
|
+
# Gemfile
|
|
33
|
+
gem "prompt-sanitizer"
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
bundle install
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Quick start
|
|
43
|
+
|
|
44
|
+
```ruby
|
|
45
|
+
require "prompt_sanitizer"
|
|
46
|
+
|
|
47
|
+
sanitizer = PromptSanitizer::Sanitizer.new # FAST mode — zero dependencies
|
|
48
|
+
|
|
49
|
+
result = sanitizer.sanitize("Hi, I'm John Doe. Reach me at john@acme.com or 555-867-5309")
|
|
50
|
+
puts result.text
|
|
51
|
+
# => "Hi, I'm [PERSON_1]. Reach me at [EMAIL_1] or [PHONE_1]"
|
|
52
|
+
|
|
53
|
+
puts result.entities.map { |e| [e.type, e.original] }.inspect
|
|
54
|
+
# => [[:person, "John Doe"], [:email, "john@acme.com"], [:phone, "555-867-5309"]]
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
## Modes
|
|
60
|
+
|
|
61
|
+
| Mode | Engines | Latency | Catches |
|
|
62
|
+
|------|---------|---------|---------|
|
|
63
|
+
| `:fast` *(default)* | Regex + Secrets | < 1 ms | Email, phone, SSN, CC, IBAN, IP, MAC, URL, ZIP, dates, crypto, bank, passport, DL, API keys, JWTs, AWS keys, DB strings |
|
|
64
|
+
| `:smart` | Fast + NER | ~25–50 ms | + Names, organisations, locations, miscellaneous entities |
|
|
65
|
+
| `:full` | Smart + Synthetic + Audit | ~25–50 ms | + Realistic fake replacements, compliance audit trail |
|
|
66
|
+
|
|
67
|
+
```ruby
|
|
68
|
+
# SMART mode — requires `gem "informers"` (or `gem "mitie"`)
|
|
69
|
+
sanitizer = PromptSanitizer::Sanitizer.new(mode: :smart)
|
|
70
|
+
|
|
71
|
+
# FULL mode — also requires `gem "faker"`
|
|
72
|
+
sanitizer = PromptSanitizer::Sanitizer.new(mode: :full)
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### on_detect callbacks
|
|
76
|
+
|
|
77
|
+
```ruby
|
|
78
|
+
# :redact (default) — replace PII with tokens
|
|
79
|
+
sanitizer = PromptSanitizer::Sanitizer.new(on_detect: :redact)
|
|
80
|
+
|
|
81
|
+
# :warn — replace AND call a warning handler
|
|
82
|
+
sanitizer = PromptSanitizer::Sanitizer.new(
|
|
83
|
+
on_detect: :warn,
|
|
84
|
+
on_detect_callback: ->(entities) { Rails.logger.warn "PII: #{entities.map(&:type)}" }
|
|
85
|
+
)
|
|
86
|
+
|
|
87
|
+
# :block — raise PIIDetectedError immediately
|
|
88
|
+
sanitizer = PromptSanitizer::Sanitizer.new(on_detect: :block)
|
|
89
|
+
begin
|
|
90
|
+
sanitizer.sanitize("SSN: 123-45-6789")
|
|
91
|
+
rescue PromptSanitizer::PIIDetectedError => e
|
|
92
|
+
puts e.entities.first.type # => :ssn
|
|
93
|
+
end
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
98
|
+
## Multi-turn sessions
|
|
99
|
+
|
|
100
|
+
Sessions share a vault across conversation turns so the original values can be
|
|
101
|
+
restored from the model's reply:
|
|
102
|
+
|
|
103
|
+
```ruby
|
|
104
|
+
sanitizer = PromptSanitizer::Sanitizer.new
|
|
105
|
+
session = sanitizer.session
|
|
106
|
+
|
|
107
|
+
# Turn 1
|
|
108
|
+
clean_prompt = session.anonymize("Book a flight for Alice Chen, alice@example.com")
|
|
109
|
+
# => "Book a flight for [PERSON_1], [EMAIL_1]"
|
|
110
|
+
|
|
111
|
+
llm_reply = YourLLMClient.chat(clean_prompt)
|
|
112
|
+
# => "Sure! I've booked a flight for [PERSON_1] ([EMAIL_1])."
|
|
113
|
+
|
|
114
|
+
final_reply = session.deanonymize(llm_reply)
|
|
115
|
+
# => "Sure! I've booked a flight for Alice Chen (alice@example.com)."
|
|
116
|
+
|
|
117
|
+
# Block form — vault is cleared automatically on exit
|
|
118
|
+
sanitizer.session do |s|
|
|
119
|
+
clean = s.anonymize(user_prompt)
|
|
120
|
+
s.deanonymize(llm_client.chat(clean))
|
|
121
|
+
end
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## Rails integration
|
|
127
|
+
|
|
128
|
+
### Install generator
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
rails generate prompt_sanitizer:install
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
This creates `config/initializers/prompt_sanitizer.rb` with all options commented.
|
|
135
|
+
|
|
136
|
+
### Initializer
|
|
137
|
+
|
|
138
|
+
```ruby
|
|
139
|
+
# config/initializers/prompt_sanitizer.rb
|
|
140
|
+
PromptSanitizer.configure do |config|
|
|
141
|
+
config.mode = :smart # :fast | :smart | :full
|
|
142
|
+
config.ner_backend = :informers # :informers | :mitie
|
|
143
|
+
config.on_detect = :redact # :redact | :warn | :block
|
|
144
|
+
config.audit_log = :memory # :memory (more backends coming)
|
|
145
|
+
end
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Rack middleware
|
|
149
|
+
|
|
150
|
+
Automatically sanitizes JSON request bodies before they hit your controllers.
|
|
151
|
+
Supports `prompt`, `messages[].content` (OpenAI format), `input`, `text`, `query`.
|
|
152
|
+
|
|
153
|
+
```ruby
|
|
154
|
+
# config/initializers/prompt_sanitizer.rb
|
|
155
|
+
PromptSanitizer.configure do |config|
|
|
156
|
+
config.use_middleware = true
|
|
157
|
+
config.middleware_routes = ["/api/"] # only sanitize these path prefixes
|
|
158
|
+
config.restore_response = false # set true to deanonymize JSON responses
|
|
159
|
+
end
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
Alternatively, insert manually:
|
|
163
|
+
|
|
164
|
+
```ruby
|
|
165
|
+
# config/application.rb
|
|
166
|
+
config.middleware.use PromptSanitizer::Integrations::SanitizerMiddleware,
|
|
167
|
+
routes: ["/api/v1/chat"],
|
|
168
|
+
restore_response: false
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
### ActionController concern
|
|
172
|
+
|
|
173
|
+
Fine-grained control inside individual actions:
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
class ChatController < ApplicationController
|
|
177
|
+
include PromptSanitizer::Integrations::ActionControllerConcern
|
|
178
|
+
|
|
179
|
+
def create
|
|
180
|
+
# Sanitize specific params in-place
|
|
181
|
+
sanitize_params!(:prompt, :message)
|
|
182
|
+
|
|
183
|
+
# Or use a multi-turn session scoped to this request
|
|
184
|
+
with_pii_session do |session|
|
|
185
|
+
clean = session.anonymize(params[:prompt])
|
|
186
|
+
raw = LLMClient.chat(clean)
|
|
187
|
+
@reply = session.deanonymize(raw)
|
|
188
|
+
end
|
|
189
|
+
end
|
|
190
|
+
end
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
### ActiveJob concern
|
|
194
|
+
|
|
195
|
+
Scrubs PII from job arguments before the job performs:
|
|
196
|
+
|
|
197
|
+
```ruby
|
|
198
|
+
class LLMJob < ApplicationJob
|
|
199
|
+
include PromptSanitizer::Integrations::ActiveJobConcern
|
|
200
|
+
|
|
201
|
+
sanitize_argument :prompt # sanitized in-place before perform
|
|
202
|
+
|
|
203
|
+
def perform(prompt:, user_id:)
|
|
204
|
+
LLMClient.chat(prompt) # prompt is already clean
|
|
205
|
+
end
|
|
206
|
+
end
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## Audit log
|
|
212
|
+
|
|
213
|
+
The audit log records every sanitization event — entity type, confidence, session ID,
|
|
214
|
+
and a SHA-256 hash of the original value. **Raw PII is never stored.**
|
|
215
|
+
|
|
216
|
+
```ruby
|
|
217
|
+
PromptSanitizer.configure do |c|
|
|
218
|
+
c.audit_log = :memory
|
|
219
|
+
end
|
|
220
|
+
|
|
221
|
+
sanitizer = PromptSanitizer::Sanitizer.new(mode: :full)
|
|
222
|
+
sanitizer.sanitize("Call Jane at 555-123-4567")
|
|
223
|
+
|
|
224
|
+
log = PromptSanitizer.audit_log
|
|
225
|
+
puts log.count # => 1
|
|
226
|
+
puts log.export(format: :json) # JSON array of events
|
|
227
|
+
puts log.export(since: "1h") # events in the last hour
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
---
|
|
231
|
+
|
|
232
|
+
## Custom patterns
|
|
233
|
+
|
|
234
|
+
```ruby
|
|
235
|
+
sanitizer = PromptSanitizer::Sanitizer.new
|
|
236
|
+
sanitizer.add_pattern(/EMP-\d{6}/, :custom) # employee IDs
|
|
237
|
+
sanitizer.sanitize("Assigned to EMP-004821")
|
|
238
|
+
# => "Assigned to [CUSTOM_1]"
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
## Entity types detected
|
|
244
|
+
|
|
245
|
+
`EMAIL` · `PHONE` · `SSN` · `CREDIT_CARD` · `IBAN` · `IP_ADDRESS` ·
|
|
246
|
+
`MAC_ADDRESS` · `URL` · `ZIP_CODE` · `DATE_OF_BIRTH` · `DATE` ·
|
|
247
|
+
`CRYPTO_ADDRESS` · `BANK_ACCOUNT` · `PASSPORT` · `DRIVING_LICENSE` ·
|
|
248
|
+
`API_KEY` · `JWT` · `BEARER_TOKEN` · `AWS_ACCESS_KEY` · `AWS_SECRET_KEY` ·
|
|
249
|
+
`PRIVATE_KEY` · `DB_CONNECTION` · `PERSON` · `ORGANIZATION` · `LOCATION` ·
|
|
250
|
+
`AGE` · `CUSTOM`
|
|
251
|
+
|
|
252
|
+
---
|
|
253
|
+
|
|
254
|
+
## Optional dependencies
|
|
255
|
+
|
|
256
|
+
| Gem | Version | Required for |
|
|
257
|
+
|-----|---------|-------------|
|
|
258
|
+
| [`informers`](https://github.com/ankane/informers) | `>= 1.3` | SMART / FULL mode NER (recommended) |
|
|
259
|
+
| [`mitie`](https://github.com/ankane/mitie) | `>= 0.4` | SMART / FULL mode NER (alternative) |
|
|
260
|
+
| [`faker`](https://github.com/faker-ruby/faker) | `>= 2.0` | FULL mode synthetic replacements |
|
|
261
|
+
|
|
262
|
+
> **Note:** `informers` and `mitie` require Ruby ≥ 3.3.
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
## License
|
|
267
|
+
|
|
268
|
+
MIT
|
|
269
|
+
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "rails/generators"
|
|
4
|
+
|
|
5
|
+
module PromptSanitizer
|
|
6
|
+
module Generators
|
|
7
|
+
# Generates a prompt_sanitizer initializer.
|
|
8
|
+
#
|
|
9
|
+
# Usage:
|
|
10
|
+
# rails g prompt_sanitizer:install
|
|
11
|
+
#
|
|
12
|
+
# Creates:
|
|
13
|
+
# config/initializers/prompt_sanitizer.rb
|
|
14
|
+
class InstallGenerator < Rails::Generators::Base
|
|
15
|
+
source_root File.expand_path("templates", __dir__)
|
|
16
|
+
|
|
17
|
+
desc "Creates a PromptSanitizer initializer in config/initializers/."
|
|
18
|
+
|
|
19
|
+
def create_initializer
|
|
20
|
+
template "initializer.rb", "config/initializers/prompt_sanitizer.rb"
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
def show_instructions
|
|
24
|
+
say ""
|
|
25
|
+
say "✅ prompt_sanitizer initializer created.", :green
|
|
26
|
+
say ""
|
|
27
|
+
say "Next steps:", :bold
|
|
28
|
+
say " 1. Review config/initializers/prompt_sanitizer.rb"
|
|
29
|
+
say " 2. Choose a mode: :fast (default), :smart (+ NER), or :full (+ audit log)"
|
|
30
|
+
say " 3. For :smart/:full mode, add to your Gemfile:"
|
|
31
|
+
say ' gem "informers", ">= 1.3" # downloads distilbert-NER (~66 MB on first run)'
|
|
32
|
+
say ""
|
|
33
|
+
say " Full docs: https://github.com/jeslor/prompt-sanitizer/tree/main/packages/ruby"
|
|
34
|
+
say ""
|
|
35
|
+
end
|
|
36
|
+
end
|
|
37
|
+
end
|
|
38
|
+
end
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
PromptSanitizer.configure do |config|
|
|
4
|
+
# Detection mode:
|
|
5
|
+
# :fast — regex + secrets patterns only. Zero ML deps. Best for production
|
|
6
|
+
# where latency matters and NER is not required. (default)
|
|
7
|
+
# :smart — adds NER via the `informers` gem (distilbert-NER, ~66 MB).
|
|
8
|
+
# Catches names, orgs, and locations missed by regex alone.
|
|
9
|
+
# :full — SMART + in-memory audit log. Every detection event is recorded
|
|
10
|
+
# (hashed, never raw PII) for compliance export.
|
|
11
|
+
config.mode = :fast
|
|
12
|
+
|
|
13
|
+
# BCP-47 locale used by the synthetic replacement engine (Faker).
|
|
14
|
+
# e.g. "en", "fr", "de", "es", "ja"
|
|
15
|
+
config.locale = "en"
|
|
16
|
+
|
|
17
|
+
# NER backend — used only when mode is :smart or :full.
|
|
18
|
+
# :informers — distilbert-NER (ONNX, int8, ~66 MB). Downloads once to
|
|
19
|
+
# ~/.cache/huggingface/ on first call. Recommended.
|
|
20
|
+
# :mitie — MITIE C++ library (~600 MB model). Faster than informers
|
|
21
|
+
# but requires a separate model file and the `mitie` gem.
|
|
22
|
+
config.ner_backend = :informers
|
|
23
|
+
|
|
24
|
+
# Optional: supply a custom audit log backend (must inherit from
|
|
25
|
+
# PromptSanitizer::Audit::Base). When nil, a MemoryAuditLog is used
|
|
26
|
+
# automatically in :full mode.
|
|
27
|
+
# config.audit_log = MyActiveRecordAuditLog.new
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
# ── Optional: Rack middleware ──────────────────────────────────────────────────
|
|
31
|
+
# Auto-sanitize JSON request bodies (messages[].content, prompt, input, …)
|
|
32
|
+
# before they reach your controllers. Uncomment to enable.
|
|
33
|
+
#
|
|
34
|
+
# Rails.application.config.prompt_sanitizer.middleware = true
|
|
35
|
+
# Rails.application.config.prompt_sanitizer.middleware_routes = ["/api/llm", "/chat"]
|
|
36
|
+
# Rails.application.config.prompt_sanitizer.restore_response = false
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "digest"
|
|
4
|
+
require "time"
|
|
5
|
+
require "json"
|
|
6
|
+
|
|
7
|
+
module PromptSanitizer
|
|
8
|
+
module Audit
|
|
9
|
+
# Immutable value object representing a single PII-detection event.
|
|
10
|
+
#
|
|
11
|
+
# The original PII text is *never* stored — only a 16-char SHA-256 prefix
|
|
12
|
+
# hash, making the log itself safe to persist or export for compliance.
|
|
13
|
+
AuditEvent = Struct.new(
|
|
14
|
+
:timestamp, # String — ISO-8601 UTC
|
|
15
|
+
:entity_type, # Symbol — EntityType constant
|
|
16
|
+
:confidence, # Float — detection confidence 0.0..1.0
|
|
17
|
+
:layer, # Symbol — :regex | :secrets | :ner
|
|
18
|
+
:redaction_method, # Symbol — :synthetic | :placeholder
|
|
19
|
+
:text_hash, # String — SHA-256[:16] of original PII (NOT the value)
|
|
20
|
+
:session_id, # String | nil — caller-supplied session identifier
|
|
21
|
+
keyword_init: true
|
|
22
|
+
) do
|
|
23
|
+
def to_h
|
|
24
|
+
super.transform_keys(&:to_s)
|
|
25
|
+
end
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
# Compute a SHA-256 prefix hash of a PII value for safe audit storage.
|
|
29
|
+
def self.hash_value(value)
|
|
30
|
+
Digest::SHA256.hexdigest(value.to_s)[0, 16]
|
|
31
|
+
end
|
|
32
|
+
|
|
33
|
+
# Return the current UTC time as an ISO-8601 string.
|
|
34
|
+
def self.now_iso
|
|
35
|
+
Time.now.utc.iso8601
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
# Parse a "since" argument into a comparable UTC Time.
|
|
39
|
+
#
|
|
40
|
+
# Accepts:
|
|
41
|
+
# - +nil+ → no cutoff
|
|
42
|
+
# - Integer → seconds ago
|
|
43
|
+
# - "7d" → 7 days ago
|
|
44
|
+
# - "12h" → 12 hours ago
|
|
45
|
+
# - Time → as-is
|
|
46
|
+
# - ISO-8601 String → parsed
|
|
47
|
+
def self.parse_since(since)
|
|
48
|
+
return nil if since.nil?
|
|
49
|
+
return since if since.is_a?(Time)
|
|
50
|
+
|
|
51
|
+
if since.is_a?(String)
|
|
52
|
+
if since =~ /\A(\d+)d\z/
|
|
53
|
+
Time.now.utc - (Regexp.last_match(1).to_i * 86_400) - 1
|
|
54
|
+
elsif since =~ /\A(\d+)h\z/
|
|
55
|
+
Time.now.utc - (Regexp.last_match(1).to_i * 3_600) - 1
|
|
56
|
+
else
|
|
57
|
+
Time.parse(since).utc
|
|
58
|
+
end
|
|
59
|
+
elsif since.is_a?(Integer)
|
|
60
|
+
Time.now.utc - since
|
|
61
|
+
end
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
# Abstract base class for audit log backends.
|
|
65
|
+
#
|
|
66
|
+
# Subclass and implement +record+, +export+, +count+, and +clear+.
|
|
67
|
+
#
|
|
68
|
+
# Example custom backend:
|
|
69
|
+
#
|
|
70
|
+
# class MyAuditLog < PromptSanitizer::Audit::Base
|
|
71
|
+
# def record(event) = MyDB.insert(event.to_h)
|
|
72
|
+
# def export(format: :json, since: nil, session_id: nil) = "..."
|
|
73
|
+
# def count(since: nil) = MyDB.count
|
|
74
|
+
# def clear = MyDB.truncate
|
|
75
|
+
# end
|
|
76
|
+
class Base
|
|
77
|
+
# Record a detection event. Must not store the original PII value.
|
|
78
|
+
# @param event [AuditEvent]
|
|
79
|
+
def record(_event)
|
|
80
|
+
raise NotImplementedError, "#{self.class}#record is not implemented"
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
# Export events as a formatted string.
|
|
84
|
+
# @param format [Symbol] :json or :csv
|
|
85
|
+
# @param since [String, Time, nil] cutoff (e.g. "7d", "1h", Time object)
|
|
86
|
+
# @param session_id [String, nil] filter to a specific session
|
|
87
|
+
# @return [String]
|
|
88
|
+
def export(format: :json, since: nil, session_id: nil)
|
|
89
|
+
raise NotImplementedError, "#{self.class}#export is not implemented"
|
|
90
|
+
end
|
|
91
|
+
|
|
92
|
+
# Count matching events.
|
|
93
|
+
# @param since [String, Time, nil]
|
|
94
|
+
# @return [Integer]
|
|
95
|
+
def count(since: nil)
|
|
96
|
+
raise NotImplementedError, "#{self.class}#count is not implemented"
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
# Remove all stored events.
|
|
100
|
+
def clear
|
|
101
|
+
raise NotImplementedError, "#{self.class}#clear is not implemented"
|
|
102
|
+
end
|
|
103
|
+
end
|
|
104
|
+
end
|
|
105
|
+
end
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module PromptSanitizer
|
|
4
|
+
module Audit
|
|
5
|
+
# Thread-safe in-memory audit log backend.
|
|
6
|
+
#
|
|
7
|
+
# Events are stored in a plain Array protected by a Mutex.
|
|
8
|
+
# Data is lost on process restart — use this for development,
|
|
9
|
+
# testing, or short-lived jobs. For persistence, implement a
|
|
10
|
+
# custom backend (e.g. ActiveRecord, SQLite) using Audit::Base.
|
|
11
|
+
#
|
|
12
|
+
# Usage:
|
|
13
|
+
#
|
|
14
|
+
# log = PromptSanitizer::Audit::MemoryAuditLog.new
|
|
15
|
+
# sanitizer = PromptSanitizer.sanitizer(audit_log: log)
|
|
16
|
+
# sanitizer.sanitize("contact john@acme.com")
|
|
17
|
+
# log.count # => 1
|
|
18
|
+
# log.export # => "[{\"entity_type\":\"email\", ...}]"
|
|
19
|
+
class MemoryAuditLog < Base
|
|
20
|
+
def initialize
|
|
21
|
+
super
|
|
22
|
+
@mutex = Mutex.new
|
|
23
|
+
@events = []
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
# @param event [AuditEvent]
|
|
27
|
+
def record(event)
|
|
28
|
+
@mutex.synchronize { @events << event }
|
|
29
|
+
nil
|
|
30
|
+
end
|
|
31
|
+
|
|
32
|
+
# @return [Array<AuditEvent>] a frozen snapshot (thread-safe copy)
|
|
33
|
+
def events
|
|
34
|
+
@mutex.synchronize { @events.dup }
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
# @param format [:json, :csv]
|
|
38
|
+
# @param since [String, Time, nil] e.g. "7d", "1h", Time.now - 3600
|
|
39
|
+
# @param session_id [String, nil]
|
|
40
|
+
# @return [String]
|
|
41
|
+
def export(format: :json, since: nil, session_id: nil)
|
|
42
|
+
rows = _filter(since: since, session_id: session_id).map(&:to_h)
|
|
43
|
+
case format.to_sym
|
|
44
|
+
when :json
|
|
45
|
+
JSON.generate(rows)
|
|
46
|
+
when :csv
|
|
47
|
+
return "" if rows.empty?
|
|
48
|
+
|
|
49
|
+
fields = rows.first.keys
|
|
50
|
+
lines = [fields.join(",")]
|
|
51
|
+
rows.each { |r| lines << fields.map { |f| r[f].to_s }.join(",") }
|
|
52
|
+
lines.join("\n")
|
|
53
|
+
else
|
|
54
|
+
raise ArgumentError, "Unknown format: #{format.inspect}. Use :json or :csv"
|
|
55
|
+
end
|
|
56
|
+
end
|
|
57
|
+
|
|
58
|
+
# @param since [String, Time, nil]
|
|
59
|
+
# @return [Integer]
|
|
60
|
+
def count(since: nil)
|
|
61
|
+
_filter(since: since).length
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
def clear
|
|
65
|
+
@mutex.synchronize { @events.clear }
|
|
66
|
+
nil
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
private
|
|
70
|
+
|
|
71
|
+
def _filter(since: nil, session_id: nil)
|
|
72
|
+
cutoff = Audit.parse_since(since)
|
|
73
|
+
evts = @mutex.synchronize { @events.dup }
|
|
74
|
+
|
|
75
|
+
if cutoff
|
|
76
|
+
evts = evts.select do |e|
|
|
77
|
+
Time.parse(e.timestamp).utc >= cutoff
|
|
78
|
+
end
|
|
79
|
+
end
|
|
80
|
+
|
|
81
|
+
evts = evts.select { |e| e.session_id == session_id } if session_id
|
|
82
|
+
evts
|
|
83
|
+
end
|
|
84
|
+
end
|
|
85
|
+
end
|
|
86
|
+
end
|