pumice 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,962 @@
1
+ # Pumice
2
+
3
+ Database PII sanitization for Rails. Declarative scrubbing, pruning, and safe export of PII-free database copies. All operations are **non-destructive** to the source database unless you explicitly opt into destructive mode.
4
+
5
+ ---
6
+
7
+ ## Table of Contents
8
+
9
+ - [Quick Start](#quick-start)
10
+ - [Sanitizer DSL](#sanitizer-dsl)
11
+ - [Verification](#verification)
12
+ - [Helpers](#helpers)
13
+ - [Rake Tasks](#rake-tasks)
14
+ - [Configuration](#configuration)
15
+ - [Safe Scrub](#safe-scrub)
16
+ - [Pruning](#pruning)
17
+ - [Soft Scrubbing](#soft-scrubbing)
18
+ - [Testing](#testing)
19
+ - [Materialized Views](#materialized-views)
20
+ - [Gotchas](#gotchas)
21
+
22
+ ---
23
+
24
+ ## Quick Start
25
+
26
+ ### 1. Install
27
+
28
+ ```ruby
29
+ # Gemfile
30
+ gem 'pumice'
31
+ ```
32
+
33
+ ```bash
34
+ bundle install
35
+ ```
36
+
37
+ ### 2. Create the initializer
38
+
39
+ ```bash
40
+ rails generate pumice:install
41
+ ```
42
+
43
+ This creates [config/initializers/pumice.rb](config/initializers/pumice.rb) with commented defaults. The defaults work out of the box — customize later as needed.
44
+
45
+ ### 3. Generate a sanitizer (and test)
46
+
47
+ ```bash
48
+ rails generate pumice:sanitizer User # sanitizer + test (if applicable)
49
+ ```
50
+
51
+ This inspects your model's columns and generates `app/sanitizers/user_sanitizer.rb` — PII columns get `scrub` stubs, credentials get flagged, and safe columns get `keep` declarations. Every `scrub` block raises `NotImplementedError` until you define the logic.
52
+
53
+ ```bash
54
+ rails generate pumice:sanitizer User # stubs (you define scrub logic)
55
+ rails generate pumice:sanitizer User --defaults # pre-filled with Faker defaults
56
+ rails generate pumice:sanitizer User --no-test # skip test generation
57
+ rails generate pumice:test User # test only (backfill existing sanitizers)
58
+ ```
59
+
60
+ If your project uses RSpec (detected by the presence of `spec/`), a spec is generated with `have_scrubbed` and `have_kept` matchers. See [Testing](#testing) for the full RSpec integration.
61
+
62
+ ### 4. Review and adjust the generated sanitizer
63
+
64
+ Without `--defaults`, scrub blocks require you to define the logic:
65
+
66
+ ```ruby
67
+ # app/sanitizers/user_sanitizer.rb
68
+ class UserSanitizer < Pumice::Sanitizer
69
+ # PII - scrub with fake data
70
+ scrub(:email) { raise NotImplementedError }
71
+ scrub(:first_name) { raise NotImplementedError }
72
+ scrub(:last_name) { raise NotImplementedError }
73
+
74
+ # Credentials - clear sensitive data
75
+ scrub(:encrypted_password) { raise NotImplementedError }
76
+
77
+ # Non-PII - safe to keep
78
+ keep :roles, :active
79
+ end
80
+ ```
81
+
82
+ With `--defaults`, blocks are pre-filled with smart Faker logic:
83
+
84
+ ```ruby
85
+ # app/sanitizers/user_sanitizer.rb (--defaults)
86
+ class UserSanitizer < Pumice::Sanitizer
87
+ scrub(:email) { fake_email(record) }
88
+ scrub(:first_name) { Faker::Name.first_name }
89
+ scrub(:last_name) { Faker::Name.last_name }
90
+ scrub(:encrypted_password) { nil }
91
+
92
+ keep :roles, :active
93
+ end
94
+ ```
95
+
96
+ | Column name contains | Pre-filled scrubbing definition |
97
+ |---|---|
98
+ | `email` | `fake_email(record)` (nil-safe when nullable) |
99
+ | `phone`, `call_number` | `fake_phone` (nil-safe when nullable) |
100
+ | `first_name` | `Faker::Name.first_name` |
101
+ | `last_name` | `Faker::Name.last_name` |
102
+ | `name`, `display_name`, `full_name` | `Faker::Name.name` |
103
+ | `address`, `street` | `Faker::Address.street_address` |
104
+ | `city` | `Faker::Address.city` |
105
+ | `state` | `Faker::Address.state_abbr` |
106
+ | `zip` | `Faker::Address.zip` |
107
+ | `username`, `login` | `"user_#{record.id}"` |
108
+ | `bio`, `description`, `notes` | `match_length(value, use: :paragraph)` |
109
+ | other `text` columns | `match_length(value, use: :paragraph)` |
110
+ | other `string` columns | `Faker::Lorem.word` |
111
+ | **Credentials** (`password`, `token`, `secret`, `key`, `encrypted`, `oauth`, etc.) | `nil` |
112
+
113
+ ### 5. Run it
114
+
115
+ ```bash
116
+ # Preview what would change (no writes)
117
+ rake db:scrub:test
118
+
119
+ # Generate a scrubbed database dump (source untouched)
120
+ rake db:scrub:generate
121
+
122
+ # Or copy-and-scrub to a separate database
123
+ SOURCE_DATABASE_URL=postgres://prod/myapp \
124
+ TARGET_DATABASE_URL=postgres://local/myapp_dev \
125
+ rake db:scrub:safe
126
+
127
+ # Or destructively scrub the attached database (WARNING!)
128
+ rake db:scrub:all
129
+ ```
130
+
131
+ That's it. Pumice auto-discovers sanitizers in `app/sanitizers/` and auto-registers them by class name (`UserSanitizer` → `users`).
132
+
133
+ ---
134
+
135
+ ## Sanitizer DSL
136
+
137
+ Each sanitizer handles one ActiveRecord model. Place them in `app/sanitizers/`.
138
+
139
+ ### `scrub(column, &block)`
140
+
141
+ Define how to replace a PII column. The block receives the original value and has access to `record` (the ActiveRecord instance) and all [helpers](#helpers).
142
+
143
+ ```ruby
144
+ scrub(:first_name) { Faker::Name.first_name }
145
+ scrub(:bio) { |value| match_length(value, use: :paragraph) }
146
+ scrub(:notes) { |value| value.present? ? Faker::Lorem.sentence : nil }
147
+ scrub(:email) { fake_email(record, domain: 'test.example') }
148
+ ```
149
+
150
+ ### `keep(*columns)`
151
+
152
+ Mark columns as non-PII. No changes applied. *Note: `id`, `created_at`, and `updated_at` are kept automatically — you never need to declare them.*
153
+
154
+ ```ruby
155
+ keep :role, :status
156
+ ```
157
+
158
+ ### `keep_undefined_columns!`
159
+
160
+ Keeps all columns not explicitly defined via `scrub` or `keep`. **Bypasses PII review.** Use only during initial development. Disable globally with:
161
+
162
+ ```ruby
163
+ Pumice.configure { |c| c.allow_keep_undefined_columns = false }
164
+ ```
165
+
166
+ ### Referencing other attributes in scrub blocks
167
+
168
+ **Bare names** return scrubbed values. **`raw(:attribute_name)`** returns original database values.
169
+
170
+ ```ruby
171
+ class UserSanitizer < Pumice::Sanitizer
172
+ scrub(:first_name) { Faker::Name.first_name }
173
+ scrub(:last_name) { Faker::Name.last_name }
174
+ scrub(:display_name) { "#{first_name} #{last_name}" } # scrubbed values
175
+ scrub(:email) { "#{raw(:first_name)}.#{raw(:last_name)}@example.test".downcase } # original values
176
+
177
+ # ...
178
+ end
179
+ ```
180
+
181
+ ### Model binding
182
+
183
+ Inferred from class name by default — `UserSanitizer` automatically binds to `User`, so `sanitizes` is optional when the naming convention matches. Use it when the class name doesn't map directly to the model:
184
+
185
+ ```ruby
186
+ class LegacyUserDataSanitizer < Pumice::Sanitizer
187
+ sanitizes :users # binds to User
188
+ end
189
+
190
+ class AdminUserSanitizer < Pumice::Sanitizer
191
+ sanitizes :admin_users, class_name: 'Admin::User' # namespaced model
192
+ end
193
+ ```
194
+
195
+ ### Friendly names
196
+
197
+ Controls the name used in rake tasks. Default: class name underscored and pluralized.
198
+
199
+ ```ruby
200
+ class TutorSessionFeedbackSanitizer < Pumice::Sanitizer
201
+ friendly_name 'feedback' # rake 'db:scrub:only[feedback]'
202
+ end
203
+ ```
204
+
205
+ | Class Name | Default | Custom |
206
+ |---|---|---|
207
+ | `UserSanitizer` | `users` | - |
208
+ | `TutorSessionFeedbackSanitizer` | `tutor_session_feedbacks` | `feedback` |
209
+
210
+ ### `prune` (pre-step, not terminal)
211
+
212
+ Removes matching records **before** record-by-record scrubbing. Survivors get scrubbed. Use when you have records worth keeping but need to reduce the dataset first.
213
+
214
+ ```ruby
215
+ class EmailLogSanitizer < Pumice::Sanitizer
216
+ prune { where(created_at: ..1.year.ago) } # delete old logs
217
+
218
+ scrub(:email) { fake_email(record) } # scrub the rest
219
+ scrub(:body) { |value| match_length(value, use: :paragraph) }
220
+
221
+ # ...
222
+ end
223
+ ```
224
+
225
+ Convenience shorthands:
226
+
227
+ ```ruby
228
+ prune_older_than 1.year
229
+ prune_older_than 90.days, column: :updated_at
230
+ prune_older_than "2024-01-01"
231
+ prune_newer_than 30.days
232
+ ```
233
+
234
+ ### Bulk operations (terminal)
235
+
236
+ For tables where you want records **gone**, not scrubbed. The entire sanitizer is just the deletion — no `scrub`/`keep` declarations needed, and no scrubbing runs after. Use `destroy_all` over `delete_all` when you need ActiveRecord callbacks (e.g., `dependent: :destroy` associations).
237
+
238
+ ```ruby
239
+ # Wipe entire table (fastest, resets auto-increment)
240
+ class SessionSanitizer < Pumice::Sanitizer
241
+ truncate!
242
+ end
243
+
244
+ # SQL DELETE with optional scope (no callbacks)
245
+ class VersionSanitizer < Pumice::Sanitizer
246
+ sanitizes :versions, class_name: 'PaperTrail::Version'
247
+
248
+ delete_all { where(item_type: %w[User Message]) }
249
+ end
250
+
251
+ # ActiveRecord destroy with callbacks and dependent associations
252
+ class AttachmentSanitizer < Pumice::Sanitizer
253
+ destroy_all { where(attachable_id: nil) }
254
+ end
255
+ ```
256
+
257
+ ### When to use what
258
+
259
+ The key distinction: `prune` is a pre-step that scrubs survivors, while bulk operations are terminal — deletion is the entire sanitizer.
260
+
261
+ | Goal | DSL | Scrubs survivors? |
262
+ |---|---|:---:|
263
+ | Delete old records, scrub the rest | `prune` / `prune_[older\|newer]_than` | Yes |
264
+ | Wipe entire table | `truncate!` | No |
265
+ | Delete matching records (fast, no callbacks) | `delete_all { scope }` | No |
266
+ | Delete with callbacks/associations | `destroy_all { scope }` | No |
267
+
268
+ ### Programmatic usage
269
+
270
+ ```ruby
271
+ UserSanitizer.sanitize(user) # returns hash, does not persist
272
+ UserSanitizer.sanitize(user, :email) # returns single scrubbed value
273
+ UserSanitizer.scrub!(user) # persists all scrubbed values
274
+ UserSanitizer.scrub!(user, :email) # persists single scrubbed value
275
+ UserSanitizer.scrub_all! # batch: prune → scrub → verify
276
+ ```
277
+
278
+ ---
279
+
280
+ ## Verification
281
+
282
+ Post-operation checks declared inside a sanitizer definition. All verification raises `Pumice::VerificationError` on failure and is skipped during dry runs.
283
+
284
+ ### Table-level
285
+
286
+ ```ruby
287
+ class UserSanitizer < Pumice::Sanitizer
288
+ scrub(:email) { Faker::Internet.email }
289
+
290
+ verify_all "No real emails should remain" do
291
+ where("email LIKE '%@gmail.com'").none?
292
+ end
293
+ end
294
+ ```
295
+
296
+ The `verify_all` block runs in model scope (`User.instance_exec`). Return truthy for success.
297
+
298
+ ### Per-record
299
+
300
+ ```ruby
301
+ class UserSanitizer < Pumice::Sanitizer
302
+ scrub(:email) { Faker::Internet.email }
303
+
304
+ verify_each "Email should be scrubbed" do |record|
305
+ !record.email.match?(/gmail|yahoo|hotmail/)
306
+ end
307
+ end
308
+ ```
309
+
310
+ ### Inline (bulk operations)
311
+
312
+ Bulk operations accept a `verify: true` option that uses a default check after execution:
313
+
314
+ ```ruby
315
+ class AuditLogSanitizer < Pumice::Sanitizer
316
+ truncate!(verify: true) # verifies count.zero?
317
+ end
318
+
319
+ class VersionSanitizer < Pumice::Sanitizer
320
+ delete_all(verify: true) { where(item_type: 'User') } # verifies scope.none?
321
+ end
322
+ ```
323
+
324
+ ### Default verification for bulk operations
325
+
326
+ | Operation | Default check |
327
+ |---|---|
328
+ | `truncate!` | `count.zero?` |
329
+ | `delete_all` (no scope) | `count.zero?` |
330
+ | `delete_all { scope }` | `scope.none?` |
331
+ | `destroy_all` (no scope) | `count.zero?` |
332
+ | `destroy_all { scope }` | `scope.none?` |
333
+
334
+ Call `verify_all` without a block on a bulk sanitizer to use the default. Calling `verify_all` without a block on a non-bulk sanitizer raises `ArgumentError`.
335
+
336
+ ### Custom verification policy
337
+
338
+ ```ruby
339
+ Pumice.configure do |config|
340
+ config.default_verification = ->(_model_class, operation) {
341
+ case operation[:type]
342
+ when :truncate
343
+ -> { count.zero? }
344
+ when :delete, :destroy
345
+ operation[:scope] || -> { count.zero? }
346
+ end
347
+ }
348
+ end
349
+ ```
350
+
351
+ ---
352
+
353
+ ## Helpers
354
+
355
+ All helpers are available inside `scrub` blocks via `Pumice::Helpers`.
356
+
357
+ ### Quick reference
358
+
359
+ | Helper | Output | Example |
360
+ |---|---|---|
361
+ | `fake_email(record)` | `user_123@example.test` | Deterministic per record |
362
+ | `fake_phone(digits = 10)` | `5551234567` | Random digits |
363
+ | `fake_password(pwd = 'password123', cost: 4)` | `$2a$04$...` | BCrypt hash |
364
+ | `fake_id(id, prefix: 'ID')` | `ID000123` | Zero-padded |
365
+ | `match_length(value, use: :sentence)` | `Lorem ipsum...` | Matches original length |
366
+ | `fake_json(value, preserve_keys: true, keep: [])` | `{"name": "lorem"}` | Structure-preserving |
367
+
368
+ ### `fake_email`
369
+
370
+ Deterministic — same record always produces the same email across runs. Important for data consistency.
371
+
372
+ ```ruby
373
+ class UserSanitizer < Pumice::Sanitizer
374
+ sanitizes :users
375
+
376
+ scrub(:email) { fake_email(record) } # user_123@example.test
377
+ scrub(:email) { fake_email(record, domain: 'test.example.com') } # user_123@test.example.com
378
+ scrub(:contact_email) {
379
+ fake_email(prefix: 'contact', unique_id: record.unique_id) # contact_789@example.test
380
+ }
381
+ end
382
+ ```
383
+
384
+ ### `fake_password`
385
+
386
+ Uses low BCrypt cost (4) for speed. All scrubbed users get the same password so devs can log in.
387
+
388
+ ```ruby
389
+ scrub(:encrypted_password) { fake_password } # hash of default 'password123'
390
+ scrub(:encrypted_password) { fake_password('testpass') } # custom password
391
+ ```
392
+
393
+ ### `match_length`
394
+
395
+ Generates text approximating the original value's length. Respects column constraints.
396
+
397
+ ```ruby
398
+ scrub(:bio) { |value| match_length(value, use: :paragraph) }
399
+ scrub(:code) { |value| match_length(value, use: :characters) } # random alphanumeric
400
+ scrub(:title) { |value| match_length(value, use: -> { Faker::Book.title }) } # custom generator
401
+ ```
402
+
403
+ | Generator | Best for |
404
+ |---|---|
405
+ | `:sentence` | Bios, comments (default) |
406
+ | `:paragraph` | Long-form content |
407
+ | `:word` | Short fields, names |
408
+ | `:characters` | Codes, tokens |
409
+ | `-> { ... }` | Any custom Faker or logic |
410
+
411
+ ### `fake_json`
412
+
413
+ Sanitizes JSON structures. Strings become random words, numbers become `0`, booleans and `nil` are preserved. Structure (nesting depth, array lengths) is always retained.
414
+
415
+ ```ruby
416
+ scrub(:preferences) { |value| fake_json(value) } # fake values, keep keys
417
+ scrub(:metadata) { |value| fake_json(value, preserve_keys: false) } # fake keys AND values
418
+ scrub(:config) { |value| fake_json(value, keep: ['api_version']) } # preserve specific key/value pairs
419
+ scrub(:data) { |value| fake_json(value, keep: ['user.profile.email']) } # dot notation for nesting
420
+ ```
421
+
422
+ | Option | Keys | Values |
423
+ |---|---|---|
424
+ | `fake_json(value)` | Original | Faked |
425
+ | `fake_json(value, preserve_keys: false)` | Faked | Faked |
426
+ | `fake_json(value, keep: ['path'])` | Original (kept paths preserved) | Faked (kept paths preserved) |
427
+ | `fake_json(value, preserve_keys: false, keep: ['path'])` | Faked (kept paths preserved) | Faked (kept paths preserved) |
428
+
429
+ ### Custom helpers
430
+
431
+ Extend `Pumice::Helpers` for project-specific needs:
432
+
433
+ ```ruby
434
+ # config/initializers/pumice_helpers.rb
435
+ module Pumice
436
+ module Helpers
437
+ def fake_student_id(record)
438
+ "STU-#{record.student_id}"
439
+ end
440
+
441
+ def redact(value, show_last: 4)
442
+ return nil if value.blank?
443
+ "******"
444
+ end
445
+ end
446
+ end
447
+ ```
448
+
449
+ ---
450
+
451
+ ## Rake Tasks
452
+
453
+ ### Inspection
454
+
455
+ ```bash
456
+ rake db:scrub:list # list registered sanitizers and their friendly names
457
+ rake db:scrub:lint # check all columns are defined (scrub or keep), exits 1 on issues
458
+ rake db:scrub:validate # check scrubbed DB for PII leaks (real emails, uncleared tokens)
459
+ rake db:scrub:analyze # show top 20 tables by size, row counts for sensitive tables
460
+ ```
461
+
462
+ ### Safe operations (source never modified)
463
+
464
+ ```bash
465
+ rake db:scrub:test # dry run all sanitizers
466
+ rake 'db:scrub:test[users,messages]' # dry run specific sanitizers
467
+ rake db:scrub:generate # create temp DB, scrub, export dump, cleanup
468
+ rake db:scrub:safe # copy to target DB, scrub target (interactive)
469
+ rake 'db:scrub:safe_confirmed[mydb]' # same, but auto-confirmed for CI
470
+ ```
471
+
472
+ ### ⚠️ Destructive operations (modifies current database) ⚠️
473
+
474
+ The following will modify the currently attached database. You will be prompt to confirm, but user be warned:
475
+
476
+ ```bash
477
+ rake db:scrub:all # scrub current DB in-place (interactive confirmation)
478
+ rake 'db:scrub:only[users,messages]' # scrub specific tables in-place
479
+ ```
480
+
481
+ ### Progress indicators
482
+
483
+ Long-running operations display progress bars when output is a TTY:
484
+
485
+ ```
486
+ Sanitizers: |============================ | 5/7 ETA: 00:12
487
+ Users: |================================== | 980/1024 ETA: 00:02
488
+ ```
489
+
490
+ Progress bars are automatically hidden when:
491
+ - `VERBOSE=true` (verbose mode shows per-record detail instead)
492
+ - Output is piped or redirected (non-TTY)
493
+ - The collection is empty
494
+
495
+ Safe Scrub operations show a numbered step counter:
496
+
497
+ ```
498
+ [1/5] Creating fresh target database...
499
+ [2/5] Copying data from source to target...
500
+ ```
501
+
502
+ ### Environment variables
503
+
504
+ | Variable | Effect |
505
+ |---|---|
506
+ | `DRY_RUN=true` | Log changes without persisting |
507
+ | `VERBOSE=true` | Detailed per-record output (disables progress bars) |
508
+ | `PRUNE=false` | Disable pruning without changing config |
509
+ | `SOURCE_DATABASE_URL` | Source DB for safe scrub |
510
+ | `TARGET_DATABASE_URL` | Target DB for safe scrub |
511
+ | `SCRUBBED_DATABASE_URL` | Alternative to `TARGET_DATABASE_URL` |
512
+ | `EXPORT_PATH` | Path to export scrubbed dump |
513
+ | `EXCLUDE_INDEXES=true` | Exclude indexes/triggers/constraints from dump |
514
+ | `EXCLUDE_MATVIEWS=false` | Include materialized views in dump (excluded by default) |
515
+
516
+ ---
517
+
518
+ ## Configuration
519
+
520
+ Create an initializer. All settings have sensible defaults — only override what you need.
521
+
522
+ ```ruby
523
+ # config/initializers/pumice.rb
524
+ Pumice.configure do |config|
525
+ # Column coverage enforcement (default: true)
526
+ # Raises if a sanitizer doesn't define every column as scrub or keep
527
+ config.strict = true
528
+
529
+ # Tables to report row counts for in db:scrub:analyze (default: [])
530
+ config.sensitive_tables = %w[users messages student_profiles]
531
+
532
+ # Email domains that indicate real PII — validation fails if found (default: [])
533
+ config.sensitive_email_domains = %w[gmail.com yahoo.com hotmail.com]
534
+ end
535
+ ```
536
+
537
+ ### Full options reference
538
+
539
+ | Option | Default | Description |
540
+ |---|---|---|
541
+ | `verbose` | `false` | Increase console output detail |
542
+ | `strict` | `true` | Raise if sanitizer columns are undefined |
543
+ | `continue_on_error` | `false` | Continue on sanitizer failure vs halt |
544
+ | `allow_keep_undefined_columns` | `true` | Allow `keep_undefined_columns!` DSL |
545
+ | `sensitive_tables` | `[]` | Tables to analyze for row counts |
546
+ | `sensitive_email_domains` | `[]` | Domains indicating real PII |
547
+ | `sensitive_email_model` | `'User'` | Model to query for email validation |
548
+ | `sensitive_email_column` | `'email'` | Column for email lookup |
549
+ | `sensitive_token_columns` | `%w[reset_password_token confirmation_token]` | Token columns to verify are cleared |
550
+ | `sensitive_external_id_columns` | `[]` | External ID columns to verify are cleared |
551
+ | `source_database_url` | `nil` | Source DB for safe scrub (`:auto` to derive from Rails config) |
552
+ | `target_database_url` | `nil` | Target DB for safe scrub |
553
+ | `export_path` | `nil` | Path to export scrubbed dump |
554
+ | `export_format` | `:custom` | `:custom` (pg_dump -Fc) or `:plain` (SQL) |
555
+ | `require_readonly_source` | `false` | Enforce read-only source (error vs warn) |
556
+ | `soft_scrubbing` | `false` | Runtime PII masking — set to hash to enable |
557
+ | `pruning` | `false` | Pre-sanitization record pruning — set to hash to enable |
558
+
559
+ ---
560
+
561
+ ## Safe Scrub
562
+
563
+ Safe Scrub creates a sanitized copy of your database without modifying the source. This is the recommended workflow for production environments.
564
+
565
+ ### Flow
566
+
567
+ ```
568
+ rake db:scrub:generate
569
+ ├─ Create temp database
570
+ ├─ Copy source → temp
571
+ ├─ Run global pruning (if configured)
572
+ ├─ Run all sanitizers
573
+ ├─ Export dump file
574
+ └─ Drop temp database
575
+
576
+ rake db:scrub:safe
577
+ ├─ Validate source ≠ target
578
+ ├─ Confirm target DB name (interactive or argument)
579
+ ├─ Drop and recreate target
580
+ ├─ Copy source → target
581
+ ├─ Run global pruning
582
+ ├─ Run sanitizers
583
+ ├─ Verify
584
+ └─ Export (if configured)
585
+ ```
586
+
587
+ ### Configuration
588
+
589
+ ```ruby
590
+ Pumice.configure do |config|
591
+ # Auto-detect source from database.yml (works in Docker dev with zero env vars)
592
+ config.source_database_url = :auto unless Rails.env.production?
593
+
594
+ # Or set explicitly
595
+ # config.source_database_url = ENV['DATABASE_URL']
596
+
597
+ config.target_database_url = ENV['SCRUBBED_DATABASE_URL']
598
+ config.export_path = "tmp/scrubbed_#{Date.today}.dump"
599
+ config.export_format = :custom # :custom (pg_dump -Fc) or :plain (SQL)
600
+ end
601
+ ```
602
+
603
+ When `source_database_url` is `:auto`, Pumice derives the URL from `ActiveRecord::Base.connection_db_config`. This means `rake db:scrub:generate` works locally with no env vars.
604
+
605
+ Environment variables (`SOURCE_DATABASE_URL`) always take precedence over config.
606
+
607
+ ### Safety guarantees
608
+
609
+ - Source database is **never modified** — read-only access
610
+ - Target cannot equal `DATABASE_URL` — prevents accidental production writes
611
+ - Source and target must differ — validated at startup
612
+ - Interactive confirmation — must type the target DB name
613
+ - Write-access detection — warns (or errors) if source credentials can write
614
+
615
+ ### Read-only source credentials (recommended)
616
+
617
+ ```sql
618
+ -- On source (production): read-only
619
+ CREATE ROLE pumice_readonly WITH LOGIN PASSWORD 'readonly_secret';
620
+ GRANT CONNECT ON DATABASE myapp_production TO pumice_readonly;
621
+ GRANT USAGE ON SCHEMA public TO pumice_readonly;
622
+ GRANT SELECT ON ALL TABLES IN SCHEMA public TO pumice_readonly;
623
+
624
+ -- On target: full access
625
+ CREATE ROLE pumice_writer WITH LOGIN PASSWORD 'writer_secret';
626
+ CREATE DATABASE myapp_scrubbed OWNER pumice_writer;
627
+ ```
628
+
629
+ ```bash
630
+ SOURCE_DATABASE_URL=postgres://pumice_readonly:readonly_secret@prod-host/myapp_production
631
+ TARGET_DATABASE_URL=postgres://pumice_writer:writer_secret@scrub-host/myapp_scrubbed
632
+ ```
633
+
634
+ Even if URLs are swapped, the read-only credential cannot modify production.
635
+
636
+ To enforce read-only source (error instead of warning):
637
+
638
+ ```ruby
639
+ config.require_readonly_source = true
640
+ ```
641
+
642
+ ### CI mode
643
+
644
+ ```bash
645
+ # Auto-confirmed — argument must match target DB name or the task fails
646
+ rake 'db:scrub:safe_confirmed[myapp_scrubbed]'
647
+ ```
648
+
649
+ ### Programmatic usage
650
+
651
+ ```ruby
652
+ Pumice::SafeScrubber.new(
653
+ source_url: ENV['DATABASE_URL'],
654
+ target_url: ENV['SCRUBBED_DATABASE_URL'],
655
+ export_path: 'tmp/scrubbed.dump',
656
+ confirm: true # skip interactive prompt
657
+ ).run
658
+ ```
659
+
660
+ ### Error types
661
+
662
+ | Error | Cause |
663
+ |---|---|
664
+ | `Pumice::ConfigurationError` | Missing URL, source = target, target = DATABASE_URL, confirmation mismatch |
665
+ | `Pumice::SourceWriteAccessError` | `require_readonly_source = true` and source has write access |
666
+
667
+ ---
668
+
669
+ ## Pruning
670
+
671
+ Removes old records before sanitization to reduce dataset size. Useful for log tables, audit trails, and event streams.
672
+
673
+ Pumice supports pruning at two levels with a **cascading override** model:
674
+
675
+ - **Global pruning** — configured once in the initializer. Applies a single age-based rule across many tables at once, before any sanitizers run. This is the default policy.
676
+ - **Per-sanitizer `prune`** — defined inside a sanitizer with a custom scope. **Overrides** global pruning for that table. See [`prune` in the Sanitizer DSL](#prune-pre-step-not-terminal).
677
+
678
+ When a sanitizer defines its own `prune`, global pruning skips that table entirely — the sanitizer's prune takes over. Use global pruning for a blanket retention policy and per-sanitizer `prune` to override specific tables with custom scopes.
679
+
680
+ ### Analyze first
681
+
682
+ ```bash
683
+ rake db:prune:analyze
684
+
685
+ # Customize thresholds
686
+ RETENTION_DAYS=30 MIN_SIZE=50000000 MIN_ROWS=5000 rake db:prune:analyze
687
+ ```
688
+
689
+ The analyzer categorizes tables by confidence:
690
+
691
+ - **High**: Log tables, >50% old records, no foreign key dependencies
692
+ - **Medium**: Log tables OR >70% old, no dependencies
693
+ - **Low**: Everything else — review before pruning
694
+
695
+ ### Global pruning configuration
696
+
697
+ ```ruby
698
+ Pumice.configure do |config|
699
+ config.pruning = {
700
+ older_than: 90.days, # required (mutually exclusive with newer_than)
701
+ column: :created_at, # default
702
+ except: %w[users messages], # never prune these (mutually exclusive with only)
703
+
704
+ analyzer: {
705
+ table_patterns: %w[portal_session voice_log], # domain-specific log patterns
706
+ min_table_size: 10_000_000, # 10 MB (default)
707
+ min_row_count: 1000 # default
708
+ }
709
+ }
710
+ end
711
+ ```
712
+
713
+ ### Execution order
714
+
715
+ ```
716
+ 1. Global prune → delete old records from all eligible tables
717
+ (tables with a sanitizer-level prune are skipped)
718
+
719
+ 2. Sanitizers → for each sanitizer, in order:
720
+ a. run sanitizer-level prune, if defined
721
+ b. scrub surviving records
722
+ ```
723
+
724
+ The sanitizer-level `prune` replaces global pruning for that table — they never both run on the same table.
725
+
726
+ ### Disable at runtime
727
+
728
+ ```bash
729
+ PRUNE=false rake db:scrub:generate
730
+ ```
731
+
732
+ ---
733
+
734
+ ## Soft Scrubbing
735
+
736
+ Masks data at read time without modifying the database. Use for runtime access control — e.g., non-admin users see scrubbed PII, admins see real data.
737
+
738
+ ### Enable
739
+
740
+ ```ruby
741
+ Pumice.configure do |config|
742
+ config.soft_scrubbing = {
743
+ context: :current_user,
744
+ if: ->(record, viewer) { viewer.nil? || !viewer.admin? }
745
+ }
746
+ end
747
+ ```
748
+
749
+ When enabled, Pumice prepends an attribute interceptor on `ActiveRecord::Base`. On attribute read, the policy is checked. If it returns true, the `scrub` block runs and the scrubbed value is returned. The database is never modified.
750
+
751
+ ### Policy options
752
+
753
+ | Option | Behavior |
754
+ |---|---|
755
+ | `if:` | Scrub when lambda returns **true** |
756
+ | `unless:` | Scrub when lambda returns **false** |
757
+ | Neither | Always scrub |
758
+
759
+ Both receive `(record, viewer)`. They are mutually exclusive — `if:` takes precedence.
760
+
761
+ ### Setting viewer context
762
+
763
+ ```ruby
764
+ # In ApplicationController
765
+ before_action { Pumice.soft_scrubbing_context = current_user }
766
+
767
+ # Or scoped
768
+ Pumice.with_soft_scrubbing_context(current_user) do
769
+ @users = User.all # reads scrubbed for non-admins
770
+ end
771
+ ```
772
+
773
+ The `context:` config option resolves a Symbol through: `record.method` → `Pumice.method` → `Current.method` → `Thread.current[:key]`.
774
+
775
+ ### Accessing original values
776
+
777
+ When soft scrubbing is enabled, attribute reads return scrubbed values. To access the original database value:
778
+
779
+ **Inside sanitizer definitions** — `raw(:*)` and `raw_attr` are available via the sanitizer DSL (see [Referencing other attributes](#referencing-other-attributes-in-scrub-blocks)).
780
+
781
+ **Inside ActiveRecord models** — use `read_attribute(:attr)` or define a helper:
782
+
783
+ ```ruby
784
+ class User < ApplicationRecord
785
+ def admin?
786
+ ADMIN_EMAILS.include?(read_attribute(:email))
787
+ end
788
+
789
+ # Or define a convenience method:
790
+ def raw(attr_name)
791
+ if Pumice.soft_scrubbing?
792
+ read_attribute(attr_name)
793
+ else
794
+ @attributes.fetch_value(attr_name.to_s)
795
+ end
796
+ end
797
+ end
798
+ ```
799
+
800
+ ---
801
+
802
+ ## Testing
803
+
804
+ ### Setup
805
+
806
+ ```ruby
807
+ # spec/rails_helper.rb
808
+ require 'pumice/rspec'
809
+ ```
810
+
811
+ This gives you:
812
+
813
+ - **Auto-reset** — `Pumice.reset!` runs before each `type: :sanitizer` spec
814
+ - **Auto-lint** — column coverage is verified automatically; incomplete sanitizers fail before examples run
815
+ - **Path inference** — specs in `spec/sanitizers/` are automatically tagged `type: :sanitizer`
816
+ - **Helpers** — `with_soft_scrubbing` and `without_soft_scrubbing` available in sanitizer specs
817
+ - **Matchers** — `have_scrubbed(:attr)` and `have_kept(:attr)` for verifying sanitizer definitions
818
+
819
+ ### Sanitizer specs
820
+
821
+ ```ruby
822
+ # spec/sanitizers/user_sanitizer_spec.rb
823
+ RSpec.describe UserSanitizer, type: :sanitizer do
824
+ let(:user) { create(:user, email: 'real@gmail.com', first_name: 'John') }
825
+
826
+ # Column coverage is checked automatically — no need to add a lint test.
827
+
828
+ describe '.sanitize' do
829
+ it 'returns sanitized values without persisting' do
830
+ result = described_class.sanitize(user)
831
+
832
+ expect(result[:email]).to match(/user_\d+@example\.test/)
833
+ expect(user.reload.email).to eq('real@gmail.com')
834
+ end
835
+ end
836
+
837
+ describe '.scrub!' do
838
+ it 'persists sanitized values' do
839
+ described_class.scrub!(user)
840
+ expect(user.reload.email).to match(/user_\d+@example\.test/)
841
+ end
842
+ end
843
+ end
844
+ ```
845
+
846
+ To skip auto-lint for a specific sanitizer (e.g., during initial development):
847
+
848
+ ```ruby
849
+ RSpec.describe UserSanitizer, type: :sanitizer, lint: false do
850
+ # ...
851
+ end
852
+ ```
853
+
854
+ ### Soft scrubbing specs
855
+
856
+ ```ruby
857
+ RSpec.describe 'User soft scrubbing', type: :sanitizer do
858
+ let(:user) { create(:user, email: 'real@gmail.com') }
859
+ let(:admin) { create(:user, :admin) }
860
+ let(:regular) { create(:user) }
861
+
862
+ it 'scrubs for non-admins' do
863
+ with_soft_scrubbing(viewer: regular, if: ->(r, v) { !v.admin? }) do
864
+ expect(user.email).to match(/user_\d+@example\.test/)
865
+ end
866
+ end
867
+
868
+ it 'shows real data to admins' do
869
+ with_soft_scrubbing(viewer: admin, if: ->(r, v) { !v.admin? }) do
870
+ expect(user.email).to eq('real@gmail.com')
871
+ end
872
+ end
873
+ end
874
+ ```
875
+
876
+ ### Helpers reference
877
+
878
+ | Helper | Use |
879
+ |---|---|
880
+ | `with_soft_scrubbing(viewer:, if:, unless:)` | Enable soft scrubbing for a block |
881
+ | `without_soft_scrubbing { ... }` | Disable soft scrubbing for a block |
882
+ | `have_scrubbed(:attr)` | Assert a sanitizer defines a scrub rule for `:attr` |
883
+ | `have_kept(:attr)` | Assert a sanitizer marks `:attr` as kept |
884
+
885
+ Both soft scrubbing helpers restore original config after the block, even on error.
886
+
887
+ ```ruby
888
+ # Matcher examples
889
+ RSpec.describe UserSanitizer, type: :sanitizer do
890
+ it { expect(described_class).to have_scrubbed(:email) }
891
+ it { expect(described_class).to have_scrubbed(:first_name) }
892
+ it { expect(described_class).to have_kept(:role) }
893
+ end
894
+ ```
895
+
896
+ ---
897
+
898
+ ## Materialized Views
899
+
900
+ Pumice includes rake tasks for managing materialized views, which are relevant during safe scrub since view data is excluded from dumps by default.
901
+
902
+ ```bash
903
+ rake db:matviews:list # list all materialized views with sizes
904
+ rake db:matviews:refresh # refresh all materialized views
905
+ rake 'db:matviews:refresh[view1,view2]' # refresh specific views
906
+ ```
907
+
908
+ After restoring a scrubbed dump, refresh materialized views to rebuild their data:
909
+
910
+ ```bash
911
+ pg_restore -d myapp_dev tmp/scrubbed.dump && rake db:matviews:refresh
912
+ ```
913
+
914
+ Set `EXCLUDE_MATVIEWS=false` to include materialized view data in the dump (skipping the need to refresh after restore).
915
+
916
+ ---
917
+
918
+ ## Gotchas
919
+
920
+ ### Strict mode and new columns
921
+
922
+ When `strict: true` (default), adding a column to a model without updating its sanitizer will raise an error on next scrub. Run `rake db:scrub:lint` in CI to catch this early.
923
+
924
+ ### Bulk operations skip column validation
925
+
926
+ `truncate!`, `delete_all`, and `destroy_all` don't require `scrub`/`keep` declarations. Strict mode doesn't apply to them.
927
+
928
+ ### Faker seeding
929
+
930
+ Pumice seeds Faker with `record.id` before each record. This makes scrubbing **deterministic** — the same record always produces the same fake values. Important for consistency across runs.
931
+
932
+ ### Protected columns
933
+
934
+ `id`, `created_at`, and `updated_at` are automatically excluded from column coverage checks. You never need to declare them.
935
+
936
+ ### Soft scrubbing circular dependency
937
+
938
+ If your policy check reads a scrubbed attribute (e.g., `viewer.admin?` checks `viewer.email`), use `read_attribute(:email)` instead. Without this, the policy triggers scrubbing, which triggers the policy — infinite loop. Pumice includes a recursion guard that falls through to `super` (the real value) on re-entry, so the app won't crash, but `read_attribute()` makes the intent explicit.
939
+
940
+ ### `source_database_url = :auto`
941
+
942
+ Only works with PostgreSQL. Builds a URL from `ActiveRecord::Base.connection_db_config` components. Returns `nil` for non-PostgreSQL adapters.
943
+
944
+ ### Pruning mutual exclusivity
945
+
946
+ - `older_than` and `newer_than` cannot both be set — raises `ArgumentError`
947
+ - `only` and `except` cannot both be set — they are mutually exclusive
948
+ - One of `older_than` or `newer_than` is required
949
+
950
+ ### Global pruning and foreign keys
951
+
952
+ The global pruner skips tables with foreign key dependencies and logs a warning. Per-sanitizer `prune` does **not** check dependencies — that's on you.
953
+
954
+ ### Safe scrub connection management
955
+
956
+ Safe Scrub temporarily changes `ActiveRecord::Base.connection_db_config` to operate on the target. It always restores the original connection, even on error. Existing connections to the target are terminated before DROP/CREATE.
957
+
958
+ ---
959
+
960
+ ## License
961
+
962
+ MIT