top_secret 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6ca966ae09c911bbffe556c5a4d043b5d57d344ec7169fa2fdd853b8daefe407
4
- data.tar.gz: f2f931cbf81c0df4165b0431d6b680d8f1316454052e38d4e3bca903fab55459
3
+ metadata.gz: cef5c9c267dd71870fd244408a3f9a020d19978381810f99e7ed5defc67f12a7
4
+ data.tar.gz: ec20793792721a47371cc6a7cf3c687ad4284b9c9f1f9b14c57e002118c5532a
5
5
  SHA512:
6
- metadata.gz: a77cf5529144525965ff471b2d12bd980f01e21b72714b04894e186120570e54828d1b2c08738d5c779a2e560e6f05a9642d43e4047e48c4aa3a7fe938c1bfb4
7
- data.tar.gz: ba347a50b9cf63a88ea6f7b98e56e713330d38e5341a00a792eb180c965b23c9bf822cafa5d28aa832e9810f5da6455f6065d7af50a9257597bb5afbaf4a32ac
6
+ metadata.gz: 0eea9a1087e5082e245b73cdbb434543dbfcff277b6d58b7a89e463b1162215e0058f5190df3863a73fd32223fdf42fa65a253781193ea7e44ecc6b68c359e45
7
+ data.tar.gz: 6626056578ec3feecf27d843a994f782c144b363a6b7473e6e0be0fa80044c2283c329ae068c4f0f66af13f021743039864e13f220174489bd822e0c162872a7
data/CHANGELOG.md CHANGED
@@ -1,5 +1,49 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [0.3.0] - 2025-09-19
4
+
5
+ ### Added
6
+
7
+ - Added `TopSecret::Text.scan` method for detecting sensitive information without redacting text
8
+ - Added `TopSecret::Text::ScanResult` class to hold scan operation results with `mapping` and `sensitive?` methods
9
+ - Added `TopSecret::Text::GlobalMapping` class to manage consistent labeling across multiple filtering operations
10
+ - Added factory methods to domain objects: `BatchResult.from_messages`, `Result.from_messages`, and `Result.with_global_labels`
11
+ - Added support for disabling NER filtering by setting `model_path` to `nil` for improved performance and deployment flexibility
12
+ - Added support for Rails 7.0 and newer
13
+ - Added `#safe?` predicate method as the logical opposite of `#sensitive?` for `BatchResult`, `Result` and `ScanResult` classes
14
+
15
+ ### Changed
16
+
17
+ - **BREAKING:** `TopSecret::Text.filter_all` now returns `TopSecret::Text::Result` objects instead of `TopSecret::Text::BatchResult::Item` objects for individual items
18
+ - Each item in `BatchResult#items` now includes an individual `mapping` attribute containing only the sensitive information found in that specific message
19
+ - `TopSecret::Text.filter_all` now only processes sensitive results when building global mappings, improving efficiency
20
+ - Refactored `TopSecret::Text.filter_all` to use domain objects with better separation of concerns and testability
21
+ - Improved performance by implementing lazy loading of MITIE model and document processing
22
+ - NER filtering now gracefully falls back when MITIE model is unavailable, continuing with regex-based filters only
23
+
24
+ ## [0.2.0] - 2025-08-18
25
+
26
+ ### Added
27
+
28
+ - Added `TopSecret::Text.filter_all` for batch processing multiple messages with globally consistent redaction labels
29
+ - Added `TopSecret::Text::BatchResult` class to hold results from batch operations
30
+ - Added `TopSecret::FilteredText` class for restoring filtered text by substituting placeholders with original values
31
+ - Added `TopSecret::FilteredText::Result` class to track restoration success and failures
32
+
33
+ ### Changed
34
+
35
+ - **BREAKING:** Moved `TopSecret::Result` to `TopSecret::Text::Result` and `TopSecret::BatchResult` to `TopSecret::Text::BatchResult` for better namespace organization
36
+ - **BREAKING:** Refactored configuration system to use individual filter accessors instead of nested `default_filters`
37
+ - Updated `TopSecret::Text.filter` to accept keyword arguments for filter overrides and `custom_filters` array
38
+ - Each default filter now has its own configuration accessor (e.g., `TopSecret.email_filter`, `TopSecret.people_filter`)
39
+
40
+ ### Migration Guide
41
+
42
+ - Replace `TopSecret::Result` with `TopSecret::Text::Result` and `TopSecret::BatchResult` with `TopSecret::Text::BatchResult`
43
+ - Replace `TopSecret.configure { |c| c.default_filters.email_filter = filter }` with `TopSecret.configure { |c| c.email_filter = filter }`
44
+ - Replace `TopSecret::Text.filter(text, filters: { email_filter: filter })` with `TopSecret::Text.filter(text, email_filter: filter)`
45
+ - For new filters, use `TopSecret::Text.filter(text, custom_filters: [filter])` instead of adding to `default_filters`
46
+
3
47
  ## [0.1.1] - 2025-08-08
4
48
 
5
49
  - Ensure `TopSecret.min_confidence_score` is respected
data/CODEOWNERS ADDED
@@ -0,0 +1,15 @@
1
+ # Lines starting with '#' are comments.
2
+ # Each line is a file pattern followed by one or more owners.
3
+
4
+ # More details are here: https://help.github.com/articles/about-codeowners/
5
+
6
+ # The '*' pattern is global owners.
7
+
8
+ # Order is important. The last matching pattern has the most precedence.
9
+ # The folders are ordered as follows:
10
+
11
+ # In each subsection folders are ordered first by depth, then alphabetically.
12
+ # This should make it easy to add new rules without breaking existing ones.
13
+
14
+ # Global rule:
15
+ * @stevepolitodesign
data/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  # Top Secret
2
2
 
3
- Filter sensitive information from free text before sending it to external services or APIs, such as Chatbots.
3
+ [![Ruby](https://github.com/thoughtbot/top_secret/actions/workflows/main.yml/badge.svg?branch=main)](https://github.com/thoughtbot/top_secret/actions/workflows/main.yml)
4
+
5
+ Filter sensitive information from free text before sending it to external services or APIs, such as chatbots and LLMs.
4
6
 
5
7
  By default it filters the following:
6
8
 
@@ -32,6 +34,13 @@ gem install top_secret
32
34
  >
33
35
  > You'll need to download and extract [ner_model.dat][] first.
34
36
 
37
+ > [!TIP]
38
+ > Due to its large size, you'll likely want to avoid committing [ner_model.dat][] into version control.
39
+ >
40
+ > You'll need to ensure the file exists in deployed environments. See relevant [discussion][discussions_60] for details.
41
+ >
42
+ > Alternatively, you can disable NER filtering entirely by setting `model_path` to `nil` if you only need regex-based filters (credit cards, emails, phone numbers, SSNs). This improves performance and eliminates the model file dependency.
43
+
35
44
  By default, Top Secret assumes the file will live at the root of your project, but this can be configured.
36
45
 
37
46
  ```ruby
@@ -123,7 +132,7 @@ TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com")
123
132
  This will return
124
133
 
125
134
  ```ruby
126
- <TopSecret::Result
135
+ <TopSecret::Text::Result
127
136
  @input="Ralph can be reached at ralph@thoughtbot.com",
128
137
  @mapping={:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"},
129
138
  @output="[PERSON_1] can be reached at [EMAIL_1]"
@@ -154,26 +163,264 @@ result.mapping
154
163
  # => {:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"}
155
164
  ```
156
165
 
166
+ Check if sensitive information was found
167
+
168
+ ```ruby
169
+ result.sensitive?
170
+
171
+ # => true
172
+
173
+ result.safe?
174
+
175
+ # => false
176
+ ```
177
+
178
+ ### Scanning for Sensitive Information
179
+
180
+ Use `TopSecret::Text.scan` to detect sensitive information without redacting the text. This is useful when you only need to check if sensitive data exists or get a mapping of what was found:
181
+
182
+ ```ruby
183
+ TopSecret::Text.scan("Ralph can be reached at ralph@thoughtbot.com")
184
+ ```
185
+
186
+ This will return
187
+
188
+ ```ruby
189
+ <TopSecret::Text::ScanResult
190
+ @mapping={:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"}
191
+ >
192
+ ```
193
+
194
+ Check if sensitive information was found
195
+
196
+ ```ruby
197
+ result.sensitive?
198
+
199
+ # => true
200
+
201
+ result.safe?
202
+
203
+ # => false
204
+ ```
205
+
206
+ View the mapping of found sensitive information
207
+
208
+ ```ruby
209
+ result.mapping
210
+
211
+ # => {:EMAIL_1=>"ralph@thoughtbot.com", :PERSON_1=>"Ralph"}
212
+ ```
213
+
214
+ The `scan` method accepts the same filter options as `filter`:
215
+
216
+ ```ruby
217
+ # Override default filters
218
+ email_filter = TopSecret::Filters::Regex.new(
219
+ label: "EMAIL_ADDRESS",
220
+ regex: /\w+\[at\]\w+\.\w+/
221
+ )
222
+ result = TopSecret::Text.scan("Contact user[at]example.com", email_filter:)
223
+ result.mapping
224
+ # => {:EMAIL_ADDRESS_1=>"user[at]example.com"}
225
+
226
+ # Disable specific filters
227
+ result = TopSecret::Text.scan("Ralph works in Boston", people_filter: nil)
228
+ result.mapping
229
+ # => {:LOCATION_1=>"Boston"}
230
+
231
+ # Add custom filters
232
+ ip_filter = TopSecret::Filters::Regex.new(
233
+ label: "IP_ADDRESS",
234
+ regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
235
+ )
236
+ result = TopSecret::Text.scan("Server IP is 192.168.1.1", custom_filters: [ip_filter])
237
+ result.mapping
238
+ # => {:IP_ADDRESS_1=>"192.168.1.1"}
239
+ ```
240
+
241
+ ### Batch Processing
242
+
243
+ When processing multiple messages, use `filter_all` to ensure consistent redaction labels across all messages:
244
+
245
+ ```ruby
246
+ messages = [
247
+ "Contact ralph@thoughtbot.com for details",
248
+ "Email ralph@thoughtbot.com again if needed",
249
+ "Also CC ruby@thoughtbot.com on the thread"
250
+ ]
251
+
252
+ result = TopSecret::Text.filter_all(messages)
253
+ ```
254
+
255
+ This will return
256
+
257
+ ```ruby
258
+ <TopSecret::Text::BatchResult
259
+ @mapping={:EMAIL_1=>"ralph@thoughtbot.com", :EMAIL_2=>"ruby@thoughtbot.com"},
260
+ @items=[
261
+ <TopSecret::Text::Result @input="Contact ralph@thoughtbot.com for details", @output="Contact [EMAIL_1] for details", @mapping={:EMAIL_1=>"ralph@thoughtbot.com"}>,
262
+ <TopSecret::Text::Result @input="Email ralph@thoughtbot.com again if needed", @output="Email [EMAIL_1] again if needed", @mapping={:EMAIL_1=>"ralph@thoughtbot.com"}>,
263
+ <TopSecret::Text::Result @input="Also CC ruby@thoughtbot.com on the thread", @output="Also CC [EMAIL_2] on the thread", @mapping={:EMAIL_2=>"ruby@thoughtbot.com"}>
264
+ ]
265
+ >
266
+ ```
267
+
268
+ Access the global mapping
269
+
270
+ ```ruby
271
+ result.mapping
272
+
273
+ # => {:EMAIL_1=>"ralph@thoughtbot.com", :EMAIL_2=>"ruby@thoughtbot.com"}
274
+ ```
275
+
276
+ Access individual items
277
+
278
+ ```ruby
279
+ result.items[0].input
280
+ # => "Contact ralph@thoughtbot.com for details"
281
+
282
+ result.items[0].output
283
+ # => "Contact [EMAIL_1] for details"
284
+
285
+ result.items[0].mapping
286
+ # => {:EMAIL_1=>"ralph@thoughtbot.com"}
287
+
288
+ result.items[0].sensitive?
289
+ # => true
290
+
291
+ result.items[0].safe?
292
+ # => false
293
+ ```
294
+
295
+ The key benefit is that identical values receive the same labels across all messages - notice how `ralph@thoughtbot.com` becomes `[EMAIL_1]` in both the first and second messages.
296
+
297
+ Each item also maintains its own mapping containing only the sensitive information found in that specific message, while the batch result provides a global mapping of all sensitive information across all messages.
298
+
299
+ ### Restoring Filtered Text
300
+
301
+ When external services (like LLMs) return responses containing filter placeholders, use `TopSecret::FilteredText.restore` to substitute them back with original values:
302
+
303
+ ```ruby
304
+ # Filter messages before sending to LLM
305
+ messages = ["Contact ralph@thoughtbot.com for details"]
306
+ batch_result = TopSecret::Text.filter_all(messages)
307
+
308
+ # Send filtered text to LLM: "Contact [EMAIL_1] for details"
309
+ # LLM responds with: "I'll email [EMAIL_1] about this request"
310
+ llm_response = "I'll email [EMAIL_1] about this request"
311
+
312
+ # Restore the original values
313
+ restore_result = TopSecret::FilteredText.restore(llm_response, mapping: batch_result.mapping)
314
+ ```
315
+
316
+ This will return
317
+
318
+ ```ruby
319
+ <TopSecret::FilteredText::Result
320
+ @output="I'll email ralph@thoughtbot.com about this request",
321
+ @restored=["[EMAIL_1]"],
322
+ @unrestored=[]
323
+ >
324
+ ```
325
+
326
+ Access the restored text
327
+
328
+ ```ruby
329
+ restore_result.output
330
+ # => "I'll email ralph@thoughtbot.com about this request"
331
+ ```
332
+
333
+ Track which placeholders were restored
334
+
335
+ ```ruby
336
+ restore_result.restored
337
+ # => ["[EMAIL_1]"]
338
+
339
+ restore_result.unrestored
340
+ # => []
341
+ ```
342
+
343
+ The restoration process tracks both successful and failed placeholder substitutions, allowing you to handle cases where the LLM response contains placeholders not found in your mapping.
344
+
345
+ ### Working with LLMs
346
+
347
+ When sending filtered information to LLMs, they'll likely need to be instructed on how to handle those filters. Otherwise, we risk them not being returned in the response, which would break the restoration process.
348
+
349
+ Here's a recommended approach:
350
+
351
+ ```ruby
352
+ instructions = <<~TEXT
353
+ I'm going to send filtered information to you in the form of free text.
354
+ If you need to refer to the filtered information in a response, just reference it by the filter.
355
+ TEXT
356
+ ```
357
+
358
+ Complete example:
359
+
360
+ ```ruby
361
+ require "openai"
362
+ require "top_secret"
363
+
364
+ openai = OpenAI::Client.new(
365
+ api_key: Rails.application.credentials.openai.api_key!
366
+ )
367
+
368
+ original_messages = [
369
+ "Ralph lives in Boston.",
370
+ "You can reach them at ralph@thoughtbot.com or 877-976-2687"
371
+ ]
372
+
373
+ # Filter all messages
374
+ result = TopSecret::Text.filter_all(original_messages)
375
+ filtered_messages = result.items.map(&:output)
376
+
377
+ user_messages = filtered_messages.map { {role: "user", content: it} }
378
+
379
+ # Instruct LLM how to handle filtered messages
380
+ instructions = <<~TEXT
381
+ I'm going to send filtered information to you in the form of free text.
382
+ If you need to refer to the filtered information in a response, just reference it by the filter.
383
+ TEXT
384
+
385
+ messages = [
386
+ {role: "system", content: instructions},
387
+ *user_messages
388
+ ]
389
+
390
+ chat_completion = openai.chat.completions.create(messages:, model: :"gpt-5")
391
+ response = chat_completion.choices.last.message.content
392
+
393
+ # Restore the response from the mapping
394
+ mapping = result.mapping
395
+ restored_response = TopSecret::FilteredText.restore(response, mapping:).output
396
+
397
+ puts(restored_response)
398
+ ```
399
+
157
400
  ### Advanced Examples
158
401
 
159
402
  #### Overriding the default filters
160
403
 
161
404
  When overriding or [disabling](#disabling-a-default-filter-1) a [default filter](#default-filters), you must map to the correct key.
162
405
 
406
+ > [!IMPORTANT]
407
+ > Invalid filter keys will raise an `ArgumentError`. Only the following keys are valid:
408
+ > `credit_card_filter`, `email_filter`, `phone_number_filter`, `ssn_filter`, `people_filter`, `location_filter`
409
+
163
410
  ```ruby
164
411
  regex_filter = TopSecret::Filters::Regex.new(label: "EMAIL_ADDRESS", regex: /\b\w+\[at\]\w+\.\w+\b/)
165
412
  ner_filter = TopSecret::Filters::NER.new(label: "NAME", tag: :person, min_confidence_score: 0.25)
166
413
 
167
- TopSecret::Text.filter("Ralph can be reached at ralph[at]thoughtbot.com", filters: {
414
+ TopSecret::Text.filter("Ralph can be reached at ralph[at]thoughtbot.com",
168
415
  email_filter: regex_filter,
169
416
  people_filter: ner_filter
170
- })
417
+ )
171
418
  ```
172
419
 
173
420
  This will return
174
421
 
175
422
  ```ruby
176
- <TopSecret::Result
423
+ <TopSecret::Text::Result
177
424
  @input="Ralph can be reached at ralph[at]thoughtbot.com",
178
425
  @mapping={:EMAIL_ADDRESS_1=>"ralph[at]thoughtbot.com", :NAME_1=>"Ralph", :NAME_2=>"ralph["},
179
426
  @output="[NAME_1] can be reached at [EMAIL_ADDRESS_1]"
@@ -183,22 +430,29 @@ This will return
183
430
  #### Disabling a default filter
184
431
 
185
432
  ```ruby
186
- TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com", filters: {
433
+ TopSecret::Text.filter("Ralph can be reached at ralph@thoughtbot.com",
187
434
  email_filter: nil,
188
435
  people_filter: nil
189
- })
436
+ )
190
437
  ```
191
438
 
192
439
  This will return
193
440
 
194
441
  ```ruby
195
- <TopSecret::Result
442
+ <TopSecret::Text::Result
196
443
  @input="Ralph can be reached at ralph@thoughtbot.com",
197
444
  @mapping={},
198
445
  @output="Ralph can be reached at ralph@thoughtbot.com"
199
446
  >
200
447
  ```
201
448
 
449
+ #### Error handling for invalid filter keys
450
+
451
+ ```ruby
452
+ # This will raise ArgumentError: Unknown key: :invalid_filter. Valid keys are: ...
453
+ TopSecret::Text.filter("some text", invalid_filter: some_filter)
454
+ ```
455
+
202
456
  ### Custom Filters
203
457
 
204
458
  #### Adding new [Regex filters][]
@@ -209,15 +463,15 @@ ip_address_filter = TopSecret::Filters::Regex.new(
209
463
  regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
210
464
  )
211
465
 
212
- TopSecret::Text.filter("Ralph's IP address is 192.168.1.1", filters: {
213
- ip_address_filter: ip_address_filter
214
- })
466
+ TopSecret::Text.filter("Ralph's IP address is 192.168.1.1",
467
+ custom_filters: [ip_address_filter]
468
+ )
215
469
  ```
216
470
 
217
471
  This will return
218
472
 
219
473
  ```ruby
220
- <TopSecret::Result
474
+ <TopSecret::Text::Result
221
475
  @input="Ralph's IP address is 192.168.1.1",
222
476
  @mapping={:PERSON_1=>"Ralph", :IP_ADDRESS_1=>"192.168.1.1"},
223
477
  @output="[PERSON_1]'s IP address is [IP_ADDRESS_1]"
@@ -235,15 +489,15 @@ language_filter = TopSecret::Filters::NER.new(
235
489
  min_confidence_score: 0.75
236
490
  )
237
491
 
238
- TopSecret::Text.filter("Ralph's favorite programming language is Ruby.", filters: {
239
- language_filter: language_filter
240
- })
492
+ TopSecret::Text.filter("Ralph's favorite programming language is Ruby.",
493
+ custom_filters: [language_filter]
494
+ )
241
495
  ```
242
496
 
243
497
  This will return
244
498
 
245
499
  ```ruby
246
- <TopSecret::Result
500
+ <TopSecret::Text::Result
247
501
  @input="Ralph's favorite programming language is Ruby.",
248
502
  @mapping={:PERSON_1=>"Ralph", :LANGUAGE_1=>"Ruby"},
249
503
  @output="[PERSON_1]'s favorite programming language is [LANGUAGE_1]"
@@ -265,9 +519,9 @@ regex_filter = TopSecret::Filters::Regex.new(
265
519
  regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
266
520
  )
267
521
 
268
- result = TopSecret::Text.filter("Server IP: 192.168.1.1", filters: {
269
- ip_address_filter: regex_filter
270
- })
522
+ result = TopSecret::Text.filter("Server IP: 192.168.1.1",
523
+ custom_filters: [regex_filter]
524
+ )
271
525
 
272
526
  result.output
273
527
  # => "Server IP: [IP_ADDRESS_1]"
@@ -285,9 +539,9 @@ ner_filter = TopSecret::Filters::NER.new(
285
539
  min_confidence_score: 0.25
286
540
  )
287
541
 
288
- result = TopSecret::Text.filter("Ralph and Ruby work at thoughtbot.", filters: {
542
+ result = TopSecret::Text.filter("Ralph and Ruby work at thoughtbot.",
289
543
  people_filter: ner_filter
290
- })
544
+ )
291
545
 
292
546
  result.output
293
547
  # => "[PERSON_1] and [PERSON_2] work at thoughtbot."
@@ -314,6 +568,22 @@ TopSecret.configure do |config|
314
568
  end
315
569
  ```
316
570
 
571
+ ### Disabling NER filtering
572
+
573
+ For improved performance or when the MITIE model file cannot be deployed, you can disable NER-based filtering entirely. This will disable people and location detection but retain all regex-based filters (credit cards, emails, phone numbers, SSNs):
574
+
575
+ ```ruby
576
+ TopSecret.configure do |config|
577
+ config.model_path = nil
578
+ end
579
+ ```
580
+
581
+ This is useful in environments where:
582
+
583
+ - The model file cannot be deployed due to size constraints
584
+ - You only need regex-based filtering
585
+ - You want to optimize for performance over NER capabilities
586
+
317
587
  ### Overriding the confidence score
318
588
 
319
589
  ```ruby
@@ -326,7 +596,7 @@ end
326
596
 
327
597
  ```ruby
328
598
  TopSecret.configure do |config|
329
- config.default_filters.email_filter = TopSecret::Filters::Regex.new(
599
+ config.email_filter = TopSecret::Filters::Regex.new(
330
600
  label: "EMAIL_ADDRESS",
331
601
  regex: /\b\w+\[at\]\w+\.\w+\b/
332
602
  )
@@ -337,18 +607,20 @@ end
337
607
 
338
608
  ```ruby
339
609
  TopSecret.configure do |config|
340
- config.default_filters.email_filter = nil
610
+ config.email_filter = nil
341
611
  end
342
612
  ```
343
613
 
344
- ### Adding new default filters
614
+ ### Adding custom filters globally
345
615
 
346
616
  ```ruby
617
+ ip_address_filter = TopSecret::Filters::Regex.new(
618
+ label: "IP_ADDRESS",
619
+ regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
620
+ )
621
+
347
622
  TopSecret.configure do |config|
348
- config.default_filters.ip_address_filter = TopSecret::Filters::Regex.new(
349
- label: "IP_ADDRESS",
350
- regex: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/
351
- )
623
+ config.custom_filters << ip_address_filter
352
624
  end
353
625
  ```
354
626
 
@@ -361,11 +633,26 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
361
633
  >
362
634
  > You'll need to download and extract [ner_model.dat][] first, and place it in the root of this project.
363
635
 
636
+ ### Performance Benchmarks
637
+
638
+ Run `bin/benchmark` to test performance and catch regressions:
639
+
640
+ ```bash
641
+ bin/benchmark # CI-optimized benchmark with pass/fail thresholds
642
+ ```
643
+
644
+ > [!NOTE]
645
+ > When adding new public methods to the API, ensure they are included in the benchmark script to catch performance regressions.
646
+
364
647
  To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
365
648
 
366
649
  ## Contributing
367
650
 
368
- Bug reports and pull requests are welcome on GitHub at [https://github.com/thoughtbot/top_secret](https://github.com/thoughtbot/top_secret). This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/thoughtbot/top_secret/blob/main/CODE_OF_CONDUCT.md).
651
+ [Bug reports](https://github.com/thoughtbot/top_secret/issues/new?template=bug_report.md) and [pull requests](https://github.com/thoughtbot/top_secret/pulls) are welcome on GitHub at [https://github.com/thoughtbot/top_secret](https://github.com/thoughtbot/top_secret).
652
+
653
+ Please create a [new discussion](https://github.com/thoughtbot/top_secret/discussions/new?category=ideas) if you want to share ideas for new features.
654
+
655
+ This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/thoughtbot/top_secret/blob/main/CODE_OF_CONDUCT.md).
369
656
 
370
657
  ## License
371
658
 
@@ -400,3 +687,4 @@ We are [available for hire][hire].
400
687
  [train]: https://github.com/ankane/mitie-ruby?tab=readme-ov-file#training
401
688
  [Regex filters]: https://github.com/thoughtbot/top_secret/blob/main/lib/top_secret/filters/regex.rb
402
689
  [NER filters]: https://github.com/thoughtbot/top_secret/blob/main/lib/top_secret/filters/ner.rb
690
+ [discussions_60]: https://github.com/thoughtbot/top_secret/discussions/60
@@ -1,6 +1,9 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module TopSecret
4
+ # @return [String] The path to the NER model file
5
+ MODEL_PATH = "ner_model.dat"
6
+
4
7
  # @return [Regexp] Matches credit card numbers
5
8
  CREDIT_CARD_REGEX = /
6
9
  \b[3456]\d{15}\b |
@@ -0,0 +1,29 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ class FilteredText
5
+ # Result object returned by FilteredText restoration operations.
6
+ #
7
+ # Contains the restored text along with tracking information about which
8
+ # placeholders were successfully restored and which remain unrestored.
9
+ class Result
10
+ # @return [String] The text with placeholders restored to original values
11
+ attr_reader :output
12
+
13
+ # @return [Array<String>] Array of placeholder strings that could not be restored
14
+ attr_reader :unrestored
15
+
16
+ # @return [Array<String>] Array of placeholder strings that were successfully restored
17
+ attr_reader :restored
18
+
19
+ # @param output [String] The restored text
20
+ # @param unrestored [Array<String>] Placeholders that could not be restored
21
+ # @param restored [Array<String>] Placeholders that were successfully restored
22
+ def initialize(output, unrestored, restored)
23
+ @output = output
24
+ @unrestored = unrestored
25
+ @restored = restored
26
+ end
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,73 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "filtered_text/result"
4
+
5
+ module TopSecret
6
+ # Restores filtered text by substituting placeholders with original values.
7
+ #
8
+ # This class is used to reverse the filtering process, typically when processing
9
+ # responses from external services like LLMs that may contain filtered placeholders.
10
+ class FilteredText
11
+ # @return [String] The text being processed for restoration
12
+ attr_reader :output
13
+
14
+ # @param filtered_text [String] Text containing filter placeholders like [EMAIL_1]
15
+ # @param mapping [Hash] Hash mapping filter symbols to original values
16
+ def initialize(filtered_text, mapping:)
17
+ @mapping = mapping
18
+ @output = filtered_text.dup
19
+ end
20
+
21
+ # Convenience method to restore filtered text in one call
22
+ #
23
+ # @param filtered_text [String] Text containing filter placeholders
24
+ # @param mapping [Hash] Hash mapping filter symbols to original values
25
+ # @return [Result] Contains restored text and tracking information
26
+ #
27
+ # @example Basic restoration
28
+ # mapping = {EMAIL_1: "john@example.com"}
29
+ # result = TopSecret::FilteredText.restore("Contact [EMAIL_1]", mapping: mapping)
30
+ # result.output # => "Contact john@example.com"
31
+ # result.restored # => ["[EMAIL_1]"]
32
+ # result.unrestored # => []
33
+ def self.restore(filtered_text, mapping:)
34
+ new(filtered_text, mapping:).restore
35
+ end
36
+
37
+ # Performs the restoration process
38
+ #
39
+ # Substitutes all found placeholders with their mapped values and tracks
40
+ # which placeholders were successfully restored vs those that remain unrestored.
41
+ #
42
+ # @return [Result] Contains the restored text and tracking arrays
43
+ def restore
44
+ restored = []
45
+
46
+ mapping.each do |filter, value|
47
+ placeholder = build_placeholder(filter)
48
+
49
+ if output.include? placeholder
50
+ restored << placeholder
51
+ output.gsub! placeholder, value
52
+ end
53
+ end
54
+
55
+ unrestored = output.scan(/\[\w*_\d\]/)
56
+
57
+ Result.new(output, unrestored, restored)
58
+ end
59
+
60
+ private
61
+
62
+ # @return [Hash] Mapping from filter symbols to original values
63
+ attr_reader :mapping
64
+
65
+ # Builds a placeholder string from a filter symbol
66
+ #
67
+ # @param filter [Symbol] The filter symbol (e.g., :EMAIL_1)
68
+ # @return [String] The placeholder string (e.g., "[EMAIL_1]")
69
+ def build_placeholder(filter)
70
+ "[#{filter}]"
71
+ end
72
+ end
73
+ end
@@ -0,0 +1,15 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ module Mapping
5
+ # @return [Boolean] Whether sensitive information was found
6
+ def sensitive?
7
+ mapping.any?
8
+ end
9
+
10
+ # @return [Boolean] Whether sensitive information was not found
11
+ def safe?
12
+ !sensitive?
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,32 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ # A null object implementation that provides a no-op interface compatible with Mitie::NER.
5
+ # Used when NER filtering is disabled (model_path is nil) to eliminate conditional checks
6
+ # throughout the codebase.
7
+ #
8
+ # @example
9
+ # model = TopSecret::NullModel.new
10
+ # doc = model.doc("some text")
11
+ # doc.entities # => []
12
+ class NullModel
13
+ # A null document implementation that provides an empty entities array.
14
+ # Used as the return value from NullModel#doc to maintain interface compatibility.
15
+ class NullDoc
16
+ # Returns an empty array of entities.
17
+ #
18
+ # @return [Array] Always returns an empty array
19
+ def entities
20
+ []
21
+ end
22
+ end
23
+
24
+ # Creates a null document that returns empty entities.
25
+ #
26
+ # @param input [String] The input text (ignored)
27
+ # @return [NullDoc] A document-like object with empty entities
28
+ def doc(input)
29
+ NullDoc.new
30
+ end
31
+ end
32
+ end
@@ -0,0 +1,42 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ class Text
5
+ # Holds the result of a batch redaction operation on multiple messages.
6
+ # Contains a global mapping that ensures consistent labeling across all messages
7
+ # and a collection of individual input/output pairs.
8
+ class BatchResult # TODO Rename to FilterBatchResult
9
+ include Mapping
10
+
11
+ # @return [Hash] Global mapping of redaction labels to original values across all messages
12
+ attr_reader :mapping
13
+
14
+ # @return [Array<Item>] Array of input/output pairs for each processed message
15
+ attr_reader :items
16
+
17
+ # Creates a new BatchResult instance
18
+ #
19
+ # @param mapping [Hash] Global mapping of redaction labels to original values
20
+ # @param items [Array<Item>] Array of input/output pairs
21
+ def initialize(mapping: {}, items: [])
22
+ @mapping = mapping
23
+ @items = items
24
+ end
25
+
26
+ # Creates a BatchResult from multiple messages with consistent global labeling
27
+ #
28
+ # @param messages [Array<String>] Array of text messages to filter
29
+ # @param custom_filters [Array] Additional custom filters to apply
30
+ # @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
31
+ # @return [BatchResult] Contains global mapping and array of Result objects with individual mappings
32
+ # @raise [ArgumentError] If invalid filter keys are provided
33
+ def self.from_messages(messages, custom_filters: [], **filters)
34
+ individual_results = TopSecret::Text::Result.from_messages(messages, custom_filters:, **filters)
35
+ mapping = TopSecret::Text::GlobalMapping.from_results(individual_results)
36
+ items = TopSecret::Text::Result.with_global_labels(individual_results, mapping)
37
+
38
+ Text::BatchResult.new(mapping:, items:)
39
+ end
40
+ end
41
+ end
42
+ end
@@ -0,0 +1,63 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ class Text
5
+ # Manages consistent labeling across multiple filtering operations by ensuring
6
+ # identical sensitive values receive the same redaction labels globally.
7
+ class GlobalMapping
8
+ # Creates a global mapping from individual filter results
9
+ #
10
+ # @param individual_results [Array<Result>] Array of individual filter results
11
+ # @return [Hash] Inverted mapping from filter labels to original values
12
+ def self.from_results(individual_results)
13
+ new.build_from_results(individual_results)
14
+ end
15
+
16
+ # Creates a new GlobalMapping instance
17
+ def initialize
18
+ @mapping = {}
19
+ @label_counters = {}
20
+ end
21
+
22
+ # Builds the global mapping by processing all individual results
23
+ #
24
+ # @param individual_results [Array<Result>] Array of individual filter results
25
+ # @return [Hash] Inverted mapping from filter labels to original values
26
+ def build_from_results(individual_results)
27
+ individual_results.each { |result| process_result(result) if result.sensitive? }
28
+
29
+ mapping.invert
30
+ end
31
+
32
+ private
33
+
34
+ attr_reader :mapping
35
+ attr_reader :label_counters
36
+
37
+ # Processes a single result, adding new values to the global mapping
38
+ #
39
+ # @param result [Result] Individual filter result to process
40
+ def process_result(result)
41
+ result.mapping.each do |individual_key, value|
42
+ next if mapping.key?(value)
43
+
44
+ mapping[value] = generate_global_key(individual_key)
45
+ end
46
+ end
47
+
48
+ # Generates a consistent global key for a given individual key
49
+ #
50
+ # @param individual_key [Symbol] The individual key from a filter result
51
+ # @return [Symbol] The global key with consistent numbering
52
+ def generate_global_key(individual_key)
53
+ # TODO: This assumes labels are formatted consistently.
54
+ # We need to account for the following for the case where a label could begin with an "_"
55
+ label_type = individual_key.to_s.rpartition("_").first
56
+
57
+ label_counters[label_type] ||= 0
58
+ label_counters[label_type] += 1
59
+ :"#{label_type}_#{label_counters[label_type]}"
60
+ end
61
+ end
62
+ end
63
+ end
@@ -0,0 +1,59 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ class Text
5
+ # Holds the result of a redaction operation.
6
+ class Result # TODO: Rename to FilterResult
7
+ include Mapping
8
+
9
+ # @return [String] The original unredacted input
10
+ attr_reader :input
11
+
12
+ # @return [String] The redacted output
13
+ attr_reader :output
14
+
15
+ # @return [Hash] Mapping of redacted labels to matched values
16
+ attr_reader :mapping
17
+
18
+ # @param input [String] The original text
19
+ # @param output [String] The redacted text
20
+ # @param mapping [Hash] Map of labels to matched values
21
+ def initialize(input, output, mapping)
22
+ @input = input
23
+ @output = output
24
+ @mapping = mapping
25
+ end
26
+
27
+ # Filters multiple messages individually using a shared model for performance
28
+ #
29
+ # @param messages [Array<String>] Array of text messages to filter
30
+ # @param custom_filters [Array] Additional custom filters to apply
31
+ # @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
32
+ # @return [Array<Result>] Array of individual Result objects for each message
33
+ # @raise [ArgumentError] If invalid filter keys are provided
34
+ def self.from_messages(messages, custom_filters: [], **filters)
35
+ shared_model = TopSecret.model_path ? Mitie::NER.new(TopSecret.model_path) : nil
36
+
37
+ messages.map do |message|
38
+ TopSecret::Text.new(message, filters:, custom_filters:, model: shared_model).filter
39
+ end
40
+ end
41
+
42
+ # Creates Result objects with globally consistent labels applied to text
43
+ #
44
+ # @param individual_results [Array<Result>] Array of individual filter results
45
+ # @param global_mapping [Hash] Global mapping from filter labels to original values
46
+ # @return [Array<Result>] Array of Result objects with globally consistent redaction and individual mappings
47
+ def self.with_global_labels(individual_results, global_mapping)
48
+ individual_results.map do |result|
49
+ output = global_mapping.reduce(result.input.dup) do |text, (filter, value)|
50
+ text.gsub(value, "[#{filter}]")
51
+ end
52
+ filter_keys = output.scan(/\[([^\]]+)\]/).flatten.map(&:to_sym)
53
+ mapping = global_mapping.slice(*filter_keys)
54
+ new(result.input, output, mapping)
55
+ end
56
+ end
57
+ end
58
+ end
59
+ end
@@ -0,0 +1,18 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TopSecret
4
+ class Text
5
+ # Holds the result of a scan operation.
6
+ class ScanResult
7
+ include Mapping
8
+
9
+ # @return [Hash] Mapping of redacted labels to matched values
10
+ attr_reader :mapping
11
+
12
+ # @param mapping [Hash] Map of labels to matched values
13
+ def initialize(mapping)
14
+ @mapping = mapping
15
+ end
16
+ end
17
+ end
18
+ end
@@ -1,37 +1,116 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require "active_support/core_ext/hash/keys"
4
+ require_relative "null_model"
5
+ require_relative "text/result"
6
+ require_relative "text/batch_result"
7
+ require_relative "text/scan_result"
8
+ require_relative "text/global_mapping"
9
+
3
10
  module TopSecret
4
11
  # Processes text to identify and redact sensitive information using configured filters.
5
12
  class Text
6
13
  # @param input [String] The original text to be filtered
7
14
  # @param filters [Hash, nil] Optional set of filters to override the defaults
8
- def initialize(input, filters: TopSecret.default_filters)
15
+ # @param custom_filters [Array] Additional custom filters to apply
16
+ # @param model [Mitie::NER, nil] Optional pre-loaded MITIE model for performance
17
+ def initialize(input, custom_filters: [], filters: {}, model: nil)
9
18
  @input = input
10
19
  @output = input.dup
11
20
  @mapping = {}
12
21
 
13
- @model = Mitie::NER.new(TopSecret.model_path)
14
- @doc = @model.doc(@output)
15
- @entities = @doc.entities
22
+ @model = model || default_model
16
23
 
17
24
  @filters = filters
25
+ @custom_filters = custom_filters
18
26
  end
19
27
 
20
28
  # Convenience method to create an instance and filter input
21
29
  #
22
30
  # @param input [String] The text to filter
23
- # @param filters [Hash] Optional filters to override defaults
31
+ # @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
32
+ # @param custom_filters [Array] Additional custom filters to apply
24
33
  # @return [Result] The filtered result
25
- def self.filter(input, filters: {})
26
- new(input, filters:).filter
34
+ # @raise [ArgumentError] If invalid filter keys are provided
35
+ def self.filter(input, custom_filters: [], **filters)
36
+ new(input, filters:, custom_filters:).filter
27
37
  end
28
38
 
29
- # Applies configured filters to the input, redacting matches and building a mapping.
39
+ # Filters multiple messages with globally consistent redaction labels
30
40
  #
31
- # @return [Result] Contains original input, redacted output, and mapping of labels to values
41
+ # Processes a collection of messages and ensures that identical sensitive values
42
+ # receive the same redaction labels across all messages. This is useful when
43
+ # processing conversation threads or document collections where consistency matters.
44
+ #
45
+ # @param messages [Array<String>] Array of text messages to filter
46
+ # @param custom_filters [Array] Additional custom filters to apply
47
+ # @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
48
+ # @return [BatchResult] Contains global mapping and array of Result objects with individual mappings
49
+ # @raise [ArgumentError] If invalid filter keys are provided
50
+ #
51
+ # @example Basic usage
52
+ # messages = ["Contact john@test.com", "Email john@test.com again"]
53
+ # result = TopSecret::Text.filter_all(messages)
54
+ # result.items[0].output # => "Contact [EMAIL_1]"
55
+ # result.items[1].output # => "Email [EMAIL_1] again"
56
+ # result.items[0].mapping # => { EMAIL_1: "john@test.com" }
57
+ # result.mapping # => { EMAIL_1: "john@test.com" }
58
+ #
59
+ # @example With custom filters
60
+ # ip_filter = TopSecret::Filters::Regex.new(label: "IP", regex: /\d+\.\d+\.\d+\.\d+/)
61
+ # result = TopSecret::Text.filter_all(messages, custom_filters: [ip_filter])
62
+ def self.filter_all(messages, custom_filters: [], **filters)
63
+ Text::BatchResult.from_messages(messages, custom_filters:, **filters)
64
+ end
65
+
66
+ # Convenience method to scan input text for sensitive information without redacting it
67
+ #
68
+ # This method detects sensitive information using configured filters but does not modify
69
+ # the original text. Use this when you only need to check if sensitive data exists or
70
+ # get a mapping of what was found.
71
+ #
72
+ # @param input [String] The text to scan for sensitive information
73
+ # @param filters [Hash] Optional filters to override defaults (only valid filter keys accepted)
74
+ # @param custom_filters [Array] Additional custom filters to apply
75
+ # @return [ScanResult] Contains mapping of found sensitive information and sensitive? flag
76
+ # @raise [ArgumentError] If invalid filter keys are provided
77
+ #
78
+ # @example Basic scanning
79
+ # result = TopSecret::Text.scan("Contact john@example.com")
80
+ # result.sensitive? # => true
81
+ # result.mapping # => {:EMAIL_1=>"john@example.com"}
82
+ #
83
+ # @example With custom filters
84
+ # ip_filter = TopSecret::Filters::Regex.new(label: "IP", regex: /\d+\.\d+\.\d+\.\d+/)
85
+ # result = TopSecret::Text.scan("Server IP: 192.168.1.1", custom_filters: [ip_filter])
86
+ # result.mapping # => {:IP_1=>"192.168.1.1"}
87
+ #
88
+ # @example Overriding default filters
89
+ # custom_email = TopSecret::Filters::Regex.new(label: "EMAIL_ADDR", regex: /\w+@\w+/)
90
+ # result = TopSecret::Text.scan("user@test.com", email_filter: custom_email)
91
+ # result.mapping # => {:EMAIL_ADDR_1=>"user@test.com"}
92
+ def self.scan(input, custom_filters: [], **filters)
93
+ new(input, filters:, custom_filters:).scan
94
+ end
95
+
96
+ # Scans the input text for sensitive information using configured filters
97
+ #
98
+ # This method applies all active filters to detect sensitive information but does not
99
+ # redact the original text. It builds a mapping of found values and returns whether
100
+ # any sensitive information was detected.
101
+ #
102
+ # @return [ScanResult] Contains mapping of found sensitive information and sensitive? flag
32
103
  # @raise [Error] If an unsupported filter is encountered
33
- def filter
34
- TopSecret.default_filters.merge(filters).compact.each_value do |filter|
104
+ # @raise [ArgumentError] If invalid filter keys are provided
105
+ def scan
106
+ @doc ||= model.doc(@output) if model
107
+ @entities ||= doc.entities if model
108
+
109
+ validate_filters!
110
+
111
+ all_filters.each do |filter|
112
+ next if filter.nil?
113
+
35
114
  values = case filter
36
115
  when TopSecret::Filters::Regex
37
116
  filter.call(input)
@@ -43,9 +122,20 @@ module TopSecret
43
122
  build_mapping(values, label: filter.label)
44
123
  end
45
124
 
46
- substitute_text
125
+ ScanResult.new(mapping)
126
+ end
47
127
 
48
- Result.new(input, output, mapping)
128
+ # Applies configured filters to the input, redacting matches and building a mapping.
129
+ #
130
+ # @return [Result] Contains original input, redacted output, and mapping of labels to values
131
+ # @raise [Error] If an unsupported filter is encountered
132
+ # @raise [ArgumentError] If invalid filter keys are provided
133
+ def filter
134
+ scan_result = scan
135
+
136
+ substitute_text if scan_result.sensitive?
137
+
138
+ Text::Result.new(input, output, scan_result.mapping)
49
139
  end
50
140
 
51
141
  private
@@ -59,12 +149,21 @@ module TopSecret
59
149
  # @return [Hash] Mapping from redaction labels to original values
60
150
  attr_reader :mapping
61
151
 
152
+ # @return [Object] The NER model (typically Mitie::NER or a test double)
153
+ attr_reader :model
154
+
155
+ # @return [Object] The document created from the output text (typically Mitie::Document or a test double)
156
+ attr_reader :doc
157
+
62
158
  # @return [Array<Hash>] Named entities extracted by MITIE
63
159
  attr_reader :entities
64
160
 
65
161
  # @return [Hash] Active filters used for redaction
66
162
  attr_reader :filters
67
163
 
164
+ # @return [Array] Custom filters to apply
165
+ attr_reader :custom_filters
166
+
68
167
  # Builds the mapping of label keys to matched values, indexed uniquely.
69
168
  #
70
169
  # @param values [Array<String>] Values matched by a filter
@@ -85,5 +184,55 @@ module TopSecret
85
184
  output.gsub! value, "[#{filter}]"
86
185
  end
87
186
  end
187
+
188
+ # Collects all filters to apply: default filters with overrides plus custom filters
189
+ #
190
+ # @return [Array] Array of filter objects to apply
191
+ def all_filters
192
+ merged_filters.values.compact + TopSecret.custom_filters + custom_filters
193
+ end
194
+
195
+ # Merges default filters with user-provided filter overrides
196
+ #
197
+ # @return [Hash] Hash containing default filters with any user overrides applied
198
+ # @private
199
+ def merged_filters
200
+ default_filters.merge(filters)
201
+ end
202
+
203
+ # Validates that all provided filter keys are recognized
204
+ #
205
+ # @return [void]
206
+ # @raise [ArgumentError] If invalid filter keys are provided
207
+ def validate_filters!
208
+ merged_filters.assert_valid_keys(*default_filters.keys)
209
+ end
210
+
211
+ # Returns the default filters configuration hash
212
+ #
213
+ # @return [Hash] Hash containing all configured default filters, keyed by filter name
214
+ # @private
215
+ def default_filters
216
+ {
217
+ credit_card_filter: TopSecret.credit_card_filter,
218
+ email_filter: TopSecret.email_filter,
219
+ phone_number_filter: TopSecret.phone_number_filter,
220
+ ssn_filter: TopSecret.ssn_filter,
221
+ people_filter: TopSecret.people_filter,
222
+ location_filter: TopSecret.location_filter
223
+ }
224
+ end
225
+
226
+ # Creates the default model based on configuration.
227
+ # Returns a MITIE NER model if a model path is configured, otherwise returns a null model.
228
+ #
229
+ # @return [Mitie::NER, NullModel] The model instance to use for NER processing
230
+ def default_model
231
+ if TopSecret.model_path
232
+ Mitie::NER.new(TopSecret.model_path)
233
+ else
234
+ NullModel.new
235
+ end
236
+ end
88
237
  end
89
238
  end
@@ -1,5 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module TopSecret
4
- VERSION = "0.1.1"
4
+ VERSION = "0.3.0"
5
+ MINIMUM_RAILS_VERSION = ">= 7.0.8"
6
+ MAXIMUM_RAILS_VERSION = "< 9"
5
7
  end
data/lib/top_secret.rb CHANGED
@@ -8,13 +8,14 @@ require "mitie"
8
8
  # modules
9
9
  require_relative "top_secret/version"
10
10
  require_relative "top_secret/constants"
11
+ require_relative "top_secret/mapping"
11
12
  require_relative "top_secret/filters/ner"
12
13
  require_relative "top_secret/filters/regex"
13
14
  require_relative "top_secret/error"
14
- require_relative "top_secret/result"
15
15
  require_relative "top_secret/text"
16
+ require_relative "top_secret/filtered_text"
16
17
 
17
- # TopSecret filters sensitive information from free text before it's sent to external services or APIs, such as Chatbots.
18
+ # TopSecret filters sensitive information from free text before it's sent to external services or APIs, such as chatbots and LLMs.
18
19
  #
19
20
  # @!attribute [rw] model_path
20
21
  # @return [String] the path to the MITIE NER model
@@ -22,23 +23,38 @@ require_relative "top_secret/text"
22
23
  # @!attribute [rw] min_confidence_score
23
24
  # @return [Float] the minimum confidence score required for NER matches
24
25
  #
25
- # @!attribute [rw] default_filters
26
- # @return [ActiveSupport::OrderedOptions] a set of default filters used to identify sensitive data
26
+ # @!attribute [rw] custom_filters
27
+ # @return [Array] array of custom filters that can be configured
28
+ #
29
+ # @!attribute [rw] credit_card_filter
30
+ # @return [TopSecret::Filters::Regex] filter for credit card numbers
31
+ #
32
+ # @!attribute [rw] email_filter
33
+ # @return [TopSecret::Filters::Regex] filter for email addresses
34
+ #
35
+ # @!attribute [rw] phone_number_filter
36
+ # @return [TopSecret::Filters::Regex] filter for phone numbers
37
+ #
38
+ # @!attribute [rw] ssn_filter
39
+ # @return [TopSecret::Filters::Regex] filter for social security numbers
40
+ #
41
+ # @!attribute [rw] people_filter
42
+ # @return [TopSecret::Filters::NER] filter for person names
43
+ #
44
+ # @!attribute [rw] location_filter
45
+ # @return [TopSecret::Filters::NER] filter for location names
27
46
  module TopSecret
28
47
  include ActiveSupport::Configurable
29
48
 
30
- config_accessor :model_path, default: "ner_model.dat"
49
+ config_accessor :model_path, default: MODEL_PATH
31
50
  config_accessor :min_confidence_score, default: MIN_CONFIDENCE_SCORE
32
51
 
33
- config_accessor :default_filters do
34
- options = ActiveSupport::OrderedOptions.new
35
- options.credit_card_filter = TopSecret::Filters::Regex.new(label: "CREDIT_CARD", regex: CREDIT_CARD_REGEX)
36
- options.email_filter = TopSecret::Filters::Regex.new(label: "EMAIL", regex: EMAIL_REGEX)
37
- options.phone_number_filter = TopSecret::Filters::Regex.new(label: "PHONE_NUMBER", regex: PHONE_REGEX)
38
- options.ssn_filter = TopSecret::Filters::Regex.new(label: "SSN", regex: SSN_REGEX)
39
- options.people_filter = TopSecret::Filters::NER.new(label: "PERSON", tag: :person)
40
- options.location_filter = TopSecret::Filters::NER.new(label: "LOCATION", tag: :location)
52
+ config_accessor :custom_filters, default: []
41
53
 
42
- options
43
- end
54
+ config_accessor :credit_card_filter, default: TopSecret::Filters::Regex.new(label: "CREDIT_CARD", regex: CREDIT_CARD_REGEX)
55
+ config_accessor :email_filter, default: TopSecret::Filters::Regex.new(label: "EMAIL", regex: EMAIL_REGEX)
56
+ config_accessor :phone_number_filter, default: TopSecret::Filters::Regex.new(label: "PHONE_NUMBER", regex: PHONE_REGEX)
57
+ config_accessor :ssn_filter, default: TopSecret::Filters::Regex.new(label: "SSN", regex: SSN_REGEX)
58
+ config_accessor :people_filter, default: TopSecret::Filters::NER.new(label: "PERSON", tag: :person)
59
+ config_accessor :location_filter, default: TopSecret::Filters::NER.new(label: "LOCATION", tag: :location)
44
60
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: top_secret
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Steve Polito
@@ -13,22 +13,22 @@ dependencies:
13
13
  name: activesupport
14
14
  requirement: !ruby/object:Gem::Requirement
15
15
  requirements:
16
- - - "~>"
17
- - !ruby/object:Gem::Version
18
- version: '8.0'
19
16
  - - ">="
20
17
  - !ruby/object:Gem::Version
21
- version: 8.0.2
18
+ version: 7.0.8
19
+ - - "<"
20
+ - !ruby/object:Gem::Version
21
+ version: '9'
22
22
  type: :runtime
23
23
  prerelease: false
24
24
  version_requirements: !ruby/object:Gem::Requirement
25
25
  requirements:
26
- - - "~>"
27
- - !ruby/object:Gem::Version
28
- version: '8.0'
29
26
  - - ">="
30
27
  - !ruby/object:Gem::Version
31
- version: 8.0.2
28
+ version: 7.0.8
29
+ - - "<"
30
+ - !ruby/object:Gem::Version
31
+ version: '9'
32
32
  - !ruby/object:Gem::Dependency
33
33
  name: mitie
34
34
  requirement: !ruby/object:Gem::Requirement
@@ -44,7 +44,7 @@ dependencies:
44
44
  - !ruby/object:Gem::Version
45
45
  version: 0.3.2
46
46
  description: Filter sensitive information from free text before sending it to external
47
- services or APIs, such as Chatbots.
47
+ services or APIs, such as chatbots and LLMs.
48
48
  email:
49
49
  - stevepolito@hey.com
50
50
  executables: []
@@ -52,6 +52,7 @@ extensions: []
52
52
  extra_rdoc_files: []
53
53
  files:
54
54
  - CHANGELOG.md
55
+ - CODEOWNERS
55
56
  - CODE_OF_CONDUCT.md
56
57
  - LICENSE.txt
57
58
  - README.md
@@ -59,10 +60,17 @@ files:
59
60
  - lib/top_secret.rb
60
61
  - lib/top_secret/constants.rb
61
62
  - lib/top_secret/error.rb
63
+ - lib/top_secret/filtered_text.rb
64
+ - lib/top_secret/filtered_text/result.rb
62
65
  - lib/top_secret/filters/ner.rb
63
66
  - lib/top_secret/filters/regex.rb
64
- - lib/top_secret/result.rb
67
+ - lib/top_secret/mapping.rb
68
+ - lib/top_secret/null_model.rb
65
69
  - lib/top_secret/text.rb
70
+ - lib/top_secret/text/batch_result.rb
71
+ - lib/top_secret/text/global_mapping.rb
72
+ - lib/top_secret/text/result.rb
73
+ - lib/top_secret/text/scan_result.rb
66
74
  - lib/top_secret/version.rb
67
75
  - sig/top_secret.rbs
68
76
  homepage: https://github.com/thoughtbot/top_secret
@@ -1,24 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module TopSecret
4
- # Holds the result of a redaction operation.
5
- class Result
6
- # @return [String] The original unredacted input
7
- attr_reader :input
8
-
9
- # @return [String] The redacted output
10
- attr_reader :output
11
-
12
- # @return [Hash] Mapping of redacted labels to matched values
13
- attr_reader :mapping
14
-
15
- # @param input [String] The original text
16
- # @param output [String] The redacted text
17
- # @param mapping [Hash] Map of labels to matched values
18
- def initialize(input, output, mapping)
19
- @input = input
20
- @output = output
21
- @mapping = mapping
22
- end
23
- end
24
- end